Abstract
We present a study comparing the cost and efficiency tradeoffs of multiple features for multimedia event detection. Low-level as well as semantic features are a critical part of contemporary multimedia and computer vision research. Arguably, combinations of multiple feature sets have been a major reason for recent progress in the field, not just as a low dimensional representations of multimedia data, but also as a means to semantically summarize images and videos. However, their efficacy for complex event recognition in unconstrained videos on standardized datasets has not been systematically studied. In this paper, we evaluate the accuracy and contribution of more than 10 multi-modality features, including semantic and low-level video representations, using two newly released NIST TRECVID Multimedia Event Detection (MED) open source datasets, i.e. MEDTEST and KINDREDTEST, which contain more than 1000 hours of videos. Contrasting multiple performance metrics, such as average precision, probability of missed detection and minimum normalized detection cost, we propose a framework to balance the trade-off between accuracy and computational cost. This study provides an empirical foundation for selecting feature sets that are capable of dealing with large-scale data with limited computational resources and are likely to produce superior multimedia event detection accuracy. This framework also applies to other resource limited multimedia analyses such as selecting/fusing multiple classifiers and different representations of each feature set.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bao, L., Yu, S.-I., Lan, Z.Z., Overwijk, A., Jin, Q., Langner, B., Garbus, M., Burger, S., Metze, F., Hauptmann, A.: Informedia@ trecvid 2011. In: TRECVID 2011 (2011)
Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006)
Chen, M.-Y., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos. CMU-CS-09-161 (2009)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, vol. 1, pp. 886–893. IEEE (2005)
Ebadollahi, S., Chang, S.-F., Xie, L., Smith John, R.: Visual event detection using multi-dimensional concept semantics. In: ICME, pp. 881–884 (2006)
Jiang, Y.-G.: Super: Towards real-time event recognition in internet videos. In: ICMR, p. 7. ACM (2012)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Lan, Z.-z., Bao, L., Yu, S.-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. In: Schoeffmann, K., Merialdo, B., Hauptmann, A.G., Ngo, C.-W., Andreopoulos, Y., Breiteneder, C. (eds.) MMM 2012. LNCS, vol. 7131, pp. 173–185. Springer, Heidelberg (2012)
Lan, Z.-Z., Bao, L., Yu, S.-I., Liu, W., Hauptmann, A.G.: Multimedia classification and event detection using double fusion. Multimedia Tools and Applications, 1–15 (2013)
Laptev, I.: On space-time interest points. IJCV 64(2-3), 107–123 (2005)
Li, L.-J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: NIPS, pp. 1378–1386 (2010)
Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., Sawhney, H.S.: Video event recognition using concept attributes. In: WACV, pp. 339–346 (2013)
Merler, M., Member, S., Huang, B., Xie, L., Hua, G.: Semantic Model vectors for complex video event recognition. IEEE Trans. on Multimedia 14(1), 88–101 (2012)
Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image classification. PAMI 30(9), 1632–1646 (2008)
Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Shaw, B., Kraaij, W., Smeaton, A.F., Quéenot, G.: Trecvid 2012 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID. NIST, USA (2012)
Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., Sawhney, H., International Sarnoff, S.R.I.: Evaluation of low-level leatures and their combinations for complex event detection in open source videos. In: CVPR, pp. 3681–3688 (2012)
Van De Sande, K.E.A., Gevers, T., Cees, G.M.S.: Evaluating color descriptors for object and scene recognition. PAMI 32(9), 1582–1596 (2010)
Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176. IEEE (2011)
Yang, J., Jiang, Y.-G., Hauptmann, A.G., Ngo, C.-W.: Evaluating bag-of-visual-words representations in scene classification. In: Workshop on ICMR, pp. 197–206. ACM (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Lan, ZZ., Yang, Y., Ballas, N., Yu, SI., Haputmann, A. (2014). Resource Constrained Multimedia Event Detection. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds) MultiMedia Modeling. MMM 2014. Lecture Notes in Computer Science, vol 8325. Springer, Cham. https://doi.org/10.1007/978-3-319-04114-8_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-04114-8_33
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04113-1
Online ISBN: 978-3-319-04114-8
eBook Packages: Computer ScienceComputer Science (R0)