Abstract
Current research shows that the detection of semantic concepts (e.g., animal, bus, person, dancing, etc.) in multimedia documents such as videos, requires the use of several types of complementary descriptors in order to achieve good results. In this work, we explore strategies for combining dozens of complementary content descriptors (or “experts”) in an efficient way, through the use of late fusion approaches, for concept detection in multimedia documents. We explore two fusion approaches that share a common structure: both start with a clustering of experts stage, continue with an intra-cluster fusion and finish with an inter-cluster fusion, and we also experiment with other state-of-the-art methods. The first fusion approach relies on a priori knowledge about the internals of each expert to group the set of available experts by similarity. The second approach automatically obtains measures on the similarity of experts from their output to group the experts using agglomerative clustering, and then combines the results of this fusion with those from other methods. In the end, we show that an additional performance boost can be obtained by also considering the context of multimedia elements.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
TREC Video Retrieval Evaluation, http://trecvid.nist.gov/.
- 2.
- 3.
References
Ayache S, Quénot G, Gensel J (2007) Image and video indexing using networks of operators. J Image Video Process 2007(3):1:1–1:13. doi:10.1155/2007/56928. http://dx.doi.org/10.1155/2007/56928
Ballas N, Delezoide B, Prêteux F (2011) Trajectories based descriptor for dynamic events annotation. In: Proceedings of the 2011 joint ACM workshop on modeling and representing events, J-MRE ’11. ACM, New York, pp 13–18. doi:10.1145/2072508.2072512. http://doi.acm.org/10.1145/2072508.2072512
Ballas N, Labbé B, Shabou A, Borgne L (2012) Cea list at trecvid 2012: semantic indexing and instance search. In: Proceedings of TRECVid workshop, Gaithersburg, 2012
Ballas N, Labbé B, Shabou A, Le Borgne H, Gosselin P, Redi M, Merialdo B, Jégou H, Delhumeau J, Vieux R, Mansencal B, Benois-Pineau J, Ayache S, Hamadi A, Safadi B, Thollard F, Derbas N, Quenot G, Bredin H, Cord M, Gao B, Zhu C, Tang Y, Dellandrea E, Bichot CE, Chen L, Benoit A, Lambert P, Strat T, Razik J, Paris S, Glotin H, Trung TN, Petrovska-Delacrétaz D, Chollet G, Stoian A, Crucianu M (2012) IRIM at TRECVid 2012: semantic indexing and instance search. In: Proceedings of the workshop on TREC video retrieval evaluation (TRECVid). Gaithersburg, p 12. http://hal.archives-ouvertes.fr/hal-00770258. CNRS, RENATER, several Universities, other funding bodies (see https://www.grid5000.fr)
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110(3):346–359. doi:10.1016/j.cviu.2007.09.014. http://dx.doi.org/10.1016/j.cviu.2007.09.014
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exp 2008(10):10008. http://stacks.iop.org/1742-5468/2008/i=10/a=P10008
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Cai N, Li M, Lin S, Zhang Y, Tang S (2007) Ap-based adaboost in high level feature extraction at trecvid. In: Proceedings of 2nd international conference on pervasive computing and applications, 2007. ICPCA 2007, pp 194–198. doi:10.1109/ICPCA.2007.4365438
Cao L, Chang SF, Codella N, Cotton C, Ellis D, Gong L, Hill M, Hua G, Kender J, Merler M, Mu Y, Smith JR, Felix XY (2012) Ibm research and columbia university trecvid-2012 multimedia event detection (med), multimedia event recounting (mer), and semantic indexing (sin) systems. In: NIST TRECVid workshop, Gaithersburg, 2012
Cliville V, Berrah L, Mauris G (2004) Information fusion in industrial performance: a 2-additive choquet-integral based approach. In: IEEE international conference on systems, man and cybernetics, vol 2, pp 1297–1302. doi:10.1109/ICSMC.2004.1399804
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR09, 2009
Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–38
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. doi:10.1006/jcss.1997.1504. http://www.sciencedirect.com/science/article/pii/S002200009791504X
Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268. http://dl.acm.org/citation.cfm?id=1953048.2021071
Gosselin PH, Cord M, Philipp-Foliguet S (2008) Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval. Comput Vis Image Underst 110(3):403–417. doi:10.1016/j.cviu.2007.09.018. http://dx.doi.org/10.1016/j.cviu.2007.09.018
Hamadi A, Quénot G, Mulhem P (2013) Conceptual feedback for semantic multimedia indexing. In: 11th international workshop on content-based multimedia indexing (CBMI), Veszprém, 2013
Kendall MG (1948) Rank correlation methods. Griffin, London
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–23
Little S, Llorente A, Rüger S (2010) An overview of evaluation campaigns in multimedia retrieval. In: Müller H, Clough P, Deselaers T, Caputo B (eds.) ImageCLEF. The information retrieval series, vol 32. Springer, Berlin, pp 507–525. doi:10.1007/978-3-642-15181-1_27. http://dx.doi.org/10.1007/978-3-642-15181-1_27
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. doi:10.1023/B:VISI.0000029664.99615.94. http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94
Negrel R, Picard D, Gosselin P (2012) Compact tensor based image representation for similarity search. In: 19th IEEE international conference on image processing (ICIP), 2012, pp 2425–2428. doi:10.1109/ICIP.2012.6467387
Newman MEJ (2006) Modularity and community structure in networks. Proc Nat Acad Sci U.S.A 103(23):8577–8582. doi:10.1073/pnas.0601602103. http://www.pnas.org/cgi/content/abstract/103/23/8577
Ng KB, Kantor PB (2000) Predicting the effectiveness of naive data fusion on the basis of system characteristics. J Am Soc Inform Sci 51:1177–1189. doi: 10.1002/1097-4571(2000)9999:9999\(\langle \)::AID-ASI1030\(\rangle \)3.0.CO;2-E. http://dl.acm.org/citation.cfm?id=357868.357870
Over P, Awad G, Michel M, Fiscus J, Kraaij W, Smeaton AF, Quénot G (2011) Trecvid 2011—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVid 2011. NIST, USA, 2011
Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quénot G (2013) Trecvid 2013—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2013. NIST, USA 2013
Paris S, Glotin H (2010) Pyramidal multi-level features for the robot vision@icpr 2010 challenge. In: 20th International conference on pattern recognition (ICPR), pp 2949–2952. doi:10.1109/ICPR.2010.1143
Pinquier J, Karaman S, Letoupin L, Guyot P, Megret R, Benois-Pineau J, Gaestel Y, Dartigues JF (2012) Strategies for multiple feature fusion with hierarchical hmm: application to activity recognition from wearable audiovisual sensors. In: 21st International conference on pattern recognition (ICPR), pp 3192–3195
Redi M, Merialdo B (2011) Saliency moments for image categorization. In: Proceedings of the 1st ACM international conference on multimedia retrieval, ICMR ’11, pp 39:1–39:8. ACM, New York. doi:10.1145/1991996.1992035. http://doi.acm.org/10.1145/1991996.1992035
Safadi B, Quénot G (2010) Evaluations of multi-learner approaches for concept indexing in video documents. In: Adaptivity, personalization and fusion of heterogeneous information, RIAO ’10, pp 88–91. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, Paris, 2010. http://dl.acm.org/citation.cfm?id=1937055.1937075
Safadi B, Quénot G (2011) Re-ranking for multimedia indexing and retrieval. In: ECIR 2011: 33rd european conference on information retrieval. Springer, Dublin, pp 708–711
Safadi B, Quénot G (2013) Descriptor optimization for multimedia indexing and retrieval. In: 11th International workshop on content-based multimedia indexing, CBMI 2013, Veszprem, 2013
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245. doi:10.1007/s11263-013-0636-x. http://dx.doi.org/10.1007/s11263-013-0636-x
van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596. http://www.science.uva.nl/research/publications/2010/vandeSandeTPAMI2010
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. doi:10.1023/A:1007614523901. http://dx.doi.org/10.1023/A:1007614523901
Shabou A, Borgne HL (2012) Locality-constrained and spatially regularized coding for scene categorization. In: CVPR, pp. 3618–3625. IEEE, 2012. http://dblp.uni-trier.de/db/conf/cvpr/cvpr2012.html #ShabouL12
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
Smeaton AF, Over P, Kraaij W (2009) High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran A (ed) Multimedia content analysis. Theory and applications. Springer, Berlin, pp 151–174
Snoek CGM, van de Sande KEA, Habibian A, Kordumova S, Li Z, Mazloom M, Pintea SL, Tao R, Koelma DC, Smeulders AWM (2012) The mediamill trecvid 2012 semantic video search engine. In: Proceedings of the TRECVid workshop. http://www.science.uva.nl/research/publications/2012/SnoekPTRECVid2012a
Strat S, Benoit A, Lambert P (2013) Retina enhanced sift descriptors for video indexing. In: 11th International workshop on content-based multimedia indexing (CBMI), pp. 201–206. doi:10.1109/CBMI.2013.6576582
Strat S, Benoit A, Lambert P, Caplier A (2012) Retina-enhanced surf descriptors for semantic concept detection in videos. In: 3rd International conference on image processing theory, tools and applications (IPTA), 2012, pp 319–324. doi:10.1109/IPTA.2012.6469557
Strat ST, Benoit A, Lambert P, Caplier A (2013) Retina enhanced surf descriptors for spatio-temporal concept detection. In: Multimedia tools and applications, pp 1–27. doi:10.1007/s11042-012-1280-0. http://dx.doi.org/10.1007/s11042-012-1280-0
Strat T, Benoit A, Bredin H, Quenot G, Lambert P (2012) Hierarchical late fusion for concept detection in videos. In: Andrea Fusiello VMRC (ed.) Proceedings of computer vision—ECCV 2012. workshops and demonstrations, Part III, Lecture notes in computer science (LNCS), vol 7585. Springer, Berlin, pp 335–344. doi:10.1007/978-3-642-33885-4_34. http://hal.archives-ouvertes.fr/hal-00732740. Oral session 1: WS21—Workshop on information fusion in computer vision for concept recognition OSEO (French State agency for innovation) and ANR (French national research agency)
Tang Z, Yanai K (2008) UEC at TRECVID 2008 high level feature task. In: In: Proceedings of the workshop on TREC video retrieval evaluation (TRECVID). Gaithersburg. http://www-nlpir.nist.gov/projects/tvpubs/tv8.papers/uec.pdf
Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition. Colorado Springs, pp 3169–3176. http://hal.inria.fr/inria-00583818
Wu L, Guo Y, Qiu X, Feng Z, Rong J, Jin W, Zhou D, Wang R, Jin M (2003) Fudan university at trecvid 2003. In: Notebook of TRECVid
Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06, pp 102–111. ACM, New York. doi:10.1145/1183614.1183633. http://doi.acm.org/10.1145/1183614.1183633
Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 603–610. DOI http://doi.acm.org/10.1145/1390334.1390437. http://doi.acm.org/10.1145/1390334.1390437
Zhang L, Jiang L, Bao L, Takahashi S, Li YAH (2011) Informedia@trecvid 2011: Surveillance event detection. In: TRECVid video retrieval evaluation workshop, Gaitherburg
Zhu C, Bichot CE, Chen L (2013) Image region description using orthogonal combination of local binary patterns enhanced with color information. Pattern Recogn. 46(7):1949–1963. doi:10.1016/j.patcog.2013.01.003. http://dx.doi.org/10.1016/j.patcog.2013.01.003
Znaidia A, Borgne HL, Hudelot C (2012) Belief theory for large-scale multi-label image classification. In: Denoeux T, Masson MH (eds.) Belief functions. Advances in soft computing, vol 164. Springer, Berlin, pp 205–212
Acknowledgments
This work was supported by the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency). The authors would also like to thank the members of the IRIM consortium for the expert scores used throughout the experiments described in this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Strat, S.T., Benoit, A., Lambert, P., Bredin, H., Quénot, G. (2014). Hierarchical Late Fusion for Concept Detection in Videos. In: Ionescu, B., Benois-Pineau, J., Piatrik, T., Quénot, G. (eds) Fusion in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-05696-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-05696-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05695-1
Online ISBN: 978-3-319-05696-8
eBook Packages: Computer ScienceComputer Science (R0)