Abstract
This paper proposes to investigate the potential benefit of the use of low-level human vision behaviors in the context of high-level semantic concept detection. A large part of the current approaches relies on the Bag-of-Words (BoW) model, which has proven itself to be a good choice especially for object recognition in images. Its extension from static images to video sequences exhibits some new problems to cope with, mainly the way to use the temporal information related to the concepts to detect (swimming, drinking...). In this study, we propose to apply a human retina model to preprocess video sequences before constructing the State-Of-The-Art BoW analysis. This preprocessing, designed in a way that enhances relevant information, increases the performance by introducing robustness to traditional image and video problems, such as luminance variation, shadows, compression artifacts and noise. Additionally, we propose a new segmentation method which enables a selection of low-level spatio-temporal potential areas of interest from the visual scene, without slowing the computation as much as a high-level saliency model would. These approaches are evaluated on the TrecVid 2010 and 2011 Semantic Indexing Task datasets, containing from 130 to 346 high-level semantic concepts. We also experiment with various parameter settings to check their effect on performance.
Similar content being viewed by others
Notes
Indexation et Recherche d’Information Multimédia, Multimedia Information Indexing and Searching, http://mrim.imag.fr/irim/.
References
Alahi A, Ortiz R, Vandergheynst P (2012) FREAK: fast retina keypoint. In: IEEE conference on computer vision and pattern recognition. CVPR 2012 Open Source Award Winner
Ali WBH, Debreuve E, Kornprobst P, Barlaud M (2011) Bio-inspired bags-of-features for image classification. In: KDIR, pp 277–281
Arthur D, Vassilvitskii S (2007) K-means+ +: the advantages of careful seeding. In: SODA, pp 1027–1035
Ballas N, Delezoide B, Prêteux F (2011) Trajectories based descriptor for dynamic events annotation. In: Proceedings of the 2011 joint ACM workshop on modeling and representing events, J-MRE ’11, pp 13–18. ACM, New York, NY. doi:10.1145/2072508.2072512
Benoit A, Caplier A, Durette B, Herault J (2010) Using human visual system modeling for bio-inspired low level image processing. Comput Vis Image Underst 114(7):758–773. doi:10.1016/j.cviu.2010.01.011. http://www.sciencedirect.com/science/article/B6WCX-4YHT83W-2/2/51a03dd736d2dabd052e62122f5fd79a
Chen MY, Hauptmann A (2009) Mosift: recognizing human actions in surveillance videos. Tech. Rep. CMU-CS-09-161, Carnegie Mellon University
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision. ECCV, pp 1–22
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCV (2)’06, pp 428–441
Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40:5:1–5:60. doi:10.1145/1348246.1348248
Gorisse D, Precioso F, Gosselin P, Granjon L, Pellerin D, Rombaut M, Bredin H, Koenig L, Vieux R, Mansencal B, Benois-Pineau J, Boujut H, Morand C, Jégou H, Ayache S, Safadi B, Tong Y, Thollard F, Quénot MG, Cord M, Benoit A, Lambert P (2010) IRIM at TRECVID 2010: semantic indexing and instance search. In: TREC online proceedings. Gaithersburg, États-Unis, GDR ISIS. http://hal.archives-ouvertes.fr/hal-00591099/en/
Hérault J (2009) Vision: signals, images and neural networks. In: Progress in neural processing. World Scientific Publishers, Département Images et Signal. http://hal.archives-ouvertes.fr/hal-00366132/en/
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Juan L, Gwun O (2009) A comparison of sift, pca-sift and surf. Int J Image Process IJIP 3(4):143–152. http://www.cscjournals.org/csc/manuscriptinfo.php?ManuscriptCode=72.73.72.79.44.52.48.99
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Conference on computer vision & pattern recognition. http://lear.inrialpes.fr/pubs/2008/LMSR08
Le Meur O, Le Callet P, Barba D, Thoreau D (2006) A coherent computational approach to model bottom-up visual attention. IEEE Trans Pattern Anal Mach Intell 28(5):802–817. doi:10.1109/TPAMI.2006.86
Mantiuk R, Daly S, Myszkowski K, Seidel HP (2005) Predicting visible differences in high dynamic range images—model and its calibration. In: Rogowitz BE, Pappas TN, Daly SJ (eds) Human vision and electronic imaging X, IS&T/SPIE’s 17th annual symposium on electronic imaging (2005), vol 5666, pp 204–214
Niaz U, Redi M, Tanase C, Merialdo B, Farinella G, Li Q (2011) EURECOM at TrecVid 2011: the light semantic indexing task. In: TRECVid'2011, 15th international workshop on video retrieval evaluation, 2011, national institute of standards and technology. Gaithersburg, USA
Over P, Awad G, Michel M, Fiscus J, Kraaij W, Smeaton AF, Quenot G (2011) Trecvid 2011–an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2011. NIST, USA
Redi M, MÃrialdo B (2011) Saliency moments for image categorization. In: ICMR 2011, 1st ACM international conference on multimedia retrieval, Trento, Italy, 17–20 April 2011. doi:10.1145/1991996.1992035. http://www.eurecom.fr/publication/3360
Reinhard E, Devlin K (2005) Dynamic range reduction inspired by photoreceptor physiology. IEEE Trans Vis Comput Graph 11:13–24. doi:10.1109/TVCG.2005.9
van de Sande KE, Gevers T, Snoek CG (2008) A comparison of color features for visual concept classification. In: Proceedings of the 2008 international conference on Content-based Image and Video Retrieval, CIVR ’08. ACM, New York, NY, pp 141–150. doi:10.1145/1386352.1386376
Smeaton AF, Over P, Kraaij W (2009) High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran A (ed.) Multimedia content analysis, theory and applications. Springer, Berlin, pp 151–174
Snoek CGM, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 2:215–322. doi:10.1561/1500000014
Sprague JM, Meikle TH Jr. (1965) The role of the superior colliculus in visually guided behavior. Exp Neurol 11(1):115–146. doi:10.1016/0014-4886(65)90026-9. http://www.sciencedirect.com/science/article/pii/0014488665900269
Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by Dense Trajectories. In: IEEE conference on computer vision & pattern recognition. Colorado Springs, USA, pp 3169–3176. http://hal.inria.fr/inria-00583818
Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM international Conference on Information and Knowledge Management, CIKM ’06. ACM, New York, NY, pp 102–111. doi:10.1145/1183614.1183633
Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating ap and ndcg. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, NY, pp 603–610. doi:10.1145/1390334.1390437
Acknowledgement
This work would not have been possible without the IRIMFootnote 5 French consortium, who provided the processing toolchain for the unified descriptors evaluation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Strat, S.T., Benoit, A., Lambert, P. et al. Retina enhanced SURF descriptors for spatio-temporal concept detection. Multimed Tools Appl 69, 443–469 (2014). https://doi.org/10.1007/s11042-012-1280-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-012-1280-0