Abstract
For a long time, it was difficult to automatically extract meanings from video shots, because, even for a particular meaning, shots are characterized by signifincantly different visual appearances, depending on camera techniques and shooting environments. One promising approach for this has been recently devised where a large amount of shots are statistically analyzed to cover diverse visual appearances for a meaning. Inspired by the significant performance improvement, concept-based video retrieval receives much research attention. Here, concepts are abstracted names of meanings that humans can perceive from shots, like objects, actions, events, and scenes. For each concept, a detector is built in advance by analyzing a large amount of shots. Then, given a query, shots are retrieved based on concept detection results. Since each detector can detect a concept robustly to diverse visual appearances, effective retrieval can be achieved using concept detection results as “intermediate” features. However, despite the recent improvement, it is still difficult to accurately detect any kind of concept. In addition, shots can be taken by arbitrary camera techniques and in arbitrary shooting environments, which unboundedly increases the diversity of visual appearances. Thus, it cannot be expected to detect concepts with an accuracy of \(100\,\%\). This chapter explores how to utilize such uncertain detection results to improve concept-based video retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Since the search task has been stopped at TRECVID 2009, videos of this year are the latest ones where the retrieval performance using example shots can be evaluated.
References
Petkovic M, Jonker W (2002) Content-based video retrieval: a database perspective. Kluwer Academic Publishers, Boston
Smeulders A, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Djordjevic D, Izquierdo E, Grzegorzek M (2007) User driven systems to bridge the semantic gap. In: Proceedings of the EUSIPCO 2007, pp 718–722
Staab S, Scherp A, Arndt R, Troncy R, Grzegorzek M, Saathoff C, Schenk S, Hardman L (2008) Semantic multimedia. In: Baroglio C, Bonatti PA, Małuszyński J, Polleres A, Schaffert S (eds) Reasoning Web. LNCS.Springer, Berlin
Naphade MR, Smith JR (2004) On the detection of semantic concepts at TRECVID. In: Proceedings of the MM 2004, pp 660–667
Snoek CGM, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 2(4):215–322
Li X, Wang D, Li J, Zhang B (2007) Video search in concept subspace: a text-like paradigm. In: Proceedings of the CIVR 2007, pp 603–610
Natsev AP, Haubold A, Tešić J, Xie L, Yan R (2007) Semantic concept-based query expansion and re-ranking for multimedia retrieval. In: Proceedings of the MM 2007, pp 991–1000
Ngo C et al (2009) VIREO/DVMM at TRECVID 2009: high-level feature extraction, automatic video search and content-based copy detection. In: Proceedings of the TRECVID 2009, pp 415–432
Wei XY, Jiang YG, Ngo CW (2011) Concept-driven multi-modality fusion for video search. IEEE Trans Circuits Syst Video Technol 21(1):62–73
Naphade M, Smith J, Tesic J, Chang SF, Hsu W, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91
Shirahama K, Uehara K (2011) Constructing and utilizing video ontology for accurate and fast retrieval. Int J Multimed Data Eng Manag (IJMDEM) 2(4):59–75
Zhu S, Wei X, Ngo C (2013) Error recovered hierarchical classification. In: Proceedings of the MM 2013, pp 697–700
Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans Multimed 9(5):958–966
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of the CVPR 2009, pp 248–255
Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Proceedings of the CHI 2008, pp 453–456
Ayache S, Qu\(\acute{\text{ e }}\)not G (2008) Video corpus annotation using active learning. In: Proceedings of the ECIR 2008, pp 187–198
Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Gool LV (2005) A comparison of affine region detectors. Int J Comput Vis 65(1–2):43–72
Lowe D (1999) Object recognition from local scale-invariant features. In: Proceedings of the ICCV 1999, pp 1150–1157
Bay H, Tuytelaars T, Gool L (2006) SURF: speeded up robust features. In: Proceedings of the ECCV 2006, pp 404–417
van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596
Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints. In: Proceedings of the ECCV 2004 SLCV, pp 1–22
Inoue N, Shinoda K (2012) A fast and accurate video semantic-indexing system using fast MAP adaptation and GMM supervectors. IEEE Trans Multimed 14(4):1196–1205
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: Proceedings of the CVPR 2007, pp 1–8
Vapnik V (1998) Statistical learning theory. Wiley-Interscience, New York
Lin HT, Lin CJ, Weng RC (2007) A note on Platt’s probabilistic outputs for support vector machines. Mach Learn 68(3):267–276
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of the MIR 2006, pp 321–330
The PASCAL Visual Object Classes Homepage. http://pascallin.ecs.soton.ac.uk/challenges/VOC/
ImageNet Large Scale Visual Recognition Competition (2013) (ILSVRC2013). http://www.image-net.org/challenges/LSVRC/2013/
Shirahama K, Uehara K (2012) Kobe university and Muroran institute of technology at TRECVID 2012 semantic indexing task. In: Proceedings of the TRECVID 2012, pp 239–247
Snoek CGM et al (2009) The MediaMill TRECVID 2009 semantic video search engine. In: Proceedings of the TRECVID 2009, pp 226–238
Natsev AP, Naphade MR, Tešić J (2005) Learning the semantics of multimedia queries and concepts from a small number of examples. In: Proceedings of the MM 2005, pp 598–607
Rasiwasia N, Moreno P, Vasconcelos N (2007) Bridging the gap: query by semantic example. IEEE Trans Multimed 9(5):923–938
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
Denoeux T (2013) Maximum likelihood estimation from uncertain data in the belief function framework. IEEE Trans Knowl Data Eng 25(1):119–130
Kanamori T, Hido S, Sugiyama M (2009) A least-squares approach to direct importance estimation. J Mach Learn Res 10(7):1391–1445
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proceedings of the ECCV 2006, pp 490–503
Snoek CGM, Worring M, Geusebroek JM, Koelma D, Seinstra F (2005) On the surplus value of semantic video analysis beyond the key frame. In: Proceedings of the ICME 2005, pp 386–389
Wang H, Klaser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the CVPR 2011, pp 3169–3176
Peng Y et al (2009) PKU-ICST at TRECVID 2009: high level feature extraction and search. In: Proceedings of the TRECVID 2009
Aggarwal C, Yu P (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21(5):609–623
Bi J, Zhang T (2005) Support vector classification with input data uncertainty. In: Proceedings of the NIPS 2004, pp 161–168
Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the KDD 2005, pp 672–677
Wang H, McClean S (2008) Deriving evidence theoretical functions in multivariate data spaces: a systematic approach. IEEE Trans Syst Man Cybern B Cybern 38(2):455–465
Aregui A, Denoeux T (2008) Constructing consonant belief functions from sample data using confidence sets of pignistic probabilities. Int J Approx Reason 49(3):575–594
Zribi M (2003) Parametric estimation of Dempster-Shafer belief functions. In: Proceedings of the ISIF 2003, pp 485–491
Benmokhtar R, Huet B (2008) Perplexity-based evidential neural network classifier fusion using MPEG-7 low-level visual features. In: Proceedings of the MIR 2008, pp 336–341
Wang X, Kankanhalli M (2010) Portfolio theory of multimedia fusion. In: Proceedings of the MM 2010, pp 723–726
Li X, Snoek CG (2009) Visual categorization with negative examples for free. In: Proceedings of the MM 2009, pp 661–664
Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10):1848–1852
Acknowledgments
The research work by Kimiaki Shirahama has been funded by the Postdoctoral Fellowship for Research Abroad by Japan Society for the Promotion of Science (JSPS). Also, this work was in part supported by JSPS through Grand-in-Aid for Scientific Research (B): KAKENHI (26280040).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
We evaluate our video retrieval method on \(24\) queries specified at TRECVID 2009 Search task [27]. For each query, shots in test videos are manually assessed based on the following criteria: A shot is relevant to the query if it contains a sufficient evidence for humans to recognize the relevance. In other words, such an evidence may be shown only in a region on some video frames in a shot. Below, we show the ID and text description of each query:
- 269::
-
Find shots of a road taken from a moving vehicle through the front window
- 270::
-
Find shots of a crowd of people, outdoors, filling more than half of the frame area
- 271::
-
Find shots with a view of one or more tall buildings (more than four stories) and the top story visible
- 272::
-
Find shots of a person talking on a telephone
- 273::
-
Find shots of a closeup of a hand, writing, drawing, coloring, or painting
- 274::
-
Find shots of exactly two people sitting at a table
- 275::
-
Find shots of one or more people, each walking up one or more steps
- 276::
-
Find shots of one or more dogs, walking, running, or jumping
- 277::
-
Find shots of a person talking behind a microphone
- 278::
-
Find shots of a building entrance
- 279::
-
Find shots of people shaking hands
- 280::
-
Find shots of a microscope
- 281::
-
Find shots of two more people, each singing and/or playing a musical instrument
- 282::
-
Find shots of a person pointing
- 283::
-
Find shots of a person playing a piano
- 284::
-
Find shots of a street scene at night
- 285::
-
Find shots of printed, typed, or handwritten text, filling more than half of the frame area
- 286::
-
Find shots of something burning with flames visible
- 287::
-
Find shots of one or more people, each at a table or desk with a computer visible
- 288::
-
Find shots of an airplane or helicopter on the ground, seen from outside
- 289::
-
Find shots of one or more people, each sitting in a chair, talking
- 290::
-
Find shots of one or more ships or boats, in the water
- 291::
-
Find shots of a train in motion, seen from outside
- 292::
-
Find shots with the camera zooming in on a person’s face
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Shirahama, K., Kumabuchi, K., Grzegorzek, M., Uehara, K. (2015). Video Retrieval Based on Uncertain Concept Detection Using Dempster–Shafer Theory. In: Baughman, A., Gao, J., Pan, JY., Petrushin, V. (eds) Multimedia Data Mining and Analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-14998-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-14998-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14997-4
Online ISBN: 978-3-319-14998-1
eBook Packages: Computer ScienceComputer Science (R0)