A State-of-Art Review on Automatic Video Annotation Techniques

Randive, Krunal; Mohan, R.

doi:10.1007/978-3-030-16657-1_99

Krunal Randive¹⁸ &
R. Mohan¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 940))

Included in the following conference series:

International Conference on Intelligent Systems Design and Applications

1478 Accesses
1 Citations

Abstract

Video annotation has gained attention because of the rapid development of video information and wide usage of video analysis in all directions. With the capacity of depicting video at the semantic level, video annotation has numerous applications in video analysis. Due to the shortcomings present in manual video annotation, Automatic Video Annotation was introduced. In this paper, distinctive methodologies of automatic video annotation are discussed. These models are classified into five classes namely, (1) Generative models, (2) Distance-based similarity model, (3) Discriminative model, (4) Ontology-based models, (5) Deep Learning-based models. The key theoretical contributions in the current decade in support of video annotation strategies are discussed. Additionally, the future directions concerning the research aspect of video annotation strategies are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1002–1009 (2004)
Google Scholar
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 119–126 (2003)
Google Scholar
Liu, J., Wang, B., Li, M., et al.: Dual cross-media relevance model for image annotation. In: Proceedings of the 15th International Conference on Multimedia, pp. 605–614 (2007)
Google Scholar
Niño-Castañeda, J., Frías-Velázquez, A., Bo, N.B., Slembrouck, M., Guan, J., Debard, G., Vanrumste, B., Tuytelaars, T., Philips, W.: Scalable semi-automatic annotation for multi-camera person tracking. IEEE Trans. Image Process. 25(5), 2259–2274 (2016)
Article MathSciNet Google Scholar
Wang, M., Hua, X.S., Tang, J., Hong, R.: Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Trans. Multimed. 11(3), 465–476 (2009)
Article Google Scholar
Wang, C., Zhang, L., Zhang, H.J.: Learning to reduce the semantic gap in web image retrieval and annotation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 355–362 (2008)
Google Scholar
Chou, C.L., Chen, H.T., Lee, S.Y.: Multimodal video-to-near-scene annotation. IEEE Trans. Multimed. 19(2), 354–366 (2017)
Article Google Scholar
Xia, S., Chen, P., Zhang, J., Li, X., Wang, B.: Utilization of rotation-invariant uniform LBP histogram distribution and statistics of connected regions in automatic image annotation based on multi-label learning. Neurocomputing 228, 11–18 (2017)
Article Google Scholar
Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Mei, T., Zhang, H.J.: Correlative multi-label video annotation. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 17–26 (2007)
Google Scholar
Jain, S.D., Grauman, K.: Click carving: segmenting objects in video with point clicks (2016). arXiv preprint: arXiv:1607.01115
Song, H., Wu, X., Liang, W., Jia, Y.: Recognizing key segments of videos for video annotation by learning from web image sets. Multimed. Tools Appl. 76(5), 6111–6126 (2017)
Article Google Scholar
Schöning, J., Faion, P., Heidemann, G., Krumnack, U.: Providing video annotations in multimedia containers for visualization and research. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 650–659 (2017)
Google Scholar
Shah, R., Zimmermann, R.: Tag recommendation and ranking. In: Multimodal Analysis of User-Generated Multimedia Content, pp. 101–138 (2017)
Google Scholar
Moxley, E., Mei, T., Hua, X.S., Ma, W.Y., Manjunath, B.S.: Automatic video annotation through search and mining. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 685–688 (2008)
Google Scholar
Wang, M., Hua, X.S., Hong, R., Tang, J., Qi, G.J., Song, Y.: Unified video annotation via multigraph learning. IEEE Trans. Circ. Syst. Video Technol. 19(5), 733–746 (2009)
Article Google Scholar
Schöning, J., Faion, P., Heidemann, G.: Pixel-wise ground truth annotation in videos. In: ICPRAM, vol. 6, p. 11 (2016)
Google Scholar
Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Process. 25(11), 4999–5011 (2016)
Article MathSciNet Google Scholar
Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Tao Shen, H.: Optimal graph learning with partial tags and multiple features for image and video annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4371–4379 (2015)
Google Scholar
Qian, X., Liu, X., Ma, X., Lu, D., Xu, C.: What is happening in the video?—Annotate video by sentence. IEEE Trans. Circ. Syst. Video Technol. 26(9), 1746–1757 (2016)
Article Google Scholar
Sikos, L.F.: Ontology-based structured video annotation for content-based video retrieval via spatiotemporal reasoning. In: Bridging the Semantic Gap in Image and Video Analysis, pp. 97–122. Springer, Cham (2018)
Google Scholar
Ballan, L., Bertini, M., Del Bimbo, A., Serra, G.: Video annotation and retrieval using ontologies and rule learning. IEEE Multimed. 17(4), 80–88 (2010)
Article Google Scholar
Altadmri, A., Ahmed, A.: A framework for automatic semantic video annotation. Multimed. Tools Appl. 72(2), 1167–1191 (2014)
Article Google Scholar
Sikos, L.F.: RDF-powered semantic video annotation tools with concept mapping to linked data for next-generation video indexing: a comprehensive review. Multimed. Tools Appl. 76(12), 14437–14460 (2017)
Article Google Scholar
Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, Y., Staab, S., Strintzis, M.G.: Semantic annotation of images and videos for multimedia analysis. In: European Semantic Web Conference, pp. 592–607 (2005)
Google Scholar
Zarka, M., Ammar, A.B., Alimi, A.M.: Fuzzy reasoning framework to improve semantic video interpretation. Multimed. Tools Appl. 75(10), 5719–5750 (2016)
Article Google Scholar
Khurana, K., Chandak, M.B.: Study of various video annotation techniques. Int. J. Adv. Res. Comput. Commun. Eng. 2(1), 909–914 (2013)
Google Scholar
Duong, T.H., Nguyen, N.T., Truong, H.B., Nguyen, V.H.: A collaborative algorithm for semantic video annotation using a consensus-based social network analysis. Expert Syst. Appl. 42(1), 246–258 (2015)
Article Google Scholar
Wang, Y., Luo, Z., Jodoin, P.M.: Interactive deep learning method for segmenting moving objects. Pattern Recogn. Lett. 96, 66–75 (2017)
Article Google Scholar
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Wu, Z., Yao, T., Fu, Y., Jiang, Y.G.: Deep learning for video classification and captioning (2016). arXiv preprint: arXiv:1609.06782
Yu, S., Cai, H., Liu, A.: Multi-semantic video annotation with semantic network. In: 2016 International Conference on Cyberworlds (CW), pp. 239–242, September 2016
Google Scholar
Koller, O., Ney, H., Bowden, R.: Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802 (2016)
Google Scholar
Liao, H., Chen, L., Song, Y., Ming, H.: Visualization-based active learning for video annotation. IEEE Trans. Multimed. 18(11), 2196–2205 (2016)
Article Google Scholar
Liu, Y., Feng, X., Zhou, Z.: Multimodal video classification with stacked contractive autoencoders. Signal Process. 120, 761–766 (2016)
Article Google Scholar
Maharaj, T., Ballas, N., Rohrbach, A., Courville, A.C., Pal, C.J.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR, pp. 7359–7368 (2017)
Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)
Google Scholar
Zhang, C., Tian, Y.: Automatic video description generation via LSTM with joint two-stream encoding. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2924–2929 (2016)
Google Scholar
Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language (2016). arXiv preprint: arXiv:1609.08124
Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning (2017). arXiv preprint: arXiv:1706.01231
Jiang, H., Lu, Y., Xue, J.: Automatic soccer video event detection based on a deep neural network combined CNN and RNN. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 490–494 (2016)
Google Scholar
Karayil, T., Blandfort, P., Borth, D., Dengel, A.: Generating affective captions using concept and syntax transition networks. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1111–1115 (2016)
Google Scholar
Ashangani, K., Wickramasinghe, K.U., De Silva, D.W.N., Gamwara, V.M., Nugaliyadde, A., Mallawarachchi, Y.: Semantic video search by automatic video annotation using TensorFlow. In: Manufacturing & Industrial Engineering Symposium (MIES), pp. 1–4 (2016)
Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)
Google Scholar
Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, vol. 2, p. 3 (2017)
Google Scholar
Xue, Y., Song, Y., Li, C., Chiang, A.T., Ning, X.: Automatic video annotation system for archival sports video. In: 2017 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp. 23–28 (2017)
Google Scholar
Zhang, L., Hong, R., Nie, L., Hong, C.: A biologically inspired automatic system for media quality assessment. IEEE Trans. Autom. Sci. Eng. 13(2), 894–902 (2016)
Article Google Scholar
Loukas, C.: Video content analysis of surgical procedures. Surg. Endosc. 32(2), 553–568 (2018)
Article Google Scholar
Hudelist, M.A., Husslein, H., Münzer, B., Kletz, S., Schoeffmann, K.: A tool to support surgical quality assessment. In: 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), pp. 238–239 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, 620015, Tamil Nadu, India
Krunal Randive & R. Mohan

Authors

Krunal Randive
View author publications
You can also search for this author in PubMed Google Scholar
R. Mohan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krunal Randive .

Editor information

Editors and Affiliations

Machine Intelligence Research Labs, Auburn, WA, USA
Ajith Abraham
School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
Aswani Kumar Cherukuri
Tijuana Institute of Technology, Tijuana, Mexico
Patricia Melin
Machine Intelligence Research Labs, Auburn, WA, USA
Niketa Gandhi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Randive, K., Mohan, R. (2020). A State-of-Art Review on Automatic Video Annotation Techniques. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds) Intelligent Systems Design and Applications. ISDA 2018 2018. Advances in Intelligent Systems and Computing, vol 940. Springer, Cham. https://doi.org/10.1007/978-3-030-16657-1_99

Download citation

DOI: https://doi.org/10.1007/978-3-030-16657-1_99
Published: 12 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16656-4
Online ISBN: 978-3-030-16657-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics