Skip to main content

A State-of-Art Review on Automatic Video Annotation Techniques

  • Conference paper
  • First Online:
Intelligent Systems Design and Applications (ISDA 2018 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 940))

Abstract

Video annotation has gained attention because of the rapid development of video information and wide usage of video analysis in all directions. With the capacity of depicting video at the semantic level, video annotation has numerous applications in video analysis. Due to the shortcomings present in manual video annotation, Automatic Video Annotation was introduced. In this paper, distinctive methodologies of automatic video annotation are discussed. These models are classified into five classes namely, (1) Generative models, (2) Distance-based similarity model, (3) Discriminative model, (4) Ontology-based models, (5) Deep Learning-based models. The key theoretical contributions in the current decade in support of video annotation strategies are discussed. Additionally, the future directions concerning the research aspect of video annotation strategies are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1002–1009 (2004)

    Google Scholar 

  2. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 119–126 (2003)

    Google Scholar 

  3. Liu, J., Wang, B., Li, M., et al.: Dual cross-media relevance model for image annotation. In: Proceedings of the 15th International Conference on Multimedia, pp. 605–614 (2007)

    Google Scholar 

  4. Niño-Castañeda, J., Frías-Velázquez, A., Bo, N.B., Slembrouck, M., Guan, J., Debard, G., Vanrumste, B., Tuytelaars, T., Philips, W.: Scalable semi-automatic annotation for multi-camera person tracking. IEEE Trans. Image Process. 25(5), 2259–2274 (2016)

    Article  MathSciNet  Google Scholar 

  5. Wang, M., Hua, X.S., Tang, J., Hong, R.: Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Trans. Multimed. 11(3), 465–476 (2009)

    Article  Google Scholar 

  6. Wang, C., Zhang, L., Zhang, H.J.: Learning to reduce the semantic gap in web image retrieval and annotation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 355–362 (2008)

    Google Scholar 

  7. Chou, C.L., Chen, H.T., Lee, S.Y.: Multimodal video-to-near-scene annotation. IEEE Trans. Multimed. 19(2), 354–366 (2017)

    Article  Google Scholar 

  8. Xia, S., Chen, P., Zhang, J., Li, X., Wang, B.: Utilization of rotation-invariant uniform LBP histogram distribution and statistics of connected regions in automatic image annotation based on multi-label learning. Neurocomputing 228, 11–18 (2017)

    Article  Google Scholar 

  9. Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Mei, T., Zhang, H.J.: Correlative multi-label video annotation. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 17–26 (2007)

    Google Scholar 

  10. Jain, S.D., Grauman, K.: Click carving: segmenting objects in video with point clicks (2016). arXiv preprint: arXiv:1607.01115

  11. Song, H., Wu, X., Liang, W., Jia, Y.: Recognizing key segments of videos for video annotation by learning from web image sets. Multimed. Tools Appl. 76(5), 6111–6126 (2017)

    Article  Google Scholar 

  12. Schöning, J., Faion, P., Heidemann, G., Krumnack, U.: Providing video annotations in multimedia containers for visualization and research. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 650–659 (2017)

    Google Scholar 

  13. Shah, R., Zimmermann, R.: Tag recommendation and ranking. In: Multimodal Analysis of User-Generated Multimedia Content, pp. 101–138 (2017)

    Google Scholar 

  14. Moxley, E., Mei, T., Hua, X.S., Ma, W.Y., Manjunath, B.S.: Automatic video annotation through search and mining. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 685–688 (2008)

    Google Scholar 

  15. Wang, M., Hua, X.S., Hong, R., Tang, J., Qi, G.J., Song, Y.: Unified video annotation via multigraph learning. IEEE Trans. Circ. Syst. Video Technol. 19(5), 733–746 (2009)

    Article  Google Scholar 

  16. Schöning, J., Faion, P., Heidemann, G.: Pixel-wise ground truth annotation in videos. In: ICPRAM, vol. 6, p. 11 (2016)

    Google Scholar 

  17. Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Process. 25(11), 4999–5011 (2016)

    Article  MathSciNet  Google Scholar 

  18. Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Tao Shen, H.: Optimal graph learning with partial tags and multiple features for image and video annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4371–4379 (2015)

    Google Scholar 

  19. Qian, X., Liu, X., Ma, X., Lu, D., Xu, C.: What is happening in the video?—Annotate video by sentence. IEEE Trans. Circ. Syst. Video Technol. 26(9), 1746–1757 (2016)

    Article  Google Scholar 

  20. Sikos, L.F.: Ontology-based structured video annotation for content-based video retrieval via spatiotemporal reasoning. In: Bridging the Semantic Gap in Image and Video Analysis, pp. 97–122. Springer, Cham (2018)

    Google Scholar 

  21. Ballan, L., Bertini, M., Del Bimbo, A., Serra, G.: Video annotation and retrieval using ontologies and rule learning. IEEE Multimed. 17(4), 80–88 (2010)

    Article  Google Scholar 

  22. Altadmri, A., Ahmed, A.: A framework for automatic semantic video annotation. Multimed. Tools Appl. 72(2), 1167–1191 (2014)

    Article  Google Scholar 

  23. Sikos, L.F.: RDF-powered semantic video annotation tools with concept mapping to linked data for next-generation video indexing: a comprehensive review. Multimed. Tools Appl. 76(12), 14437–14460 (2017)

    Article  Google Scholar 

  24. Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, Y., Staab, S., Strintzis, M.G.: Semantic annotation of images and videos for multimedia analysis. In: European Semantic Web Conference, pp. 592–607 (2005)

    Google Scholar 

  25. Zarka, M., Ammar, A.B., Alimi, A.M.: Fuzzy reasoning framework to improve semantic video interpretation. Multimed. Tools Appl. 75(10), 5719–5750 (2016)

    Article  Google Scholar 

  26. Khurana, K., Chandak, M.B.: Study of various video annotation techniques. Int. J. Adv. Res. Comput. Commun. Eng. 2(1), 909–914 (2013)

    Google Scholar 

  27. Duong, T.H., Nguyen, N.T., Truong, H.B., Nguyen, V.H.: A collaborative algorithm for semantic video annotation using a consensus-based social network analysis. Expert Syst. Appl. 42(1), 246–258 (2015)

    Article  Google Scholar 

  28. Wang, Y., Luo, Z., Jodoin, P.M.: Interactive deep learning method for segmenting moving objects. Pattern Recogn. Lett. 96, 66–75 (2017)

    Article  Google Scholar 

  29. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)

    Google Scholar 

  30. Wu, Z., Yao, T., Fu, Y., Jiang, Y.G.: Deep learning for video classification and captioning (2016). arXiv preprint: arXiv:1609.06782

  31. Yu, S., Cai, H., Liu, A.: Multi-semantic video annotation with semantic network. In: 2016 International Conference on Cyberworlds (CW), pp. 239–242, September 2016

    Google Scholar 

  32. Koller, O., Ney, H., Bowden, R.: Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802 (2016)

    Google Scholar 

  33. Liao, H., Chen, L., Song, Y., Ming, H.: Visualization-based active learning for video annotation. IEEE Trans. Multimed. 18(11), 2196–2205 (2016)

    Article  Google Scholar 

  34. Liu, Y., Feng, X., Zhou, Z.: Multimodal video classification with stacked contractive autoencoders. Signal Process. 120, 761–766 (2016)

    Article  Google Scholar 

  35. Maharaj, T., Ballas, N., Rohrbach, A., Courville, A.C., Pal, C.J.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR, pp. 7359–7368 (2017)

    Google Scholar 

  36. Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)

    Google Scholar 

  37. Zhang, C., Tian, Y.: Automatic video description generation via LSTM with joint two-stream encoding. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2924–2929 (2016)

    Google Scholar 

  38. Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language (2016). arXiv preprint: arXiv:1609.08124

  39. Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning (2017). arXiv preprint: arXiv:1706.01231

  40. Jiang, H., Lu, Y., Xue, J.: Automatic soccer video event detection based on a deep neural network combined CNN and RNN. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 490–494 (2016)

    Google Scholar 

  41. Karayil, T., Blandfort, P., Borth, D., Dengel, A.: Generating affective captions using concept and syntax transition networks. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1111–1115 (2016)

    Google Scholar 

  42. Ashangani, K., Wickramasinghe, K.U., De Silva, D.W.N., Gamwara, V.M., Nugaliyadde, A., Mallawarachchi, Y.: Semantic video search by automatic video annotation using TensorFlow. In: Manufacturing & Industrial Engineering Symposium (MIES), pp. 1–4 (2016)

    Google Scholar 

  43. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)

    Google Scholar 

  44. Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, vol. 2, p. 3 (2017)

    Google Scholar 

  45. Xue, Y., Song, Y., Li, C., Chiang, A.T., Ning, X.: Automatic video annotation system for archival sports video. In: 2017 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp. 23–28 (2017)

    Google Scholar 

  46. Zhang, L., Hong, R., Nie, L., Hong, C.: A biologically inspired automatic system for media quality assessment. IEEE Trans. Autom. Sci. Eng. 13(2), 894–902 (2016)

    Article  Google Scholar 

  47. Loukas, C.: Video content analysis of surgical procedures. Surg. Endosc. 32(2), 553–568 (2018)

    Article  Google Scholar 

  48. Hudelist, M.A., Husslein, H., Münzer, B., Kletz, S., Schoeffmann, K.: A tool to support surgical quality assessment. In: 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), pp. 238–239 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krunal Randive .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Randive, K., Mohan, R. (2020). A State-of-Art Review on Automatic Video Annotation Techniques. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds) Intelligent Systems Design and Applications. ISDA 2018 2018. Advances in Intelligent Systems and Computing, vol 940. Springer, Cham. https://doi.org/10.1007/978-3-030-16657-1_99

Download citation

Publish with us

Policies and ethics