A State-of-Art Review on Automatic Video Annotation Techniques

  • Krunal RandiveEmail author
  • R. Mohan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 940)


Video annotation has gained attention because of the rapid development of video information and wide usage of video analysis in all directions. With the capacity of depicting video at the semantic level, video annotation has numerous applications in video analysis. Due to the shortcomings present in manual video annotation, Automatic Video Annotation was introduced. In this paper, distinctive methodologies of automatic video annotation are discussed. These models are classified into five classes namely, (1) Generative models, (2) Distance-based similarity model, (3) Discriminative model, (4) Ontology-based models, (5) Deep Learning-based models. The key theoretical contributions in the current decade in support of video annotation strategies are discussed. Additionally, the future directions concerning the research aspect of video annotation strategies are discussed.


Automatic Video Annotation (AVA) Deep learning Ontology Feature extraction Convolutional Neural Network (CNN) 


  1. 1.
    Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1002–1009 (2004)Google Scholar
  2. 2.
    Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 119–126 (2003)Google Scholar
  3. 3.
    Liu, J., Wang, B., Li, M., et al.: Dual cross-media relevance model for image annotation. In: Proceedings of the 15th International Conference on Multimedia, pp. 605–614 (2007)Google Scholar
  4. 4.
    Niño-Castañeda, J., Frías-Velázquez, A., Bo, N.B., Slembrouck, M., Guan, J., Debard, G., Vanrumste, B., Tuytelaars, T., Philips, W.: Scalable semi-automatic annotation for multi-camera person tracking. IEEE Trans. Image Process. 25(5), 2259–2274 (2016)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Wang, M., Hua, X.S., Tang, J., Hong, R.: Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Trans. Multimed. 11(3), 465–476 (2009)CrossRefGoogle Scholar
  6. 6.
    Wang, C., Zhang, L., Zhang, H.J.: Learning to reduce the semantic gap in web image retrieval and annotation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 355–362 (2008)Google Scholar
  7. 7.
    Chou, C.L., Chen, H.T., Lee, S.Y.: Multimodal video-to-near-scene annotation. IEEE Trans. Multimed. 19(2), 354–366 (2017)CrossRefGoogle Scholar
  8. 8.
    Xia, S., Chen, P., Zhang, J., Li, X., Wang, B.: Utilization of rotation-invariant uniform LBP histogram distribution and statistics of connected regions in automatic image annotation based on multi-label learning. Neurocomputing 228, 11–18 (2017)CrossRefGoogle Scholar
  9. 9.
    Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Mei, T., Zhang, H.J.: Correlative multi-label video annotation. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 17–26 (2007)Google Scholar
  10. 10.
    Jain, S.D., Grauman, K.: Click carving: segmenting objects in video with point clicks (2016). arXiv preprint: arXiv:1607.01115
  11. 11.
    Song, H., Wu, X., Liang, W., Jia, Y.: Recognizing key segments of videos for video annotation by learning from web image sets. Multimed. Tools Appl. 76(5), 6111–6126 (2017)CrossRefGoogle Scholar
  12. 12.
    Schöning, J., Faion, P., Heidemann, G., Krumnack, U.: Providing video annotations in multimedia containers for visualization and research. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 650–659 (2017)Google Scholar
  13. 13.
    Shah, R., Zimmermann, R.: Tag recommendation and ranking. In: Multimodal Analysis of User-Generated Multimedia Content, pp. 101–138 (2017)CrossRefGoogle Scholar
  14. 14.
    Moxley, E., Mei, T., Hua, X.S., Ma, W.Y., Manjunath, B.S.: Automatic video annotation through search and mining. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 685–688 (2008)Google Scholar
  15. 15.
    Wang, M., Hua, X.S., Hong, R., Tang, J., Qi, G.J., Song, Y.: Unified video annotation via multigraph learning. IEEE Trans. Circ. Syst. Video Technol. 19(5), 733–746 (2009)CrossRefGoogle Scholar
  16. 16.
    Schöning, J., Faion, P., Heidemann, G.: Pixel-wise ground truth annotation in videos. In: ICPRAM, vol. 6, p. 11 (2016)Google Scholar
  17. 17.
    Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Process. 25(11), 4999–5011 (2016)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Tao Shen, H.: Optimal graph learning with partial tags and multiple features for image and video annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4371–4379 (2015)Google Scholar
  19. 19.
    Qian, X., Liu, X., Ma, X., Lu, D., Xu, C.: What is happening in the video?—Annotate video by sentence. IEEE Trans. Circ. Syst. Video Technol. 26(9), 1746–1757 (2016)CrossRefGoogle Scholar
  20. 20.
    Sikos, L.F.: Ontology-based structured video annotation for content-based video retrieval via spatiotemporal reasoning. In: Bridging the Semantic Gap in Image and Video Analysis, pp. 97–122. Springer, Cham (2018)Google Scholar
  21. 21.
    Ballan, L., Bertini, M., Del Bimbo, A., Serra, G.: Video annotation and retrieval using ontologies and rule learning. IEEE Multimed. 17(4), 80–88 (2010)CrossRefGoogle Scholar
  22. 22.
    Altadmri, A., Ahmed, A.: A framework for automatic semantic video annotation. Multimed. Tools Appl. 72(2), 1167–1191 (2014)CrossRefGoogle Scholar
  23. 23.
    Sikos, L.F.: RDF-powered semantic video annotation tools with concept mapping to linked data for next-generation video indexing: a comprehensive review. Multimed. Tools Appl. 76(12), 14437–14460 (2017)CrossRefGoogle Scholar
  24. 24.
    Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, Y., Staab, S., Strintzis, M.G.: Semantic annotation of images and videos for multimedia analysis. In: European Semantic Web Conference, pp. 592–607 (2005)Google Scholar
  25. 25.
    Zarka, M., Ammar, A.B., Alimi, A.M.: Fuzzy reasoning framework to improve semantic video interpretation. Multimed. Tools Appl. 75(10), 5719–5750 (2016)CrossRefGoogle Scholar
  26. 26.
    Khurana, K., Chandak, M.B.: Study of various video annotation techniques. Int. J. Adv. Res. Comput. Commun. Eng. 2(1), 909–914 (2013)Google Scholar
  27. 27.
    Duong, T.H., Nguyen, N.T., Truong, H.B., Nguyen, V.H.: A collaborative algorithm for semantic video annotation using a consensus-based social network analysis. Expert Syst. Appl. 42(1), 246–258 (2015)CrossRefGoogle Scholar
  28. 28.
    Wang, Y., Luo, Z., Jodoin, P.M.: Interactive deep learning method for segmenting moving objects. Pattern Recogn. Lett. 96, 66–75 (2017)CrossRefGoogle Scholar
  29. 29.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)Google Scholar
  30. 30.
    Wu, Z., Yao, T., Fu, Y., Jiang, Y.G.: Deep learning for video classification and captioning (2016). arXiv preprint: arXiv:1609.06782
  31. 31.
    Yu, S., Cai, H., Liu, A.: Multi-semantic video annotation with semantic network. In: 2016 International Conference on Cyberworlds (CW), pp. 239–242, September 2016Google Scholar
  32. 32.
    Koller, O., Ney, H., Bowden, R.: Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802 (2016)Google Scholar
  33. 33.
    Liao, H., Chen, L., Song, Y., Ming, H.: Visualization-based active learning for video annotation. IEEE Trans. Multimed. 18(11), 2196–2205 (2016)CrossRefGoogle Scholar
  34. 34.
    Liu, Y., Feng, X., Zhou, Z.: Multimodal video classification with stacked contractive autoencoders. Signal Process. 120, 761–766 (2016)CrossRefGoogle Scholar
  35. 35.
    Maharaj, T., Ballas, N., Rohrbach, A., Courville, A.C., Pal, C.J.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR, pp. 7359–7368 (2017)Google Scholar
  36. 36.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)Google Scholar
  37. 37.
    Zhang, C., Tian, Y.: Automatic video description generation via LSTM with joint two-stream encoding. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2924–2929 (2016)Google Scholar
  38. 38.
    Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language (2016). arXiv preprint: arXiv:1609.08124
  39. 39.
    Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning (2017). arXiv preprint: arXiv:1706.01231
  40. 40.
    Jiang, H., Lu, Y., Xue, J.: Automatic soccer video event detection based on a deep neural network combined CNN and RNN. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 490–494 (2016)Google Scholar
  41. 41.
    Karayil, T., Blandfort, P., Borth, D., Dengel, A.: Generating affective captions using concept and syntax transition networks. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1111–1115 (2016)Google Scholar
  42. 42.
    Ashangani, K., Wickramasinghe, K.U., De Silva, D.W.N., Gamwara, V.M., Nugaliyadde, A., Mallawarachchi, Y.: Semantic video search by automatic video annotation using TensorFlow. In: Manufacturing & Industrial Engineering Symposium (MIES), pp. 1–4 (2016)Google Scholar
  43. 43.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)Google Scholar
  44. 44.
    Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, vol. 2, p. 3 (2017)Google Scholar
  45. 45.
    Xue, Y., Song, Y., Li, C., Chiang, A.T., Ning, X.: Automatic video annotation system for archival sports video. In: 2017 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp. 23–28 (2017)Google Scholar
  46. 46.
    Zhang, L., Hong, R., Nie, L., Hong, C.: A biologically inspired automatic system for media quality assessment. IEEE Trans. Autom. Sci. Eng. 13(2), 894–902 (2016)CrossRefGoogle Scholar
  47. 47.
    Loukas, C.: Video content analysis of surgical procedures. Surg. Endosc. 32(2), 553–568 (2018)CrossRefGoogle Scholar
  48. 48.
    Hudelist, M.A., Husslein, H., Münzer, B., Kletz, S., Schoeffmann, K.: A tool to support surgical quality assessment. In: 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), pp. 238–239 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNational Institute of TechnologyTiruchirappalliIndia

Personalised recommendations