Deep Learning Based Semantic Video Indexing and Retrieval

  • Anna Podlesnaya
  • Sergey PodlesnyyEmail author
Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 16)


Vast amount of video stored in web archives makes their retrieval based on manual text annotations impractical. This study presents a video retrieval system capitalizing on image recognition techniques. The article discloses the details of implementation and empirical evaluation results for the system entirely based on features, extracted by convolutional neural networks. It is shown that these features can serve as universal signatures of the semantic content of the video and can be useful for implementing several types of multimedia retrieval queries defined in MPEG-7 standard. Further, the graph-based structure of the video index storage is proposed in order to efficiently implement complicated spatial and temporal search queries. Thus, technical approaches proposed in this work may help to build cost-efficient and user-friendly multimedia retrieval system.


Video indexing Video retrieval Shot boundary detection Graph database Semantic features Convolutional neural networks Deep learning MPEG-7 


  1. 1.
    Smith, J.R., Basu, S., Lin, C.-Y., Naphade, M., Tseng, B.: Interactive content-based retrieval of video. In: IEEE International Conference on Image Processing, ICIP 2002, September 2002Google Scholar
  2. 2.
    Bangalore, S.: System and method for digital video retrieval involving speech recognition. US Patent 8487984 (2013)Google Scholar
  3. 3.
    ISO/IEC 15938-5:2003 Information technology – Multimedia content description interface – Part 5: Multimedia description schemes. International Organization for Standardization, Geneva, Switzerland (2003)Google Scholar
  4. 4.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  5. 5.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  6. 6.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR, arXiv:1409.4842 (2014)
  7. 7.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. CoRR, arXiv:1409.0575 (2014)
  8. 8.
    Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification (2015).
  9. 9.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306 (2014)
  10. 10.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. arXiv preprint arXiv:1502.08029 (2015)
  11. 11.
    Krizhevsky, A., Hinton, G.E.: Using very deep autoencoders for content-based image retrieval. In: ESANN (2011)Google Scholar
  12. 12.
    Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: ICCV (2011)Google Scholar
  13. 13.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  14. 14.
    Princeton University “About WordNet.” WordNet. Princeton University (2010).
  15. 15.
    Langford, J., Li, L., Strehl, A.: Vowpal wabbit online learning project (2007).

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Cinema and Photo Research Institute (NIKFI)Creative Production Association “Gorky Film Studio”MoscowRussia

Personalised recommendations