Advertisement

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

  • Dian Shao
  • Yu Xiong
  • Yue Zhao
  • Qingqiu Huang
  • Yu Qiao
  • Dahua Lin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11213)

Abstract

The thriving of video sharing services brings new challenges to video retrieval, e.g. the rapid growth in video duration and content diversity. Meeting such challenges calls for new techniques that can effectively retrieve videos with natural language queries. Existing methods along this line, which mostly rely on embedding videos as a whole, remain far from satisfactory for real-world applications due to the limited expressive power. In this work, we aim to move beyond this limitation by delving into the internal structures of both sides, the queries and the videos. Specifically, we propose a new framework called Find and Focus (FIFO), which not only performs top-level matching (paragraph vs. video), but also makes part-level associations, localizing a video clip for each sentence in the query with the help of a focusing guide. These levels are complementary – the top-level matching narrows the search while the part-level localization refines the results. On both ActivityNet Captions and modified LSMDC datasets, the proposed framework achieves remarkable performance gains (Project Page: https://ycxioooong.github.io/projects/fifo).

Notes

Acknowledgements

This work is partially supported by the Big Data Collaboration Research grant from SenseTime Group (CUHK Agreement No. TS1610626), the Early Career Scheme (ECS) of Hong Kong (No. 24204215), and International Partnership Program of Chinese Academy of Sciences (172644KYSB20160033).

Supplementary material

474192_1_En_13_MOESM1_ESM.pdf (2.9 mb)
Supplementary material 1 (pdf 2923 KB)

Supplementary material 2 (mp4 17789 KB)

References

  1. 1.
    Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual descriptors. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6583–6587. IEEE (2014)Google Scholar
  2. 2.
    Aytar, Y., Shah, M., Luo, J.: Utilizing semantic word similarity measures for video retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. IEEE (2008)Google Scholar
  3. 3.
    Bojanowski, P., et al.: Weakly-supervised alignment of video with text. In: IEEE International Conference on Computer Vision (ICCV), pp. 4462–4470 (2015)Google Scholar
  4. 4.
    Chen, K., Song, H., Loy, C.C., Lin, D.: Discover and learn new objects from documentaries. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1111–1120. IEEE (2017)Google Scholar
  5. 5.
    Dalton, J., Allan, J., Mirajkar, P.: Zero-shot video retrieval using content and concepts. In: the 22nd ACM International Conference on Information and Knowledge Management (CIKM), pp. 1857–1860. ACM (2013)Google Scholar
  6. 6.
    Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 (2017)
  7. 7.
    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems (NIPS), pp. 2121–2129 (2013)Google Scholar
  8. 8.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2782–2795 (2013)CrossRefGoogle Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)Google Scholar
  10. 10.
    Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., Snoek, C.: Action localization with tubelets from motion. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  11. 11.
    Johnson, J., et al.: Image retrieval using scene graphs. In: IEEE Conference on Computer vision and Pattern Recognition (CVPR), pp. 3668–3678 (2015)Google Scholar
  12. 12.
    Jouili, S., Tabbone, S.: Hypergraph-based image retrieval for graph-based representation. Pattern Recognit. 45(11), 4054–4068 (2012)CrossRefGoogle Scholar
  13. 13.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)Google Scholar
  14. 14.
    Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems (NIPS), pp. 1889–1897 (2014)Google Scholar
  15. 15.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  16. 16.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
  17. 17.
    Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  18. 18.
    Lin, D., Fidler, S., Kong, C., Urtasun, R.: Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2657–2664 (2014)Google Scholar
  19. 19.
    Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3707–3715 (2015)Google Scholar
  20. 20.
    Otani, M., Nakashima, Y., Rahtu, E., Heikkilä, J., Yokoya, N.: Learning joint representations of videos and sentences with web image search. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 651–667. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46604-0_46CrossRefGoogle Scholar
  21. 21.
    Plummer, B.A., Brown, M., Lazebnik, S.: Enhancing video summarization via vision-language embedding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  22. 22.
    Ren, M., Kiros, R., Zemel, R.: Image question answering: a visual semantic embedding model and a new dataset. Adv. Neural Inf. Process. Systems (NIPS) 1(2), 5 (2015)Google Scholar
  23. 23.
    Rohrbach, A., et al.: Movie description. Int. J. Comput. Vis. 123(1), 94–120 (2017)CrossRefGoogle Scholar
  24. 24.
    Sharghi, A., Gong, B., Shah, M.: Query-focused extractive video summarization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 3–19. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_1CrossRefGoogle Scholar
  25. 25.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1049–1058 (2016)Google Scholar
  26. 26.
    Smoliar, S.W., Zhang, H.: Content based video indexing and retrieval. IEEE Multimed. 1(2), 62–72 (1994)CrossRefGoogle Scholar
  27. 27.
    Snoek, C.G., Worring, M.: Concept-based video retrieval. Found. Trends Inf. Retrieval 2(4), 215–322 (2008)CrossRefGoogle Scholar
  28. 28.
    Tang, K., Yao, B., Fei-Fei, L., Koller, D.: Combining the right features for complex event recognition. In: IEEE International Conference on Computer Vision (ICCV), pp. 2696–2703. IEEE (2013)Google Scholar
  29. 29.
    Tapaswi, M., Bäuml, M., Stiefelhagen, R.: Aligning plot synopses to videos for story-based retrieval. Int. J. Multimed. Inf. Retrieval 4(1), 3–16 (2015)CrossRefGoogle Scholar
  30. 30.
    Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 (2016)
  31. 31.
    Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. In: International Conference on Representation Learning (ICLR) (2016)Google Scholar
  32. 32.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: IEEE International Conference on Computer Vision (ICCV), pp. 4534–4542 (2015)Google Scholar
  33. 33.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
  34. 34.
    Wang, D., Li, X., Li, J., Zhang, B.: The importance of query-concept-mapping for automatic video retrieval. In: the 15th ACM International Conference on Multimedia, pp. 285–288. ACM (2007)Google Scholar
  35. 35.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  36. 36.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5005–5013 (2016)Google Scholar
  37. 37.
    Wu, B., Lang, B., Liu, Y.: GKSH: graph based image retrieval using supervised kernel hashing. In: International Conference on Internet Multimedia Computing and Service, pp. 88–93. ACM (2016)Google Scholar
  38. 38.
    Xu, R., Xiong, C., Chen, W., Corso, J.J.: Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI Conference on Artificial Intelligence (AAAI), vol. 5, p. 6 (2015)Google Scholar
  39. 39.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision (ICCV), pp. 4507–4515 (2015)Google Scholar
  40. 40.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687 (2016)Google Scholar
  41. 41.
    Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  42. 42.
    Zhang, H.J., Wu, J., Zhong, D., Smoliar, S.W.: An integrated system for content-based video retrieval and browsing. Pattern Recognit. 30(4), 643–658 (1997)CrossRefGoogle Scholar
  43. 43.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: IEEE International Conference on Computer Vision (ICCV), vol. 8 (2017)Google Scholar
  44. 44.
    Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.CUHK-SenseTime Joint LabThe Chinese University of Hong KongShatinHong Kong
  2. 2.SIAT-SenseTime Joint Lab, Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesBeijingChina

Personalised recommendations