Video Re-localization

  • Yang FengEmail author
  • Lin Ma
  • Wei Liu
  • Tong Zhang
  • Jiebo Luo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)


Many methods have been developed to help people find the video content they want efficiently. However, there are still some unsolved problems in this area. For example, given a query video and a reference video, how to accurately localize a segment in the reference video such that the segment semantically corresponds to the query video? We define a distinctively new task, namely video re-localization, to address this need. Video re-localization is an important enabling technology with many applications, such as fast seeking in videos, video copy detection, as well as video surveillance. Meanwhile, it is also a challenging research task because the visual appearance of a semantic concept in videos can have large variations. The first hurdle to clear for the video re-localization task is the lack of existing datasets. It is labor expensive to collect pairs of videos with semantic coherence or correspondence, and label the corresponding segments. We first exploit and reorganize the videos in ActivityNet to form a new dataset for video re-localization research, which consists of about 10,000 videos of diverse visual appearances associated with the localized boundary information. Subsequently, we propose an innovative cross gated bilinear matching model such that every time-step in the reference video is matched against the attentively weighted query video. Consequently, the prediction of the starting and ending time is formulated as a classification problem based on the matching results. Extensive experimental results show that the proposed method outperforms the baseline methods. Our code is available at:


Video re-localization Cross gating Bilinear matching 



We would like to thank the support of New York State through the Goergen Institute for Data Science and NSF Award #1722847.


  1. 1.
    Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV (2017)Google Scholar
  2. 2.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: CVPR (2017)Google Scholar
  3. 3.
    Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., Zhong, D.: A fully automated content-based video search engine supporting spatiotemporal queries. IEEE CSVT 8(5), 602–615 (1998)Google Scholar
  4. 4.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)Google Scholar
  5. 5.
    Chou, C.L., Chen, H.T., Lee, S.Y.: Pattern-based near-duplicate video retrieval and localization on web-scale videos. TMM 17(3), 382–395 (2015)Google Scholar
  6. 6.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: A large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  7. 7.
    Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV (2017)Google Scholar
  8. 8.
    Gorban, A., et al.: THUMOS challenge: action recognition with a large number of classes (2015).
  9. 9.
    Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). Scholar
  10. 10.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  11. 11.
    Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.: A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst. Man Cybern. 41(6), 797–819 (2011)CrossRefGoogle Scholar
  12. 12.
    Jiang, Y.G., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. IEEE Trans. Big Data 2(1), 32–42 (2016)CrossRefGoogle Scholar
  13. 13.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: ICCV (2017)Google Scholar
  14. 14.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  15. 15.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  16. 16.
    Kläser, A., Marszałek, M., Schmid, C., Zisserman, A.: Human focused action localization in video. In: Kutulakos, K.N. (ed.) ECCV 2010. LNCS, vol. 6553, pp. 219–233. Springer, Heidelberg (2012). Scholar
  17. 17.
    Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011)Google Scholar
  18. 18.
    Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV (2015)Google Scholar
  19. 19.
    Liu, H., et al.: Neural person search machines. In: ICCV (2017)Google Scholar
  20. 20.
    Liu, H., et al.: Video-based person re-identification with accumulative motion context. In: CSVT (2017)Google Scholar
  21. 21.
    Plummer, B.A., Brown, M., Lazebnik, S.: Enhancing video summarization via vision-language embedding. In: CVPR (2017)Google Scholar
  22. 22.
    Ren, W., Singh, S., Singh, M., Zhu, Y.S.: State-of-the-art on spatio-temporal information-based video retrieval. Pattern Recognit. 42(2), 267–282 (2009)CrossRefGoogle Scholar
  23. 23.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  24. 24.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  25. 25.
    Seo, H.J., Milanfar, P.: Action recognition from one example. PAMI 33(5), 867–882 (2011)CrossRefGoogle Scholar
  26. 26.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017)Google Scholar
  27. 27.
    Soomro, K., Shah, M.: Unsupervised action discovery and localization in videos. In: CVPR (2017)Google Scholar
  28. 28.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  29. 29.
    Wang, S., Jiang, J.: Machine comprehension using match-LSTM and answer pointer. arXiv preprint arXiv:1608.07905 (2016)
  30. 30.
    Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: ECCV (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yang Feng
    • 2
    Email author
  • Lin Ma
    • 1
  • Wei Liu
    • 1
  • Tong Zhang
    • 1
  • Jiebo Luo
    • 2
  1. 1.Tencent AI LabShenzhenChina
  2. 2.University of RochesterRochesterUSA

Personalised recommendations