Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
Abstract
This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.
Keywords
Video question answering Multi-grained representation Temporal co-attentionNotes
Acknowledgements
This work was supported by Zhejiang Natural Science Foundation (LR19F020002, LZ17F020001), National Natural Science Foundation of China (61572431), Key R&D Program of Zhejiang Province (2018C01006), Chinese Knowledge Center for Engineering Sciences and Technology and Joint Research Program of ZJU and Hikvision Research Institute.
References
- 1.Agrawal A, Lu J, Antol S, Mitchell M, Zitnick CL, Parikh D, Batra D (2017) VQA: visual question answering. Int J Compu Vis 123:431MathSciNetCrossRefGoogle Scholar
- 2.Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4976–4984Google Scholar
- 3.Teney D, Liu L, Van Den Hengel A (2017) Graph-structured representations for visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 3233–3241Google Scholar
- 4.Yang Z, He X, Gao J, Deng L, Smola A (2015) Stacked attention networks for image question answering. CVPR 2016:21–29Google Scholar
- 5.Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:320CrossRefGoogle Scholar
- 6.Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. CVPR 2017:4187–4195Google Scholar
- 7.Zhang H, Zha Z, Yang Y (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: Proceedings of the 21st ACM international conference on multimedia, pp 33–42Google Scholar
- 8.Hong R, Li L, Cai J (2017) Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans Image Process 26(9):4128–4138MathSciNetCrossRefGoogle Scholar
- 9.Ye Y, Zhao Z, Li Y, Chen L, Xiao J, Zhuang Y (2017) Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, Cambridge, pp 829–832Google Scholar
- 10.Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X, Zhuang Y (2017) Video question answering via gradually refined attention over appearance and motion. In: ACM multimedia conference, pp 1645–1653Google Scholar
- 11.Zhao Z, Yang Q, Cai D, He X, Zhuang Y (2017) Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI international joint conference on artificial intelligence, pp 3518–3524Google Scholar
- 12.Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124(3):409–421MathSciNetCrossRefGoogle Scholar
- 13.Hong R, Wang M, Li G (2012) Multimedia question answering. IEEE Multimedia 19(4):72–78CrossRefGoogle Scholar
- 14.Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI international joint conference on artificial intelligence, pp 2737–2743Google Scholar
- 15.Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICMLGoogle Scholar
- 16.Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: ICCVGoogle Scholar
- 17.Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPRGoogle Scholar
- 18.Chen K, Wang J, Chen L-C, Gao H, Xu W, Nevatia R (2016) ABCCNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
- 19.Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: CVPRGoogle Scholar
- 20.Xu H, Saenko K (2016) Ask, attend and answer: exploring question- guided spatial attention for visual question answering. In: ECCVGoogle Scholar
- 21.Zeng K-H, Chen T-H, Chuang C-Y, Liao, Y-H, Niebles JC, Sun M (2016) Leveraging video descriptions to learn video question answering. arXiv preprint arXiv:1611.04021
- 22.Backhouse J (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
- 23.Zhao S, Liu Y, Han Y (2018) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol 28(8):1839–1849CrossRefGoogle Scholar
- 24.Gao Z, Zhang H, Xu G (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(2):554–564CrossRefGoogle Scholar
- 25.Xiao S, Li Y, Ye Y (2018) Video question answering via multi-granularity temporal attention network learning. In: ICIMCSGoogle Scholar
- 26.Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. arXiv Preprint arXiv:1606.00061
- 27.Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp. 961–970Google Scholar
- 28.Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 845–853Google Scholar
- 29.Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with. IEEE Trans Pattern Anal Mach Intel 39(6):1137–1149CrossRefGoogle Scholar