Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

  • Feng MaoEmail author
  • Xiang Wu
  • Hui Xue
  • Rong Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


High accuracy video label prediction (classification) models are attributed to large scale data. These data could be frame feature sequences extracted by a pre-trained convolutional-neural-network, which promote the efficiency for creating models. Unsupervised solutions such as feature average pooling, as a simple label-independent parameter-free based method, has limited ability to represent the video. While the supervised methods, like RNN, can greatly improve the recognition accuracy. However, the video length is usually long, and there are hierarchical relationships between frames across events in the video, the performance of RNN based models are decreased. In this paper, we proposes a novel video classification method based on a deep convolutional graph neural network (DCGN). The proposed method utilize the characteristics of the hierarchical structure of the video, and performed multi-level feature extraction on the video frame sequence through the graph network, obtained a video representation reflecting the event semantics hierarchically. We test our model on YouTube-8M Large-Scale Video Understanding dataset, and the result outperforms RNN based benchmarks.


Video classification Sequence representation Graph neural network Deep convolutional neural network 


  1. 1.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. CoRR abs/1609.08675 (2016)Google Scholar
  2. 2.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)Google Scholar
  3. 3.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(9), 1735–1780 (1997)CrossRefGoogle Scholar
  4. 4.
    Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014)Google Scholar
  5. 5.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)Google Scholar
  6. 6.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)Google Scholar
  7. 7.
    Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015)
  8. 8.
    Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural network (IndRNN): building a longer and deeper RNN. CoRR abs/1803.04831 (2018)Google Scholar
  9. 9.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907 (2016)Google Scholar
  10. 10.
    Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014). Scholar
  11. 11.
    Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 6(2), 181–214 (1994)CrossRefGoogle Scholar
  12. 12.
    Google: Youtube-8m starter code (2018).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Alibaba GroupHangzhouChina

Personalised recommendations