Learning Event Representations by Encoding the Temporal Context
This work aims at learning image representations suitable for event segmentation, a largely unexplored problem in the computer vision literature. The proposed approach is a self-supervised neural network that captures patterns of temporal overlap by learning to predict the feature vector of neighbor frames, given the one of the current frame. The model is inspired to recent experimental findings in neuroscience, showing that stimuli associated with similar temporal contexts are grouped together in the representational space. Experiments performed on image sequences captured at regular intervals have shown that a representation able to encode the temporal context provides very promising results on the task of temporal segmentation.
KeywordsRepresentation learning Event learning LSTM Neural networks
This work was partially founded by TIN2015-66951-C2, SGR 1742, ICREA Academia 2014, Marató TV3 (20141510), Nestore Horizon2020 SC1-PM-15-2017 (769643) and CERCA. The funders had no role in the study design, data collection, analysis, and preparation of the manuscript. The authors gratefully acknowledge NVIDIA Corporation for the donation of the GPU used in this work.
- 9.Theodoridis, T., Tefas, A., Pitas, I.: Multi-view semantic temporal video segmentation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3947–3951. IEEE (2016)Google Scholar
- 11.Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807 (2015)Google Scholar
- 12.Chang, X., Yang, Y., Hauptmann, A.G., Xing, E.P., Yu, Y.L.: Semantic concept discovery for large-scale zero-shot event detection. In: International Joint Conference on Artificial Intelligence (IJCAI) (2015)Google Scholar
- 13.Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)Google Scholar