Learning Event Representations by Encoding the Temporal Context

  • Catarina Dias
  • Mariella DimiccoliEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11131)


This work aims at learning image representations suitable for event segmentation, a largely unexplored problem in the computer vision literature. The proposed approach is a self-supervised neural network that captures patterns of temporal overlap by learning to predict the feature vector of neighbor frames, given the one of the current frame. The model is inspired to recent experimental findings in neuroscience, showing that stimuli associated with similar temporal contexts are grouped together in the representational space. Experiments performed on image sequences captured at regular intervals have shown that a representation able to encode the temporal context provides very promising results on the task of temporal segmentation.


Representation learning Event learning LSTM Neural networks 



This work was partially founded by TIN2015-66951-C2, SGR 1742, ICREA Academia 2014, Marató TV3 (20141510), Nestore Horizon2020 SC1-PM-15-2017 (769643) and CERCA. The funders had no role in the study design, data collection, analysis, and preparation of the manuscript. The authors gratefully acknowledge NVIDIA Corporation for the donation of the GPU used in this work.


  1. 1.
    Newtson, D., Engquist, G.A., Bois, J.: The objective basis of behavior units. J. Pers. Soc. Psychol. 35(12), 847 (1977)CrossRefGoogle Scholar
  2. 2.
    Zacks, J.M., Speer, N.K., Swallow, K.M., Braver, T.S., Reynolds, J.R.: Event perception: a mind-brain perspective. Psychol. Bull. 133(2), 273 (2007)CrossRefGoogle Scholar
  3. 3.
    Kurby, C.A., Zacks, J.M.: Segmentation in the perception and memory of events. Trends Cogn. Sci. 12(2), 72–79 (2008)CrossRefGoogle Scholar
  4. 4.
    Schapiro, A.C., Rogers, T.T., Cordova, N.I., Turk-Browne, N.B., Botvinick, M.M.: Neural representations of events arise from temporal community structure. Nature Neurosci. 16(4), 486 (2013)CrossRefGoogle Scholar
  5. 5.
    DuBrow, S., Davachi, L.: Temporal binding within and across events. Neurobiol. Learn. Memory 134, 107–114 (2016)CrossRefGoogle Scholar
  6. 6.
    Koprinska, I., Carrato, S.: Temporal video segmentation: a survey. Signal Process. Image Commun. 16(5), 477–500 (2001)CrossRefGoogle Scholar
  7. 7.
    Krishna, M.V., Bodesheim, P., Körner, M., Denzler, J.: Temporal video segmentation by event detection: a novelty detection approach. Pattern Recogn. Image Anal. 24(2), 243–255 (2014)CrossRefGoogle Scholar
  8. 8.
    Liwicki, S., Zafeiriou, S.P., Pantic, M.: Online kernel slow feature analysis for temporal video segmentation and tracking. IEEE Trans. Image Process. 24(10), 2955–2970 (2015)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Theodoridis, T., Tefas, A., Pitas, I.: Multi-view semantic temporal video segmentation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3947–3951. IEEE (2016)Google Scholar
  10. 10.
    Iwan, L.H., Thom, J.A.: Temporal video segmentation: detecting the end-of-act in circus performance videos. Multimed. Tools Appl. 76(1), 1379–1401 (2017)CrossRefGoogle Scholar
  11. 11.
    Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807 (2015)Google Scholar
  12. 12.
    Chang, X., Yang, Y., Hauptmann, A.G., Xing, E.P., Yu, Y.L.: Semantic concept discovery for large-scale zero-shot event detection. In: International Joint Conference on Artificial Intelligence (IJCAI) (2015)Google Scholar
  13. 13.
    Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)Google Scholar
  14. 14.
    Salembier, P., Garrido, L.: Binary partition tree as an efficient representation for image processing, segmentation, and information retrieval. IEEE Trans. Image Process. 9(4), 561–576 (2000)CrossRefGoogle Scholar
  15. 15.
    Talavera, E., Dimiccoli, M., Bolaños, M., Aghaei, M., Radeva, P.: R-clustering for egocentric video segmentation. In: Paredes, R., Cardoso, J.S., Pardo, X.M. (eds.) IbPRIA 2015. LNCS, vol. 9117, pp. 327–336. Springer, Cham (2015). Scholar
  16. 16.
    Dimiccoli, M., Bolaños, M., Talavera, E., Aghaei, M., Nikolov, S.G., Radeva, P.: SR-clustering: semantic regularized clustering for egocentric photo streams segmentation. Comput. Vis. Image Underst. 155, 55–69 (2017)CrossRefGoogle Scholar
  17. 17.
    Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of EngineeringUniversity of PortoPortoPortugal
  2. 2.Department of Mathematics and Computer ScienceUniversity of BarcelonaBarcelonaSpain
  3. 3.Computer Vision Center, Campus UABCerdanyola del Valles, BarcelonaSpain

Personalised recommendations