Learning Event Representations by Encoding the Temporal Context

Dias, Catarina; Dimiccoli, Mariella

doi:10.1007/978-3-030-11015-4_44

Catarina Dias¹⁴ &
Mariella Dimiccoli^15,16

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11131))

Included in the following conference series:

European Conference on Computer Vision

1594 Accesses
2 Citations

Abstract

This work aims at learning image representations suitable for event segmentation, a largely unexplored problem in the computer vision literature. The proposed approach is a self-supervised neural network that captures patterns of temporal overlap by learning to predict the feature vector of neighbor frames, given the one of the current frame. The model is inspired to recent experimental findings in neuroscience, showing that stimuli associated with similar temporal contexts are grouped together in the representational space. Experiments performed on image sequences captured at regular intervals have shown that a representation able to encode the temporal context provides very promising results on the task of temporal segmentation.

You have full access to this open access chapter, Download conference paper PDF

Learning to Segment Moving Objects

Article 22 September 2018

A Differentiable Recurrent Surface for Asynchronous Event-Based Data

Dynamic and Evolving Neural Network for Event Discrimination

Keywords

1 Introduction

As our sensory system is inherently continuous, we experience the world as an uninterrupted stream of perceptual stimuli. However, sensory information is automatically segmented by our brain into discrete events that can be understood, remembered and retrieved from the memory. How these event representations are generated at neural level is a very active area of research in neuroscience [1,2,3,4,5]. Although firstly questioned more than fifty years ago [1], it was only in 2007 that the seminal work of Zacks [2] revealed the key role of uncertainty and surprise in determining event boundaries. Later on, Kurby and Zacks [3] hypothesized that event segmentation might arise as a side effect of integrating information over the recent past to improve predictions about the near future. More recently, Shapiro et al. [4] have shown that neural representation of events are not tied to predictive uncertainty, but arise from temporal community structures: items that share the temporal context are grouped together in a representational space. Focusing more on a higher level processing, DuBrow and Davachi [5] have argued that event boundaries are generated by changes in our goals and these goals determine how information is stored and retrieved in our brain.

Besides neuroscience, event segmentation is also receiving an increasing attention in computer vision [6,7,8,9,10]. Indeed, temporal video segmentation can be considered the first step towards automatic annotation and recognition of digital video sequences (see Fig. 1). The growing size of today’s available unconstrained videos on internet raises the need of automatically detecting, recognizing and retrieving the type of complex events occurring in them. Typically, state of the art algorithms for event or action recognition rely on event boundary detection and visual features extracted at image level or at event level such as semantic features or optical flow. Most recent and promising approaches for event detection [11, 12] use concept scores as intermediate representation, which are the confidence of the occurrence of the concepts in the video. For example, the event Having a dinner with friends can be described as the occurrence of food, laughing, people, bottles ... etc. However, the resulting concept-based event representation is highly noisy due to the high variability of concept’s appearance on the one hand and to the high variability of the concept representation for a complex event on the other hand. Therefore it is difficult to segment temporally a video based on semantic features.

Given the relevance of the problem, in this paper, we propose to learn a representation for image sequences suitable for the task of temporal segmentation. Inspired by the theory Shapiro et al. [4], our model aims at embedding the temporal context of images via a self-supervised pretext task consisting in predicting the feature vector of neighbor frames given the concept vector of the current frame.

The reminder of this paper is as follows. In Sect. 2, we present the proposed approach, while in Sect. 3 we detail and discuss experimental results. We conclude the paper with Sect. 4, summarizing its main contributions and findings.

2 Learning Event Representations

Here, we first introduce the concepts underlying the proposed model. Second, we describe the pretext task that we use to learn the temporal representations and subsequently present the validation task which is employed to evaluate the quality of the learned representations (see Fig. 2).

2.1 Underlying Model

Hereafter, we consider a toy example to illustrate the intuition underlying our model. Let us suppose that we are given a sequence of N images and each image is represented by a feature vector as shown in Fig. 3(a).

Relying on the feature vector representation, we built a directed graph, where each vertex corresponds to a feature vector and each direct edge going from, says, A to B indicates that feature vector B temporally follows feature vector A in the image sequence. The resulting graph, shown in Fig. 3(b) indicates that the underlying representation of the image sequence presents two community structures, with many edges joining vertices of the same community and comparatively few edges joining vertices of different communities. Therefore, the directed graph could be regarded as a more intuitive simplified model that represents the temporal context of sequence data.

In the next section, we focus on how to automatically learn a feature space, where temporally nearby frames belonging to the same event lie close to each other.

2.2 Pretext Task

Representation learning aims at building intermediate representations of data useful to solve machine learning tasks. In self-supervised learning, one trains a model to solve a so-called pretext task on a dataset without the need for human annotation by exploiting labeling that comes for “free” with the data.

In our case, similarly to word2vec models in natural language processing [13], the pretext task is a prediction task, that given a frame, aims at predicting the temporally neighbor frames corresponding to the temporal context. This leads to learn a function from a given frame to the frames surrounding it. We considered two different implementations. The first one is a simple neural network with a single hidden layer that takes as input the feature vector \(x_i\) of the frame i and is trained to output the concept vector \(x_{i\pm n}\) of the frame \(i\pm n\), with \(n \in \{1,..,m\}\) by minimizing the Mean Squared Error (MSE) between \(x_i\) and the estimation of \(x_{i+1}\), say \(\hat{x}_{i+1}\). After training, the new feature vector \(\tilde{x}_i\) embedding the temporal context for the image i, is obtained by multiplying \(x_i\) to the learned weight matrix W. This procedure is illustrated on Fig. 4.

The second one is a many-to-many encoder-decoder long short term memory (LSTM) recurrent neural network. The model is trained to predict the feature vector of the next frame in a sequence based on the n previous ones. This process is illustrated in Fig. 5. The network is trained by feeding batches of sequential frames of size \(n<N\) randomly extracted, where N is the full length of the video sequence. The new family of feature vectors \(\tilde{\mathbf{x}}\) are obtained after training, as output of the encoder by feeding it the original feature vectors \(\mathbf{x}\).

In the case of the toy example illustrated in the previous section, we used one-hot encoding for the 12 feature vectors and we trained a neural network to predict the feature vector of the image \(i+1\) given the concept vector i. In Fig. 3(c) is shown the multidimensional scaling of the activation vectors obtained multiplying the original feature vectors by the weight matrix learned with the prediction task. From this figure we can clearly see how the feature vectors that on Fig. 3(b) were linked by numerous edges lie close to each other on this new representational space (A, B, C, D, E on the one side, and G, I, F, M, H, L, O on the other side).

Since the main objective is to transfer the learned representation to target task, in the next section we detail how we validate the goodness of the new learned representation.

2.3 Validation Task

To validate the goodness of the learned representation we applied a classical temporal segmentation algorithm to both, the concept vector representation and the new learned representation. In particular, we used a Binary Partition Tree (BPT) based approach [14]. BPT is a hierarchical clustering method that iteratively merges the most similar temporally adjacent neighbor frames until a single event is obtained. The initial partition of events is given by all frames. Each time two neighboring events are merged, the resulting event is modeled as the mean feature vector of the two original events. The temporal segmentation is obtained by cutting the binary tree, using as criterion the number of events or a more complex function that takes into account the similarity of all the merging. Since the goal of this work is to show that encoding the temporal context is beneficial for temporal segmentation, we used the most simple criterion.

3 Experimental Results

In the following, we first detail our experimental setup, including the dataset used in the experiments, the metrics employed in the validation task, how the initial feature vectors where computed and which type of experiments were performed. Then, we present and discuss the experimental results.

3.1 Experimental Setup

Dataset. We used a subset of the EDUB-Seg dataset [15, 16], consisting of ten image sequences belonging to five different users and captured by a wearable photo-camera that takes pictures at regular intervals of 30 s, with an average of 662 images per sequence. The subset we considered, comes together with the ground truth event segmentation and concept vectors describing the probability of each concept in the images.

Evaluation Measure. To evaluate the performance of the temporal segmentation we used the F-measure. In particular, we considered as true positives (TPs) the images that the BPT detects as boundaries of an event and that were manually defined in the GT. The false positives (FPs) are the images detected as events delimiters, but that were not marked in the GT, and the false negatives (FNs) the boundaries not detected but present in the GT. In all cases we considered a tolerance of 5 frames as in [16]. Good event segmentations correspond to F-measure values close to 1.

Feature Extraction. In our experiments, the original feature vectors are obtained as in [16] by firstly detecting concepts independently on each image by means of a concept detector, and then by clustering them in a semantic space by relying on WordNet [17] (see Fig. 6). The number of clusters found determines the size of the vocabulary of concepts. Each element of the feature vector corresponds to the probability of finding a given concept in the image.

Experiments. We performed three different class of experiments, with slightly different prediction tasks. The first one is the most simple forward prediction (NN forward): given the i^th feature vector as input to predict the \((i+1)\)^th feature vector. The second one is a forward prediction that takes into account the previous n feature vectors to predict the next feature vector (LSTM forward). In our experiments we used \(n=50\). The third one is a forward-backward prediction (NN forward-backward): given the i^th feature vector as input to predict the neighbor feature vectors inside a window of length n centered at i. We considered \(n=\{2,3,4,5\}\). All models were trained by using the MSE of the prediction as loss function and by using Stochastic Gradient Descent (SGD) as optimizer. The results reported on Tables 2 and 3 were achieved by setting the learning rate to 0.1 for the encoder-decoder LSTM model, with momentum to 0.9 for the NN models. The number of neurons used in the hidden layer was different accordingly to the prediction task as shown on Table 1. The size of the original feature vector was 50, extracted following the procedure described in Sect. 2.2. For each sequence the weights that ensured a better performance were saved for later generation of the new learned representation. The BPT based approach described in Sect. 2.3 for event segmentation was applied by relying on both, the original feature vectors and the new learned representation.

Table 1. Number of neurons in the hidden layer used in the different prediction tasks.

Full size table

Table 2. Results obtained in the different experiments for each sequence.

Full size table

Table 3. Average results for each prediction task and baseline performance.

Full size table

3.2 Results and Discussion

Table 2 shows the results obtained on each sequence and for each prediction task. On Table 3, is reported the average over all sequences for each experiment. As it can be observed on Table 3, the F-measure of the temporal segmentation obtained relying on a representation that encode the temporal context, outperforms the baseline, obtained using the original feature vectors, for all prediction tasks. In particular, the representation learned through the forward-backward prediction achieves the best performance. Furthermore, the performance increases with the size of the temporal window achieving the maximum value for a window of size 3 and then decreases again. These results have shown that, although its simplicity, the proposed approach is very effective to learn event representations, and suggest that encoding the temporal context is crucial for event learning.

Key issues to be investigated, is what features are most suited as basis for the temporal embedding in videos, if they can be learned in an end-to-end fashion and which prediction task would be more effective in the video domain.

4 Conclusions

To the best of our knowledge, this work has presented the first attempt to learn image representations suitable for event segmentation. The proposed approach is inspired to recent experimental findings in neuroscience showing that neural representations of events arise from temporal community structures. To learn the temporal embedding, we proposed a pretext task consisting of predicting the feature vector of neighboring images in a temporal window of fixed size, by using two different approaches: a simple neural network and an encoder-decoder LSTM. Experimental results performed on a dataset of image sequences captured at regular intervals have shown that the new learned representation outperforms the original feature-based representation on the task of temporal segmentation. The generalization of the approach to temporal segmentation of video, would have an important impact in the processing of untrimmed videos.

References

Newtson, D., Engquist, G.A., Bois, J.: The objective basis of behavior units. J. Pers. Soc. Psychol. 35(12), 847 (1977)
Article Google Scholar
Zacks, J.M., Speer, N.K., Swallow, K.M., Braver, T.S., Reynolds, J.R.: Event perception: a mind-brain perspective. Psychol. Bull. 133(2), 273 (2007)
Article Google Scholar
Kurby, C.A., Zacks, J.M.: Segmentation in the perception and memory of events. Trends Cogn. Sci. 12(2), 72–79 (2008)
Article Google Scholar
Schapiro, A.C., Rogers, T.T., Cordova, N.I., Turk-Browne, N.B., Botvinick, M.M.: Neural representations of events arise from temporal community structure. Nature Neurosci. 16(4), 486 (2013)
Article Google Scholar
DuBrow, S., Davachi, L.: Temporal binding within and across events. Neurobiol. Learn. Memory 134, 107–114 (2016)
Article Google Scholar
Koprinska, I., Carrato, S.: Temporal video segmentation: a survey. Signal Process. Image Commun. 16(5), 477–500 (2001)
Article Google Scholar
Krishna, M.V., Bodesheim, P., Körner, M., Denzler, J.: Temporal video segmentation by event detection: a novelty detection approach. Pattern Recogn. Image Anal. 24(2), 243–255 (2014)
Article Google Scholar
Liwicki, S., Zafeiriou, S.P., Pantic, M.: Online kernel slow feature analysis for temporal video segmentation and tracking. IEEE Trans. Image Process. 24(10), 2955–2970 (2015)
Article MathSciNet Google Scholar
Theodoridis, T., Tefas, A., Pitas, I.: Multi-view semantic temporal video segmentation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3947–3951. IEEE (2016)
Google Scholar
Iwan, L.H., Thom, J.A.: Temporal video segmentation: detecting the end-of-act in circus performance videos. Multimed. Tools Appl. 76(1), 1379–1401 (2017)
Article Google Scholar
Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807 (2015)
Google Scholar
Chang, X., Yang, Y., Hauptmann, A.G., Xing, E.P., Yu, Y.L.: Semantic concept discovery for large-scale zero-shot event detection. In: International Joint Conference on Artificial Intelligence (IJCAI) (2015)
Google Scholar
Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)
Google Scholar
Salembier, P., Garrido, L.: Binary partition tree as an efficient representation for image processing, segmentation, and information retrieval. IEEE Trans. Image Process. 9(4), 561–576 (2000)
Article Google Scholar
Talavera, E., Dimiccoli, M., Bolaños, M., Aghaei, M., Radeva, P.: R-clustering for egocentric video segmentation. In: Paredes, R., Cardoso, J.S., Pardo, X.M. (eds.) IbPRIA 2015. LNCS, vol. 9117, pp. 327–336. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19390-8_37
Chapter Google Scholar
Dimiccoli, M., Bolaños, M., Talavera, E., Aghaei, M., Nikolov, S.G., Radeva, P.: SR-clustering: semantic regularized clustering for egocentric photo streams segmentation. Comput. Vis. Image Underst. 155, 55–69 (2017)
Article Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar

Download references

Acknowledgments

This work was partially founded by TIN2015-66951-C2, SGR 1742, ICREA Academia 2014, Marató TV3 (20141510), Nestore Horizon2020 SC1-PM-15-2017 (769643) and CERCA. The funders had no role in the study design, data collection, analysis, and preparation of the manuscript. The authors gratefully acknowledge NVIDIA Corporation for the donation of the GPU used in this work.

Author information

Authors and Affiliations

Faculty of Engineering, University of Porto, Rua Doutor Roberto Frias, 4200-465, Porto, Portugal
Catarina Dias
Department of Mathematics and Computer Science, University of Barcelona, Gran via de les Corts Catalanes 585, 08007, Barcelona, Spain
Mariella Dimiccoli
Computer Vision Center, Campus UAB, Edifici O, 08193, Cerdanyola del Valles, Barcelona, Spain
Mariella Dimiccoli

Authors

Catarina Dias
View author publications
You can also search for this author in PubMed Google Scholar
Mariella Dimiccoli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mariella Dimiccoli .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dias, C., Dimiccoli, M. (2019). Learning Event Representations by Encoding the Temporal Context. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11131. Springer, Cham. https://doi.org/10.1007/978-3-030-11015-4_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-11015-4_44
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11014-7
Online ISBN: 978-3-030-11015-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Event Representations by Encoding the Temporal Context

Abstract