Skip to main content

Advertisement

Log in

Complex Activity Recognition Via Attribute Dynamics

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The problem of modeling the dynamic structure of human activities is considered. Video is mapped to a semantic feature space, which encodes activity attribute probabilities over time. The binary dynamic system (BDS) model is proposed to jointly learn the distribution and dynamics of activities in this space. This is a non-linear dynamic system that combines binary observation variables and a hidden Gauss–Markov state process, extending both binary principal component analysis and the classical linear dynamic systems. A BDS learning algorithm, inspired by the popular dynamic texture, and a dissimilarity measure between BDSs, which generalizes the Binet–Cauchy kernel, are introduced. To enable the recognition of highly non-stationary activities, the BDS is embedded in a bag of words. An algorithm is introduced for learning a BDS codebook, enabling the use of the BDS as a visual word for attribute dynamics (WAD). Short-term video segments are then quantized with a WAD codebook, allowing the representation of video as a bag-of-words for attribute dynamics. Video sequences are finally encoded as vectors of locally aggregated descriptors, which summarize the first moments of video snippets on the BDS manifold. Experiments show that this representation achieves state-of-the-art performance on the tasks of complex activity recognition and event identification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. There is an ongoing debate on how deep architectures can capture long-term low-level motion information. While early models failed to achieve competitive performance (Ji et al. 2013; Karpathy et al. 2014), recent works (Simonyan and Zisserman 2014; Ng et al. 2015; Wang et al. 2015) show promising results, albeit still inferior to those of the best hand-crafted features (Peng et al. 2014a, b; Ni et al. 2015; Lan et al. 2015). It is worth noting that this issue is orthogonal to the contributions of this work, since the proposed method is built on a space of attribute responses which could be computed with a convolutional neural network (CNN).

  2. The optimization of the lengths \(\{\tau _i\}\) of the video segments \(\{\varvec{s}^{(i)}\}\) is left for further research. In this work, we simply considered segments of equal length \(\{\tau _i\} = \tau , \forall i\), chosen from a finite set of segment lengths \(\tau \), selected so as to achieved good empirical performance on the datasets considered. The specific values of \(\tau \) used are discussed in the experimental section.

  3. Although the square root of the symmetric KL divergence is not a metric (since the triangle inequality does not hold), it has been shown effective for the design of probability distribution kernels, in the context of various applications (Moreno et al. 2004; Vasconcelos et al. 2004; Haasdonk 2005; Chan and Vasconcelos 2005).

  4. In practice, the Fisher information metric \({{\mathcal {I}}}_M\) is often omitted, since the Fisher kernel is an Euclidean metric in the range space of the invertible linear transformation by \({{\mathcal {I}}}_M^{\frac{1}{2}}\), of the tangent space of the manifold at \(M\).

  5. For simplicity, we consider the precision matrices \(S^{-1}\) and \(Q^{-1}\) instead of the covariances \(S, Q\) in the computation of Fisher scores.

  6. Binary for STIP available at http://www.di.ens.fr/~laptev/download; source code for ITF available at http://lear.inrialpes.fr/~wang/download.

  7. Note that the version of Olympic Sports used in (Niebles et al. 2010) is different from that released publicly. DMS performance on the latter was reported in (Tang et al. 2012).

References

  • Afsari, B., Chaudhry, R., Ravichandran, A., & Vidal, R. (2012). Group action induced distances for averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes. In CVPR.

  • Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43(16), 1–16.

    Article  Google Scholar 

  • Amari, S., & Nagaoka, H. (2000). Methods of information geometry. American Mathematical Society.

  • Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.

    Article  Google Scholar 

  • Attias, H. (1999). A variational bayesian framework for graphical models. In NIPS.

  • Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2010). Action classification in soccer videos with long short-term memory recurrent neural networks. In ICANN.

  • Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In 2nd international workshop on human behavior understanding.

  • Bhattacharya, S. (2013). Recognition of complex events in open-source web-scale videos: A bottom up approach. ACM International Conference on Multimedia.

  • Bhattacharya, S., Kalayeh, M. M., Sukthankar, R., Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In CVPR.

  • Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. ICML.

  • Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.

    Article  Google Scholar 

  • Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In CVPR.

  • Campbell, L., Bobick, A. (1995). Recognition of human body motion using phase space constraints. In ICCV.

  • Chan, A., Vasconcelos, N. (2005). Probabilistic kernels for the classification of auto-regressive visual processes. In CVPR.

  • Chan, A., & Vasconcelos, N. (2008). Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Transaction on Pattern Analysis and Machine Intelligence, 30(5), 909–926.

    Article  Google Scholar 

  • Chan, A. B., & Vasconcelos, N. (2007). Classifying video with kernel dynamic textures. In CVPR.

  • Chang, C., & Lin, C. (2011). Libsvm: A library for support vector machines. ACM Transaction on Intelligent Systems and Technology, 2(3), 27.

    Google Scholar 

  • Chaudhry, R., Ravichandran, A., Hager, G., & Vidal, R. (2009). Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In CVPR.

  • Chomat, O., & Crowley, J. L. (1999). Probabilistic recognition of activity using local appearance. In CVPR.

  • Cinbis, R. G., Verbeek, J., & Schmid, C. (2012). Image categorization using fisher kernels of non-iid image models. In CVPR.

  • Collins, M., Dasgupta, S., & Schapire, R. E. (2002). A generalization of principal component analysis to the exponential family. In NIPS.

  • Deng, L., & Yu, D. (2014). Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 7(3–4), 197–387.

    Article  MathSciNet  MATH  Google Scholar 

  • Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.

  • Doretto, G., Chiuso, A., Wu, Y. N., & Soatto, S. (2003). Dynamic textures. International Journal of Computer Vision, 51(2), 91–109.

    Article  MATH  Google Scholar 

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

    MATH  Google Scholar 

  • Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In CVPR.

  • Gaidon, A., Harchaoui, Z., & Schmid, C. (2011). Actom sequence models for efficient action detection. In CVPR.

  • Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.

    Article  Google Scholar 

  • Ghahramani, Z., & Beal, M. J. (2000). Propagation algorithms for variational bayesian learning. In NIPS.

  • Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transaction on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.

    Article  Google Scholar 

  • Graves, A., & Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS.

  • Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In ICASSP.

  • Haasdonk, B. (2005). Feature space interpretation of svms with indefinite kernels. IEEE Transaction on Pattern Analysis and Machine Intelligence, 27(4), 482–492.

    Article  Google Scholar 

  • Hajimirsadeghi, H., Yan, W., Vahdat, A., & Mori, G. (2015). Visual recognition by counting instances: A multi-instance cardinality potential kernel. In CVPR.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Ikizler, N., & Forsyth, D. A. (2008). Searching for complex human activities with no visual examples. International Journal of Computer Vision, 80(3), 337–357.

    Article  Google Scholar 

  • Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In NIPS.

  • Jain, A., Gupta, A., Rodriguez, M., & Davis, L. S. (2013a). Representing videos using mid-level discriminative patches. In CVPR.

  • Jain, M., Jegou, H., Bouthemy, P. (2013b). Better exploiting motion for better action recognition. In CVPR.

  • Jain, M., van Gemert, J. C., Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR.

  • Jegou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34(9), 1704–1718.

    Article  Google Scholar 

  • Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV.

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35(1), 221–231.

    Article  Google Scholar 

  • Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In ECCV.

  • Jones, S., & Shao, L. (2014). A multigraph representation for improved unsupervised/semi-supervised learning of human actions. In CVPR.

  • Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.

    Article  MATH  Google Scholar 

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  • Kellokumpu, V., Zhao, G., & Pietikäinen, M. (2008). Human activity recognition using a dynamic texture based method. BMVC.

  • Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR.

  • Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with fisher vectors for image categorization. In ICCV.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV.

  • Kullback, S. (1997). Information theory and statistics. Mineola: Courier Dover Publications.

    MATH  Google Scholar 

  • Lai, K., Liu, D., Chen, M., & Chang, S. (2014). Recognizing complex events in videos by learning key static-dynamic evidences. In ECCV.

  • Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.

  • Lan, Z., Li, X., Hauptmann, A. G. (2014). Temporal extension of scale pyramid and spatial pyramid matching for action recognition. http://arxiv.org/abs/1408.7071.

  • Lan, Z., Lin, M., Li, X., Hauptmann, A. G., Raj, B. (2015). Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  MathSciNet  Google Scholar 

  • Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

  • Laxton, B., Lim, J., & Kriegman, D. (2007). Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In CVPR.

  • Li, B., Ayazoglu, M., Mao, T., Camps, O., & Sznaier, M. (2011). Activity recognition using dynamic subspace angles. In CVPR.

  • Li, W., & Vasconcelos, N. (2012). Recognizing activities by attribute dynamics. In NIPS.

  • Li, W., Yu, Q., Divakaran, A., & Vasconcelos, N. (2013a). Dynamic pooling for complex event recognition. In ICCV.

  • Li, W., Yu, Q., Sawhney, H., & Vasconcelos, N. (2013b). Recognizing activities via bag of words for attribute dynamics. In CVPR.

  • Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR.

  • Matikainen, P., Hebert, M., & Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. Eur Conf Comput Vis, pp. 508–521. doi:10.1007/978-3-642-15549-9_37.

  • Moore, D. J., Essa, I. A., III M. H. H. (1999). Exploiting human actions and object context for recognition tasks. ICCV.

  • Moreno, P. J., Ho, P. P., & Vasconcelos, N. (2004). A kullback-leibler divergence based kernel for svm classification in multimedia applications. In NIPS.

  • Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.

  • Ni, B., Moulin, P., Yang, X., & Yan, S. (2015). Motion part regularization: Improving action recognition via trajectory selection. In CVPR.

  • Niebles, J. C., Chen, C. W., Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.

  • Niyogi, S., & Adelson, E. (1994). Analyzing and recognizing walking figures in xyt. In CVPR.

  • Over, P., Awad, G., Fiscus, J., Antonishek, B., Michel, M., Smeaton, A. F., Kraaij, W., & Quenot, G. (2011). Trecvid 2011—an overview of the goals, tasks, data, evaluation mechanisms, and metrics. Proceedings of TRECVID 2011.

  • Palatucci, M., Pomerleau, D., Hinton, G., & Mitchell, T. (2009). Zero-shot learning with semantic output codes. In NIPS.

  • Peng, X., Wang, L., Wang, X., Qiao, Y. (2014a). Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive study and good practice. http://arxiv.org/abs/1405.4506.

  • Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014b). Action recognition with stacked fisher vectors. In ECCV.

  • Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the Fisher Kernel for large-scale image classification. In ECCV.

  • Pinhanez, C., & Bobick, A. (1998). Human action detection using pnf propagation of temporal constraints. In CVPR.

  • Quattoni, A., Collins, M., & Darrell, T. (2007). Learning visual representations using images with captions. In CVPR.

  • Rasiwasi, N., Moreno, P. J., & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5), 923–938.

    Article  Google Scholar 

  • Rasiwasia, N., & Vasconcelos, N. (2008). Scene classification with low-dimensional semantic spaces and weak supervision. In CVPR.

  • Rasiwasia, N., & Vasconcelos, N. (2009). Holistic context modeling using semantic co-occurrences. In CVPR.

  • Rasiwasia, N., & Vasconcelos, N. (2012). Holistic context models for visual recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34(5), 902–917.

    Article  Google Scholar 

  • Ravichandran, A., Chaudhry, R., & Vidal, R. (2012). Categorizing dynamic textures using a bag of dynamical systems. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35(2), 342–353.

    Article  Google Scholar 

  • Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach: A spatio-temporal maximum average correlation height filter for action recognition. In CVPR.

  • Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. http://arxiv.org/abs/1409.0575.

  • Saul, L. K., & Jordan, M. I. (2000). Attractor dynamics in feedforward neural networks. Neural Computation, 12, 1313–1335.

    Article  Google Scholar 

  • Schein, A. L., Saul, L. K., Ungar, L. H. (2003). A generalized linear model for principal component analysis of binary data. In AISTATS.

  • Schölkopf, B., Smola, A., & Müller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.

    Article  Google Scholar 

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR.

  • Shao, L., Zhen, X., Tao, D., & Li, X. (2014). Spatio-temporal Laplacian pyramid coding for action recognition. IEEE Transaction on Cybernetics, 44(6), 817–827.

    Article  Google Scholar 

  • Shao, L., Liu, L., & Yu, M. (2015). Kernelized multiview projection for robust action recognition. International Journal of Computer Vision.

  • Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the em algorithm. Journal of Time Series Analysis, 3(4), 253–264.

    Article  MATH  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

  • Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep Fisher networks for large-scale image classification. In NIPS.

  • Snoek, C. G. M., Worring, M., & Smeulders, A. W. M. (2005). Early versus late fusion in semantic video analysis. ACM International Conference on Multimedia.

  • Sun, C., & Nevatia, R. (2013). Active: Activity concept transitions in video event classification. In ICCV.

  • Sun, C., & Nevatia, R. (2014). Discover: Discovering important segments for classification of video events and recounting. In CVPR.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.

  • Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., & Sawhney, H. (2012). Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR.

  • Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.

  • Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV.

  • Vahdat, A., Cannons, K., Mori, G., Oh, S., & Kim, I. (2013). Compositional models for video event detection: A multiple kernel learning latent variable approach. In ICCV.

  • Vasconcelos, N., Ho, P., & Moreno, P. (2004). The kullback-leibler kernel as a framework for discriminant and localized representations for visual recognition. In ECCV.

  • Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34(3), 480–492.

    Article  Google Scholar 

  • Vishwanathan, S. V. N., Smola, A. J., & Vidal, R. (2006). Binet-cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. International Journal of Computer Vision, 73(1), 95–119.

    Article  Google Scholar 

  • Vrigkas, M., Nikou, C., & Kakadiaris, I. (2015). A review of human activity recognition methods. Frontiers in Robotics and AI, 2, 1–28.

    Article  Google Scholar 

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.

  • Wang, H., Ullah, M., Kläser, A., Laptev, I., Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In BMVC.

  • Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, L., Qiao, Y., Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.

  • Wang, X., McCallum, A. (2006). Topics over time: A non-markov continuous-time model of topical trends. In ACM SIGKDD.

  • Winn, J., & Bishop, C. M. (2005). Variational message passing. Journal of Machine Learning Research, 6, 661–694.

    MathSciNet  MATH  Google Scholar 

  • Xu, Z., Tsang, I., Yang, Y., Ma, Z., Hauptmann, A. (2014), Event detection using multi-level relevance labels and multiple features. In CVPR.

  • Yacoob, Y., Black, M. J. (1998). Parameterized modeling and recognition of activities. In ICCV.

  • Ye, G., Liu, D., Jhuo, I. H., Chang, S. F. (2012). Robust late fusion with rank minimization. In CVPR.

  • Yu, M., Liu, L., Shao, L. (2015). Structure-preserving binary representations for RGB-D action recognition. In IEEE Transaction on Pattern Analysis and Machine Intelligence.

  • Yu, Q., Liu, J., Cheng, H., Divakaran, A., Sawhney, H. (2012). Multimedia event recounting with concept based representation. ACM International Conference on Multimedia.

Download references

Acknowledgments

The authors acknowledge valuable discussions on the manuscript and references to variational inference by professor Lawrence K. Saul. This work was partially supported by National Science Foundation Grant IIS-1208522.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Xin Li.

Additional information

Communicated by Hiroshi Ishikawa, Takeshi Masuda, Yasuyo Kita and Katsushi Ikeuchi.

Appendices

Appendices

1.1 Appendix 1: Convergence of Bag-of-Models Clustering

The bag-of-models clustering procedure of Algorithm 2 is a general framework for clustering examples in a Riemannian manifold \({{\mathcal {M}}}\) of statistical models. The goal is to find a preset number of models \(\{M_j\}_{j=1}^{K}\subset {{\mathcal {M}}}\) in the manifold that best explain a corpora \({{\mathcal {D}}}=\{\varvec{z}_i\}_{i=1}^N\) (\(\varvec{z}_i\in {{\mathcal {Z}}}, \forall i\)). It is assumed that all models \(M\) are parametrized by a set of parameters \({\varvec{\theta }}\) and have smooth likelihood functions (derivatives of all orders exist and are bounded), and that Algorithm 2 satisfies the following conditions.

Condition 1: the operation \(f_{{{\mathcal {M}}}}\) of (20) consists of estimating the parameters \({\varvec{\theta }}\) of \({{\mathcal {M}}}\) by the maximum likelihood estimation (MLE) principle.

Condition 2: the Riemannian metric of the manifold \({{\mathcal {M}}}\) defined by the Fisher information \({{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}}\) (Jaakkola and Haussler 1999; Amari and Nagaoka 2000) is used as the dissimilarity measure of (21). More precisely, the metric of \({{\mathcal {M}}}\) in the neighbrhood of model \(M_{\varvec{z}}\) is

$$\begin{aligned} d_{{{\mathcal {M}}}}(M^*,~M_{\varvec{z}}) = ||{\varvec{\theta }}^* - {\varvec{\theta }}_{\varvec{z}}||^2_{{{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}}}, \end{aligned}$$
(80)

where \(||{\varvec{\theta }}_1-{\varvec{\theta }}_2||^2_{\mathcal {I}}= ({\varvec{\theta }}_1-{\varvec{\theta }}_2)^{\intercal }{\mathcal {I}}({\varvec{\theta }}_1-{\varvec{\theta }}_2)\), and the Fisher information \({{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}}\) is defined as (Si 1998)

$$\begin{aligned} {{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}} = -{\mathbb {E}}_{{\varvec{x}}\sim p({\varvec{x}}; {\varvec{\theta }}_{\varvec{z}})}\left[ \nabla ^2_{{\varvec{\theta }}}\ln p({\varvec{x}}; {\varvec{\theta }})|_{{\varvec{\theta }}={\varvec{\theta }}_{\varvec{z}}} \right] . \end{aligned}$$
(81)

Given the similarity between Algorithm 2 and k-means, the convergence of the former can be studied with the techniques commonly used to show that the latter converges. This requires the definition of a suitable objective function to quantify the quality of the fit of the set \(\{M_i\}_{j=1}^{K}\) to the corpora \({{\mathcal {D}}}\). We rely on the objective

$$\begin{aligned} \zeta (\{M_i\}_{j=1}^{K}, \{S_j\}_{j=1}^K) = \sum \limits _{j}\sum \limits _{\varvec{z}\in S_j} \ln p_{M_j}(\varvec{z}), \end{aligned}$$
(82)

where \(p_{M}(\cdot )\) is the likelihood function of model \(M\), and \(S_j\) a subset of \({{\mathcal {D}}}\), containing all examples assigned to j-th model. Note that this implies that \(\forall i\ne j, S_i\bigcap S_j=\varnothing \) and \(\bigcup \limits _jS_j={{\mathcal {D}}}\). From the assumption of smooth models \(M\) (i.e., \(\forall \varvec{z}\in {{\mathcal {Z}}}, M\in {{\mathcal {M}}}\), \(p_M(\varvec{z})<\infty \)) and the fact that there is only a finite set of assignments \(\{S_j\}_{j=1}^K\), the objective function of (82) is upper bounded. Since the refinement step of Algorithm 2 updates the models so that

$$\begin{aligned} M^{(t+1)}_j = f_{{{\mathcal {M}}}}(S^{(t+1)}_j) = \mathop {\mathrm {argmax}}\limits _{M\in {{\mathcal {M}}}} \sum \limits _{\varvec{z}\in S^{(t+1)}_j}\ln p_M(\varvec{z}), \end{aligned}$$

the objective either increases or remains constant after each refinement step. It remains to prove that the same holds for each assignment step. If that is the case, Algorithm 2 produces a monotonically increasing and upper-bounded sequence of objective function values. By the monotone convergence theorem, this implies that algorithm converges in a finite number of steps. Note that, as in k-means, there is no guarantee on convergence to the global optimum.

It thus remains to prove that the objective of (82) increases with each assignment step. The Riemannian structure of the manifold \({{\mathcal {M}}}\), makes this proof more technical than the corresponding one for k-means. In what follows, we provide a sketch of the proof. Let \(M^*\) be the model (of parameters \({\varvec{\theta }}^*\)) to which example \(\varvec{z}\) is assigned by the assignment step of Algorithm 2, i.e.,

$$\begin{aligned} M^* = \mathop {\mathrm {argmin}}\limits _{M\in \{M^{(t)}_j\}_{j=1}^K}d_{{{\mathcal {M}}}} (M_{\varvec{z}},M) \end{aligned}$$
(83)

and \(M^\circ \) (of parameter \({\varvec{\theta }}^\circ \)) the equivalent model of the previous iteration. It follows from Condition 2 that

$$\begin{aligned}&d_{{{\mathcal {M}}}}(M^*,~M_{\varvec{z}}) = ||{\varvec{\theta }}^* - {\varvec{\theta }}_{\varvec{z}}||^2_{{{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}}} \nonumber \\&\quad \leqslant d_{{{\mathcal {M}}}}(M^\circ ,~M_{\varvec{z}}) = ||{\varvec{\theta }}^\circ - {\varvec{\theta }}_{\varvec{z}}||^2_{{{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}}}. \end{aligned}$$
(84)

Note that, \(M_{\varvec{z}}\) is the model \(p(\varvec{z}; {\varvec{\theta }}_{\varvec{z}})\) onto which \(\varvec{z}\) is mapped by (20). From Condition 1, \({\varvec{\theta }}_{\varvec{z}}=\mathop {\mathrm {argmax}}\nolimits _{{\varvec{\theta }}} p(\varvec{z}; {\varvec{\theta }})\) and, using a Taylor series expansion,

$$\begin{aligned} \ln p(\varvec{z}; {\varvec{\theta }})&\approx \ln p(\varvec{z}; {\varvec{\theta }}_{\varvec{z}}) + \left<\nabla _{\varvec{\theta }}\ln p(\varvec{z}; {\varvec{\theta }})|_{{\varvec{\theta }}={\varvec{\theta }}_{\varvec{z}}}, {\varvec{\theta }}- {\varvec{\theta }}_{\varvec{z}}\right> \nonumber \\& + \frac{1}{2}||{\varvec{\theta }}- {\varvec{\theta }}_{\varvec{z}}||^2_{H_{{\varvec{\theta }}_{\varvec{z}}}} \end{aligned}$$
(85)
$$\begin{aligned}&= \ln p(\varvec{z}; {\varvec{\theta }}_{\varvec{z}}) + \frac{1}{2}||{\varvec{\theta }}- {\varvec{\theta }}_{\varvec{z}}||^2_{H_{{\varvec{\theta }}_{\varvec{z}}}} , \end{aligned}$$
(86)

where \(H_{{\varvec{\theta }}_{\varvec{z}}}=\nabla ^2_{\varvec{\theta }}\ln p(\varvec{z}; {\varvec{\theta }})|_{{\varvec{\theta }}={\varvec{\theta }}_{\varvec{z}}}\) is the Hessian of \(\ln p(\varvec{z}; {\varvec{\theta }})\) at \({\varvec{\theta }}_{\varvec{z}}\). Since \(p(\varvec{z}; {\varvec{\theta }}_{\varvec{z}})\) is the model obtained from a single example \(\varvec{z}\), it is a heavily peaky distribution centered at \(\varvec{z}\). Hence, the expectation of (81) can be approximated by

$$\begin{aligned} {{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}} \approx -H_{{\varvec{\theta }}_{\varvec{z}}}. \end{aligned}$$
(87)

Combining (84), (86), and (87) then results in

$$\begin{aligned} \ln p(\varvec{z}; {\varvec{\theta }}^*)&\approx \ln p(\varvec{z}; {\varvec{\theta }}_{\varvec{z}}) + \frac{1}{2}||{\varvec{\theta }}^* - {\varvec{\theta }}_{\varvec{z}}||^2_{H_{{\varvec{\theta }}_{\varvec{z}}}} \\&\approx \ln p(\varvec{z}; {\varvec{\theta }}_{\varvec{z}}) - \frac{1}{2}||{\varvec{\theta }}^* - {\varvec{\theta }}_{\varvec{z}}||^2_{{{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}}}\\&\geqslant \ln p(\varvec{z}; {\varvec{\theta }}_{\varvec{z}}) - \frac{1}{2}||{\varvec{\theta }}^\circ - {\varvec{\theta }}_{\varvec{z}}||^2_{{{\mathcal {I}}}_{{\varvec{\theta }}_{\varvec{z}}}} \approx \ln p(\varvec{z}; {\varvec{\theta }}^\circ ). \end{aligned}$$

It follows that the objective of (82) increases after each assignment step. This is intuitive in the sense that, the closer a model \(M\) is to an example’s representative model, the better \(M\) can explain that example.

Appendix 2: Optimization

In this appendix, we derive (72), by considering the optimization problem

$$\begin{aligned} X^* = \mathop {\mathrm {argmax}}\limits _{X\in {{\mathcal {S}}}_{++}}&~b\ln \left| X\right| -\mathrm {tr}(AX), \\ s.t.&~A\in {{\mathcal {S}}}_{++},~b>0. \nonumber \end{aligned}$$
(88)

Since 1) both \(b\ln \left| X\right| \) and \(-\mathrm {tr}(AX)\) are smooth and concave functions in X (Boyd and Vandenberghe 2004), and 2) the domain \({{\mathcal {S}}}_{++}\) is an open convex set, the supremum of (88) is achieved at either 1) its stationary point(s) (if any), or 2) the boundary of its domain. The derivative of the objective function of (88) is

$$\begin{aligned} \frac{\partial }{\partial X}&\big \{b\ln \left| X\right| - \mathrm {tr}(AX)\big \} = b({X}^{-1})^{\intercal } - A. \end{aligned}$$
(89)

Setting (89) to zero leads to

$$\begin{aligned} {X}^{*} = bA^{-1} \in {{\mathcal {S}}}_{++}. \end{aligned}$$
(90)

Applying this result to (71), with \(b=1\), \(X={\varSigma }\), and \(A=W\), leads to (72).

Appendix 3: Variational Inference for BDS

The key computation of the variational inference procedure of Sect. 6.2.3 is to determine

$$\begin{aligned}&\varvec{m}_t =\left\langle \varvec{x}_t\right\rangle _{q}, \\&{\varSigma }_{t,t} = \left\langle (\varvec{x}_t-\varvec{m}_t)(\varvec{x}_t-\varvec{m}_t)^{\intercal }\right\rangle _{q}, \\&{\varSigma }_{t,t+1} = \left\langle (\varvec{x}_t-\varvec{m}_t)(\varvec{x}_{t+1}-\varvec{m}_{t+1})^{\intercal }\right\rangle _{q}. \end{aligned}$$

In this appendix, we derive an efficient method for this computation, which draws on the solution of the identical variational inference problem for the LDS of (10). We start by discussing the LDS case.

1.1 (a) Inference for Linear Dynamic Systems

Consider the LDS of (10) with parameters \({\varvec{\theta }}_{LDS}=\{S, {\varvec{\mu }}, A, C, Q, {R} , \varvec{u}\}\), an observation sequence \(\{{\varvec{y}}_1^\tau \}\) (\({\varvec{y}}_t\in {\mathbb {R}}^K\)), and the variational distribution \(q(\varvec{x})\) of (57). Similarly to the derivation of Sect. 6.2.3, the variational lower bound of (41) for the log-likelihood of the LDS can be shown to be

$$\begin{aligned} {\mathscr {L}}({\varvec{\theta }}, {\varvec{y}}, q)= & {} \left\langle \ln p(\varvec{x}_1)\right\rangle _{q} + \sum _{t=1}^{\tau -1} \left\langle \ln p(\varvec{x}_{t+1}|\varvec{x}_t)\right\rangle _{q} \nonumber \\&+ \sum _{t=1}^{\tau } \left\langle \ln p({\varvec{y}}_t|\varvec{x}_t)\right\rangle _{q} + H_{q}(X), \end{aligned}$$
(91)

with \(\left\langle \ln p(\varvec{x}_1)\right\rangle _{q}, \left\langle \ln p(\varvec{x}_{t+1}|\varvec{x}_t)\right\rangle _{q},\) and \(H_{q}(X)\) as in (63)-(65). Furthermore, defining \({{\tilde{{\varvec{y}}}}}_t ={\varvec{y}}_t - \varvec{u}\),

$$\begin{aligned} \left\langle \ln p({\varvec{y}}_t|\varvec{x}_t)\right\rangle _{q}= & {} \left\langle \ln {\mathcal {G}}({{\tilde{{\varvec{y}}}}}_t;C\varvec{x}_t,R)\right\rangle _{q(\varvec{x}_t)} \nonumber \\= & {} \left\langle \ln {\mathcal {G}}(C\varvec{x}_t;{{\tilde{{\varvec{y}}}}}_t,R)\right\rangle _{{\mathcal {G}}(\varvec{x}_t; \varvec{m}_t, {\varSigma }_{t,t})} \nonumber \\= & {} \left\langle \ln {\mathcal {G}}(\varvec{x}_t;{{\tilde{{\varvec{y}}}}}_t,R)\right\rangle _{{\mathcal {G}}(\varvec{x}_t; C\varvec{m}_t, C{\varSigma }_{t,t}C^{\intercal })} \end{aligned}$$
(92)

and, from (8),

$$\begin{aligned} \left\langle \ln p({\varvec{y}}_t|\varvec{x}_t)\right\rangle _{q}\propto & {} -\frac{1}{2}\Big [||{{\tilde{{\varvec{y}}}}}_t-C\varvec{m}_t||^2_{ {R} } + \mathrm {tr}( {R} ^{-1}C{\varSigma }_{t,t}C^{\intercal }) \Big ]. \end{aligned}$$

It follows that

$$\begin{aligned} {\mathscr {L}}({\varvec{\theta }}, {\varvec{y}}, q)\propto & {} -\frac{1}{2}\bigg \{ ||{\varvec{\mu }}-\varvec{m}_1||^2_{S} +\mathrm {tr}(S^{-1}{\varSigma }_{1,1}) \nonumber \\+ & {} \sum _{t=1}^{\tau -1} \mathrm {tr}({\varGamma }^{-1}{\varPhi }_t) + \sum _{t=1}^{\tau } \mathrm {tr}( {R} ^{-1}C{\varSigma }_{t,t}C^{\intercal }) \nonumber \\+ & {} \sum _{t=1}^{\tau } ||{{\tilde{{\varvec{y}}}}}_t-C\varvec{m}_t||^2_{ {R} } \bigg \} + \frac{1}{2}\ln \left| {\varSigma }\right| , \end{aligned}$$
(93)

where \({\varGamma }\) and \({\varPhi }_t\) are defined in Sect. 6.2.3.

As was the the case with (69), the optimization of (93) with respect to the variational distribution \(q\) can be factorized into two optimization problems

$$\begin{aligned} \{\varvec{m}^*, {\varSigma }^*\} =&\mathop {\mathrm {argmax}}\limits _{\{\varvec{m}, {\varSigma }\}\in {\mathbb {R}}^{L\tau }\times {{\mathcal {S}}}^{L\tau }_{++}} {\mathscr {L}}({\varvec{\theta }}, {\varvec{y}}, q) \\ =&~\bigg \{ \mathop {\mathrm {argmax}}\limits _{\varvec{m}\in {\mathbb {R}}^{L\tau }} {\mathscr {L}}({\varvec{\theta }}, {\varvec{y}}, q), \mathop {\mathrm {argmax}}\limits _{{\varSigma }\in {{\mathcal {S}}}^{L\tau }_{++}} {\mathscr {L}}({\varvec{\theta }}, {\varvec{y}}, q) \bigg \}. \end{aligned}$$

In fact, the dependence of (93) on \({\varSigma }\) is identical to that of (69), up to the replacement of R by 4I. Hence, the optimal \({\varSigma }\) is still the solution of (71), i.e.,

$$\begin{aligned} {\varSigma }^*&= W^{-1}, \end{aligned}$$
(94)

but with a matrix \(W\in {{\mathcal {S}}}_{++}\) which is slightly different from (71), namely

$$\begin{aligned} W_{i,j} = {\left\{ \begin{array}{ll} A^{\intercal }Q^{-1}A+ S^{-1}+C^{\intercal } {R} ^{-1}C, &{} i=j=1, \\ A^{\intercal }Q^{-1}A+ Q^{-1}+C^{\intercal } {R} ^{-1}C,&{}1<i=j<\tau , \\ Q^{-1}+C^{\intercal } {R} ^{-1}C, &{} i=j=\tau , \\ -Q^{-1}A, &{} i=j+1, \\ -A^{\intercal }Q^{-1}, &{} i=j-1, \\ {0}, &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(95)

where \(W_{i,j}\in {\mathbb {R}}^{L\times L}\) is the block in row-i, column-j of \(W\).

In summary, the algorithm for learning the covariance of the variational distribution of the BDS is identical to the learning algorithm for the covariance of the variational distribution of a LDS with \(R = 4I\). Furthermore, since all random variables \(\varvec{x}\) and \({\varvec{y}}\) (as well as all marginal or conditional distributions) of the LDS are Gaussian, the variational inference is exact in this case and

$$\begin{aligned} q^*(\varvec{x})=p(\varvec{x}|{\varvec{y}}; {\varvec{\theta }}_{LDS}). \end{aligned}$$

It thus follows that the standard algorithms for exact inference of \(p(\varvec{x}|{\varvec{y}}; {\varvec{\theta }}_{LDS})\) with the LDS can be used to compute the covariance \({\varSigma }^*\) of the variational distribution of the BDS. In the following section, we briefly review the Kalman smoothing filter, which is the most popular such algorithm.

The situation is, however, different for the mean of the variational distribution. In this case, the LDS continues to have a simple closed-form solution, namely

$$\begin{aligned} \varvec{m}^*&= W^{-1}\varvec{\nu }, \end{aligned}$$
(96)

where \(W\) is as in (95) and

$$\begin{aligned} \varvec{\nu }= \begin{bmatrix} \varvec{\nu }_1\\ \vdots \\ \varvec{\nu }_\tau \end{bmatrix},~~ \varvec{\nu }_t = {\left\{ \begin{array}{ll} S^{-1}{\varvec{\mu }}+ C {R} ^{-1}{{\tilde{{\varvec{y}}}}}_1, &{} t=1, \\ C {R} ^{-1}{{\tilde{{\varvec{y}}}}}_t, &{} 1<t\leqslant \tau . \end{array}\right. } \end{aligned}$$
(97)

However, because the dependence of (93) on \(\varvec{m}\) is no longer identical to that of (69), the LDS solution is not informative for learning the BDS. A different procedure is thus required to learn the variational mean of the BDS. This is discussed in Sect. 1.

1.2 (b) Inferring the Variational Covariance \({\varSigma }\) of the BDS

Note that, \(\varvec{m}^*\) and \({\varSigma }^*\) have size linear and quadratic, respectively, in \(\tau \), the length of the sequence \(\{{\varvec{y}}_1^\tau \}\). This makes the direct solution of (96) and (94) expensive for long sequences - complexity \(O(L^\rho \tau ^\rho )\) with \(\rho \approx 2.4\). Furthermore, this solution is unnecessary, since inference with both the LDS and BDS only requires \({\varSigma }_{t,t}^*\) and \({\varSigma }_{t,t+1}^*\). A popular efficient alternative is the Kalman smoothing filter (Shumway and Stoffer 1982; Roweis and Ghahramani 1999), which is commonly use to estimate the posteriors \(p(\varvec{x}_t|{\varvec{y}}_{1}^{\tau })\) and \(p(\varvec{x}_t, \varvec{x}_{t+1}|{\varvec{y}}_{1}^{\tau })\) of the LDS, i.e., \(\varvec{m}^*\) of (96), \({\varSigma }^*_{t,t}\) and \({\varSigma }^*_{t,t+1}\) of (94).

Defining expectations conditioned on the observed sequence from time \(t=1\) to \(t=r\) as

$$\begin{aligned} \hat{\varvec{x}}_t^{r}= & {} \left\langle \varvec{x}_t\right\rangle _{p(\varvec{x}_t|{\varvec{y}}_1,\ldots ,{\varvec{y}}_r)}, \end{aligned}$$
(98)
$$\begin{aligned} \hat{V}_{t,k}^r= & {} \left\langle (\varvec{x}_t-\hat{\varvec{x}}_t^r)(\varvec{x}_{k}-\hat{\varvec{x}}_{k}^r)^{\intercal }\right\rangle _{p({\varvec{x}_t,\varvec{x}_k|{\varvec{y}}_1,\ldots ,{\varvec{y}}_r})}, \end{aligned}$$
(99)

the estimates are calculated via the forward and backward recursions:

  • In the forward recursion, for \(t=1, \ldots , \tau \), compute

    $$\begin{aligned} \hat{V}_{t,t}^{t-1}= & {} {A} \hat{V}_{t-1,t-1}^{t-1} {A} ^{\intercal }+ {Q} , \end{aligned}$$
    (100)
    $$\begin{aligned} K_t= & {} \hat{V}_t^{t-1} {C} ^{\intercal }( {C} \hat{V}_{t,t}^{t-1} {C} ^{\intercal }+ {R} _t)^{-1}, \end{aligned}$$
    (101)
    $$\begin{aligned} \hat{V}_{t,t}^t= & {} \hat{V}_{t,t}^{t-1} - K_t {C} \hat{V}_{t,t}^{t-1}, \end{aligned}$$
    (102)
    $$\begin{aligned} \hat{\varvec{x}}_t^{t-1}= & {} {A} \hat{\varvec{x}}_{t-1}^{t-1}, \end{aligned}$$
    (103)
    $$\begin{aligned} \hat{\varvec{x}}_t^t= & {} \hat{\varvec{x}}_t^{t-1} + K_t ({{\tilde{{\varvec{y}}}}}_t- {C} \hat{\varvec{x}}_t^{t-1}), \end{aligned}$$
    (104)

    with initial conditions \(\hat{\varvec{x}}_1^0 = {\varvec{\mu }} \) and \(\hat{V}_{1,1}^0= {S} \).

  • In the backward recursion, for \(t=\tau , \ldots ,1\),

    $$\begin{aligned} J_{t-1}= & {} \hat{V}_{t-1,t-1}^{t-1} {A} ^{\intercal }(\hat{V}_{t,t}^{t-1})^{-1}, \end{aligned}$$
    (105)
    $$\begin{aligned} \hat{\varvec{x}}_{t-1}^\tau= & {} \hat{\varvec{x}}_{t-1}^{t-1}+J_{t-1}(\hat{\varvec{x}}_t^\tau - {A} \hat{\varvec{x}}_{t-1}^{t-1}), \end{aligned}$$
    (106)
    $$\begin{aligned} \hat{V}_{t-1,t-1}^\tau= & {} \hat{V}_{t-1,t-1}^{t-1}+J_{t-1}(\hat{V}_{t,t}^\tau -\hat{V}_{t,t}^{t-1})J_{t-1}^{\intercal }, \end{aligned}$$
    (107)

    and for \(t=\tau ,\ldots ,2\),

    $$\begin{aligned}&\hat{V}_{t-1,t-2}^\tau = ~\hat{V}_{t-1,t-1}^{t-1}J_{t-2}^{\intercal } \nonumber \\&\quad ~ + J_{t-1}(\hat{V}_{t,t-1}^\tau - {A} \hat{V}_{t-1,t-1}^{t-1})J_{t-2}^{\intercal } \end{aligned}$$
    (108)

    with initial c ondition \(\hat{V}_{\tau ,\tau -1}^\tau = (I-K_\tau {C} ) {A} \hat{V}_{\tau -1,\tau -1}^{\tau -1}\).

This algorithm can be used to efficiently compute the variational covariance parameters \({\varSigma }_{t,t}^*\) and \({\varSigma }_{t,t+1}^*\) of the BDS, which are exactly the matrices \(\hat{V}_{t,t}^\tau \) of (107) and \(\hat{V}_{t,t-1}^\tau \) of (108), respectively. This has complexity \(O(L^\rho \tau ),\) with \(\rho \approx 2.4\).

1.3 (c) Infering the Variational mean \(\varvec{m}\) of the BDS

The variational mean \(\varvec{m}\) is the solution of

$$\begin{aligned} \varvec{m}^* =&~\mathop {\mathrm {argmax}}_{\varvec{m}}~{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q) \nonumber \\ =&~\mathop {\mathrm {argmax}}_{\varvec{m}} \bigg \{ {\varvec{\mu }}_0^{\intercal }S^{-1}\varvec{m}_1 -\frac{1}{2} \varvec{m}_1^{\intercal }S^{-1}\varvec{m}_1\nonumber \\&~~-\frac{1}{2} \sum _{t=1}^{\tau -1} \lambda _t^{\intercal }{\varGamma }^{-1}\lambda _t \nonumber \\&~~+ \sum _{t,k}\Big [ {\pi }_{kt}\ln \sigma ({\hat{{\omega }}}_{kt}) + (1-{\pi }_{kt})\ln \sigma (-{\hat{{\omega }}}_{kt})\Big ]\bigg \}. \end{aligned}$$
(109)

This can be rewritten as

$$\begin{aligned} \varvec{m}^* =&~\mathop {\mathrm {argmax}}_{\varvec{m}} \bigg \{-\frac{1}{2}\varvec{m}^{\intercal }{{\tilde{W}}}\varvec{m}+ \varvec{ b}_1^{\intercal }\varvec{m}_1 \nonumber \\&+ \sum _{t, k} \Big [{\pi }_{kt}\ln \sigma ({\hat{{\omega }}}_{kt}) + (1-{\pi }_{kt})\ln \sigma (-{\hat{{\omega }}}_{kt}) \Big ] \bigg \}, \end{aligned}$$
(110)

where

$$\begin{aligned} {{\tilde{W}}}_{i,j} = {\left\{ \begin{array}{ll} A^{\intercal }Q^{-1}A+ S^{-1}, &{} i=j=1, \\ A^{\intercal }Q^{-1}A+ Q^{-1},&{}1<i=j<\tau , \\ Q^{-1}, &{} i=j=\tau , \\ -Q^{-1}A, &{} i=j+1, \\ -A^{\intercal }Q^{-1}, &{} i=j-1, \\ {0}, &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(111)

\({\hat{{\omega }}}_{kt}=C_{k\cdot }\varvec{m}_t+u_k\), and \(\varvec{ b}_1 = 2S^{-1}{\varvec{\mu }}\). Since \({{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q)\) is a concave function of \(\varvec{m}\in {\mathbb {R}}^{\tau L}\), gradient-based methods can be applied to search for the stationary point where global optimum is guaranteed.

The gradient of \({{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q)\) is

$$\begin{aligned} \frac{\partial }{\partial \varvec{m}}{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q) =&~ -{{\tilde{W}}}\varvec{m}+ \begin{bmatrix} \varvec{ b}_1\\ {\varvec{0}} \end{bmatrix} - \begin{bmatrix} C^{\intercal }&\\&\ddots&\\&C^{\intercal } \end{bmatrix} \begin{bmatrix} {\varvec{\beta }}_1 \\ \vdots \\ {\varvec{\beta }}_\tau \end{bmatrix}, \end{aligned}$$
(112)

where

$$\begin{aligned} {\varvec{\beta }}_t =&~ \begin{bmatrix} \sigma ({\hat{{\omega }}}_{kt})-{\pi }_{1t}\\ \vdots \\ \sigma ({\hat{{\omega }}}_{Kt})-{\pi }_{Kt} \end{bmatrix}. \end{aligned}$$

The second-order partial derivatives of \({{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q)\) is

$$\begin{aligned} \frac{\partial ^2}{\partial \varvec{m}^2}{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q)&= -{{\tilde{W}}} - \begin{bmatrix} C^{\intercal }{\varXi }_1C&\\&\ddots&\\&C^{\intercal }{\varXi }_\tau C\end{bmatrix}, \end{aligned}$$
(113)

where

$$\begin{aligned} {\varXi }_t = {\mathrm {diag}}(\sigma ({\hat{{\omega }}}_{1t})\sigma (-{\hat{{\omega }}}_{1t}),~\ldots ,~\sigma ({\hat{{\omega }}}_{Kt})\sigma (-{\hat{{\omega }}}_{Kt})). \end{aligned}$$

Given the concavity and smoothness of \({{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q)\), many popular numerical optimization algorithms can be utilized to search for its optimum, e.g., gradient descent, Newton–Raphson method, Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm, etc.

Appendix 4: The Fisher Vector for BDS

In this section, we present the derivation of the Fisher vector for BDS using the tightest variational lower bound \({{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*)\) of (69). This consists of computing partial derivatives of \({{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*)\) w.r.t. each of the BDS parameters \({\varvec{\theta }}=\{S^{-1}, {\varvec{\mu }}, A, Q^{-1}, C, \varvec{u}\}\).

1.1 (a) Derivative w.r.t. \(S^{-1}\)

We have

$$\begin{aligned}&\frac{\partial }{\partial S^{-1}} {{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*) \nonumber \\&\quad = \frac{\partial }{\partial S^{-1}}~\frac{1}{2}\bigg [\ln \left| S^{-1}\right| -\mathrm {tr}\Big (({\hat{P}}_{1,1}^*- 2\varvec{m}^*_1{\varvec{\mu }}^{\intercal }+ {\varvec{\mu }}{\varvec{\mu }}^{\intercal })S^{-1}\Big ) \bigg ] \nonumber \\&\quad =\frac{1}{2}\Big (S+ 2{\varvec{\mu }}\varvec{m}_1^{*T} - {\hat{P}}_{1,1}^* - {\varvec{\mu }}{\varvec{\mu }}^{\intercal } \Big ), \end{aligned}$$
(114)

where \({\hat{P}}_{t_1,t_2}^*\) is defined in (62). Note that, \(S^{-1}\in {{\mathcal {S}}}^L_{++}\), thus the derivative of (114) needs to be projected into the space of symmetric matrices \({{\mathcal {S}}}^L\). Since an orthonormal basis of \({{\mathcal {S}}}^L\) is \(\{\frac{1}{2}(E_{i,j}+E_{j,i}), 1\leqslant i\leqslant j\leqslant L\}\), where \(E_{i,j}\in {\mathbb {R}}^{L\times L}\) with the (ij)-element equal to one and all the rest elements being zero, it can be shown that after the projection, (114) becomes

$$\begin{aligned} \frac{\partial }{\partial S^{-1}}&{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*) \nonumber \\ =&~\frac{1}{2}\Big (S+ {\varvec{\mu }}\varvec{m}_1^{*\intercal } + \varvec{m}^*_1{\varvec{\mu }}^{\intercal } - {\hat{P}}_{1,1}^* - {\varvec{\mu }}{\varvec{\mu }}^{\intercal } \Big ). \end{aligned}$$
(115)

1.2 (b) Derivative w.r.t. \({\varvec{\mu }}\)

We have

$$\begin{aligned} \frac{\partial }{\partial {\varvec{\mu }}}&{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*) \nonumber \\ =&~\frac{\partial }{\partial {\varvec{\mu }}} \Big [ {\varvec{\mu }}^{\intercal }S^{-1}\varvec{m}^*_1 - \frac{1}{2}{\varvec{\mu }}^{\intercal }S^{-1}{\varvec{\mu }}\Big ] \nonumber \\ =&~S^{-1}(\varvec{m}^*_1 - {\varvec{\mu }}). \end{aligned}$$
(116)

1.3 (c) Derivative w.r.t. \(A\)

We have

$$\begin{aligned} \frac{\partial }{\partial A}&{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*) \nonumber \\ =&~\frac{\partial }{\partial A} \bigg [ \sum \limits _{t=1}^{\tau -1} \mathrm {tr}\bigg ( {\hat{P}}_{t,t+1}^*Q^{-1}A-\frac{1}{2}{\hat{P}}_{t,t}^*A^{\intercal }Q^{-1}A\bigg ) \bigg ] \nonumber \\ =&~\frac{\partial }{\partial A} \bigg [ \mathrm {tr}\bigg ( {\varPsi }^{\intercal }Q^{-1}A-\frac{1}{2}\phi A^{\intercal }Q^{-1}A\bigg ) \bigg ] \nonumber \\ =&~({\varPsi }^{\intercal }Q^{-1})^{\intercal } - \frac{1}{2}\Big [Q^{-\intercal }A\phi ^{\intercal } + Q^{-1}A\phi \Big ] \nonumber \\ =&~Q^{-1}({\varPsi }-A\phi ), \end{aligned}$$
(117)

where

$$\begin{aligned} \phi = \sum \limits _{t=2}^{\tau } {\hat{P}}_{t-1,t-1}^*,~ {\varPsi }= \sum \limits _{t=2}^{\tau } {\hat{P}}_{t,t-1}^*. \end{aligned}$$

1.4 (d) Derivative w.r.t. \(Q^{-1}\)

We have

$$\begin{aligned} \frac{\partial }{\partial Q^{-1}}&{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*) \nonumber \\ =&~\frac{\partial }{\partial Q^{-1}} \bigg [ \sum \limits _{t=1}^{\tau -1} \mathrm {tr}\bigg ( A{\hat{P}}_{t,t+1}^*Q^{-1} -\frac{1}{2}A{\hat{P}}_{t,t}^*A^{\intercal }Q^{-1} \nonumber \\& -\frac{1}{2}{\hat{P}}_{t+1,t+1}^*Q^{-1} \bigg ) + (\frac{\tau -1}{2})\ln \left| Q^{-1}\right| \bigg ] \nonumber \\ =&~\frac{\partial }{\partial Q^{-1}} \bigg [ \mathrm {tr}\bigg ( A{\varPsi }^{\intercal }Q^{-1} -\frac{1}{2}A\phi A^{\intercal }Q^{-1} -\frac{1}{2}\varphi Q^{-1} \bigg ) \nonumber \\& + (\frac{\tau -1}{2})\ln \left| Q^{-1}\right| \bigg ] \nonumber \\ =&~{\varPsi }A^{\intercal } + \frac{1}{2}\Big [(\tau -1)Q-A\phi A^{\intercal } - \varphi \Big ], \end{aligned}$$
(118)

where

$$\begin{aligned} \varphi&= \sum \limits _{t=2}^{\tau } {\hat{P}}_{t,t}^*.~ \end{aligned}$$
(119)

Again, since \(Q^{-1}\in {{\mathcal {S}}}_{++}\), the partial derivative of (118) is projected into \({{\mathcal {S}}}\), giving

$$\begin{aligned} \frac{\partial }{\partial Q^{-1}}&{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*) \nonumber \\ =~\frac{1}{2}&\Big [{\varPsi }A^{\intercal }+A{\varPsi }^{\intercal }-A\phi A^{\intercal } - \varphi + (\tau -1)Q\Big ]. \end{aligned}$$
(120)

1.5 D.5 Derivative w.r.t. \({\tilde{C}}\)

Assuming \({\tilde{C}}=\begin{bmatrix} C&\varvec{u}\end{bmatrix}\), we have

$$\begin{aligned} \frac{\partial }{\partial {\tilde{C}}}&{{\hat{{\mathscr {L}}}}}({\varvec{\theta }}, q^*) \nonumber \\ =&~\frac{\partial }{\partial {\tilde{C}}} \bigg \{ \sum \limits _{k,t} \Big [ {\pi }_{kt}\ln \sigma ({\tilde{C}}_{k\cdot }{\varvec{b}_{t}}) + (1-{\pi }_{kt})\ln \sigma (-{\tilde{C}}_{k\cdot }{\varvec{b}_{t}}) \Big ] \nonumber \\&- \frac{1}{8}\mathrm {tr}\big ({\tilde{C}}{{\tilde{{\varUpsilon }}}} {\tilde{C}}^{\intercal }\big ) \bigg \} \nonumber \\ =&~-\frac{1}{4} \bigg \{{\tilde{C}}{{\tilde{{\varUpsilon }}}} + \sum _{t=1}^{\tau } \begin{bmatrix} \sigma ({\tilde{C}}_{1\cdot }{\varvec{b}_{t}}) - {\pi }_{1t}\\ \vdots \\ \sigma ({\tilde{C}}_{K\cdot }{\varvec{b}_{t}}) - {\pi }_{Kt} \end{bmatrix} {\varvec{b}}_{t}^{\intercal }\Bigg \}, \end{aligned}$$
(121)

where \({\tilde{C}}_{k\cdot }\) is the k-th row of \({\tilde{C}}\), and

$$\begin{aligned} {{\tilde{{\varUpsilon }}}} = \begin{pmatrix} \sum \limits _{t=1}^{\tau } \varSigma _{t,t}^{*} &{} 0\\ 0 &{} 0 \end{pmatrix},~ {\varvec{b}}_{t} = \begin{pmatrix} \varvec{m}^*_t\\ 1 \end{pmatrix}. \end{aligned}$$

Appendix 5: Weizmann Complex Activity

1.1 (a) Synthetic Datasets

The synthetic dataset contains three sets: Syn-4/5/6, Syn20 \(\times \) 1 and Syn1s0 \(\times \) 2, which are generated using the 10 atomic actions (per person) from the original Weizmann dataset by Gorelick et al. (2007). Exemplar activities in Syn-4/5/6, Syn20 \(\times \) 1, and Syn10 \(\times \) 2 are shown in Tables 7, 8, and 9, respectively. For Syn20 \(\times \) 1, and Syn10 \(\times \) 2, two of the 9 instances for an activity (each instance is assembled from each of the 9 people’s atomic actions).

Table 7 Examples for Syn-4/5/6
Table 8 Examples for Syn20 \(\times \) 1
Table 9 Examples for Syn10 \(\times \) 2
Table 10 Attributes for Weizmann actions

Appendix 6: Attribute Definition

1.1 (a) Weizmann Complex Activity

Attribute definitions from (Liu et al. 2011) on Weizmann complex activity are shown in Table 10.

1.2 (b) Olympic Sports

Attribute definitions from (Liu et al. 2011) on Olympic Sports dataset (Niebles et al. 2010) are shown in Table 11.

Table 11 Attributes for Olympic Sports
Table 12 Attribute list for TRECVID MED11

1.3 (c) TRECVID MED11

Attribute definitions from (Bhattacharya 2013) on TRECVID MED11 dataset Over et al. (2011) are shown in Table 12.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, WX., Vasconcelos, N. Complex Activity Recognition Via Attribute Dynamics. Int J Comput Vis 122, 334–370 (2017). https://doi.org/10.1007/s11263-016-0918-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0918-1

Keywords

Navigation