DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features


Recently, the prediction of interactions in videos has been an active subject in computer vision. Its goal is to deduce interactions in their early stages. Many approaches have been proposed to predict interaction, but it still remains a challenging problem. In the present paper, features are optical flow fields extracted from video frames using convolutional neural networks. This feature, which is extracted from successive frames, constructs a time series. Then, the problem is modeled in the form of a time series prediction. Prediction of the interaction type is based on matching the time series under experiment with the time series available in the training set. Dynamic time warping provides an optimal match between a pair of time-series data by a nonlinear mapping between two data. Finally, the SVM and KNN classification methods with dynamic time warping distance are used to predict the video label. The results showed that the proposed model improved on standard interaction recognition datasets including the TVHI, BIT, and UT interaction.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.

    Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)

    MathSciNet  Google Scholar 

  2. 2.

    Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: Proceedings of Fourth International Conference on Computer Vision, 1993. IEEE, pp. 231–236 (1993)

  3. 3.

    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision. Springer, pp. 25–36 (2004)

  4. 4.

    Chaaraoui, A.A., Climent-Pérez, P., Flórez-Revuelta, F.: Silhouette-based human action recognition using sequences of key poses. Pattern Recognit. Lett. 34(15), 1799–1807 (2013)

    Article  Google Scholar 

  5. 5.

    Chen, M.Y., Hauptmann, A.: Mosift: recognizing human actions in surveillance videos. Technical report, Carnegie Mellon University, Pittsburgh, USA (2009)

  6. 6.

    Cho, N.G., Park, S.H., Park, J.S., Park, U., Lee, S.W.: Compositional interaction descriptor for human interaction recognition. Neurocomputing 267, 169–181 (2017)

    Article  Google Scholar 

  7. 7.

    Choi, W., Shahid, K., Savarese, S.: What are they doing? Collective activity classification using spatio-temporal relationship among people. In: 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, pp. 1282–1289 (2009)

  8. 8.

    Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (stip) detector. Vis. Comput. 32(3), 289–306 (2016)

    Article  Google Scholar 

  9. 9.

    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)

  11. 11.

    Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961)

    MathSciNet  Article  Google Scholar 

  12. 12.

    Ess, A., Leibe, B., Schindler, K., Van Gool, L.: A mobile vision system for robust multi-person tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE, pp. 1–8 (2008)

  13. 13.

    Farha, Y.A., Richard, A., Gall, J.: When will you do what? Anticipating temporal occurrences of activities. arXiv preprint arXiv:1804.00892 (2018)

  14. 14.

    Gao, C., Yang, L., Du, Y., Feng, Z., Liu, J.: From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition. World Wide Web 19(2), 265–276 (2016)

    Article  Google Scholar 

  15. 15.

    Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. arXiv preprint arXiv:1707.04818 (2017)

  16. 16.

    Guerrero-Peña, F., Vasconcelos, G.C.: Object recognition under severe occlusions with a hidden markov model approach. Pattern Recognit. Lett. 86, 68–75 (2017)

    Article  Google Scholar 

  17. 17.

    Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)

    Article  Google Scholar 

  18. 18.

    Ikizler, N., Duygulu, P.: Histogram of oriented rectangles: a new pose descriptor for human action recognition. Image Vis. Comput. 27(10), 1515–1526 (2009)

    Article  Google Scholar 

  19. 19.

    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pp. 675–678 (2014)

  20. 20.

    Jin, C.B., Li, S., Do, T.D., Kim, H.: Real-time human action recognition using cnn over temporal images for static video surveillance cameras. In: Pacific Rim Conference on Multimedia. Springer, pp. 330–339 (2015)

  21. 21.

    Kassidas, A., MacGregor, J.F., Taylor, P.A.: Synchronization of batch trajectories using dynamic time warping. AIChE J. 44(4), 864 (1998)

    Article  Google Scholar 

  22. 22.

    Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: Human interaction prediction using deep temporal features. In: European Conference on Computer Vision. Springer, pp. 403–414 (2016)

  23. 23.

    Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Leveraging structural context models and ranking score fusion for human interaction prediction. IEEE Trans. Multimedia 20(7), 1712–1723 (2018)

    Article  Google Scholar 

  24. 24.

    Kong, Y., Fu, Y.: Max-margin action prediction machine. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1844–1858 (2016)

    Article  Google Scholar 

  25. 25.

    Kong, Y., Jia, Y., Fu, Y.: Learning human interaction by interactive phrases. In: European Conference on Computer Vision. Springer, pp. 300–313 (2012)

  26. 26.

    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: European Conference on Computer Vision. Springer, pp. 596–611 (2014)

  27. 27.

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)

  28. 28.

    Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: European Conference on Computer Vision. Springer, pp. 689–704 (2014)

  29. 29.

    Lei, H., Sun, B.: A study on the dynamic time warping in kernel machines. In: Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, 2007. SITIS’07. IEEE, pp. 839–845 (2007)

  30. 30.

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  31. 31.

    Ma, Y., Chang, Q., Lu, H., Liu, J.: Reconstruct recurrent neural networks via flexible sub-models for time series classification. Appl. Sci. 8(4), 630 (2018)

    Article  Google Scholar 

  32. 32.

    Mo, D.: A survey on deep learning: one small step toward ai. Department of Computer Science, University of New Mexico, USA (2012)

  33. 33.

    Munoz-Organero, M., Ruiz-Blazquez, R.: Time-elastic generative model for acceleration time series in human activity recognition. Sensors 17(2), 319 (2017)

    Article  Google Scholar 

  34. 34.

    Oliver, N.M., Rosario, B., Pentland, A.P.: A bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000)

    Article  Google Scholar 

  35. 35.

    Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I.D.: High five: Recognising human interactions in tv shows. In: BMVC, vol. 1, p. 2. Citeseer (2010)

  36. 36.

    Pei, W., Dibeklioğlu, H., Tax, D.M., van der Maaten, L.: Multivariate time-series classification using the hidden-unit logistic model. IEEE Trans. Neural Netw. Learn. Syst. 29(4), 920–931 (2018)

    Article  Google Scholar 

  37. 37.

    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: European Conference on Computer Vision. Springer, pp. 143–156 (2010)

  38. 38.

    Ramanathan, M., Yau, W.Y., Teoh, E.K.: Human action recognition with video data: research and evaluation challenges. IEEE Trans. Hum. Mach. Syst. 44(5), 650–663 (2014)

    Article  Google Scholar 

  39. 39.

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    MathSciNet  Article  Google Scholar 

  40. 40.

    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, pp. 1036–1043 (2011)

  41. 41.

    Ryoo, M.S., Aggarwal, J.: Ut-interaction dataset, ICPR contest on semantic description of human activities (sdha). In: IEEE International Conference on Pattern Recognition Workshops, vol. 2, p. 4 (2010)

  42. 42.

    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 3. IEEE, pp. 32–36 (2004)

  43. 43.

    Shin, H.-C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)

    Article  Google Scholar 

  44. 44.

    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp. 568–576 (2014)

  45. 45.

    Sminchisescu, C., Kanaujia, A., Metaxas, D.: Conditional models for contextual human motion recognition. Comput. Vis. Image Underst. 104(2–3), 210–220 (2006)

    Article  Google Scholar 

  46. 46.

    Soomro, K., Idrees, H., Shah, M.: Online localization and prediction of actions and interactions. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 459–472 (2019)

    Article  Google Scholar 

  47. 47.

    Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)

    Article  Google Scholar 

  48. 48.

    Wang, H., Yang, W., Yuan, C., Ling, H., Hu, W.: Human activity prediction using temporally-weighted generalized time warping. Neurocomputing 225, 139–147 (2017)

    Article  Google Scholar 

  49. 49.

    Wang, L., Suter, D.: Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In: IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE, pp. 1–8 (2007)

  50. 50.

    Wang, Z., Liu, S., Zhang, J., Chen, S., Guan, Q.: A spatio-temporal CRF for human interaction understanding. IEEE Trans. Circuits Syst. Video Technol. 27(8), 1647–1660 (2017)

    Article  Google Scholar 

  51. 51.

    Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden markov model. In: 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92. IEEE, pp. 379–385 (1992)

Download references


We would like to thank our colleague professor Leili Tapak for her valuable comments and suggestions.

Author information



Corresponding author

Correspondence to Hassan khotanlou.

Ethics declarations

Conflict of interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Afrasiabi, M., khotanlou, H. & Mansoorizadeh, M. DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis Comput 36, 1127–1139 (2020). https://doi.org/10.1007/s00371-019-01722-6

Download citation


  • Interaction prediction
  • Convolutional neural network
  • Dynamic time warping
  • Support vector machine
  • k-Nearest neighbor