DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features

  • Mahlagha Afrasiabi
  • Hassan khotanlouEmail author
  • Muharram Mansoorizadeh
Original Article


Recently, the prediction of interactions in videos has been an active subject in computer vision. Its goal is to deduce interactions in their early stages. Many approaches have been proposed to predict interaction, but it still remains a challenging problem. In the present paper, features are optical flow fields extracted from video frames using convolutional neural networks. This feature, which is extracted from successive frames, constructs a time series. Then, the problem is modeled in the form of a time series prediction. Prediction of the interaction type is based on matching the time series under experiment with the time series available in the training set. Dynamic time warping provides an optimal match between a pair of time-series data by a nonlinear mapping between two data. Finally, the SVM and KNN classification methods with dynamic time warping distance are used to predict the video label. The results showed that the proposed model improved on standard interaction recognition datasets including the TVHI, BIT, and UT interaction.


Interaction prediction Convolutional neural network Dynamic time warping Support vector machine k-Nearest neighbor 



We would like to thank our colleague professor Leili Tapak for her valuable comments and suggestions.

Compliance with ethical standards

Conflict of interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.


  1. 1.
    Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)MathSciNetGoogle Scholar
  2. 2.
    Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: Proceedings of Fourth International Conference on Computer Vision, 1993. IEEE, pp. 231–236 (1993)Google Scholar
  3. 3.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision. Springer, pp. 25–36 (2004)Google Scholar
  4. 4.
    Chaaraoui, A.A., Climent-Pérez, P., Flórez-Revuelta, F.: Silhouette-based human action recognition using sequences of key poses. Pattern Recognit. Lett. 34(15), 1799–1807 (2013)CrossRefGoogle Scholar
  5. 5.
    Chen, M.Y., Hauptmann, A.: Mosift: recognizing human actions in surveillance videos. Technical report, Carnegie Mellon University, Pittsburgh, USA (2009)Google Scholar
  6. 6.
    Cho, N.G., Park, S.H., Park, J.S., Park, U., Lee, S.W.: Compositional interaction descriptor for human interaction recognition. Neurocomputing 267, 169–181 (2017)CrossRefGoogle Scholar
  7. 7.
    Choi, W., Shahid, K., Savarese, S.: What are they doing? Collective activity classification using spatio-temporal relationship among people. In: 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, pp. 1282–1289 (2009)Google Scholar
  8. 8.
    Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (stip) detector. Vis. Comput. 32(3), 289–306 (2016)CrossRefGoogle Scholar
  9. 9.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  11. 11.
    Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Ess, A., Leibe, B., Schindler, K., Van Gool, L.: A mobile vision system for robust multi-person tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE, pp. 1–8 (2008)Google Scholar
  13. 13.
    Farha, Y.A., Richard, A., Gall, J.: When will you do what? Anticipating temporal occurrences of activities. arXiv preprint arXiv:1804.00892 (2018)
  14. 14.
    Gao, C., Yang, L., Du, Y., Feng, Z., Liu, J.: From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition. World Wide Web 19(2), 265–276 (2016)CrossRefGoogle Scholar
  15. 15.
    Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. arXiv preprint arXiv:1707.04818 (2017)
  16. 16.
    Guerrero-Peña, F., Vasconcelos, G.C.: Object recognition under severe occlusions with a hidden markov model approach. Pattern Recognit. Lett. 86, 68–75 (2017)CrossRefGoogle Scholar
  17. 17.
    Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)CrossRefGoogle Scholar
  18. 18.
    Ikizler, N., Duygulu, P.: Histogram of oriented rectangles: a new pose descriptor for human action recognition. Image Vis. Comput. 27(10), 1515–1526 (2009)CrossRefGoogle Scholar
  19. 19.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pp. 675–678 (2014)Google Scholar
  20. 20.
    Jin, C.B., Li, S., Do, T.D., Kim, H.: Real-time human action recognition using cnn over temporal images for static video surveillance cameras. In: Pacific Rim Conference on Multimedia. Springer, pp. 330–339 (2015)Google Scholar
  21. 21.
    Kassidas, A., MacGregor, J.F., Taylor, P.A.: Synchronization of batch trajectories using dynamic time warping. AIChE J. 44(4), 864 (1998)CrossRefGoogle Scholar
  22. 22.
    Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: Human interaction prediction using deep temporal features. In: European Conference on Computer Vision. Springer, pp. 403–414 (2016)Google Scholar
  23. 23.
    Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Leveraging structural context models and ranking score fusion for human interaction prediction. IEEE Trans. Multimedia 20(7), 1712–1723 (2018)CrossRefGoogle Scholar
  24. 24.
    Kong, Y., Fu, Y.: Max-margin action prediction machine. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1844–1858 (2016)CrossRefGoogle Scholar
  25. 25.
    Kong, Y., Jia, Y., Fu, Y.: Learning human interaction by interactive phrases. In: European Conference on Computer Vision. Springer, pp. 300–313 (2012)Google Scholar
  26. 26.
    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: European Conference on Computer Vision. Springer, pp. 596–611 (2014)Google Scholar
  27. 27.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)Google Scholar
  28. 28.
    Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: European Conference on Computer Vision. Springer, pp. 689–704 (2014)Google Scholar
  29. 29.
    Lei, H., Sun, B.: A study on the dynamic time warping in kernel machines. In: Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, 2007. SITIS’07. IEEE, pp. 839–845 (2007)Google Scholar
  30. 30.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  31. 31.
    Ma, Y., Chang, Q., Lu, H., Liu, J.: Reconstruct recurrent neural networks via flexible sub-models for time series classification. Appl. Sci. 8(4), 630 (2018)CrossRefGoogle Scholar
  32. 32.
    Mo, D.: A survey on deep learning: one small step toward ai. Department of Computer Science, University of New Mexico, USA (2012)Google Scholar
  33. 33.
    Munoz-Organero, M., Ruiz-Blazquez, R.: Time-elastic generative model for acceleration time series in human activity recognition. Sensors 17(2), 319 (2017)CrossRefGoogle Scholar
  34. 34.
    Oliver, N.M., Rosario, B., Pentland, A.P.: A bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000)CrossRefGoogle Scholar
  35. 35.
    Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I.D.: High five: Recognising human interactions in tv shows. In: BMVC, vol. 1, p. 2. Citeseer (2010)Google Scholar
  36. 36.
    Pei, W., Dibeklioğlu, H., Tax, D.M., van der Maaten, L.: Multivariate time-series classification using the hidden-unit logistic model. IEEE Trans. Neural Netw. Learn. Syst. 29(4), 920–931 (2018)CrossRefGoogle Scholar
  37. 37.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: European Conference on Computer Vision. Springer, pp. 143–156 (2010)Google Scholar
  38. 38.
    Ramanathan, M., Yau, W.Y., Teoh, E.K.: Human action recognition with video data: research and evaluation challenges. IEEE Trans. Hum. Mach. Syst. 44(5), 650–663 (2014)CrossRefGoogle Scholar
  39. 39.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, pp. 1036–1043 (2011)Google Scholar
  41. 41.
    Ryoo, M.S., Aggarwal, J.: Ut-interaction dataset, ICPR contest on semantic description of human activities (sdha). In: IEEE International Conference on Pattern Recognition Workshops, vol. 2, p. 4 (2010)Google Scholar
  42. 42.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 3. IEEE, pp. 32–36 (2004)Google Scholar
  43. 43.
    Shin, H.-C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)CrossRefGoogle Scholar
  44. 44.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp. 568–576 (2014)Google Scholar
  45. 45.
    Sminchisescu, C., Kanaujia, A., Metaxas, D.: Conditional models for contextual human motion recognition. Comput. Vis. Image Underst. 104(2–3), 210–220 (2006)CrossRefGoogle Scholar
  46. 46.
    Soomro, K., Idrees, H., Shah, M.: Online localization and prediction of actions and interactions. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 459–472 (2019)CrossRefGoogle Scholar
  47. 47.
    Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)CrossRefGoogle Scholar
  48. 48.
    Wang, H., Yang, W., Yuan, C., Ling, H., Hu, W.: Human activity prediction using temporally-weighted generalized time warping. Neurocomputing 225, 139–147 (2017)CrossRefGoogle Scholar
  49. 49.
    Wang, L., Suter, D.: Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In: IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE, pp. 1–8 (2007)Google Scholar
  50. 50.
    Wang, Z., Liu, S., Zhang, J., Chen, S., Guan, Q.: A spatio-temporal CRF for human interaction understanding. IEEE Trans. Circuits Syst. Video Technol. 27(8), 1647–1660 (2017)CrossRefGoogle Scholar
  51. 51.
    Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden markov model. In: 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92. IEEE, pp. 379–385 (1992)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer EngineeringBu-Ali Sina UniversityHamedanIran

Personalised recommendations