Advertisement

Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow

  • Shengchao Li
  • Lin Zhang
  • Xiumin DiaoEmail author
Article

Abstract

A key technical issue for human intention prediction from observed human actions is how to discover and utilize the spatio-temporal patterns behind those actions. Inspired by the well-known two-stream architecture for action recognition, this paper proposes an approach for human intention prediction based on a two-stream architecture using RGB images and optical flow. Firstly, the action-start frame of each trial of a human action is determined by calculating the L2 distance of the positions of the human joints between frames of the skeleton data. Secondly, a spatial network and a temporal network are trained separately to predict human intentions using RGB images and optical flow, respectively. Both early concatenation fusion methods in the spatial network and sampling methods in the temporal network are optimized based on experiments. Finally, average fusion is used to fuse the prediction results from the spatial network and the temporal network. To verify the effectiveness of the proposed approach, a new dataset of human intentions behind human actions is introduced. This dataset contains RGB images, RGB-D images, and skeleton data of human actions of pitching a ball. Experiments show that the proposed approach can predict human intentions behind human actions with a prediction accuracy of 74% on the proposed dataset. The proposed approach is further evaluated on the Intention from Motion (IfM) dataset, a dataset of human intentions behind human actions of grasping a bottle. The proposed approach achieves a prediction accuracy of 77% on the IfM dataset. The proposed approach is effective in predicting human intentions behind human actions in different applications.

Keywords

Human intention prediction Deep learning RGB image Optical flow 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

References

  1. 1.
    Wang, Z., Boularias, A., Mülling, K., Schölkopf, B., Peters, J.: Anticipatory action selection for human–robot table tennis. Artif. Intell. 247, 399–414 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Koppula, H.S., Jain, A., Saxena, A.: Anticipatory planning for human-robot teams. Experimental Robotics. 453–470 (2016)Google Scholar
  3. 3.
    Townsend, E.C., Mielke, E.A., Wingate, D., and Killpack, M.D.: “Estimating Human Intent for Physical Human-Robot co-Manipulation,” arXiv Prepr. arXiv1705.10851, (2017)Google Scholar
  4. 4.
    Kim, I.-H., Bong, J.-H., Park, J., Park, S.: Prediction of driver’s intention of lane change by augmenting sensor information using machine learning techniques. Sensors. 17(6), 1350 (2017)CrossRefGoogle Scholar
  5. 5.
    Kwak, J.-Y., Ko, B.C., Nam, J.-Y.: Pedestrian intention prediction based on dynamic fuzzy automata for vehicle driving at nighttime. Infrared Phys. Technol. 81, 41–51 (2017)CrossRefGoogle Scholar
  6. 6.
    Kirchner, E.A., Tabie, M., Seeland, A.: Multimodal movement prediction-towards an individual assistance of patients. PLoS One. 9(1), e85060 (2014)CrossRefGoogle Scholar
  7. 7.
    Phule, S.S., Sawant, S.D.: “Abnormal activities detection for security purpose unattainded bag and crowding detection by using image processing,” in Intelligent Computing and Control Systems (ICICCS), 2017 International Conference on, pp. 1069–1073, (2017)Google Scholar
  8. 8.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941, (2016)Google Scholar
  9. 9.
    Ma, S., Sigal, L., Sclaroff, S.: “Learning activity progression in lstms for activity detection and early detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1942–1950, (2016)Google Scholar
  10. 10.
    Ryoo, M.S.: “Human activity prediction: early recognition of ongoing activities from streaming videos,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1036–1043, (2011)Google Scholar
  11. 11.
    Xu, Z., Qing, L., Miao, J.: “Activity auto-completion: predicting human activities from partial videos,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3191–3199, (2015)Google Scholar
  12. 12.
    Li, S., Zhang, L., Diao, X., “Improving Human Intention Prediction Using Data Augmentation,” in 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 559–564, (2018)Google Scholar
  13. 13.
    Sharma, G, Jurie, F., Schmid, C.: “Expanded parts model for human attribute and action recognition in still images,” in computer vision and pattern recognition, pp. 652–659, (2013)Google Scholar
  14. 14.
    Zheng, Y., Zhang, Y.J., Li, X., Liu, B.D.: “Action recognition in still images using a combination of human pose and context information,” in 2012 19th IEEE International Conference on Image Processing, pp. 785–788, (2012)Google Scholar
  15. 15.
    Delaitre, V., Sivic, J., Laptev, I.: “Learning person-object interactions for action recognition in still images,” in Advances in Neural Information Processing Systems, pp. 1503–1511, (2011)Google Scholar
  16. 16.
    Zunino, A., Cavazza, J., Koul, A., Cavallo, A., Becchio, C., Murino, V.: “intention from motion,” arXiv Prepr. arXiv1605.09526, (2016)Google Scholar
  17. 17.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  18. 18.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection, in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. 1, 886–893 (2005)Google Scholar
  19. 19.
    Klaser, A., Marszałek, M., Schmid, C.: “A spatio-temporal descriptor based on 3d-gradients,” in BMVC 2008-19th British Machine Vision Conference, pp. 271–275, (2008)Google Scholar
  20. 20.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: “Learning realistic human actions from movies,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, (2008)Google Scholar
  21. 21.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  22. 22.
    Scovanner, P., Ali, S., Shah, M.: “A 3-dimensional sift descriptor and its application to action recognition,” in Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360, (2007)Google Scholar
  23. 23.
    Wang, H., Schmid, C.: “Action recognition with improved trajectories,” in Proceedings of the IEEE International Conference on Computer Vision, (2013)Google Scholar
  24. 24.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi A., Gould, S.: “Dynamic image networks for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3034–3042, 2016Google Scholar
  25. 25.
    Ryoo, M.S., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: “Robot-centric activity prediction from first-person videos: what will they do to me?,” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp. 295–302, (2015)Google Scholar
  26. 26.
    Soran, B., Farhadi, A., Shapiro, L.: “Generating notifications for missing actions: Don’t forget to turn the lights off!,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4669–4677, (2015)Google Scholar
  27. 27.
    Yu, G., Yuan, J., Liu, Z.: “Predicting human activities using spatio-temporal structure of interest points,” in Proceedings of the 20th ACM International Conference on Multimedia, pp. 1049–1052, (2012)Google Scholar
  28. 28.
    Soomro, K., Idrees, H., Shah, M.: “Predicting the where and what of actors and actions through online action localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2648–2657, (2016)Google Scholar
  29. 29.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: “Learning spatiotemporal features with 3d convolutional networks,” in Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 4489–4497, (2015)Google Scholar
  30. 30.
    Donahue, J. et al.: “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2625–2634, (2015)Google Scholar
  31. 31.
    Simonyan, K., Zisserman, A.: “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS), pp. 568–576, (2014)Google Scholar
  32. 32.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  33. 33.
    Varol, G., Laptev, I., Schmid, C.: “Long-term temporal convolutions for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., (2017)Google Scholar
  34. 34.
    Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. Proceedings of the IEEE International Conference on Computer Vision. 4597–4605 (2015)Google Scholar
  35. 35.
    Qiu, Z., Yao, T., Mei, T.: “Learning spatio-temporal representation with pseudo-3d residual networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542, (2017)Google Scholar
  36. 36.
    Carreira, J., Zisserman, A.: “Quo vadis, action recognition? A new model and the kinetics dataset,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 4724–4733, (2017)Google Scholar
  37. 37.
    Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: “Beyond short snippets: deep networks for video classification,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 4694–4702, (2015)Google Scholar
  38. 38.
    Fermüller, C., Wang, F., Yang, Y., Zampogiannis, K., Zhang, Y., Barranco, F., Pfeiffer, M.: Prediction of manipulation actions. Int. J. Comput. Vis. 126(2–4), 358–374 (2018)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: “Unsupervised learning of video representations using lstms,” in International conference on machine learning, pp. 843–852, (2015)Google Scholar
  40. 40.
    L. Wang et al.: “Temporal segment networks: Towards good practices for deep action recognition,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9912 LNCS, pp. 20–36, (2016)Google Scholar
  41. 41.
    Wang, L., Qiao, Y., Tang, X.: “Action recognition with trajectory-pooled deep-convolutional descriptors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4305–4314, (2015)Google Scholar
  42. 42.
    Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: “A key volume mining deep framework for action recognition,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 1991–1999, (2016)Google Scholar
  43. 43.
    Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: “Action recognition by learning deep multi-granular spatio-temporal video representation,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 159–166, (2016)Google Scholar
  44. 44.
    Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Learning hierarchical video representation for action recognition. Int. J. Multimed. Inf. Retr. 6(1), 85–98 (2017)CrossRefGoogle Scholar
  45. 45.
    Qiu, Z., Li, Q., Yao, T., Mei, T., Rui, Y: “Msr asia msm at thumos challenge 2015,” in CVPR workshop, vol. 8, (2015)Google Scholar
  46. 46.
    Chéron, G., Laptev, I., Schmid, C.: “P-CNN: pose-based CNN features for action recognition,” in Proceedings of the IEEE international conference on computer vision, pp. 3218–3226, (2015)Google Scholar
  47. 47.
    Gkioxari, G., Malik, J.: “Finding action tubes,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 759–768, (2015)Google Scholar
  48. 48.
    Weinzaepfel, P., Harchaoui, Z., Schmid, C.: “Learning to track for spatio-temporal action localization,” in Proceedings of the IEEE international conference on computer vision, pp. 3164–3172, (2015)Google Scholar
  49. 49.
    Daoudi, M., Coello, Y., Desrosiers, P., Ott, L.: “A new computational approach to identify human social intention in action,” in IEEE International Conference on Automatic Face & Gesture Recognition, (2018)Google Scholar
  50. 50.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732, (2014)Google Scholar
  51. 51.
    Dosovitskiy, A. et al.: “Flownet: learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766, (2015)Google Scholar
  52. 52.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionision (CVPR), vol. 2, p. 6, (2017)Google Scholar
  53. 53.
    He, K., Zhang, X., Ren, S., Sun, J.: “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionision (CVPR), pp. 770–778, (2016)Google Scholar
  54. 54.
    Berg, A., Deng, J., Fei-Fei, L.: “Large scale visual recognition challenge 2010.” 2010Google Scholar
  55. 55.
    Soomro, K., Zamir, A.R., Shah, M.: “UCF101: a dataset of 101 human actions classes from videos in the wild,” arXiv Prepr. arXiv1212.0402, (2012)Google Scholar
  56. 56.
    Ryoo, M.S., Aggarwal, J.K., Dataset, U.-I.: “ICPR Contest on Semantic Description of Human Activities (SDHA), (2010)Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.School of Engineering TechnologyPurdue UniversityWest LafayetteUSA

Personalised recommendations