Advertisement

A spatiotemporal attention-based ResC3D model for large-scale gesture recognition

  • Yunan Li
  • Qiguang Miao
  • Xiangda Qi
  • Zhenxin Ma
  • Wanli Ouyang
Special Issue Paper
  • 85 Downloads

Abstract

Abnormal gesture recognition has many applications in the fields of visual surveillance, crowd behavior analysis, and sensitive video content detection. However, the recognition of dynamic gestures with large-scale videos remains a challenging task due to the barriers of gesture-irrelevant factors like the variations in illumination, movement path, and background. In this paper, we propose a spatiotemporal attention-based ResC3D model for abnormal gesture recognition with large-scale videos. One key idea is to find a compact and effective representation of the gesture in both spatial and temporal contexts. To eliminate the influence of gesture-irrelevant factors, we first employ the enhancement techniques such as Retinex and hybrid median filer to improve the quality of RGB and depth inputs. Then, we design a spatiotemporal attention scheme to focus on the most valuable cues related to the moving parts for the gesture. Upon these representations, a ResC3D network, which leverages the advantages of both residual network and C3D model, is developed to extract features, together with a canonical correlation analysis-based fusion scheme for blending features from different modalities. The performance of our method is evaluated on the Chalearn IsoGD Dataset. Experiments demonstrate the effectiveness of each module of our method and show the ultimate accuracy reaches 68.14%, which outperforms other state-of-the-art methods, including our basic work in 2017 Chalearn Looking at People Workshop of ICCV.

Keywords

Gesture recognition Spatiotemporal attention mechanism ResC3D model 

Notes

References

  1. 1.
    Albu, V.: Measuring customer behavior with deep convolutional neural networks. BRAIN Broad Res. Artif. Intell. Neurosci. 7(1), 74–79 (2016)Google Scholar
  2. 2.
    Andrade, E.L., Blunsden, S., Fisher, R.B.: Modelling crowd scenes for event detection. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006, vol. 1, pp. 175–178. IEEE (2006)Google Scholar
  3. 3.
    Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)Google Scholar
  4. 4.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36. Springer (2004)Google Scholar
  5. 5.
    Chang, J.Y.: Nonparametric feature matching based conditional random fields for gesture recognition from multi-modal video. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1612–1625 (2016)Google Scholar
  6. 6.
    Choi, H., Park, H.: A hierarchical structure for gesture recognition using RGB-D sensor. In: Proceedings of the Second International Conference on Human–Agent Interaction, pp. 265–268. ACM (2014)Google Scholar
  7. 7.
    Corradini, A.: Dynamic time warping for off-line recognition of a small gesture vocabulary. In: IEEE International Conference on Computer Vision Workshops, pp. 82–89. IEEE (2001)Google Scholar
  8. 8.
    Di Benedetto, A., Palmieri, F.A., Cavallo, A., Falco, P.: A hidden markov model-based approach to grasping hand gestures classification. In: Advances in Neural Networks, pp. 415–423. Springer (2016)Google Scholar
  9. 9.
    Ding, J., Chang, C.W.: An adaptive hidden markov model-based gesture recognition approach using kinect to simplify large-scale video data processing for humanoid robot imitation. Multimed. Tools Appl. 75(23), 15537–15551 (2016)Google Scholar
  10. 10.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2625–2634 (2015)Google Scholar
  11. 11.
    Duan, J., Wan, J., Zhou, S., Guo, X., Li, S.: A unified framework for multi-modal isolated gesture recognition. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 14, 21:1–21:16 (2017)Google Scholar
  12. 12.
    Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The Chalearn gesture dataset (CGD 2011). Mach. Vis. Appl. 25(8), 1929–1951 (2014)Google Scholar
  13. 13.
    Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11(9), 1984–1996 (2016)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 770–778 (2016)Google Scholar
  15. 15.
    Helbing, D., Johansson, A., Al-Abideen, H.Z.: Dynamics of crowd disasters: an empirical study. Phys. Rev. E 75(4), 046109 (2007)Google Scholar
  16. 16.
    Hong, P., Turk, M., Huang, T.S.: Gesture modeling and recognition using finite state machines. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 410–415. IEEE (2000)Google Scholar
  17. 17.
    Hsu, Y.L., Chu, C.L., Tsai, Y.J., Wang, J.S.: An inertial pen with dynamic time warping recognizer for handwriting and gesture recognition. IEEE Sens. J. 15(1), 154–163 (2015)Google Scholar
  18. 18.
    Huang, S., Ramanan, D.: Expecting the unexpected: training detectors for unusual pedestrians with adversarial imposters. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1 (2017)Google Scholar
  19. 19.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)Google Scholar
  20. 20.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
  21. 21.
    Jin, C.B., Li, S., Kim, H.: Real-time action detection in video surveillance using sub-action descriptor with multi-cnn. ArXiv preprint arXiv:1710.03383 (2017)
  22. 22.
    Kaâniche, M.B., Bremond, F.: Recognizing gestures by learning local motion signatures of hog descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2247–2258 (2012)Google Scholar
  23. 23.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  24. 24.
    Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference, pp. 1–10. British Machine Vision Association (2008)Google Scholar
  25. 25.
    Ko, K.E., Sim, K.B.: Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Eng. Appl. Artif. Intell. 67, 226–234 (2018)Google Scholar
  26. 26.
    Konecnỳ, J., Hagara, M.: One-shot-learning gesture recognition using hog–hof. J. Mach. Learn. Res. 15, 2513–2532 (2014)MathSciNetGoogle Scholar
  27. 27.
    Kratz, L., Nishino, K.: Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 1446–1453. IEEE (2009)Google Scholar
  28. 28.
    Lakoba, T.I., Kaup, D.J., Finkelstein, N.M.: Modifications of the Helbing–Molnar–Farkas–Vicsek social force model for pedestrian evolution. Simulation 81(5), 339–352 (2005)Google Scholar
  29. 29.
    Land, E.H., McCann, J.J.: Lightness and retinex theory. Josa 61(1), 1–11 (1971)Google Scholar
  30. 30.
    LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, vol. 2, pp. 96–104. IEEE (2004)Google Scholar
  31. 31.
    Li, J., Xu, X., Tao, J., Ding, L., Gao, H., Deng, Z.: Interact with robot: an efficient approach based on finite state machine and mouse gesture recognition. In: 2016 9th International Conference on Human System Interactions (HSI), pp. 203–208. IEEE (2016)Google Scholar
  32. 32.
    Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Li, R., Song, J.: Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In: IEEE International Conference on Pattern Recognition Workshops. IEEE (2016)Google Scholar
  33. 33.
    Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Li, R., Song, J.: Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2956–2964 (2017)Google Scholar
  34. 34.
    Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Ma, Z., Song, J.: Large-scale gesture recognition with a fusion of RGB-D data based on optical flow and the C3D model. Pattern Recognit. Lett. (2017).  https://doi.org/10.1016/j.patrec.2017.12.003
  35. 35.
    Liu, C., Wechsler, H.: A shape-and texture-based enhanced Fisher classifier for face recognition. IEEE Trans. Image Process. 10(4), 598–608 (2001)zbMATHGoogle Scholar
  36. 36.
    Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. IJCAI 1, 3 (2013)Google Scholar
  37. 37.
    Liu, M., Liu, H.: Depth context: a new descriptor for human activity recognition by using sole depth sequences. Neurocomputing 175, 747–758 (2016)Google Scholar
  38. 38.
    Liu, Z., Chai, X., Liu, Z., Chen, X.: Continuous gesture recognition with hand-oriented spatiotemporal feature. In: Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3056–3064 (2017)Google Scholar
  39. 39.
    Malgireddy, M.R., Inwogu, I., Govindaraju, V.: A temporal bayesian model for classifying, detecting and localizing activities in video sequences. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 43–48. IEEE (2012)Google Scholar
  40. 40.
    Malgireddy, M.R., Nwogu, I., Govindaraju, V.: Language-motivated approaches to action recognition. J. Mach. Learn. Res. 14(1), 2189–2212 (2013)MathSciNetGoogle Scholar
  41. 41.
    Mehran, R., Oyama, A., Shah, M.: Abnormal crowd behavior detection using social force model. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 935–942. IEEE (2009)Google Scholar
  42. 42.
    Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., Cao, X.: Multimodal gesture recognition based on the ResC3D network. In: Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3047–3055 (2017)Google Scholar
  43. 43.
    Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215. IEEE (2016)Google Scholar
  44. 44.
    Nishida, N., Nakayama, H.: Multimodal gesture recognition using multi-stream recurrent neural network. In: Pacific-Rim Symposium on Image and Video Technology, pp. 682–694. Springer (2015)Google Scholar
  45. 45.
    Pitsikalis, V., Katsamanis, A., Theodorakis, S., Maragos, P.: Multimodal gesture recognition via multiple hypotheses rescoring. J. Mach. Learn. Res. 16(1), 255–284 (2015)MathSciNetzbMATHGoogle Scholar
  46. 46.
    Plouffe, G., Cretu, A.M.: Static and dynamic hand gesture recognition in depth data using dynamic time warping. IEEE Trans. Instrum. Meas. 65(2), 305–316 (2016)Google Scholar
  47. 47.
    Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)Google Scholar
  48. 48.
    Sanin, A., Sanderson, C., Harandi, M.T., Lovell, B.C.: Spatio-temporal covariance descriptors for action and gesture recognition. In: IEEE Workshops on Applications of Computer Vision, pp. 103–110. IEEE (2013)Google Scholar
  49. 49.
    Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)Google Scholar
  50. 50.
    Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. ArXiv preprint arXiv:1412.6806 (2014)
  51. 51.
    Sun, Q.S., Zeng, S.G., Liu, Y., Heng, P.A., Xia, D.S.: A new method of feature fusion and its application in image recognition. Pattern Recognit. 38(12), 2437–2448 (2005)Google Scholar
  52. 52.
    Tang, J., Cheng, H., Zhao, Y., Guo, H.: Structured dynamic time warping for continuous hand trajectory gesture recognition. Pattern Recognit. 80, 21–31 (2018)Google Scholar
  53. 53.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE (2015)Google Scholar
  54. 54.
    Wan, J., Escalera, S., Anbarjafari, G., Escalante, H.J., Baró, X., Guyon, I., Madadi, M., Allik, J., Gorbova, J., Lin, C., et al.: Results and analysis of Chalearn lap multi-modal isolated and continuous gesture recognition, and real versus fake expressed emotions challenges. In: ICCV Workshops, pp. 3189–3197 (2017)Google Scholar
  55. 55.
    Wan, J., Guo, G., Li, S.: Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1626–1639 (2015)Google Scholar
  56. 56.
    Wan, J., Li, S.Z., Zhao, Y., Zhou, S., Guyon, I., Escalera, S.: Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 56–64. IEEE (2016)Google Scholar
  57. 57.
    Wan, J., Ruan, Q., Li, W., An, G., Zhao, R.: 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J. Electron. Imaging 23(2), 3017–3017 (2014)Google Scholar
  58. 58.
    Wang, H., Wang, P., Song, Z., Li, W.: Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3138–3146 (2017)Google Scholar
  59. 59.
    Wang, P., Li, W., Liu, S., Gao, Z., Tang, C., Ogunbona, P.: Large-scale isolated gesture recognition using convolutional neural networks. In: Proceedings of International Conference on PR, pp. 7–12. IEEE (2016)Google Scholar
  60. 60.
    Wang, S.B., Quattoni, A., Morency, L.P., Demirdjian, D., Darrell, T.: Hidden conditional random fields for gesture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, vol. 2, pp. 1521–1527. IEEE (2006)Google Scholar
  61. 61.
    Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006)Google Scholar
  62. 62.
    Yang, J., Yang, J.: Generalized K–L transform based combined feature extraction. Pattern Recognit. 35(1), 295–297 (2002)zbMATHGoogle Scholar
  63. 63.
    Yeasin, M., Chaudhuri, S.: Visual understanding of dynamic hand gestures. Pattern Recognit. 33(11), 1805–1817 (2000)Google Scholar
  64. 64.
    Zhang, L., Zhu, G., Shen, P., Song, J., Shah, S.A., Bennamoun, M.: Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3120–3128 (2017)Google Scholar
  65. 65.
    Zhu, G., Zhang, L., Mei, L., Shao, J., Song, J., Shen, P.: Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In: IEEE International Conference on Pattern Recognition Workshops (2016)Google Scholar
  66. 66.
    Zhu, G., Zhang, L., Shen, P., Song, J.: Multimodal gesture recognition using 3D convolution and convolutional LSTM. IEEE Access 5, 4517–4524 (2017)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyXidian UniversityXi’anChina
  2. 2.Xi’an Key Laboratory of Big Data and Intelligent VisionXi’anChina
  3. 3.School of Electrical and Information EngineeringThe University of SydneySydneyAustralia

Personalised recommendations