Comparing Hybrid NN-HMM and RNN for Temporal Modeling in Gesture Recognition

  • Nicolas GrangerEmail author
  • Mounîm A. el Yacoubi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10635)


This paper provides an extended comparison of two temporal models for gesture recognition, namely Hybrid Neural Network-Hidden Markov Models (NN-HMM) and Recurrent Neural Networks (RNN) which have lately claimed the state-the-art performances. Experiments were conducted on both models in the same body of work, with similar representation learning capacity and comparable computational costs. For both solutions, we have integrated recent contributions to the model architectures and training techniques. We show that, for this task, Hybrid NN-HMM models remain competitive with Recurrent Neural Networks in a standard setting. For both models, we analyze the influence of the training objective function on the final evaluation metric. We further tested the influence of temporal convolution to improve context modeling, a technique which was recently reported to improve the accuracy of gesture recognition.


Hybrid NN-HMM RNN Gesture recognition End-to-End learning Representation learning 


  1. 1.
    Bourlard, H., Morgan, N.: A continuous speech recognition system embedding MLP into HMM. In: Advances in Neural Information Processing Systems, pp. 186–193 (1990)Google Scholar
  2. 2.
    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  3. 3.
    Escalera, S., Baró, X., Gonzàlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce-López, V., Escalante, H.J., Shotton, J., Guyon, I.: ChaLearn looking at people challenge 2014: dataset and results. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 459–473. Springer, Cham (2015). doi: 10.1007/978-3-319-16178-5_32 CrossRefGoogle Scholar
  4. 4.
    Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). doi: 10.1007/11550907_126 Google Scholar
  5. 5.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)Google Scholar
  6. 6.
    Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 545–552 (2009)Google Scholar
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 448–456, 07–09 July 2015. PMLR, Lille (2015)Google Scholar
  9. 9.
    Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: hybrid CNN-HMM for continuous sign language recognition. In: Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, 19–22 September 2016 (2016)Google Scholar
  10. 10.
    Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML Workshop on Deep Learning for Audio, Speech and Language Processing (2013)Google Scholar
  11. 11.
    Neverova, N., Wolf, C., Taylor, G.W., Nebout, F.: Multi-scale deep learning for gesture detection and localization. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 474–490. Springer, Cham (2015). doi: 10.1007/978-3-319-16178-5_33 CrossRefGoogle Scholar
  12. 12.
    Pigou, L., van den Oord, A., Dieleman, S., Van Herreweghe, M., Dambre, J.: Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in Video. Int. J. Comput. Vis. 1–10 (2016)Google Scholar
  13. 13.
    Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)zbMATHMathSciNetGoogle Scholar
  14. 14.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  15. 15.
    Tóth, L., Kocsor, A.: Training HMM/ANN hybrid speech recognizers by probabilistic sampling. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3696, pp. 597–603. Springer, Heidelberg (2005). doi: 10.1007/11550822_93 CrossRefGoogle Scholar
  16. 16.
    Wu, D., Pigou, L., Kindermans, P.J., Nam, L.E., Shao, L., Dambre, J., Odobez, J.M.: Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1583–1597 (2016). doi: 10.1109/TPAMI.2016.2537340 CrossRefGoogle Scholar
  17. 17.
    Yang, H.D., Sclaroff, S., Lee, S.W.: Sign language spotting with a threshold model based on conditional random fields. IEEE Trans. Patt. Anal. Mach. Intell. 31(7), 1264–1277 (2009)CrossRefGoogle Scholar
  18. 18.
    Yin, Y., Davis, R.: Real-time continuous gesture recognition for natural human-computer interaction. In: 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 113–120. IEEE (2014)Google Scholar
  19. 19.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122 (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.SAMOVAR, Télécom SudParis, CNRSUniversity of Paris-SaclayÉvryFrance

Personalised recommendations