Class Confusability Reduction in Audio-Visual Speech Recognition Using Random Forests

  • Gonzalo D. Sad
  • Lucas D. Terissi
  • Juan C. Gómez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10657)


This paper presents an audio-visual speech classification system based on Random Forests classifiers, aiming to reduce the intra-class misclassification problems, which is a very usual situation, specially in speech recognition tasks. A novel training procedure is proposed, introducing the concept of Complementary Random Forests (CRF) classifiers. Experimental results over three audio-visual databases, show that a good performance is achieved with the proposed system for the different types of input information considered, viz., audio-only information, video-only information and fused audio-video information. In addition, these results also indicate that the proposed method performs satisfactorily over the three databases using the same configuration parameters.


Speech recognition Audio-visual speech Random forests 


  1. 1.
    Advanced Multimedia Processing Laboratory. Carnegie Mellon University, Pittsburgh, PA.
  2. 2.
    Borgström, B., Alwan, A.: A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Trans. Syst. Man Cybern. 38(6), 1273–1280 (2008)CrossRefGoogle Scholar
  3. 3.
    Digital Signal Processing Group, Rice University: NOISEX-92 Database, Houston, RiceGoogle Scholar
  4. 4.
    Katsaggelos, A.K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)CrossRefGoogle Scholar
  5. 5.
    Krishnamurthy, N., Hansen, J.: Babble noise: modeling, analysis, and applications. IEEE Trans. Audio Speech Lang. Process. 17(7), 1394–1407 (2009)CrossRefGoogle Scholar
  6. 6.
    Matthews, I., Cootes, T., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24, 2002 (2002)CrossRefGoogle Scholar
  7. 7.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696 (2011)Google Scholar
  8. 8.
    Terissi, L., Gómez, J.: 3D head pose and facial expression tracking using a single camera. J. Univ. Comput. Sci. 16(6), 903–920 (2010)MathSciNetMATHGoogle Scholar
  9. 9.
    Terissi, L.D., Sad, G.D., Gómez, J.C., Parodi, M.: Audio-visual speech recognition scheme based on wavelets and random forests classification. In: Pardo, A., Kittler, J. (eds.) CIARP 2015. LNCS, vol. 9423, pp. 567–574. Springer, Cham (2015). CrossRefGoogle Scholar
  10. 10.
    Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Gonzalo D. Sad
    • 1
  • Lucas D. Terissi
    • 1
  • Juan C. Gómez
    • 1
  1. 1.Laboratory for System Dynamics and Signal ProcessingUniversidad Nacional de Rosario CIFASIS-CONICETRosarioArgentina

Personalised recommendations