International Journal of Speech Technology

, Volume 21, Issue 2, pp 293–307 | Cite as

Robust front-end for audio, visual and audio–visual speech classification

  • Lucas D. Terissi
  • Gonzalo D. Sad
  • Juan C. Gómez


This paper proposes a robust front-end for speech classification which can be employed with acoustic, visual or audio–visual information, indistinctly. Wavelet multiresolution analysis is employed to represent temporal input data associated with speech information. These wavelet-based features are then used as inputs to a Random Forest classifier to perform the speech classification. The performance of the proposed speech classification scheme is evaluated in different scenarios, namely, considering only acoustic information, only visual information (lip-reading), and fused audio–visual information. These evaluations are carried out over three different audio–visual databases, two of them public ones and the remaining one compiled by the authors of this paper. Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered. In addition, the proposed method performs better than other reported methods in the literature over the same two public databases. All the experiments were implemented using the same configuration parameters. These results also indicate that the proposed method performs satisfactorily, neither requiring the tuning of the wavelet decomposition parameters nor of the Random Forests classifier parameters, for each particular database and input modalities.


Audio–visual speech recognition Wavelet decomposition Random forests 



The funding was provided by the Agencia Nacional de Promoción Científica y Tecnológica Grant No. (PICT 2014-2041), Ministerio de Ciencia, Tecnología e Innovación Productiva Grant No. (STIC-AmSud Project 15STIC-05) and Universidad Nacional de Rosario Grant No. (Project Ing395).


  1. Ahlberg, J. (2001). Candide-3: An updated parameterised face. Technical report, Linkoping: Department of Electrical Engineering, Linkping University.Google Scholar
  2. Ahmadi, S., Ahadi, S. M., Cranen, B., & Boves, L. (2014). Sparse coding of the modulation spectrum for noise-robust automatic speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 36.CrossRefGoogle Scholar
  3. Aleksic, P., Williams, J., Wu, Z., & Katsaggelos, A. (2002). Audio-visual continuous speech recognition using MPEG-4 compliant visual features. In Proceedings of the International Conference on Image Processing, vol 1, pp. 960–963.zbMATHCrossRefGoogle Scholar
  4. Ali, H., Ahmad, N., Zhou, X., Iqbal, K., & Ali, S. M. (2014). Dwt features performance analysis for automatic speech recognition of Urdu. SpringerPlus, 3(1), 204.CrossRefGoogle Scholar
  5. Ali, H., Jianwei, A., & Iqbal, K. (2015). Automatic speech recognition of urdu digits with optimal classification approach. International Journal of Computer Applications, 118(9), 1–5.CrossRefGoogle Scholar
  6. Amer, M. R., Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2014). Multimodal fusion using dynamic hybrid models. In IEEE Winter Conference on Applications of Computer Vision, pp. 556–563.Google Scholar
  7. Attar, M., Mosleh, M., & Ansari-Asl, K. (2010). Isolated words-recognition based on random forest classifiers. In Proceedings of 2010 4th International Conference on Intelligent Information Technology.Google Scholar
  8. Biswas, A., Sahu, P. K., & Chandra, M. (2016). Multiple cameras audio visual speech recognition using active appearance model visual features in car environment. International Journal of Speech Technology, 19(1), 159–171.CrossRefGoogle Scholar
  9. Borde, P., Varpe, A., Manza, R., & Yannawar, P. (2015). Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. International Journal of Speech Technology, 18(2), 167–175.CrossRefGoogle Scholar
  10. Borgström, B., & Alwan, A. (2008). A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Transactions on Systems Man and Cybernetics, 38(6), 1273–1280.CrossRefGoogle Scholar
  11. Breiman, L. (1996). Bagging predictors. Machine Learning, 26(2), 123–140.zbMATHGoogle Scholar
  12. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.zbMATHCrossRefGoogle Scholar
  13. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.zbMATHGoogle Scholar
  14. Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.zbMATHCrossRefGoogle Scholar
  15. Dong, L., Foo, S. W., & Lian, Y. (2005). A two-channel training algorithm for hidden Markov model and its application to lip reading. EURASIP Journal on Advances in Signal Processing, 2005(9), 347367.zbMATHCrossRefGoogle Scholar
  16. Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3), 141–151.CrossRefGoogle Scholar
  17. Estellers, V., Gurban, M., & Thiran, J. (2012). On dynamic stream weighting for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(4), 1145–1157.CrossRefGoogle Scholar
  18. Farooq, O., & Datta, S. (2003a). Phoneme recognition using wavelet based features. Information Sciences, 150(1–2), 5–15.CrossRefGoogle Scholar
  19. Farooq, O., & Datta, S. (2003b). Wavelet-based denoising for robust feature extraction for speech recognition. Electronics Letters, 39(1), 163–165.CrossRefGoogle Scholar
  20. Foo, S., Lian, Y., & Dong, L. (2004). Recognition of visual speech elements using adaptively boosted hidden Markov models. IEEE Transactions on Circuits and Systems for Video Technology, 14(5), 693–705.CrossRefGoogle Scholar
  21. Gowdy, J., Subramanya, A., Bartels, C., & Bilmes, J. (2004). DBN based multi-stream models for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 993–996.Google Scholar
  22. Gowdy, J. N. & Tufekci, Z. (2000). Mel-scaled discrete wavelet coefficients for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol 3, pp. 1351–1354.Google Scholar
  23. Gupta, M. & Gilbert, A. (2001). Robust speech recognition using wavelet coefficient features. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU ’01., pp. 445–448.Google Scholar
  24. Hu, D., Li, X., & Lu, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3574–3582.Google Scholar
  25. Huang, F. J. & Chen, T. (1998). Advanced Multimedia Processing Laboratory. Cornell University, Ithaca, NY. Accessed March 2018, from
  26. Iwano, K., Yoshinaga, T., Tamura, S., & Furui, S. (2007). Audio-visual speech recognition using lip information extracted from side-face images. EURASIP Journal on Audio, Speech, and Music Processing, 2007(1), 064506.Google Scholar
  27. Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE, 103(9), 1635–1653.CrossRefGoogle Scholar
  28. Kotnik, B., Kacic, Z., & Horvat, B. (2003). The usage of wavelet packet transformation in automatic noisy speech recognition systems. In The IEEE Region 8 EUROCON 2003. Computer as a Tool., vol. 2, pp. 131–134.Google Scholar
  29. Krishnamurthy, N., & Hansen, J. (2009). Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1394–1407.CrossRefGoogle Scholar
  30. Lee, J.-S., & Park, C.-H. (2008). Robust audio-visual speech recognition based on late integration. IEEE Transactions on Multimedia, 10(5), 767–779.CrossRefGoogle Scholar
  31. Maganti, H. K., & Matassoni, M. (2014). Auditory processing-based features for improving speech recognition in adverse acoustic conditions. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 21.CrossRefGoogle Scholar
  32. Matthews, I., Cootes, T., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 2002.CrossRefGoogle Scholar
  33. Miki, M., Kitaoka, N., Miyajima, C., Nishino, T., & Takeda, K. (2014). Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 2.CrossRefGoogle Scholar
  34. Monaci, G., Vandergheynst, P., & Sommer, F. T. (2009). Learning bimodal structure in audio-visual data. IEEE Transactions on Neural Networks, 20(12), 1898–1910.CrossRefGoogle Scholar
  35. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696.Google Scholar
  36. Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(1), 9–18.CrossRefGoogle Scholar
  37. Papandreou, G., Katsamanis, A., Pitsikalis, V., & Maragos, P. (2009). Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 17(3), 423–435.CrossRefGoogle Scholar
  38. Pavez, E., & Silva, J. F. (2012). Analysis and design of wavelet-packet cepstral coefficients for automatic speech recognition. Speech Communication, 54(6), 814–835.CrossRefGoogle Scholar
  39. Petridis, S. & Pantic, M. (2016). Deep complementary bottleneck features for visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2304–2308.Google Scholar
  40. Potamianos, G., Graf, H. P., & Cosatto, E. (1998). An image transform approach for HMM based automatic lipreading. In Proceedings of the International Conference on Image Processing, pp. 173–177.Google Scholar
  41. Potamianos, G., Neti, C., Gravier, G., & Garg, A. (2003). Recent advances in the automatic recognition of audio-visual speech. Proceedings of the IEEE, 91(9), 1306–1326.CrossRefGoogle Scholar
  42. Potamianos, G., Neti, C., Iyengar, G., Senior, A. W., & Verma, A. (2001). A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 4(3), 193–208.zbMATHCrossRefGoogle Scholar
  43. Puviarasan, N., & Palanivel, S. (2011). Lip reading of hearing impaired persons using HMM. Expert Systems with Applications, 38(4), 4477–4481.CrossRefGoogle Scholar
  44. Rabiner, L. (1989). A tutorial on Hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRefGoogle Scholar
  45. Rajeswari, P. N. N. S. S., & Sathyanarayana, V. (2014). Robust speech recognition using wavelet domain front end and hidden Markov models. In V. Sridhar, H. S. Sheshadri, & M. C. Padma (Eds.), Emerging research in electronics, computer science and technology. New Delhi: Springer.Google Scholar
  46. Saitoh, T., Morishita, K., & Konishi, R. (2008). Analysis of efficient lip reading method for various languages. In Proceedings of the 19th International Conference on Pattern Recognition, pp. 1–4.Google Scholar
  47. Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37, 80–91.zbMATHCrossRefGoogle Scholar
  48. Shen, P., Tamura, S., & Hayamizu, S. (2014). Multistream sparse representation features for noise robust audio-visual speech recognition. Acoustical Science and Technology, 35(1), 17–27.CrossRefGoogle Scholar
  49. Shin, J., Lee, J., & Kim, D. (2011). Real-time lip reading system for isolated Korean word recognition. Pattern Recognition, 44(3), 559–571.zbMATHCrossRefGoogle Scholar
  50. Shivappa, S., Trivedi, M., & Rao, B. (2010). Audiovisual information fusion in human computer interfaces and intelligent environments: A survey. Proceedings of the IEEE, 98(10), 1692–1715.CrossRefGoogle Scholar
  51. Terissi, L. D., & Gómez, J. C. (2010). 3D head pose and facial expression tracking using a single camera. Journal of Universal Computer Science, 16(6), 903–920.MathSciNetzbMATHGoogle Scholar
  52. Trottier, L., Giguère, P., & Chaib-draa, B. (2015). Feature selection for robust automatic speech recognition: a temporal offset approach. International Journal of Speech Technology, 18(3), 395–404.CrossRefGoogle Scholar
  53. Tufekci, Z., Gowdy, J. N., Gurbuz, S., & Patterson, E. (2006). Applied mel-frequency discrete wavelet coefficients and parallel model compensation for noise-robust speech recognition. Speech Communication, 48(10), 1294–1307.CrossRefGoogle Scholar
  54. Uluskan, S., Sangwan, A., & Hansen, J. H. L. (2017). Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech. International Journal of Speech Technology, 20, 799–811.CrossRefGoogle Scholar
  55. Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition II: NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.CrossRefGoogle Scholar
  56. Wang, S. L., Lau, W. H., & Leung, S. H. (2004). Automatic lip contour extraction from color images. Pattern Recognition, 37(12), 2375–2387.zbMATHCrossRefGoogle Scholar
  57. Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T. S., & Yan, S. (2010). Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6), 1031–1044.CrossRefGoogle Scholar
  58. Yau, W. C., Kumar, D. K., & Arjunan, S. P. (2007). Visual recognition of speech consonants using facial movement features. Integrated Computer-Aided Engineering-Informatics in Control, Automation and Robotics, 14(1), 49–61.Google Scholar
  59. Yin, S., Liu, C., Zhang, Z., Lin, Y., Wang, D., Tejedor, J., et al. (2015). Noisy training for deep neural networks in speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1), 2.CrossRefGoogle Scholar
  60. Zhao, G., Barnard, M., & Pietikäinen, M. (2009). Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7), 1254–1265.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Laboratory for System Dynamics and Signal ProcessingFCEIA, Universidad Nacional de Rosario, CIFASIS, CONICETRosarioArgentina

Personalised recommendations