Robust front-end for audio, visual and audio–visual speech classification
- 83 Downloads
This paper proposes a robust front-end for speech classification which can be employed with acoustic, visual or audio–visual information, indistinctly. Wavelet multiresolution analysis is employed to represent temporal input data associated with speech information. These wavelet-based features are then used as inputs to a Random Forest classifier to perform the speech classification. The performance of the proposed speech classification scheme is evaluated in different scenarios, namely, considering only acoustic information, only visual information (lip-reading), and fused audio–visual information. These evaluations are carried out over three different audio–visual databases, two of them public ones and the remaining one compiled by the authors of this paper. Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered. In addition, the proposed method performs better than other reported methods in the literature over the same two public databases. All the experiments were implemented using the same configuration parameters. These results also indicate that the proposed method performs satisfactorily, neither requiring the tuning of the wavelet decomposition parameters nor of the Random Forests classifier parameters, for each particular database and input modalities.
KeywordsAudio–visual speech recognition Wavelet decomposition Random forests
The funding was provided by the Agencia Nacional de Promoción Científica y Tecnológica Grant No. (PICT 2014-2041), Ministerio de Ciencia, Tecnología e Innovación Productiva Grant No. (STIC-AmSud Project 15STIC-05) and Universidad Nacional de Rosario Grant No. (Project Ing395).
- Ahlberg, J. (2001). Candide-3: An updated parameterised face. Technical report, Linkoping: Department of Electrical Engineering, Linkping University.Google Scholar
- Amer, M. R., Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2014). Multimodal fusion using dynamic hybrid models. In IEEE Winter Conference on Applications of Computer Vision, pp. 556–563.Google Scholar
- Attar, M., Mosleh, M., & Ansari-Asl, K. (2010). Isolated words-recognition based on random forest classifiers. In Proceedings of 2010 4th International Conference on Intelligent Information Technology.Google Scholar
- Gowdy, J., Subramanya, A., Bartels, C., & Bilmes, J. (2004). DBN based multi-stream models for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 993–996.Google Scholar
- Gowdy, J. N. & Tufekci, Z. (2000). Mel-scaled discrete wavelet coefficients for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol 3, pp. 1351–1354.Google Scholar
- Gupta, M. & Gilbert, A. (2001). Robust speech recognition using wavelet coefficient features. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU ’01., pp. 445–448.Google Scholar
- Hu, D., Li, X., & Lu, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3574–3582.Google Scholar
- Huang, F. J. & Chen, T. (1998). Advanced Multimedia Processing Laboratory. Cornell University, Ithaca, NY. Accessed March 2018, from http://chenlab.ece.cornell.edu/projects/AudioVisualSpeechProcessing.
- Iwano, K., Yoshinaga, T., Tamura, S., & Furui, S. (2007). Audio-visual speech recognition using lip information extracted from side-face images. EURASIP Journal on Audio, Speech, and Music Processing, 2007(1), 064506.Google Scholar
- Kotnik, B., Kacic, Z., & Horvat, B. (2003). The usage of wavelet packet transformation in automatic noisy speech recognition systems. In The IEEE Region 8 EUROCON 2003. Computer as a Tool., vol. 2, pp. 131–134.Google Scholar
- Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696.Google Scholar
- Petridis, S. & Pantic, M. (2016). Deep complementary bottleneck features for visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2304–2308.Google Scholar
- Potamianos, G., Graf, H. P., & Cosatto, E. (1998). An image transform approach for HMM based automatic lipreading. In Proceedings of the International Conference on Image Processing, pp. 173–177.Google Scholar
- Rajeswari, P. N. N. S. S., & Sathyanarayana, V. (2014). Robust speech recognition using wavelet domain front end and hidden Markov models. In V. Sridhar, H. S. Sheshadri, & M. C. Padma (Eds.), Emerging research in electronics, computer science and technology. New Delhi: Springer.Google Scholar
- Saitoh, T., Morishita, K., & Konishi, R. (2008). Analysis of efficient lip reading method for various languages. In Proceedings of the 19th International Conference on Pattern Recognition, pp. 1–4.Google Scholar
- Yau, W. C., Kumar, D. K., & Arjunan, S. P. (2007). Visual recognition of speech consonants using facial movement features. Integrated Computer-Aided Engineering-Informatics in Control, Automation and Robotics, 14(1), 49–61.Google Scholar