Abstract
This paper implements and compares the performance of a number of techniques proposed for improving the accuracy of Automatic Speech Recognition (ASR) systems. As ASR that uses only speech can be contaminated by environmental noise, in some applications it may improve performance to employ Audio-Visual Speech Recognition (AVSR), in which recognition uses both audio information and mouth movements obtained from a video recording of the speaker’s face region. In this paper, model validation techniques, namely the holdout method, leave-one-out cross validation and bootstrap validation, are implemented to validate the performance of an AVSR system as well as to provide a comparison of the performance of the validation techniques themselves. A new speech data corpus is used, namely the Loughborough University Audio-Visual (LUNA-V) dataset that contains 10 speakers with five sets of samples uttered by each speaker. The database is divided into training and testing sets and processed in manners suitable for the validation techniques under investigation. The performance is evaluated using a range of different signal-to-noise ratio values using a variety of noise types obtained from the NOISEX-92 dataset.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Kokkinidis, K., Panagi, A., Manitsaris, A.: Finding the optimum training solution for Byzantine music recognition—a Max/Msp approach, pp. 6–9 (2016)
Kocaguneli, E., Menzies, T.: Software effort models should be assessed via leave-one-out validation. J. Syst. Softw. 86(7), 1879–1890 (2013)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. Int. Jt. Conf. Artif. Intell. 14(12), 1137–1143 (1995)
Receveur, S., Scheler, D., Fingscheidt, T.: A Turbo-decoding weighted forward-backward algorithm for multimodal speech recognition, pp. 179–192. Springer, Berlin (2014)
Ibrahim, M.Z., Mulvaney, D.J., Abas, M.F.: Feature-fusion based audio-visual speech recognition using lip geometry features in noisy enviroment. ARPN J. Eng. Appl. Sci. 10(23), 17521–17527 (2015)
Matlab, M.: R2015a. http://www.mathworks.com/products/matlab
Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly Media Inc., Newton (2008)
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK book (for HTK version 3.4). Camb. Univ. Dep. Eng. 2(2), 2–3 (2006)
Pawar, G.S., Morade, S.S.: Isolated english language digit recognition using hidden markov model toolkit. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4(6), 781–784 (2014)
Ibrahim, M.Z.: A Novel Lip Geometry Approach for Audio-Visual Speech Recognition. Loughborough University, Loughborough (2014)
Ibrahim, M.Z., Mulvaney, D.J.: Robust geometrical-based lip-reading using hidden Markov models. In: IEEE EuroCon 2013, pp. 2011–2016, July 2013
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. Comput. Vis. Pattern Recognit. 1, I–511–I–518 (2001)
Kakumanu, P., Makrogiannis, S., Bourbakis, N.: A survey of skin-color modeling and detection methods. Pattern Recognit. 40(3), 1106–1122 (2007)
Li, H., Greenspan, M.: Model-based segmentation and recognition of dynamic gestures in continuous video streams. Pattern Recognit. 44(8), 1614–1628 (2011)
Chauhan, K., Sharma, S.: A review on feature extraction techniques for CBIR system. Signal Image Process. An Int. J. 3(6), 1–14 (2012)
Tripathy, S., Baranwal, N., Nandi, G.C.: A MFCC based Hindi speech recognition technique using HTK Toolkit. In: 2013 IEEE 2nd International Conference on Image Information Processing, IEEE ICIIP 2013, pp. 539–544, January 2016
Wahid, N.S.A., Saad, P., Hariharan, M.: Automatic infant cry pattern classification for a multiclass problem. 8(9), 45–52 (2016)
Chitu, G., Rothkrantz, L.J.M.: Building a data corpus for audio-visual speech recognition. 1, Movellan 1995 (2007)
Tantithamthavorn, S., Mcintosh, A., Hassan, E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng. 5589, 1–16 (2016)
Acknowledgments
This work was supported by Universiti Malaysia Pahang and funded by the Ministry of Higher Education Malaysia under FRGS Grant RDU160108.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Seong, T.W., Ibrahim, M.Z., Arshad, N.W.B., Mulvaney, D.J. (2018). A Comparison of Model Validation Techniques for Audio-Visual Speech Recognition. In: Kim, K., Kim, H., Baek, N. (eds) IT Convergence and Security 2017. Lecture Notes in Electrical Engineering, vol 449. Springer, Singapore. https://doi.org/10.1007/978-981-10-6451-7_14
Download citation
DOI: https://doi.org/10.1007/978-981-10-6451-7_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6450-0
Online ISBN: 978-981-10-6451-7
eBook Packages: EngineeringEngineering (R0)