A Comparison of Model Validation Techniques for Audio-Visual Speech Recognition

Seong, Thum Wei; Ibrahim, Mohd Zamri; Arshad, Nurul Wahidah Binti; Mulvaney, D. J.

doi:10.1007/978-981-10-6451-7_14

A Comparison of Model Validation Techniques for Audio-Visual Speech Recognition

Thum Wei Seong³²,
Mohd Zamri Ibrahim³²,
Nurul Wahidah Binti Arshad³² &
…
D. J. Mulvaney³³

Conference paper
First Online: 31 August 2017

1303 Accesses

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 449))

Abstract

This paper implements and compares the performance of a number of techniques proposed for improving the accuracy of Automatic Speech Recognition (ASR) systems. As ASR that uses only speech can be contaminated by environmental noise, in some applications it may improve performance to employ Audio-Visual Speech Recognition (AVSR), in which recognition uses both audio information and mouth movements obtained from a video recording of the speaker’s face region. In this paper, model validation techniques, namely the holdout method, leave-one-out cross validation and bootstrap validation, are implemented to validate the performance of an AVSR system as well as to provide a comparison of the performance of the validation techniques themselves. A new speech data corpus is used, namely the Loughborough University Audio-Visual (LUNA-V) dataset that contains 10 speakers with five sets of samples uttered by each speaker. The database is divided into training and testing sets and processed in manners suitable for the validation techniques under investigation. The performance is evaluated using a range of different signal-to-noise ratio values using a variety of noise types obtained from the NOISEX-92 dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Kokkinidis, K., Panagi, A., Manitsaris, A.: Finding the optimum training solution for Byzantine music recognition—a Max/Msp approach, pp. 6–9 (2016)
Google Scholar
Kocaguneli, E., Menzies, T.: Software effort models should be assessed via leave-one-out validation. J. Syst. Softw. 86(7), 1879–1890 (2013)
Article Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. Int. Jt. Conf. Artif. Intell. 14(12), 1137–1143 (1995)
Google Scholar
Receveur, S., Scheler, D., Fingscheidt, T.: A Turbo-decoding weighted forward-backward algorithm for multimodal speech recognition, pp. 179–192. Springer, Berlin (2014)
Google Scholar
Ibrahim, M.Z., Mulvaney, D.J., Abas, M.F.: Feature-fusion based audio-visual speech recognition using lip geometry features in noisy enviroment. ARPN J. Eng. Appl. Sci. 10(23), 17521–17527 (2015)
Google Scholar
Matlab, M.: R2015a. http://www.mathworks.com/products/matlab
Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly Media Inc., Newton (2008)
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK book (for HTK version 3.4). Camb. Univ. Dep. Eng. 2(2), 2–3 (2006)
Google Scholar
Pawar, G.S., Morade, S.S.: Isolated english language digit recognition using hidden markov model toolkit. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4(6), 781–784 (2014)
Google Scholar
Ibrahim, M.Z.: A Novel Lip Geometry Approach for Audio-Visual Speech Recognition. Loughborough University, Loughborough (2014)
Google Scholar
Ibrahim, M.Z., Mulvaney, D.J.: Robust geometrical-based lip-reading using hidden Markov models. In: IEEE EuroCon 2013, pp. 2011–2016, July 2013
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. Comput. Vis. Pattern Recognit. 1, I–511–I–518 (2001)
Google Scholar
Kakumanu, P., Makrogiannis, S., Bourbakis, N.: A survey of skin-color modeling and detection methods. Pattern Recognit. 40(3), 1106–1122 (2007)
Article MATH Google Scholar
Li, H., Greenspan, M.: Model-based segmentation and recognition of dynamic gestures in continuous video streams. Pattern Recognit. 44(8), 1614–1628 (2011)
Article Google Scholar
Chauhan, K., Sharma, S.: A review on feature extraction techniques for CBIR system. Signal Image Process. An Int. J. 3(6), 1–14 (2012)
Article Google Scholar
Tripathy, S., Baranwal, N., Nandi, G.C.: A MFCC based Hindi speech recognition technique using HTK Toolkit. In: 2013 IEEE 2nd International Conference on Image Information Processing, IEEE ICIIP 2013, pp. 539–544, January 2016
Google Scholar
Wahid, N.S.A., Saad, P., Hariharan, M.: Automatic infant cry pattern classification for a multiclass problem. 8(9), 45–52 (2016)
Google Scholar
Chitu, G., Rothkrantz, L.J.M.: Building a data corpus for audio-visual speech recognition. 1, Movellan 1995 (2007)
Google Scholar
Tantithamthavorn, S., Mcintosh, A., Hassan, E., Matsumoto, K.: An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng. 5589, 1–16 (2016)
Google Scholar

Download references

Acknowledgments

This work was supported by Universiti Malaysia Pahang and funded by the Ministry of Higher Education Malaysia under FRGS Grant RDU160108.

Author information

Authors and Affiliations

Faculty of Electrical and Electronic Engineering, University Malaysia Pahang, 26600, Pekan, Pahang, Malaysia
Thum Wei Seong, Mohd Zamri Ibrahim & Nurul Wahidah Binti Arshad
School of Electronic, Electrical and Systems Engineering, Loughborough University, Loughborough, LE11 3TU, UK
D. J. Mulvaney

Authors

Thum Wei Seong
View author publications
You can also search for this author in PubMed Google Scholar
Mohd Zamri Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Nurul Wahidah Binti Arshad
View author publications
You can also search for this author in PubMed Google Scholar
D. J. Mulvaney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohd Zamri Ibrahim .

Editor information

Editors and Affiliations

iCatse, B-3001, Intellige 2, Kyonggi University, Seongnam-si, Kyonggi-do, Korea (Republic of)
Kuinam J. Kim
Computer Science, Namseoul University Computer Science, Cheonan , Ch´ungch´ong-namdo, Korea (Republic of)
Hyuncheol Kim
School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea (Republic of)
Nakhoon Baek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seong, T.W., Ibrahim, M.Z., Arshad, N.W.B., Mulvaney, D.J. (2018). A Comparison of Model Validation Techniques for Audio-Visual Speech Recognition. In: Kim, K., Kim, H., Baek, N. (eds) IT Convergence and Security 2017. Lecture Notes in Electrical Engineering, vol 449. Springer, Singapore. https://doi.org/10.1007/978-981-10-6451-7_14

Download citation

DOI: https://doi.org/10.1007/978-981-10-6451-7_14
Published: 31 August 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6450-0
Online ISBN: 978-981-10-6451-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics