Intelligent Multi-modal Interfaces for Mobile Applications in Hostile Environment(IM-HOST)

Stricker, Claude; Wagen, Jean-Frédéric; Aradilla, Guillermo; Bourlard, Hervé; Hermansky, Hynek; Pinto, Joel; Rey, Paul-Henri; Théraulaz, Jérôme

doi:10.1007/978-3-642-00437-7_4

Claude Stricker^17,19,
Jean-Frédéric Wagen¹⁸,
Guillermo Aradilla²⁰,
Hervé Bourlard²⁰,
Hynek Hermansky²⁰,
Joel Pinto²⁰,
Paul-Henri Rey^17,19 &
…
Jérôme Théraulaz¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5440))

2535 Accesses

Abstract

Multi-modal interfaces for mobile applications include tiny screens, keyboards, touch screens, ear phones, microphones and software components for voice-based man-machine interaction. The software enabling voice recognition, as well as the microphone, are of primary importance in a noisy environment. Current performances of voice applications are reasonably good in quiet environment. However, the surrounding noise in many practical situations largely deteriorates the quality of the speech signal. As a consequence, the recognition rate decreases significantly. Noise management is a major focus in developing voice-enabled technologies. This project addresses the problem of voice recognition with the goal of reaching a high success rate (ideally above 99%) in an outdoor environment that is noisy and hostile: the user stands on an open deck of a motor-boat and use his/her voice to command applications running on a laptop by using a wireless microphone. In addition to the problem of noise, there are other constraints strongly limiting the hardware options. Furthermore, the user must also perform several tasks simultaneously. The success of the solution must rely on the efficiency and effectiveness of the voice recognition algorithm and the choice of the microphone. In addition, the training of the recognizer should be kept to a minimum and the recognition time should not last longer than 3 seconds. For these two reasons, only a limited set of voice commands have been tested.

A first demonstrator based on digit keyword spotting trained over phone speech showed poor performances in very noisy conditions. A second demonstrator combining neural network and template matching techniques lead to nearly acceptable results when the user recorded the keywords. Since the recognition rate was approximated around 90%, no additional field test was undertaken. This R&D project shows that state-of-the-art research on voice recognition needs further investigations in order to recognize spoken keywords in noisy environments. In addition to on-going improvements, unconventional research approaches that are worth testing include, deriving adapted keywords to specialized algorithms and having the user learn these keyword.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Vergyri, D., et al.: The SRI/OGI 2006 Spoken Term Detection System. In: Proc. of Interspeech (2007)
Google Scholar
Miller, D., et al.: Rapid and Accurate Spoken Term Detection. In: Proc. of NIST Spoken Term Detection Workshop (STD 2006) (December 2006)
Google Scholar
Szoke, I., et al.: Combination of Word and Phoneme Approach for Spoken Term Detection. In: 4th Joint Workshop on Machine Learning and Multimodal Interaction (2007)
Google Scholar
James, D., Young, S.: A Fast Lattice-Based Approach to Vocabulary Independent Wordspotting. In: Proc. of IEEE Conf. Acoust. Speech. Signal Process. (ICASSP) (1994)
Google Scholar
Szoke, I., et al.: Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In: Proc. of Interspeech (2005)
Google Scholar
Hermansky, H., Fousek, P., Lehtonen, M.: The Role of Speech in Multimodal Human-Computer Interaction (Towards Reliable Rejection of Non-Keyword Input). In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS, vol. 3658, pp. 2–8. Springer, Heidelberg (2005)
Chapter Google Scholar
Wachter, M.D., Demuynck, K., Compernolle, D.V., Wambacq, P.: Data Driven Example Based Continuous Speech Recognition. In: Proceedings of Eurospeech, pp. 1133–1136 (2003)
Google Scholar
Aradilla, G., Vepa, J., Bourlard, H.: Improving Speech Recognition Using a Data-Driven Approach. In: Proceedings of Interspeech, pp. 3333–3336 (2005)
Google Scholar
Axelrod, S., Maison, B.: Combination of Hidden Markov Models with Dynamic Time Warping for Speech Recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. I, pp. 173–176 (2004)
Google Scholar
Zhu, Q., Chen, B., Morgan, N., Stolcke, A.: On Using MLP features in LVCSR. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) (2004)
Google Scholar
Hermansky, H., Ellis, D., Sharma, S.: Tandem Connectionist Feature Extraction for Conventional HMM Systems. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2000)
Google Scholar
Aradilla, G., Vepa, J., Bourlard, H.: Using Posterior-Based Features in Template Matching for Speech Recognition. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) (2006)
Google Scholar
Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993)
MATH Google Scholar
Aradilla, G., Vepa, J., Bourlard, H.: Using Pitch as Prior Knowledge in Template-Based Speech Recognition. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2006)
Google Scholar
Niyogi, P., Sondhi, M.M.: Detecting Stop Consonants in Continuous Speech. The Journal of the Acoustic Society of America 111(2), 1063–1076 (2002)
Article Google Scholar
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Audio, Speech and Signal Processing 28, 357–366 (1980)
Article Google Scholar
Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustic Society of America 87 (1990)
Google Scholar
Wachter, M.D., Demuynck, K., Wambacq, P., Compernolle, D.V.: A Locally Weighted Distance Measure For Example Based Speech Recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181–184 (2004)
Google Scholar
Matton, M., Wachter, M.D., Compernolle, D.V., Cools, R.: A Discriminative Locally Weighted Distance Measure for Speaker Independent Template Based Speech Recognition. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) (2004)
Google Scholar
Cover, T.M., Thomas, J.A.: Information Theory. John Wiley, Chichester (1991)
MATH Google Scholar
Bhattacharyya, A.: On a Measure of Divergence between Two Statistical Populations Defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)
MathSciNet MATH Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Morgan Kaufmann, Academic Press (1990)
Google Scholar
Mak, B., Barnard, E.: Phone Clustering Using the Bhattacharyya Distance. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), pp. 2005–2008 (1996)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley Interscience, Hoboken (2001)
MATH Google Scholar
Hermansky, H., Fousek, P.: Multi-Resolution RASTA Filtering for TANDEM-based ASR. In: Proceedings of Interspeech (2005)
Google Scholar
Bourlard, H., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Kluwer Academic Publishers, Boston (1993)
Google Scholar
Dupont, S., Bourlard, H., Deroo, O., Fontaine, V., Boite, J.M.: Hybrid HMM/ANN Systems for Training Independent Tasks: Experiments on Phonebook and Related Improvements. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1997)
Google Scholar
Bradley, S., et al.: The mechanisms creating wind noise in microphones. University of Salford, Nokia Mobile Phones
Google Scholar
Rabiner, L.: Techniques for Speech and Natural Language Recognition. Rutgers, The State University of New Jersey (2002)
Google Scholar
Xiong, Z., Radhakrishnan, R., Divakaran, A., Huang, T.S.: Highlights extraction from sports video based on an audio-visual marker detection framework (2005)
Google Scholar
Cole, R.A., Noel, M., Lander, T., Durham, T.: New Telephone Speech Corpora at CSLU. In: Proceedings of Eurospeech (1995)
Google Scholar
Rey, P.-H.: Opportunities in Sport for Voice-Enabled Technologies. Master of Advanced Studies in Sport Administration and Technology thesis, AISTS (2006)
Google Scholar
Stricker, C., Rey, P.-H.: How can voice-enabled technologies help athletes and coaches to become more efficient? In: 3rd Asia-Pacific Congress on Sports Technology, Singapore (2007)
Google Scholar
Shneiderman, B.: The Limits of Speech Recognition. Communications of the ACM 43(9) (September 2000)
Google Scholar
Grosso, M.A.: The long-Term Adoption of Speech Recognition in Medical Applications. George Washington University School of Medicine (2003)
Google Scholar
Strayer, D.L., Johnson, W.A.: Driven to distraction: dual-task studies of simulated driving and conversing on a cellular phone. Psychol. Sci. 12, 462–466 (2001)
Article Google Scholar
Wagen, J.-F., Imhalsy, M.: Conception de produits et de services basés sur la Reconnaissance Vocale: exemples d’une collaboration IDIAP/HES-SO. TIC day, Martigny, May 24 (2007), http://home.hefr.ch/wagen/Imhost_Humavox_TicDay_Final.pdf

Download references

Author information

Authors and Affiliations

AISTS, CH-1015, Lausanne, Switzerland
Claude Stricker & Paul-Henri Rey
EIA-FR (HES-SO Fribourg), CH-1705, Fribourg, Switzerland
Jean-Frédéric Wagen & Jérôme Théraulaz
HES-SO Valais, CH-3960, Sierre, Switzerland
Claude Stricker & Paul-Henri Rey
Idiap Research Institute, CH-1920, Martigny, Switzerland
Guillermo Aradilla, Hervé Bourlard, Hynek Hermansky & Joel Pinto

Authors

Claude Stricker
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Frédéric Wagen
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Aradilla
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Bourlard
View author publications
You can also search for this author in PubMed Google Scholar
Hynek Hermansky
View author publications
You can also search for this author in PubMed Google Scholar
Joel Pinto
View author publications
You can also search for this author in PubMed Google Scholar
Paul-Henri Rey
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Théraulaz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, University of Fribourg, Bd. de Pérolles 90, CH-1700, Fribourg, Switzerland
Denis Lalanne & Jürg Kohlas &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stricker, C. et al. (2009). Intelligent Multi-modal Interfaces for Mobile Applications in Hostile Environment(IM-HOST). In: Lalanne, D., Kohlas, J. (eds) Human Machine Interaction. Lecture Notes in Computer Science, vol 5440. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00437-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-00437-7_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00436-0
Online ISBN: 978-3-642-00437-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics