Speech Recognition

Fink, Gernot A.

doi:10.1007/978-1-4471-6308-4_13

Gernot A. Fink⁴

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

4600 Accesses
1 Citations

Abstract

Today, a number of commercial speech recognition systems are available on the market and, just recently, speech enabled assistive services were introduced for smart phones. Nevertheless, the problem of automatic speech recognition should by no means be considered to be solved.

In order to build a competitive speech recognition system, the integration of a multitude of techniques is required. The probably best documented research systems are the ones developed by Hermann Ney and colleagues at the former Philips Research Lab in Aachen, Germany, and later at RWTH Aachen University, Aachen, Germany. In this chapter we want to put the emphasis on the works at RWTH Aachen University. However, many aspects of the systems within the research tradition are identical with those developed by Philips. Afterwards we want to present the speech recognizer of BBN which, in contrast to most systems developed by private companies, is documented by several scientific publications. The chapter concludes with the description of a speech recognition system of our own developed on the basis of ESMERALDA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Recent extensions to the system also introduced multi-pass decoding in order to be able to exploit the capabilities of speaker adaptation techniques and especially expensive modeling aspects (cf. e.g. [182]).
2.
This corresponds to a quite simple variant of cepstral mean normalization as it is also used in the ESMERALDA system. Further explanations can, therefore, be found in Sect. 13.3.
3.
The respective efficient decoding method was patented by BBN (cf. [212]).
4.
Environment for Statistical Model Estimation and Recognition on Arbitrary Linear Data Arrays.
5.
ESMERALDA is free software and is distributed under the terms of the GNU Lesser General Public License (LPGL) [86].
6.
ESMERALDA also supports the use of full covariance matrices. However, the large amount of training samples required to estimate such models and the higher costs in decoding cause the simpler diagonal models to become the more favorable alternative. The reduced capabilities for describing correlations between feature vector components are usually outweighed by the additional precision in modeling achieved when using larger numbers of densities.
7.
Most models usually receive 3 states with some 2-state models representing especially short acoustic units.
8.
Good results are achieved for a threshold of 75 samples. More compact and possibly also more robust models are obtained by requiring a few 100 feature vectors to support a state cluster.
9.
However, when using higher-order n-gram models substantial improvements in the accuracy of the results could be achieved in lexicon-free recognition experiments using language models based on phonetic units only [88].

References

Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading (1986)
Google Scholar
Bauckhage, C., Fink, G.A., Fritsch, J., Kummert, F., Lömker, F., Sagerer, G., Wachsmuth, S.: An integrated system for cooperative man-machine interaction. In: IEEE International Symposium on Computational Intelligence in Robotics and Automation, Banff, Canada, pp. 328–333 (2001)
Google Scholar
Beyerlein, P., Aubert, X.L., Haeb-Umbach, R., Harris, M., Klakow, D., Wendemuth, A., Molau, S., Pitz, M., Sixtus, A.: The Philips/RWTH system for transcription of broadcast news. In: Proc. European Conf. on Speech Communication and Technology, Budapest, vol. 2, pp. 647–650 (1999)
Google Scholar
Billa, J., Colhurst, T., El-Jaroudi, A., Iyer, R., Ma, K., Matsoukas, S., Quillen, C., Richardson, F., Siu, M., Zvaliagkos, G., Gish, H.: Recent experiments in large vocabulary conversational speech recognition. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AZ (1999)
Google Scholar
Brandt-Pook, H., Fink, G.A., Wachsmuth, S., Sagerer, G.: Integrated recognition and interpretation of speech for a construction task domain. In: Bullinger, H.-J., Ziegler, J. (eds.) Proceedings 8th International Conference on Human-Computer Interaction, München, vol. 1, pp. 550–554 (1999)
Google Scholar
Brindöpke, C., Fink, G.A., Kummert, F.: A comparative study of HMM-based approaches for the automatic recognition of perceptively relevant aspects of spontaneous German speech melody. In: Proc. European Conf. on Speech Communication and Technology, Budapest, vol. 2, pp. 699–702 (1999)
Google Scholar
Brindöpke, C., Fink, G.A., Kummert, F., Sagerer, G.: An HMM-based recognition system for perceptive relevant pitch movements of spontaneous German speech. In: Proc. Int. Conf. on Spoken Language Processing, Sydney, vol. 7, pp. 2895–2898 (1998)
Google Scholar
Colthurst, T., Kimball, O., Richardson, F., Shu, H., Wooters, C., Iyer, R., Gish, H.: The 2000 BBN Byblos LVCSR system. In: 2000 Speech Transcription Workshop, Maryland (2000)
Google Scholar
Fink, G.A.: Developing HMM-based recognizers with ESMERALDA. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) Text, Speech and Dialogue. Lecture Notes in Artificial Intelligence, vol. 1692, pp. 229–234. Springer, Berlin (1999)
Chapter Google Scholar
Fink, G.A., Plötz, T.: Integrating speaker identification and learning with adaptive speech recognition. In: 2004: A Speaker Odyssey – The Speaker and Language Recognition Workshop, Toledo, pp. 185–192 (2004)
Google Scholar
Fink, G.A., Plötz, T.: On appearance-based feature extraction methods for writer-independent handwritten text recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, Seoul, Korea, vol. 2, pp. 1070–1074 (2005)
Google Scholar
Fink, G.A., Plötz, T.: Unsupervised estimation of writing style models for improved unconstrained off-line handwriting recognition. In: Proc. Int. Workshop on Frontiers in Handwriting Recognition, La Baule, France, pp. 429–434 (2006)
Google Scholar
Fink, G.A., Plötz, T.: ESMERALDA: a development environment for HMM-based pattern recognition systems. In: 7th Open German/Russian Workshop on Pattern Recognition and Image Understanding, Ettlingen, Germany (2007)
Google Scholar
Fink, G.A., Plötz, T.: ESMERALDA: a development environment for HMM-based pattern recognition systems (2007). http://sourceforge.net/projects/esmeralda
Fink, G.A., Plötz, T.: On the use of context-dependent modelling units for HMM-based offline handwriting recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, Curitiba, Brazil, vol. 2, pp. 729–733 (2007)
Google Scholar
Fink, G.A., Sagerer, G.: Zeitsynchrone Suche mit n-Gramm-Modellen höherer Ordnung (Time-synchonous search with higher-order n-gram models). In: Konvens 2000/Sprachkommunikation. ITG-Fachbericht, vol. 161, pp. 145–150. VDE Verlag, Berlin (2000) (in German)
Google Scholar
Fink, G.A., Schillo, C., Kummert, F., Sagerer, G.: Incremental speech recognition for multimodal interfaces. In: Proc. Annual Conference of the IEEE Industrial Electronics Society, Aachen, vol. 4, pp. 2012–2017 (1998)
Google Scholar
Fink, G.A., Vajda, S., Bhattacharya, U., Parui, S.K., Chaudhuri, B.B.: Online Bangla word recognition using sub-stroke level features and hidden Markov models. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Kolkata, India, pp. 393–398 (2010)
Google Scholar
Fink, G.A., Wienecke, M.: Experiments in video-based whiteboard reading. In: First Int. Workshop on Camera-Based Document Analysis and Recognition, Seoul, Korea, pp. 95–100 (2005)
Google Scholar
Fink, G.A., Wienecke, M., Sagerer, G.: Video-based on-line handwriting recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, pp. 226–230 (2001)
Google Scholar
Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Englewood Cliffs (2001)
Google Scholar
Kanthak, S., Molau, S., Sixtus, A., Schlüter, R., Ney, H.: The RWTH large vocabulary speech recognition system for spontaneous speech. In: Konvens 2000/Sprachkommunikation. ITG-Fachbericht, vol. 161, pp. 249–256. VDE Verlag, Berlin (2000)
Google Scholar
Kirchhoff, K., Fink, G.A., Sagerer, G.: Conversational speech recognition using acoustic and articulatory input. In: Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Istanbul (2000)
Google Scholar
Kirchhoff, K., Fink, G.A., Sagerer, G.: Combining acoustic and articulatory information for robust speech recognition. Speech Commun. 37(3–4), 303–319 (2002)
Article MATH Google Scholar
Kummert, F., Fink, G.A., Sagerer, G.: A hybrid speech recognizer combining HMMs and polynomial classification. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 3, pp. 814–817 (2000)
Google Scholar
Lööf, J., Bisani, M., Gollan, C., Heigold, G., Hoffmeister, B., Plahl, C., Schlüter, R., Ney, H.: The 2006 RWTH parliamentary speeches transcription system. In: TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, Spain, pp. 133–138 (2006)
Google Scholar
Matsoukas, S., Colthurst, T., Kimball, O., Solomonoff, A., Richardson, F., Quillen, C., Gish, H., Dognin, P.: The 2001 BYBLOS English large vocabulary conversational speech recognition system. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, pp. 721–724 (2002)
Google Scholar
Nguyen, L., Matsoukas, S., Billa, J., Schwartz, R., Makhoul, J.: The 1999 BBN BYBLOS 10xRT broadcast news transcription system. In: 2000 Speech Transcription Workshop, Maryland (2000)
Google Scholar
Nguyen, L., Schwartz, R.: The BBN single-phonetic-tree fast-match algorithm. In: Proc. Int. Conf. on Spoken Language Processing, Sydney (1998)
Google Scholar
Pfeiffer, M.: Architektur eines multimodalen Forschungssystems zur iterativen inhaltsbasierten Bildsuche (Architecture of a multimodal research system for iterative interactive image retrieval). PhD thesis, Bielefeld University, Faculty of Technology, Bielefeld, Germany (2006) (in German)
Google Scholar
Plötz, T., Fink, G.A.: Robust time-synchronous environmental adaptation for continuous speech recognition systems. In: Proc. Int. Conf. on Spoken Language Processing, Denver, vol. 2, pp. 1409–1412 (2002)
Google Scholar
Plötz, T., Fink, G.A.: Feature extraction for improved profile HMM based biological sequence analysis. In: Proc. Int. Conf. on Pattern Recognition, pp. 315–318 (2004)
Google Scholar
Plötz, T., Fink, G.A.: A new approach for HMM based protein sequence modeling and its application to remote homology classification. In: Proc. Workshop Statistical Signal Processing, Bordeaux, France (2005)
Google Scholar
Plötz, T., Fink, G.A.: Robust remote homology detection by feature based Profile Hidden Markov Models. Stat. Appl. Genet. Mol. Biol. 4(1) (2005)
Google Scholar
Plötz, T., Fink, G.A.: Pattern recognition methods for advanced stochastic protein sequence analysis using HMMs. Pattern Recognit. 39, 2267–2280 (2006). Special Issue on Bioinformatics
Article MATH Google Scholar
Plötz, T., Fink, G.A.: An efficient method for making un-supervised adaptation of HMM-based speech recognition systems robust against out-of-domain data. In: Proc. 4th Int. Workshop on Natural Language Processing and Cognitive Science, Funchal, Portugal, June 2007
Google Scholar
Plötz, T., Fink, G.A.., Husemann, P., Kanies, S., Lienemann, K., Marschall, T., Martin, M., Schillingmann, L., Steinrücken, M., Sudek, H.: Automatic detection of song changes in music mixes using stochastic models. In: Proc. Int. Conf. on Pattern Recognition, pp. 665–668 (2006)
Google Scholar
Plötz, T., Thurau, C., Fink, G.A.: Camera-based whiteboard reading: new approaches to a challenging task. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Montreal, Canada, pp. 385–390 (2008)
Google Scholar
Richarz, J., Fink, G.A.: Visual recognition of 3d emblematic gestures in an hmm framework. J. Ambient Intell. Smart Environ. 3(3), 193–211 (2011). Thematic Issue on Computer Vision for Ambient Intelligence
Google Scholar
Rosenberg, A.E., Lee, C.-H., Soong, F.K.: Cepstral channel normalization techniques for HMM-based speaker verification. In: Proc. Int. Conf. on Spoken Language Processing, Yokohama, Japan, vol. 4, pp. 1835–1838 (1994)
Google Scholar
Rothacker, L., Rusinol, M., Fink, G.A.: Bag-of-features HMMs for segmentation-free word spotting in handwritten documents. In: Proc. Int. Conf. on Document Analysis and Recognition, Washington DC, USA (2013)
Google Scholar
Rothacker, L., Vajda, S., Fink, G.A.: Bag-of-features representations for offline handwriting recognition applied to Arabic script. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Bari, Italy (2012)
Google Scholar
Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The RWTH Aachen University open source speech recognition system. In: Interspeech, Brighton, UK, pp. 2111–2114 (2009)
Google Scholar
Schillo, C., Fink, G.A., Kummert, F.: Grapheme based speech recognition for large vocabularies. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 4, pp. 584–587 (2000)
Google Scholar
Sixtus, A., Molau, S., Kanthak, S., Schlüter, R., Ney, H.: Recent improvements of the RWTH large vocabulary speech recognition system on spontaneous speech. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Istanbul, pp. 1671–1674 (2000)
Google Scholar
Spiess, T., Wrede, B., Kummert, F., Fink, G.A.: Data-driven pronunciation modeling for ASR using acoustic subword units. In: Proc. European Conf. on Speech Communication and Technology, Geneva, pp. 2549–2552 (2003)
Google Scholar
Wachsmuth, S., Fink, G.A., Kummert, F., Sagerer, G.: Using speech in visual object recognition. In: Sommer, G., Krüger, N., Perwass, C. (eds.) Mustererkennung 2000, 22. DAGM-Symposium Kiel. Informatik Aktuell, pp. 428–435. Springer, Berlin (2000)
Chapter Google Scholar
Wachsmuth, S., Fink, G.A., Sagerer, G.: Integration of parsing and incremental speech recognition. In: Proceedings of the European Signal Processing Conference, Rhodes, Sep. 1998, vol. 1, pp. 371–375 (1998)
Google Scholar
Welling, L., Kanthak, S., Ney, H.: Improved methods for vocal tract normalisation. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AZ, pp. 761–764 (1999)
Google Scholar
Wendt, S., Fink, G.A., Kummert, F.: Forward masking for increased robustness in automatic speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Aalborg, vol. 1, pp. 615–618 (2001)
Google Scholar
Wendt, S., Fink, G.A., Kummert, F.: Dynamic search-space pruning for time-constrained speech recognition. In: Proc. Int. Conf. on Spoken Language Processing, Denver, vol. 1, pp. 377–380 (2002)
Google Scholar
Westphal, M.: The use of cepstral means in conversational speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Rhodes, Greece, vol. 3, pp. 1143–1146 (1997)
Google Scholar
Wienecke, M., Fink, G.A., Sagerer, G.: A handwriting recognition system based on visual input. In: Schiele, B., Sagerer, G. (eds.) Computer Vision Systems. Lecture Notes in Computer Science, pp. 63–72. Springer, Berlin (2001)
Google Scholar
Wienecke, M., Fink, G.A., Sagerer, G.: Experiments in unconstrained offline handwritten text recognition. In: Proc. 8th Int. Workshop on Frontiers in Handwriting Recognition, Niagara on the Lake, Canada, August 2002
Google Scholar
Wienecke, M., Fink, G.A., Sagerer, G.: Towards automatic video-based whiteboard reading. In: Proc. Int. Conf. on Document Analysis and Recognition, Edinburgh, vol. 1, pp. 87–91 (2003)
Google Scholar
Wienecke, M., Fink, G.A., Sagerer, G.: Toward automatic video-based whiteboard reading. Int. J. Doc. Anal. Recognit. 7(2–3), 188–200 (2005)
Article Google Scholar
Wrede, B., Fink, G.A., Sagerer, G.: Influence of duration on static and dynamic properties of German vowels in spontaneous speech. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 1, pp. 82–85 (2000)
Google Scholar
Wrede, B., Fink, G.A., Sagerer, G.: An investigation of modelling aspects for rate-dependent speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Aalborg, vol. 4, pp. 2527–2530 (2001)
Google Scholar
Zwicker, E., Fastl, H.: Psychoacoustics: Facts and Models, 2nd edn. Springer Series in Information Sciences, vol. 22. Springer, Berlin (1999)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, TU Dortmund University, Dortmund, Germany
Gernot A. Fink

Authors

Gernot A. Fink
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fink, G.A. (2014). Speech Recognition. In: Markov Models for Pattern Recognition. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-6308-4_13

Download citation

DOI: https://doi.org/10.1007/978-1-4471-6308-4_13
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6307-7
Online ISBN: 978-1-4471-6308-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics