Abstract
Today, a number of commercial speech recognition systems are available on the market and, just recently, speech enabled assistive services were introduced for smart phones. Nevertheless, the problem of automatic speech recognition should by no means be considered to be solved.
In order to build a competitive speech recognition system, the integration of a multitude of techniques is required. The probably best documented research systems are the ones developed by Hermann Ney and colleagues at the former Philips Research Lab in Aachen, Germany, and later at RWTH Aachen University, Aachen, Germany. In this chapter we want to put the emphasis on the works at RWTH Aachen University. However, many aspects of the systems within the research tradition are identical with those developed by Philips. Afterwards we want to present the speech recognizer of BBN which, in contrast to most systems developed by private companies, is documented by several scientific publications. The chapter concludes with the description of a speech recognition system of our own developed on the basis of ESMERALDA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Recent extensions to the system also introduced multi-pass decoding in order to be able to exploit the capabilities of speaker adaptation techniques and especially expensive modeling aspects (cf. e.g. [182]).
- 2.
This corresponds to a quite simple variant of cepstral mean normalization as it is also used in the ESMERALDA system. Further explanations can, therefore, be found in Sect. 13.3.
- 3.
The respective efficient decoding method was patented by BBN (cf. [212]).
- 4.
Environment for Statistical Model Estimation and Recognition on Arbitrary Linear Data Arrays.
- 5.
ESMERALDA is free software and is distributed under the terms of the GNU Lesser General Public License (LPGL) [86].
- 6.
ESMERALDA also supports the use of full covariance matrices. However, the large amount of training samples required to estimate such models and the higher costs in decoding cause the simpler diagonal models to become the more favorable alternative. The reduced capabilities for describing correlations between feature vector components are usually outweighed by the additional precision in modeling achieved when using larger numbers of densities.
- 7.
Most models usually receive 3 states with some 2-state models representing especially short acoustic units.
- 8.
Good results are achieved for a threshold of 75 samples. More compact and possibly also more robust models are obtained by requiring a few 100 feature vectors to support a state cluster.
- 9.
However, when using higher-order n-gram models substantial improvements in the accuracy of the results could be achieved in lexicon-free recognition experiments using language models based on phonetic units only [88].
References
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading (1986)
Bauckhage, C., Fink, G.A., Fritsch, J., Kummert, F., Lömker, F., Sagerer, G., Wachsmuth, S.: An integrated system for cooperative man-machine interaction. In: IEEE International Symposium on Computational Intelligence in Robotics and Automation, Banff, Canada, pp. 328–333 (2001)
Beyerlein, P., Aubert, X.L., Haeb-Umbach, R., Harris, M., Klakow, D., Wendemuth, A., Molau, S., Pitz, M., Sixtus, A.: The Philips/RWTH system for transcription of broadcast news. In: Proc. European Conf. on Speech Communication and Technology, Budapest, vol. 2, pp. 647–650 (1999)
Billa, J., Colhurst, T., El-Jaroudi, A., Iyer, R., Ma, K., Matsoukas, S., Quillen, C., Richardson, F., Siu, M., Zvaliagkos, G., Gish, H.: Recent experiments in large vocabulary conversational speech recognition. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AZ (1999)
Brandt-Pook, H., Fink, G.A., Wachsmuth, S., Sagerer, G.: Integrated recognition and interpretation of speech for a construction task domain. In: Bullinger, H.-J., Ziegler, J. (eds.) Proceedings 8th International Conference on Human-Computer Interaction, München, vol. 1, pp. 550–554 (1999)
Brindöpke, C., Fink, G.A., Kummert, F.: A comparative study of HMM-based approaches for the automatic recognition of perceptively relevant aspects of spontaneous German speech melody. In: Proc. European Conf. on Speech Communication and Technology, Budapest, vol. 2, pp. 699–702 (1999)
Brindöpke, C., Fink, G.A., Kummert, F., Sagerer, G.: An HMM-based recognition system for perceptive relevant pitch movements of spontaneous German speech. In: Proc. Int. Conf. on Spoken Language Processing, Sydney, vol. 7, pp. 2895–2898 (1998)
Colthurst, T., Kimball, O., Richardson, F., Shu, H., Wooters, C., Iyer, R., Gish, H.: The 2000 BBN Byblos LVCSR system. In: 2000 Speech Transcription Workshop, Maryland (2000)
Fink, G.A.: Developing HMM-based recognizers with ESMERALDA. In: MatouÅ¡ek, V., Mautner, P., OcelÃková, J., Sojka, P. (eds.) Text, Speech and Dialogue. Lecture Notes in Artificial Intelligence, vol. 1692, pp. 229–234. Springer, Berlin (1999)
Fink, G.A., Plötz, T.: Integrating speaker identification and learning with adaptive speech recognition. In: 2004: A Speaker Odyssey – The Speaker and Language Recognition Workshop, Toledo, pp. 185–192 (2004)
Fink, G.A., Plötz, T.: On appearance-based feature extraction methods for writer-independent handwritten text recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, Seoul, Korea, vol. 2, pp. 1070–1074 (2005)
Fink, G.A., Plötz, T.: Unsupervised estimation of writing style models for improved unconstrained off-line handwriting recognition. In: Proc. Int. Workshop on Frontiers in Handwriting Recognition, La Baule, France, pp. 429–434 (2006)
Fink, G.A., Plötz, T.: ESMERALDA: a development environment for HMM-based pattern recognition systems. In: 7th Open German/Russian Workshop on Pattern Recognition and Image Understanding, Ettlingen, Germany (2007)
Fink, G.A., Plötz, T.: ESMERALDA: a development environment for HMM-based pattern recognition systems (2007). http://sourceforge.net/projects/esmeralda
Fink, G.A., Plötz, T.: On the use of context-dependent modelling units for HMM-based offline handwriting recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, Curitiba, Brazil, vol. 2, pp. 729–733 (2007)
Fink, G.A., Sagerer, G.: Zeitsynchrone Suche mit n-Gramm-Modellen höherer Ordnung (Time-synchonous search with higher-order n-gram models). In: Konvens 2000/Sprachkommunikation. ITG-Fachbericht, vol. 161, pp. 145–150. VDE Verlag, Berlin (2000) (in German)
Fink, G.A., Schillo, C., Kummert, F., Sagerer, G.: Incremental speech recognition for multimodal interfaces. In: Proc. Annual Conference of the IEEE Industrial Electronics Society, Aachen, vol. 4, pp. 2012–2017 (1998)
Fink, G.A., Vajda, S., Bhattacharya, U., Parui, S.K., Chaudhuri, B.B.: Online Bangla word recognition using sub-stroke level features and hidden Markov models. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Kolkata, India, pp. 393–398 (2010)
Fink, G.A., Wienecke, M.: Experiments in video-based whiteboard reading. In: First Int. Workshop on Camera-Based Document Analysis and Recognition, Seoul, Korea, pp. 95–100 (2005)
Fink, G.A., Wienecke, M., Sagerer, G.: Video-based on-line handwriting recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, pp. 226–230 (2001)
Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Englewood Cliffs (2001)
Kanthak, S., Molau, S., Sixtus, A., Schlüter, R., Ney, H.: The RWTH large vocabulary speech recognition system for spontaneous speech. In: Konvens 2000/Sprachkommunikation. ITG-Fachbericht, vol. 161, pp. 249–256. VDE Verlag, Berlin (2000)
Kirchhoff, K., Fink, G.A., Sagerer, G.: Conversational speech recognition using acoustic and articulatory input. In: Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Istanbul (2000)
Kirchhoff, K., Fink, G.A., Sagerer, G.: Combining acoustic and articulatory information for robust speech recognition. Speech Commun. 37(3–4), 303–319 (2002)
Kummert, F., Fink, G.A., Sagerer, G.: A hybrid speech recognizer combining HMMs and polynomial classification. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 3, pp. 814–817 (2000)
Lööf, J., Bisani, M., Gollan, C., Heigold, G., Hoffmeister, B., Plahl, C., Schlüter, R., Ney, H.: The 2006 RWTH parliamentary speeches transcription system. In: TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, Spain, pp. 133–138 (2006)
Matsoukas, S., Colthurst, T., Kimball, O., Solomonoff, A., Richardson, F., Quillen, C., Gish, H., Dognin, P.: The 2001 BYBLOS English large vocabulary conversational speech recognition system. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, pp. 721–724 (2002)
Nguyen, L., Matsoukas, S., Billa, J., Schwartz, R., Makhoul, J.: The 1999 BBN BYBLOS 10xRT broadcast news transcription system. In: 2000 Speech Transcription Workshop, Maryland (2000)
Nguyen, L., Schwartz, R.: The BBN single-phonetic-tree fast-match algorithm. In: Proc. Int. Conf. on Spoken Language Processing, Sydney (1998)
Pfeiffer, M.: Architektur eines multimodalen Forschungssystems zur iterativen inhaltsbasierten Bildsuche (Architecture of a multimodal research system for iterative interactive image retrieval). PhD thesis, Bielefeld University, Faculty of Technology, Bielefeld, Germany (2006) (in German)
Plötz, T., Fink, G.A.: Robust time-synchronous environmental adaptation for continuous speech recognition systems. In: Proc. Int. Conf. on Spoken Language Processing, Denver, vol. 2, pp. 1409–1412 (2002)
Plötz, T., Fink, G.A.: Feature extraction for improved profile HMM based biological sequence analysis. In: Proc. Int. Conf. on Pattern Recognition, pp. 315–318 (2004)
Plötz, T., Fink, G.A.: A new approach for HMM based protein sequence modeling and its application to remote homology classification. In: Proc. Workshop Statistical Signal Processing, Bordeaux, France (2005)
Plötz, T., Fink, G.A.: Robust remote homology detection by feature based Profile Hidden Markov Models. Stat. Appl. Genet. Mol. Biol. 4(1) (2005)
Plötz, T., Fink, G.A.: Pattern recognition methods for advanced stochastic protein sequence analysis using HMMs. Pattern Recognit. 39, 2267–2280 (2006). Special Issue on Bioinformatics
Plötz, T., Fink, G.A.: An efficient method for making un-supervised adaptation of HMM-based speech recognition systems robust against out-of-domain data. In: Proc. 4th Int. Workshop on Natural Language Processing and Cognitive Science, Funchal, Portugal, June 2007
Plötz, T., Fink, G.A.., Husemann, P., Kanies, S., Lienemann, K., Marschall, T., Martin, M., Schillingmann, L., Steinrücken, M., Sudek, H.: Automatic detection of song changes in music mixes using stochastic models. In: Proc. Int. Conf. on Pattern Recognition, pp. 665–668 (2006)
Plötz, T., Thurau, C., Fink, G.A.: Camera-based whiteboard reading: new approaches to a challenging task. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Montreal, Canada, pp. 385–390 (2008)
Richarz, J., Fink, G.A.: Visual recognition of 3d emblematic gestures in an hmm framework. J. Ambient Intell. Smart Environ. 3(3), 193–211 (2011). Thematic Issue on Computer Vision for Ambient Intelligence
Rosenberg, A.E., Lee, C.-H., Soong, F.K.: Cepstral channel normalization techniques for HMM-based speaker verification. In: Proc. Int. Conf. on Spoken Language Processing, Yokohama, Japan, vol. 4, pp. 1835–1838 (1994)
Rothacker, L., Rusinol, M., Fink, G.A.: Bag-of-features HMMs for segmentation-free word spotting in handwritten documents. In: Proc. Int. Conf. on Document Analysis and Recognition, Washington DC, USA (2013)
Rothacker, L., Vajda, S., Fink, G.A.: Bag-of-features representations for offline handwriting recognition applied to Arabic script. In: Proc. Int. Conf. on Frontiers in Handwriting Recognition, Bari, Italy (2012)
Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The RWTH Aachen University open source speech recognition system. In: Interspeech, Brighton, UK, pp. 2111–2114 (2009)
Schillo, C., Fink, G.A., Kummert, F.: Grapheme based speech recognition for large vocabularies. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 4, pp. 584–587 (2000)
Sixtus, A., Molau, S., Kanthak, S., Schlüter, R., Ney, H.: Recent improvements of the RWTH large vocabulary speech recognition system on spontaneous speech. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Istanbul, pp. 1671–1674 (2000)
Spiess, T., Wrede, B., Kummert, F., Fink, G.A.: Data-driven pronunciation modeling for ASR using acoustic subword units. In: Proc. European Conf. on Speech Communication and Technology, Geneva, pp. 2549–2552 (2003)
Wachsmuth, S., Fink, G.A., Kummert, F., Sagerer, G.: Using speech in visual object recognition. In: Sommer, G., Krüger, N., Perwass, C. (eds.) Mustererkennung 2000, 22. DAGM-Symposium Kiel. Informatik Aktuell, pp. 428–435. Springer, Berlin (2000)
Wachsmuth, S., Fink, G.A., Sagerer, G.: Integration of parsing and incremental speech recognition. In: Proceedings of the European Signal Processing Conference, Rhodes, Sep. 1998, vol. 1, pp. 371–375 (1998)
Welling, L., Kanthak, S., Ney, H.: Improved methods for vocal tract normalisation. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AZ, pp. 761–764 (1999)
Wendt, S., Fink, G.A., Kummert, F.: Forward masking for increased robustness in automatic speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Aalborg, vol. 1, pp. 615–618 (2001)
Wendt, S., Fink, G.A., Kummert, F.: Dynamic search-space pruning for time-constrained speech recognition. In: Proc. Int. Conf. on Spoken Language Processing, Denver, vol. 1, pp. 377–380 (2002)
Westphal, M.: The use of cepstral means in conversational speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Rhodes, Greece, vol. 3, pp. 1143–1146 (1997)
Wienecke, M., Fink, G.A., Sagerer, G.: A handwriting recognition system based on visual input. In: Schiele, B., Sagerer, G. (eds.) Computer Vision Systems. Lecture Notes in Computer Science, pp. 63–72. Springer, Berlin (2001)
Wienecke, M., Fink, G.A., Sagerer, G.: Experiments in unconstrained offline handwritten text recognition. In: Proc. 8th Int. Workshop on Frontiers in Handwriting Recognition, Niagara on the Lake, Canada, August 2002
Wienecke, M., Fink, G.A., Sagerer, G.: Towards automatic video-based whiteboard reading. In: Proc. Int. Conf. on Document Analysis and Recognition, Edinburgh, vol. 1, pp. 87–91 (2003)
Wienecke, M., Fink, G.A., Sagerer, G.: Toward automatic video-based whiteboard reading. Int. J. Doc. Anal. Recognit. 7(2–3), 188–200 (2005)
Wrede, B., Fink, G.A., Sagerer, G.: Influence of duration on static and dynamic properties of German vowels in spontaneous speech. In: Proc. Int. Conf. on Spoken Language Processing, Beijing, China, vol. 1, pp. 82–85 (2000)
Wrede, B., Fink, G.A., Sagerer, G.: An investigation of modelling aspects for rate-dependent speech recognition. In: Proc. European Conf. on Speech Communication and Technology, Aalborg, vol. 4, pp. 2527–2530 (2001)
Zwicker, E., Fastl, H.: Psychoacoustics: Facts and Models, 2nd edn. Springer Series in Information Sciences, vol. 22. Springer, Berlin (1999)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag London
About this chapter
Cite this chapter
Fink, G.A. (2014). Speech Recognition. In: Markov Models for Pattern Recognition. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-6308-4_13
Download citation
DOI: https://doi.org/10.1007/978-1-4471-6308-4_13
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6307-7
Online ISBN: 978-1-4471-6308-4
eBook Packages: Computer ScienceComputer Science (R0)