Recovery of Rare Words in Lecture Speech

Kombrink, Stefan; Hannemann, Mirko; Burget, Lukáš; Heřmanský, Hynek

doi:10.1007/978-3-642-15760-8_42

Stefan Kombrink²³,
Mirko Hannemann²³,
Lukáš Burget²³ &
…
Hynek Heřmanský^23,24

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6231))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1428 Accesses
9 Citations

Abstract

The vocabulary used in speech usually consists of two types of words: a limited set of common words, shared across multiple documents, and a virtually unlimited set of rare words, each of which might appear a few times only in particular documents. In most documents, however, these rare words are not seen at all. The first type of words is typically included in the language model of an automatic speech recognizer (ASR) and is thus widely referred to as in-vocabulary (IV). Words of the second type are missing in the language model and thus are called out-of-vocabulary (OOV). However, these words usually carry important information.

We use a hybrid word/sub-word recognizer to detect OOV words occurring in English talks and describe them as sequences of sub-words. We detected about one third of all OOV words, and were able to recover the correct spelling for 26.2% of all detections by using a phoneme-to-grapheme (P2G) conversion trained on the recognition dictionary. By omitting detections corresponding to recovered IV words, we were able to increase the precision of the OOV detection substantially.

This work was partly supported by European project DIRAC (FP6-027787), by Grant Agency of Czech Republic project No. 102/08/0707, Czech Ministry of Education project No. MSM0021630528 and by BUT FIT grant No. FIT-10-S-2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burget, L., et al.: Combination of Strongly and Weakly Constrained Recognizers for Reliable Detection of OOVs. In: ICASSP (2008)
Google Scholar
Jiang, H.: Confidence Measures for Speech Recognition: A Survey. Speech Communication 45(4), 455–470 (2005)
Article Google Scholar
Bisani, M., Ney, H.: Open Vocabulary Speech Recognition with Flat Hybrid Models. In: Ninth European Conference on Speech Communication and Technology (2005)
Google Scholar
Yazgan, A., et al.: Hybrid Language Models for out of Vocabulary Word Detection in Large Vocabulary Conversational Speech Recognition. In: ICASSP (2004)
Google Scholar
Szoke, I., Fapso, M., Burget, L., Černocký, J.: Hybrid Word-Subword Decoding for Spoken Term Detection. In: Proc. SSCS 2008: Speech Search Workshop at SIGIR (2008)
Google Scholar
Bisani, M., Ney, H.: Joint-Sequence Models for Grapheme-to-Phoneme Conversion. Speech Communication 50(5), 434–451 (2008)
Article Google Scholar
Hain, T., et al.: The 2007 AMI(DA) System for Meeting Transcription. Multimodal Technologies for Perception of Humans, 414–428 (2008)
Google Scholar
Deligne, et al.: Language Modeling by Variable Length Sequences: Theoretical Formulation and Evaluation of Multigrams. In: ICASSP, Detroit, MI, pp. 169–172 (1995)
Google Scholar
Hazen, T. J., Bazzi, I.: A Comparison and Combination of Methods for OOV Word Detection and Word Confidence Scoring. In: IEEE Intl. Conference on Acoustics, Speech and Signal Processing (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Speech@FIT, Brno University of Technology, Czech Republic
Stefan Kombrink, Mirko Hannemann, Lukáš Burget & Hynek Heřmanský
Johns Hopkins University, Baltimore, USA
Hynek Heřmanský

Authors

Stefan Kombrink
View author publications
You can also search for this author in PubMed Google Scholar
Mirko Hannemann
View author publications
You can also search for this author in PubMed Google Scholar
Lukáš Burget
View author publications
You can also search for this author in PubMed Google Scholar
Hynek Heřmanský
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kombrink, S., Hannemann, M., Burget, L., Heřmanský, H. (2010). Recovery of Rare Words in Lecture Speech . In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2010. Lecture Notes in Computer Science(), vol 6231. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15760-8_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-15760-8_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15759-2
Online ISBN: 978-3-642-15760-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics