Dealing with Newly Emerging OOVs in Broadcast Programs by Daily Updates of the Lexicon and Language Model

Cerva, Petr; Volna, Veronika; Weingartova, Lenka

doi:10.1007/978-3-030-60276-5_10

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

International Conference on Speech and Computer

1597 Accesses

Abstract

This paper deals with out-of-vocabulary (OOV) word recognition in the task of 24/7 broadcast stream transcription. Here, the majority of OOVs newly emerging over time are constituted of names of politicians, athletes, major world events, disasters, etc. The absence of these content OOVs, e.g. COVID-19, is detrimental to human understanding of the recognized text and harmful to further NLP processing, such as machine translation, named entity recognition or any type of semantic or dialogue analysis. In production environments, content OOVs are of extreme importance and it is essential that their correct transcription is provided as soon as possible. For this purpose, an approach based on daily updates of the lexicon and language model is proposed. It consists of three consecutive steps: a) the identification of new content OOVs from already existing text sources, b) their controlled addition into the lexicon of the transcription system and c) proper tuning of the language model. Experimental evaluation is performed on an extensive data-set compiled from various Czech broadcast programs. This data was produced by a real transcription platform over the course of 300 days in 2019. Detailed statistics and analysis of new content OOVs emerging within this period are also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents

Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach

Automatic Transcription of Polish Radio and Television Broadcast Audio

References

Bazzi, I., Glass, J. R.: Learning units for domain-independent out-of- vocabulary word modelling. In: Proceedings of INTERSPEECH (2001)
Google Scholar
Bisani, M., Ney, H.: Open vocabulary speech recognition with flat hybrid models. In: Proceedings of INTERSPEECH, pp. 725–728 (2005)
Google Scholar
Burget, L., et al.: Combination of strongly and weakly constrained recognizers for reliable detection of OOVS. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4081–4084 (2008)
Google Scholar
Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
Article Google Scholar
Gage, P.: A new algorithm for data compression. C Users J. 12(2), 23–38 (1994)
Google Scholar
Inaguma, H., Mimura, M., Sakai, S., Kawahara, T.: Improving OOV detection and resolution with external language models in acoustic-to-word ASR. In: Proceedings of IEEE Spoken Language Technology Workshop, pp. 212–218 (2018)
Google Scholar
Lin, H., Bilmes, J.A., Vergyri, D., Kirchhoff, K.: OOV detection by joint word/phone lattice alignment. In: Proceedings of ASRU, pp. 478–483 (2007)
Google Scholar
Parada, C., Dredze, M., Filimonov, D., Jelinek, F.: Contextual Information Improves OOV Detection in Speech. In: Proceedings of HTL-NAACL, pp. 216–224 (2010)
Google Scholar
Rastrow, A., Sethy, A., Ramabhadran, B.: A new method for OOV detection using hybrid word/fragment system. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3953–3956 (2009)
Google Scholar
Thomas, S., Audhkhasi, K., Tüske, Z., Huang, Y., Picheny, M.: Detection and recovery of OOVs for improved English broadcast news captioning. In: Proceedings of INTERSPEECH, pp. 2973–2977 (2019)
Google Scholar
Xiao, Z., Ou, Z., Chu, W., Lin, H.: Hybrid CTC-Attention based end-to-end speech recognition using subword units. In: Proceedings of 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 146–150 (2018)
Google Scholar
Zenkel, T., Sanabria, R., Metze, F., Waibel, A.H.: Subword and crossword units for CTC acoustic models. In: Proceedings of INTERSPEECH, pp. 396–400 (2018)
Google Scholar
spaCy Webpage. https://spacy.io/api/annotation. Accessed 15 July 2020

Download references

Acknowledgments

This work was supported by the Technology Agency of the Czech Republic (Project No. TH03010018).

Author information

Authors and Affiliations

Institute of Information Technologies and Electronics, Technical University of Liberec, Studentska 2, Liberec, 46117, Czech Republic
Petr Cerva
Newton Technologies, Na Pankraci 1683/127, Praha 4, 140 00, Czech Republic
Veronika Volna & Lenka Weingartova

Authors

Petr Cerva
View author publications
You can also search for this author in PubMed Google Scholar
Veronika Volna
View author publications
You can also search for this author in PubMed Google Scholar
Lenka Weingartova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petr Cerva .

Editor information

Editors and Affiliations

St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Institute for Applied and Mathematical Linguistics, Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cerva, P., Volna, V., Weingartova, L. (2020). Dealing with Newly Emerging OOVs in Broadcast Programs by Daily Updates of the Lexicon and Language Model. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-60276-5_10
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dealing with Newly Emerging OOVs in Broadcast Programs by Daily Updates of the Lexicon and Language Model

Abstract

Access this chapter

Similar content being viewed by others

Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents

Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach

Automatic Transcription of Polish Radio and Television Broadcast Audio

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Dealing with Newly Emerging OOVs in Broadcast Programs by Daily Updates of the Lexicon and Language Model

Abstract

Access this chapter

Similar content being viewed by others

Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents

Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach

Automatic Transcription of Polish Radio and Television Broadcast Audio

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation