Multiple Model Text Normalization for the Polish Language

Brocki, Łukasz; Marasek, Krzysztof; Koržinek, Danijel

doi:10.1007/978-3-642-34624-8_17

Łukasz Brocki²²,
Krzysztof Marasek²² &
Danijel Koržinek²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7661))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1297 Accesses
3 Citations

Abstract

The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule-based and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the final solution. The tool is also capable of producing grammar tags on words to aid in further language model creation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Filip, G., Krzysztof, J., Agnieszka, W., Mikołaj, W.: Text Normalization as a Special Case of Machine Translation. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, vol. 1 (2006)
Google Scholar
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)
Google Scholar
Dumke, R.R., Abran, A. (eds.): IWSM 2000. LNCS, vol. 2006. Springer, Heidelberg (2001)
MATH Google Scholar
Michalewicz, Z.: Genetic algorithms + Data Structures = Evolution Programs. Springer (1994)
Google Scholar
Michalewicz, Z., Fogel, D.B.: How to Solve It: Modern Heuristics. Springer (1999)
Google Scholar
Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna / The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa (2004)
Google Scholar
Savary, A., Rabiega-Wiśniewska, J., Woliński, M.: Inflection of Polish Multi-Word Proper Names with Morfeusz and Multiflex. In: Marciniak, M., Mykowiecka, A. (eds.) Aspects of Natural Language Processing. LNCS, vol. 5070, pp. 111–141. Springer, Heidelberg (2009)
Chapter Google Scholar
http://sgjp.pl/morfeusz/
Bilmes, J.A., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of HLT/NACCL, pp. 4–6 (2003)
Google Scholar
Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Computer, Speech and Language 393, 359–393 (1999)
Google Scholar
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 3, 400–401 (1987)
Article Google Scholar
Kneser, R., Ney, H.: Improved backing-off for n-gram language modeling. In: International Conference on Acoustics, Speech and Signal Processing, pp. 181–184 (1995)
Google Scholar
Chung, G., Seneff, S., Wang, C.: Automatic Induction of Language Model Data for A Spoken Dialogue System. In: 6th SIGdial Workshop on Discourse and Dialogue Lisbon, Portugal, September 2-3 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Polish-Japanese Institute of Information Technology, Warsaw, Poland
Łukasz Brocki, Krzysztof Marasek & Danijel Koržinek

Authors

Łukasz Brocki
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Marasek
View author publications
You can also search for this author in PubMed Google Scholar
Danijel Koržinek
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Hong Kong Baptist University, 224 Waterloo Road, Kowloon, Hong Kong
Li Chen & Jiming Liu &
Institute for Software Technology, Graz University of Technology, Inffeldgasse 16b, 8010, Graz, Austria
Alexander Felfernig
University of North Carolina, Charlotte, NC 28223, USA and Warsaw University of Technology, Nowowiejska 15/19, 00-665, Warsaw, Poland
Zbigniew W. Raś

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brocki, Ł., Marasek, K., Koržinek, D. (2012). Multiple Model Text Normalization for the Polish Language. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2012. Lecture Notes in Computer Science(), vol 7661. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34624-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-34624-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34623-1
Online ISBN: 978-3-642-34624-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics