Skip to main content

Multiple Model Text Normalization for the Polish Language

  • Conference paper
Book cover Foundations of Intelligent Systems (ISMIS 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7661))

Included in the following conference series:

Abstract

The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule-based and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the final solution. The tool is also capable of producing grammar tags on words to aid in further language model creation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Filip, G., Krzysztof, J., Agnieszka, W., Mikołaj, W.: Text Normalization as a Special Case of Machine Translation. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, vol. 1 (2006)

    Google Scholar 

  2. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)

    Google Scholar 

  3. Dumke, R.R., Abran, A. (eds.): IWSM 2000. LNCS, vol. 2006. Springer, Heidelberg (2001)

    MATH  Google Scholar 

  4. Michalewicz, Z.: Genetic algorithms + Data Structures = Evolution Programs. Springer (1994)

    Google Scholar 

  5. Michalewicz, Z., Fogel, D.B.: How to Solve It: Modern Heuristics. Springer (1999)

    Google Scholar 

  6. Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna / The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa (2004)

    Google Scholar 

  7. Savary, A., Rabiega-Wiśniewska, J., Woliński, M.: Inflection of Polish Multi-Word Proper Names with Morfeusz and Multiflex. In: Marciniak, M., Mykowiecka, A. (eds.) Aspects of Natural Language Processing. LNCS, vol. 5070, pp. 111–141. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  8. http://sgjp.pl/morfeusz/

  9. Bilmes, J.A., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Proceedings of HLT/NACCL, pp. 4–6 (2003)

    Google Scholar 

  10. Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Computer, Speech and Language 393, 359–393 (1999)

    Google Scholar 

  11. Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 3, 400–401 (1987)

    Article  Google Scholar 

  12. Kneser, R., Ney, H.: Improved backing-off for n-gram language modeling. In: International Conference on Acoustics, Speech and Signal Processing, pp. 181–184 (1995)

    Google Scholar 

  13. Chung, G., Seneff, S., Wang, C.: Automatic Induction of Language Model Data for A Spoken Dialogue System. In: 6th SIGdial Workshop on Discourse and Dialogue Lisbon, Portugal, September 2-3 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brocki, Ł., Marasek, K., Koržinek, D. (2012). Multiple Model Text Normalization for the Polish Language. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2012. Lecture Notes in Computer Science(), vol 7661. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34624-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34624-8_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34623-1

  • Online ISBN: 978-3-642-34624-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics