Skip to main content

Extended N-gram Model for Analysis of Polish Texts

  • Conference paper
  • First Online:
Man-Machine Interactions 5 (ICMMI 2017)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 659))

Included in the following conference series:

  • 1167 Accesses

Abstract

The paper presents extended N-gram model designed for analysis of texts in Polish language. One of possible applications of the model is automatic detection and correction of errors that occur during computerized text edition. N-grams belong to the group of statistical methods in Natural Language Processing (NLP). They are created through analysis of sufficiently large language data resources called corpora. In the classic version N-grams represent the sequences of words of certain length that appear in analyzed language resources. Presented approach introduces N-grams that include also results of morphological analysis of texts. As a result, three types of N-grams may be distinguished: lexical (containing original words from text or their basic forms), morphosyntactic (sequences of morphosyntactic tags assigned to words) and mixed (combination of lexical and morphological description). Extended model with new types of N-grams encompasses language properties specific for Polish such as free word order and complex inflection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Davies, M.: The corpus of contemporary American english as the first reliable monitor corpus of english. Lit. Linguist. Comput. 25(4), 447–464 (2010)

    Article  Google Scholar 

  2. Davies, M.: Making google books n-grams useful for a wide range of research on language change. Int. J. Corpus Linguist. 19(3), 401–416 (2014)

    Article  Google Scholar 

  3. Goldberg, Y., Orwant, J.: A dataset of syntactic-ngrams over time from a very large corpus of english books. In: SEM 2013, Atlanta, US, pp. 241–247 (2013)

    Google Scholar 

  4. Jurafsky, D., Martin, J.H.: Speech and Language Processing. Pearson, London (2008)

    Google Scholar 

  5. Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: ACL 2012, Jeju Island, Korea, pp. 169–174 (2012)

    Google Scholar 

  6. Piasecki, M.: Polish tagger TaKIPI: rule based construction and optimisation. Task Q. 11(1–2), 151–167 (2007)

    Google Scholar 

  7. Pohl, A., Ziółko, B.: Using part of speech n-grams for improving automatic speech recognition of polish. In: Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition. LNCS, vol. 7988, pp. 492–504. Springer, Berlin (2013)

    Chapter  Google Scholar 

  8. Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B.: Narodowy Korpus Jezyka Polskiego. Wydawnictwo Naukowe PWN, Warsaw (2012)

    Google Scholar 

  9. Woliński, M.: System znaczników morfosyntaktycznych w korpusie IPI PAN. Polonica XXII–XXIII, 39–55 (2003)

    Google Scholar 

  10. Woliński, M.: Morfeusz–a practical tool for the morphological analysis of polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining, AINSC, vol. 35, pp. 511–520. Springer, Berlin (2006)

    Chapter  Google Scholar 

  11. Ziółko, B., Skurzok, D.: N-Grams Model for Polish, pp. 107–127. InTech, Rijeka (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dariusz Banasiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Banasiak, D., Mierzwa, J., Sterna, A. (2018). Extended N-gram Model for Analysis of Polish Texts. In: Gruca, A., Czachórski, T., Harezlak, K., Kozielski, S., Piotrowska, A. (eds) Man-Machine Interactions 5. ICMMI 2017. Advances in Intelligent Systems and Computing, vol 659. Springer, Cham. https://doi.org/10.1007/978-3-319-67792-7_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67792-7_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67791-0

  • Online ISBN: 978-3-319-67792-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics