Extended N-gram Model for Analysis of Polish Texts

Banasiak, Dariusz; Mierzwa, Jarosław; Sterna, Antoni

doi:10.1007/978-3-319-67792-7_35

Dariusz Banasiak¹⁹,
Jarosław Mierzwa¹⁹ &
Antoni Sterna¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 659))

Included in the following conference series:

International Conference on Man–Machine Interactions

1167 Accesses

Abstract

The paper presents extended N-gram model designed for analysis of texts in Polish language. One of possible applications of the model is automatic detection and correction of errors that occur during computerized text edition. N-grams belong to the group of statistical methods in Natural Language Processing (NLP). They are created through analysis of sufficiently large language data resources called corpora. In the classic version N-grams represent the sequences of words of certain length that appear in analyzed language resources. Presented approach introduces N-grams that include also results of morphological analysis of texts. As a result, three types of N-grams may be distinguished: lexical (containing original words from text or their basic forms), morphosyntactic (sequences of morphosyntactic tags assigned to words) and mixed (combination of lexical and morphological description). Extended model with new types of N-grams encompasses language properties specific for Polish such as free word order and complex inflection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Davies, M.: The corpus of contemporary American english as the first reliable monitor corpus of english. Lit. Linguist. Comput. 25(4), 447–464 (2010)
Article Google Scholar
Davies, M.: Making google books n-grams useful for a wide range of research on language change. Int. J. Corpus Linguist. 19(3), 401–416 (2014)
Article Google Scholar
Goldberg, Y., Orwant, J.: A dataset of syntactic-ngrams over time from a very large corpus of english books. In: SEM 2013, Atlanta, US, pp. 241–247 (2013)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing. Pearson, London (2008)
Google Scholar
Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: ACL 2012, Jeju Island, Korea, pp. 169–174 (2012)
Google Scholar
Piasecki, M.: Polish tagger TaKIPI: rule based construction and optimisation. Task Q. 11(1–2), 151–167 (2007)
Google Scholar
Pohl, A., Ziółko, B.: Using part of speech n-grams for improving automatic speech recognition of polish. In: Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition. LNCS, vol. 7988, pp. 492–504. Springer, Berlin (2013)
Chapter Google Scholar
Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B.: Narodowy Korpus Jezyka Polskiego. Wydawnictwo Naukowe PWN, Warsaw (2012)
Google Scholar
Woliński, M.: System znaczników morfosyntaktycznych w korpusie IPI PAN. Polonica XXII–XXIII, 39–55 (2003)
Google Scholar
Woliński, M.: Morfeusz–a practical tool for the morphological analysis of polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining, AINSC, vol. 35, pp. 511–520. Springer, Berlin (2006)
Chapter Google Scholar
Ziółko, B., Skurzok, D.: N-Grams Model for Polish, pp. 107–127. InTech, Rijeka (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Electronics, Wroclaw University of Science and Technology, Wroclaw, Poland
Dariusz Banasiak, Jarosław Mierzwa & Antoni Sterna

Authors

Dariusz Banasiak
View author publications
You can also search for this author in PubMed Google Scholar
Jarosław Mierzwa
View author publications
You can also search for this author in PubMed Google Scholar
Antoni Sterna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Banasiak .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Aleksandra Gruca
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Tadeusz Czachórski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Katarzyna Harezlak
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Agnieszka Piotrowska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Banasiak, D., Mierzwa, J., Sterna, A. (2018). Extended N-gram Model for Analysis of Polish Texts. In: Gruca, A., Czachórski, T., Harezlak, K., Kozielski, S., Piotrowska, A. (eds) Man-Machine Interactions 5. ICMMI 2017. Advances in Intelligent Systems and Computing, vol 659. Springer, Cham. https://doi.org/10.1007/978-3-319-67792-7_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-67792-7_35
Published: 20 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67791-0
Online ISBN: 978-3-319-67792-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics