Liner2 – A Customizable Framework for Proper Names Recognition for Polish

Marcińczuk, Michał; Kocoń, Jan; Janicki, Maciej

doi:10.1007/978-3-642-35647-6_17

Michał Marcińczuk⁶,
Jan Kocoń⁶ &
Maciej Janicki⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 467))

934 Accesses
15 Citations

Abstract

In the paper we present a customizable and open-source framework for proper names recognition called Liner2. The framework consists of several universal methods for sequence chunking which include: dictionary look-up, pattern matching and statistical processing. The statistical processing is performed using Conditional Random Fields and a rich set of features including morphological, lexical and semantic information. We present an application of the framework to the task of recognition proper names in Polish texts (5 common categories of proper names, i.e. first names, surnames, city names, road names and country names). The Liner2 framework was also used to train an extended model to recognize 56 categories of proper names which was used to bootstrap the manual annotation of KPWr corpus. We also present the CRF-based model integrated with a heterogeneous named entity similarity function. We show that the similarity function added to the best configuration improved the final result for cross-domain evaluation. The last section presents NER-WS – a web service for proper names recognition in Polish texts utilizing the Liner2 framework and the model for 56 categories of proper names. The web service can be tested using a web-based demo available at http://nlp.pwr.wroc.pl/inforex/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2007), Poznań, Poland, October 5-7 (2007)
Google Scholar
Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)
Google Scholar
Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiórkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit (2009)
Google Scholar
Marcińczuk, M., Zaśko-Zielińska, M., Piasecki, M.: Structure Annotation in the Polish Corpus of Suicide Notes. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 419–426. Springer, Heidelberg (2011)
Chapter Google Scholar
Marrero, M., Sánchez-Cuadrado, S., Lara, J.M., Andreadakis, G.: Evaluation of Named Entity Extraction Systems. Research in Computing Science 41, 47–58 (2009)
Google Scholar
Kravalová, J., žabokrtský, Z.: Czech named entity corpus and SVM-based recognizer. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pp. 194–201. ACL, Suntec (2009)
Chapter Google Scholar
Osenova, P., Kolkovska, S.: Combining the Named-entity Recognition Task and NP Chunking Strategy for Robust Pre-processing. In: Proc. of the 1st Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria, pp. 167–182 (2002)
Google Scholar
Katrenko, S., Adriaans, P.: Named Entity Recognition for Ukrainian: A Resource-Light Approach. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, pp. 88–93. ACL, Prague (2007)
Chapter Google Scholar
Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313–316. Association for Computational Linguistics, Prague (2004)
Google Scholar
Piskorski, J.: Named-Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–133. Springer, Heidelberg (2005)
Chapter Google Scholar
Urbańska, D., Mykowiecka, A.: Multi-words Named Entity Recognition in Polish texts. In: SLOVKO 2005 – Third International Seminar Computer Treatment of Slavic and East European Languages, Bratislava, Slovakia, pp. 208–215 (2005)
Google Scholar
Abramowicz, W., Filipowska, A., Piskorski, J., Węcel, K., and Wieloch, K.: Linguistic Suite for Polish Cadastral System. In: Proceedings of the LREC 2006, Genoa, Italy, pp. 53–58 (2006) ISBN 2-9517408-2-4
Google Scholar
Savary, A., Piskorski, J.: Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish. In: Kłopotek, M.A., Marciniak, M., Mykowiecka, A., Penczek, W., Wierzchnoń, S.T. (eds.) Intelligent Information Systems, Siedlce, pp. 141–154 (2010) ISBN 978-83-7051-580-5
Google Scholar
Marcińczuk, M., Piasecki, M.: Pattern Extraction for Event Recognition in the Reports of Polish Stockholders. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, October 15-17, vol. 2, pp. 275–284 (2007) ISSN 1896-7094
Google Scholar
Marcińczuk, M.: Pattern Acquisition Methods for Information Extraction Systems. Master’s thesis, Blekinge Tekniska Högskola, Sweden (2007)
Google Scholar
Savary, A., Waszczuk, J.: Narzedzia do anotacji jednostek nazewniczych. In: Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.) Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish], pp. 225–252. Wydawnictwo Naukowe PWN, Warsaw (2012)
Google Scholar
Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. Control and Cybernetics (2011)
Google Scholar
Radziszewski, A., Śniatowski, T.: Maca: a configurable tool to integrate Polish morphological data. In: Proceedings of FreeRBMT 2011, Barcelona, Spain (2011)
Google Scholar
Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer Science. Polish Academy of Sciences, Warsaw (2004)
Google Scholar
Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A.: Rich Set of Features for Proper Name Recognition in Polish Texts. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 332–344. Springer, Heidelberg (2012)
Chapter Google Scholar
Marcińczuk, M., Janicki, M.: Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 258–269. Springer, Heidelberg (2012)
Chapter Google Scholar
Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., Wardyński, A.: KPWr: Towards a Free Corpus of Polish. In: Calzolari, N., Choukri, K., Declerck, T., Doĝan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association, ELRA (2012) ISBN 978-2-9517408-7-7
Google Scholar
Marcińczuk, M., Kocoń, J., Broda, B.: Inforex — a web-based tool for text corpus management and semantic annotation. In: Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association, ELRA (2012) ISBN 978-2-9517408-7-7
Google Scholar
Kocoń, J., Piasecki, M.: Heterogeneous Named Entity Similarity Function. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS(LNAI), vol. 7499, pp. 223–231. Springer, Heidelberg (2012)
Chapter Google Scholar
Piasecki, M., Radziszewski, A.: Polish Morphological Guesser Based on a Statistical A Tergo Index. In: Proceedings of the International Multiconference on Computer Science and Information Technology — 2nd International Symposium Advances in Artificial Intelligence and Applications (AAIA 2007), pp. 247–256 (2007)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI 2003 Workshop on Information Integration, pp. 73–78 (2003)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Metrics for Matching Names and Records. In: Proceedings of the KDD 2003 Workshop on Data, Washington, DC, pp. 13–18 (2003) ISBN 3-540-29754-5
Google Scholar
Newcomer, E.: Understanding Web Services: XML, WSDL, SOAP and UDDI. Pearson (2002)
Google Scholar
Broda, B., Marcińczuk, M., Piasecki, M.: Building a Node of the Accessible Language Technology Infrastructure. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, May 19-21 (2010) ISBN 2-9517408-6-7
Google Scholar

Download references

Author information

Authors and Affiliations

Wrocław University of Technology, Wrocław, Poland
Michał Marcińczuk, Jan Kocoń & Maciej Janicki

Authors

Michał Marcińczuk
View author publications
You can also search for this author in PubMed Google Scholar
Jan Kocoń
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Janicki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michał Marcińczuk .

Editor information

Editors and Affiliations

, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Robert Bembenik
, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Lukasz Skonieczny
, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Henryk Rybinski
, Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, Warszawa, 00-665, Poland
Marzena Kryszkiewicz
, Interdisciplinary Centre for, University of Warsaw, Pawińskiego 5a bl. D, Warsaw, 02-106, Poland
Marek Niezgodka

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Marcińczuk, M., Kocoń, J., Janicki, M. (2013). Liner2 – A Customizable Framework for Proper Names Recognition for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol 467. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35647-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-35647-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35646-9
Online ISBN: 978-3-642-35647-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics