Skip to main content

Liner2 – A Customizable Framework for Proper Names Recognition for Polish

  • Chapter
Intelligent Tools for Building a Scientific Information Platform

Part of the book series: Studies in Computational Intelligence ((SCI,volume 467))

Abstract

In the paper we present a customizable and open-source framework for proper names recognition called Liner2. The framework consists of several universal methods for sequence chunking which include: dictionary look-up, pattern matching and statistical processing. The statistical processing is performed using Conditional Random Fields and a rich set of features including morphological, lexical and semantic information. We present an application of the framework to the task of recognition proper names in Polish texts (5 common categories of proper names, i.e. first names, surnames, city names, road names and country names). The Liner2 framework was also used to train an extended model to recognize 56 categories of proper names which was used to bootstrap the manual annotation of KPWr corpus. We also present the CRF-based model integrated with a heterogeneous named entity similarity function. We show that the similarity function added to the best configuration improved the final result for cross-domain evaluation. The last section presents NER-WS – a web service for proper names recognition in Polish texts utilizing the Liner2 framework and the model for 56 categories of proper names. The web service can be tested using a web-based demo available at http://nlp.pwr.wroc.pl/inforex/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2007), Poznań, Poland, October 5-7 (2007)

    Google Scholar 

  2. Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)

    Google Scholar 

  3. Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiórkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit (2009)

    Google Scholar 

  4. Marcińczuk, M., Zaśko-Zielińska, M., Piasecki, M.: Structure Annotation in the Polish Corpus of Suicide Notes. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 419–426. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  5. Marrero, M., Sánchez-Cuadrado, S., Lara, J.M., Andreadakis, G.: Evaluation of Named Entity Extraction Systems. Research in Computing Science 41, 47–58 (2009)

    Google Scholar 

  6. Kravalová, J., žabokrtský, Z.: Czech named entity corpus and SVM-based recognizer. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pp. 194–201. ACL, Suntec (2009)

    Chapter  Google Scholar 

  7. Osenova, P., Kolkovska, S.: Combining the Named-entity Recognition Task and NP Chunking Strategy for Robust Pre-processing. In: Proc. of the 1st Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria, pp. 167–182 (2002)

    Google Scholar 

  8. Katrenko, S., Adriaans, P.: Named Entity Recognition for Ukrainian: A Resource-Light Approach. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, pp. 88–93. ACL, Prague (2007)

    Chapter  Google Scholar 

  9. Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313–316. Association for Computational Linguistics, Prague (2004)

    Google Scholar 

  10. Piskorski, J.: Named-Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–133. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  11. Urbańska, D., Mykowiecka, A.: Multi-words Named Entity Recognition in Polish texts. In: SLOVKO 2005 – Third International Seminar Computer Treatment of Slavic and East European Languages, Bratislava, Slovakia, pp. 208–215 (2005)

    Google Scholar 

  12. Abramowicz, W., Filipowska, A., Piskorski, J., Węcel, K., and Wieloch, K.: Linguistic Suite for Polish Cadastral System. In: Proceedings of the LREC 2006, Genoa, Italy, pp. 53–58 (2006) ISBN 2-9517408-2-4

    Google Scholar 

  13. Savary, A., Piskorski, J.: Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish. In: Kłopotek, M.A., Marciniak, M., Mykowiecka, A., Penczek, W., Wierzchnoń, S.T. (eds.) Intelligent Information Systems, Siedlce, pp. 141–154 (2010) ISBN 978-83-7051-580-5

    Google Scholar 

  14. Marcińczuk, M., Piasecki, M.: Pattern Extraction for Event Recognition in the Reports of Polish Stockholders. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, October 15-17, vol. 2, pp. 275–284 (2007) ISSN 1896-7094

    Google Scholar 

  15. Marcińczuk, M.: Pattern Acquisition Methods for Information Extraction Systems. Master’s thesis, Blekinge Tekniska Högskola, Sweden (2007)

    Google Scholar 

  16. Savary, A., Waszczuk, J.: Narzedzia do anotacji jednostek nazewniczych. In: Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.) Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish], pp. 225–252. Wydawnictwo Naukowe PWN, Warsaw (2012)

    Google Scholar 

  17. Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. Control and Cybernetics (2011)

    Google Scholar 

  18. Radziszewski, A., Śniatowski, T.: Maca: a configurable tool to integrate Polish morphological data. In: Proceedings of FreeRBMT 2011, Barcelona, Spain (2011)

    Google Scholar 

  19. Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer Science. Polish Academy of Sciences, Warsaw (2004)

    Google Scholar 

  20. Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A.: Rich Set of Features for Proper Name Recognition in Polish Texts. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 332–344. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Marcińczuk, M., Janicki, M.: Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 258–269. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  22. Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., Wardyński, A.: KPWr: Towards a Free Corpus of Polish. In: Calzolari, N., Choukri, K., Declerck, T., Doĝan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association, ELRA (2012) ISBN 978-2-9517408-7-7

    Google Scholar 

  23. Marcińczuk, M., Kocoń, J., Broda, B.: Inforex — a web-based tool for text corpus management and semantic annotation. In: Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association, ELRA (2012) ISBN 978-2-9517408-7-7

    Google Scholar 

  24. Kocoń, J., Piasecki, M.: Heterogeneous Named Entity Similarity Function. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS(LNAI), vol. 7499, pp. 223–231. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  25. Piasecki, M., Radziszewski, A.: Polish Morphological Guesser Based on a Statistical A Tergo Index. In: Proceedings of the International Multiconference on Computer Science and Information Technology — 2nd International Symposium Advances in Artificial Intelligence and Applications (AAIA 2007), pp. 247–256 (2007)

    Google Scholar 

  26. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI 2003 Workshop on Information Integration, pp. 73–78 (2003)

    Google Scholar 

  27. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Metrics for Matching Names and Records. In: Proceedings of the KDD 2003 Workshop on Data, Washington, DC, pp. 13–18 (2003) ISBN 3-540-29754-5

    Google Scholar 

  28. Newcomer, E.: Understanding Web Services: XML, WSDL, SOAP and UDDI. Pearson (2002)

    Google Scholar 

  29. Broda, B., Marcińczuk, M., Piasecki, M.: Building a Node of the Accessible Language Technology Infrastructure. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, May 19-21 (2010) ISBN 2-9517408-6-7

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michał Marcińczuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Marcińczuk, M., Kocoń, J., Janicki, M. (2013). Liner2 – A Customizable Framework for Proper Names Recognition for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol 467. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35647-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35647-6_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35646-9

  • Online ISBN: 978-3-642-35647-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics