Abstract
In the paper we present a customizable and open-source framework for proper names recognition called Liner2. The framework consists of several universal methods for sequence chunking which include: dictionary look-up, pattern matching and statistical processing. The statistical processing is performed using Conditional Random Fields and a rich set of features including morphological, lexical and semantic information. We present an application of the framework to the task of recognition proper names in Polish texts (5 common categories of proper names, i.e. first names, surnames, city names, road names and country names). The Liner2 framework was also used to train an extended model to recognize 56 categories of proper names which was used to bootstrap the manual annotation of KPWr corpus. We also present the CRF-based model integrated with a heterogeneous named entity similarity function. We show that the similarity function added to the best configuration improved the final result for cross-domain evaluation. The last section presents NER-WS – a web service for proper names recognition in Polish texts utilizing the Liner2 framework and the model for 56 categories of proper names. The web service can be tested using a web-based demo available at http://nlp.pwr.wroc.pl/inforex/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2007), Poznań, Poland, October 5-7 (2007)
Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)
Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiórkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit (2009)
Marcińczuk, M., Zaśko-Zielińska, M., Piasecki, M.: Structure Annotation in the Polish Corpus of Suicide Notes. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 419–426. Springer, Heidelberg (2011)
Marrero, M., Sánchez-Cuadrado, S., Lara, J.M., Andreadakis, G.: Evaluation of Named Entity Extraction Systems. Research in Computing Science 41, 47–58 (2009)
Kravalová, J., žabokrtský, Z.: Czech named entity corpus and SVM-based recognizer. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pp. 194–201. ACL, Suntec (2009)
Osenova, P., Kolkovska, S.: Combining the Named-entity Recognition Task and NP Chunking Strategy for Robust Pre-processing. In: Proc. of the 1st Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria, pp. 167–182 (2002)
Katrenko, S., Adriaans, P.: Named Entity Recognition for Ukrainian: A Resource-Light Approach. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, pp. 88–93. ACL, Prague (2007)
Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313–316. Association for Computational Linguistics, Prague (2004)
Piskorski, J.: Named-Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–133. Springer, Heidelberg (2005)
Urbańska, D., Mykowiecka, A.: Multi-words Named Entity Recognition in Polish texts. In: SLOVKO 2005 – Third International Seminar Computer Treatment of Slavic and East European Languages, Bratislava, Slovakia, pp. 208–215 (2005)
Abramowicz, W., Filipowska, A., Piskorski, J., Węcel, K., and Wieloch, K.: Linguistic Suite for Polish Cadastral System. In: Proceedings of the LREC 2006, Genoa, Italy, pp. 53–58 (2006) ISBN 2-9517408-2-4
Savary, A., Piskorski, J.: Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish. In: Kłopotek, M.A., Marciniak, M., Mykowiecka, A., Penczek, W., Wierzchnoń, S.T. (eds.) Intelligent Information Systems, Siedlce, pp. 141–154 (2010) ISBN 978-83-7051-580-5
Marcińczuk, M., Piasecki, M.: Pattern Extraction for Event Recognition in the Reports of Polish Stockholders. In: Proceedings of the International Multiconference on Computer Science and Information Technology, Wisła, Poland, October 15-17, vol. 2, pp. 275–284 (2007) ISSN 1896-7094
Marcińczuk, M.: Pattern Acquisition Methods for Information Extraction Systems. Master’s thesis, Blekinge Tekniska Högskola, Sweden (2007)
Savary, A., Waszczuk, J.: Narzedzia do anotacji jednostek nazewniczych. In: Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.) Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish], pp. 225–252. Wydawnictwo Naukowe PWN, Warsaw (2012)
Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. Control and Cybernetics (2011)
Radziszewski, A., Śniatowski, T.: Maca: a configurable tool to integrate Polish morphological data. In: Proceedings of FreeRBMT 2011, Barcelona, Spain (2011)
Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer Science. Polish Academy of Sciences, Warsaw (2004)
Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A.: Rich Set of Features for Proper Name Recognition in Polish Texts. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 332–344. Springer, Heidelberg (2012)
Marcińczuk, M., Janicki, M.: Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 258–269. Springer, Heidelberg (2012)
Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., Wardyński, A.: KPWr: Towards a Free Corpus of Polish. In: Calzolari, N., Choukri, K., Declerck, T., Doĝan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association, ELRA (2012) ISBN 978-2-9517408-7-7
Marcińczuk, M., Kocoń, J., Broda, B.: Inforex — a web-based tool for text corpus management and semantic annotation. In: Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association, ELRA (2012) ISBN 978-2-9517408-7-7
Kocoń, J., Piasecki, M.: Heterogeneous Named Entity Similarity Function. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS(LNAI), vol. 7499, pp. 223–231. Springer, Heidelberg (2012)
Piasecki, M., Radziszewski, A.: Polish Morphological Guesser Based on a Statistical A Tergo Index. In: Proceedings of the International Multiconference on Computer Science and Information Technology — 2nd International Symposium Advances in Artificial Intelligence and Applications (AAIA 2007), pp. 247–256 (2007)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI 2003 Workshop on Information Integration, pp. 73–78 (2003)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Metrics for Matching Names and Records. In: Proceedings of the KDD 2003 Workshop on Data, Washington, DC, pp. 13–18 (2003) ISBN 3-540-29754-5
Newcomer, E.: Understanding Web Services: XML, WSDL, SOAP and UDDI. Pearson (2002)
Broda, B., Marcińczuk, M., Piasecki, M.: Building a Node of the Accessible Language Technology Infrastructure. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, May 19-21 (2010) ISBN 2-9517408-6-7
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Marcińczuk, M., Kocoń, J., Janicki, M. (2013). Liner2 – A Customizable Framework for Proper Names Recognition for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol 467. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35647-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-35647-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35646-9
Online ISBN: 978-3-642-35647-6
eBook Packages: EngineeringEngineering (R0)