Skip to main content

A Mixed Approach in Recognising Geographical Entities in Texts

  • Conference paper
  • First Online:
Linguistic Linked Open Data (RUMOUR 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 588))

Included in the following conference series:

Abstract

The paper describes an approach for automatic identification in Romanian texts of name entities belonging to the geographical domain. The research is part of a project (MappingBooks) aimed to link mentions of entities in an e-book with external information, as found in social media, Wikipedia, or web pages containing cultural or touristic information, in order to enhance the reader’s experience. The described name entity recognizer mixes ontological information, as found in public resources, with handwritten symbolic rules. The outputs of the two component modules are compared and heuristics are used to take decisions in cases of conflict.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Financed by the Romanian Ministry of Education and Research (UEFISCDI) under the Partnerships Programme (PN II Parteneriate, competition PCCA 2013), project code: PN-II-PT-PCCA-2013-4-1878.

  2. 2.

    The Romanian Academy Center for Artificial Intelligence, in Bucharest.

  3. 3.

    A box or a container is associated with each character (entity), which in a text is, at the first mention, partially filled in with pieces of information and, subsequently, complemented with details (name, sex, kinship connections, composition, beliefs, religion, etc.).

  4. 4.

    http://itextpdf.com/ .

  5. 5.

    http://nlptools.info.uaic.ro/ .

  6. 6.

    http://www.geonames.org/.

  7. 7.

    http://sourceforge.net/projects/ggs/.

  8. 8.

    http://85.122.23.18:8181/MappingBooks/resources/recognizer.

  9. 9.

    http://85.122.23.18:8181/MappingBooks/resources/editor.

  10. 10.

    The API allows also identification of relations between entities, a facility not described in this paper.

  11. 11.

    http://geonetwork-opensource.org/.

  12. 12.

    http://www.naturalearthdata.com/.

  13. 13.

    http://earth.unibuc.ro/download/harta-unitati-relief-romania.

  14. 14.

    http://www.openstreetmap.org.

  15. 15.

    www.bing.com/maps.

  16. 16.

    http://ro.wikipedia.org.

  17. 17.

    http://geoportal.ancpi.ro.

  18. 18.

    ANCPI – http://www.ancpi.ro.

  19. 19.

    http://www.insse.ro.

  20. 20.

    https://statistici.insse.ro/shop/?lang=ro.

  21. 21.

    http://edemos.insse.ro/portal.

  22. 22.

    http://www.insse.ro/cms/files/IDDT%202012/index_IDDT.htm.

  23. 23.

    http://www.insse.ro/cms/files/Web_IDD_BD_ro/index.htm.

References

  1. Anechitei, D.A.: MultiDPS - a multilingual discourse processing system. In: Proceedings of the COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations (COLING 2014), Dublin, Ireland, August 2014, pp. 44–47 (2014)

    Google Scholar 

  2. Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proceedings of the 6th Workshop on Very Large Corpora (1998)

    Google Scholar 

  3. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), College Park, MD, pp. 100–110. Association for Computational Linguistics (1999)

    Google Scholar 

  4. Cristea, D., Gîfu, D., Colhon, M., Diac, P., Bibiri, A.-D., Mărănduc, C., Scutelnicu, A.L.: Quo vadis: a corpus of entities and relations. In: Gala, N., Rapp, R., Enguix, G.B. (eds.) Language Production, Cognition, and the Lexicon. Springer International Publishing, Cham (2015)

    Google Scholar 

  5. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)

    Google Scholar 

  6. Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of the Seventh Conference on Natural language learning (HLT-NAACL 2003), vol. 4, pp. 168–171 (2003)

    Google Scholar 

  7. Fung, P.: A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora. In: Proceedings of the Association for Computational Linguistics (1995)

    Google Scholar 

  8. Grishman, R., Sundheim, B.: Message understanding conference - 6: a brief history. In: Proceedings of the COLING (1996)

    Google Scholar 

  9. Huang, F.: Multilingual named entity extraction and translation from text and speech. Ph.D. thesis, Carnegie Mellon University (2005)

    Google Scholar 

  10. Iftene, A., Trandabăţ, D., Toader, M., Corîci, M.: Named entity recognition for Romanian. In: Proceedings of the 3th Conference on Knowledge Engineering: Principles and Techniques Conference (KEPT 2011), pp. 19–24, vol. 2. Studia Universitatis, Babeş-Bolyai, Cluj-Napoca (2011)

    Google Scholar 

  11. Kraak, M.-J., Rico, V.D.: Principles of hypermaps. Comput. Geosci. 23(4), 457–464 (1997)

    Article  Google Scholar 

  12. Lee, S., Lee, G.G.: A bootstrapping approach for geographic named entity annotation. In: Myaeng, S.-H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 178–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  13. Li, H., Srihari, R.K., Niu, C., Li, W.: InfoXtract location normalization: a hybrid approach to geographic references in information extraction. In: Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, Alberta, Canada, pp. 39–44 (2003)

    Google Scholar 

  14. Mann, Gideon S. and Yarowsky, D.: Unsupervised Personal Name Disambiguation. In: Proceedings of the 9th Conference on Computational Natural Language Learning (2003)

    Google Scholar 

  15. Masayuki, A., Matsumoto, Y.: Japanese: named entity extraction with redundant morphological analysis. In Proceedings of the Human Language Technology Conference – North American chapter of the Association for Computational Linguistic (2003)

    Google Scholar 

  16. Maynard, D., Tablan, V., Ursu, C., Cunningham, H., Wilks, Y.: Named entity recognition from diverse text types. In: Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark, Bulgaria, pp. 257–274 (2001)

    Google Scholar 

  17. Mikheev, M., Grover, C. and Moens, M.: Description of the LTG system used for MUC-7. In: Proceedings of the 7th Message Understanding Conference (MUC-7) (1998)

    Google Scholar 

  18. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: Sekine, S., Ranchhod, E. (eds.) Named Entities: Recognition, Classification and Use, vol. 30(1), pp. 3–26 (2007). Special issue of Lingvisticæ Investigationes

    Google Scholar 

  19. Nadeau, D., Turney, P.A.: Supervised learning approach to acronym identification. In: Proceedings of the 18th Canadian Conference on Artificial Intelligence (2005)

    Google Scholar 

  20. Sekine, S., Grishman, R., Shinnou, H.: a decision tree method for finding and classifying names in Japanese texts. In: Proceedings of the Sixth Workshop on Very Large Corpora (1998)

    Google Scholar 

  21. Simionescu, R.: Graphical grammar studio as a constraint grammar solution for part of speech tagger. In: Proceedings of the International Conference Resources and Tools for Romanian Language – ConsILR-2011, Bucharest. “Alexandru Ioan Cuza” University of Iași Publishing House (2011)

    Google Scholar 

  22. Smith, D.A., Mann, G.S.: Bootstrapping toponym classifiers. In: Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, Alberta, Canada, pp. 45–49 (2003)

    Google Scholar 

  23. Yangarber, R., Lin, W., Grishman, R.: Unsupervised Learning of Generalized Names. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, pp. 1135–1141 (2002)

    Google Scholar 

Download references

Acknowledgement

The work was published with the support of the PN-II-PT-PCCA-2013-4-1878 Partnership PCCA 2013 grant MappingBooksJump in the Book!, having as partners the “Alexandru Ioan Cuza” University of Iaşi, SIVECO S.R.L. Bucharest and “Ștefan cel Mare” University of Suceava.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniela Gîfu .

Editor information

Editors and Affiliations

Appendix

Appendix

It is worth mentioning that, in MappingBooks, the identified geographical entities are intended to be used as location points on the document, linking them with actual maps or external web links, or participating in relevant semantic relations. As a repository of spatial data GeoNetworkFootnote 11 was used, an open source platform that allows creating catalogues of spatial data, searching and storing their spatial metadata. The application is based on the principles of FOSS (Free and Open Source Software) and implements international standards (ISO/TC211 and OGC). The GeoNetwork application, running as a service server, stores the data in a database and provides a web interface through which the user can access catalogues of view spatial data and publishing spatial data, or can enter, visualise and edit metadata associated with the geospatial data.

Our intention is to attach to the recognised geographical entities different types of information, found on public sources. For this we are spotting a number of possible sources of free geospatial data: Natural Earth Footnote 12 – a set of cultural, physical and raster layers data, generalized for three spatial scales: 1:10 millions, 1:50 millions and 1:110 millions; Romanian geomorphological regionalization Footnote 13, digitised after a number of analogic versions; Open Street Map Footnote 14 – a dataset created by the community, open to anyone for contribution and editing, containing points of interests (POI), lines and polygons representing different types of spatial entities complemented with more information; Bing Maps®Footnote 15 – a product of Microsoft®, providing a WMS service with maps and aerial images and a Geocoding service, with a suite of data licenses, which, to some extent, can be used for personal and educational purposes; Wikipedia Footnote 16 containing in addition to the related text for each word, a location, as geographical coordinates, for toponyms; the Romanian SDI and the Romanian INSPIRE geoportalFootnote 17 crated by the National Agency for Cadastre and Registration Footnote 18, through the National Geodetic Fund and several collaborations; Data.gov.ro – a portal of partially geospatial data produced by Romanian government agencies (the SIRUTA national codes for administrative units); statistical data provided by the National Statistics Institute Footnote 19 – the Romanian national statistics service, linkable to geospatial boundaries: the TEMPO databaseFootnote 20, the eDemos databaseFootnote 21, the IDDT databaseFootnote 22 of sustainable development indexesFootnote 23, etc.

Also, compiled datasets can be produced by linking statistical databases with geospatial data, or through generalisation or other kinds of spatial analysis. For most datasets global processing is needed for cutting the region of interest, or to possibly change the format and projection.

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Cristea, D., Gîfu, D., Pistol, I., Sfirnaciuc, D., Niculiţă, M. (2016). A Mixed Approach in Recognising Geographical Entities in Texts. In: Trandabăţ, D., Gîfu, D. (eds) Linguistic Linked Open Data. RUMOUR 2015. Communications in Computer and Information Science, vol 588. Springer, Cham. https://doi.org/10.1007/978-3-319-32942-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32942-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32941-3

  • Online ISBN: 978-3-319-32942-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics