Skip to main content

Automatic Gazetteer Generation from Wikipedia

  • Conference paper
  • 564 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6699))

Abstract

The presence of high quality Named Entity gazetteer within a CLIR system is crucial in order to provide multilingual access to digital resources, particularly in the domain of Digital Libraries. In our paper we investigate an approach for automatically extracting this kind of resources from Wikipedia using an unsupervised approach that leverages the DBpedia classification of the English articles in order to induce the same classification onto encyclopedia pages expressed in other languages. By exploiting the structured information present in Wikipedia we furthermore aim at enriching our standard gazetteer with translations to other languages as well as with the alternative spellings of the entities.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In: EAMT (2003)

    Google Scholar 

  2. Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: People’s Web (2009)

    Google Scholar 

  3. Baroni, M., Bernardini, S.: BootCaT: Bootstrapping corpora and terms from the web. In: LREC (2004)

    Google Scholar 

  4. Bosca, A., Dini, L.: Language Identification Strategies for Cross Language Information Retrieval. In: logCLEF (2010)

    Google Scholar 

  5. Bosca, A., Dini, L.: The role of logs in improving cross language access in digital libraries. In: Proceedings of the International Conference on Semantic Web and Digital Libraries (2009)

    Google Scholar 

  6. Bosca, A., Dini, L.: Ontology based law discovery. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 122–135. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  7. Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of CoNLL (2003)

    Google Scholar 

  8. Hall, M., Eibe, F., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1) (2009)

    Google Scholar 

  9. Jansen, B.J.: Search log analysis: What it is, what’s been done, how to do it. Library & Information Science Research 28(3), 407–432 (2006)

    Article  Google Scholar 

  10. Kazama, J., Torisawa, K.: Exploiting Wikipedia as External Knowledge for Named Entity Recognition. In: EMNLP-CoNLL (2007)

    Google Scholar 

  11. Müller, C., Gurevych, I.: Using wikipedia and wiktionary in domain-specific information retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Journal of Linguisticae Investigationes (2007)

    Google Scholar 

  13. Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into Named Entity Training Data. In: ALTA (2008)

    Google Scholar 

  14. Oh, J., Kawahara, D., Uchimoto, K., Kazama, J., Torisawa, K.: Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia. In: Web Intelligence (2008)

    Google Scholar 

  15. Ponzetto, S.P., Navigli, R.: Knowledge-rich Word Sense Disambiguation rivaling supervised systems. In: ACL (2010)

    Google Scholar 

  16. Reese, S., Boleda, G., Cuadros, M., Padr, L., Rigau, G.: Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus. In: LREC (2010)

    Google Scholar 

  17. Stiller, J., Gde, M., Petras, V.: Ambiguity of Queries and the Challenges for Query Language Detection. In: logCLEF (2010)

    Google Scholar 

  18. Wu, D., He, D., Ji, H., Grishman, R.: The Effects of High Quality Translations of Named Entities in Cross-Language Information Exploration. In: IEEE NLP-KE (2008)

    Google Scholar 

  19. ANSI/NISO Z39.50, http://www.loc.gov/z3950/agency/

  20. CACAO project, http://www.cacaoproject.eu/

  21. DBPedia Ontology, http://wiki.dbpedia.org/ Ontology

  22. Dublin Core Metadata Initiative, http://dublincore.org/

  23. EuropeanaConnect project, http://www.europeanaconnect.eu/

  24. http://www.uni-hildesheim.de/logclef/index.html

  25. MICHAEL project, http://www.michael-culture.eu/

  26. OAI-PMH, http://www.openarchives.org/pmh/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bosca, A., Dini, L. (2011). Automatic Gazetteer Generation from Wikipedia. In: Bernardi, R., Chambers, S., Gottfried, B., Segond, F., Zaihrayeu, I. (eds) Advanced Language Technologies for Digital Libraries. NLP4DL AT4DL 2009 2009. Lecture Notes in Computer Science, vol 6699. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23160-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23160-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23159-9

  • Online ISBN: 978-3-642-23160-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics