Automatic Gazetteer Generation from Wikipedia

  • Alessio Bosca
  • Luca Dini
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6699)


The presence of high quality Named Entity gazetteer within a CLIR system is crucial in order to provide multilingual access to digital resources, particularly in the domain of Digital Libraries. In our paper we investigate an approach for automatically extracting this kind of resources from Wikipedia using an unsupervised approach that leverages the DBpedia classification of the English articles in order to induce the same classification onto encyclopedia pages expressed in other languages. By exploiting the structured information present in Wikipedia we furthermore aim at enriching our standard gazetteer with translations to other languages as well as with the alternative spellings of the entities.


Digital Library Machine Translation Search Query Statistical Machine Translation Entity Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In: EAMT (2003)Google Scholar
  2. 2.
    Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: People’s Web (2009)Google Scholar
  3. 3.
    Baroni, M., Bernardini, S.: BootCaT: Bootstrapping corpora and terms from the web. In: LREC (2004)Google Scholar
  4. 4.
    Bosca, A., Dini, L.: Language Identification Strategies for Cross Language Information Retrieval. In: logCLEF (2010)Google Scholar
  5. 5.
    Bosca, A., Dini, L.: The role of logs in improving cross language access in digital libraries. In: Proceedings of the International Conference on Semantic Web and Digital Libraries (2009)Google Scholar
  6. 6.
    Bosca, A., Dini, L.: Ontology based law discovery. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 122–135. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of CoNLL (2003)Google Scholar
  8. 8.
    Hall, M., Eibe, F., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1) (2009)Google Scholar
  9. 9.
    Jansen, B.J.: Search log analysis: What it is, what’s been done, how to do it. Library & Information Science Research 28(3), 407–432 (2006)CrossRefGoogle Scholar
  10. 10.
    Kazama, J., Torisawa, K.: Exploiting Wikipedia as External Knowledge for Named Entity Recognition. In: EMNLP-CoNLL (2007)Google Scholar
  11. 11.
    Müller, C., Gurevych, I.: Using wikipedia and wiktionary in domain-specific information retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Journal of Linguisticae Investigationes (2007)Google Scholar
  13. 13.
    Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into Named Entity Training Data. In: ALTA (2008)Google Scholar
  14. 14.
    Oh, J., Kawahara, D., Uchimoto, K., Kazama, J., Torisawa, K.: Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia. In: Web Intelligence (2008)Google Scholar
  15. 15.
    Ponzetto, S.P., Navigli, R.: Knowledge-rich Word Sense Disambiguation rivaling supervised systems. In: ACL (2010)Google Scholar
  16. 16.
    Reese, S., Boleda, G., Cuadros, M., Padr, L., Rigau, G.: Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus. In: LREC (2010)Google Scholar
  17. 17.
    Stiller, J., Gde, M., Petras, V.: Ambiguity of Queries and the Challenges for Query Language Detection. In: logCLEF (2010)Google Scholar
  18. 18.
    Wu, D., He, D., Ji, H., Grishman, R.: The Effects of High Quality Translations of Named Entities in Cross-Language Information Exploration. In: IEEE NLP-KE (2008)Google Scholar
  19. 19.
  20. 20.
  21. 21.
    DBPedia Ontology, Ontology
  22. 22.
    Dublin Core Metadata Initiative,
  23. 23.
    EuropeanaConnect project,
  24. 24.
  25. 25.
  26. 26.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Alessio Bosca
    • 1
  • Luca Dini
    • 1
  1. 1.CELI s.r.l.TorinoItaly

Personalised recommendations