Abstract
Controlled vocabularies are important resources used in several tasks such as machine translation, text summarization, and text analysis. However, the development of such resources is expensive and time-consuming. On the other hand, the Wikipedia, a free collaborative encyclopedia, contains plenty of semi-structured information that can be used by an automatic process to create new resources. This paper proposes a method to extract semantic information from the Wikipedia in the form of a controlled vocabulary. The method combines keywords obtained for a specific Wikipedia article with three different strategies: using Wikipedia annotations called wikilinks, a ranking measure to obtain keywords from text, and a dependency parser. To evaluate the model, we performed an analysis in terms of coverage and performance of the acquired vocabulary using WordNet as a gold standard.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The library can be found the in the git repository: https://github.com/rdorado79/wikitools.
References
Davison, B.D.: Topical locality in the web. In: Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000), pp. 272–279 (2000)
Garcia, R.D., Rensing, C., Steinmetz, R.: Automatic acquisition of taxonomies in different languages from multiple wikipedia versions. In: Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW 2011) (2011)
Greenwood, M.A., Stevenson, M.: Improving semi-supervised acquisition of relation extraction patterns. In: Proceedings of the Workshop on Information Extraction Beyond The Document (IEBeyond Doc 2006), pp. 29–35 (2006)
Hu, L., Wang, X., Zhang, M., Li, J., Li, X., Shao, C., Tang, J., Liu, Y.: Learning topic hierarchies for wikipedia categories. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2015)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL 2003), vol. 1, pp. 423–430 (2003)
Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification. In: WordNet: An Electronic Lexical Database, pp. 265–283. MIT Press, Cambridge, MA (1998)
Lin, D., Pantel, P.: Discovery of inference rules for question-answering. J. Nat. Lang. Eng. 7, 343–360 (2001)
Lin, W.P., Snover, M., Ji, H.: Unsupervised language-independent name translation mining from wikipedia infoboxes. In: Proceedings Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pp. 43–52 (2011)
Makris, C., Plegas, Y., Theodoridis, E.: Improved text annotation with wikipedia entities. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, pp. 288–295 (2013)
Mihalcea, R.: Using wikipedia for automatic word sense disambiguation. In: Proceedings of NAACL HLT 2007, pp. 196–203 (2007)
Nakayama, K., Hara, T., Nishio, S.: A thesaurus construction method from large scale web dictionaries. In: 21st International Conference on Advanced Information Networking and Applications, pp. 932–939 (2007)
Nivre, J., de Marneffe, M.C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., Zeman, D.: Universal dependencies v1: A multilingual treebank collection. In: LREC 2016 (2016)
Stevenson, M., Greenwood, M.A.: Dependency pattern models for information extraction. Depend. Pattern Model. Inf. Extract. 7(13), 13–39 (2009)
Sudo, K., Sekine, S., Grishman, R.: Automatic pattern acquisition for Japanese information extraction. In: Proceedings of the Human Language Technology Conference (HLT 2001) (2001)
Szpektor, I., Tanev, H., Dagan, I., Coppola, B.: Scaling web-based acquisition of entailment relation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 41–48 (2004)
Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138 (1994)
Yangarber, R.: Counter-training in discovery of semantic patterns. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 343–350 (2003)
Zesch, T., Gurevych, I.: Analysis of the wikipedia category graph for nlp applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL HLT 2007), pp. 1–8 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Dorado, R., Bramy, A., Mejía-Moncayo, C., Rojas, A.E. (2017). Automatic Acquisition of Controlled Vocabularies from Wikipedia Using Wikilinks, Word Ranking, and a Dependency Parser. In: Solano, A., Ordoñez, H. (eds) Advances in Computing. CCC 2017. Communications in Computer and Information Science, vol 735. Springer, Cham. https://doi.org/10.1007/978-3-319-66562-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-66562-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66561-0
Online ISBN: 978-3-319-66562-7
eBook Packages: Computer ScienceComputer Science (R0)