Abstract
Relation extraction is a subtask of information extraction that aims at obtaining instances of semantic relations present in texts. This information can be arranged in machine-readable formats, useful for several applications that need structured semantic knowledge. The work presented in this paper explores different strategies to automate the extraction of semantic relations from texts in Portuguese, Galician and Spanish. Both machine learning (distant-supervised and supervised) and rule-based techniques are investigated, and the impact of the different levels of linguistic knowledge is analyzed for the various approaches. Regarding domains, the experiments are focused on the extraction of encyclopedic knowledge, by means of the development of biographical relations classifiers (in a closed domain) and the evaluation of an open information extraction tool. To implement the extraction systems, several natural language processing tools have been built for the three research languages: From sentence splitting and tokenization modules to part-of-speech taggers, named entity recognizers and coreference resolution systems. Furthermore, several lexica and corpora have been compiled and enriched with different levels of linguistic annotation, which are useful for both training and testing probabilistic and symbolic models. As a result of the performed work, new resources and tools are available for automated processing of texts in Portuguese, Galician and Spanish.
This work has been partially supported by the Spanish Ministry of Economy and Competitiveness through the project FFI2014-51978-C2-1-R, and by a Juan de la Cierva formación grant, reference FJCI-2014-22853.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A possible English translation could be: “John A. Garcia (born in 1949 in Galicia) is one of the pioneers of the modern American computer game industry and the current president of Novalogic.”.
- 2.
All of them are freely available at http://gramatica.usc.es/~marcos/phd.html.
References
Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries, pp. 85–94 (2000)
Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI 2007), pp. 2670–2676 (2007)
Barcala, F.M., Domínguez Noya, E.M., Otero, P.G., López Martínez, M., Moscoso Mato, E.M., Rojo, G., Santalla del Río, M.P., Sotelo Docío, S.: A corpus and lexical resources for multi-word terminology extraction in the field of economy in a in a minority language. In: Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 3rd Language & Technology Conference, pp. 359–363 (2007)
Bosque 8.0: Uma floresta integralmente revista por linguistas (2008)
Branco, A., Silva, J.R.: Contractions: breaking the tokenization-tagging circularity. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.G.V. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 167–170. Springer, Heidelberg (2003)
Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art POS taggers for portuguese. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), pp. 507–510 (2004)
Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proceedings of the WebDB Workshop at the 6th International Conference on Extending Database Technology (EDBT 1998), pp. 172–183 (1998)
Bruckschen, M., Camargo de Souza, J., Vieira, R., Rigo, S.: Sistema SeRELeP para o reconhecimento de relações entre entidades mencionadas. In: Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, Chap. 14, pp. 247–260. Linguateca (2008)
Cardoso, N.: REMBRANDT - Reconhecimento de Entidades Mencionadas Baseado em Relações ANálise Detalhada do Texto. In: Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, pp. 195–211. Linguateca (2008)
Carreras, X., Márquez, L., Padró, L.: A simple named entity extractor using AdaBoost. In: Proceedings of the 7th Conference on Natural Language Learning at HLT/NAACL 2003, vol. 4, pp. 152–155. ACL (2003)
Chaves, M.: Geo-ontologias e padrões para reconhecimento de locais e de suas relações em textos: o SEI-Geo no Segundo HAREM. In: Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, pp. 231–245. Linguateca (2008)
Corro, L.D., Gemulla, R.: ClausIE: clause-based open information extraction. In: Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), pp. 355–366 (2013)
Eleutério, S., Ranchhod, E., Mota, C., Carvalho, P.: Dicionários Electrónicos do Português. Características e Aplicações. In: Actas del VIII Simposio Internacional de Comunicación Social, pp. 636–642 (2003)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D., Yates, A.: Web-scale information extraction in KnowItAll. In: Proceedings of the 13th International Conference on World Wide Web (WWW 2004), pp. 100–110. ACM (2004)
Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: the second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), pp. 3–10 (2011)
Gamallo, P., Garcia, M.: A resource-based method for named entity extraction and classification. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011. LNCS (LNAI), vol. 7026, pp. 610–623. Springer, Heidelberg (2011)
Gamallo, P., Garcia, M., Fernández-Lanza, S.: Dependency-based open information extraction. In: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pp. 10–18. ACL (2012)
Gamallo, P., González López, I.: A grammatical formalism based on patterns of part-of-speech tags. Int. J. Corpus Linguist. 16(1), 45–71 (2011)
Garcia, M.: Extracção de Relações Semânticas. Recursos, Ferramentas e Estratégias. Ph.D. thesis, Universidade de Santiago de Compostela (2014)
Garcia, M., Gamallo, P.: Análise Morfossintáctica para Português Europeu e Galego: Problemas, Soluções e Avaliação. Linguamática. Revista para o Processamento Automático das Línguas Ibéricas 2(2), 59–67 (2010)
Garcia, M., Gamallo, P.: Using morphosyntactic post-processing to improve PoS-tagging accuracy. In: Proceedings of the 9th International Conference on Computational Processing of Portuguese Language (PROPOR 2010), Extended Activities Proceedings (2010)
Garcia, M., Gamallo, P.: A weakly-supervised rule-based approach for relation extraction. In: Proceedings of the XIV Conference of the Spanish Association for Artificial Intelligence (CAEPIA 2011). Workshop on Knowledge Extraction and Exploitation from Semi-structures Online Sources (KEESOS) (2011)
Garcia, M., Gamallo, P.: An exploration of the linguistic knowledge for semantic relation extraction in Spanish. In: Proceedings of the Joint Workshop FAM-LbR/KRAQ 2011. In: Learning by Reading and its Applications in Intelligent Question-Answering at 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), pp. 7–12 (2011)
Garcia, M., Gamallo, P.: Dependency-based text compression for semantic relation extraction. In: Proceedings of the Workshop on Information Extraction and Knowledge Acquisition (IEKA 2011) at 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), pp. 21–28 (2011)
Garcia, M., Gamallo, P.: Evaluating various features on semantic relation extraction. In: Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), pp. 721–726 (2011)
Garcia, M., Gamallo, P.: Exploring the effectiveness of linguistic knowledge for biographical relation extraction. Nat. Lang. Eng. 21(4), 519–551 (2013)
Garcia, M., Gamallo, P.: An entity-centric coreference resolution system for person entities with rich linguistic information. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 741–752 (2014)
Garcia, M., Gamallo, P.: Entity-centric coreference resolution of person entities for open information extraction. Procesamiento del Lenguaje Natural 53, 25–32 (2014)
Garcia, M., Gamallo, P.: Multilingual corpora with coreference annotation of person entities. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 3229–3233. ELRA (2014)
Garcia, M., Gamallo, P., Gayo, I., Pousada Cruz, M.: PoS-tagging the Web in Portuguese. National varieties, text typologies and spelling systems. Procesamiento del Lenguaje Natural 53, 95–101 (2014)
Garcia, M., Gayo, I., González López, I.: Identificação e Classificação de Entidades Mencionadas em Galego. Estudos de Lingüística Galega 4, 13–25 (2012)
Graña, J., Barcala, F.-M., Vilares, J.: Formal methods of tokenization for part-of-speech tagging. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 123–144. Springer, Heidelberg (2002)
Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 539–545. ACL (1992)
Leach, G., Wilson, A.: Recommendations for the morphosyntactic annotation of corpora. Technical report, Expert Advisory Group on Language Engineering Standard (EAGLES) (1996)
Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., Jurafsky, D.: Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput. Linguist. 39(4), 885–916 (2013)
Mikheev, A., Grover, C., Moens, M.: XML tools and architecture for Named Entity Recognition. J. Markup Lang. Theory Pract. 1(3), 89–113 (1998)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009), pp. 1003–1011. ACL (2009)
Mota, C., Santos, D. (eds.): Desafios na avaliação conjunta do reconhecimento de entidades mencionadas. O Segundo HAREM. Linguateca (2008)
Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012). ELRA (2012)
Palomar, M., Ferrández, A., Moreno, L.: Martínez-Barco, P., Peral, J., Saiz-Noeda, M., Muñoz, R.: An algorithm for anaphora resolution in Spanish texts. Comput. Linguist. 27(4), 545–567 (2001)
Pantel, P., Pennacchiotti, M.: Espresso: leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pp. 113–120. ACL (2006)
Recasens, M.: Martí, M.: AnCora-CO: coreferentially annotated corpora for Spanish and Catalan. Lang. Res. Eval. 44(4), 315–345 (2010)
Santos, D., Cardoso, N. (eds.): Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Garcia, M. (2016). Semantic Relation Extraction. Resources, Tools and Strategies. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-41552-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)