Abstract
Massive text data can help users obtain information and expand the boundaries of human knowledge, but most text data cannot be easily processed and understood by computers. According to statistics, more than 80% of the text information in the Internet is unstructured. These unstructured text data greatly increase the difficulty and cost to users of gaining information. Therefore, there is an urgent need for a technology that can automatically analyze unstructured text data, mine relevant and valuable knowledge from them, and present the results to users in a structured form. Thus, information extraction (IE) technology has emerged. This chapter first introduces the concepts and history of IE and then details the typical methods for IE tasks, such as entity recognition, entity disambiguation, relation extraction, and event extraction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
Multiview refers to multiple views of data, such as speech and vision views in videos. The two views are independent of each other and can be regarded as two dimensions of the data.
- 7.
This idea is very similar to the PageRank algorithm in which the importance of a webpage is determined by the pages linking to it.
- 8.
The entity category can be retrieved from the knowledge base and is generally expressed by a phrase. For example, Donald Trump’s entity category is president of the United States.
- 9.
The transition is unidirectional and there is no transition from the entity to the mention.
- 10.
The mention is usually a person’s name that is ambiguous because it corresponds to different entities in different contexts. For example, Michael Jordan corresponds to entities in different documents. The clustering-based entity disambiguation method performs clustering on all appearances of Michael Jordan in the document set.
- 11.
CFG represents context-free grammar, e.g., VP → PP VP.
References
Abad, A., Nabi, M., & Moschitti, A. (2017). Self-crowdsourcing training for relation extraction. In Proceedings of ACL.
Ahn, D. (2006). The stages of event extraction. In Proceedings of TERQAS (pp. 1–8).
Angeli, G., Johnson Premkumar, M. J., & Manning, C. D. (2015). Leveraging linguistic structure for open domain information extraction. In Proceedings of ACL and IJCNLP.
Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of ACL-COLING.
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of IJCAI.
Bekkerman, R., & Mccallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of WWW (pp. 463–470).
Chen, Y., Liu, S., Zhang, X., Liu, K., & Zhao, J. (2017b). Automatically labeled data generation for large scale event extraction. In Proceedings of ACL.
Chen, Y., Xu, L., Liu, K., Zeng, D., & Zhao, J. (2015b). Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of ACL.
Chen, Y., & Zong, C. (2008). A structure-based model for Chinese organization name translation. ACM Transactions on Asian Language Information Processing. https://doi.org/10.1145/1330291.1330292
Chen, Z., Tamang, S., Lee, A., Li, X., Lin, W.-P., Snover, M. G., et al. (2010). Cuny-blender TAC-KBP2010 entity linking and slot filling system description. In Theory and Applications of Categories.
Collins, M., & Duffy, N. (2002). Convolution kernels for natural language. In Proceedings of NIPS.
Culotta, A., & Sorensen, J. (2004). Dependency tree kernels for relation extraction. In Proceedings of ACL.
Do, Q., Lu, W., & Roth, D. (2012). Joint inference for event timeline construction. In Proceedings of IJCNLP and COLING.
Fleischman, M., & Hovy, E. (2004). Multi-document person name resolution. In Proceedings of ACL.
Han, X., Sun, L., & Zhao, J. (2011). Collective entity linking in web text: A graph-based method. In Proceedings of SIGIR.
Han, X., & Zhao, J. (2009a). Named entity disambiguation by leveraging wikipedia semantic knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (pp. 215–224).
Han, X., & Zhao, J. (2009b). NLPR_KBP in TAC 2009 KBP track: A two-stage method to entity linking. In Proceedings of TAC 2009 Workshop.
He, Z., Liu, S., Li, M., Zhou, M., Zhang, L., & Wang, H. (2013). Learning entity representation for entity disambiguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 30–34).
Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., & Weld, D. S. (2011). Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 541–550).
Huang, L., Fayong, S., & Guo, Y. (2012). Structured perceptron with inexact search. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 142–151).
Huang, L., & Huang, L. (2013). Optimized event storyline generation based on mixture-event-aspect model. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 726–735).
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. Preprint, arXiv:1508.01991.
Jain, A., & Pennacchiotti, M. (2010). Open entity extraction from web search query logs. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10 (pp. 510–518). Cambridge: Association for Computational Linguistics.
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.
LDC. (2005). ACE (automatic content extraction) English annotation guidelines for entities (version 5.5.1). https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/chinese-events-guidelines-v5.5.1.pdf.
Li, Q., & Ji, H. (2014). Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 402–412).
Li, Q., Ji, H., & Huang, L. (2013b). Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 73–82).
Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named entity recognition. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (pp. 58–65).
Lin, Y., Shen, S., Liu, Z., Luan, H., & Sun, M. (2016). Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2124–2133).
Liu, K. (2000). Chinese text word segmentation and annotation. Beijing: Commercial Press (in Chinese).
Luo, B., Feng, Y., Wang, Z., Zhu, Z., Huang, S., Yan, R., et al. (2017). Learning with noise: Enhance distantly supervised relation extraction with dynamic transition matrix. In Proceedings of ACL.
Malin, W., Airoldi, E., & Carley, K. (2005). A network analysis model for disambiguation of names in lists. Computational & Mathematical Organization Theory, 11, 119–139.
Mann, G., & Yarowsky, D. (2003). Unsupervised personal name disambiguation. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (pp. 33–40).
Mausam, Schmitz, M., Soderland, S., Bart, R., & Etzioni, O. (2012). Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 523–534).
McCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (pp. 188–191).
Minkov, E., Cohen, W. W., & Ng, A. Y. (2006). Contextual search and name disambiguation in email using graphs. In Proceedings of SIGIR (pp. 27–34).
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011).
Miwa, M., & Bansal, M. (2016). End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1105–1116).
Narasimhan, K., Yala, A., & Barzilay, R. (2016). Improving information extraction by acquiring external evidence with reinforcement learning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2355–2365).
Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name discrimination by clustering similar contexts. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 226–237).
Peng, N., Poon, H., Quirk, C., Toutanova, K., & Yih, W.-T. (2017). Cross-sentence N-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics, 5, 101–115.
Pennacchiotti, M., & Pantel, P. (2009). Entity extraction via ensemble semantics. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (Vol. 1, pp. 238–247). Stroudsburg: Association for Computational Linguistics.
Rabiner, L., & Juang, B. (1986). An introduction to hidden markov models. IEEE ASSP Magazine, 3(1), 4–16.
Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009) (pp. 147–155).
Riedel, S., Yao, L., & McCallum, A. (2010). Modeling relations and their mentions without labeled text. In Proceedings of ECML (pp. 148–163).
Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.
Shen, W., Wang, J., & Han, J. (2015). Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27(2), 443–460.
Stanovsky, G., & Dagan, I. (2016). Creating a large benchmark for open information extraction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2300–2305).
Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling mention, context and entity with neural networks for entity disambiguation. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
Surdeanu, M., Tibshirani, J., Nallapati, R., & Manning, C. D. (2012). Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 455–465).
Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267–373.
Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of ACL-08: HLT (pp. 665–673).
Wu, Y., Bamman, D., & Russell, S. (2017). Adversarial training for relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 1778–1783).
Zelenko, D., Aone, C., & Richardella, A. (2003). Kernel methods for relation extraction. Journal of Machine Learning Research, 3, 1083–1106.
Zeng, D., Liu, K., Chen, Y., & Zhao, J. (2015). Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1753–1762).
Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation classification via convolutional deep neural network. In Proceedings of COLING.
Zhang, M., Zhang, Y., & Fu, G. (2017). End-to-end neural relation extraction with global optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 1730–1740).
Zhang, M., Zhou, G., & Aw, A. (2008). Exploring syntactic structured features over parse trees for relation extraction using kernel methods. Information Processing & Management, 44(2), 687–701.
Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., & Xu, B. (2017). Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of ACL.
Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 473–480).
Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005). Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 427–434).
Zong, C. (2013). Statistical natural language processing (2nd ed.). Beijing: Tsinghua University Press (in Chinese).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 Tsinghua University Press
About this chapter
Cite this chapter
Zong, C., Xia, R., Zhang, J. (2021). Information Extraction. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_10
Download citation
DOI: https://doi.org/10.1007/978-981-16-0100-2_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0099-9
Online ISBN: 978-981-16-0100-2
eBook Packages: Computer ScienceComputer Science (R0)