Skip to main content

Information Extraction

  • Chapter
  • First Online:
Text Data Mining

Abstract

Massive text data can help users obtain information and expand the boundaries of human knowledge, but most text data cannot be easily processed and understood by computers. According to statistics, more than 80% of the text information in the Internet is unstructured. These unstructured text data greatly increase the difficulty and cost to users of gaining information. Therefore, there is an urgent need for a technology that can automatically analyze unstructured text data, mine relevant and valuable knowledge from them, and present the results to users in a structured form. Thus, information extraction (IE) technology has emerged. This chapter first introduces the concepts and history of IE and then details the typical methods for IE tasks, such as entity recognition, entity disambiguation, relation extraction, and event extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ldc.upenn.edu/collaborations/past-projects/ace.

  2. 2.

    https://www.idc.upenn.edu/collaborations/past-projects/ace.

  3. 3.

    https://www.sinall.org/conll.

  4. 4.

    https://www.signll.org/conll.

  5. 5.

    https://taku910.github.io/crfpp.

  6. 6.

    Multiview refers to multiple views of data, such as speech and vision views in videos. The two views are independent of each other and can be regarded as two dimensions of the data.

  7. 7.

    This idea is very similar to the PageRank algorithm in which the importance of a webpage is determined by the pages linking to it.

  8. 8.

    The entity category can be retrieved from the knowledge base and is generally expressed by a phrase. For example, Donald Trump’s entity category is president of the United States.

  9. 9.

    The transition is unidirectional and there is no transition from the entity to the mention.

  10. 10.

    The mention is usually a person’s name that is ambiguous because it corresponds to different entities in different contexts. For example, Michael Jordan corresponds to entities in different documents. The clustering-based entity disambiguation method performs clustering on all appearances of Michael Jordan in the document set.

  11. 11.

    CFG represents context-free grammar, e.g., VP → PP VP.

References

  • Abad, A., Nabi, M., & Moschitti, A. (2017). Self-crowdsourcing training for relation extraction. In Proceedings of ACL.

    Google Scholar 

  • Ahn, D. (2006). The stages of event extraction. In Proceedings of TERQAS (pp. 1–8).

    Google Scholar 

  • Angeli, G., Johnson Premkumar, M. J., & Manning, C. D. (2015). Leveraging linguistic structure for open domain information extraction. In Proceedings of ACL and IJCNLP.

    Google Scholar 

  • Bagga, A., & Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. In Proceedings of ACL-COLING.

    Google Scholar 

  • Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of IJCAI.

    Google Scholar 

  • Bekkerman, R., & Mccallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of WWW (pp. 463–470).

    Google Scholar 

  • Chen, Y., Liu, S., Zhang, X., Liu, K., & Zhao, J. (2017b). Automatically labeled data generation for large scale event extraction. In Proceedings of ACL.

    Google Scholar 

  • Chen, Y., Xu, L., Liu, K., Zeng, D., & Zhao, J. (2015b). Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of ACL.

    Google Scholar 

  • Chen, Y., & Zong, C. (2008). A structure-based model for Chinese organization name translation. ACM Transactions on Asian Language Information Processing. https://doi.org/10.1145/1330291.1330292

  • Chen, Z., Tamang, S., Lee, A., Li, X., Lin, W.-P., Snover, M. G., et al. (2010). Cuny-blender TAC-KBP2010 entity linking and slot filling system description. In Theory and Applications of Categories.

    Google Scholar 

  • Collins, M., & Duffy, N. (2002). Convolution kernels for natural language. In Proceedings of NIPS.

    Google Scholar 

  • Culotta, A., & Sorensen, J. (2004). Dependency tree kernels for relation extraction. In Proceedings of ACL.

    Google Scholar 

  • Do, Q., Lu, W., & Roth, D. (2012). Joint inference for event timeline construction. In Proceedings of IJCNLP and COLING.

    Google Scholar 

  • Fleischman, M., & Hovy, E. (2004). Multi-document person name resolution. In Proceedings of ACL.

    Google Scholar 

  • Han, X., Sun, L., & Zhao, J. (2011). Collective entity linking in web text: A graph-based method. In Proceedings of SIGIR.

    Google Scholar 

  • Han, X., & Zhao, J. (2009a). Named entity disambiguation by leveraging wikipedia semantic knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (pp. 215–224).

    Google Scholar 

  • Han, X., & Zhao, J. (2009b). NLPR_KBP in TAC 2009 KBP track: A two-stage method to entity linking. In Proceedings of TAC 2009 Workshop.

    Google Scholar 

  • He, Z., Liu, S., Li, M., Zhou, M., Zhang, L., & Wang, H. (2013). Learning entity representation for entity disambiguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 30–34).

    Google Scholar 

  • Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., & Weld, D. S. (2011). Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 541–550).

    Google Scholar 

  • Huang, L., Fayong, S., & Guo, Y. (2012). Structured perceptron with inexact search. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 142–151).

    Google Scholar 

  • Huang, L., & Huang, L. (2013). Optimized event storyline generation based on mixture-event-aspect model. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 726–735).

    Google Scholar 

  • Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. Preprint, arXiv:1508.01991.

    Google Scholar 

  • Jain, A., & Pennacchiotti, M. (2010). Open entity extraction from web search query logs. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10 (pp. 510–518). Cambridge: Association for Computational Linguistics.

    Google Scholar 

  • Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.

    Google Scholar 

  • LDC. (2005). ACE (automatic content extraction) English annotation guidelines for entities (version 5.5.1). https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/chinese-events-guidelines-v5.5.1.pdf.

  • Li, Q., & Ji, H. (2014). Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 402–412).

    Google Scholar 

  • Li, Q., Ji, H., & Huang, L. (2013b). Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 73–82).

    Google Scholar 

  • Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named entity recognition. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (pp. 58–65).

    Google Scholar 

  • Lin, Y., Shen, S., Liu, Z., Luan, H., & Sun, M. (2016). Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2124–2133).

    Google Scholar 

  • Liu, K. (2000). Chinese text word segmentation and annotation. Beijing: Commercial Press (in Chinese).

    Google Scholar 

  • Luo, B., Feng, Y., Wang, Z., Zhu, Z., Huang, S., Yan, R., et al. (2017). Learning with noise: Enhance distantly supervised relation extraction with dynamic transition matrix. In Proceedings of ACL.

    Google Scholar 

  • Malin, W., Airoldi, E., & Carley, K. (2005). A network analysis model for disambiguation of names in lists. Computational & Mathematical Organization Theory, 11, 119–139.

    Article  Google Scholar 

  • Mann, G., & Yarowsky, D. (2003). Unsupervised personal name disambiguation. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (pp. 33–40).

    Google Scholar 

  • Mausam, Schmitz, M., Soderland, S., Bart, R., & Etzioni, O. (2012). Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 523–534).

    Google Scholar 

  • McCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (pp. 188–191).

    Google Scholar 

  • Minkov, E., Cohen, W. W., & Ng, A. Y. (2006). Contextual search and name disambiguation in email using graphs. In Proceedings of SIGIR (pp. 27–34).

    Google Scholar 

  • Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003–1011).

    Google Scholar 

  • Miwa, M., & Bansal, M. (2016). End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1105–1116).

    Google Scholar 

  • Narasimhan, K., Yala, A., & Barzilay, R. (2016). Improving information extraction by acquiring external evidence with reinforcement learning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2355–2365).

    Google Scholar 

  • Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name discrimination by clustering similar contexts. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 226–237).

    Google Scholar 

  • Peng, N., Poon, H., Quirk, C., Toutanova, K., & Yih, W.-T. (2017). Cross-sentence N-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics, 5, 101–115.

    Article  Google Scholar 

  • Pennacchiotti, M., & Pantel, P. (2009). Entity extraction via ensemble semantics. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (Vol. 1, pp. 238–247). Stroudsburg: Association for Computational Linguistics.

    Google Scholar 

  • Rabiner, L., & Juang, B. (1986). An introduction to hidden markov models. IEEE ASSP Magazine, 3(1), 4–16.

    Article  Google Scholar 

  • Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009) (pp. 147–155).

    Google Scholar 

  • Riedel, S., Yao, L., & McCallum, A. (2010). Modeling relations and their mentions without labeled text. In Proceedings of ECML (pp. 148–163).

    Google Scholar 

  • Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.

    Article  Google Scholar 

  • Shen, W., Wang, J., & Han, J. (2015). Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27(2), 443–460.

    Article  Google Scholar 

  • Stanovsky, G., & Dagan, I. (2016). Creating a large benchmark for open information extraction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2300–2305).

    Google Scholar 

  • Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling mention, context and entity with neural networks for entity disambiguation. In Twenty-Fourth International Joint Conference on Artificial Intelligence.

    Google Scholar 

  • Surdeanu, M., Tibshirani, J., Nallapati, R., & Manning, C. D. (2012). Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 455–465).

    Google Scholar 

  • Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267–373.

    Article  Google Scholar 

  • Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of ACL-08: HLT (pp. 665–673).

    Google Scholar 

  • Wu, Y., Bamman, D., & Russell, S. (2017). Adversarial training for relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 1778–1783).

    Google Scholar 

  • Zelenko, D., Aone, C., & Richardella, A. (2003). Kernel methods for relation extraction. Journal of Machine Learning Research, 3, 1083–1106.

    MathSciNet  MATH  Google Scholar 

  • Zeng, D., Liu, K., Chen, Y., & Zhao, J. (2015). Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1753–1762).

    Google Scholar 

  • Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation classification via convolutional deep neural network. In Proceedings of COLING.

    Google Scholar 

  • Zhang, M., Zhang, Y., & Fu, G. (2017). End-to-end neural relation extraction with global optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 1730–1740).

    Google Scholar 

  • Zhang, M., Zhou, G., & Aw, A. (2008). Exploring syntactic structured features over parse trees for relation extraction using kernel methods. Information Processing & Management, 44(2), 687–701.

    Article  Google Scholar 

  • Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., & Xu, B. (2017). Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of ACL.

    Google Scholar 

  • Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 473–480).

    Google Scholar 

  • Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005). Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 427–434).

    Google Scholar 

  • Zong, C. (2013). Statistical natural language processing (2nd ed.). Beijing: Tsinghua University Press (in Chinese).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Tsinghua University Press

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zong, C., Xia, R., Zhang, J. (2021). Information Extraction. In: Text Data Mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-0100-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-0100-2_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-0099-9

  • Online ISBN: 978-981-16-0100-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics