Web Information Extraction

Chiticariu, Laura; Danilevsky, Marina; Ho, Howard; Krishnamurthy, Rajasekar; Li, Yunyao; Raghavan, Sriram; Reiss, Frederick; Vaithyanathan, Shivakumar; Zhu, Huaiyu

doi:10.1007/978-1-4899-7993-3_459-2

Web Information Extraction

Laura Chiticariu³,
Marina Danilevsky³,
Howard Ho³,
Rajasekar Krishnamurthy³,
Yunyao Li³,
Sriram Raghavan³,
Frederick Reiss³,
Shivakumar Vaithyanathan³ &
…
Huaiyu Zhu³

Living reference work entry
First Online: 01 January 2017

133 Accesses
1 Citations

FormalPara Synonyms

Information extraction; Text analytics

Definition

Information extraction (IE) is the process of automatically extracting structured pieces of information from unstructured or semi-structured text documents. Classical problems in information extraction include named-entity recognition (identifying mentions of persons, places, organizations, etc.) and relationship extraction (identifying mentions of relationships between such named entities). Web information extraction is the application of IE techniques to process the vast amounts of unstructured content on the Web. Due to the nature of the content on the Web, in addition to named-entity and relationship extraction, there is growing interest in more complex tasks such as extraction of reviews, opinions, and sentiments.

Historical Background

Historically, information extraction was studied by the Natural Language Processing community in the context of identifying organizations, locations, and person names in news...

This is a preview of subscription content, log in via an institution.

Recommended Reading

Akbik A, Konomi O, Melnikov M. Propminer: a workflow for interactive information extraction and exploration using dependency trees. In:ACL (conference system demonstrations). 2013.
Google Scholar
Appelt DE, Onyshkevych B. The common pattern specification language. In: TIPSTER. 1998.
Google Scholar
Atasu K, Polig R, Hagleitner C, Reiss FR. Hardware-accelerated regular expression matching for high-throughput text analytics. In: FPL. IEEE; 2013. p. 1–7.
Google Scholar
Boguraev B. Annotation-based finite state processing in a large-scale NLP architecture. In: RANLP. 2003.
Google Scholar
Bohannon P, Merugu S, Yu C, Agarwal V, DeRose P, Iyer AS, Jain A, Kakade V, Muralidharan M, Ramakrishnan R, Shen W. Purple sox extraction management system.: SIGMOD Rec. 2008;37(4):21–27.
Google Scholar
Brauer F, Rieger R, Mocan A, Barczynski WM. Enabling information extraction by inference of regular expressions from sample entities. In: CIKM. 2011.
Book Google Scholar
Burdick D, Hernández M, Ho H, Koutrika G, Krishnamurthy R, Popa L, Stanoi IR, Vaithyanathan S, Das S. Extracting, linking and integrating data from public sources: a financial case study.: IEEE Data Eng Bull. 2011;34(3):60–67.
Google Scholar
Cafarella MJ, Etzion O. A search engine for natural language applications. In: WWW. 2005.
Book Google Scholar
Chiticariu L, Krishnamurthy R, Li Y, Raghavan S, Reiss F, Vaithyanathan S. Systemt: an algebraic approach to declarative information extraction. In: ACL. 2010.
Google Scholar
Chiticariu L, Krishnamurthy R, Li Y, Reiss F, Vaithyanathan S. Domain adaptation of rule-based annotators for named-entity recognition tasks. In: EMNLP. 2010.
Google Scholar
Chiticariu L, Li Y, Reiss FR. Rule-based information extraction is dead! long live rule-based information extraction systems! In: EMNLP. 2013.
Google Scholar
Cohen W, McCallum A. Information extraction from the world wide web. In: KDD. 2003.
Google Scholar
Cunningham H. Information extraction, automatic. In: Encyclopedia of language and linguistics. 2nd ed. 2005.
Google Scholar
Doan A, Ramakrishnan R, Vaithyanathan S. Managing information extraction: state of the art and research directions. In: SIGMOD. 2006.
Book Google Scholar
Grishman R, Sundheim B. Message understanding conference-6: a brief history. In: COLING. 1996.
Book Google Scholar
Huang J, Chen T, Doan A, Naughton JF. On the provenance of non-answers to queries over extracted data. vol. 1. 2008.
Google Scholar
Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML. 2001.
Google Scholar
Li Y, Chu V, Blohm S, Zhu H, Ho H. Facilitating pattern discovery for relation extraction with semantic-signature-based clustering. In: CIKM. 2011.
Book Google Scholar
Li Y, Krishnamurthy R, Raghavan S, Vaithyanathan S, Jagadish HV. Regular expression learning for information extraction. In: EMNLP. 2008.
Book Google Scholar
Li Y, Krishnamurthy R, Vaithyanathan S, Jagadish H. Getting work done on the web: supporting transactional queries. In: SIGIR. 2006.
Book Google Scholar
Liu B, Chiticariu L, Chu V, Jagadish HV, Reiss F. Automatic rule refinement for information extraction.: PVLDB. 2010;3(1):588–97.
Google Scholar
Nagesh A, Ramakrishnan G, Chiticariu L, Krishnamurthy R, Dharkar A, Bhattacharyya P. Towards efficient named-entity rule induction for customizability. In: EMNLP-CoNLL. 2012.
Google Scholar
Reiss F, Raghavan S, Krishnamurthy R, Zhu H, Vaithyanathan S. An algebraic approach to rule-based information extraction. In: ICDE. 2008.
Book Google Scholar
Riloff E. Automatically constructing a dictionary for information extraction tasks. In: AAAI. 1993.
Google Scholar
Roy S, Chiticariu L, Feldman V, Reiss F, Zhu H. Provenance-based dictionary refinement in information extraction. In: SIGMOD. 2013.
Book Google Scholar
Sarma AD, Jain A, Bohannon P. Building a generic debugger for information extraction pipelines. In: CIKM. 2011.
Google Scholar
Sarma AD, Jain A, Srivastava D. I4e: interactive investigation of iterative information extraction. In: SIGMOD. 2010.
Google Scholar
Shen W, Doan A, Naughton J, Ramakrishnan R. Declarative information extraction using datalog with embedded extraction predicates. In: VLDB. 2007.
Google Scholar
Wandelt S, Deng D, Gerdjikov S, Mishra S, Mitankin P, Patil M, Siragusa E, Tiskin A, Wang W, Wang J, Leser U. State-of-the-art in string similarity search and join. SIGMOD Rec. 2014;43(1):64–76.
Article Google Scholar
Wang DZ, Wei L, Li Y, Reiss F, Vaithyanathan S. Selectivity estimation for extraction operators over text data. In: ICDE. 2011.
Book Google Scholar
Zhang C, Baldwin T, Ho H, Kimelfeld B, Li Y. Adaptive parser-centric text normalization. In: ACL (1). 2013. p. 1159–68.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, San Jose, CA, USA
Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan & Huaiyu Zhu

Authors

Laura Chiticariu
View author publications
You can also search for this author in PubMed Google Scholar
Marina Danilevsky
View author publications
You can also search for this author in PubMed Google Scholar
Howard Ho
View author publications
You can also search for this author in PubMed Google Scholar
Rajasekar Krishnamurthy
View author publications
You can also search for this author in PubMed Google Scholar
Yunyao Li
View author publications
You can also search for this author in PubMed Google Scholar
Sriram Raghavan
View author publications
You can also search for this author in PubMed Google Scholar
Frederick Reiss
View author publications
You can also search for this author in PubMed Google Scholar
Shivakumar Vaithyanathan
View author publications
You can also search for this author in PubMed Google Scholar
Huaiyu Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laura Chiticariu .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, Georgia, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, Ontario, Canada
M. Tamer Özsu

Section Editor information

Google Research, 76th 9th Ave, 10018, New York, NY, USA
Cong Yu Research Scientist

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Chiticariu, L. et al. (2016). Web Information Extraction. In: Liu, L., Özsu, M. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7993-3_459-2

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7993-3_459-2
Received: 29 August 2014
Accepted: 14 June 2016
Published: 27 January 2017
Publisher Name: Springer, New York, NY
Online ISBN: 978-1-4899-7993-3
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics