Using Grammatical Inference to Automate Information Extraction from the Web

Hong, Theodore W.; Clark, Keith L.

doi:10.1007/3-540-44794-6_18

Theodore W. Hong³ &
Keith L. Clark³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2168))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2582 Accesses
14 Citations

Abstract

The World-Wide Web contains a wealth of semistructured information sources that often give partial/overlapping views on the same domains, such as real estate listings or book prices. These partial sources could be used more effectively if integrated into a single view; however, since they are typically formatted in diverse ways for human viewing, extracting their data for integration is a difficult challenge. Existing learning systems for this task generally use hardcoded ad hoc heuristics, are restricted in the domains and structures they can recognize, and/or require manual training. We describe a principled method for automatically generating extraction wrappers using grammatical inference that can recognize general structures and does not rely on manually-labelled examples. Domain-speci.c knowledge is explicitly separated out in the form of declarative rules. The method is demonstrated in a test setting by extracting real estate listings from web pages and integrating them into an interactive data visualization tool based on dynamic queries.

Download to read the full chapter text

Chapter PDF

Grammar Induction - Experimental Results

Grammatical Inference in Software Engineering: An Overview of the State of the Art

Syntax and Data-to-Text Generation

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

S. Abiteboul, “Querying semi-structured data,” in Database Theory, 6th International Conference (ICDT’ 97), Delphi, Greece, 1–18. Springer (1997).
Google Scholar
H. Ahonen, “Automatic generation of SGML content models,” Electronic Publishing— Origination, Dissemination and Design 8(2&3), 195–206 (1995).
Google Scholar
N. Ashish and C.A. Knoblock, “Semi-automatic wrapper generation for Internet information sources, ” in Second IFCIS International Conference on Cooperative Information Systems (CoopIS’ 97), Kiawah Island, SC, USA. IEEE-CS Press (1997).
Google Scholar
J.K. Baker, “Trainable grammars for speech recognition,” Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, 547–550 (1979).
Google Scholar
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom, “ The TSIMMIS project: integration of heterogenous information sources,” in Proceedings of the 10th Meeting of the Information Processing Society of Japan (IPSJ’ 94), 7–18. (1994).
Google Scholar
W.W. Cohen, “Recognizing structure in web pages using similarity queries,” in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI’ 99), Orlando, FL, USA. AAAI Press (1999).
Google Scholar
C.M. Cook, A. Rosenfeld, and A.R. Aronson, “Grammatical inference by hill climbing, ” Informational Sciences 10, 59–80 (1976).
MathSciNet Google Scholar
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, “Learning to construct knowledge bases from the world wide web,” Artificial Intelligence 118, 69–113 (2000).
Article MATH Google Scholar
R. Doorenbos, O. Etzioni, and D. Weld, “A scalable comparison-shopping agent for the world-wide web, ” in First International Conference on Autonomous Agents (Agents’ 97), Marina del Rey, CA, USA, 39–48. ACM Press (1997).
Google Scholar
D. Freitag, “Using grammatical inference to improve precision in information extraction,” in ICML’ 97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition, Nashville, TN, USA. (1997).
Google Scholar
X. Gao and L. Sterling, “AutoWrapper: automatic wrapper generation for multiple online services, ” in Asia Pacific Web Conference’ 99, Hong Kong. (1999).
Google Scholar
R. Ghani, R. Jones, D. Mladenić, K. Nigam, and S. Slattery, “Data mining on symbolic knowledge extracted from the web,” in KDD-2000 Workshop on Text Mining, Boston, MA, USA. (2000).
Google Scholar
E.M. Gold, “Language identi.cation in the limit,” Information and Control 10, 447–474 (1967).
Article MATH Google Scholar
T. Hong, “Visualizing real estate property information on the web,” Information Visualization’ 99. IEEE Computer Society, Los Alamitos, CA (1999).
Google Scholar
B. Krulwich, “The BargainFinder agent: comparison price shopping on the Internet,” in Bots and Other Internet Beasties. Sams Publishing (1996).
Google Scholar
N. Kushmerick, “Wrapper induction: e.ciency and expressiveness,” Artificial Intelligence 118, 15–68 (2000).
Article MATH MathSciNet Google Scholar
Y. Sakakibara, “Recent advances of grammatical inference,” Theoretical Computer Science 185, 15–45 (1997).
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing, Imperial College of Science, Technology, and Medicine, 180 Queen’s Gate, London, SW7 2BZ, UK
Theodore W. Hong & Keith L. Clark

Authors

Theodore W. Hong
View author publications
You can also search for this author in PubMed Google Scholar
Keith L. Clark
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Albert-Ludwigs University Freiburg, Georges Köhler-Allee, Geb. 079, 79110, Freiburg, Germany
Luc De Raedt
Inst.of Information and Computing Sciences Dept. of Mathematics and Computer Science, University of Utrecht, Padualaan 14, de Uithof, 3508, TB Utrecht, The Netherlands
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hong, T.W., Clark, K.L. (2001). Using Grammatical Inference to Automate Information Extraction from the Web. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_18

Download citation

DOI: https://doi.org/10.1007/3-540-44794-6_18
Published: 28 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Using Grammatical Inference to Automate Information Extraction from the Web

Abstract

Chapter PDF

Similar content being viewed by others

Grammar Induction - Experimental Results

Grammatical Inference in Software Engineering: An Overview of the State of the Art

Syntax and Data-to-Text Generation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Using Grammatical Inference to Automate Information Extraction from the Web

Abstract

Chapter PDF

Similar content being viewed by others

Grammar Induction - Experimental Results

Grammatical Inference in Software Engineering: An Overview of the State of the Art

Syntax and Data-to-Text Generation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation