Abstract
The World-Wide Web contains a wealth of semistructured information sources that often give partial/overlapping views on the same domains, such as real estate listings or book prices. These partial sources could be used more effectively if integrated into a single view; however, since they are typically formatted in diverse ways for human viewing, extracting their data for integration is a difficult challenge. Existing learning systems for this task generally use hardcoded ad hoc heuristics, are restricted in the domains and structures they can recognize, and/or require manual training. We describe a principled method for automatically generating extraction wrappers using grammatical inference that can recognize general structures and does not rely on manually-labelled examples. Domain-speci.c knowledge is explicitly separated out in the form of declarative rules. The method is demonstrated in a test setting by extracting real estate listings from web pages and integrating them into an interactive data visualization tool based on dynamic queries.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
S. Abiteboul, “Querying semi-structured data,” in Database Theory, 6th International Conference (ICDT’ 97), Delphi, Greece, 1–18. Springer (1997).
H. Ahonen, “Automatic generation of SGML content models,” Electronic Publishing— Origination, Dissemination and Design 8(2&3), 195–206 (1995).
N. Ashish and C.A. Knoblock, “Semi-automatic wrapper generation for Internet information sources, ” in Second IFCIS International Conference on Cooperative Information Systems (CoopIS’ 97), Kiawah Island, SC, USA. IEEE-CS Press (1997).
J.K. Baker, “Trainable grammars for speech recognition,” Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, 547–550 (1979).
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom, “ The TSIMMIS project: integration of heterogenous information sources,” in Proceedings of the 10th Meeting of the Information Processing Society of Japan (IPSJ’ 94), 7–18. (1994).
W.W. Cohen, “Recognizing structure in web pages using similarity queries,” in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI’ 99), Orlando, FL, USA. AAAI Press (1999).
C.M. Cook, A. Rosenfeld, and A.R. Aronson, “Grammatical inference by hill climbing, ” Informational Sciences 10, 59–80 (1976).
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, “Learning to construct knowledge bases from the world wide web,” Artificial Intelligence 118, 69–113 (2000).
R. Doorenbos, O. Etzioni, and D. Weld, “A scalable comparison-shopping agent for the world-wide web, ” in First International Conference on Autonomous Agents (Agents’ 97), Marina del Rey, CA, USA, 39–48. ACM Press (1997).
D. Freitag, “Using grammatical inference to improve precision in information extraction,” in ICML’ 97 Workshop on Automata Induction, Grammatical Inference, and Language Acquisition, Nashville, TN, USA. (1997).
X. Gao and L. Sterling, “AutoWrapper: automatic wrapper generation for multiple online services, ” in Asia Pacific Web Conference’ 99, Hong Kong. (1999).
R. Ghani, R. Jones, D. Mladenić, K. Nigam, and S. Slattery, “Data mining on symbolic knowledge extracted from the web,” in KDD-2000 Workshop on Text Mining, Boston, MA, USA. (2000).
E.M. Gold, “Language identi.cation in the limit,” Information and Control 10, 447–474 (1967).
T. Hong, “Visualizing real estate property information on the web,” Information Visualization’ 99. IEEE Computer Society, Los Alamitos, CA (1999).
B. Krulwich, “The BargainFinder agent: comparison price shopping on the Internet,” in Bots and Other Internet Beasties. Sams Publishing (1996).
N. Kushmerick, “Wrapper induction: e.ciency and expressiveness,” Artificial Intelligence 118, 15–68 (2000).
Y. Sakakibara, “Recent advances of grammatical inference,” Theoretical Computer Science 185, 15–45 (1997).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hong, T.W., Clark, K.L. (2001). Using Grammatical Inference to Automate Information Extraction from the Web. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_18
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive