Extracting Information from Semistructured Data

Ma, Liping; Shepherd, John; Zhang, Yanchun

doi:10.1007/3-540-45703-8_13

Liping Ma⁶,
John Shepherd⁶ &
Yanchun Zhang⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2419))

Included in the following conference series:

International Conference on Web-Age Information Management

331 Accesses
2 Citations

Abstract

This paper describes work towards automatically building on-line structured information resources from information sources that are comprised largely of natural language but with some structuring conventions. Such conversion requires two phases: region identification of the incoming documents, and mapping the information they contain into a more structured form. We describe a system that uses decision-tree-based machine learning techniques to build a classifier that can accurately identify document regions and discuss pattern-discovery methods for extracting information from the identified regions. Experiments demonstrate that this approach works well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Serge Abiteboul. Querying semi-structured data. In International Conference on Database Technology, Jan 1997.
Google Scholar
Brad Adlberg. Nodose-a tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD, 1998.
Google Scholar
Peter Buneman, Susan Davidson, Mary Fernandez, and Dan Suciu. Adding structure to unstructured data. Technical report, University of Pennsylvania, 1996.
Google Scholar
Jim Cowie and Wendy Lehnert. Information extraction. Technical report, Communications of the ACM 39, 1, Jan. 1996.
Google Scholar
Alin Deutsch, Mary Fernandez, and Dan Suciu. Storing semistructured data with stored. In SIGMOD, 1999.
Google Scholar
D. W. Embley, D. M. Campbell, Y. S. Jiang, Y.-K. Ng, R. D. Smith, S. W. Liddle, and D. W. Quass. A conceptual-modeling approach to extracting data from the web. In ER’98, 1998.
Google Scholar
D. W. Embley, Y S. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In SIGMOD, 1999.
Google Scholar
Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu. Strudel: A web site management system. In SIGMOD, 1997.
Google Scholar
C. Knoblock I. Muslea, S. Minton. A hierarchical approach to wrapper induction. In Third International Conference on Autonomous Agents, (Agents’99), 1999.
Google Scholar
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI’97, 1997.
Google Scholar
Michael Ley. DBLP Computer Science Bibliography. http://www.informatik.uni-trier.de/~ley/db/, 2001.
Chin Yew Lin. Assembly of topic extraction modules in summarist. In AAAI, Spring Symposium on Intelligent Test Summarization, 1998.
Google Scholar
Ling Liu, Calton Pu, and Wei Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In ICDE2000, 2000.
Google Scholar
Liping Ma, John Shepherd, and Yanchun Zhang. Using machine learning to extract information from semistructured data. Technical report, School of Computer Science and Engineering, UNSW, 2002.
Google Scholar
G. Mecca, A. Masci P. Atzeni, P. Merialdo, and G. Sindoni. The araneus web-base management system. Technical report, Exhibits Program of SIGMOD, 1998.
Google Scholar
Research Institute NEC. ResearchIndex: The NECI Scientific Literature Digital Library. http://citeseer.nj.nec.com/cs, 2001.
Svetlozar Nestorov, Serge Abiteboul, and Rajeev Motwani. Extracting schema from semistructured data. In International workshop on management of semistructured data, 1997.
Google Scholar
Svetlozar Nestorov, Serge Abiteboul, and Rajeev Motwani. Infer structure in semistruc-tured data. In International workshop on management of semistructured data, 1997.
Google Scholar
J. R. Quinlan. C4.5: Programs for machine learning, 1993.
Google Scholar
Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy Lehnert. Crystal: Inducing a conceptual dictionary. In IJCAI’95, 1995.
Google Scholar
Ke Wang and Huiqing Liu. Schema discovery for semistructured data. In KDD, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, The University of New South Wales, Australia
Liping Ma & John Shepherd
School of Computer Science, University of Tasmania, Australia
Yanchun Zhang

Authors

Liping Ma
View author publications
You can also search for this author in PubMed Google Scholar
John Shepherd
View author publications
You can also search for this author in PubMed Google Scholar
Yanchun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information School, Renmin University of China, Beijing, 100872, China
Xiaofeng Meng
Department of Computer Science, University of California, Santa Barbara, CA, 93106-5110, USA
Jianwen Su & Yujun Wang &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, L., Shepherd, J., Zhang, Y. (2002). Extracting Information from Semistructured Data. In: Meng, X., Su, J., Wang, Y. (eds) Advances in Web-Age Information Management. WAIM 2002. Lecture Notes in Computer Science, vol 2419. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45703-8_13

Download citation

DOI: https://doi.org/10.1007/3-540-45703-8_13
Published: 21 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44045-1
Online ISBN: 978-3-540-45703-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics