Advertisement

A Conceptual-Modeling Approach to Extracting Data from the Web

  • D. W. Embley
  • D. M. Campbell
  • Y. S. Jiang
  • S. W. Liddle
  • Y. -K. Ng
  • D. W. Quass
  • R. D. Smith
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1507)

Abstract

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth.

Keywords

data extraction data structuring unstructured data data-rich document World-Wide Web ontology ontological conceptual modeling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adelberg, B.: NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. In: Proc. 1998 ACMSIGMOD International Conference on Management of Data, pp. 283–294 (1998)Google Scholar
  2. 2.
    Apers, P.: Identifying internet-related database research. In: Proc. 2nd International East-West Database Workshop, pp. 183–193 (1994) Google Scholar
  3. 3.
    Arocena, G., Mendelzon, A.: WebOQL: restructuring documents, databases and webs. In: Proc. Fourteen International Conference on Data Engineering (1998) Google Scholar
  4. 4.
    Ashish, N., Knoblock, C.: Wrapper generation for semi-structured internet sources. SIGMOD Record 26, 8–15 (1997)CrossRefGoogle Scholar
  5. 5.
    Atzeni, P., Mecca, G.: Cut and paste. In: Proc. PODS 1997 (1997) Google Scholar
  6. 6.
    Cowie, J., Lehnert, W.: Information extraction. Communications of the ACM 39, 80–91 (1996)CrossRefGoogle Scholar
  7. 7.
    Doorenbos, R., Etzioni, O., Weld, D.: A scalable comparison-shopping agent for the world-wide web. In: Proc. First International Conference on Autonomous Agents, pp. 39–48 (1997)Google Scholar
  8. 8.
    Embley, D.: Programming with data frames for everyday data items. In: Proc. 1980 National Computer Conference, pp. 301–305 (1980)Google Scholar
  9. 9.
    Embley, D., Kurtz, B., Woodfield, S.: Object-oriented Systems Analysis: A Model- Driven Approach. Prentice Hall, Englewood Cliffs (1992)Google Scholar
  10. 10.
    Embley, D., Campbell, D., Smith, R., Liddle, S.: Ontology-based extraction and structuring of information from data-rich unstructured documents. In: Proc. Conference on Information and Knowledge Management, CIKM 1998 (1998) (to appear)Google Scholar
  11. 11.
    Gupta, A., Harinarayan, V., Rajaraman, A.: Virtual database technology. SIGMOD Record 26, 57–61 (1997)CrossRefGoogle Scholar
  12. 12.
    Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: Proc. Workshop on Management of Semistructured Data (1997) Google Scholar
  13. 13.
    Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper induction for information extraction. In: Proc. 1997 International Joint Conference on Artificial Intelligence, pp. 729–735 (1997)Google Scholar
  14. 14.
    Liddle, S., Embley, D., Woodfield, S.: Unifying modeling and programming through an active, object-oriented, model-equivalent programming language. In: Proc. Fourteenth International Conference on Object-Oriented and Entity-Relationship Modeling, pp. 55–64 (1995)Google Scholar
  15. 15.
    Smith, D., Lopez, M.: Information extraction for semi-structured documents. In: Proc: Workshop on Management of Semistructured Data (1997) Google Scholar
  16. 16.
    Soderland, S.: Learning to extract text-based information from the world wide web. In: Proc: Third International Conference on Knowledge Discovery and Data Mining, pp. 251–254 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • D. W. Embley
    • 1
  • D. M. Campbell
    • 1
  • Y. S. Jiang
    • 1
  • S. W. Liddle
    • 2
  • Y. -K. Ng
    • 1
  • D. W. Quass
    • 2
  • R. D. Smith
    • 1
  1. 1.Department of Computer ScienceBrigham Young UniversityProvoU.S.A.
  2. 2.School of Accountancy and Information SystemsBrigham Young UniversityProvoU.S.A.

Personalised recommendations