Recognizing Ontology-Applicable Multiple-Record Web Documents

  • David W. Embley
  • Yiu-Kai Ng
  • Li Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2224)


Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplerecord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.


Data Frame Vector Space Model Participation Constraint Universal Rule Document Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [BB63]
    H. Borko and M. Bernick. Automatic document classification. Journal of the ACM, 10(2):151–162, 1963.zbMATHCrossRefGoogle Scholar
  2. [BM98]
    L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of the 21th ACM SIGIR, pages 96–103, 1998.Google Scholar
  3. [Bun77]
    M. A. Bunge. Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. Reidel, Boston, 1977.Google Scholar
  4. [BYRN99]
    R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. A ddison Wesley, Menlo Park, California, 1999.Google Scholar
  5. [CvdBD99]
    S. Chakrabarti, M. van den Berg, and B. E. Dom. Focused crawling: A new approach for topic-specific resource discovery. Computer Networks, 31:1623–1640, 1999.CrossRefGoogle Scholar
  6. [ECJ+99]
    D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y.-K. Ng, and R. Smith. Conceptual-model-based data extraction from multiplerecord web pages. Data & Knowledge Engineering, 31(3):227–251, November 1999.zbMATHCrossRefGoogle Scholar
  7. [EFKR99]
    D. W. Embley, N. Fuhr, C.-P. Klas, and T. Roelleke. Ontology suitability for uncertain extraction of information from multi-record web documents. In Proceedings of the Workshop on Agenten, Datenbanken und Information Retrieval (ADI’99), Rostock-Warnemuende, Germany, 1999.Google Scholar
  8. [EJN99]
    D. W. Embley, Y. S. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD, pages 467–478, Philadelphia, Pennsylvania, 31 May–3 June 1999.Google Scholar
  9. [EX00]
    D. W. Embley and L. Xu. Record location and reconfiguration in unstructured multiple-record web documents. In Proceedings of the 3rd Intl. Workshop on the Web and Databases, pages 123–128, Dallas, Texas, May 2000.Google Scholar
  10. [McC96]
    Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. mccallum/bow, 1996.
  11. [MNRS99]
    A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domainspecific search engines with machine learning techniques. In Proceedings of the AAAI Spring Sym. on Intelligent Agents in Cyberspace, March 1999.Google Scholar
  12. [Qui93]
    J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.Google Scholar
  13. [RL94]
    E. Rilo. and W. Lehnert. Information extraction as a basis for highprecision text classification. ACM TOIS, 12(3):296–333, 1994.CrossRefGoogle Scholar
  14. [SM83]
    G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • David W. Embley
    • 1
  • Yiu-Kai Ng
    • 1
  • Li Xu
    • 1
  1. 1.Dept. of Computer ScienceBrigham Young UniversityProvoUSA

Personalised recommendations