Recognizing Ontology-Applicable Multiple-Record Web Documents
Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplerecord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.
KeywordsData Frame Vector Space Model Participation Constraint Universal Rule Document Vector
Unable to display preview. Download preview PDF.
- [BM98]L. D. Baker and A. K. McCallum. Distributional clustering of words for text classification. In Proceedings of the 21th ACM SIGIR, pages 96–103, 1998.Google Scholar
- [Bun77]M. A. Bunge. Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. Reidel, Boston, 1977.Google Scholar
- [BYRN99]R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. A ddison Wesley, Menlo Park, California, 1999.Google Scholar
- [EFKR99]D. W. Embley, N. Fuhr, C.-P. Klas, and T. Roelleke. Ontology suitability for uncertain extraction of information from multi-record web documents. In Proceedings of the Workshop on Agenten, Datenbanken und Information Retrieval (ADI’99), Rostock-Warnemuende, Germany, 1999.Google Scholar
- [EJN99]D. W. Embley, Y. S. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of the 1999 ACM SIGMOD, pages 467–478, Philadelphia, Pennsylvania, 31 May–3 June 1999.Google Scholar
- [EX00]D. W. Embley and L. Xu. Record location and reconfiguration in unstructured multiple-record web documents. In Proceedings of the 3rd Intl. Workshop on the Web and Databases, pages 123–128, Dallas, Texas, May 2000.Google Scholar
- [McC96]Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
- [MNRS99]A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domainspecific search engines with machine learning techniques. In Proceedings of the AAAI Spring Sym. on Intelligent Agents in Cyberspace, March 1999.Google Scholar
- [Qui93]J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.Google Scholar