A Case-Based Recognition of Semantic Structures in HTML Documents

An Automated Transformation from HTML to XML
  • Masayuki Umehara
  • Koji Iwanuma
  • Hidetomo Nabeshima
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2412)


The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that alignment is appropriate for recognizing characteristic semantic/logical structures of a series of HTML documents, within a framework of case-based reasoning. That is, given a series of HTML documents and a sample transformation from an HTML document into an XML format, then the alignment can identify semantic/logical structures in the remaining HTML documents of the series, by matching the text-block sequence of the remaining document with the one of the sample transformation. Several important properties of texts, such as continuity and sequentiality, can naturally be utilized by the alignment. The alignment technology can significantly improve the ability of the case-based transformation method which transforms a spatial/temporal series of HTML documents into machine-readable XML formats. Throughout experimental evaluations, we show that the case-based method with alignment achieved a highly accurate transformation of HTML documents into XML.


Semantic Structure Text Block Basement Floor Sample Transformation Alignment Technology 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    N. Ashish and C. A. Knoblock: Wrapper Generation for Semi-Structured Internet Source, ACM SIGMOD Records, 26(4) (1997) 8–15.CrossRefGoogle Scholar
  2. 2.
    W. W. Cohen: Recognizing Structure in Web Pages using Similarity Queries, Proc. of AAAI-99 (1999) 59–66.Google Scholar
  3. 3.
    J. Y. Hsu and W. Yih: Template-Based Information Mining from HTML Documents, Proc. of AAAI-97 (1997) 256–262.Google Scholar
  4. 4.
    J. B. Kruskal: An Overview of Sequence Comparison: In D. Sankoff and J. B. Kruskal, (ed.), Time Warps, String Edits and Macromolecules: the Theory and Practice of Sequence Comparison (Addison Wesley, 1983) 1–44.Google Scholar
  5. 5.
    N. Kushmerick: Regression testing for wrapper maintenance, Proc. of AAAI-99 (1999) 74–79.Google Scholar
  6. 6.
    G. Salton: Introduction to Modern Information Retrieval, (McGraw-Hill, 1983).Google Scholar
  7. 7.
    S-J. Lim, Y-K. Ng: An Automated Change-Detection Algorithm for HTML Documents Based on Semantic Hierarchies, Proc. of ICDE 2001 (2001) 303–312.Google Scholar
  8. 8.
    M. Umehara and K. Iwanuma: A Case-Based Transformation from HTML to XML, Proc. of IDEAL 2000 LNAI 1983 (2000) 410–415.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Masayuki Umehara
    • 1
  • Koji Iwanuma
    • 1
  • Hidetomo Nabeshima
    • 1
  1. 1.Yamanashi UniversityKofu-shiJapan

Personalised recommendations