Abstract
When we describe a Web page informally, we often use phrases like “it looks like a newspaper site”, “there are several unordered lists” or “it's just a collection of links”. Unfortunately, no Web search or classification tools provide the capability to retrieve information using such informal descriptions that are based on the appearance, i.e., structure, of the Web page. In this paper, we take a look at the concept of structurally similar Web pages. We note that some structural properties can be identified with semantic properties of the data and provide measures for comparison between HTML documents.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Research supported in part by the National Science Foundation under CAREER Award IRI-9896052. URLs: www.cs.wpi.edu/~ifc and casa.wpi.edu
Preview
Unable to display preview. Download preview PDF.
References
R. B. Allen. Retrieval from Facet Spaces. Electronic Publishing, 8(3):247–257, 1996.
M. Balabanovic, Y. Shoham, and Y. Yun. An Adaptive Agent for Automated Web Browsing. Technical Report CS-TN-97-52, Stanford University, February 1997.
C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal. Web Search Using Automatic Classification. Technical report, Stanford University and IBM Almaden Center, December 1996. theory.stanford.edu/people/wass/publications/Web_Search/.
I. F. Cruz and W. T. Lucas. DelaunayMM: a Visual Framework for Multimedia Presentation. In IEEE Symposium on Visual Languages (VL '97), pages 212–219, 1997.
Digital Equipment Corporation. Digital's AltaVista Search Unveils Largest and Freshest Web Index. www.altavista.digital.com/av/content/pr101497.htm.
H. Lieberman. Letizia: an Agent that Assists Web Browsing. In Proc. of the International Joint Conference on Artificial Intelligence, 1995.
M. A. Marks and T. R. Webb. Internet Documents Clustered by Structure. Major Qualifying Project, Worcester Polytechnic Institute, 1997.
D. Sankoff and J. B. Kruskal, eds. Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.
J. T.-L. Wang, G. J. S. Chang, G. Patel, L. Rhihan, D. Shasha, and K. Zhang. Structural Mapping and Discovery in Document Databases. In ACM-SIGMOD Intl. Conf. on Management of Data, pages 560–563, 1997.
L. Weitzman and K. Wittenburg. Automatic Presentation of Multimedia Documents Using Relational Grammars. In ACM Multimedia Conference, 1994.
K. Wittenburg and E. Sigman. Visual Focusing and Transition Techniques in a Treeviewer for Web Information Access. In IEEE Symposium on Visual Languages (VL '97), pages 20–27, 1997.
Yahoo! Inc. Yahoo! Ranked No. 1 Web Site Among Business Users in First-Ever PC Meter Workplace Study. www.yahoo.com/docs/pr/release106.html.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R. (1998). Measuring structural similarity among web documents: preliminary results. In: Hersch, R.D., André, J., Brown, H. (eds) Electronic Publishing, Artistic Imaging, and Digital Typography. RIDT 1998. Lecture Notes in Computer Science, vol 1375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0053296
Download citation
DOI: https://doi.org/10.1007/BFb0053296
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64298-5
Online ISBN: 978-3-540-69718-3
eBook Packages: Springer Book Archive