Clustering XML Documents by Structure

Dalamagas, Theodore; Cheng, Tao; Winkel, Klaas-Jan; Sellis, Timos

doi:10.1007/978-3-540-24674-9_13

Theodore Dalamagas¹⁸,
Tao Cheng¹⁹,
Klaas-Jan Winkel²⁰ &
…
Timos Sellis¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3025))

Included in the following conference series:

Hellenic Conference on Artificial Intelligence

1394 Accesses
21 Citations

Abstract

This work explores the application of clustering methods for grouping structurally similar XML documents. Modeling the XML documents as rooted ordered labeled trees, we apply clustering algorithms using distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proceedings of the ACM SIGMOD Conference,Texas, USA (2000)
Google Scholar
Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison. CSLI Publications, Standford (1999)
Google Scholar
Direen, H.G., Jones, M.S.: Knowledge management in bioinformatics. In: Chaudhri, A.B., Rashid, A., Zicari, R. (eds.) XML Data Management, Addison Wesley, Reading (2003)
Google Scholar
Wagner, R., Fisher, M.: The string-to-string correction problem. Journal of ACM 21(1), 168–173 (1974)
Article MATH Google Scholar
Tai, K.C.: The tree-to-tree correction problem. Journal of ACM 26, 422–433 (1979)
Article MATH MathSciNet Google Scholar
Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186 (1977)
Article MATH MathSciNet Google Scholar
Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proceedings of the VLDB Conference, Edinburgh, Scotland, UK, pp. 90–101 (1999)
Google Scholar
Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proceedings of the ACM SIGMOD Conference, USA (1996)
Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)
Article MATH MathSciNet Google Scholar
Rasmussen, E.: Clustering algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures, in: SSDBM Conference, Virginia, USA (2001)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Google Scholar
Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18, 54–64 (1969)
Article MathSciNet Google Scholar
Hubert, L.J., Levin, J.R.: A general statistical framework for accessing categorical clustering in free recall. Psychological Bulletin 83, 1072–1082 (1976)
Article Google Scholar
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Article Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in xml documents. In: Proceedings of the WebDB Workshop, Madison, Wisconsin, USA (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electr. and Comp. Engineering, National Technical University of Athens, Greece
Theodore Dalamagas & Timos Sellis
Dept. of Computer Science, University of California, Santa Barbara, USA
Tao Cheng
Faculty of Computer Science, University of Twente, the Netherlands
Klaas-Jan Winkel

Authors

Theodore Dalamagas
View author publications
You can also search for this author in PubMed Google Scholar
Tao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Klaas-Jan Winkel
View author publications
You can also search for this author in PubMed Google Scholar
Timos Sellis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Info and Communication Systems Eng, Aegean University, 83200, Karlovassi, Samos, Greece
George A. Vouros
Department of Informatics, University of Piraeus, Piraeus, Greece
Themistoklis Panayiotopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dalamagas, T., Cheng, T., Winkel, KJ., Sellis, T. (2004). Clustering XML Documents by Structure. In: Vouros, G.A., Panayiotopoulos, T. (eds) Methods and Applications of Artificial Intelligence. SETN 2004. Lecture Notes in Computer Science(), vol 3025. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24674-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-24674-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21937-8
Online ISBN: 978-3-540-24674-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics