Skip to main content

Clustering XML Documents by Structure

  • Conference paper
Methods and Applications of Artificial Intelligence (SETN 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3025))

Included in the following conference series:

Abstract

This work explores the application of clustering methods for grouping structurally similar XML documents. Modeling the XML documents as rooted ordered labeled trees, we apply clustering algorithms using distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  2. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proceedings of the ACM SIGMOD Conference,Texas, USA (2000)

    Google Scholar 

  3. Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison. CSLI Publications, Standford (1999)

    Google Scholar 

  4. Direen, H.G., Jones, M.S.: Knowledge management in bioinformatics. In: Chaudhri, A.B., Rashid, A., Zicari, R. (eds.) XML Data Management, Addison Wesley, Reading (2003)

    Google Scholar 

  5. Wagner, R., Fisher, M.: The string-to-string correction problem. Journal of ACM 21(1), 168–173 (1974)

    Article  MATH  Google Scholar 

  6. Tai, K.C.: The tree-to-tree correction problem. Journal of ACM 26, 422–433 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  7. Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  8. Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proceedings of the VLDB Conference, Edinburgh, Scotland, UK, pp. 90–101 (1999)

    Google Scholar 

  9. Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proceedings of the ACM SIGMOD Conference, USA (1996)

    Google Scholar 

  10. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  11. Rasmussen, E.: Clustering algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  12. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures, in: SSDBM Conference, Virginia, USA (2001)

    Google Scholar 

  13. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)

    Google Scholar 

  14. Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18, 54–64 (1969)

    Article  MathSciNet  Google Scholar 

  15. Hubert, L.J., Levin, J.R.: A general statistical framework for accessing categorical clustering in free recall. Psychological Bulletin 83, 1072–1082 (1976)

    Article  Google Scholar 

  16. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)

    Article  Google Scholar 

  17. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in xml documents. In: Proceedings of the WebDB Workshop, Madison, Wisconsin, USA (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dalamagas, T., Cheng, T., Winkel, KJ., Sellis, T. (2004). Clustering XML Documents by Structure. In: Vouros, G.A., Panayiotopoulos, T. (eds) Methods and Applications of Artificial Intelligence. SETN 2004. Lecture Notes in Computer Science(), vol 3025. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24674-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24674-9_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21937-8

  • Online ISBN: 978-3-540-24674-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics