This paper addresses the issue of semantically clustering the increasing number of the schemaless XML documents. In our approach, each document in a document collection is firstly represented by a macro-path sequence. Secondly, the similarity matrix for a document collection is constructed by computing the similarity value among these macro-path sequences. Finally, the desired clusters are constructed by utilizing the hierarchical clustering technique. Experimental results are also shown in this paper.


Content Node Weight Mechanism Tree Inclusion Bitmap Indexing Path Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: The Proceedings of the twelfth international conference on World Wide Web, pp. 500–510 (2003)Google Scholar
  2. 2.
    W3C:Extensible Markup Language (1999),
  3. 3.
    W3C: XML Schema (2001),
  4. 4.
    Anderberg, M.R.: Clustering analysis for Applications. Academic Press, New York (1973)Google Scholar
  5. 5.
    Baeza-Yates, R.: Modern Information Retrieval. ACM Press, New York (1999)Google Scholar
  6. 6.
    Xyleme, L.: A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bullet 24(2), 40–47 (1991)Google Scholar
  7. 7.
    Doucet, A., Ahonen-Myka, H.: Naive clustering of a large XML document collection. In: The Proceedings of the First Annual Workshop of the Initiative for the Evaluation of XML retrieval, INEX (2002)Google Scholar
  8. 8.
    Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2-3), 241–254 (2001)zbMATHCrossRefGoogle Scholar
  9. 9.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Fifth International Workshop on the Web and Databases, WebDB 2002 (2002)Google Scholar
  10. 10.
    Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: The Proceedings of the eleventh international conference on Information and knowledge management, pp. 292–299 (2002)Google Scholar
  11. 11.
    Shen, Y., Wang, B.: Path Join For Retrieving Data From XML Documents. Technical Report 02–03 (2003)Google Scholar
  12. 12.
    Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  13. 13.
    Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
  14. 14.
    The XML C parser and toolkit for Gnome,
  15. 15.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31 (1999)Google Scholar
  16. 16.
    The Business Process Management Initiative, BPMI (2002),

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Yun Shen
    • 1
  • Bing Wang
    • 1
  1. 1.Department of Computer ScienceUniversity of HullHullUK

Personalised recommendations