Abstract
We propose a novel methodology for clustering XML documents on the basis of their structural similarities. The idea is to equip each cluster with an XML cluster representative, i.e. an XML document subsuming the most typical structural specifics of a set of XML documents. Clustering is essentially accomplished by comparing cluster representatives, and updating the representatives as soon as new clusters are detected. We present an algorithm for the computation of an XML representative based on suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees. Experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.
Chapter PDF
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press Books. Addison Wesley (1999)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: Proc. VLDB 2001 Conf., pp. 119–128 (2001)
Bertino, E., Guerrini, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1) (2004)
Chawathe, S., et al.: Change detection in hierarchically structured information. In: Proc. SIGMOD 1996 Conf., pp. 493–504 (1996)
Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: Proc. ICDE 2002 Conf., pp. 41–52 (2002)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. VLDB 2001 Conf., pp. 109–118 (2001)
Doucet, A., Myka, H.A.: Naive clustering of a large XML document collection. In: Proc. INEX 2002 Workshop (2002)
Flesca, S., et al.: Detecting structural similarities between XML documents. In: Proc. WebDB 2002 Workshop (2002)
Giannotti, F., Gozzi, C., Manco, G.: Clustering transactional data. In: Proc. ECML-PKDD 2002 Conf., pp. 175–187 (2002)
Jain, K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Lian, W., et al.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE TKDE 16(1), 82–96 (2004)
Mignet, L., Barbosa, D., Veltri, P.: The XML Web: a First Study. In: Proc. WWW 2003 Conf., pp. 500–510 (2003)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. WebDB 2002 Workshop (2002)
Wang, Y., DeWitt, D.J., Cai, J.: X-Diff: A fast change detection algorithm for XML documents. In: Proc. ICDE 2003 Conf., pp. 519–530 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Costa, G., Manco, G., Ortale, R., Tagarelli, A. (2004). A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-30116-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive