A Tree-Based Approach to Clustering XML Documents by Structure

Costa, Gianni; Manco, Giuseppe; Ortale, Riccardo; Tagarelli, Andrea

doi:10.1007/978-3-540-30116-5_15

Gianni Costa²²,
Giuseppe Manco²²,
Riccardo Ortale²³ &
…
Andrea Tagarelli²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3202))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2307 Accesses
22 Citations

Abstract

We propose a novel methodology for clustering XML documents on the basis of their structural similarities. The idea is to equip each cluster with an XML cluster representative, i.e. an XML document subsuming the most typical structural specifics of a set of XML documents. Clustering is essentially accomplished by comparing cluster representatives, and updating the representatives as soon as new clusters are detected. We present an algorithm for the computation of an XML representative based on suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees. Experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.

Download to read the full chapter text

Chapter PDF

Clustering XML documents by patterns

Article Open access 23 January 2015

Maciej Piernik, Dariusz Brzezinski & Tadeusz Morzy

Structure-Oriented Techniques for XML Document Partitioning

Clustering XML Documents Using Frequent Edge-Sets

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press Books. Addison Wesley (1999)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: Proc. VLDB 2001 Conf., pp. 119–128 (2001)
Google Scholar
Bertino, E., Guerrini, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1) (2004)
Google Scholar
Chawathe, S., et al.: Change detection in hierarchically structured information. In: Proc. SIGMOD 1996 Conf., pp. 493–504 (1996)
Google Scholar
Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: Proc. ICDE 2002 Conf., pp. 41–52 (2002)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proc. VLDB 2001 Conf., pp. 109–118 (2001)
Google Scholar
Doucet, A., Myka, H.A.: Naive clustering of a large XML document collection. In: Proc. INEX 2002 Workshop (2002)
Google Scholar
Flesca, S., et al.: Detecting structural similarities between XML documents. In: Proc. WebDB 2002 Workshop (2002)
Google Scholar
Giannotti, F., Gozzi, C., Manco, G.: Clustering transactional data. In: Proc. ECML-PKDD 2002 Conf., pp. 175–187 (2002)
Google Scholar
Jain, K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Lian, W., et al.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE TKDE 16(1), 82–96 (2004)
Google Scholar
Mignet, L., Barbosa, D., Veltri, P.: The XML Web: a First Study. In: Proc. WWW 2003 Conf., pp. 500–510 (2003)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. WebDB 2002 Workshop (2002)
Google Scholar
Wang, Y., DeWitt, D.J., Cai, J.: X-Diff: A fast change detection algorithm for XML documents. In: Proc. ICDE 2003 Conf., pp. 519–530 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

ICAR-CNR – Institute of Italian National Research Council, Via Pietro Bucci 41c, 87036, Rende (CS), Italy
Gianni Costa & Giuseppe Manco
DEIS, University of Calabria, Via Pietro Bucci 41c, 87036, Rende (CS), Italy
Riccardo Ortale & Andrea Tagarelli

Authors

Gianni Costa
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Manco
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Ortale
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Tagarelli
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Dipartimento di Informatica, Università degli Studi di Bari,
Floriana Esposito
Pisa KDD Laboratory, ISTI - CNR, Area della Ricerca di Pisa, Via Giuseppe Moruzzi 1, Pisa, Italy
Fosca Giannotti
Dipartimento di Informatica, Via F. Buonarroti 2, 56127, Pisa, Italy
Dino Pedreschi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costa, G., Manco, G., Ortale, R., Tagarelli, A. (2004). A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-30116-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Tree-Based Approach to Clustering XML Documents by Structure

Abstract

Chapter PDF

Similar content being viewed by others

Clustering XML documents by patterns

Structure-Oriented Techniques for XML Document Partitioning

Clustering XML Documents Using Frequent Edge-Sets

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Tree-Based Approach to Clustering XML Documents by Structure

Abstract

Chapter PDF

Similar content being viewed by others

Clustering XML documents by patterns

Structure-Oriented Techniques for XML Document Partitioning

Clustering XML Documents Using Frequent Edge-Sets

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation