Abstract
As the Web continues to grow and evolve, more and more information is being placed in structurally rich documents, XML documents in particular, so as to improve the efficiency of similarity clustering, information retrieval and data management applications. Various algorithms for comparing hierarchically structured data, e.g., XML documents, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being modeled as ordered labeled trees. Nevertheless, a thorough investigation of current approaches led us to identify several structural similarity aspects, i.e. sub-tree related similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an improved comparison method to deal with fine-grained sub-trees and leaf node repetitions, without increasing overall complexity with respect to current XML comparison methods. Our approach consists of two main algorithms for discovering the structural commonality between sub-trees and computing tree-based edit operations costs. A prototype has been developed to evaluate the optimality and performance of our method. Experimental results, on both real and synthetic XML data, demonstrate better performance with respect to alternative XML comparison methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. Association for Computing Machinery 23(1), 1–12 (1976)
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science 29, 23–46 (2004)
Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM SIGMOD 1996, ACM Press, New York (1996)
Chawathe, S.: Comparing Hierarchical Data in External Memory. In: VLDB 1999, pp. 90–101 (1999)
Cobéna, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proc. of the IEEE Int. Conf. on Data Engineering, pp. 41–52. IEEE Computer Society Press, Los Alamitos (2002)
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering XML documents by structure. Information Systems 31(3), 187–228 (2006)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities Between XML Documents. In: Proc. of 5th SIGMOD Workshop on The Web and Databases (2002)
Gower, J.C., Ross, G.J.S.: Minimum Spanning Trees and Single Linkage Cluster Analysis. Applied Statistics 18, 54–64 (1969)
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML Joins. In: Proceedings of ACM SIGMOD 2002, pp. 287–298 (2002)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering Algorithms and Validity Measures. In: SSDBM Conference, Virginia, USA (2001)
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
Myers, E.: An O(ND) Difference Algorithm and Its Variations. Algorithmica 1, 251–266 (1986)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the 5th SIGMOD Workshop on The Web and Databases (2002)
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Sanz, I., Mesiti, M., Guerrini, G., Berlanga Lavori, R.: Approximate Subtree Identification in Heterogeneous XML Documents Collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)
Schlieder, T.: Similarity Search in XML Data Using Cost-based Query Transformations. In: Proceedings of 4th SIGMOD Workshop on The Web and Databases (2001)
Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Pattern Matching in Strings, Trees and Arrays, ch. 14, Oxford University Press, Oxford (1995)
Wagner, J., Fisher, M.: The String-to-String correction problem. ACM J. 21, 168–173 (1974)
Wong, C., Chandra, A.: Bounds for the String Editing Problem. ACM J. 23(1), 13–16 (1976)
WWW Consortium, The Document Object Model, http://www.w3.org/DOM
Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM J. of Computing 18(6), 1245–1262 (1989)
Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric in XML Documents. In: Knowledge Management and Experience Management Workshop (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tekli, J., Chbeir, R., Yetongnon, K. (2007). A Fine-Grained XML Structural Comparison Approach. In: Parent, C., Schewe, KD., Storey, V.C., Thalheim, B. (eds) Conceptual Modeling - ER 2007. ER 2007. Lecture Notes in Computer Science, vol 4801. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75563-0_39
Download citation
DOI: https://doi.org/10.1007/978-3-540-75563-0_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75562-3
Online ISBN: 978-3-540-75563-0
eBook Packages: Computer ScienceComputer Science (R0)