LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration

  • Wenxin Liang
  • Haruo Yokota
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3567)


Recently, more and more data are published and exchanged by XML on the Internet. However, different XML data sources might contain the same data but have different structures. Therefore, it requires an efficient method to integrate such XML data sources so that more complete and useful information can be conveniently accessed and acquired by users.

The tree edit distance is regarded as an effective metric for evaluating the structural similarity in XML documents. However, its computational cost is extremely expensive and the traditional wisdom in join algorithms cannot be applied easily. In this paper, we propose LAX (Leaf-clustering based Approximate XML join algorithm), in which the two XML document trees are clustered into subtrees representing independent items and the similarity between them is determined by calculating the similarity degree based on the leaf nodes of each pair of subtrees. We also propose an effective algorithm for clustering the XML document for LAX. We show that it is easily to apply the traditional wisdom in join algorithms to LAX and the join result contains complete information of the two documents. We then do experiments to compare LAX with the tree edit distance and evaluate its performance using both synthetic and real data sets. Our experimental results show that LAX is more efficient in performance and more effective for measuring the approximate similarity between XML documents than the tree edit distance.


Leaf Node Candidate Element Traditional Wisdom Tree Edit Distance Approximate Similarity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    ACM SIGMOD Record in XML, Available at
  2. 2.
    Arenas, M., Libkin, L.: A Normal Form for XML Documents. ACM Transactions on Database Systems 29(1), 195–232 (2004)CrossRefGoogle Scholar
  3. 3.
    Chawathe, S., Garacia-Molina, H.: Meaningful Change Detection in Structured Data. In: Proc. of ACM SIGMOD 1997, pp. 26–37 (1997)Google Scholar
  4. 4.
    Chawathe, S., Tajaraman, A., Garacia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of ACM SIGMOD 1996, pp. 493–504 (1996)Google Scholar
  5. 5.
    Cruz, I.F., Xiao, H., Hsu, F.: An Ontology-Based Framework for XML Semantic Integration. In: Proc. of IDEAS 2004, pp. 217–226 (2004)Google Scholar
  6. 6.
    Doan, A., Domingos, P., Halevy, A.: Reconciling Schemas of Disparate Data Sources: A Machine-learning Approch. In: Proc. of ACM SIGMOD 2001, pp. 509–520 (2001)Google Scholar
  7. 7.
    Fan, W., Libkin, L.: On XML Integrity Constraints in the Presence of DTDs. In: Proc. of PODS 2001, pp. 114–125 (2001)Google Scholar
  8. 8.
    Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML Joins. In: Proc. of ACM SIGMOD 2002, pp. 287–298 (2002)Google Scholar
  9. 9.
    Guha, S., Koudas, N., Srivastava, D., Yu, T.: Index-Based Approximate XML Joins. In: Proc. of ICDE 2003, pp. 708–710 (2003)Google Scholar
  10. 10.
    IBM XML Generator, Available at
  11. 11.
    Lee, M., Yang, L., Hus, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. of CIKM 2002, pp. 292–299 (2002)Google Scholar
  12. 12.
    MAGE (MicroArray and Gene Expression), Available at
  13. 13.
    Marian, A., Abiteboul, S., Cobena, G., Mignet, L.: Change-Centric Management of Versions in an XML Warehouse. In: Proc. of 27th VLDB, pp. 581–590 (2001)Google Scholar
  14. 14.
    Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of WebDB 2002, pp. 61–66 (2002)Google Scholar
  15. 15.
    Rahm, E., Bernstein, P.A.: A Survey of approaches to automatic schema matching. The VLDB Journal 10(1), 334–350 (2001)zbMATHCrossRefGoogle Scholar
  16. 16.
    Selkow, S.: The Tree-to-tree Editing Problem. Information Processing Letters 6(6), 184–186 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Wang, Y., DeWitt, D.J., Cai, J.: X-Diff: An Effective Change Detection Algo-rithm for XML Documents. In: Proc. of ICDE 2003, pp. 519–530 (March 2003)Google Scholar
  18. 18.
    World Wide Web Consortium (W3C). The Document Object Model (DOM),
  19. 19.
    XML Version of DBLP, Available at
  20. 20.
    Yang, X., Lee, M., Ling, T.: Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 520–533. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  21. 21.
    Zhang, K., Shasha, D.: Simple Fast Algorithm for the Editing Distance Between Trees and Related Problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Zhang, K., Shasha, D.: Tree Pattern Matching. In: Pattern Matching Algorithms, ch. 11, Oxford University Press, Oxford (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Wenxin Liang
    • 1
  • Haruo Yokota
    • 2
  1. 1.Department of Computer ScienceTokyo Institute of TechnologyTokyoJapan
  2. 2.Global Scientific Information and Computer CenterTokyo Institute of TechnologyTokyoJapan

Personalised recommendations