Approximate Joins for XML Using g-String

  • Fei Li
  • Hongzhi Wang
  • Cheng Zhang
  • Liang Hao
  • Jianzhong Li
  • Hong Gao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6309)


When integrating XML documents from autonomous databases, exact joins often fail for the data items representing the same real world object may not be exactly the same. Thus the join must be approximate. Tree-edit-distance-based join methods have high join quality but low efficiency. Comparatively, other methods with higher efficiency cannot perform the join as effectively as tree edit distance does.

To keep the balance between efficiency and effectiveness, in this paper, we propose a novel method to approximately join XML documents. In our method, trees are transformed to g-strings with each entry a tiny subtree. Then the distance between two trees is evaluated as the g-string distance between their corresponding g-strings. To make the g-string based join method scale to large XML databases, we propose the g-bag distance as the lower bound of the g-string distance. With g-bag distance, only a very small part of g-string distance need to be computed directly. Thus the whole join process can be done very efficiently. We theoretically analyze the properties of the g-string distance. Experiments with synthetic and various real world data confirm the effectiveness and efficiency of our method and suggest that our technique is both scalable and useful.


Leaf Node Label Information Edit Operation Tree Pair Dummy Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Approximate joins for data-centric xml. In: ICDE, pp. 814–823 (2008)Google Scholar
  2. 2.
    Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB, pp. 301–312 (2005)Google Scholar
  3. 3.
    Augsten, N., Böhlen, M.H., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35(1) (2010)Google Scholar
  4. 4.
    Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. In: Arge, L., Cachin, C., Jurdziński, T., Tarlecki, A. (eds.) ICALP 2007. LNCS, vol. 4596, pp. 146–157. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Garofalakis, M.N., Kumar, A.: Xml stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. 30(1), 279–332 (2005)CrossRefGoogle Scholar
  7. 7.
    Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate xml joins. In: SIGMOD Conference, pp. 287–298 (2002)Google Scholar
  8. 8.
    Kailing, K., Kriegel, H.-P., Schönauer, S., Seidl, T.: Efficient similarity search for hierarchical data in large databases. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 676–693. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  9. 9.
    Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 91–102. Springer, Heidelberg (1998)Google Scholar
  10. 10.
    Kuboyama, T.: Matching and Learning in Trees (2007)Google Scholar
  11. 11.
    Shapiro, B.A., Zhang, K.: Comparing multiple rna secondary structures using tree comparisons. Computer Applications in the Biosciences 6(4), 309–318 (1990)Google Scholar
  12. 12.
    Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Tatikonda, S., Parthasarathy, S.: Hashing Tree-Structured Data: Methods and Applications. In: ICDE (to appear, 2010)Google Scholar
  14. 14.
    Valiente, G.: An efficient bottom-up distance between trees. In: SPIRE, pp. 212–219 (2001)Google Scholar
  15. 15.
    van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)zbMATHGoogle Scholar
  16. 16.
    Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD Conference, pp. 754–765 (2005)Google Scholar
  17. 17.
    Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Fei Li
    • 1
  • Hongzhi Wang
    • 1
  • Cheng Zhang
    • 1
  • Liang Hao
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.The School of Computer Science and TechnologyHarbin Institute of TechnologyChina

Personalised recommendations