Improving XML Instances Comparison with Preprocessing Algorithms

  • Rodrigo Gonçalves
  • Ronaldo dos Santos Mello
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4653)


Data instances integration, specially on the web, involves analyzing and matching data from two or more sources, including XML sources. XML sources, in particular, introduce new challenges to the integration process, given their dynamic and irregular structure. In this context, one of the hardest steps is to find out which XML instances are similar. This paper presents a group of algorithms to prepare XML instances for comparison. We analyse the benefit of these algorithms over existing XML comparison approaches.


Similarity Metrics Complex Element Semistructured Data Preprocessing Algorithm Tree Edit Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Carvalho, J.C.P., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: Chiang, R.H.L., Laender, A.H.F., Lim, E.-P. (eds.) WIDM, pp. 90–93. ACM Press, New York (2003)CrossRefGoogle Scholar
  2. 2.
    Wiederhold, G.: Intelligent integration of information. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD 1993, SIGMOD Record (ACM Special Interest Group on Management of Data), Washington, May 26–28, 1993, vol. 22(2), pp. 434–437. ACM Press, New York (1993)Google Scholar
  3. 3.
    Manolescu, I., Florescu, D., Kossmann, D.K.: Answering XML queries over heterogeneous data sources. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB 2001), Orlando, pp. 241–250. Morgan Kaufmann, San Francisco (2001)Google Scholar
  4. 4.
    Consortium, W.W.W.: Extensible markup language (XML) 1.0, W3C recommendation. 2nd edn. (2000), Available at
  5. 5.
    Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. In: Naumann, F., Scannapieco, M. (eds.) IQIS, pp. 10–19. ACM Press, New York (2004)Google Scholar
  6. 6.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17(2), 160–175 (2005)CrossRefGoogle Scholar
  7. 7.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)Google Scholar
  8. 8.
    Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Lu, S.-Y.: A tree-to-tree distance and its application to cluster analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 219–224 (1979)zbMATHGoogle Scholar
  10. 10.
    Shasha, D., Zhang, K.: Fast algorithms for the unit cost editing distance between trees. J. Algorithms 11(4), 581–621 (1990)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Wang, J.T.-L., Zhang, K., Jeong, K., Shasha, D.: A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6(4), 559–571 (1994)CrossRefGoogle Scholar
  12. 12.
    Shasha, D., Zhang, K.: Approximate tree pattern matching. In: Pattern Matching Algorithms, pp. 341–371. Oxford University Press, Oxford (1997)Google Scholar
  13. 13.
    Chen, J., DeWitt, D.J., Tian, F., Wang, Y.: NiagaraCQ: A scalable continuous query system for Internet databases. SIGMOD Record (ACM Special Interest Group on Management of Data) 29(2), 379–390 (2000)Google Scholar
  14. 14.
    Wang, Y., DeWitt, D.J., yi Cai, J.: X-diff: An effective change detection algorithm for XML documents. In: ICDE, pp. 519–530 (2003)Google Scholar
  15. 15.
    Marian, A., Abiteboul, S., Cobéna, G., Mignet, L.: Change-centric management of versions in an XML warehouse. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB 2001), Orlando, pp. 581–590. Morgan Kaufmann, San Francisco (2001)Google Scholar
  16. 16.
    Buttler, D.: A short survey of document structure similarity algorithms. In: International Conference on Internet Computing, pp. 3–9 (2004)Google Scholar
  17. 17.
    Broder, A.: On the resemblance and containment of documents. In: SEQS: Sequences 1991 (1998)Google Scholar
  18. 18.
    Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  19. 19.
    Winkler, W.: The state of record linkage and current research problems (1999),

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Rodrigo Gonçalves
    • 1
  • Ronaldo dos Santos Mello
    • 1
  1. 1.Universidade Federal de Santa Catarina, Florianópolis, Santa Catarina, 88045-360Brazil

Personalised recommendations