Skip to main content

SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9648))

Abstract

In this paper we introduce and experimentally assess SemSynX, a novel technique for supporting similarity analysis of XML data via semantic and syntactic heterogeneity/homogeneity detection. Given two XML trees, SemSynX retrieves a list of semantic and syntactic heterogeneity/homogeneity matches of objects (i.e., elements, values, tags, attributes) occurring in certain paths of the trees. A local score that takes into account the path and value similarity is given for each heterogeneity/homogeneity found. A global score that summarizes the number of equal matches as well as the local scores globally is also provided. The proposed technique is highly customizable, and it permits the specification of thresholds for the requested degree of similarity for paths and values as well as for the degree of relevance for path and value matching. It thus makes possible to “adjust” the similarity analysis depending on the nature of the input XML trees. SemSynX has been implemented in terms of a XQuery library, as to enhance interoperability with other XML processing tools. To complete our analytical contributions, a comprehensive experimental assessment and evaluation of SemSynX over several classes of XML documents is provided.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The use of keys is not mandatory in our approach, but keys are used in the running example to guide similarity search.

References

  1. Aïtelhadj, A., Boughanem, M., Mezghiche, M., Souam, F.: Using structural similarity for clustering XML documents. Knowl. Inf. Syst. 32(1), 109–139 (2012)

    Article  Google Scholar 

  2. Algergawy, A., Mesiti, M., Nayak, R., Saake, G.: XML data clustering: an overview. ACM Comput. Surv. (CSUR) 43(4), 25 (2011)

    Article  MATH  Google Scholar 

  3. Almendros-Jiménez, J.M., Cuzzocrea, A.: Towards flexible similarity analysis of XML data. In: On the Move to Meaningful Internet Systems: OTM 2015 Workshops, Rhodes, Greece, 26–30 October 2015, pp. 573–576 (2015)

    Google Scholar 

  4. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. (CSUR) 41(1), 1 (2008)

    Article  Google Scholar 

  5. Bryl, V., Bizer, C., Isele, R., Verlic, M., Hong, S.G., Jang, S., Yi, M.Y., Choi, K.-S.: Interlinking and knowledge fusion. In: Auer, S., Bryl, V., Tramp, S. (eds.) Linked Open Data. LNCS, vol. 8661, pp. 70–89. Springer, Heidelberg (2014)

    Google Scholar 

  6. Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., Pugliese, A.: Modeling adaptive hypermedia with an object-oriented approach and XML. In: Proceedings of the Second International Workshop on Web Dynamics, pp. 35–44 (2002)

    Google Scholar 

  7. Cannataro, M., Cuzzocrea, A., Pugliese, A., Bucci, V.P.: A probabilistic approach to model adaptive hypermedia systems. In: Proceedings of the First International Workshop for Web Dynamics, pp. 12–30 (2001)

    Google Scholar 

  8. Cecchin, F., de Aguiar Ciferri, C.D., Hara, C.S.: XML data fusion. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 297–308. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  9. Costa, G., Cuzzocrea, A., Manco, G., Ortale, R.: Data de-duplication: a review. In: Biba, M., Xhafa, F. (eds.) Learning Structure and Schemas from Documents. SCI, vol. 375, pp. 385–412. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Cuzzocrea, A.: Combining multidimensional user models and knowledge representation and management techniques for making web services knowledge-aware. Web Intell. Agent Syst. 4(3), 289–312 (2006)

    Google Scholar 

  11. Cuzzocrea, A., Puglisi, P.L.: Record linkage in data warehousing: state-of-the-art analysis and research perspectives. In: Database and Expert Systems Applications, DEXA 2011, International Workshops, Toulouse, France, August 29 – September 2 2011, pp. 121–125 (2011)

    Google Scholar 

  12. Do Nascimento, A.M., Hara, C.S.: A model for XML instance level integration. In: Proceedings of the 23rd Brazilian Symposium on Databases, pp. 46–60. Sociedade Brasileira de Computação (2008)

    Google Scholar 

  13. Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. Proc. VLDB Endowment 2(2), 1654–1655 (2009)

    Article  Google Scholar 

  14. Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Approximate data instance matching: a survey. Knowl. Inf. Syst. 27(1), 1–21 (2011)

    Article  Google Scholar 

  15. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: IEEE 30th International Conference on Data Engineering (ICDE 2014), pp. 232–243. IEEE (2014)

    Google Scholar 

  16. Hara, C.S., de Aguiar Ciferri, C.D., Ciferri, R.R.: Incremental data fusion based on provenance information. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M. (eds.) Buneman Festschrift 2013. LNCS, vol. 8000, pp. 339–365. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  17. Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. T. Large-Scale Data- Knowl. Centered Syst. 8, 174–196 (2013)

    Google Scholar 

  18. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady, vol. 10, p. 707 (1966)

    Google Scholar 

  19. Lung, C.-H., Sanaullah, M., Cao, Y., Majumdar, S.: Design and performance evaluation of cloud-based XML publish/subscribe services. In: IEEE International Conference on Services Computing, SCC 2014, Anchorage, AK, USA, June 27 – July 2 2014, pp. 583–589 (2014)

    Google Scholar 

  20. Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: Proceedings of the Joint EDBT/ICDT 2012 Workshops, pp. 116–123. ACM (2012)

    Google Scholar 

  21. Milano, D., Scannapieco, M., Catarci, T.: Using ontologies for XML data cleaning. In: Meersman, R., Tari, Z. (eds.) OTM-WS 2005. LNCS, vol. 3762, pp. 562–571. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  22. Oliveira, P., de Fatima Rodrigues, M., Henriques, P.R.: An ontology-based approach for data cleaning. In: ICIQ, pp. 307–320 (2006)

    Google Scholar 

  23. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  24. Sundaram, S., Kumar, S.: Madria.: a change detection system for unordered XML data using a relational model. Data Knowl. Eng. 72, 257–284 (2012)

    Article  Google Scholar 

  25. Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: background, current trends and future directions. Comput. Sci. Rev. 3(3), 151–173 (2009)

    Article  MATH  Google Scholar 

  26. Weis, M., Manolescu, I.: Declarative XML data cleaning with XClean. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 96–110. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  27. Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 109–109. IEEE (2006)

    Google Scholar 

  28. Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau (1999)

    Google Scholar 

  29. Yaguinuma, C.A., Afonso, G.F., Ferraz, V., Borges, S., Santos, M.T.: A fuzzy ontology-based semantic data integration system. J. Inf. Knowl. Manage. 10(03), 285–299 (2011)

    Article  Google Scholar 

  30. Zhang, D., Song, T., He, J., Shi, X., Dong, Y.: A similarity-oriented RDF graph matching algorithm for ranking linked data. In: 2012 IEEE 12th International Conference on Computer and Information Technology (CIT), pp. 427–434. IEEE (2012)

    Google Scholar 

Download references

Acknowledgments

This work was funded by the EU ERDF and the Spanish Ministry of Economy and Competitiveness (MINECO) under Project TIN2013-44742-C4-4-R as well as by the Andalusian Regional Government under Project P10-TIC-6114.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jesús M. Almendros-Jiménez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Almendros-Jiménez, J.M., Cuzzocrea, A. (2016). SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2016. Lecture Notes in Computer Science(), vol 9648. Springer, Cham. https://doi.org/10.1007/978-3-319-32034-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32034-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32033-5

  • Online ISBN: 978-3-319-32034-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics