Content Data Based Schema Matching

  • Marcin SzymczakEmail author
  • Antoon Bronselaer
  • Sławomir Zadrożny
  • Guy De Tré
Part of the Studies in Computational Intelligence book series (SCI, volume 634)


A novel automatic method for detecting corresponding attributes in schemas based on content data is studied. More specifically, our proposed method for the detection of coreferent attributes in schemas is based on a statistical and lexical comparison of content data and detected coreferent tuples across multiple datasets, which increase the possibility of correct schema matching. We will show that knowledge of even a small number of coreferent tuples is sufficient to establish correct matching between corresponding attributes of heterogeneous schemas. The behaviour of the novel schema matching technique has been evaluated on several real life datasets, giving a valuable insight in the influence of the different parameters of our approach on the results obtained.


Content Data Schema Match Aggregation Operator Fuzzy Measure Possibility Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This contribution is supported by the Foundation for Polish Science under International PhD Projects in Intelligent Computing. Project financed from The European Union within the Innovative Economy Operational Programme 2007–2013 and European Regional Development Fund. This work was also partially supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).


  1. 1.
    Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the 28th International Conference on Data Engineering (ICDE) (2005)Google Scholar
  2. 2.
    Bronselaer, A., De Tré, G.: A possibilistic approach on string comparison. IEEE Trans. Fuzzy Syst. 17(1), 208–223 (2009)CrossRefzbMATHGoogle Scholar
  3. 3.
    Bronselaer, A., De Tré, G.: Properties of possibilistic string comparison. IEEE Trans. Fuzzy Syst. 18(2), 312–325 (2010)CrossRefGoogle Scholar
  4. 4.
    Bronselaer, A., Hallez, A., De Tré, G.: Extensions of fuzzy measures and the sugeno integral for possibilistic truth values. Int. J. Intel. Syst. 24(2), 97–117 (2009)CrossRefzbMATHGoogle Scholar
  5. 5.
    Calvo, T., Mayor, G., Mesiar, R. (eds.): Aggregation Operators: New Trends and Applications. Physica-Verlag GmbH, Heidelberg (2002)zbMATHGoogle Scholar
  6. 6.
    Chua, C.E.H., Chiang, R.H.L., Lim, E.P.: Instance-based attribute identification in database integration. VLDB J. 12(3), 228–243 (2003). OctCrossRefGoogle Scholar
  7. 7.
    de Cooman, G.: Towards a possibilistic logic. In: Ruan, D. (ed.) Fuzzy Set Theory and Advanced Mathematical Applications, International Series in Intelligent Technologies, vol. 4, pp. 89–133. Springer, US (1995)CrossRefGoogle Scholar
  8. 8.
    Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: imap: discovering complex semantic matches between database schemas. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, ACM Press (2004)Google Scholar
  9. 9.
    Do, H.h., Rahm, E.: Coma—a system for flexible combination of schema matching approaches. In: Proceedings of the VLDB 2002, pp. 610–621 (2002)Google Scholar
  10. 10.
    Doan, A., Domingos, P., Levy, A.Y.: Learning source description for data integration. In: WebDB (Informal Proceedings), pp. 81–86 (2000)Google Scholar
  11. 11.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  12. 12.
    Hallez, A., De Tré, G., Verstraete, J., Matthé, T.: Application of fuzzy quantifiers on possibilistic truth values. In: Proceedings of EUROFUSE EURO WG on Fuzzy Sets, pp. 252–254. EXIT (2004)Google Scholar
  13. 13.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc, New York (2001)CrossRefzbMATHGoogle Scholar
  14. 14.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000). JanCrossRefGoogle Scholar
  15. 15.
    Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1986)zbMATHGoogle Scholar
  16. 16.
    Lu, H., Fan, W., Goh, C.H., Madnick, S., Cheung, D.: Discovering and reconciling semantic conflicts: a data mining prospective. In: Proceedings of IFIP Working Conference on Data Semantics (DS-7) (1997)Google Scholar
  17. 17.
    Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases. pp. 49–58. VLDB ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)Google Scholar
  18. 18.
    Mehdi, O.A., Ibrahim, H., Affendey, L.S.: Instance based matching using regular expression. Procedia CS 10, 688–695 (2012)Google Scholar
  19. 19.
    Perkowitz, M., Doorenbos, R.B., Etzioni, O., Weld, D.S.: Learning to understand information on the internet: an example-based approach. J. Intel. Inf. Syst. 8(2), 133–153 (1997). MarCrossRefGoogle Scholar
  20. 20.
    Prade, H.: Possibility sets, fuzzy sets and their relation to Lukasiewicz logic. In: Proceeding of 12th Int Symp on Multiple-Valued Logic. pp. 223–227 (1982)Google Scholar
  21. 21.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). DecCrossRefzbMATHGoogle Scholar
  22. 22.
    Reiss, R.D., Thomas, M.: Statistical analysis of extreme values: with applications to insurance, finance, hydrology and other fields. Birkhuser Basel, 3rd edn. (2007)Google Scholar
  23. 23.
    Sugeno, M.: Theory of Fuzzy Integrals and its Applications. Ph.D. thesis, Tokyo, Japan (1974)Google Scholar
  24. 24.
    Szymczak, M., Koepke, J.: Matching methods for semantic annotation-based XML document transformations. In: K. Atanassov, et al. (Eds.), New Developments in Fuzzy Sets, Intuitionistic Fuzzy Sets, Generalized Nets and Related Topics. Applications. Volume II. pp. 297–308. SRI PAS (2012)Google Scholar
  25. 25.
    Szymczak, M., Zadrożny, S., De Tré, G.: Coreference detection in XML metadata. In: Pedrycz, W., Reformat, M. (eds.) Proceedings of 2013 Joint IFSA World Congress NAFIPS Annual Meeting. pp. 1354–1359 (2013)Google Scholar
  26. 26.
    Szymczak, M., Bronselaer, A., Zadrożny, S., De Tré, G.: Semantical mappings of attribute values for data integration. In: Proceedings of NAFIPS 2014. pp. 1–8. IEEE (2014)Google Scholar
  27. 27.
    Szymczak, M., Zadrożny, S., Bronselaer, A., De Tré, G.: Coreference detection in an XML schema. Inf. Sci. 296, 237–262 (2015)CrossRefGoogle Scholar
  28. 28.
    Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)CrossRefzbMATHGoogle Scholar
  29. 29.
    Yager, R.: On the theory of bags. Int. J. Gen. Syst. 13(1), 23–27 (1986)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Zadeh, L.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 100, 9–34 (1999). AprCrossRefGoogle Scholar
  31. 31.
    Zadrożny, S., Kacprzyk, J., Sobota, G.: Avoiding duplicate records in a database using a linguistic quantifier based aggregation—a practical approach. In: Proceedings of FUZZ-IEEE. pp. 2194–2201 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Marcin Szymczak
    • 1
    • 2
    Email author
  • Antoon Bronselaer
    • 2
  • Sławomir Zadrożny
    • 1
  • Guy De Tré
    • 2
  1. 1.Systems Research Institute, Polish Academy of SciencesWarsawPoland
  2. 2.Department of Telecommunications and Information ProcessingUniversity GhentGhentBelgium

Personalised recommendations