Advertisement

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

  • Klitos ChristodoulouEmail author
  • Fernando Rene Sanchez Serrano
  • Alvaro A. A. Fernandes
  • Norman W. Paton
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10940)

Abstract

The Web of Data consists of numerous Linked Data (LD) sources from many largely independent publishers, giving rise to the need for data integration at scale. To address data integration at scale, automation can provide candidate integrations that underpin a pay-as-you-go approach. However, automated approaches need: (i) to operate across several data integration steps; (ii) to build on diverse sources of evidence; and (iii) to contend with uncertainty. This paper describes the construction of probabilistic models that yield degrees of belief both on the equivalence of real-world concepts, and on the ability of mapping expressions to return correct results. The paper shows how such models can underpin a Bayesian approach to assimilating different forms of evidence: syntactic (in the form of similarity scores derived by string-based matchers), semantic (in the form of semantic annotations stemming from LD vocabularies), and internal in the form of fitness values for candidate mappings. The paper presents an empirical evaluation of the methodology described with respect to equivalence and correctness judgements made by human experts. Experimental evaluation confirms that the proposed Bayesian methodology is suitable as a generic, principled approach for quantifying and assimilating different pieces of evidence throughout the various phases of an automated data integration process.

Keywords

Probabilistic modelling Bayesian updating Data integration Linked Data 

Notes

Acknowledgments

Fernando R. Sanchez S. is supported by a grant from the Mexican National Council for Science and Technology (CONACyT).

References

  1. 1.
    Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD Conference, pp. 906–908 (2005)Google Scholar
  2. 2.
    Belhajjame, K., Paton, N.W., Embury, S.M., Fernandes, A.A.A., Hedeler, C.: Incrementally improving dataspaces based on user feedback. Inf. Syst. 38(5), 656–687 (2013)CrossRefGoogle Scholar
  3. 3.
    Bernstein, P., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011)Google Scholar
  4. 4.
    Bowman, A.W., Azzalini, A.: Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. OUP, Oxford (1997)zbMATHGoogle Scholar
  5. 5.
    Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. In: Hameurlain, A., Küng, J., Wagner, R., Bianchini, D., De Antonellis, V., De Virgilio, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XIX. LNCS, vol. 8990, pp. 1–25. Springer, Heidelberg (2015).  https://doi.org/10.1007/978-3-662-46562-2_1CrossRefGoogle Scholar
  6. 6.
    de Vaus, D.: Surveys in Social Research: Research Methods/Sociology. Taylor & Francis, London (2002)Google Scholar
  7. 7.
    Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. 18(2), 469–500 (2009)CrossRefGoogle Scholar
  8. 8.
    Guo, C., Hedeler, C., Paton, N.W., Fernandes, A.A.A.: EvoMatch: an evolutionary algorithm for inferring schematic correspondences. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XII. LNCS, vol. 8320, pp. 1–26. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-45315-1_1CrossRefGoogle Scholar
  9. 9.
    Guo, C., Hedeler, C., Paton, N.W., Fernandes, A.A.A.: MatchBench: benchmarking schema matching algorithms for schematic correspondences. In: Gottlob, G., Grasso, G., Olteanu, D., Schallhart, C. (eds.) BNCOD 2013. LNCS, vol. 7968, pp. 92–106. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-39467-6_11CrossRefGoogle Scholar
  10. 10.
    Halevy, A.Y.: Why your data won’t mix: semantic heterogeneity. ACM Queue 3(8), 50–58 (2005)CrossRefGoogle Scholar
  11. 11.
    Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)Google Scholar
  12. 12.
    Halevy, A.Y., Rajaraman, A., Ordille, J.J.: Data integration: the teenage years. In: VLDB, pp. 9–16 (2006)Google Scholar
  13. 13.
    Hedeler, C., et al.: DSToolkit: an architecture for flexible dataspace management. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems V. LNCS, vol. 7100, pp. 126–157. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-28148-8_6CrossRefGoogle Scholar
  14. 14.
    Hedeler, C., Belhajjame, K., Paton, N.W., Campi, A., Fernandes, A.A.A., Embury, S.M.: Chapter 7: dataspaces. In: Ceri, S., Brambilla, M. (eds.) Search Computing. LNCS, vol. 5950, pp. 114–134. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-12310-8_7CrossRefGoogle Scholar
  15. 15.
    Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. IJF 22(4), 679–688 (2006)Google Scholar
  16. 16.
    Jain, P., Hitzler, P., Sheth, A.P., Verma, K., Yeh, P.Z.: Ontology alignment for linked open data. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 402–417. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-17746-0_26CrossRefGoogle Scholar
  17. 17.
    Kim, W., Seo, J.: Classifying schematic and data heterogeneity in multidatabase systems. IEEE Comput. 24(12), 12–18 (1991)CrossRefGoogle Scholar
  18. 18.
    Kuicheu, N.C., Wang, N., Fanzou Tchuissang, G.N., Xu, D., Dai, G., Siewe, F.: Managing uncertain mediated schema and semantic mappings automatically in dataspace support platforms. Comput. Inform. 32(1), 175–202 (2013)zbMATHGoogle Scholar
  19. 19.
    Lenzerini, M.: Data integration: a theoretical perspective. In: PODS, pp. 233–246 (2002)Google Scholar
  20. 20.
    Madhavan, J., et al.: Web-scale data integration: you can only afford to pay as you go. In: CIDR, pp. 342–350 (2007)Google Scholar
  21. 21.
    Magnani, M., Montesi, D.: Uncertainty in data integration: current approaches and open problems. In: Proceedings of the First International VLDB Workshop on Management of Uncertain Data in Conjunction with VLDB 2007, Vienna, Austria, 24 September 2007, pp. 18–32 (2007)Google Scholar
  22. 22.
    Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V.S. (eds.) SUM 2007. LNCS (LNAI), vol. 4772, pp. 60–73. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-75410-7_5CrossRefGoogle Scholar
  23. 23.
    Papoulis, A.: Probability, Random Variables and Stochastic Processes, 3rd edn. McGraw-Hill Companies, New York (1991)zbMATHGoogle Scholar
  24. 24.
    Paton, N.W., Belhajjame, K., Embury, S.M., Fernandes, A.A.A., Maskat, R.: Pay-as-you-go data integration: experiences and recurring themes. In: Freivalds, R.M., Engels, G., Catania, B. (eds.) SOFSEM 2016. LNCS, vol. 9587, pp. 81–92. Springer, Heidelberg (2016).  https://doi.org/10.1007/978-3-662-49192-8_7CrossRefzbMATHGoogle Scholar
  25. 25.
    Peukert, E., Maßmann, S., König, K.: Comparing similarity combination methods for schema matching. In: GI Jahrestagung, no. 1, pp. 692–701 (2010)Google Scholar
  26. 26.
    Polleres, A., Hogan, A., Harth, A., Decker, S.: Can we ever catch up with the web? Semant. Web 1(1–2), 45–52 (2010)Google Scholar
  27. 27.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefGoogle Scholar
  28. 28.
    Sabou, M., d’Aquin, M., Motta, E.: Exploring the semantic web as background knowledge for ontology matching. J. Data Semant. 11, 156–190 (2008)Google Scholar
  29. 29.
    Sabou, M., d’Aquin, M., Motta, E.: SCARLET: Semantic relation discovery by harvesting online ontologies. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 854–858. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-68234-9_72CrossRefGoogle Scholar
  30. 30.
    Das Sarma, A., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD Conference, pp. 861–874 (2008)Google Scholar
  31. 31.
    Sarma, A.D., Dong, X.L., Halevy, A.Y.: Uncertainty in data integration and dataspace support platforms. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping. DCSA, pp. 75–108. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-16518-4_4CrossRefGoogle Scholar
  32. 32.
    Shvaiko, P., Euzenat, J.: Ontology matching: state of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25(1), 158–176 (2013)CrossRefGoogle Scholar
  33. 33.
    Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall, London (1986)CrossRefGoogle Scholar
  34. 34.
    Spragins, J.: A note on the iterative application of Bayes’ rule. IEEE Trans. Inf. Theory 11(4), 544–549 (2006)MathSciNetCrossRefGoogle Scholar
  35. 35.
    van Keulen, M.: Managing uncertainty: the road towards better data interoperability. IT - Inf. Technol. 54(3), 138–146 (2012)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Klitos Christodoulou
    • 2
    Email author
  • Fernando Rene Sanchez Serrano
    • 1
  • Alvaro A. A. Fernandes
    • 1
  • Norman W. Paton
    • 1
  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK
  2. 2.Department of Information SciencesNeapolis University PafosPaphosCyprus

Personalised recommendations