Skip to main content

Semantics and Verification of Entity Resolution and Data Fusion Operations via Transformation into a Formal Notation

  • Conference paper
  • First Online:
Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 706))

  • 591 Accesses

Abstract

During all the period of development of data integration methods and tools the issues of formal semantics definition and verification were arising. Three levels of integration can be distinguished: data model integration, schema matching and integration and data integration proper. This paper is aimed at development of methods and tools for formal semantics definition and verification on the third level – level of data proper. An approach for definition of formal semantics for high-level data integration programs is proposed. The semantics is defined using a transformation into a formal specification language supported by automatic/interactive provers. The semantics is applied for verification of structured data integration workflows. Workflow properties to be verified are presented as expressions of the specification language chosen. After that a semantic specification of the data integration workflow is verified w.r.t. required properties. A practical aim of the work is to define a basis for formal verification of data integration workflows during problem solving in various integration environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abrial, J.-R.: The B-Book: Assigning Programs to Meanings. Cambridge University Press, Cambridge (1996)

    Book  MATH  Google Scholar 

  2. Atelier B, the industrial tool to efficiently deploy the B Method. http://www.atelierb.eu/

  3. ATL - a model transformation technology. https://eclipse.org/atl/

  4. Apache Hadoop Project (2016). http://hadoop.apache.org/

  5. Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., Ozcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. In: 37th International Conference on Very Large Data Bases VLDB, pp. 1272–1283. Curran Associates, New York (2011)

    Google Scholar 

  6. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1). Article No. 1. (2009). doi:10.1145/1456650.1456651

  7. Bleiholde, J.: Data fusion and conflict resolution in integrated information systems. D.Sc. Diss., 184 p., Hasso-Plattner-Institut, Potsdam (2010)

    Google Scholar 

  8. Burdick, D., Hernández, M.A., Ho, H., Koutrika, G., Krishnamurthy, R., Popa, L., Stanoi, I.R., Vaithyanathan, S., Das, S.: Extracting, linking and integrating data from public sources: a financial case study. IEEE Data Eng. Bull. 34(3), 60–67 (2011)

    Google Scholar 

  9. Calegari, D., Szasz, N.: Verification of model transformations: a survey of the state-of-the-art. Electronic Notes in Theoretical Computer Science 292, 5–25 (2013)

    Article  Google Scholar 

  10. Luna Dong, X., Naumann, F.: Data fusion — resolving data conflicts in integration. Proc. VLDB Endowment 2(2), 1654–1655 (2009)

    Article  Google Scholar 

  11. Fagin, R., Kolaitis, P., Miller, R., Popa, L.: Data exchange: semantics and query answering. Theoret. Comput. Sci. 336(1), 89–124 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02463-4_12

    Chapter  Google Scholar 

  13. Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Proceedings, pp. 1527–1527 (2013)

    Google Scholar 

  14. IBM InfoSphere BigInsights Version 3.0 Information Center. https://goo.gl/lZpEQd

  15. Hernandez, M., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: HIL: a high-level scripting language for entity integration. In: 16th Conference (International) on Extending Database Technology Proceedings EDBT 2013, pp. 549–560 (2013)

    Google Scholar 

  16. Kalinichenko, L.A.: Method for data models integration in the common paradigm. In: Proceedings of the First East-European Symposium on Advances in Databases and Information Systems ADBIS 1997, vol. 1: Regular Papers, pp. 275–284. Nevsky Dialect, St.-Petersburg (1997)

    Google Scholar 

  17. Kalinichenko, L., Stupnikov, S., Zemtsov, N.: Extensible canonical process model synthesis applying formal interpretation. In: Eder, J., Haav, H.-M., Kalja, A., Penjam, J. (eds.) ADBIS 2005. LNCS, vol. 3631, pp. 183–198. Springer, Heidelberg (2005). doi:10.1007/11547686_14

    Chapter  Google Scholar 

  18. Kalinichenko, L.A., Stupnikov, S.A.: Constructing of mappings of heterogeneous information models into the canonical models of integrated information systems. In: Advances in Databases and Information Systems: Proceedings of the 12th East-European Conference, pp. 106–122. Tampere University of Technology, Pori (2008)

    Google Scholar 

  19. Kalinichenko, L.A., Stupnikov, S.A.: Heterogeneous information model unification as a pre-requisite to resource schema mapping. In: D’Atri, A., Saccà, D. (eds.) Information Systems: People, Organizations, Institutions, and Technologies - Proceedings of the V Conference of the Italian Chapter of Association for Information Systems itAIS, pp. 373–380. Springer Physica Verlag, Heidelberg (2010)

    Google Scholar 

  20. Kalinichenko, L.A., Stupnikov, S.A.: OWL as yet another data model to be integrated. In: Advances in Databases and Information Systems: Proceedings II of the 15th East-European Conference, pp. 178–189. Austrian Computer Society, Vienna (2011)

    Google Scholar 

  21. Kalinichenko, L., Stupnikov, S., Vovchenko, A., Kovalev, D.: Rule-based multi-dialect infrastructure for conceptual problem solving over heterogeneous distributed information resources. In: Catania, B., et al. (eds.) New Trends in Databases and Information Systems. Advances in Intelligent Systems and Computing, vol. 241, pp. 61–68. Springer, Cham (2014)

    Chapter  Google Scholar 

  22. Kalinichenko, L.A., Stupnikov, S.A., Vovchenko, A.E., Kovalev, D.Y.: Conceptual modeling of multi-dialect workflows. Informatics and Applications 8(4), 110–124 (2014)

    Google Scholar 

  23. Kopcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endowment 3(1–2), 484–493 (2010)

    Article  Google Scholar 

  24. Larsen, P.G., Plat, N., Toetenel, H.: A formal semantics of data flow diagrams. Formal Aspects Comput. 6(6), 586–606 (1994)

    Article  MATH  Google Scholar 

  25. Lano, K., Bicarregui, J., Evans, A.: Structured axiomatic semantics for UML models. In: Rigorous Object-Oriented Methods: Proceedings of the Conference, p. 5 (2000)

    Google Scholar 

  26. Lano, K., Kolahdouz-Rahimi, S., Clark, T.: Language-independent model transformation verification. In: Verification of Model Transformations, Proceedings of the Third International Workshop on Verification of Model Transformations, CEUR Workshop Proceedings, vol. 1325, pp. 36–45 (2014)

    Google Scholar 

  27. Miner, D.: MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Sebastopol (2012)

    Google Scholar 

  28. Naumann, F., Bilke, A., Bleiholder, J., Weis, M.: Data fusion in three steps: resolving inconsistencies at schema-, tuple-, and value-level. IEEE Data Engineering Bulletin 29(2), 21–31 (2006)

    Google Scholar 

  29. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the SIGMOD Conference, pp. 1099–1110 (2008)

    Google Scholar 

  30. Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Springer, Heidelberg (2011)

    MATH  Google Scholar 

  31. Stupnikov, S., Kalinichenko, L., Bressan, S.: Interactive discovery and composition of complex web services. In: Manolopoulos, Y., Pokorný, J., Sellis, T.K. (eds.) ADBIS 2006. LNCS, vol. 4152, pp. 216–231. Springer, Heidelberg (2006). doi:10.1007/11827252_18

    Chapter  Google Scholar 

  32. Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A generic and customizable framework for the design of ETL scenarios. Inf. Syst. 30(7), 492–525 (2005)

    Article  Google Scholar 

  33. Stupnikov, S.A.: Modeling of compositional refining specifications. Ph.D. thesis. Institute of Informatics Problems, Russian Academy of Sciences, Moscow, 195 p. (2006)

    Google Scholar 

  34. Stupnikov, S.A.: Unification of an array data model for the integration of heterogeneous information resources. In: Proceedings of the 14th Russian Conference on Digital Libraries RCDL 2012, CEUR Workshop Proceedings, vol. 934, pp. 42–52 (2012)

    Google Scholar 

  35. Stupnikov, S.A.: Mapping of a graph data model into an object-frame canonical information model for the development of heterogeneous information resources integration systems. In: Proceedings of the 15th Russian Conference on Digital Libraries RCDL 2013, CEUR Workshop Proceedings, vol. 1108, pp. 85–94 (2013)

    Google Scholar 

  36. Stupnikov, S.A., Vovchenko, A.E.: Combined virtual and materialized environment for integration of large heterogeneous data collections. In: Proceedings of the 16th Russian Conference on Digital Libraries RCDL 2014 Proceedings. CEUR Workshop Proceedings, vol. 1297, pp. 201–210 (2014)

    Google Scholar 

  37. InfoSphere Big Match for Hadoop. Technical Overview. https://goo.gl/0TMqvw

  38. HIL2AMN Project. GitHub Repository (2017). https://goo.gl/IK1JzU

  39. Stupnikov, S.: Formal semantics of a language for entity resolution and data fusion and its application for verification of data integration workflows. Selected Papers of the XVIII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2016), CEUR Workshop Proceedings, vol. 1752, pp. 159–167 (2016)

    Google Scholar 

  40. Steinberg, D., Budinsky, F., Paternostro, M., Merks, E.: EMF: Eclipse Modeling Framework, 2nd edn. Addison-Wesley Professional, Reading (2008)

    Google Scholar 

  41. EMFText Concrete Syntax Mapper. http://www.emftext.org/index.php/EMFText

Download references

Acknowledgments

This research was partially supported by the Russian Foundation for Basic Research (projects 15-29-06045, 16-07-01028).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergey Stupnikov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Stupnikov, S. (2017). Semantics and Verification of Entity Resolution and Data Fusion Operations via Transformation into a Formal Notation. In: Kalinichenko, L., Kuznetsov, S., Manolopoulos, Y. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2016. Communications in Computer and Information Science, vol 706. Springer, Cham. https://doi.org/10.1007/978-3-319-57135-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-57135-5_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-57134-8

  • Online ISBN: 978-3-319-57135-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics