Rationality of Cross-System Data Duplication: A Case Study

  • Wiebe Hordijk
  • Roel Wieringa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6051)


Duplication of data across systems in an organization is a problem because it wastes effort and leads to inconsistencies. Researchers have proposed several technical solutions but duplication still occurs in practice. In this paper we report on a case study of how and why duplication occurs in a large organization, and discuss generalizable lessons learned from this. Our case study research questions are why data gets duplicated, what the size of the negative effects of duplication is, and why existing solutions are not used. We frame our findings in terms of design rationale and explain them by providing a causal model. Our findings suggest that next to technological factors, organizational and project factors have a large effect on duplication. We discuss the implications of our findings for technical solutions in general.


Data duplication design rationale field study 


  1. 1.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB 2002: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597 (2002)Google Scholar
  2. 2.
    Bhattacharya, I., Getoor, L.: Deduplication and group detection using links. In: Workshop on Link Analysis and Group Detection (LinkKDD 2004), Seattle, WA, USA. ACM, New York (2004)Google Scholar
  3. 3.
    Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD 1998: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 201–212 (1998)Google Scholar
  4. 4.
    Colomb, R.M., Ahmad, M.N.: Merging ontologies requires interlocking institutional worlds. Appl. Ontol. 2(1), 1–12 (2007)Google Scholar
  5. 5.
    Hordijk, W., Wieringa, R.: Reusable rationale blocks: Improving quality and efficiency of design choices. In: Dutoit, A.H., McCall, R., Mistrik, I., Paech, B. (eds.) Rationale Management in Software Engineering, pp. 353–371. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Ioannou, E., Niederee, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 556–570. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Jiang, L., Borgida, A., Mylopoulos, J.: Towards a compositional semantic account of data quality attributes. In: Li, Q., Spaccapietra, S., Yu, E., Olivé, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 55–68. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Noy, N.F.: Semantic integration: a survey of ontology-based approaches. SIGMOD Rec. 33(4), 65–70 (2004)CrossRefGoogle Scholar
  9. 9.
    Pawson, R., Tilley, N.: Realistic Evaluation. SAGE Publications, London (1997)Google Scholar
  10. 10.
    Pollock, J.T.: Integration’s dirty little secret: It’s a matter of semantics. Technical report, Modulant (2002)Google Scholar
  11. 11.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001)zbMATHCrossRefGoogle Scholar
  12. 12.
    Simon, H.A.: The Sciences of the Artificial, 3rd edn. MIT Press, Cambridge (1996)Google Scholar
  13. 13.
    Sumner, M.: Risk factors in enterprise-wide/erp projects. Journal of Information Technology 15(4), 317–327 (2000)CrossRefGoogle Scholar
  14. 14.
    Uschold, M., Gruninger, M.: Ontologies and semantics for seamless connectivity. SIGMOD Rec. 33(4), 58–64 (2004)CrossRefGoogle Scholar
  15. 15.
    Wieringa, R.J.: Design science as nested problem solving. In: Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology, Philadelphia, pp. 1–12 (2009)Google Scholar
  16. 16.
    Winkler, W.E.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau (1999)Google Scholar
  17. 17.
    Yin, R.K.: Case study research: design and methods, 3rd edn. Applied Social Research Methods Series, vol. 5. SAGE Publications, Thousand Oaks (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Wiebe Hordijk
    • 1
  • Roel Wieringa
    • 1
  1. 1.University of TwenteThe Netherlands

Personalised recommendations