Abstract
Duplication of data across systems in an organization is a problem because it wastes effort and leads to inconsistencies. Researchers have proposed several technical solutions but duplication still occurs in practice. In this paper we report on a case study of how and why duplication occurs in a large organization, and discuss generalizable lessons learned from this. Our case study research questions are why data gets duplicated, what the size of the negative effects of duplication is, and why existing solutions are not used. We frame our findings in terms of design rationale and explain them by providing a causal model. Our findings suggest that next to technological factors, organizational and project factors have a large effect on duplication. We discuss the implications of our findings for technical solutions in general.
Chapter PDF
References
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB 2002: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597 (2002)
Bhattacharya, I., Getoor, L.: Deduplication and group detection using links. In: Workshop on Link Analysis and Group Detection (LinkKDD 2004), Seattle, WA, USA. ACM, New York (2004)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD 1998: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 201–212 (1998)
Colomb, R.M., Ahmad, M.N.: Merging ontologies requires interlocking institutional worlds. Appl. Ontol. 2(1), 1–12 (2007)
Hordijk, W., Wieringa, R.: Reusable rationale blocks: Improving quality and efficiency of design choices. In: Dutoit, A.H., McCall, R., Mistrik, I., Paech, B. (eds.) Rationale Management in Software Engineering, pp. 353–371. Springer, Heidelberg (2006)
Ioannou, E., Niederee, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 556–570. Springer, Heidelberg (2008)
Jiang, L., Borgida, A., Mylopoulos, J.: Towards a compositional semantic account of data quality attributes. In: Li, Q., Spaccapietra, S., Yu, E., Olivé, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 55–68. Springer, Heidelberg (2008)
Noy, N.F.: Semantic integration: a survey of ontology-based approaches. SIGMOD Rec. 33(4), 65–70 (2004)
Pawson, R., Tilley, N.: Realistic Evaluation. SAGE Publications, London (1997)
Pollock, J.T.: Integration’s dirty little secret: It’s a matter of semantics. Technical report, Modulant (2002)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001)
Simon, H.A.: The Sciences of the Artificial, 3rd edn. MIT Press, Cambridge (1996)
Sumner, M.: Risk factors in enterprise-wide/erp projects. Journal of Information Technology 15(4), 317–327 (2000)
Uschold, M., Gruninger, M.: Ontologies and semantics for seamless connectivity. SIGMOD Rec. 33(4), 58–64 (2004)
Wieringa, R.J.: Design science as nested problem solving. In: Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology, Philadelphia, pp. 1–12 (2009)
Winkler, W.E.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau (1999)
Yin, R.K.: Case study research: design and methods, 3rd edn. Applied Social Research Methods Series, vol. 5. SAGE Publications, Thousand Oaks (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hordijk, W., Wieringa, R. (2010). Rationality of Cross-System Data Duplication: A Case Study. In: Pernici, B. (eds) Advanced Information Systems Engineering. CAiSE 2010. Lecture Notes in Computer Science, vol 6051. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13094-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-13094-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13093-9
Online ISBN: 978-3-642-13094-6
eBook Packages: Computer ScienceComputer Science (R0)