Advertisement

Anomaly-Based Duplicate Detection: A Probabilistic Approach

  • Andreas ObermeierEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11491)

Abstract

The importance of identifying records in databases that refer to the same real-world entity (“duplicate detection”) has been recognized in both research and practice. However, existing supervised approaches for duplicate detection need training data with labeled instances of duplicates and non-duplicates, which is often costly and time-consuming to generate. On the contrary, unsupervised approaches can forego such training data but may suffer from limiting assumptions (e.g., monotonicity) and providing less reliable results. To address the issue of generating high-quality results using easy to acquire duplicate-free training data only, we propose a probabilistic approach for anomaly-based duplicate detection. Duplicates exhibit specific characteristics which differ significantly from the characteristics of non-duplicates and therefore represent anomalies. Based on the grade of anomaly compared to duplicate-free training data, our approach assigns the probability of being a duplicate to each analyzed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analyzing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform even fully supervised state-of-the-art approaches for duplicate detection.

Keywords

Duplicate detection Unsupervised classification Data quality 

References

  1. 1.
    Fan, W.: Data quality. From theory to practice. ACM SIGMOD Rec. 44(3), 7–18 (2015).  https://doi.org/10.1145/2854006.2854008CrossRefGoogle Scholar
  2. 2.
    Helmis, S., Hollmann, R.: Webbased Dataintegration. Approaches to Measure and Maintain the Quality of Information in Heterogeneous Databases Using a Fully Web-Based Tool. Springer, Heidelberg (2009)Google Scholar
  3. 3.
    Heinrich, B., Klier, M., Obermeier, A.A., Schiller, A.: Event-driven duplicate detection: a probability-based approach. In: Proceedings of the 26th ECIS (2018)Google Scholar
  4. 4.
    Bleiholder, J., Schmid, J.: Dataintegration and deduplication. In: Daten- und Informationsqualität, pp. 121–140. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  5. 5.
    Draisbach, U.: Partitioning for Efficient Duplicate Detection in Relational Data. Springer, Heidelberg (2012)Google Scholar
  6. 6.
    Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD, pp. 151–159 (2008)Google Scholar
  7. 7.
    Christen, P.: A two-step classification approach to unsupervised record linkage. In: Proceedings of the 6th AusDM, pp. 111–119 (2007)Google Scholar
  8. 8.
    Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. In: Spaccapietra, S. (ed.) Journal on Data Semantics VII. LNCS, vol. 4244, pp. 136–164. Springer, Heidelberg (2006).  https://doi.org/10.1007/11890591_5CrossRefGoogle Scholar
  9. 9.
    Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: TAILOR: a record linkage toolbox. In: Proceedings of the 18th ICDE, pp. 17–28 (2002)Google Scholar
  10. 10.
    Gu, L., Baxter, R.: Decision models for record linkage. In: Williams, G.J., Simoff, S.J. (eds.) Data Mining. LNCS (LNAI), vol. 3755, pp. 146–160. Springer, Heidelberg (2006).  https://doi.org/10.1007/11677437_12CrossRefGoogle Scholar
  11. 11.
    Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th UAI, pp. 454–461 (2004)Google Scholar
  12. 12.
    Jurek, A., Deepak, P.: It pays to be certain: unsupervised record linkage via ambiguity minimization. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 177–190. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-93040-4_15CrossRefGoogle Scholar
  13. 13.
    Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science research methodology for information systems research. JMIS 24(3), 45–77 (2007)Google Scholar
  14. 14.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31164-2CrossRefGoogle Scholar
  15. 15.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection. A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  16. 16.
    Winkler, W.E.: Overview of record linkage and current research directions. U.S. Bureau of the Census (2006)Google Scholar
  17. 17.
    Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets. Probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)CrossRefGoogle Scholar
  18. 18.
    Hettiarachchi, G.P., Hettiarachchi, N.N., Hettiarachchi, D.S., Ebisuya, A.: Next generation data classification and linkage. Role of probabilistic models and artificial intelligence. In: Proceedings of the 4th IEEE GHTC, pp. 569–576 (2014)Google Scholar
  19. 19.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)zbMATHCrossRefGoogle Scholar
  20. 20.
    Belin, T.R., Rubin, D.B.: A method for calibrating false-match rates in record linkage. J. Am. Stat. Assoc. 90(430), 694–707 (1995)zbMATHCrossRefGoogle Scholar
  21. 21.
    Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Thibaudeau, Y.: The discrimination power of dependency structures in record linkage. U.S. Bureau of the Census (1992)Google Scholar
  23. 23.
    Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of Survey Research Methods Section, pp. 274–279. American Statistical Association (1993)Google Scholar
  24. 24.
    Scott, D.W.: Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley, Hoboken (2015)zbMATHCrossRefGoogle Scholar
  25. 25.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning, pp. 73–78 (2003)Google Scholar
  26. 26.
    Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. U.S. Bureau of the Census (1990)Google Scholar
  27. 27.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  28. 28.
    Seabold, S., Perktold, J.: Statsmodels. Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference, pp. 57–61 (2010)Google Scholar
  29. 29.
    Hoerl, A.E., Fallin, H.K.: Reliability of subjective evaluations in a high incentive situation. J. Roy. Stat. Soc. Ser. A (General) 137(2), 227–230 (1974)CrossRefGoogle Scholar
  30. 30.
    Murphy, A.H., Winkler, R.L.: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat. 26(1), 41–47 (1977)CrossRefGoogle Scholar
  31. 31.
    Murphy, A.H., Winkler, R.L.: A general framework for forecast verification. Mon. Weather Rev. 115(7), 1330–1338 (1987)CrossRefGoogle Scholar
  32. 32.
    Sanders, F.: On subjective probability forecasting. J. Appl. Meteorol. 2(2), 191–201 (1963)CrossRefGoogle Scholar
  33. 33.
    Bröcker, J., Smith, L.A.: Increasing the reliability of reliability diagrams. Weather Forecast. 22(3), 651–661 (2007)CrossRefGoogle Scholar
  34. 34.
    Murphy, A.H.: A new vector partition of the probability score. J. Appl. Meteorol. 12(4), 595–600 (1973)CrossRefGoogle Scholar
  35. 35.
    Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)CrossRefGoogle Scholar
  36. 36.
    de Bruin, J.: Python Record Linkage Toolkit. https://github.com/J535D165/recordlinkage. Accessed 4 Jan 2019

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of UlmUlmGermany

Personalised recommendations