Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

  • Rebecca C. SteortsEmail author
  • Andrea Tancredi
  • Brunero Liseo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)


Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate statistical analyses. In addition, there is inherent record linkage uncertainty passed to the downstream task. Motivated by the above, we propose a generalized Bayesian record linkage method and consider multiple regression analysis as the downstream task. Records are linked via a random partition model, which allows for a wide class to be considered. In addition, we jointly model the record linkage and downstream task, which allows one to account for the record linkage uncertainty exactly. Moreover, one is able to generate a feedback propagation mechanism of the information from the proposed Bayesian record linkage model into the downstream task. This feedback effect is essential to eliminate potential biases that can jeopardize resulting downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the “feedback effect” is able to improve the performance of record linkage.



Steorts was supported by NSF-1652431 and NSF-1534412. Tancredi and Liseo were supported by Ministero dell’ Istruzione dell’ Universita e della Ricerca, Italia PRIN 2015.


  1. 1.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). Scholar
  2. 2.
    Copas, J., Hilton, F.: Record linkage: statistical models for matching computer records. J. R. Stat. Soc. A 153, 287–320 (1990)CrossRefGoogle Scholar
  3. 3.
    De Blasi, P., Favaro, S., Lijoi, A., Mena, R., Prunster, I., Ruggiero, M.: Are gibbs-type priors the most natural generalization of the dirichlet process? IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 803–821 (2015)CrossRefGoogle Scholar
  4. 4.
    Goldstein, H., Harron, K., Wade, A.: The analysis of record-linked data using multiple imputation with data value priors. Stat. Med. 31, 3481–3493 (2012)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Gutman, R., Afendulis, C.C., Zaslavsky, A.M.: A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Am. Stat. Assoc. 108, 34–47 (2013)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Gutman, R., Sammartino, C., Green, T., Montague, B.: Error adjustments for file linking methods using encrypted unique client identifier (eUCI) with application to recently released prisoners who are HIV+. Stat. Med. 35, 115–129 (2016)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Hof, M., Zwinderman, A.: Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables. Stat. Med. 31, 4231–4242 (2012)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Kim, G., Chambers, R.: Regression analysis under incomplete linkage. Comput. Stat. Data Anal. 56, 2756–2770 (2012)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Lahiri, P., Larsen, M.D.: Regression analysis with linked data. J. Am. Stat. Assoc. 100, 222–230 (2005)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Liseo, B., Tancredi, A.: Bayesian estimation of population size via linkage of multivariate normal data sets. J. Off. Stat. 27, 491–505 (2011)Google Scholar
  11. 11.
    MacEachern, S.N.: Estimating normal means with a conjugate style Dirichlet process prior. Commun. Stat.-Simul. Comput. 23, 727–741 (1994)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9, 249–265 (2000)MathSciNetGoogle Scholar
  13. 13.
    Pitman, J.: Combinatiorial Stochastic Processes. Ecole d’Eté de Probabilités de Saint-Flour XXXII. LNM, vol. 1875. Springer, Berlin (2006). Scholar
  14. 14.
    Sadinle, M.: Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8, 2404–2434 (2014)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Steorts, R.C.: Entity resolution with empirically motivated priors. Bayesian Anal. 10, 849–875 (2015)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Steorts, R.C., Hall, R., Fienberg, S.E.: SMERED: a Bayesian approach to graphical record linkage and de-duplication. J. Mach. Learn. Res. 33, 922–930 (2014)Google Scholar
  17. 17.
    Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and de-duplication. J. Am. Stat. Soc. (2016)Google Scholar
  18. 18.
    Tancredi, A., Liseo, B.: A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5, 1553–1585 (2011)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Yamato, H., Shibuya, M.: Moments of some statistics of pitman sampling formula. Bull. Inf. Cybern. 32, 1–10 (2000)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Rebecca C. Steorts
    • 1
    Email author
  • Andrea Tancredi
    • 2
  • Brunero Liseo
    • 2
  1. 1.Department of Statistical Science, affiliated faculty, Computer Science, Biostatistics and Bioinformatics, the information initiative at Duke (iiD), and the Social Science Research Institute (SSRI) Duke University; Principal Researcher, Center for Statistical Research MethodologyDuke University and U.S. Census BureauDurhamUSA
  2. 2.Department of Methods and Models for Economics, Territory and FinanceLa SapienzaRomeItaly

Personalised recommendations