Optimal Decision Rules for Constrained Record Linkage: An Evolutionary Approach

Zardetto, Diego; Scannapieco, Monica

doi:10.1007/978-3-319-00032-9_44

Optimal Decision Rules for Constrained Record Linkage: An Evolutionary Approach

Diego Zardetto⁴ &
Monica Scannapieco⁴

Conference paper
First Online: 01 January 2013

5038 Accesses
1 Citations

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

Record Linkage (RL) aims at identifying pairs of records coming from different sources and representing the same real-world entity. Probabilistic RL methods assume that the pairwise distances computed in the record-comparison process obey a well defined statistical model, and exploit the statistical inference machinery to draw conclusions on the unknown Match/Unmatch status of each pair. Once model parameters have been estimated, classical Decision Theory results (e.g. the MAP rule) can generally be used to obtain a probabilistic clustering of the pairs into Matches and Unmatches. Constrained RL tasks (arising whenever one knows in advance that either or both the data sets to be linked do not contain duplicates) represent a relevant exception. In this paper we propose an Evolutionary Algorithm to find optimal decision rules according to arbitrary objectives (e.g. Maximum complete-Likelihood) while fulfilling 1:1, 1:N and N:1 matching constraints. We also present some experiments on real-world constrained RL instances, showing the accuracy and efficiency of our approach.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The expression “N:M” means that each record of either data set can in principle match many records of the other, and viceversa.
2.
For duplicates we mean records that (i) correspond to the same real-world entity and(ii) belong to the same data set.
3.
We performed 10 runs of our Evolutionary Algorithm on each instance, owing to its stochastic nature. Anyway, we found a negligible variability in the results.
4.
The Precision, Recall and F-measure increase (when present) turned out to be of 0.1% at most.

References

Christen, P., & Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. Quality Measures in Data Mining, Springer Studies in Computational Intelligence, 2007, 43.
Google Scholar
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum-likelihood from incomplete data via the EM algorithm. JRSS, Series B, 39, 1.
MathSciNet MATH Google Scholar
Duda, R., Hart, P., & Stork D. (2000). Pattern classification. New York: Wiley.
Google Scholar
Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, Florida. JASA, 84, 406, 1989.
Google Scholar
Kopcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3, 1, 2010.
Google Scholar
McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley.
Book MATH Google Scholar
Michalewicz, Z. (1996). Genetic algorithms + data structures = evolution programs. Berlin: Springer.
Book MATH Google Scholar
Winkler, W. (1994). Advanced methods for record linkage. In Proc. of Survey Research Methods, ASA, 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

Istat - Italian National Institute of Statistics, Rome, Italy
Diego Zardetto & Monica Scannapieco

Authors

Diego Zardetto
View author publications
You can also search for this author in PubMed Google Scholar
Monica Scannapieco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Zardetto .

Editor information

Editors and Affiliations

Department of Economics, and Management, University of Pavia, Via San Felice 7, Pavia, 27100, Italy
Paolo Giudici
Department of Economics, and Business, University of Catania, Corso Italia 55, Catania, 95129, Italy
Salvatore Ingrassia
, Department of Statistics, University of Rome "La Sapienza", Piazzale Aldo Moro 5, Rome, 00185, Italy
Maurizio Vichi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zardetto, D., Scannapieco, M. (2013). Optimal Decision Rules for Constrained Record Linkage: An Evolutionary Approach. In: Giudici, P., Ingrassia, S., Vichi, M. (eds) Statistical Models for Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00032-9_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-00032-9_44
Published: 22 May 2013
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-00031-2
Online ISBN: 978-3-319-00032-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics