Hierarchical Linkage Clustering with Distributions of Distances for Large-Scale Record Linkage

Ventura, Samuel L.; Nugent, Rebecca

doi:10.1007/978-3-319-11257-2_22

Samuel L. Ventura¹⁶ &
Rebecca Nugent¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

Abstract

Distance-based clustering techniques such as hierarchical clustering use a single estimate of distance for each pair of observations; their results then rely on the accuracy of this estimate. However, in many applications, datasets include measurement error or are too large for traditional models, meaning a single estimate of distance between two observations may be subject to error or computationally prohibitive to calculate. For example, in many of today’s large-scale record linkage problems, datasets are prohibitively large, making distance estimates computationally infeasible. By using a distribution of distance estimates instead (e.g. from an ensemble of classifiers trained on subsets of recordpairs), these issues may be resolved. We present a large-scale record linkage framework that incorporates classifier ensembles and “distribution linkage” clustering to identify clusters of records corresponding to unique entities. We examine the performance of several different distributional summary measures in hierarchical clustering. We motivate and illustrate this approach with an application of record linkage to the United States Patent and Trademark Office database.

An Erratum for this chapter can be found at http://dx.doi.org/10.1007/978-3-319-11257-2_28

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Akinsanmi, E., Reagans, R., Fuchs, E.: Economic Downturns, Technology Trajectories, and the Careers of Scientists (2012)
Google Scholar
Bien, J., Tibshirani, R.: Hierarchical Clustering With Prototypes via Minimax Linkage. Annals of Eugenics, 1075–1084 (2012)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Metrics. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48 (2003)
Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.E.: Adaptive Name Matching in Information Integration. IEEE Intelligent Systems 18, 16–23 (2003)
Article Google Scholar
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Carayol, N., Cassi, L.: Who’s Who in Patents: A Bayesian approach (2009)
Google Scholar
Christen, P.: A comparison of personal name matching: techniques and practical issues (2006)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64(328) (1969)
Google Scholar
Fleming, L., King III, C., Juda, A.: Small Worlds and Regional Innovation. Organizational Science (2007)
Google Scholar
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two Supervised Learning Approaches for Name Disambiguation in Author Citations. In: Joint Conference on Digital Libraries (2004)
Google Scholar
Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer (2009)
Google Scholar
Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: A Scalable Bootstrap for Massive Data (2012)
Google Scholar
Lai, R., D’Amour, A., Fleming, L.: The careers and co-authorship networks of U.S. patent-holders, since 1975 (2009)
Google Scholar
Lai, R., D’Amour, A., Yu, A., Sun, Y., Fleming, L.: Disambiguation and Co-authorship Networks of the U.S. Patent Inventor Database (2014)
Google Scholar
Martins, B.: A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records. In: Claramunt, C., Levashkin, S., Bertolotto, M. (eds.) GeoS 2011. LNCS, vol. 6631, pp. 34–51. Springer, Heidelberg (2011)
Chapter Google Scholar
Torvik, V., Smalheiser, N.: Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data 3(3) (2009)
Google Scholar
Treeratpituk, P., Giles, C.L.: Disambiguating Authors in Academic Publications using Random Forests. In: Joint Conference on Digital Libaries (2009)
Google Scholar
Ventura, S.L., Nugent, R., Fuchs, E.: Methods Matter: Rethinking Inventor Disambiguation Algorithms with Classification Models and Labeled Inventor Records (2014)
Google Scholar
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods (American Statistical Association), pp. 354–359 (1990)
Google Scholar
Winkler, W.E.: Matching and Record Linkage. In: Business Survey Methods, pp. 355–384. J. Wiley, New York (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Carnegie Mellon University, U.S.A.
Samuel L. Ventura & Rebecca Nugent

Authors

Samuel L. Ventura
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Nugent
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Catalonia
Josep Domingo-Ferrer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ventura, S.L., Nugent, R. (2014). Hierarchical Linkage Clustering with Distributions of Distances for Large-Scale Record Linkage. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-11257-2_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics