Abstract
Distance-based clustering techniques such as hierarchical clustering use a single estimate of distance for each pair of observations; their results then rely on the accuracy of this estimate. However, in many applications, datasets include measurement error or are too large for traditional models, meaning a single estimate of distance between two observations may be subject to error or computationally prohibitive to calculate. For example, in many of today’s large-scale record linkage problems, datasets are prohibitively large, making distance estimates computationally infeasible. By using a distribution of distance estimates instead (e.g. from an ensemble of classifiers trained on subsets of recordpairs), these issues may be resolved. We present a large-scale record linkage framework that incorporates classifier ensembles and “distribution linkage” clustering to identify clusters of records corresponding to unique entities. We examine the performance of several different distributional summary measures in hierarchical clustering. We motivate and illustrate this approach with an application of record linkage to the United States Patent and Trademark Office database.
An Erratum for this chapter can be found at http://dx.doi.org/10.1007/978-3-319-11257-2_28
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Akinsanmi, E., Reagans, R., Fuchs, E.: Economic Downturns, Technology Trajectories, and the Careers of Scientists (2012)
Bien, J., Tibshirani, R.: Hierarchical Clustering With Prototypes via Minimax Linkage. Annals of Eugenics, 1075–1084 (2012)
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Metrics. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48 (2003)
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.E.: Adaptive Name Matching in Information Integration. IEEE Intelligent Systems 18, 16–23 (2003)
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
Carayol, N., Cassi, L.: Who’s Who in Patents: A Bayesian approach (2009)
Christen, P.: A comparison of personal name matching: techniques and practical issues (2006)
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64(328) (1969)
Fleming, L., King III, C., Juda, A.: Small Worlds and Regional Innovation. Organizational Science (2007)
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two Supervised Learning Approaches for Name Disambiguation in Author Citations. In: Joint Conference on Digital Libraries (2004)
Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer (2009)
Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: A Scalable Bootstrap for Massive Data (2012)
Lai, R., D’Amour, A., Fleming, L.: The careers and co-authorship networks of U.S. patent-holders, since 1975 (2009)
Lai, R., D’Amour, A., Yu, A., Sun, Y., Fleming, L.: Disambiguation and Co-authorship Networks of the U.S. Patent Inventor Database (2014)
Martins, B.: A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records. In: Claramunt, C., Levashkin, S., Bertolotto, M. (eds.) GeoS 2011. LNCS, vol. 6631, pp. 34–51. Springer, Heidelberg (2011)
Torvik, V., Smalheiser, N.: Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data 3(3) (2009)
Treeratpituk, P., Giles, C.L.: Disambiguating Authors in Academic Publications using Random Forests. In: Joint Conference on Digital Libaries (2009)
Ventura, S.L., Nugent, R., Fuchs, E.: Methods Matter: Rethinking Inventor Disambiguation Algorithms with Classification Models and Labeled Inventor Records (2014)
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods (American Statistical Association), pp. 354–359 (1990)
Winkler, W.E.: Matching and Record Linkage. In: Business Survey Methods, pp. 355–384. J. Wiley, New York (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ventura, S.L., Nugent, R. (2014). Hierarchical Linkage Clustering with Distributions of Distances for Large-Scale Record Linkage. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-11257-2_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)