Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

Vatsalan, Dinusha; Christen, Peter

doi:10.1007/978-3-642-37456-2_29

Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

Dinusha Vatsalan²³ &
Peter Christen²³

Conference paper

9693 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7819))

Abstract

Record linkage is an emerging research area which is required by various real-world applications to identify which records in different data sources refer to the same real-world entities. Often privacy concerns and restrictions prevent the use of traditional record linkage applications across different organizations. Linking records in situations where no private or confidential information can be revealed is known as privacy-preserving record linkage (PPRL). As with traditional record linkage applications, scalability is a main challenge in PPRL. This challenge is generally addressed by employing a blocking technique that aims to reduce the number of candidate record pairs by removing record pairs that likely refer to non-matches without comparing them in detail. This paper presents an efficient private blocking technique based on a sorted neighborhood approach that combines k-anonymous clustering and the use of public reference values. An empirical study conducted on real-world databases shows that this approach is scalable to large databases, and that it can provide effective blocking while preserving k-anonymous characteristics. The proposed approach can be up-to two orders of magnitude faster than two state-of-the-art private blocking techniques, k-nearest neighbor clustering and Hamming based locality sensitive hashing.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Christen, P.: Data Matching. Data-Centric Systems and Appl. Springer (2012)
Google Scholar
Batini, C., Scannapieca, M.: Data quality: Concepts, methodologies and techniques. In: Data-Centric Systems and Appl. Springer (2006)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 12(9) (2012)
Google Scholar
Vatsalan, D., Christen, P., Verykios, V.: A taxonomy of privacy-preserving record linkage techniques. Information Systems (2013)
Google Scholar
Hall, R., Fienberg, S.: Privacy-preserving record linkage. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 269–283. Springer, Heidelberg (2010)
Chapter Google Scholar
Churches, T., Christen, P.: Blind data linkage using n-gram similarity comparisons. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 121–126. Springer, Heidelberg (2004)
Chapter Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Society 64(328), 1183–1210 (1969)
Article Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: DASFAA 2003, pp. 137–146 (2003)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: ACM SIGKDD, pp. 475–480 (2002)
Google Scholar
Kim, H., Lee, D.: Harra: fast iterative hashed record linkage for large-scale data collections. In: EDBT, Lausanne, Switzerland, pp. 525–536 (2010)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: ICDE, pp. 1073–1083 (2012)
Google Scholar
Sweeney, L.: K-anonymity: A model for protecting privacy. International Journal of Uncertainty Fuzziness and Knowledge Based Systems 10(5), 557–570 (2002)
Article MathSciNet MATH Google Scholar
Pang, C., Gu, L., Hansen, D., Maeder, A.: Privacy-preserving fuzzy matching using a public reference table. In: McClean, S., Millard, P., El-Darzi, E., Nugent, C. (eds.) Intelligent Patient Management. Studies in Computational Intelligence, vol. 189, pp. 71–89. Springer, Heidelberg (2009)
Chapter Google Scholar
Karakasidis, A., Verykios, V.: Reference table based k-anonymous private blocking. In: ACM Symposium on Applied Computing, Riva del Garda, Italy (2012)
Google Scholar
Durham, E.: A framework for accurate, efficient private record linkage. PhD thesis, Vanderbilt University (2012)
Google Scholar
Al-Lawati, A., Lee, D., McDaniel, P.: Blocking-aware private record linkage. In: IQIS, pp. 59–68 (2005)
Google Scholar
Inan, A., Kantarcioglu, M., Bertino, E., Scannapieco, M.: A hybrid approach to private record linkage. In: IEEE ICDE, Cancun, Mexico, pp. 496–505 (2008)
Google Scholar
Inan, A., Kantarcioglu, M., Ghinita, G., Bertino, E.: Private record matching using differential privacy. In: EDBT (2010)
Google Scholar
Karakasidis, A., Verykios, V., Christen, P.: Fake injection strategies for private phonetic matching. In: DPM, Leuven, Belgium (2011)
Google Scholar
Vatsalan, D., Christen, P., Verykios, V.: An efficient two-party protocol for approximate matching in private record linkage. In: AusDM, CRPIT 121 (2011)
Google Scholar
Scannapieco, M., Figotin, I., Bertino, E., Elmagarmid, A.: Privacy preserving schema and data matching. In: ACM SIGMOD, pp. 653–664 (2007)
Google Scholar
Yakout, M., Atallah, M., Elmagarmid, A.: Efficient private record linkage. In: IEEE ICDE, Shanghai, pp. 1283–1286 (2009)
Google Scholar
Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making 9(1) (2009)
Google Scholar
Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 507–514. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Research School of Computer Science, College of Engineering and Computer Science, The Australian National University, Canberra, ACT, 0200, Australia
Dinusha Vatsalan & Peter Christen

Authors

Dinusha Vatsalan
View author publications
You can also search for this author in PubMed Google Scholar
Peter Christen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Dept. of Computer Science and Information Engineering, Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, P.O. Box 123, 2007, Sydney, NSW, Australia
Longbing Cao & Guandong Xu &
Asian Office of Aerospace Research and Development (AOARD), Air Force Office of Scientific Research (AFOSR), Air Force Research Laboratory USA, Osaka University, 7-23-17 Roppongi, 106-0032, Minato-ku, Tokyo, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vatsalan, D., Christen, P. (2013). Sorted Nearest Neighborhood Clustering for Efficient Private Blocking. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37456-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-37456-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37455-5
Online ISBN: 978-3-642-37456-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics