Geocode Matching and Privacy Preservation

  • Peter Christen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5456)


Geocoding is the process of matching addresses to geographic locations, such as latitudes and longitudes, or local census areas. In many applications, addresses are the key to geo-spatial data analysis and mining. Privacy and confidentiality are of paramount importance when data from, for example, cancer registries or crime databases is geocoded. Various approaches to privacy-preserving data matching, also called record linkage or entity resolution, have been developed in recent times. However, most of these approaches have not considered the specific privacy issues involved in geocode matching. This paper provides a brief introduction to privacy-preserving data and geocode matching, and using several real-world scenarios the issues involved in privacy and confidentiality for data and geocode matching are illustrated. The challenges of making privacy-preserving matching practical for real-world applications are highlighted, and potential directions for future research are discussed.


Data matching record linkage entity resolution privacy preservation geocoding secure multi-party computations 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    US Federal Geographic Data Committee. Homeland Security and Geographic Information Systems: How GIS and mapping technology can save lives and protect property in post-September 11th America. Public Health GIS News and Information (52), 21–23 (May 2003)Google Scholar
  2. 2.
    Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Winkler, W.E.: Overview of record linkage and current research directions. Technical Report RRS2006/02, US Bureau of the Census (2006)Google Scholar
  4. 4.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  5. 5.
    Kelman, C.W., Bass, J.A., Holman, D.: Research use of linked health data – A best practice protocol. ANZ Journal of Public Health 26(3), 251–255 (2002)Google Scholar
  6. 6.
    Jonas, J., Harper, J.: Effective counterterrorism and the limited role of predictive data mining. Policy Analysis (584) (2006)Google Scholar
  7. 7.
    Wang, G., Chen, H., Xu, J.J., Atabakhsh, H.: Automatically detecting criminal identity deception: An adaptive detection algorithm. IEEE Transactions on Systems, Man and Cybernetics (Part A) 36(5), 988–999 (2006)CrossRefGoogle Scholar
  8. 8.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(1) (2007)Google Scholar
  9. 9.
    Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  10. 10.
    Churches, T., Christen, P., Lim, K., Zhu, J.: Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making 2(9) (2002)Google Scholar
  11. 11.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC (2003)Google Scholar
  12. 12.
    Christen, P.: Febrl – An open source data cleaning, deduplication and record linkage system with a graphical user interface. In: ACM International Conference on Knowledge Discovery and Data Mining, Las Vegas, pp. 1065–1068 (2008)Google Scholar
  13. 13.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IJCAI Workshop on Information Integration on the Web, Acapulco, pp. 73–78 (2003)Google Scholar
  14. 14.
    Christen, P.: A comparison of personal name matching: Techniques and practical issues. In: IEEE ICDM Workshop on Mining Complex Data, Hong Kong, pp. 290–294 (2006)Google Scholar
  15. 15.
    Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: ACM International Conference on Knowledge Discovery and Data Mining, Las Vegas, pp. 151–159 (2008)Google Scholar
  16. 16.
    Clarke, D.: Practical introduction to record linkage for injury research. Injury Prevention 10, 186–191 (2004)CrossRefGoogle Scholar
  17. 17.
    Christen, P., Willmore, A., Churches, T.: A probabilistic geocoding system utilising a parcel based address file. In: Williams, G.J., Simoff, S.J. (eds.) Data Mining. LNCS (LNAI), vol. 3755, pp. 130–145. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  18. 18.
    Paull, D.: A geocoded national address file for Australia: The G-NAF what, why, who and when? PSMA Australia Limited, Griffith, ACT, Australia (2003),
  19. 19.
    Cayo, M.R., Talbot, T.O.: Positional error in automated geocoding of residential addresses. International Journal of Health Geographics 2(10) (2003)Google Scholar
  20. 20.
    Brownstein, J.S., Cassa, C., Kohane, I.S., Mandl, K.D.: Reverse geocoding: Concerns about patient confidentiality in the display of geospatial health data. In: AMIA Annual Symposium Proceedings 2005, p. 905 (2005)Google Scholar
  21. 21.
    Brownstein, J.S., Cassa, C., Mandl, K.D.: No place to hide–reverse identification of patients from published maps. New England Journal of Medicine 355(16), 1741–1742 (2006)CrossRefGoogle Scholar
  22. 22.
    Curtis, A.J., Mills, J.W., Leitner, M.: Spatial confidentiality and GIS: Re-engineering mortality locations from published maps about Hurricane Katrina. International Journal of Health Geographics 5(1), 44–56 (2006)CrossRefGoogle Scholar
  23. 23.
    Australian Attorney-General’s Department, Standing Committee of Attorney’s-General: Model criminal law officers’ committee: Final report on identity crime. Canberra (March 2008)Google Scholar
  24. 24.
    Chaytor, R., Brown, E., Wareham, T.: Privacy advisors for personal information management. In: SIGIR Workshop on Personal Information Management, Seattle, Washington, pp. 28–31 (2006)Google Scholar
  25. 25.
    Fienberg, S.E.: Privacy and confidentiality in an e-Commerce world: Data mining, data warehousing, matching and disclosure limitation. Statistical Science 21(2), 143–154 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Sweeney, L.: K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5), 557–570 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Christen, P.: Privacy-preserving data linkage and geocoding: Current approaches and research directions. In: IEEE ICDM Workshop on Privacy Aspects of Data Mining, Hong Kong, pp. 497–501 (2006)Google Scholar
  28. 28.
    Sweeney, L.: Privacy-enhanced linking. ACM SIGKDD Explorations 7(2), 72–75 (2005)CrossRefGoogle Scholar
  29. 29.
    Christen, P., Churches, T.: Secure health data linkage and geocoding: Current approaches and research directions. In: National e-Health Privacy and Security Symposium, Brisbane, Australia (2006)Google Scholar
  30. 30.
    Wartell, J., McEwen, T.: Privacy in the information age: A guide for sharing crime maps and spatial data. Institute for Law and Justice, NCJ 188739 (July 2001)Google Scholar
  31. 31.
    Rushton, G., Armstrong, M.P., Gittler, J., Greene, B.R., Pavlik, C.E., West, M.M., Zimmerman, D.L.: Geocoding in cancer research – A review. American Journal of Preventive Medicine 30(2S), 16–24 (2006)CrossRefGoogle Scholar
  32. 32.
    Rivest, R.L.: Chaffing and winnowing: Confidentiality without encryption. MIT Lab for Computer Science (1998),
  33. 33.
    Churches, T.: A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers. BioMed. Central Medical Research Methodology 3(1) (2003)Google Scholar
  34. 34.
    Bouzelat, H., Quantin, C., Dusserre, L.: Extraction and anonymity protocol of medical file. In: AMIA Fall Symposium, pp. 323–327 (1996)Google Scholar
  35. 35.
    Dusserre, L., Quantin, C., Bouzelat, H.: A one way public key cryptosystem for the linkage of nominal files in epidemiological studies. Medinfo. 8(644–7) (1995)Google Scholar
  36. 36.
    Quantin, C., Bouzelat, H., Allaert, F.A., Benhamiche, A.M., Faivre, J., Dusserre, L.: Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods of Information in Medicine 37(3), 271–277 (1998)Google Scholar
  37. 37.
    Quantin, C., Bouzelat, H., Allaert, F.A., Benhamiche, A.M., Faivre, J., Dusserre, L.: How to ensure data quality of an epidemiological follow-up: Quality assessment of an anonymous record linkage procedure. International Journal of Medical Informatics 49(1), 117–122 (1998)CrossRefGoogle Scholar
  38. 38.
    Quantin, C., Bouzelat, H., Dusserre, L.: Irreversible encryption method by generation of polynomials. Medical Informatics and the Internet in Medicine 21(2), 113–121 (1996)CrossRefGoogle Scholar
  39. 39.
    Schneier, B.: Applied cryptography: Protocols, algorithms, and source code in C, 2nd edn. John Wiley & Sons, Inc., New York (1995)zbMATHGoogle Scholar
  40. 40.
    Ravikumar, P., Cohen, W.W., Fienberg, S.E.: A secure protocol for computing string distance metrics. In: IEEE ICDM Workshop on Privacy and Security Aspects of Data Mining, Brighton, UK (2004)Google Scholar
  41. 41.
    Atallah, M.J., Kerschbaum, F., Du, W.: Secure and private sequence comparisons. In: ACM Workshop on Privacy in the Electronic Society, Washington DC, pp. 39–44 (2003)Google Scholar
  42. 42.
    O’Keefe, C.M., Yung, M., Gu, L., Baxter, R.: Privacy-preserving data linkage protocols. In: ACM Workshop on Privacy in the Electronic Society, Washington DC, pp. 94–102 (2004)Google Scholar
  43. 43.
    Churches, T., Christen, P.: Some methods for blindfolded record linkage. BioMed. Central Medical Informatics and Decision Making 4(9) (2004)Google Scholar
  44. 44.
    Al-Lawati, A., Lee, D., McDaniel, P.: Blocking-aware private record linkage. In: International Workshop on Information Quality in Information Systems, Baltimore, pp. 59–68 (2005)Google Scholar
  45. 45.
    Inan, A., Kantarcioglu, M., Bertino, E., Scannapieco, M.: A hybrid approach to private record linkage. In: IEEE International Conference Data Engineering, pp. 496–505 (2008)Google Scholar
  46. 46.
    Christen, P.: Automatic training example selection for scalable unsupervised record linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  47. 47.
    Guisado-Gamez, J., Prat-Perez, A., Nin, J., Muntes-Mulero, V., Larriba-Pey, J.L.: Parallelizing record linkage for disclosure risk assessment. In: Privacy in Statistical Databases, Istanbul, Turkey. LNCS, vol. 5262, pp. 190–202. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  48. 48.
    Christen, P., Gayler, R.: Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In: AusDM 2008, CRPIT, Glenelg, Australia, vol. 87, pp. 51–60 (2008)Google Scholar
  49. 49.
    Winkler, W.E.: Masking and re-identification methods for public-use microdata: Overview and research problems. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  50. 50.
    Malin, B., Sweeney, L.: A secure protocol to distribute unlinkable health data. In: American Medical Informatics Association 2005 Annual Symposium, Washington DC, pp. 485–489 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Peter Christen
    • 1
  1. 1.Department of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations