An unsupervised classification method for inferring original case locations from low-resolution disease maps
Widespread availability of geographic information systems software has facilitated the use of disease mapping in academia, government and private sector. Maps that display the address of affected patients are often exchanged in public forums, and published in peer-reviewed journal articles. As previously reported, a search of figure legends in five major medical journals found 19 articles from 1994–2004 that identify over 19,000 patient addresses. In this report, a method is presented to evaluate whether patient privacy is being breached in the publication of low-resolution disease maps.
To demonstrate the effect, a hypothetical low-resolution map of geocoded patient addresses was created and the accuracy with which patient addresses can be resolved is described. Through georeferencing and unsupervised classification of the original image, the method precisely re-identified 26% (144/550) of the patient addresses from a presentation quality map and 79% (432/550) from a publication quality map. For the presentation quality map, 99.8% of the addresses were within 70 meters (approximately one city block length) of the predicted patient location, 51.6% of addresses were identified within five buildings, 70.7% within ten buildings and 93% within twenty buildings. For the publication quality map, all addresses were within 14 meters and 11 buildings of the predicted patient location.
This study demonstrates that lowering the resolution of a map displaying geocoded patient addresses does not sufficiently protect patient addresses from re-identification. Guidelines to protect patient privacy, including those of medical journals, should reflect policies that ensure privacy protection when spatial data are displayed or published.
KeywordsGlobal Position System Privacy Protection Unsupervised Classification Apartment Building Housing Density
Geocoding patient data – translating the plaintext addresses of patients into longitudes and latitudes – has become routine and enables display and analysis of disease patterns. Many public health surveillance systems and academic investigations rely on specific case locations for identifying patterns, correlates, and predictors of disease [1, 2, 3]. Maps that display such geocoded health data are frequently presented publicly and published electronically and in print.
However, publishing patient address locations on maps also creates a risk of re-identification of individuals [4, 5, 6, 7]. We recently reported an inadvertent breach of privacy across five major medical journals, identifying 19 articles from 1994–2004 that include maps with patient addresses plotted as individual dots or symbols [4, 5]. From these publications, over 19,000 patient addresses are plotted on map figures. We demonstrated through a process of reverse identification that the home addresses of many of these patients could be discovered, despite the low resolution of the disease maps.
Here, we provide the details of that method. We rely on unsupervised classification of the spectral properties of the map image to identify case locations. Since we do not have available to us the original addresses of the patients represented in the published maps, we devised an indirect approach relying on simulation.
We created a JPEG image with a resolution of 50 dots per inch (dpi), 550 × 400 pixels, a file size of 129 kb and a scale of 1:100,000. This low resolution is typical for web display and is lower than generally used in slide presentations. Also the re-identification of patient addresses was evaluated using a higher-resolution map (266 dpi, 2926 × 2261 pixels, 712 kb, 1:100,000), often the minimum resolution for peer-reviewed publications.
There are several steps involved in reversely identifying a patient address. First, the sample map is scanned or imported into GIS software as an image file . Second, the imported map is georeferenced. The cartographic projection of the map is used to set the coordinate system. Generally, the projection of a published map would be unknown and the correct projection would need to be found by manually matching the image of the map to an image of a correctly georegistered map of the same area. In this case, we have a priori knowledge of the map projection. In either case, ground control points are selected on the image using a corresponding vector outline of the map area to re-project the image file of patient locations and reference it to a coordinate system. In this example, an outline of counties around Boston provided by the US Census Bureau to set the ground control points . The process of scanning and georeferencing the disease map parallels the methodology detailed by Curtis et al . Third, using image analysis software , unsupervised classification of the georeferenced map is performed. Given the spectral properties of the image file, pixels are classified so that pixels representing the patient points are aggregated together. Fourth, a reclassified raster map (an image composed of individual pixel elements arranged in a grid) that only contains patient points is extracted and converted to a vector file. Finally, Coordinates of the patient points are then calculated.
Accuracy of reverse geocoding was measured as (a) the number of correctly identified patient addresses (b) the distance between the reversely identified address coordinate and the boundary of the building of the patient home address and (c) the number of buildings in which the patient could reside, given the reversely geocoded address. To calculate (c), we estimated the minimum buffer size from the predicted location needed to contain the centroid of the correct address. Accuracy in this case is therefore defined as the number of incorrect addresses within this buffer.
Our reverse identification method correctly identified 26% (144/550) of patient addresses precisely, from a sample map with low-resolution GIS output. We observed increased detection with the higher-resolution publication quality output to 79% (432/550) of patient addresses identified exactly.
Our results demonstrate that even lowering the resolution of a map displaying geocoded patient addresses does not sufficiently protect patient addresses from re-identification. Despite the low quality of output sources, these images – based on high precision input sources – preserve positional accuracy. Using a low quality map that would serve the purpose of web or presentation display, we were able to precisely identify more than one quarter of all randomly selected home addresses and on average patients could be identified to a city block or within one of eight buildings. Using a map with minimum resolution for peer-reviewed publication, we could identify almost all patient addresses and on average patients could be identified within 14 m.
The ultimate accuracy of the patient re-identification will no doubt depend on the number of individuals residing at these addresses. In the case of multi-family apartment dwellings, address identification may still afford a certain level of privacy protection. In the case of single family dwellings, re-identification becomes much more likely. However, even in the best case scenario of an urban area multi-family apartment building, an additional concern is that individuals at these addresses can be fully re-identified when linked with other datasets or by using other characteristics supplied in the publication . Previous research has shown that combinations of seemingly innocuous data is adequate to uniquely identify individuals with a high level of reliability . For example, an experiment using 1990 U.S. Census summary data surprised the public health community by showing that datasets previously thought to be adequately de-identified, containing only 5-digit ZIP code, gender and date of birth, could be linked with other publicly available data (e.g., voting records) and used to uniquely identify 87% of the population of the United States . Low-resolution maps of patient locations pose an additional risk to individual privacy – allowing considerably more precision in re-identification than might be expected. Although the Health Insurance Portability and Accountability Act Privacy Rule (Section 164.514) does not explicitly address the publication of such maps, certain formats of geographic data display most likely violate the spirit of that rule.
Curtis et al have also recently described a method to re-identify patients from published maps through manual outlining of case markers . Though the vector-based approach of heads-up digitizing can be more accurate than raster-based unsupervised classification in certain circumstances, in this case, it may be difficult to find the true border of the case markers from a scanned paper-based maps (such as the newspaper article described by Curtis et al) or even low-resolution digital images. If the marker is not digitized accurately, then it follows that the centroid of this polygon will also less accurately reflect the original geocoded location. Our approach differs from the manual approach in that we rely on analyzing the spectral properties of the map image through unsupervised classification to automatically identify patient locations. The raster-based method based on the spectral properties of the image can provide a reliable means of re-creating the original vector file and systematically obtaining the center point of a low-resolution marker. This comparison, however, warrants further evaluation. Nonetheless, the results of the two papers are very similar in that they show that maps containing point data are vulnerable to patient address re-identification. These studies and our previous publication on this topic  should be viewed together informing policy around the display of geographic data.
The main question that should be asked by both authors and editors is what are the benefits and risks of point localization of patients? Is it necessary to publish maps of point locations, for the presentation of relevant results of research or are they presented merely for illustrative purposes? The answer to these questions should guide decisions on how to report disease maps . If just for illustrative purposes, there are techniques available to visualize spatial data without revealing patient information . For instance, a common approach to de-identifying such data has been to use ZIP or postal code rather than home address to protect anonymity. While usually appropriate for the reporting of study results, aggregation of data to an administrative unit poses constraints on the analysis and visualization of disease patterns [17, 18, 19]. Other approaches are available for masking geographic data, such as spatial masking of cases by randomly relocating cases within a given distance of their true location [20, 21, 22, 23] or the population-density adjusted 2D Gaussian blurring approach which results in only a small reduction in sensitivity to detect clustering patterns . These methods avoid these visualization constraints of data aggregation and afford sufficient privacy for publication without substantial loss to visual display. Masking methods provide more systematic and reliable means of de-identification rather than simply reducing map resolution. Spruill developed a measure of privacy protection for any mask, analogous to our measure of number of addresses within which the patient could reside . Such a measure could be used by journal editors as a rule for not publishing maps of individual cases unless a certain value of anonymity was attained. This measure, often referred to as K-anonymity, could help to establish guidelines for the safe publication of disease maps [13, 24].
Our approach relies on simulation, rather than attempting to re-identify patients from published maps. We chose this approach to avoid propagating any prior inadvertent disclosures of patient identity, and to avoid impugning particular authors or journals. An advantage of our approach is that since we know the value of the original plotted location, we can precisely measure the accuracy of re-identification. Our analysis also does not address the geocoding method. Accuracy of re-identification will also be dependent on the method for geocoding patient address. Use of a global positioning system (GPS) will provide greater accuracy then that of an address geocoder (automatic conversion from home address text to latitude and longitude using interpolation along street line data). When a geocoder is applied, the input data source will affect the accuracy of the estimate address coordinate. Many US-based studies rely on the freely available US Census TIGER line file as input to assign coordinates to addresses. Although TIGER line files differ in accuracy across the US, they rarely, if ever, approach the geometric accuracy of GPS coordinates or even more detailed commercial datasets. In fact, geocoding based on the free Census data available to most health researchers increases patient anonymity as the proportional placement of the address location can greatly affect geocoding accuracy [10, 26]. Outside the US, street level data may not be available for address geocoding. Therefore, spatial analysis studies in these areas would rely on the more accurate GPS measures. By extension, greater positional accuracy is revealed in these studies. Our findings may therefore be highly pertinent for GIS-based studies in developing countries.
The issues we raise here have, of course, much wider implications than for just health data, including crime data, housing data (e.g.: Section 8 units, shelters for abused women, etc.), and other administrative data sets [20, 27, 28]. New spatial data standards that protect confidentiality while still effectively communicating information about spatial patterns require immediate evaluation .
The publication of low-resolution disease maps poses an inherent jeopardy to patient privacy. Because the appropriate use of the patient address level data can bring real benefit to many areas of public health research that deal with spatial analysis, accidental disclosure of patient information from such maps may lead to constraints on obtaining geographically referenced health data. Thus, guidelines for the display or publication of health data are needed to guarantee privacy protection. Further, the editors of journals and textbooks should consider implementing policies to ensure the safe reporting of spatial data.
We gratefully acknowledge Marc Overhage, MD PhD and Shaun Grannis MD, Regenstrief Institute, for their helpful comments on study design and manuscript drafting. This work was supported by R01LM007970-01, and R21LM009263-01 from the National Library of Medicine, National Institutes of Health and the Canadian Institutes of Health Research. Isaac Kohane is partially supported by National Institutes of HealthNational Center for Biomedical Computing grant 5U54LM008748-02.20.
- 3.Elliot P, Wakefiled JC, Best NG, Briggs DJ: Spatial epidemiology: methods and applications. 2000, Oxford , Oxford University Press, 233-239.Google Scholar
- 4.Brownstein JS, Cassa CA, Kohane IS, Mandl KD: Reverse geocoding: Concerns about patient confidentiality in the display of geospatial health data. AMIA Annu Symp Proc. 2005, 905-Google Scholar
- 8.Boston Water and Sewer Commission: Boston planimetric and topographic data. 1996Google Scholar
- 9.US Census Bureau: Redistricting Census 2000 TIGER/Line Files [machine-readable data files]. 2000, Washington, DCGoogle Scholar
- 11.ESRI: ArcGIS. 2001, Redlands, CA , Environmental Systems Institute Inc., 8.1Google Scholar
- 12.ERDAS: IMAGINE. 2001, Atlanta, GA , ERDAS, 8.5Google Scholar
- 15.Sweeney L: Uniqueness of Simple Demographics in the U.S. Population, LIDAP-WP4. Forthcoming book entitled, The Identifiability of Data. 2000, Pittsburgh, PA , Carnegie Mellon University, Laboratory for International Data PrivacyGoogle Scholar
- 25.Spruill NL: The confidentiality and analytic usefulness of masked business microdata. Proceedings of the American Statistical Association Section on Survey Research Methods. 1983, 602-607.Google Scholar
- 27.Monmonier M: Spying with Maps: Surveillance Technologies and the Future of Privacy. 2002, Chicago, IL , University of Chicago Press.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.