Abstract
Many public health policymaking questions involve data subsets representing application-specific attributes and geographic location. We develop and evaluate standard and tailored techniques for clustering via unsupervised learning (UL) algorithms on such amalgamated (dual-domain) data sets. The aim of the associated algorithms is to identify geographically efficient clusters that also maximize the number of statistically significant differences in disease incidence and demographic variables across top clusters. Two standard UL approaches, k means with k++ initialization (k++) and the standard self-organizing map (SSOM), are considered along with a new, tailored version of the SOM (TSOM). The TSOM algorithm involves optimization of a customized objective function with terms promoting individual geographic cluster cohesion while also maximizing the number of differences across clusters, and two hyper-parameters controlling the relative weighting of geographic and attribute subspaces in a non-Euclidean distance measure within the clustering problem. The performance of these three techniques (k++, SSOM, TSOM) is compared and evaluated in the context of a data set for colorectal cancer incidence in the state of California, at the level of individual counties. Clusters are visualized via chloropleth maps and ordered graphs are also used to illustrate disparities in disease incidence among four identity groups. While all three approaches performed well, the TSOM identified the largest number of disease and demographic disparities while also yielding more geographically efficient top clusters. Techniques presented in this study are relevant to applications including the delivery of health care resources and identifying disparities among identity groups, and to questions involving coordination between county- and state-level policymakers.
Similar content being viewed by others
Data availability
Only public data sets where used herein, as cited in the text.
Code availability
Corresponding author can be contacted regarding custom codes developed herein.
References
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Bacao F, Lobo V, Painho M (2005) The self-organizing map, the geo-SOM, and relevant variants for the geosciences. Computers & Geosciences 31:155–163
Bergin RJ, Emery J, Bollard RC, Falborg AZ, Jensen H, Weller D, Menon U, Vedsted P, Thomas RJ, Whitfield K, White V (2018) Rural-urban disparities in time to diagnosis and treatment for colorectal and breast cancer. Cancer Epidemiol Biomarkers Prev 27(9):1036–1046
Dunn OJ (1964) Multiple comparisons using rank sums. Technometrics 6:241–252
Fisher RA (1928) Statistical methods for research workers. Stechert
Goovaerts P, Jacquez G (2004) Accounting for regional background and population size in the detection of spatial clusters and outliers using geostatistical filtering and spatial neutral models: the case of lung cancer in Long Island, New York. International Journal of Health Geographics 3(1):14
Institute of Medicine (2003) The future of the public’s health in the 21st century. The National Academies Press, Washington, DC
Jackson CS, Oman M, Patel AM, Vega KJ (2016) Health disparities in colorectal cancer among racial and ethnic minorities in the United States. J Gastrointest Oncol 7(Suppl 1):S32–S43
Kohonen T (1982) Analysis of a simple self-organizing process. Biol Cybern 44:135–140
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69
Kohonon T (2014) MATLAB Implementations and applications of the self-organizing map. Unigrafia, Helsinki
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47:538–621
Liao ZX, Peng WC (2012) Clustering spatial data with a geographic constraint: exploring local search. Knowledge and Information Systems 31:153–170
Lin CR, Liu KH, Chen MS (2005) Dual clustering: integrating data clustering over optimzation and constraint domains. IEEE Trans Knowl Data Eng 17:628–637
Murphy CC, Wallace K, Sandler RS, Baron JA (2019) Racial disparities in incidence of young-onset colorectal cancer and patient survival. Gastroenterology 156(4):958–965
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1:281–297
Marsland S (2015) Machine learning: an algorithmic perspective. Taylor & Francis Group, Boca Raton, FL
McMahon ME (2020) Unsupervised learning models for dual-domain data with proximal geographic clustering. PhD Thesis, North Carolina State University, Raleigh
National Program of Cancer Registries and Surveillance, Epidemiology, and End Results SEER*Stat Database: NPCR and SEER Incidence – U.S. Cancer Statistics 2001–2016 Public Use Research Database. November 2018 submission (2001-2016), United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute. Accessed at www.cdc.gov/cancer/uscs/public-use/. Released June 2019, based on the November 2018 submission.
National Cancer Institute (2020) State cancer profiles: dynamic views of cancer statistics for prioritizing cancer control efforts across the nation, https://statecancerprofiles.cancer.gov/index.html
Core Team R (2019) R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. https://www.R-project.org/
Rico J, Miguelino-Keasling V, Darsie B, Davis S, Kwong S, Snipes KP (1988) Colorectal cancer in California. California Department of Public Health, Cancer Surveillance Section, Sacramento, CA
Romesburg HC (2004) Cluster analysis for researchers. reprint of 1984 edition, with minor revisions. Lulu Press, Morrisville, NC (Reprint of 1984 edition, with minor revisions)
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20:53–65
Thorndike RL (1953) Who belongs in the family? Psychometrika 18(4):267–276
Tukey JW (1949) Comparing individual means in the analysis of variance. Biometrics 5:99–114
United States Gazetteer Files. United States Census Bureau. Accessed at https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html. 2019
Waller LA, Gotway CA (2004) Applied spatial statistics for public health data. John Wiley & Sons, Hoboken, NJ
Weinberg BA, Marshall JL (2019) Colon cancer in young adults: trends and their implications. Curr Oncol Rep 21(1):3
Wheeler DC (2007) A comparison of spatial clustering and cluster detection techniques for childhood leukemia incidence in Ohio, 1996-2003. Int J Health Geogr 6:13
Wittich AR, Shay LA, Flores B, De La Rosa EM, Mackay R, Valerio MA (2019) Colorectal cancer screening: Understanding the health literacy needs of hispanic rural residents. AIMS Public Health 6 (2):107–120
Yager S, Cheung WY (2011) Gender disparities in colorectal cancer screening. J Clin Oncol 29(15):1544–1544
Yonto D, Issel LM, Thill J-C (2019) Spatial analytics based on confidential data for strategic planning in urban health departments. Urban Sci 3:75
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics statement
For all data used in this study ethics approval was not needed, in accordance with the policies of our institutions.
Conflict of interests
None.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
McMahon, M.E., Doroshenko, L., Roostaei, J. et al. Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California. Health Care Manag Sci 25, 574–589 (2022). https://doi.org/10.1007/s10729-022-09604-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10729-022-09604-5