Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California

McMahon, Mallory E.; Doroshenko, Lyubov; Roostaei, Javad; Cho, Hyunsoon; Haider, Mansoor A.

doi:10.1007/s10729-022-09604-5

Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California

Published: 23 June 2022

Volume 25, pages 574–589, (2022)
Cite this article

Health Care Management Science Aims and scope Submit manuscript

602 Accesses
1 Altmetric
Explore all metrics

Abstract

Many public health policymaking questions involve data subsets representing application-specific attributes and geographic location. We develop and evaluate standard and tailored techniques for clustering via unsupervised learning (UL) algorithms on such amalgamated (dual-domain) data sets. The aim of the associated algorithms is to identify geographically efficient clusters that also maximize the number of statistically significant differences in disease incidence and demographic variables across top clusters. Two standard UL approaches, k means with k++ initialization (k++) and the standard self-organizing map (SSOM), are considered along with a new, tailored version of the SOM (TSOM). The TSOM algorithm involves optimization of a customized objective function with terms promoting individual geographic cluster cohesion while also maximizing the number of differences across clusters, and two hyper-parameters controlling the relative weighting of geographic and attribute subspaces in a non-Euclidean distance measure within the clustering problem. The performance of these three techniques (k++, SSOM, TSOM) is compared and evaluated in the context of a data set for colorectal cancer incidence in the state of California, at the level of individual counties. Clusters are visualized via chloropleth maps and ordered graphs are also used to illustrate disparities in disease incidence among four identity groups. While all three approaches performed well, the TSOM identified the largest number of disease and demographic disparities while also yielding more geographically efficient top clusters. Techniques presented in this study are relevant to applications including the delivery of health care resources and identifying disparities among identity groups, and to questions involving coordination between county- and state-level policymakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How Socio-economic Inequalities Cluster People with Diabetes in Malaysia: Geographic Evaluation of Area Disparities Using a Non-parameterized Unsupervised Learning Method

Article Open access 05 February 2024

A hybrid method for fast detection of spatial disease clusters in irregular shapes

Article 15 July 2017

An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE)

Article Open access 30 October 2014

Data availability

Only public data sets where used herein, as cited in the text.

Code availability

Corresponding author can be contacted regarding custom codes developed herein.

References

Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Bacao F, Lobo V, Painho M (2005) The self-organizing map, the geo-SOM, and relevant variants for the geosciences. Computers & Geosciences 31:155–163
Article Google Scholar
Bergin RJ, Emery J, Bollard RC, Falborg AZ, Jensen H, Weller D, Menon U, Vedsted P, Thomas RJ, Whitfield K, White V (2018) Rural-urban disparities in time to diagnosis and treatment for colorectal and breast cancer. Cancer Epidemiol Biomarkers Prev 27(9):1036–1046
Article Google Scholar
Dunn OJ (1964) Multiple comparisons using rank sums. Technometrics 6:241–252
Article Google Scholar
Fisher RA (1928) Statistical methods for research workers. Stechert
Goovaerts P, Jacquez G (2004) Accounting for regional background and population size in the detection of spatial clusters and outliers using geostatistical filtering and spatial neutral models: the case of lung cancer in Long Island, New York. International Journal of Health Geographics 3(1):14
Article Google Scholar
Institute of Medicine (2003) The future of the public’s health in the 21st century. The National Academies Press, Washington, DC
Google Scholar
Jackson CS, Oman M, Patel AM, Vega KJ (2016) Health disparities in colorectal cancer among racial and ethnic minorities in the United States. J Gastrointest Oncol 7(Suppl 1):S32–S43
Google Scholar
Kohonen T (1982) Analysis of a simple self-organizing process. Biol Cybern 44:135–140
Article Google Scholar
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69
Article Google Scholar
Kohonon T (2014) MATLAB Implementations and applications of the self-organizing map. Unigrafia, Helsinki
Google Scholar
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47:538–621
Article Google Scholar
Liao ZX, Peng WC (2012) Clustering spatial data with a geographic constraint: exploring local search. Knowledge and Information Systems 31:153–170
Article Google Scholar
Lin CR, Liu KH, Chen MS (2005) Dual clustering: integrating data clustering over optimzation and constraint domains. IEEE Trans Knowl Data Eng 17:628–637
Article Google Scholar
Murphy CC, Wallace K, Sandler RS, Baron JA (2019) Racial disparities in incidence of young-onset colorectal cancer and patient survival. Gastroenterology 156(4):958–965
Article Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1:281–297
Google Scholar
Marsland S (2015) Machine learning: an algorithmic perspective. Taylor & Francis Group, Boca Raton, FL
Google Scholar
McMahon ME (2020) Unsupervised learning models for dual-domain data with proximal geographic clustering. PhD Thesis, North Carolina State University, Raleigh
National Program of Cancer Registries and Surveillance, Epidemiology, and End Results SEER*Stat Database: NPCR and SEER Incidence – U.S. Cancer Statistics 2001–2016 Public Use Research Database. November 2018 submission (2001-2016), United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute. Accessed at www.cdc.gov/cancer/uscs/public-use/. Released June 2019, based on the November 2018 submission.
National Cancer Institute (2020) State cancer profiles: dynamic views of cancer statistics for prioritizing cancer control efforts across the nation, https://statecancerprofiles.cancer.gov/index.html
Core Team R (2019) R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. https://www.R-project.org/
Rico J, Miguelino-Keasling V, Darsie B, Davis S, Kwong S, Snipes KP (1988) Colorectal cancer in California. California Department of Public Health, Cancer Surveillance Section, Sacramento, CA
Google Scholar
Romesburg HC (2004) Cluster analysis for researchers. reprint of 1984 edition, with minor revisions. Lulu Press, Morrisville, NC (Reprint of 1984 edition, with minor revisions)
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20:53–65
Article Google Scholar
Thorndike RL (1953) Who belongs in the family? Psychometrika 18(4):267–276
Article Google Scholar
Tukey JW (1949) Comparing individual means in the analysis of variance. Biometrics 5:99–114
Article Google Scholar
United States Gazetteer Files. United States Census Bureau. Accessed at https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html. 2019
Waller LA, Gotway CA (2004) Applied spatial statistics for public health data. John Wiley & Sons, Hoboken, NJ
Book Google Scholar
Weinberg BA, Marshall JL (2019) Colon cancer in young adults: trends and their implications. Curr Oncol Rep 21(1):3
Article Google Scholar
Wheeler DC (2007) A comparison of spatial clustering and cluster detection techniques for childhood leukemia incidence in Ohio, 1996-2003. Int J Health Geogr 6:13
Article Google Scholar
Wittich AR, Shay LA, Flores B, De La Rosa EM, Mackay R, Valerio MA (2019) Colorectal cancer screening: Understanding the health literacy needs of hispanic rural residents. AIMS Public Health 6 (2):107–120
Article Google Scholar
Yager S, Cheung WY (2011) Gender disparities in colorectal cancer screening. J Clin Oncol 29(15):1544–1544
Article Google Scholar
Yonto D, Issel LM, Thill J-C (2019) Spatial analytics based on confidential data for strategic planning in urban health departments. Urban Sci 3:75
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Box 8205, North Carolina State University, Raleigh, NC, 27695-8205, USA
Mallory E. McMahon & Mansoor A. Haider
Department of Economics and Finance, La Sapienza University of Rome, 00185, Roma, Italy
Lyubov Doroshenko
Department of Environmental Sciences and Engineering, UNC Gillings School of Global Public Health Chapel Hill, Raleigh, NC, 27599-7400, USA
Javad Roostaei
Department of Cancer Control and Population Health, Graduate School of Cancer Science and Policy, National Cancer Center, Goyang, Korea
Hyunsoon Cho

Authors

Mallory E. McMahon
View author publications
You can also search for this author in PubMed Google Scholar
Lyubov Doroshenko
View author publications
You can also search for this author in PubMed Google Scholar
Javad Roostaei
View author publications
You can also search for this author in PubMed Google Scholar
Hyunsoon Cho
View author publications
You can also search for this author in PubMed Google Scholar
Mansoor A. Haider
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mansoor A. Haider.

Ethics declarations

Ethics statement

For all data used in this study ethics approval was not needed, in accordance with the policies of our institutions.

Conflict of interests

None.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

McMahon, M.E., Doroshenko, L., Roostaei, J. et al. Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California. Health Care Manag Sci 25, 574–589 (2022). https://doi.org/10.1007/s10729-022-09604-5

Download citation

Received: 17 December 2020
Accepted: 06 June 2022
Published: 23 June 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10729-022-09604-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California

Abstract

Access this article

Similar content being viewed by others

How Socio-economic Inequalities Cluster People with Diabetes in Malaysia: Geographic Evaluation of Area Disparities Using a Non-parameterized Unsupervised Learning Method

A hybrid method for fast detection of spatial disease clusters in irregular shapes

An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE)

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics statement

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California

Abstract

Access this article

Similar content being viewed by others

How Socio-economic Inequalities Cluster People with Diabetes in Malaysia: Geographic Evaluation of Area Disparities Using a Non-parameterized Unsupervised Learning Method

A hybrid method for fast detection of spatial disease clusters in irregular shapes

An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE)

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics statement

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation