Skip to main content
Log in

Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California

  • Published:
Health Care Management Science Aims and scope Submit manuscript

Abstract

Many public health policymaking questions involve data subsets representing application-specific attributes and geographic location. We develop and evaluate standard and tailored techniques for clustering via unsupervised learning (UL) algorithms on such amalgamated (dual-domain) data sets. The aim of the associated algorithms is to identify geographically efficient clusters that also maximize the number of statistically significant differences in disease incidence and demographic variables across top clusters. Two standard UL approaches, k means with k++ initialization (k++) and the standard self-organizing map (SSOM), are considered along with a new, tailored version of the SOM (TSOM). The TSOM algorithm involves optimization of a customized objective function with terms promoting individual geographic cluster cohesion while also maximizing the number of differences across clusters, and two hyper-parameters controlling the relative weighting of geographic and attribute subspaces in a non-Euclidean distance measure within the clustering problem. The performance of these three techniques (k++, SSOM, TSOM) is compared and evaluated in the context of a data set for colorectal cancer incidence in the state of California, at the level of individual counties. Clusters are visualized via chloropleth maps and ordered graphs are also used to illustrate disparities in disease incidence among four identity groups. While all three approaches performed well, the TSOM identified the largest number of disease and demographic disparities while also yielding more geographically efficient top clusters. Techniques presented in this study are relevant to applications including the delivery of health care resources and identifying disparities among identity groups, and to questions involving coordination between county- and state-level policymakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

Only public data sets where used herein, as cited in the text.

Code availability

Corresponding author can be contacted regarding custom codes developed herein.

References

  1. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035

  2. Bacao F, Lobo V, Painho M (2005) The self-organizing map, the geo-SOM, and relevant variants for the geosciences. Computers & Geosciences 31:155–163

    Article  Google Scholar 

  3. Bergin RJ, Emery J, Bollard RC, Falborg AZ, Jensen H, Weller D, Menon U, Vedsted P, Thomas RJ, Whitfield K, White V (2018) Rural-urban disparities in time to diagnosis and treatment for colorectal and breast cancer. Cancer Epidemiol Biomarkers Prev 27(9):1036–1046

    Article  Google Scholar 

  4. Dunn OJ (1964) Multiple comparisons using rank sums. Technometrics 6:241–252

    Article  Google Scholar 

  5. Fisher RA (1928) Statistical methods for research workers. Stechert

  6. Goovaerts P, Jacquez G (2004) Accounting for regional background and population size in the detection of spatial clusters and outliers using geostatistical filtering and spatial neutral models: the case of lung cancer in Long Island, New York. International Journal of Health Geographics 3(1):14

    Article  Google Scholar 

  7. Institute of Medicine (2003) The future of the public’s health in the 21st century. The National Academies Press, Washington, DC

    Google Scholar 

  8. Jackson CS, Oman M, Patel AM, Vega KJ (2016) Health disparities in colorectal cancer among racial and ethnic minorities in the United States. J Gastrointest Oncol 7(Suppl 1):S32–S43

    Google Scholar 

  9. Kohonen T (1982) Analysis of a simple self-organizing process. Biol Cybern 44:135–140

    Article  Google Scholar 

  10. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69

    Article  Google Scholar 

  11. Kohonon T (2014) MATLAB Implementations and applications of the self-organizing map. Unigrafia, Helsinki

    Google Scholar 

  12. Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47:538–621

    Article  Google Scholar 

  13. Liao ZX, Peng WC (2012) Clustering spatial data with a geographic constraint: exploring local search. Knowledge and Information Systems 31:153–170

    Article  Google Scholar 

  14. Lin CR, Liu KH, Chen MS (2005) Dual clustering: integrating data clustering over optimzation and constraint domains. IEEE Trans Knowl Data Eng 17:628–637

    Article  Google Scholar 

  15. Murphy CC, Wallace K, Sandler RS, Baron JA (2019) Racial disparities in incidence of young-onset colorectal cancer and patient survival. Gastroenterology 156(4):958–965

    Article  Google Scholar 

  16. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1:281–297

    Google Scholar 

  17. Marsland S (2015) Machine learning: an algorithmic perspective. Taylor & Francis Group, Boca Raton, FL

    Google Scholar 

  18. McMahon ME (2020) Unsupervised learning models for dual-domain data with proximal geographic clustering. PhD Thesis, North Carolina State University, Raleigh

  19. National Program of Cancer Registries and Surveillance, Epidemiology, and End Results SEER*Stat Database: NPCR and SEER Incidence – U.S. Cancer Statistics 2001–2016 Public Use Research Database. November 2018 submission (2001-2016), United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute. Accessed at www.cdc.gov/cancer/uscs/public-use/. Released June 2019, based on the November 2018 submission.

  20. National Cancer Institute (2020) State cancer profiles: dynamic views of cancer statistics for prioritizing cancer control efforts across the nation, https://statecancerprofiles.cancer.gov/index.html

  21. Core Team R (2019) R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. https://www.R-project.org/

  22. Rico J, Miguelino-Keasling V, Darsie B, Davis S, Kwong S, Snipes KP (1988) Colorectal cancer in California. California Department of Public Health, Cancer Surveillance Section, Sacramento, CA

    Google Scholar 

  23. Romesburg HC (2004) Cluster analysis for researchers. reprint of 1984 edition, with minor revisions. Lulu Press, Morrisville, NC (Reprint of 1984 edition, with minor revisions)

  24. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20:53–65

    Article  Google Scholar 

  25. Thorndike RL (1953) Who belongs in the family? Psychometrika 18(4):267–276

    Article  Google Scholar 

  26. Tukey JW (1949) Comparing individual means in the analysis of variance. Biometrics 5:99–114

    Article  Google Scholar 

  27. United States Gazetteer Files. United States Census Bureau. Accessed at https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html. 2019

  28. Waller LA, Gotway CA (2004) Applied spatial statistics for public health data. John Wiley & Sons, Hoboken, NJ

    Book  Google Scholar 

  29. Weinberg BA, Marshall JL (2019) Colon cancer in young adults: trends and their implications. Curr Oncol Rep 21(1):3

    Article  Google Scholar 

  30. Wheeler DC (2007) A comparison of spatial clustering and cluster detection techniques for childhood leukemia incidence in Ohio, 1996-2003. Int J Health Geogr 6:13

    Article  Google Scholar 

  31. Wittich AR, Shay LA, Flores B, De La Rosa EM, Mackay R, Valerio MA (2019) Colorectal cancer screening: Understanding the health literacy needs of hispanic rural residents. AIMS Public Health 6 (2):107–120

    Article  Google Scholar 

  32. Yager S, Cheung WY (2011) Gender disparities in colorectal cancer screening. J Clin Oncol 29(15):1544–1544

    Article  Google Scholar 

  33. Yonto D, Issel LM, Thill J-C (2019) Spatial analytics based on confidential data for strategic planning in urban health departments. Urban Sci 3:75

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mansoor A. Haider.

Ethics declarations

Ethics statement

For all data used in this study ethics approval was not needed, in accordance with the policies of our institutions.

Conflict of interests

None.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

McMahon, M.E., Doroshenko, L., Roostaei, J. et al. Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California. Health Care Manag Sci 25, 574–589 (2022). https://doi.org/10.1007/s10729-022-09604-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10729-022-09604-5

Keywords

Navigation