Journal of General Internal Medicine

, Volume 34, Issue 2, pp 211–217 | Cite as

Applying Machine Learning Algorithms to Segment High-Cost Patient Populations

  • Jiali Yan
  • Kristin A. Linn
  • Brian W. Powers
  • Jingsan Zhu
  • Sachin H. Jain
  • Jennifer L. Kowalski
  • Amol S. NavatheEmail author
Original Research



Efforts to improve the value of care for high-cost patients may benefit from care management strategies targeted at clinically distinct subgroups of patients.


To evaluate the performance of three different machine learning algorithms for identifying subgroups of high-cost patients.


We applied three different clustering algorithms—connectivity-based clustering using agglomerative hierarchical clustering, centroid-based clustering with the k-medoids algorithm, and density-based clustering with the OPTICS algorithm—to a clinical and administrative dataset. We then examined the extent to which each algorithm identified subgroups of patients that were (1) clinically distinct and (2) associated with meaningful differences in relevant utilization metrics.


Patients enrolled in a national Medicare Advantage plan, categorized in the top decile of spending (n = 6154).

Main Measures

Post hoc discriminative models comparing the importance of variables for distinguishing observations in one cluster from the rest. Variance in utilization and spending measures.

Key Results

Connectivity-based, centroid-based, and density-based clustering identified eight, five, and ten subgroups of high-cost patients, respectively. Post hoc discriminative models indicated that density-based clustering subgroups were the most clinically distinct. The variance of utilization and spending measures was the greatest among the subgroups identified through density-based clustering.


Machine learning algorithms can be used to segment a high-cost patient population into subgroups of patients that are clinically distinct and associated with meaningful differences in utilization and spending measures. For these purposes, density-based clustering with the OPTICS algorithm outperformed connectivity-based and centroid-based clustering algorithms.


high-cost patients machine learning patient segmentation 


Funding Information

This study was supported by a grant from the Anthem Public Policy Institute and, in part, under a grant with the Pennsylvania Department of Health.

Compliance with Ethical Standards

This study was approved by the Institutional Review Board of the University of Pennsylvania.

Conflict of Interest

Dr. Navathe reports that he has received grant support from Hawaii Medical Service Association, Anthem Public Policy Institute, and Oscar Health; personal fees from Navvis and Co, Navigant Inc., Lynx Medical, Indegene Inc., Agathos, Inc, and Sutherland Global Services; personal fees and equity from NavaHealth; serves on the board without compensation for Integrated Services, Inc., speaking fees from the Cleveland Clinic, and honoraria from Elsevier Press. Dr. Linn reports that she has received grant support from Hawaii Medical Service Association. Dr. Jain reports employment by Anthem, Inc.; stock ownership in Anthem, Inc., and honoraria from Elsevier Press. Ms. Kowalski reports employment by Anthem, Inc. and stock ownership in Anthem, Inc. and Amazon. Dr. Powers reports employment by Anthem, Inc. All other authors declare no conflicts of interest.


The Pennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations, or conclusions.

Supplementary material

11606_2018_4760_MOESM1_ESM.docx (640 kb)
ESM 1 (DOCX 639 kb)


  1. 1.
    National Academy of Medicine. Effective Care for High-Need Patients. Washington, DC: National Academy of Medicine; 2017.Google Scholar
  2. 2.
    Hong CS, Siegel AL, Ferris TG. Caring for High-Need, High-Cost Patients: What Makes for a Successful Care Management Program? 2014; Accessed October 19, 2018.Google Scholar
  3. 3.
    Joynt KE, Figueroa JF, Beaulieu N, Wild RC, Orav EJ, Jha AK. Segmenting high-cost Medicare patients into potentially actionable cohorts. Healthc (Amst). 2017;5(1–2):62–67.CrossRefGoogle Scholar
  4. 4.
    Blumenthal D, Abrams MK. Tailoring Complex Care Management for High-Need, High-Cost Patients. JAMA 2016;316(16):1657–1658.CrossRefGoogle Scholar
  5. 5.
    Clough JD, Riley GF, Cohen M, et al. Patterns of care for clinically distinct segments of high cost Medicare beneficiaries. Healthc (Amst). 2016;4(3):160–165.CrossRefGoogle Scholar
  6. 6.
    Lynn J, Straube BM, Bell KM, Jencks SF, Kambic RT. Using population segmentation to provide better health care for all: the “Bridges to Health” model. Milbank Q. 2007;85(2):185–208; discussion 209-112.CrossRefGoogle Scholar
  7. 7.
    Berkhin P. A Survey of Clustering Data Mining Techniques. In: Kogan J, Nicholas C, Teboulle M, eds. Grouping Multidimensional Data: Recent Advances in Clustering. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006:25–71.CrossRefGoogle Scholar
  8. 8.
    Gan G, Ma C, Wu J. Data Clustering: Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics; 2007.Google Scholar
  9. 9.
    Moore WC, Meyers DA, Wenzel SE, et al. Identification of asthma phenotypes using cluster analysis in the Severe Asthma Research Program. Am J Respir Crit Care Med. 2010;181(4):315–323.CrossRefGoogle Scholar
  10. 10.
    Haldar P, Pavord ID, Shaw DE, et al. Cluster analysis and clinical asthma phenotypes. Am J Respir Crit Care Med. 2008;178(3):218–224.CrossRefGoogle Scholar
  11. 11.
    Weatherall M, Shirtcliffe P, Travers J, Beasley R. Use of cluster analysis to define COPD phenotypes. Eur Respir J. 2010;36(3):472–474.CrossRefGoogle Scholar
  12. 12.
    Chen CZ, Wang LY, Ou CY, Lee CH, Lin CC, Hsiue TR. Using cluster analysis to identify phenotypes and validation of mortality in men with COPD. Lung. 2014;192(6):889–896.CrossRefGoogle Scholar
  13. 13.
    Ahmad T, Pencina MJ, Schulte PJ, et al. Clinical implications of chronic heart failure phenotypes defined by cluster analysis. J Am Coll Cardiol 2014;64(17):1765–1774.CrossRefGoogle Scholar
  14. 14.
    Ahmad T, Desai N, Wilson F, et al. Clinical Implications of Cluster Analysis-Based Classification of Acute Decompensated Heart Failure and Correlation with Bedside Hemodynamic Profiles. PloS one. 2016;11(2):e0145881.CrossRefGoogle Scholar
  15. 15.
    Erro R, Vitale C, Amboni M, et al. The heterogeneity of early Parkinson’s disease: a cluster analysis on newly diagnosed untreated patients. PloS one. 2013;8(8):e70244.CrossRefGoogle Scholar
  16. 16.
    Hamid JS, Meaney C, Crowcroft NS, Granerod J, Beyene J, Group UKEoES. Cluster analysis for identifying sub-groups and selecting potential discriminatory variables in human encephalitis. BMC Infect Dis. 2010;10:364.CrossRefGoogle Scholar
  17. 17.
    Newcomer SR, Steiner JF, Bayliss EA. Identifying subgroups of complex patients with cluster analysis. Am J Manag Care. 2011;17(8):e324–332.Google Scholar
  18. 18.
    Lee NS, Whitman N, Vakharia N, Ph DG, Rothberg MB. High-Cost Patients: Hot-Spotters Don’t Explain the Half of It. J Gen Intern Med. 2017;32(1):28–34.CrossRefGoogle Scholar
  19. 19.
    Powers BW, Yan J, Zhu J, et al. Subgroups of High-Cost Medicare Advantage Patients: An Observational Study. J Gen Intern Med 2018.Google Scholar
  20. 20.
    Bellman R. Adaptive control processes: a guided tour. Princeton, N.J.,: Princeton University Press; 1961.Google Scholar
  21. 21.
    Donoho DL. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture. 2000:1–32.Google Scholar
  22. 22.
    Van Der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9(Nov):2579–2605.Google Scholar
  23. 23.
    Van Der Maaten L. Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 2014;15(1):3221–3245.Google Scholar
  24. 24.
    Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987;20:53–65.CrossRefGoogle Scholar
  25. 25.
    Ward JH. Hierarchical Grouping to Optimize an Objective Function. J Am Stat Assoc 1963;58(301):236–244.CrossRefGoogle Scholar
  26. 26.
    Kaufman L, Rousseeuw PJ. Clustering by means of medoids. Amsterdam: North-Holland/Elsevier; 1987.Google Scholar
  27. 27.
    Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; 1996; Portland, Oregon.Google Scholar
  28. 28.
    Ankerst M, Breunig MM, Kriegel H-P, Sander J. OPTICS: ordering points to identify the clustering structure. SIGMOD Rec. 1999;28(2):49–60.CrossRefGoogle Scholar
  29. 29.
    Hoerl AE, Kennard RW. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12(1):55–67.CrossRefGoogle Scholar
  30. 30.
    Figueroa JF, Jha AK. Approach for Achieving Effective Care for High-Need Patients. JAMA Intern Med. 2018;178(6):845–846.Google Scholar
  31. 31.
    Grun D, Lyubimova A, Kester L, et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–255.CrossRefGoogle Scholar
  32. 32.
    Keren-Shaul H, Spinrad A, Weiner A, et al. A Unique Microglia Type Associated with Restricting Development of Alzheimer’s Disease. Cell. 2017;169(7):1276–1290 e1217.CrossRefGoogle Scholar
  33. 33.
    Becher B, Schlitzer A, Chen J, et al. High-dimensional analysis of the murine myeloid cell system. Nat Immunol. 2014;15(12):1181–1189.CrossRefGoogle Scholar
  34. 34.
    Abdelmoula WM, Balluff B, Englert S, et al. Data-driven identification of prognostic tumor subpopulations using spatially mapped t-SNE of mass spectrometry imaging data. Proc Natl Acad Sci U S A. 2016;113(43):12244–12249.CrossRefGoogle Scholar

Copyright information

© Society of General Internal Medicine 2018

Authors and Affiliations

  • Jiali Yan
    • 1
  • Kristin A. Linn
    • 2
  • Brian W. Powers
    • 3
    • 4
    • 5
    • 6
  • Jingsan Zhu
    • 7
  • Sachin H. Jain
    • 5
  • Jennifer L. Kowalski
    • 8
  • Amol S. Navathe
    • 7
    • 9
    Email author
  1. 1.Department of MedicineUniversity of Pennsylvania Perelman School of MedicinePhiladelphiaUSA
  2. 2.Department of Biostatistics, Epidemiology, and InformaticsUniversity of Pennsylvania Perelman School of MedicinePhiladelphiaUSA
  3. 3.Department of MedicineBrigham and Women’s HospitalBostonUSA
  4. 4.Department of Population MedicineHarvard Medical School/Harvard Pilgrim Health Care InstituteBostonUSA
  5. 5.CareMore Health SystemCerritosUSA
  6. 6.Atrius HealthBostonUSA
  7. 7.Department of Medical Ethics and Health PolicyUniversity of Pennsylvania Perelman School of MedicinePhiladelphiaUSA
  8. 8.Anthem Public Policy InstituteWashingtonUSA
  9. 9.Corporal Michael J. Cresencz VA Medical CenterPhiladelphiaUSA

Personalised recommendations