FormalPara Key Points

Artificial intelligence (AI) prescription algorithms have been successfully applied to single disease problems, but previous applications have not considered comorbid conditions, pharmacological treatments, treatment histories, and other individual characteristics that are important for personalized diabetes management.

We trained and evaluated a series of AI algorithms to optimize patients’ glycemia, blood pressure, and CVD risk outcomes, either individually or jointly, using a retrospective cohort of type 2 diabetes patients from an ambulatory care electronic health records database (2009–2017).

When optimizing glycemia, blood pressure, and CVD risk individually, the algorithms consistently recommended prescriptions with clinicians’ decisions in 86.1%, 82.9%, and 98.4% of patient encounters. In cases where the AI recommendation differed from the clinicians’ prescriptions, health outcomes were significantly improved.

The RL algorithm can be integrated into electronic health record platforms to assist physicians with dynamic real-time suggestions on personalized treatment paths.

1 Introduction

Comorbid chronic conditions are common among people with type 2 diabetes (T2DM) [1]. Hypertension (HTN) and atherosclerotic cardiovascular disease (CVD) are the two most common multimorbidities for T2DM patients [2]; Therefore, the need to address comorbid chronic conditions, in addition to patients’ diabetes-specific treatment goals [3], poses a substantial challenge for effective T2DM management. Although improvements in glycemic monitoring and control have been documented in several large systems of care, and more widespread use of treatments such as angiotensin-converting enzyme (ACE) inhibitors and aspirin have decreased patients’ risk of cardiovascular death, the current commonly used standard of care and guidelines are usually built around single diseases [4]. Despite the increasing numbers of patients with multimorbidity, such patients are usually excluded from randomized controlled trials [5,6,7]. A systematic review of managing patients with multimorbidity identified only 10 randomized trials worldwide and highlighted the paucity of research into interventions to improve outcomes for patients with multimorbidity [8]. On the other hand, there is a large volume of evidence suggesting that the response to T2DM treatment, HTN treatment, and CVD prevention differs between population subgroups [9, 10]. Therefore, the need for an individualized approach is especially pressing given the variety of comorbid conditions, pharmacological treatments, individual treatment histories, and other individual characteristics that may inform treatment selection.

We provide an artificial intelligence (AI) prescription algorithm, based on reinforcement learning (RL), which is able to dynamically suggest personalized optimal treatments for patients with T2DM to manage their multimorbidity based on evidence from patients’ electronic health records (EHRs). RL has been successfully applied in the past to single disease problems, such as blood glucose control [11], HIV therapy [12], cancer treatment [13], anemia treatment in hemodialysis patients [14], treatment strategies for sepsis in intensive care [15], and a personalized regimen of sedation dosage and ventilator support for patients in intensive care units (ICUs) [12]. Prescriptive algorithms using regression trees and \(k\) nearest neighbors (kNN) have previously shown great potential in personalized diabetes management [16, 17].

Our approach leverages the power of RL and abundant data in the EHR system to dynamically recommend treatment prescriptions, which are personalized based on patient characteristics, including age, sex, race, body mass index (BMI), blood pressure (BP), laboratory tests, duration of T2DM, and treatment history. In our setting, we first applied RL to optimize glycemic control, BP control, and CVD prevention separately, and then studied the potential of RL for multimorbidity management by optimizing all three outcomes jointly. We evaluated the effectiveness of the personalized treatment recommendations made by RL against the observed clinicians’ treatment by estimating patient outcomes based on the outcomes of similar patients in the EHR database.

2 Research Design and Methods

2.1 Study Design and Participants

We used ambulatory care EHR samples for T2DM patients from New York University Langone Health (NYULH–EHR) to derive and validate the RL algorithm. Eligible patients had had at least one encounter with an NYULH ambulatory primary care physician between 2009 and 2017 and had been selected by a T2DM rule-based phenotyping algorithm, defined as the following criteria: (1) had at least two encounters with an International Classification of Diseases, Tenth Revision (ICD-10) code for T2DM; (2) had two or more abnormal hemoglobin A1c (A1c; ≥  6.5%) and at least one encounter with an ICD-10 code for T2DM; or (3) had a prescription for a T2DM medication, excluding metformin and acarbose. We excluded patients seen for consultation only and patients in emergency department, inpatient, or specialist settings, as these lacked consistent documentation of T2DM across encounters. We randomly selected 60% of the eligible patients as the training cohort to develop the RL algorithm, and reserved the remaining 40% of patients as the test cohort to evaluate the performance of the RL algorithm. This study was approved by the NYULH Institutional Review Board, and the data were de-identified to ensure anonymity.

For each patient, we had access to demographic data, including age, sex, race, ethnicity, and smoking status, as well as the following biomarkers: systolic BP (SBP), diastolic BP (DBP), BMI, HbA1c, total cholesterol (TC), low-density lipoprotein (LDL), high-density lipoprotein (HDL), creatinine, triglycerides, and estimated glomerular filtration rate (eGFR). In NYULH–EHR, 1% of samples had missing vitals, including BPs and BMI, 8% had missing HbA1c, 5–32% had missing renal function biomarkers, and 13% had missing lipid biomarkers. Following on from the work of Lundberg et al. [18], we imputed the missing patients’ biomarkers based on the observed values measured in previous encounters.

Medication prescriptions were first grouped by therapeutic class codes of antihyperglycemic, antihypertensive, and lipid-lowering, then analyzed by pharmacologic subclass. The antihyperglycemic therapeutic class contains nine pharmacologic subclasses, including the peroxisome proliferator-activated receptor (PPAR) agonist thiazolidinedione (PPARg), insulin-release stimulant type (INSR), incretin mimetic (glucagon-like peptide 1 receptor agonist; GLP1), dipeptidyl peptidase-4 (DPP4) inhibitor and biguanide (DPP4-BIG), DPP-4 inhibitors (DPP4), biguanide type (BIG), insulin-release stimulant and biguanide (INSR-BIG), sodium-glucose cotransporter-2 inhibitors (SGLT2), and insulins (INSO). The antihypertensive therapeutic class contains 10 pharmacologic subclasses, including angiotensin receptor antagonists (ARAs), potassium-sparing diuretics in combination (PSD), α/β-adrenergic blocking agents (ABAB), ACE inhibitor with thiazide or thiazide-like diuretic (ACE-TD), ARAs with thiazide diuretic (ARA-TD), ACE inhibitors (ACE), thiazide and related diuretics (TD), β-adrenergic blocking agents (BAB), calcium channel blocking agents (CCB), and ARAs with CCBs (ARA-CCB). The antihyperlipidemic therapeutic class contains five pharmacologic subclasses, including bile salt sequestrants (BSS), HMG-CoA reductase inhibitors (HMG), HMG-CoA reductase inhibitors and cholesterol absorption inhibitors (HMG-CA), proprotein convertase subtilisin/kexin type 9 inhibitors (PCSK9), and lipotropics (LIP).

2.2 Overview of the Reinforcement Learning Algorithm

RL algorithms model the course of patients' EHR histories, which includes prescriptions, biomarkers, and health outcomes changing over time using a Markov decision process with key elements, including state, action, and reward [15, 19]. In this setting, ‘state’ refers to the observed patient demographics, laboratory test results at the current encounter, and their histories of laboratories tests and prescriptions. ‘Action’ refers to the prescribed treatment regimen at the current encounter, which are pharmacologic subclasses or their combinations. The result of an action is a numerical reward representing the improvement of health outcomes compared with the previous encounter. The cumulative reward is defined as the sum of the rewards along the course of EHR encounter records. RL has been well-established as an efficient AI learning algorithm to maximize cumulative reward by selecting an optimal action at each encounter through a learning algorithm called Deep Q Networks [20, 21] with a multilayer (deep) neural network. An important advantage of RL is that the action in every encounter is personalized to the patient's individual characteristics as they are observed, in a way that optimizes the cumulative reward. In this paper, we focus on glycemia control (lowering A1c towards 6.5%), BP control (lowering SBP towards 120 mmHg), and CVD prevention (minimizing CVD risk). We first optimized each outcome individually using three separate RL algorithms, referred to as RL–glycemia, RL–BP, and RL–CVD. We then trained a multimorbidity management RL algorithm (RL–multimorbidity) to optimize glycemia, BP and CVD risk simultaneously. The details of state, action, and reward are described as follows

  • State: A list of observed patient characteristics, including age, sex, race, smoking status; vitals and laboratory test values at current encounter and in the past 6 months, including BMI, weight, SBP, DBP, triglycerides, TC, HDL, LDL, A1c, and creatinine; prescription history in the past 6 months; and encounter histories, including days since the previous encounter and days since the first encounter.

  • Action: The action space consists of the pharmacologic subclasses and their combinations, referred to as the treatment regimen. The action space of RL–glycemia contains nine pharmacologic subclasses in the antihyperglycemic therapeutic class, or their combinations; the action space of RL–BP contains 10 pharmacologic subclasses in the antihypertensive therapeutic class, or their combinations; the action space of RL–CVD contains five pharmacologic subclasses in the antihyperlipidemic therapeutic class, or their combinations; and the action space of RL–multimorbidity contains pharmacologic subclasses in all three therapeutic classes, or their combinations.

  • Reward: The reward of a prescription is a numeric measure of treatment efficacies between two consecutive encounters. For RL–glycemia, if A1c <5.6% in both encounters, their rewards are zero, otherwise the reward is defined by the reduction in A1c. For RL–BP, if patients have no HTN symptoms (< 120 mmHg) in both encounters, the reward is zero, otherwise it is equal to the decrease in SBP. For RL–CVD, the reward is the reduction in global CVD Framingham Risk Score (FRS) [22], which is a function of age, TC, HDL, SBP, treatment for HTN, smoking, and T2DM status (all yes). Sex-specific risk equations were applied to males and females separately. For RL–multimorbidity, the reward is defined as the average of standardized rewards values of RL–BP, RL–glycemia, and RL–CVD (model and training details are shown in the electronic supplementary materials).

2.3 Model Evaluation

We evaluated the RL-recommended therapy by comparing its effect with the observed clinicians’ prescriptions on the test cohort of NYULH-EHR samples. In each encounter, the RL algorithm recommends a treatment regimen for the patient. If the recommendation is the same as the observed clinicians’ prescriptions in the data, we noted that RL is ‘consistent’ with the clinicians’ prescriptions. When RL is discrepant with the clinicians’ prescriptions, the efficacy of the RL-recommended treatment is not directly observed. For this reason, we imputed the outcome of the RL-recommended treatment using kNN regression, an approach commonly used for causal inference in observational studies [23]. In short, the imputation works by averaging the outcomes of the \(k\) most similar patient encounters, in terms of patient characteristics, in which the RL-recommended therapy had been administered by clinicians. The similarity between patient encounters was estimated by Euclidean distance, as in the study by Bertsimas et al. [16]. To assess the performance of the imputation, we first compared imputed outcomes with observed outcomes under clinicians’ treatments, and found 87–95% correlation between them, indicating that the imputation algorithm can effectively estimate unobserved health outcomes (Table 1). We varied the number k of nearest neighbors and found the performance of the imputation (for any of the three health outcomes) was insensitive when k was between 8 and 10. We estimated the efficacy of the recommendations made by RL, first in the whole set of test samples, and then for individual sex, racial, and age subgroups.

Table 1 Counterfactual outcome versus true clinical outcome comparison based on kNN regression

2.4 Feature Importance

To better understand which features have the most impact on treatment recommendations, we used SHAP (SHapley Additive exPlanations) [24, 25] to estimate and rank the contributions of clinician features explaining RL and clinician prescriptions.

3 Results

Overall, 16,665 patients in NYULH ambulatory care EHR samples had a query-based T2DM diagnosis in 2009 to 2017, with 1,278,785 encounters (median 12 encounters per patient). The number of T2DM patients was robust to variations in the T2DM phenotyping algorithm resulting from changes in the required number of encounters with T2DM ICD-10 codes and the medications. The demographic and clinic characteristics of the analysis cohort are shown in Table 2. Overall, patients were 65.6 years of age and comprised 8278 females (54.6%). On average, T2DM patients showed A1c of 7.1% and SBP of 128.9 mmHg. Antihyperglycemic, antihypertensive, and antihyperlipidemic medications were prescribed in 665,768 (52.1%), 849,328 (66.4%), and 428,427 (33.5%) encounters, respectively. The median follow-up time was 2.6 years since T2DM diagnosis (interquartile range [IQR] 1.9–3.9 years). We first trained the RL algorithms using 530,786 (60%) T2DM patient encounters, and then assessed their performance using the remaining 394,447 (40%) T2DM patient encounters.

Table 2 Demographics and clinical characteristics of NYULH-EHR patients with type 2 diabetes

The performance of the RL algorithms on the test dataset is summarized in Table 3. The RL–glycemia algorithm was consistent with clinicians' prescriptions in 86.1% of encounters. In the remaining 15,578 (13.9%) encounters, the mean A1c under clinician prescription was 8.09% (95% confidence interval [CI] 8.06–8.12), while the mean A1c under RL–glycemia was 7.80% (95% CI 7.78–7.82), showing a 0.30% (95% CI 0.28–0.32) reduction (< 0.001). Significantly fewer encounters showed uncontrolled A1c (A1c >8%) under RL–glycemia than under clinicians (35% vs. 43%, p < 0.001). The RL–BP algorithm was consistent with clinicians’ prescriptions in 82.9% of encounters. In the remaining 20,251 encounters (17.1%) with discrepant recommendations, RL–BP achieved a 0.58 mmHg (95% CI 0.37–0.79) reduction in SBP relative to clinicians' prescriptions (131.77 vs. 132.35 mmHg, < 0.001). Fewer encounters showed uncontrolled HTN (SBP >140 mmHg) under RL–BP than under clinicians' prescriptions (16% vs. 27%, p < 0.001). The RL–CVD was consistent with clinicians’ prescriptions in 98.4% of encounters. In the remaining 946 encounters (1.6%) with discrepant recommendations from RL and clinicians’, the mean FRS reduced 3.53% (95% CI 2.94–4.12) under RL–CVD compared with clinicians’ prescriptions (13.65% vs. 17.18%, < 0.001), with fewer encounters showing high FRS risk (> 20%; 25% vs. 31%, p < 0.01). Collectively, these results showed high concordance between the optimized RL algorithms and clinicians’ prescriptions for single-target management for patients with T2DM. However, there were more frequent discrepancies between RL–multimorbidity and clinicians. The RL–multimorbidity algorithm was consistent with clinicians' prescriptions in 71.1% of encounters. In the remaining 102,184 encounters (28.9%) with discrepant prescriptions, 16,436 (16.1%), 9800 (9.6%), and 48,283 (47.3%) encounters had uncontrolled A1c, uncontrolled HTN, and high FRS risk that was significantly lower than observed outcomes under clinicians’ prescriptions (20.4%, 20.5%, and 54.8%, respectively).

Table 3 Performance of RL algorithms with comparison between RL and clinicians for glycemic control, hypertension control, and CVD prevention

To understand when and how RL makes different prescriptions from clinicians, Table 4 compares consistent and discrepant encounters by patient demographics and clinical characteristics. The most significantly associated factor was severity at the time of the encounter. For RL–glycemia, encounters with higher A1c were more likely to have different recommendations (average A1c 8.1% for discrepant encounters vs. 7.5% for consistent encounters, < 0.001). For RL–BP, encounters with higher SBP were more likely to have different recommendations (average SBP 132.85 vs. 131.00 mmHg, < 0.001).

Table 4 Comparison of RL and clinicians for glycemic control, BP, and CVD prevention.

The efficacy of the RL prescriptive algorithms was consistently observed across T2DM patients, and sex, racial, and age subgroups (Tables 5, 6, 7). Specifically, African American T2DM patients, and T2DM patients aged older than 60 years, observed higher efficacies from the RL algorithms than clinicians’ prescriptions compared with the observed efficacies in White patients and patients aged 60 years and younger. For example, A1c under RL–glycemia for African American patients was 0.39% lower than under clinicians’ treatment. In contrast, A1c under RL–glycemia was 0.28% lower than under clinicians’ treatment for White patients. Patients aged 60 years and younger observed higher efficacy, with A1c under RL–glycemia 0.47% lower than that under clinicians’ treatment, than those older than 60 years of age, with A1c under RL–glycemia 0.19% lower than that under clinicians’ treatment.

Table 5 Subgroup results of the glycemic control RL algorithm
Table 6 Subgroup results of the BP control RL algorithm
Table 7 Subgroup results of the multimorbidity control RL algorithm

The patterns of different treatment recommendations, along with the resulting differences in health outcomes, for RL–glycemia, RL–BP, and RL–multimorbidity, are illustrated in Fig. 1. In the case of RL–glycemia, the most frequently observed discrepancy (1167 encounters) was that clinicians prescribed insulin monotherapy (INSO) while RL prescribed biguanide type (BIG). On these encounters, RL–glycemia achieved, on average, 1.22% lower A1c than clinicians. In the case of RL–BP, the most frequently observed discrepancy (1010 encounters) was that clinicians prescribed ACE inhibitors (ACE), while RL prescribed BABs. On these encounters, RL–BP achieved a 6.78 mmHg lower SBP. The most frequently observed discrepancy between RL–multimorbidity and clinicians’ prescription was biguanide type (BIG) prescribed by clinicians, and HMG-CoA reductase inhibitors (HMG) prescribed by RL–multimorbidity, observed in 1272 patient encounters. On these discrepant encounters, RL–multimorbidity achieved a 0.15% higher A1c but 2.42% lower CVD risk and 0.30 mmHg lower SBP. Overall, RL algorithms tended to prescribe fewer medications than clinicians (Fig. 2).

Fig. 1
figure 1

Patterns of the most frequent discrepant RL recommendations and clinicians’ prescriptions for (a) RL–glycemia, (b) RL–BP, and (c) RL–multimorbidity. Each cell and the numbers represent patients for whom RL (labels on the x axis) recommended a different regimen from the regimen given by clinicians (labels on the y axis). The color in each cell quantifies the improvement in health outcomes achieved by the RL recommendation relative to the clinician’s prescription, with blue indicating benefits of the RL recommendation and orange indicating worsening outcomes relative to the clinician’s prescription. (a) Indicates the mean A1c reduction (%) of RL–glycemia (labels on the x axis) compared with clinicians (labels on the y axis); (b) indicates the mean SBP decrease (mmHg) of RL–BP (labels on the x axis) compared with clinicians (labels on the y axis); and (c) indicates the mean difference of multimorbidity reward from RL–multimorbidity (labels on the x axis) compared with clinicians (labels on the y axis). RL–CVD was consistent with clinicians’ prescriptions for the vast majority of encounters, and thus was not shown in this figure. RL reinforcement learning, SBP systolic blood pressure

Fig. 2
figure 2

Prescription medication use by RL versus clinicians. Total number of drugs prescribed for (a) blood glucose control, (b) BP control, and (c) multimorbidity management. RL reinforcement learning

Figure 3 shows the importance of features associated with the RL–multimorbidity algorithm and clinicians’ prescriptions. In general, there was reasonable agreement between the feature importance estimates of RL–multimorbidity and those identified by the clinicians. A1c is the most important feature for clinicians, while RL–multimorbidity was most influenced by recent therapies, age, BMI, and A1c. One difference is the importance of creatinine in the clinicians’ prescriptions, but it was not as important for RL–multimorbidity. Another difference is the reduced role of the time since first encounter in RL–multimorbidity compared with clinicians’ prescriptions.

Fig. 3
figure 3

Feature importance of (a) RL–multimorbidity and (b) clinician prescription. RL reinforcement learning, BMI body mass index, HDL high-density lipoprotein, LDL low-density lipoprotein, TC total cholesterol, BP blood pressure

4 Discussion

To our best knowledge, this is the first RL-assisted prescriptive algorithm for personalized single and multimorbidity outcome management for patients with T2DM. Using an EHR database, the developed RL algorithm can efficiently recommend treatment regimens to optimize patient health outcomes incorporating their individual demographic and treatment history. Compared with other machine-learning methods, the RL approach has a particular advantage as it can efficiently learn complex dynamic drug–disease and drug–drug interactions in the presence of high temporal variation, uncertain outcomes, and long-term treatment effects [15, 19]. RL recommendations showed high levels of concordance with clinicians’ prescriptions for single outcome optimizations of glycemia, BP, and CVD risk control. This demonstrates the feasibility of using RL for T2DM management, and indicates that clinicians make near-optimal decisions with regard to single-outcome management.

RL–multimorbidity recommendations showed more frequent discrepancy with clinicians' prescriptions as well as the recommendations by single-outcome RL algorithms. This provides data-driven evidence that optimizing multimorbidity management is different from optimizing single outcomes in parallel. For example, in the 1272 patient encounters with the most frequently observed discrepancy between RL–multimorbidity and clinicians, their average A1c was 7.0%, SBP was 127.2 mmHg, and CVD risk was 12.6%. For these encounters, clinicians prescribed BIG to prioritize glycemic control, while RL–multimorbidity prescribed HMG for lipid-lowering. This indicates challenges and uncertainties of multimorbidity management for patients with borderline and balanced levels of severities in multiple chronic conditions [26, 27]. RL–multimorbidity showed overall improvements in managing the three outcomes simultaneously, significantly reducing the number of encounters with uncontrolled glycemia, uncontrolled HTN, and high FRS CVD risk.

Although both clinicians and RL–multimorbidity place high importance on similar factors, these factors are ranked differently. RL algorithms did not weigh features that were not included in the reward functions, such as creatinine, as much as clinicians who consider it an important renal function biomarker. This indicates a potential challenge of the RL algorithms using single-directed reward outcomes as the optimization goal. Ideally, a comprehensive reward function should incorporate domain knowledge and adverse events, such as hypoglycemia and kidney comorbidity, to achieve optimized outcomes while balancing the risks of adverse events [28].

Typical limitations with EHR data are their unobserved medication adherence, partially observed clinical data at each encounter, and uncontrolled time span between encounters [29]. However, the RL algorithms were designed to incorporate these uncertainties under real-world scenarios. In particular, if there were observable patient characteristics that were associated with higher non-adherence to a certain treatment leading to lower levels of efficacy, RL would be able to identify this and prescribe different treatments for patients with those characteristics.

Although our evaluation methodology controls for several confounding factors that could explain differences in treatment effects, we can only estimate counterfactual outcomes under RL recommendations for patients with discrepant prescriptions. In addition, the T2DM patient population from NYULH ambulatory care may not be representative of the United States T2DM population. To ultimately validate the efficacy of the RL algorithms, randomized clinical trials with patients randomly assigned to RL and clinician mechanism would be needed.

5 Conclusions

In this study, we demonstrated the feasibility of using RL prescriptive algorithms for patients with T2DM to manage their multimorbidity based on test data from an ambulatory care center. The RL–glycemia, RL–BP, and RL–CVD algorithms showed high concordance (83–98%) with clinicians’ prescriptions, while RL–multimorbidity showed relatively low concordance (71%) for multimorbidity management. For patient encounters in which the RL recommendations differed from the clinician prescriptions, RL prescriptions showed significantly improved health outcomes compared with clinicians’ prescriptions. Potentially, the algorithm can be integrated into EHR platforms to assist physicians for T2DM management with dynamic real-time suggestions of personalized treatment paths.