Type 2 diabetes mellitus (T2DM) is a major cause of morbidity and mortality worldwide and its prevalence has been rising steadily over recent decades [1]. Although the number and availability of glucose-lowering agents have increased during recent years, the selection of appropriate treatment for individual patients with T2DM can be difficult given that the relative benefits/risks of different drugs for individual patients are not well understood [2], along with the limited knowledge about the association between patient characteristics and attainment of glycemic control. As common clinical practice indicates, to reach glycemic control the use of combination therapy with agents with complementary modes of action is frequently necessary [3]. The latest treatment strategy from the American Diabetes Association (ADA) recommends initial combination therapy when glycated hemoglobin (HbA1c) levels are ≥ 9% as this may provide more rapid attainment of HbA1c targets than with sequential therapy [4]. Similarly, the consensus statement by the American Association of Clinical Endocrinologists and American College of Endocrinology (AACE/ACE) states that dual combination therapy is usually required in patients with T2DM, and should be initiated when HbA1c is ≥ 7.5% [3]. However, despite these recommendations, glycemic control remains suboptimal in a significant proportion of patients [5,6,7], and availability of early predictors of glycemic response is lacking.

The Role of Machine Learning Techniques in Healthcare

Machine learning has recently been described as an important technology that can meaningfully process data that are beyond the capacity of the human brain to comprehend, in particular in relation to the huge clinical databases that are now available [8]. Although this remains a developing area in clinical medicine, a range of machine learning techniques are being increasingly used in healthcare, in particular to analyze the large and rapidly growing body of research and clinical data, and extract information that can lead to new hypotheses aimed at improved understanding and further investigation of medical conditions, including T2DM [9]. This approach can also be used to identify drug–target interactions in the search for potential candidates as a first step in the process of drug discovery [10]. To date, the main disease areas that have used machine learning techniques include oncology (e.g. for the prediction of breast cancer risk), cardiology (e.g. for predicting the occurrence of myocardial infarction), and neurology (e.g. in the evaluation of diagnostic imaging to predict outcomes after stroke) [11, 12]. In the field of diabetes, machine learning techniques have a range of applications, such as the use of computational algorithms in the evaluation of genomic data for the selection of biomarkers for T2DM [13], the identification of risk factors for predicting T2DM [14] or detecting individuals with impaired glucose tolerance or T2DM [15], for the prediction of T2DM following gestational diabetes [16], and in the classification of diabetic retinopathy [17]. An important limitation of machine learning models is that a model can only predict patient outcomes that are included in the dataset on which the model is based, and hence is dependent on the quality of data used to create the model [8]. Furthermore, the ease of interpretation of the model depends on the number of features evaluated—if the number is small, then the simple prediction tasks are easy to understand. In contrast, complex tasks are inherently more difficult to interpret because the model has been developed to identify complex statistical patterns, which might be more difficult to explain in terms of the subtle patterns that have led to a particular prediction [8]. It is important that clinicians who use machine learning systems understand how to interpret them so that they can identify clinical situations in which a model might be helpful. This underlines the need to perform real-world clinical evaluation of analytical models.

Application of Machine Learning to Clinical Datasets

The use of analytical methods to determine how treatments can benefit certain patients, and which patients will benefit from specific treatments, may help to improve treatment success in T2DM. Machine learning algorithms can be used as clinical prediction models to extract new information from the ever-increasing amounts of data generated by clinical trials. The machine learning algorithm provides an hypothesis-free, unbiased methodology that can facilitate the search for predictors of therapeutic success in T2DM. This approach can be used to find patterns in clinical datasets and offers the potential to define predictive factors to help identify which patients could benefit most from a given treatment [18]. This is particularly important in the field of diabetes research where predictors of response to antihyperglycemic therapies, both in terms of HbA1c reduction and maintenance of glycemic control, remain elusive. One type of machine learning involves the construction of computer systems that learn from experience to identify patterns in data and predict outcomes [19].

Description and Use of Random Forests

A commonly used type of machine learning methodology involves the use of random forests [20]. In this approach, decision trees derived from clinical data can be used to develop a group of trees (a random forest). A random forest is a group of tree-structured classifiers. The random forest approach incorporates two effective machine learning techniques—bagging and random feature selection. Bagging involves training each tree on a bootstrap sample of training data, and predictions are based on a majority vote of trees. This approach involves random selection of a subset of features to split at each node as a tree is grown. During training, each tree is grown using a particular bootstrap sample, with some of the data (approximately one-third) being left out during sampling. These omitted data are the out-of-bag (OOB) sample. Since the OOB data have not been used in tree construction, these data can be used to estimate the prediction performance [20]. Although each tree is unlikely to produce accurate predictions on its own, generating results based on the final vote across hundreds of trees can optimize the accuracy of predictions; the larger the number of trees, the greater the accuracy of the predictions [11, 20]. In the field of diabetes research, the random forest approach has the potential to explore relationships between possible disease predictors, and has been shown to help to screen potential biomarkers for T2DM [21]. This approach also has the potential to predict treatment success in T2DM by analyzing patient characteristics and treatment response.

Aims and Objectives

In the present study, clinical trial data comparing the single-pill combination of the sodium-glucose co-transporter-2/dipeptidyl peptidase-4 (SGLT2/DPP-4) inhibitor empagliflozin/linagliptin with empagliflozin or linagliptin monotherapies [22, 23] were used to determine if random forest or classification tree models could identify new predictors of treatment success, defined as HbA1c reduction. Specifically, the aims of this analysis were to identify patient characteristics associated with achieving an HbA1c target of ≤ 7% at week 12 and maintaining the target through week 52. By applying a new machine learning methodology to an existing clinical dataset, the practical application of this approach will be evaluated, and the potential utility of this new approach to clinical decision making can be assessed.


Design and Patients

Data were pooled from two phase III studies of empagliflozin/linagliptin single-pill combination therapy versus empagliflozin or linagliptin monotherapies in T2DM. These studies were chosen as a convenient sample that could be used to test the proposed analytical methods. The two studies had a similar design but enrolled patients who were treatment-naïve (study 1, n = 677) [22] or receiving background metformin (study 2, n = 686) [23]. Both trials were registered as NCT01422876 and have since been published, including a detailed description of the trial methods. In brief, both trials were randomized, double-blind, parallel-group studies that compared once-daily administration of a single-pill combination of empagliflozin plus linagliptin (empagliflozin 25 mg/linagliptin 5 mg, or empagliflozin 10 mg/linagliptin 5 mg) with empagliflozin monotherapy (25 mg or 10 mg daily) or linagliptin (5 mg) for 52 weeks. The inclusion criteria were patients aged ≥ 18 years, HbA1c level > 7% and ≤ 10.5% at screening, and a fasting plasma glucose (FPG) level of ≤ 240 mg/dL. Patients were randomized to one of five groups that were used for the descriptive analyses; however, for the machine learning data analysis, the two empagliflozin/linagliptin single-pill combination groups were pooled, as were the two empagliflozin monotherapy groups. In both studies, the primary endpoint was defined as the change in HbA1c levels between baseline and week 24.

Descriptive Analysis

Descriptive analysis was used to assess univariate associations between target categories and each baseline variable (e.g. HbA1c). Target attainment was defined by three groups: patients who achieved an HbA1c target of ≤ 7% at weeks 12 and 52; patients who reached the HbA1c target at week 12 but were above the target at week 52; and patients with HbA1c above the target at week 12 (irrespective of the week 52 value). Patients who discontinued before a specific time point were considered not at target for that time point. The differences in the distribution of at-target categories between groups were tested using a Chi-square test.

Machine Learning Analysis

Within the category of machine learning, the random forests algorithm is a well-established and now commonly used method. This method requires a dependent or outcome variable of interest and a list of independent variables as potential predictors of the outcome variable. The current study has a binary outcome variable (whether a patient had sustained response or not) and a relatively large set of patient characteristics as potential predictors, therefore a random forest approach was considered to be appropriate. The random forest algorithm was implemented using the randomForest R package (The R Foundation for Statistical Computing, Vienna, Austria).

After the descriptive analysis results, a machine learning analysis was planned and conducted (classification tree and random forest methods) to estimate and predict target categories based on patient characteristics at baseline without a priori selection. This analysis was based on the status at 12, 24, and 52 weeks. For stronger contrast, the analysis was limited to patients with sustained control (at target at all time points) or not in control (not at target at any time point). We excluded from the analysis patients with delayed control (not at target at week 12 but at target at either week 24 or 52) or non-sustained control (at target at week 12 but not at target at week 24 or 52).

Baseline variables included in the model were age, sex, race, ethnicity, geographic region, background therapy (treatment-naïve or receiving background metformin), height, weight, body mass index (BMI), waist circumference, smoking status, alcohol consumption, time since diagnosis, estimated creatinine clearance rate (CrCl), estimated glomerular filtration rate (eGFR), hypertension diagnosis, systolic blood pressure (SBP), diastolic blood pressure (DBP), HbA1c, and FPG.

Incorporation of the Random Forest Model

The random forest model is a well-established method in the statistical literature and has been increasingly applied in the field of biomedical research. In random forests, in general, the more trees, the better the prediction results. However, this improvement declines as the number of trees continues to grow, and, beyond a certain point, the amount of improvement becomes negligible. It has been shown in the random forest methodological literature that 500 is a sufficiently prudent number of trees to be built beyond which little improvement in prediction results is expected. Thus, in the present study, a conventional approach was used [24], with 500 individual trees constructed in each analysis; at each tree split, a random subset of 4 of 20 baseline variables were selected and considered as split candidates. The importance of the baseline variables was based on two parameters: (1) mean decrease in prediction accuracy without the variable in the model; and (2) mean decrease in the Gini index [25], a measure of impurity of the dataset (i.e. risk of misclassification of data), by including the variable. For both parameters, the greater the score, the greater the importance of the variable. An advantage of this approach is that the random forest method is robust in the presence of collinearity among potential predictors, unlike regression analysis.

For the present study, the descriptive analysis was performed using SAS 9.4 software (SAS Institute Inc., Cary, NC, USA); the machine learning analysis was performed in R version 3.3.2 (The R Project for Statistical Computing).

Use of Classification Tree Analysis as Comparator with Random Forest

Compared with the random forest method, the classification tree analysis is a simpler tree-based method that involves the construction of a single tree. In this study, it was used as a reference to compare with the random forest analysis. The two methods were compared using the training and validation set approach in which the full-analysis population was randomly divided into two subsets with a 60% versus 40% ratio (standard choice). The first subset (training set) was then used to build the models, and the second subset (validation set) was used to test the performance of the models.


Overall, baseline patient characteristics were balanced between treatment groups, and the details have been published elsewhere for the individual studies [22, 23]. In summary, for the two studies overall (data given as mean ranges across study arms), the majority of participants were male (study 1, 48–58%; study 2, 46–61%), approximately 55 years of age (study 1, 53–56 years; study 2, 55–57 years), White (study 1, 70–78%; study 2, 71–76%), and diagnosed with T2DM at least 1–5 years previously (study 1, 36–45%; study 2, 34–37%). At baseline, mean HbA1c was approximately 8.0% in both studies (study 1, 7.99–8.05%; study 2, 7.90–8.02%) and mean FPG was approximately 156 mg/dL (study 1, 152.8–160.3 mg/dL; study 2, 154.6–161.6 mg/dL).

All treatment groups showed significant reductions from baseline in HbA1c over 24 weeks [22, 23]. Among patients who were treatment-naïve or receiving background metformin therapy, more of those treated with the empagliflozin/linagliptin single-pill combination achieved and maintained HbA1c targets compared with either agent alone (Table 1). The proportion of at-target categories in the single-pill combination groups was significantly greater than in the monotherapy groups (Chi-square test, p < 0.0001). In the descriptive analysis, lower mean baseline HbA1c and FPG were both associated with achieving and maintaining the HbA1c target (Table 1). Table 2 shows the number of patients who achieved sustained control in the trial, and the number predicted by the random forest model to achieve sustained glycemic control.

Table 1 Proportion in each at-target category for each treatment group, showing mean HbA1c and FPG at baseline
Table 2 Number of patients with response predicted by the random forest analysis versus observed response

Figure 1 shows the performance of the random forest model across the three treatment groups. Overall, the graphs show the likelihood of error (y-axis) against the number of trees (x-axis). The rate of error decreases as the number of trees increases. The OOB estimates of the prediction error rate were 22.0%, 18.4%, and 22.5% for the empagliflozin/linagliptin, empagliflozin, and linagliptin analyses, respectively. Of the variables included in the model, baseline HbA1c and FPG were the two most important predictors (Fig. 2).

Fig. 1
figure 1

Prediction error rates for a empagliflozin/linagliptin single-pill combination, b empagliflozin, and c linagliptin. The graphs show the out-of-bag estimate of the prediction error rate (a measure of incorrect predictions); the prediction error rate among patients who achieved sustained glycemic control (analogous to the false-negative rate); and the prediction error rate among patients who did not achieve glycemic control (analogous to the false-positive rate)

Fig. 2
figure 2

Importance of baseline variables in the random forest analysis for a empagliflozin/linagliptin single-pill combination, b empagliflozin, and c linagliptin. BMI body mass index, CrCl estimated creatinine clearance rate, DBP diastolic blood pressure, eGFR estimated glomerular filtration rate, FPG fasting plasma glucose, HbA1c glycated hemoglobin, HTN hypertension, SBP systolic blood pressure, waist waist circumference

Machine Learning Analysis

The machine learning analysis also identified HbA1c and FPG as the strongest predictors of attaining glycemic control. As can be seen in Table 2, the error rate was related to the number of patients in each group (or pooled group). For example, in the empagliflozin/linagliptin group, 225 patients achieved sustained control and the model correctly predicted 194/225 patients (86.2%), but incorrectly predicted 31/225 patients (13.8%). Fewer patients were not in control throughout the study (n = 144), and the model incorrectly predicted that 50/144 patients (34.7%) would achieve sustained control.

Comparison with Classification Tree Analysis

In the validation set, small improvements in prediction accuracy for the random forest model versus the classification tree model were observed: 81% versus 79% for the empagliflozin/linagliptin single-pill combination, 82% versus 80% for empagliflozin, and 78% versus 77% for linagliptin.


Using both traditional and novel data analysis methodologies, this study has identified baseline glycemic status as the strongest predictor of target glycemic control attainment. In the group of patients who received empagliflozin or linagliptin monotherapies or the empagliflozin/linagliptin single-pill combination, low baseline HbA1c and FPG (within the ranges evaluated in the two studies) predicted attainment of glycemic control (i.e. target HbA1c) during the treatment period. In contrast, covariates including body weight, waist circumference, SBP, DBP, or the other variables studied did not contribute to the outcome in the final model for any of the three therapies studied. These findings are consistent with current experience of glucose-lowering therapies [26,27,28], where high baseline HbA1c is associated with an increased HbA1c response to therapy but lower baseline HbA1c is associated with better achievement of target HbA1c since the initial HbA1c level is already close to the targeted value. HbA1c at the start of therapy has been shown to be predictive of HbA1c reductions achieved with insulin therapies [26], glucagon-like peptide-1 (GLP-1) agonists, DPP-4 inhibitors [27], metformin, and sulfonylureas [28].

While acknowledging several limitations of this study (as noted below) that might have compromised our ability to discover novel predictors of glycemic control attainment in this population, we report here similar results obtained with either an hypothesis-based or hypothesis-free data analysis methodology. This is consistent with the notion that the smaller the difference from target glycemic control, the more likely the success in reaching and maintaining glycemic control in response to a treatment. However, while this is a typically expected outcome, based on conventional hypothesis-driven analyses, the present study provides new insights into the use of a new machine learning algorithm to predict treatment responses. Furthermore, the magnitude of the influence of HbA1c and FPG, compared with the other evaluated variables, could provide important insights into the relative importance of known predictors of treatment response. A previous study also used a machine learning approach to identify predictors of treatment response to another SGLT2 inhibitor, dapagliflozin, and while expected predictors of treatment response were identified, the study demonstrated the potential utility of an hypothesis-independent approach in the evaluation of clinical data [18]. Since data from clinical trials are usually limited in the range of variables that can be evaluated due to the study design, future research could benefit from the use of real-world data sources, such as evaluation of datasets obtained from electronic health records. It is therefore possible that the use of such an hypothesis-free, unbiased methodology could be a useful way to enable the identification of baseline predictors of glycemic control and, in turn, to inform the choice of individualized, effective therapies for patients with diabetes. The machine learning algorithm used in our study is an example of how to approach this task. In particular, the random forest approach, as a machine learning method, offers the advantages of mimicking the human decision-making process and of providing personalized predictions for each assessed patient, for example, with respect to diagnosis, prognosis and treatment responses. These features make it an attractive tool to support clinicians in their practice and decision making [29].

The limitations of this study include the relatively small size of the population sample, which could restrict the generalizability of the findings. Furthermore, patients with delayed or non-sustained control were excluded from the analysis as the study was focused on the two extremes of response patterns (sustained control vs. not in control) with the aim of obtaining the greatest phenotypic contrast between these two patient groups to provide the best chance of detecting meaningful predictors of treatment response. The evaluation of patients with delayed or non-sustained control merits attention in future research. Another limitation of the study is that the set of variables measured at baseline in our study is limited and influenced by the design of clinical trials in T2DM. It is also possible that some predictive factors were not measured and were consequently not included in our analysis. Therefore, the prediction accuracy of the model could be improved by the inclusion of additional variables, such as T2DM biomarkers. The evaluation of larger and more comprehensive datasets, both in terms of the number of subjects and types of variables studied, is warranted. Real-world data sources, such as electronic health record data, are a promising alternative. The findings of this study suggest, however, that the use of hypothesis-free data analysis approaches is very promising and could have an important role in the search for predictors of therapeutic success in T2DM as this remains one of the most relevant criteria to guide therapeutic choices for patients with T2DM.


Identifying predictors of target glycemic control attainment can inform treatment choices and enhance success in treating diabetes. In this study, using both traditional and novel data analysis methodologies, we have identified baseline glycemic status as the strongest predictor of target glycemic control attainment. Machine learning algorithms provide an hypothesis-free, unbiased methodology, which can greatly enhance the search for predictors of therapeutic success in T2DM. We suggest that this approach may be applied to other drugs for which clinical datasets or real-world data are available. The approach used in the present analysis provides an example of how a machine learning algorithm can be applied to a clinical dataset and be used to develop predictions that can facilitate clinical decision making. The more widespread use of machine learning in healthcare has the potential to allow clinicians to take advantage of medically relevant data and assist them in the provision of optimal and individualized patient care.