INTRODUCTION

Emergency departments (ED) are becoming increasingly overwhelmed, increasing poor outcomes associated with ED overcrowding.1,2,3,4,5 Triage scores6,7 aim to optimize the waiting time and prioritize the resource usage according to the severity of the medical condition. The Emergency Severity Index (ESI) is the most widely used triage score.8 It is a subjective risk classification of patients, from 1 (most urgent) to 5 (least urgent), based on patients’ acuity and resources needed. The ESI relies heavily on provider judgment which can lead to inaccuracy and misclassification.9,10 Differentiating between levels 2 and 3 is a challenging task,11 and ESI level 3 is assigned to a largely diverse ill patient group.12

Artificial intelligence (AI) algorithms offer advantages for creating predictive clinical applications because of flexibility in handling large datasets from electronic medical records (EMR).13 AI algorithms are becoming better at prediction tasks, often outperforming current clinical scoring systems.14 Several prediction models have been developed using these techniques in the past years trying to improve the triage process.10,15,16,17,18,19,20,21,22 A machine learning prediction model may improve the proper identification of patients at greater risk of mortality and would be superior to a nonsystematic experience-based assessment.10,13

In this study, we aimed to evaluate a state-of-the-art machine learning model for predicting mortality at the triage level. By validating this automatic tool, our purpose is to improve the categorization of patients in the ED using readily available data from the EMR at the time of arrival.

DESIGN: MATERIALS AND METHODS

An institutional review board (IRB) approval was granted for this retrospective study.

Study Cohort

Information of consecutive adult patients (ages 18–100) admitted at the academic ED of one tertiary center were retrieved from the hospital’s EMR. This acute care hospital has approximately 1700 beds and gets 185,000 total emergency department visits per year. Universal health coverage is provided to everyone as stipulated by the public healthcare system.

The study time frame was from January 1, 2012, to December 31, 2018.

Data Acquisition

Data retrieved were information available at the triage level:

Demographics: age, sex

Admission date: retrieved as a timestamp variable

Arrival mode: either walk-in, by ambulance (BLS), or by intensive care ambulance (ALS)

Referral code: either independent or referred by a physician

Chief complaint: which was recorded in two methods: (1) a structured list of 122 chief complaints from a list of the Israeli Health Ministry guidelines available at the ED, (2) a two-word free text chief complaint obtained by a triage nurse

Previous ED visits: dates of all previous visits to our ED during the study time frame

Previous hospitalizations: dates of all previous hospitalizations in our hospital during the study time frame

Comorbidities: coded as International Classification of Diseases (ICD9) records

Home medications: coded using World Health Organization (WHO) Anatomical Therapeutic Chemical Classification System (ATC)

Vital signs: temperature (T°), heart rate (HR), systolic blood pressure (SBP), diastolic blood pressure (DBP), oxygen saturation (SO2)

ESI score

Mortality dates were obtained from the EMR and from the Ministry of internal mortality records (when the death occurred outside the hospital), and the duration in days from admission to overall mortality was computed. The following features were used as patient outcome endpoints: early mortality, defined as mortality up to 2 days from registration to the ED, and short-term mortality, defined as mortality 2–30 days from registration to the ED.

Data Pre-processing

Data Cleaning

Patients aged 18–100 years were included. Records of patients with erroneous unreasonable values were removed: vital signs were limited to systolic blood pressure (SBP) < 300 mmHg, diastolic blood pressure (DBP) < 250 mmHg, pulse < 300 beats/min, temperature ranging between 25 and 45 °C, oxygen saturation ≤ 100%.

Data Encoding

Input features: all categorical factors were encoded as numerical. For comorbidities, home medications, and unstructured chief complaint high cardinality variables, we used target encoding for embedding. For home medications, we also used one hot encoding using ATC pharmacologic subgroups. Using the lists of previous ED visits and previous hospitalizations at our hospital, we compiled the following: number of previous ED visits, number of previous hospitalizations, number of days to most recent previous ED visit, number of days to most recent previous hospitalization. From the current ED visit timestamp we retrieved the year, month, day, and hour.

Output outcome was encoded as a binary variable (1 = patient died in a selected time frame, 0 = patient did not die in a selected time frame).

Machine Learning Model

The algorithms were programmed using Python (Version 3.6.5 64bits) and the XGBoost open-source library (version 0.8) with the scikit-learn wrapper (version 0.19.2). We addressed class imbalance between mortality and non-mortality cases by using the XGBoost class weights scaling feature. The training data (X, with multiple features) was configured as input to predict a target variable (Y) which was considered a label. Statistical calculations were performed using Python. Computations were done on an Intel I7 CPU and NVIDIA GeForce GTX 1080Ti GPU computer.

Gradient boosting23 is a machine learning algorithm where multiple weak learners (tree based classifiers) are trained to augment each other and produce superior results. It differs from random forests (RF)24,25 in which trees are learned sequentially and based on the performance of all previous trees. In gradient boosting, at each stage, a new decision tree is learned with the aim to correct errors made by existing trees. As a non-linear method, it naturally outperforms linear models26 when higher order relationships exist in the data. Gradient boosting has also surpassed other machine learning algorithms in a number of data challenges.23,27 Recent works described its potential in the medical field.26,27,28,29,30,31,32,33

Analysis

Continuous features are reported as the median with the spread reported as interquartile range (IQR). Categorical elements are reported as percentages. The model’s performance was assessed using the area under the curve (AUC) metric. We have calculated AUC both for training (years 2012–2017) and validation (year 2018). An AUC of 1 indicates perfect outcome prediction in all patients whereas a value of 0.5 means that the model is no better than chance. An AUC value of greater than 0.80 is desirable.18 Models without including the ESI were also calculated.

Bootstrapping validations (1000 bootstrap resamples) were used to calculate 95% confidence intervals (CI) for AUCs. Youden’s index was employed to find the optimal cutoff point on the receiver operating characteristic (ROC) curve in order to calculate sensitivity, specificity, false-positive rate (FPR), negative predictive value (NPV), and positive predictive value (PPV) of the final models. We also evaluated the sensitivity, FPR, NPV, and PPV for fixed specificities of 95% and 97.5%.

Instead of using cross-validation, models were trained on data from the years 2012–2017 and tested on data from the year 2018, ensuring by this way that there was no overfitting and no leakage of information.

Variable Analysis

We evaluated single-variable predictions by using each one as a single input, and early mortality and short-term mortality as targets. These experiments were conducted using the same data split as described before.

We also evaluated variables’ importance for both mortality frames by using the XGBoost feature importance property, which shows for each variable how much on average the prediction changes if the variable value changes.

Model Assessment

We used Brier score and calibration plots to evaluate the XGBoost model.

Comparison with Previous Models

We compared the XGBoost model with severity scores Shock Index (SI), Modified Shock Index (MSI), and Aged Shock Index (ASI)9,11 where SI = HR/SBP; MSI = HR/(2/3 × DBP + 1/3 × SBP); ASI = Age × SI.

Nine-Point Triage Score

The top single most predictive variables examined before were tested as input factors in an early mortality predictive model. The final model was evaluated on the entire cohort, the cohort after dropping of all missing values, and the cohort after dropping only visits with missing structured chief complaint. We compared these models with a logistic regression (LR) model employing the top features as inputs (used one hot encoding for structured chief complaint) evaluated on the data after dropping all cases with missing values.

The top most important features evaluated before were also tested on the entire cohort in an early mortality predictive model.

RESULTS

Study Cohort

A total of 990,864 ED visits were retrospectively retrieved during the 6-year study frame. We excluded 190,609 records for age criteria, and 733 records for vital signs criteria. Thus, 799,522 ED visits which represented 367,219 unique patients were available for analysis. The overall early mortality rate was 4561/799,522 (0.6%) and the short-term mortality rate was 19,647/799,522 (2.5%). Of the 4561 patients with early mortality, 917 (20.1%) died in the ED, 3405 (74.6%) died in hospitalization, 33 (0.7%) died in the internal care unit (ICU), and 206 (4.5%) died after being discharged from the ED.

Characteristics of the study cohort are presented in Table 1. Table A1 (online) presents the proportions of abnormal range vital signs. The ten most common chief complaints for early and non-early mortality groups are presented in Table A2 (online). The total cumulative mortality up to 30 days, and grouped according to ESI levels, is presented in Figure A3 (online). The proportions of missing data per features are presented in Table A4 (online).

Table 1 Study Cohort Characteristics (n = 799,522)

Model Performance

Models were trained using triage variables and tested using 1000 decision trees. The input vector included 308 features. The performances of the models in the validation set for both mortality frames are reported in Table 2. Models yielded an AUC of 0.962 for early mortality and an AUC of 0.923 for short-term mortality. Mortality rates for training and validation sets were represented separately in Figure 1. Models without including the ESI are shown in Table A5 (online).

Table 2 Models’ Performances for Each Mortality Group in the 1-Year Validation Set. Sensitivity, Specificity, FPR, NPV, and PPV Are Calculated for Youden’s Index, and Sensitivity; NPV and PPV Are also Calculated for Fixed Specificities of 95% and 97.5%
Fig. 1
figure 1

Receiver operating characteristic (ROC) curves comparing models’ performances for each mortality group.

Brier scores for the early mortality and short-term mortality models were 0.004 and 0.022, respectively. Figure A6 (online) present the calibration curves.

Variable Analysis

Table 3 presents the top ten highest single-variable AUCs for each mortality group. For early mortality, age, arrival mode, and structured chief complaint had the highest AUCs (0.810, 0.809, and 0.787, respectively). For short-term mortality age, comorbidities coded by ICD9 and structured chief complaint had the highest AUCs (0.767, 0.754, and 0.749).

Table 3 Highest Single AUCs for Each Mortality Group for the Ten Clinical Features Calculated by Single-Variable Analysis

Figures A7 and A8 (online) show the top ten features calculated by the XGBoost feature importance property for early mortality and short-term mortality respectively.

Comparison Between Gradient Boosting Model and Previous Models

Table 4 presents comparisons between the AUCs of the gradient boosting model, SI, MSI, and ASI. From the evaluated previous models, ASI showed the highest predictive ability with an AUC of 0.858 for early mortality and 0.834 for short-term mortality.

Table 4 Mortality Prediction Comparative Analysis (AUCs)

Figure 2 presents histograms of SBP, HR, SI, and ASI of patients who died and did not die within 2 days of ED admission. The histogram of ASI shows the greatest differentiation between the two groups.

Fig. 2
figure 2

Frequency histograms of the recorded SBP (a), HR (b), SI (c), and ASI (d) for subject who died and did not die within 2 days of ED admission. SBP, systolic blood pressure. HR, heart rate. SI, Shock Index. ASI, Age Shock Index.

Nine-Point Triage Score

The top ten single predictive variables were experimented as factors in an early mortality predictive model. Number of comorbidities was dropped as it requires previous history of the patient and we wanted to evaluate a model that does not require such data. Consequently, we assessed a gradient boosting model for early mortality prediction which included only nine elements: age, arrival mode, structured chief complaint, vital signs (T°, SO2, HR, SBP, and DBP), and ESI. Table 5 presents the AUCs, sensitivities, and specificities of the XGBoost model using the selected nine variables as inputs and early mortality as output. These experiments were conducted on three cohorts: the entire dataset, the cohort after dropping all cases with missing data, the cohort after dropping only cases with missing structured chief complaint. A LR model after dropping all cases with missing data was also added for comparison.

Table 5 Mortality Prediction and Statistical Analysis of the XGBoost Model Using Nine-Point Triage Score

An early mortality prediction model using the top ten features from XGBoost feature importance analysis was also calculated, with its properties presented in Table A9 (online).

DISCUSSION

The main aim of this project was to improve the categorization of patients in the ED by constructing a model that predicts mortality outcomes at the triage level. A prediction model can integrate all available information and facilitate the identification of patients with a higher mortality risk who might be missed. As a tool for health care providers in the decision-making process, it may be used to ensure rapid treatment, and flag high-risk patients subjectively under-triaged.10,13

The non-linear gradient boosting model demonstrated a high predictive ability with an AUC of 0.962 AUC for early mortality using only triage available features, and a decreased ability for short-term mortality. In our opinion, a possible explanation for this difference could be that the severity of physiologic abnormalities at initial presentation is not as much related with short-term mortality compared with early mortality.

When analyzing single variables, age and structured chief complaint are the highest predictors of mortality for all time frames. Other factors that foretell early mortality are those reflecting the acute condition of the patient, such as arrival mode, ESI, and vital signs. For short-term mortality, characteristics reflecting the patient’s background, such as comorbidities and home medications, seem to be better predictors.

To date, the most widely used triage scoring tool is the ESI score.8 This score is a subjectively assessed five-level ED triage algorithm that provides risk stratification of patients, from 1 (most urgent) to 5 (least urgent), based on patients’ acuity and resources needed.

Several previous studies have developed scores for mortality prediction as a way to improve triage classification, and some of them used AI algorithms.9,10,11,15,16,17,18,19,20,21 Levin et al. utilized a random forest E-triage prediction model that had AUCs ranging from 0.90 to 0.92 for critical care outcome (in-hospital mortality or direct admission to an intensive care unit).10 A major strength of our study was the significantly larger number of ED visits in comparison with previous studies, which could have increased the model’s performance. When examining previous models, the ASI model9 showed impressive results.

Single feature analysis helped us devise the nine-point triage score which included age, arrival mode, structured chief complaint, vital signs (T°, SO2, HR, SBP, and DBP), and ESI into the gradient boosting model. It showed an AUC of 0.962 for early mortality, similar to the AUC of the full features model. Using a simplified model makes it easier to understand which variables are truly driving the early mortality outcome allowing to improve it.34

Models that use hundreds of variables make manual entry impractical and are difficult to encode within EMR databases. A simplified model with fewer variables would make this task more feasible.31

Our study has several limitations: Firstly, it was a retrospective single-institution study, and the sample was homogeneous and may had been subject to local practices which limit its generalizability; also the high performance of ASI casts the doubt if it can outperform the model during external validation. Secondly, we lack information from other institutions about previous ED visits and hospitalizations. Thirdly, do-not-resuscitate (DNR) patients, on-hospice, left-before-being-seen or against-medical-advice could not be excluded. Fourthly, other outcomes like ICU admission or in-hospital mortality were not evaluated.

We believe this study may serve as a proof of concept for the ability to develop AI-based triage prediction models, to be replicated on multi-institute projects by matching/mapping data using a multicenter approach in order to ascertain the predictive accuracy.

The presented model suggests a decision support tool, and does not intend to replace clinical judgment. The interaction with provider intuition can strengthen clinical decision-making, making it more consistent and reducing the risks of over- or under-triage. A strong collaboration between clinicians and machine learning experts is necessary to develop and validate a model which includes the best predicting variables in order to prevent entering large amounts of data without a clinical context.32

In conclusion, by using data available at the time of triage in the ED, the gradient boosting model showed a high predictive ability while screening patients at risk of early mortality.