Measurement and application of patient similarity in personalized predictive modeling based on electronic medical records
- 62 Downloads
Conventional risk prediction techniques may not be the most suitable approach for personalized prediction for individual patients. Therefore, individualized predictive modeling based on similar patients has emerged. This study aimed to propose a comprehensive measurement of patient similarity using real-world electronic medical records data, and evaluate the effectiveness of the individualized prediction of a patient’s diabetes status based on the patient similarity.
When using no more than 30% of the whole training sample, the personalized predictive models outperformed corresponding traditional models built on randomly selected training samples of the same size as the personalized models (P < 0.001 for all). With only the top 1000 (10%), 700 (7%) and 1400 (14%) similar samples, personalized random forest, k-nearest neighbor and logistic regression models reached the globally optimal performance with the area under the receiver-operating characteristic (ROC) curve of 0.90, 0.82 and 0.89, respectively.
The proposed patient similarity measurement was effective when developing personalized predictive models. The successful application of patient similarity in predicting a patient’s diabetes status provided useful references for diagnostic decision-making support by investigating the evidence on similar patients.
KeywordsPatient similarity Electronic medical records Personalized prediction Model performance Diabetes mellitus
area under the ROC curve
Clinical Classification Software
electronic medical records
International Classification of Diseases, tenth revision
nearest common ancestor
receiver operating characteristic
In personalized medicine, clinicians and health policy makers must choose the most appropriate clinical trial and make predictions for the right patient during decision-making [1, 2]. This approach is used to individualize medical practice.
At present, clinicians can predict diseases by many methods like diagnostic imaging technique [3, 4, 5, 6, 7] but with fewer predictive models. In recent years, predictive modeling has been successfully applied in the medical scenarios, including the identification of risk factors [8, 9] and early detection of disease onset [10, 11]. In addition, advances have been made in using predictive modeling to predict patient outcomes . The traditional predictive modeling approach involves building a global predictive model using all available training data. However, this may not be the most suitable approach for personalized prediction for individual patients. Furthermore, generally there are varieties of noisy data in electronic medical records (EMR) data, which were primarily designed for administration and improving healthcare efficiency, and many studies have found secondary use such as patient trajectory modeling, disease inference and clinical decision support system . It is recommended to de-noise data before building a global predictive model, which will be time consuming and challenging to represent and model. In this context, individualized predictive modeling based on patient similarity emerged and was shown to be adjustable for individual patients. Employing patient similarity helps to identify a precision cohort for an index patient, which will then be used to train a personalized model . Accordingly, when building a predictive model for an index patient, training samples are determined as “patients like me,” instead of using all available training samples in a conventional way. “Patients like me” are selected from the training sample set on the basis of similarity between the index patient and each training sample. Of note, based on patient similarity, patients with noisy data are less likely to be selected as similar patients of an index patient for the reason of the less similarity between them. Patient similarity is usually measured by considering information on demographics, disease history, comorbidities, laboratory tests, hospitalizations, treatment, and pharmacotherapy. Such data are easily extracted from the EMR for tens of millions of patients .
In this study, we defined a patient as a vector in a d-dimensional feature space. Then, a multi-dimensional approach to estimate patient similarity was proposed. To demonstrate the effectiveness of the proposed similarity measure, the most similar patients were retrieved to build personalized models to predict the diabetes status of a given patient.
To assist physicians with the selection of the most appropriate recommendations and the prediction of a given patient, several methodologies have been applied in personalized medicine such as clustering, principle component analysis and patient similarity computation.
Clustering is the most popular method used in personalized medicine. This aims to create groups of patients with similar disease evolution , with the prediction for a new patient identified with the label of their most similar cluster. To determine the subtype for a breast cancer patient and provide the most effective treatment, Wang et al.  defined a novel consensus clustering method to automatically cluster numerical and categorical data using Euclidean distance and categorical distance, respectively. The proposed method demonstrated great superiority and robustness in clustering and differentiating patient outcomes. Li et al.  presented an unsupervised clustering framework based on topological analysis to identify type 2 diabetes subgroups. The topology-based patient–patient network could be used for identifying three distinct subgroups of type 2 diabetes successfully. Panahiazar et al.  designed two different approaches for medication recommendation for a heart-failure patient, using both unsupervised clustering (hierarchical clustering and K-means clustering) and supervised clustering (using the medication plan as class variable). Their results showed that supervised clustering outperformed the unsupervised clustering.
Another frequently used technique for predicting patient outcomes is based on the patient similarity. Patient similarity evaluation was investigated as a tool to enable precision medicine , and was identified as a fundamental problem in many data mining algorithms and practical information process systems . Most commonly, through exhaustive comparisons between a given patient and a cohort of existing patients, an assessment specific to the given patient can help in identifying his similar patients. Lee et al.  used a cosine-based patient similarity metric to identify patients who agreed the most with each patient. The result suggested that using fewer but more similar data could get higher predictive performance than using overall available data. David et al.  proposed an algorithm for the anomaly detection and characterization on the basis of the Euclidean distance between the medical laboratory data. With the selected neighbors around him, the index patient could be segmented into one of the seven disease groups with a higher accuracy. For early screening and assessment of suicidal risks, researchers used the sum of absolute distances for each predictor to retrieve a cohort of similar patients and determined the most potential risk level for a new patient . Among these studies, one of them  compared the performance of the patient similarity-based personalized predictive models with the whole population-based global predictive models. The results demonstrated that personalized predictive models showed a higher performance.
Many previous studies usually calculated the patient similarity using single similarity measures (e.g., Euclidean distance, cosine distance, and Mahalanobis distance), and most of them did not take the importance of patient features into consideration while calculating the similarity. In this study, we aimed to investigate in depth the patient similarity in the following two aspects. One is using different similarity metrics for different types of feature data. The other is assigning different weights (importance) to patient features when integrating feature similarities into a patient similarity.
Overview of patient similarity
Evaluation of predictive performance
When the top 1000 (10%), 700 (7%), and 1400 (14%) similar samples selected according to the CCS-based similarity were used, the personalized RF, kNN, and LR models showed a clear increasing trend from the initial area under the receiver-operating characteristic (ROC) curve of 0.87, 0.79, and 0.70 to the saturated area under the ROC curve (AUC) of 0.90, 0.82, and 0.89, respectively. When the kNN model was built using up to the top 4% of similar samples, it outperformed the LR model. This suggested that more appropriate data were needed for the LR model parameters to be properly trained. Similar results were found when patient similarities were based on ICD-based similarity. When RF, kNN, and LR models were built on the top 12%, 7%, and 15% of similar samples, respectively, they showed the globally optimal performance. The RF model showed significantly higher performance than the LR and kNN models (Mann–Whitney U test adjusted by Bonferroni, P values < 0.001 for all), partially because of its built-in feature selection property.
Further comparisons of predictive performance of the personalized models built on ICD-10- and CCS-based similar patients showed that there were no significant differences for RF, kNN, and LR models (Mann–Whitney U test adjusted by Bonferroni, P = 0.491, 0.988 and 0.635, separately).
Interpretation of predictive models
Prediction of risk for specific diseases is important in a variety of applications, including health insurance, tailored health communication, and public health . In this paper, we proposed a method for predicting risk for a potential disease using a large clinical dataset collected from an EMR system. In the proposed method, classification algorithms (kNN, LR, and RF) were built to predict a patient’s diabetes status based on patient similarities assessed using a multi-dimensional approach covering demographics, disease diagnoses, and laboratory tests. The investigation pipeline can easily be extended to the study of other complex and multifactorial diseases.
Because patients’ disease diagnoses were an important part of EMR data and a key factor for disease prediction, we investigated two similarity measurements for disease diagnoses. One was calculated using a hierarchical similarity measure with ICD-10 disease codes, and the other using simple cosine similarity with CCS disease codes. Although the hierarchical similarity measure has been argued to be a more direct mapping of hierarchical information to distances , we found that predictive models built on the most similar samples selected according to patient similarity based on hierarchical similarity did not show higher performance than those based on cosine disease similarity. This suggests that narrowing ICD-10 diagnosis codes into CCS codes may be useful for presenting disease data at a descriptive statistical categorical level . Therefore, feature similarity for disease diagnoses based on CCS codes and cosine similarity was more effective and efficient than that based on ICD-10 codes and hierarchical similarity in this study.
A previous study suggested that in personalized medicine, using patient similarity in data-driven analysis of patient cohorts will significantly assist physicians to make informed decisions and choose the most appropriate clinical trial . In this study, three different predictive models using similar cohorts showed a consistently higher performance, especially in that they used fewer training samples than those built on randomly selected samples. This finding coincided with the conclusion that similarity-based selection was better than random selection . In particular, the personalized LR model showed the largest performance increase. This demonstrated that patient similarity has potential to improve the predictive performance of machine learning models.
Furthermore, predictive performance for both the personalized and traditional models reached a saturated level when increasing numbers of training samples were involved in the modeling, where the personalized models reached earlier. This finding was consistent with the conclusion of two previous studies that little was gained from using more dissimilar patients when building models [8, 25]. Generally, there are varieties of noisy data (errors) in EMR, where noisy data referred to the irrelevant and dissimilar data for a patient with the specific disease. When building personalized models, the most similar samples measured by the proposed patient similarity were used as the training samples, which could be considered as “the patients like me”. Under this situation, noisy data which may disturb the prediction were less likely to be selected as training samples due to the less similarity; thus, patient similarity measurement proposed herein could be harnessed as a de-noising method. This improved the predictive performance and the overall robustness of aforementioned models to some degree. Using fewer but more similar samples, personalized predictive models may perform as well as traditional predictive models built on the entire training samples. For the personalized models, as the training sample size increased, more and more samples with less similarity were added into the training set, making the overlap of training set for the personalized models and traditional models enlarged. When the training sample size increased to the whole available training samples, no difference would exist in the similarity-based selection and random selection of training samples. The personalized models, thus, degenerated into the traditional ones, both showing the same predictive performance, the global performance.
Diabetes prediction is a challenging task for its multifactorial characteristics and various manifestations. Park et al.  applied their new knowledge discovery techniques to improve the performance of diabetes prediction, obtaining an average accuracy of 0.76. In another study  of diabetes prediction, the best performance (AUC, 0.62) of the personalized models was obtained when the predictive model was built on 2000 similar patients. In our study, based on the proposed similarity measurement, predictive performances for diabetes improved a lot with the highest AUC of 0.90.
There are some limitations to our research. First, when constructing study cohort, no exclusion criterion specific to the predictive task was employed. Second, the patient similarity was calculated directly, without making the full use of the information provided by the large amount of sample patients. Last, the performance of the proposed patient similarity measure was only evaluated for disease prediction. In the further work, we will improve the algorithm for the similarity measurement, including learning the patient similarity automatically, and the patient similarity will be used in other application scenarios, such as patient stratification for disease sub-typing.
In this study, we proposed a comprehensive measurement of patient similarity using real-world EMR data, and evaluated the effectiveness of the individualized prediction of a patient’s diabetes status based on the patient similarity. The proposed similarity measure was designed to reflect the data type and clinical meaning of each patient feature. Moreover, predictive models built on similar cohorts had a consistently higher performance than those built on randomly selected samples. They also performed as well as models built on entire training samples. This makes it possible for further large-scale and high-dimensional predictive applications at relatively lower time and space costs and higher performance. The successful application of patient similarity in predicting a patient’s diabetes status provided useful references for diagnostic decision-making support by investigating the evidence on similar patients.
Feature similarity for age
Feature similarity for sex
Feature similarity for laboratory test
Feature similarity for disease diagnoses
Disease diagnoses were initially identified using ICD-10 codes . In the ICD-10 code scheme, each code begins with a letter (A–Z for 22 chapters) followed by five digits, arranged in a tree-like hierarchical manner (Additional file 1: Figure S1). The letter and first three digits are usually used for statistical purposes ; they were, therefore, used to calculate feature similarity for disease diagnosis in this study. As an alternative to the ICD-10 code scheme, the CCS code scheme  collapsed ICD-10 codes into 259 diagnosis codes (numbered 1–259) with better generalization and clinical meaningfulness . For example, DM was designated as ICD-10 codes E10.x–E14.x; corresponding CCS codes were 49 (DM without complications) and 50 (DM with complications).
We proposed two methods of measuring disease diagnosis similarity based on the two code schemes with totally different structures.
Feature similarity for disease diagnoses based on the ICD-10 code scheme
Feature similarity for disease diagnoses based on the CCS code scheme
Application of patient similarity
EMR data used in this study were derived from all inpatients discharged from a tertiary hospital in Beijing, China between 2014 and 2016. Individual hospitalizations were de-identified and maintained as unique records, including age at admission, sex, disease diagnoses at discharged (up to 11), and laboratory tests during hospitalization. Disease diagnoses were identified using ICD-10 codes.
Records for patients who had disease diagnoses with ICD-10 codes starting with O (complications of pregnancy), P (certain conditions originating in the perinatal period), S and T (incidental conditions such as poisoning and injuries), and Y and V (supplementary classification codes) were excluded. In addition, for patients with more than one hospitalization (i.e., readmission), records for follow-up admissions were excluded to maintain a study dataset containing distinct patients.
In one hospitalization episode, patients are not necessary to take all laboratory tests, leading to a large number of missing values in laboratory test fields. This will make it more difficult to compute feature similarity for laboratory test. Therefore, records with more missing laboratory tests should be excluded in the current study. For the task of disease prediction, DM (ICD-10 codes of E10–E14 [30, 31]) was chosen as the target disease. Thus, 77 most regular laboratory test items related to DM, including blood test, urine test and electrolyte test were employed for the similarity computation. Records with missing values of any of the above 77 laboratory test items were then excluded.
In total, 8245 patients with any diabetes diagnosis (positive samples) remained and another 8245 patients without any diabetes diagnoses (negative samples) were randomly selected, giving a study dataset of 16,490 samples (Additional file 1: Figure S2). The mean ages of the patients with and without DM were 63.0 ± 11.6 years and 57.2 ± 17.1 years (t-test, P < 0.001), respectively. 5163 (62.6%) patients in DM group were males, whereas 6062 (73.5%) in non-DM group (χ2 test, P < 0.001).
Machine learning models
For an index (test) patient with an unknown label, a personalized predictive model was built based on the most similar patients from the training samples. This model was then tested on the index patient. This study predicted the index patient as diabetic or not diabetic, which was a binary classification problem. To explore the impact of the model on the performance of the similarity-based predictive model, three machine learning-based classification models with disparate algorithms and structures were used: kNN, LR, and RF classifiers.
In our classification setting, the kNN classifier assigned each index patient with the majority class of its k (k = 50 in this study) nearest labeled neighbors, based on Euclidean distance from the training set . The probability of that patient being predicted as diabetic was defined as the proportion of patients with diabetes among the k neighbors. LR is a discriminative model in machine learning, or a kind of generalized linear model with a logit link function and binomial distribution . The predicted outcome of the LR classifier for the index patient was the probability of belonging to the positive class. RF  is an ensemble classifier consisting of many decision trees (100 trees in this study) based on random feature selection [34, 35] and bootstrap aggregation . The final predicted probability of belonging to each class for the index patient was obtained by combining the predictions of individual trees.
Input features for the classification models were age, sex, disease diagnoses and 77 laboratory tests. To reduce the dimensionality of the feature space, diseases that occurred in less than 1% of the study dataset were ruled out. In total, 27 diseases with a statistically different occurrence rate between patients with and without DM (χ2 test, P < 0.05) remained for further modeling. Finally, 106 features were used as the input features for the models.
The basic characteristics of samples in the test set and training set
Test set (n = 6490)
Training set (n = 10,000)
Male gender, n (%)
Age (years), mean ± SD
60.1 ± 14.7
60.1 ± 15.0
Myocardial infarction, n (%)
Congestive heart failure, n (%)
Chronic obstructive pulmonary disease, n (%)
Mild liver disease, n (%)
Hypertension, n (%)
Coronary heart disease, n (%)
Serum glucose (mmol/L), mean ± SD
6.6 ± 2.9
6.7 ± 2.9
Abnormal urine glucose, n (%)
To dynamically evaluate the potentials of the proposed patient similarity when being used in selecting similar samples for predicting diabetes, predictive models were trained based on top K similar patients, where the smaller the sample size K, the more similar the selected training patients. Performance evaluation and comparisons were then conducted among the three classification models built on similar and randomly selected samples with the same sample size, and the changing trends of the predictive performance as the size of the training samples increased could be analyzed. Predictive performance was evaluated by the AUC. The cubic polynomial fitting was used to give the changing trends of AUCs.
To help understand the classification process of the kNN model, the patient to be predicted and its k (k = 10, 50, 100, respectively) nearest neighbors were visualized. Another visualization was used to show the top 20 important features captured by the RF models which were built on similar patients and randomly selected patients, separately. Feature importance was determined by the Gini coefficients.
All computations and analyses were conducted using R 3.4.0 software (https://cran.r-project.org/).
HC conceived the study and developed the methods; XF and LW collected data; NW and YH sorted and analyzed the data. HL and XZ drafted the manuscript; NW prepared the figures; HC provided critical review of the manuscript. All authors have reviewed the final version of the manuscript for publication. All authors read and approved the final manuscript.
This work was supported by the National Natural Science Foundation of China (Nos. 81901707, 81671786 and 81701792).
Ethics approval and consent to participate
Consent for publication
All authors have approved the manuscript and agreed with submission and publication. The manuscript has not previously been published elsewhere and is not under consideration by any other journals.
The authors declare that they have no competing interests.
- 1.Henriques J, Carvalho P, Paredes S, Rocha T. Prediction of heart failure decompensation events by trend analysis of telemonitoring data. IEEE J Biomed Health Inform. 2014;19(5):1757–69.Google Scholar
- 2.Sharafoddini A, Dubin JA, Lee J. Patient similarity in prediction models based on health data: a scoping review. JMIR Med Inform. 2017;5(1):e7.Google Scholar
- 3.Krysik K, Dobrowolski D, Polanowska K, Lyssek-Boron A, Wylegala EA. Measurements of corneal thickness in eyes with pseudoexfoliation syndrome: comparative study of different image processing protocols. J Healthc Eng. 2017;2017:4315238.Google Scholar
- 4.Lyssek-Boroń A, Wylęgała A, Polanowska K, Krysik K, Dobrowolski D. Longitudinal changes in retinal nerve fiber layer thickness evaluated using Avanti Rtvue-XR optical coherence tomography after 23G vitrectomy for epiretinal membrane in patients with open-angle glaucoma. J Healthc Eng. 2017;2017:4673714.Google Scholar
- 5.Chatterjee A, He D, Fan X, Antic T, Jiang Y, Eggener S, Karczmar GS, Oto A. Diagnosis of prostate cancer by use of MRI-derived quantitative risk maps: a feasibility study. Am J Roentgenol. 2019;213:1–10.Google Scholar
- 6.Yang C, Lu M, Duan Y, Liu B. An efficient optic cup segmentation method decreasing the influences of blood vessels. Biomed Eng Online. 2018;17(1):130.Google Scholar
- 7.Krysik K, Dobrowolski D, Stanienda-Sokół K, Wylegala EA, Lyssek-Boron A. Scheimpflug camera and swept-source optical coherence tomography in pachymetry evaluation of diabetic patients. J Ophthalmol. 2019;2019:1–6.Google Scholar
- 8.Ng K, Sun J, Hu J, Wang F. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summits Transl Sci Proc. 2015;2015:132–6.Google Scholar
- 9.Whellan DJ, Ousdigian KT, Alkhatib SM, Pu W, Sarkar S, Porter CB, Pavri BB, O’Connor CM, Investigators PS. Combined heart failure device diagnostics identify patients at higher risk of subsequent heart failure hospitalizations: results from PARTNERS HF (program to access and review trending information and evaluate correlation to symptoms in patients with heart failure) study. J Am Coll Cardiol. 2010;55(17):1803–10.Google Scholar
- 10.Sepanski RJ, Godambe SA, Mangum CD, Bovat CS, Zaritsky AL, Shah SH. Designing a pediatric severe sepsis screening tool. Front Pediatr. 2014;2(56):56.Google Scholar
- 11.Wu J, Roy J, Stewart W. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med Care. 2010;48(6 Suppl):S106.Google Scholar
- 12.Shickel B, Tighe PJ, Bihorac A. Deep EHR: a survey of recent advances on deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2017;22(5):1589–604.Google Scholar
- 13.Marcos M, Maldonado JA, Martinez-Salvador B, Bosca D, Robles M. Interoperability of clinical decision-support systems and electronic health records using archetypes: a case study in clinical trial eligibility. J Biomed Inform. 2013;46(4):676–89.Google Scholar
- 14.Parimbelli E, Marini S, Sacchi L, Bellazzi R. Patient similarity for precision medicine: a systematic review. J Biomed Inform. 2018;83:87–96.Google Scholar
- 15.Wang C, Machiraju R, Huang K. Breast cancer patient stratification using a molecular regularized consensus clustering method. Methods. 2014;67(3):304–12.Google Scholar
- 16.Li L, Cheng WY, Glicksberg BS, Gottesman O, Tamler R, Chen R, Bottinger EP, Dudley JT. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci Transl Med. 2015;7(311):311ra174.Google Scholar
- 17.Panahiazar M, Taslimitehrani V, Pereira NL, Pathak J. Using EHRs for heart failure therapy recommendation using multidimensional patient similarity analytics. Stud Health Technol Inform. 2015;210:369–73.Google Scholar
- 18.Wang F. Adaptive semi-supervised recursive tree partitioning: the ART towards large scale patient indexing in personalized healthcare. J Biomed Inform. 2015;55:41–54.Google Scholar
- 19.Lee J, Maslove DM, Dubin JA. Personalized mortality prediction driven by electronic medical data and a patient similarity metric. PLoS ONE. 2015;10(5):e0127428.Google Scholar
- 20.David G, Bernstein L, Coifman RR. Generating evidence based interpretation of hematology screens via anomaly characterization. Open Clin Chem J. 2011;4(1):10–6.Google Scholar
- 21.Chattopadhyay S, Ray P, Chen HS. Suicidal risk evaluation using a similarity-based classifier. Adv Data Min Appl. 2008;5139:51–61.Google Scholar
- 22.Popescu M, Khalilia M. Improving disease prediction using ICD-9 ontological features. IEEE Int Conf Fuzzy Syst. 2011;56(10):1805–9.Google Scholar
- 23.Girardi D, Wartner S, Halmerbauer G, Ehrenmüller M, Kosorus H, Dreiseitl S. Using concept hierarchies to improve calculation of patient similarity. J Biomed Inform. 2016;63(C):66–73.Google Scholar
- 24.Hielscher T, Spiliopoulou M, Volzke H, Kuhn JP. Using participant similarity for the classification of epidemiological data on hepatic steatosis. In: IEEE international symposium on computer-based medical systems. Washington, D.C.: IEEE Computer Society; 2014. p. 1–7.Google Scholar
- 25.Park YJ, Kim BC, Chun SH. New knowledge extraction technique using probability for case-based reasoning: application to medical diagnosis. Expert Syst. 2010;23(1):2–20.Google Scholar
- 26.Ashley J. The international classification of diseases: the structure and content of the tenth revision. Health Trends. 1990;22(4):135.Google Scholar
- 27.Cowen ME, Dusseau DJ, Toth BG, Guisinger C, Zodet MW, Shyr Y. Casemix adjustment of managed care claims data using the clinical classification for health policy research method. Med Care. 1998;36(7):1108–13.Google Scholar
- 28.Gottlieb A, Stein GY, Ruppin E, Altman RB, Sharan R. A method for inferring medical diagnoses from patient similarities. BMC Med. 2013;11(1):194.Google Scholar
- 29.Huang Y, Wang N, Liu H, Zhang H, Fei X, Wei L, Chen H. Study on patient similarity measurement based on electronic medical records. Stud Health Technol Inform. 2019;264:1484–5.Google Scholar
- 30.Chen G, Khan N, Walker R, Quan H. Validating ICD coding algorithms for diabetes mellitus from administrative data. Diabetes Res Clin Pract. 2010;89(2):189–95.Google Scholar
- 31.Khokhar B, Jette N, Metcalfe A, Cunningham CT, Quan H, Kaplan GG, Butalia S, Rabi D. Systematic review of validated case definitions for diabetes in ICD-9-coded and ICD-10-coded data in adult populations. BMJ Open. 2016;6(8):e009952.Google Scholar
- 32.Neuvirth H, Ozery-Flato M, Hu J, Laserson J, Kohn MS, Ebadollahi S, Rosen-Zvi M. Toward personalized care management of patients at risk: the diabetes case study. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining; 2011. p. 395–403.Google Scholar
- 34.Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.Google Scholar
- 35.Ho T. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–44.Google Scholar
- 37.Charlson ME, Pompei P, Ales KL, Mackenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373–83.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.