Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
 4.5k Downloads
 18 Citations
Abstract
Background
Modern modelling techniques may potentially provide more accurate predictions of binary outcomes than classical techniques. We aimed to study the predictive performance of different modelling techniques in relation to the effective sample size (“data hungriness”).
Methods
We performed simulation studies based on three clinical cohorts: 1282 patients with head and neck cancer (with 46.9% 5 year survival), 1731 patients with traumatic brain injury (22.3% 6 month mortality) and 3181 patients with minor head injury (7.6% with CT scan abnormalities). We compared three relatively modern modelling techniques: support vector machines (SVM), neural nets (NN), and random forests (RF) and two classical techniques: logistic regression (LR) and classification and regression trees (CART). We created three large artificial databases with 20 fold, 10 fold and 6 fold replication of subjects, where we generated dichotomous outcomes according to different underlying models. We applied each modelling technique to increasingly larger development parts (100 repetitions). The area under the ROCcurve (AUC) indicated the performance of each model in the development part and in an independent validation part. Data hungriness was defined by plateauing of AUC and small optimism (difference between the mean apparent AUC and the mean validated AUC <0.01).
Results
We found that a stable AUC was reached by LR at approximately 20 to 50 events per variable, followed by CART, SVM, NN and RF models. Optimism decreased with increasing sample sizes and the same ranking of techniques. The RF, SVM and NN models showed instability and a high optimism even with >200 events per variable.
Conclusions
Modern modelling techniques such as SVM, NN and RF may need over 10 times as many events per variable to achieve a stable AUC and a small optimism than classical modelling techniques such as LR. This implies that such modern techniques should only be used in medical prediction problems if very large data sets are available.
Keywords
Support Vector Machine Random Forest Modelling Technique Reference Model Data HungrinessAbbreviations
 CT
Computed tomography
 SVM
Support vector machines
 NN
Neural nets
 RF
Random forest
 CART
Classification and regression trees
 LR
Logistic regression
 ROC
Receiver operating curve
 AUC
Area under the curve
 EPV
Events per variable
 HNSCC
Head and neck squamous cell carcinoma
 TBI
Traumatic brain injury
 CHIP
CT in head injury patients
 UCI
University of California, Irvine
 LASSO
Least absolute shrinkage and selection operator.
Background
Prediction of binary outcomes is important in medical research. The interest in the development, validation, and clinical application of clinical prediction models is increasing [1]. Most prediction models are based on logistic regression analysis (LR), but other, more modern techniques, may also be used. Support vector machines (SVM), neural nets (NN) and random forest (RF) have received increasing attention in medical research [2, 3, 4, 5, 6], since these hold the promise of better capturing nonlinearities and interactions in medical data. The increased flexibility of modern techniques implies that larger sample sizes may be required for reliable estimation. Little is known, however, about the sample size that is needed to generate a prediction model with a modern modelling technique that outperforms more traditional, regressionbased modelling techniques in medical data.
Usually, only a relatively limited number of subjects is available for developing prediction models. In 1995, a comparative study on the performance of various prediction models for medical outcomes concluded that the ultimate limitation seemed due to the availability of the information in data. This study used the term “data barrier” [7].
Some researchers aimed to develop a “power law” that can be used to determine the relation between sample size and the discriminatory ability of prediction models in terms of accuracy [8, 9, 10]. These studies clarified how a larger sample size leads to a better accuracy. The studies revealed that a satisfactory level of accuracy (the accuracy at infinite sample size +/− 0.01) can be achieved by sample sizes varying from 300 to 16,000 records, depending on the modelling technique and the data structure. The relation between sample size and accuracy was reflected in learning curves. Similarly, the number of events per variable (EPV) has been studied in relation to model performance [11, 12, 13, 14, 15].
In the current study, we aimed to define learning curves to reflect the performance of a model in terms of discriminatory ability, which is a key aspect of the performance of prediction models in medicine [16]. We assumed that the discriminatory ability of a model is a monotonically increasing function of the sample size, converging to a maximum at the infinite sample size. We hypothesized that modern, more flexible techniques are more “data hungry” [17] than more conventional modelling techniques, such as regression analysis. The concept of data hungriness refers to the sample size needed for a modelling technique to generate a prediction model with a good predictive accuracy. For fair comparison, we generated reference models with each of the modelling techniques considered in our simulation study.
Methods
Patients
We performed a simulation study, based on three patient cohorts.
Cohort characteristics
Cohort  

HNSCC  TBI  CHIP  
Outcome  5 year survival  6 months mortality  Intracranial findings 
Type  dichotomous  dichotomous  dichotomous 
Event/total  601/1282 (46.9%)  386/1731 (22.3%)  243/3181 (7.6%) 
Predictors  2 dichotomous  4 dichotomous  9 dichotomous 
4 categorial  1 categorial  1 categorial  
1 continuous  4 continuous  2 continuous 
The second cohort consisted of patients with traumatic brain injury (“TBI cohort”) [19]. The cohort contained 10 predictor variables (4 dichotomous, 1 categorical and 4 continuous) and a dichotomous outcome with an incidence of 386/1731 (22.3%) (Table 1).
The third cohort consisted of patients suspected of head injury who underwent a CTscan (“CHIP cohort”) [6]. This cohort contained 12 predictor variables (9 dichotomous, 1 categorical and 2 continuous) and a dichotomous (0/1) outcome with an incidence of 243/3181 (7.6%) (Table 1).
We generated artificial cohorts by replicating the HNSCC cohort 20 times, the TBI cohort 10 times and the CHIP cohort 6 times. This resulted in an artificial cohort consisting of 25,640 subjects (“HNSCC artificial cohort”), an artificial cohort consisting of 17,310 subjects (“TBI artificial cohort”) and an artificial cohort consisting of 19,086 subjects (“CHIP artificial cohort”).
Reference models
In the current study, we evaluated the following modelling techniques, using default settings as far as possible:

Logistic regression (LR)

Classification and regression trees (CART)

Support vector machines (SVM)

Neural nets (NN)

Random forest (RF)
For a description of these modelling techniques, based on the work of various authors [12, 15, 20, 21], we refer to Additional file 1.
As reference points for this evaluation, we first applied each modelling technique to each entire artificial cohort in order to generate an LR model, a CART model, an SVM model, an NN model and an RF model. These models were fitted with optimization according to default settings. Next, we generated probabilities of the outcome for each of these reference models. With these probabilities, we generated a new 0/1 outcome by comparing the generated probabilities of each reference model with a random number from a uniform (0,1) distribution. Using this new 0/1 outcome, we evaluated the five modelling techniques. The Rcode for the construction of the reference models is in Additional file 2.
Development and validation
For each of the five modelling techniques, we randomly divided the artificial cohort into a development set and a validation set for performance assessment. Each set consisted of 50% of the subjects of the artificial cohort.
Simulation design and analysis
 1.
Development sets were samples of increasing sizes (varying from 200 to the maximum size of the development set with increment 1000), drawn at random from the nonvalidation part of the artificial cohort.
 2.
For each of the five modelling techniques we generated a model for each sample, taking the 0/1 outcome of a specific reference model as outcome. We evaluated the predictions on each sample.
 3.
For each sample, the predictions of the model were evaluated on the validation set, taking the 0/1 outcome of the same reference model as outcome.
We repeated these steps 100 times for each sample size to achieve sufficient stability. We considered each of the five reference models in turn for a fair comparison of each of the modelling techniques. Evaluation of predictive performance focussed on the discriminatory ability according to the area under the Receiver Operating Characteristic curve (AUC). The AUC was determined using the development set (apparent AUC) and the validation set (validated AUC). We calculated optimism as mean apparent AUC minus mean validated AUC.
We defined the maximally attainable AUC (AUCmax) as the validated AUCvalue of a model based on the entire development set (50% of the artificial cohort).
Learning curves
For each modelling technique, we generated learning curves to visualize the relation between the AUCvalues and optimism of the generated models with respect to the number of events per variable.
Data hungriness
The data hungriness of a modelling technique was defined as the minimum number of events per variable at which the optimism of the generated model was <0.01. This limit was admittedly arbitrary, but in line with previous research [24].
Sensitivity analysis
We performed a sensitivity analysis to determine the influence of the endpoint incidence in the CHIP artificial cohort (7.6%). We hereto selectively oversampled subjects with the outcome of interest in order to generate an artificial cohort with an endpoint incidence of 50% (“CHIP5050 cohort”).
Results
HNSCC cohort
AUCmax per reference model, HNSCC cohort
Reference model  

LR  CART  SV  NN  RF  
LR  0.797  0.745  0.803  0.802  0.880 
CART  0.730  0.748  0.749  0.728  0.822 
SVM  0.787  0.740  0.814  0.802  0.898 
NN  0.785  0.744  0.800  0.804  0.869 
RF  0.784  0.747  0.810  0.810  0.929 
For all reference models, the optimism of the models decreased with an increasing number of events per variable. For all reference models, except when the reference model was CART, the modelling technique LR needed the smallest number of events per variable to reach an optimism <0.01 (55 to 127 events per variable).
TBI cohort
AUCmax per reference model, TBI cohort
Reference model  

LR  CART  SVM  NN  RF  
LR  0.806  0.712  0.743  0.762  0.817 
CART  0.710  0.702  0.676  0.652  0.684 
SVM  0.754  0.677  0.765  .0765  0.838 
NN  0.800  0.701  0.746  0.802  0.828 
RF  0.744  0.685  0.750  0.776  0.988 
CHIP cohort
AUCmax per reference model, CHIP cohort
Reference model  

LR  CART  SVM  NN  RF  
LR  0.786  0.572  0.607  0.782  0.903 
CART  0.562  0.578  0.580  0.500  0.666 
SVM  0.584  0.560  0.615  0.616  0.871 
NN  0.758  0.564  0.589  0.791  0.856 
RF  0.728  0.579  0.594  0.755  0.916 
Sensitivity analysis CHIP cohort
Discussion
Modern modelling techniques, such as SVM, NN and RF, needed far more events per variable to achieve a stable validated AUC and an optimism <0.01 than the more conventional modelling techniques, such as LR and CART. The CART models had a stable performance, but at a fairly poor level. Specifically, a larger number of events did not lead to better validated performance in the cohort with a 7.6% event rate. The LR models had low optimism when the number of events per variable was at least 20 to 50. A remarkable finding was that the optimism of the RF models remained high for the three cohorts, even at a large number (over 200) of events per variable. This indicates that these RF models were far from robust. Of note, the validated performance of RF models was similar to that of LR models. This implies that especially RF models need careful validation to assess predictive performance, since apparent performance may be highly optimistic.
Since LR modelling is far less data hungry than alternative modelling techniques, this technique may especially be useful in relatively small data sets. With very small data sets, any modelling technique will lead to poorly performing models. Our results confirm the generally accepted rule that reasonable predictive modelling requires at least 10 events per variable, even with a robust technique such as LR [11, 12, 15]. We note that larger numbers of events per variable are desirable to achieve better stability and higher expected performance.
The modelling techniques SVM and NN needed far more events per variable to generate models with a stable mean validated AUCvalue and an optimism converging towards zero. For models generated with the modelling technique RF, the optimism did not even converge towards zero at the largest number of events per variable that we evaluated.
Obviously, models generated by the same modelling technique as the reference model generally performed best, reflecting a “home advantage” over models generated by a different modelling technique than the reference model. The performance of models according to different reference models was provided for a fair assessment of the performance of the approaches considered.
While RF and LR models consistently performed well, CART consistently performed poorly. The poor performance of CART modelling may be explained by the fact that continuous variables need to be categorized, with optimal cutoffs determined from all possible cutoff points, and that possibly unnecessary higherorder interactions are assumed between all predictor variables. RF modelling is an obvious improvement over CART modelling [24]. It is hence remarkable that CART is still advocated as the preferred modelling technique for prediction in some disease areas, such as trauma [25]. A researcher must always carefully consider which modelling technique is appropriate in a specific situation. Using, for instance, a random forest technique just because the number of subjects is over 10,000 is too simplistic.
The aim of our study was to investigate the data hungriness of the various modelling techniques and the aim was not to find the best modelling technique in AUC terms. To our knowledge, the data hungriness of various modelling techniques has not been assessed before for medical prediction problems. However, a few studies addressed this topic in the context of progressive sampling for the development of a power law to guide the required sample size for prediction modelling. For example, arithmetic sampling was applied with sample sizes of 100, 200, 300, 400 etc. to 11 of the UCI repository databases to obtain insight into the performance of a naive Bayes classifier [8]. This study led to required sample sizes from 300 to 2180 to be within 2% from the accuracy of a model built from the entire database. Other researchers modelled 3 of the larger databases from the UCI repository using different progressive sampling techniques [9]. Using the C4.5 modelling technique, which we consider a CART variant, sample sizes of 2000 for the LED database, 8000 for the CENSUS database and 12000 for the WAVEFORM database were required for a model being no more than 1% less accurate than a model based on all the available data.
Another study compared the performance of 6 data mining tools at various sample sizes for 2 test databases (test database I with 50,000 records and test data base II with 1,500,000 records), using accuracy as the performance measure. For test database I, for all tools, a stable level of accuracy was reached at 16,000 records, and for test database II, for all tools, a stable level was reached at 8000 records [10]. The results of our study are in line with these studies. Although we used mean validated AUCvalues instead of accuracy to measure the performance of the models, we also found that the more complex modelling techniques required large numbers of events per variable to generate models with optimism <0.01.
A number of limitations need to be considered. Firstly, we used three cohorts with dichotomous outcomes, in which nonlinearity was not a major issue. While this may be common in medical research, it limited the ability for some modern modelling techniques to outperform traditional logistic regression modelling. If important nonlinearity is truly present in a data set, techniques that capture such nonlinear patterns well are obviously attractive. Various approaches can be considered to address nonlinearity within the regression framework, including restricted cubic splines and fractional polynomials [15, 26]. Secondly, we used default settings for the modelling techniques [8]. Further research might investigate our evaluated models, but also other modelling techniques such as LASSO, using other cohorts, and also using other settings for the modelling (such as pruning options, priors, and number of subjects in the end nodes).
Thirdly, there was a considerable difference in incidence between the three cohorts (47%, 22% and 8%). To assess the effect of this difference in incidence on the data hungriness, we performed a sensitivity analysis. Further research should evaluate the relation between the incidence of the outcome and the data hungriness patterns of various modelling techniques.
Conclusions
Modern modelling techniques such as SVM, NN and RF need far more events per variable to achieve a stable AUCvalue than classical modelling techniques such as LR and CART. If very large data sets are available, modern techniques such as RF may potentially achieve an AUCvalue that exceeds the AUCvalues of modelling techniques such as LR. The improvement over simple LR models may, however, be minor, as was shown in the two empirical examples in this study. This implies that modern modelling techniques should only be considered in medical prediction problems if very large data sets with many events are available.
Notes
Acknowledgements
The authors thank Mr. D. Nieboer for methodological and statistical advice and Mrs. L. van Hulst for editorial support.
Supplementary material
References
 1.Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, Riley RD, Hemingway H, Altman DG: Prognosis Research Strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013, 10:Google Scholar
 2.Harrison RF, Kennedy RL: Artificial neural network models for prediction of acute coronary syndromes using clinical data from the time of presentation. Ann Emerg Med. 2005, 46: 431439. 10.1016/j.annemergmed.2004.09.012.CrossRefPubMedGoogle Scholar
 3.Mangasarian YJLOL, Wolberg WH: Breast cancer survival and chemotherapy: a support vector machine analysis. Discret Math Probl with Med Appl DIMACS Work Discret Math Probl with Med Appl December 8–10, 1999, Volume 55. 2000, DIMACS Cent, 1Google Scholar
 4.Ennis M, Hinton G, Naylor D, Revow M, Tibshirani R: A comparison of statistical learning methods on the Gusto database. Stat Med. 1998, 17: 25012508. 10.1002/(SICI)10970258(19981115)17:21<2501::AIDSIM938>3.0.CO;2M.CrossRefPubMedGoogle Scholar
 5.Van der Ploeg T, Datema F, de Jong RB, Steyerberg EW: Prediction of Survival with Alternative Modeling Techniques Using Pseudo Values. PLoS One. 2014, 9: e10023410.1371/journal.pone.0100234.CrossRefPubMedPubMedCentralGoogle Scholar
 6.Van der Ploeg T, Smits M, Dippel DW, Hunink M, Steyerberg EW: Prediction of intracranial findings on CTscans by alternative modelling techniques. BMC Med Res Methodol. 2011, 143:Google Scholar
 7.Selker HP, Griffith JL, Patil S, Long WJ, D’Agostino RB: A Comparison of Performance of Mathematical Predictive Methods for Medical Diagnosis: Identifying Acute Cardiac Ischemia among Emergency Department Patients. 1995, 43: 468476.Google Scholar
 8.John G, Langley P, John H: Static Versus Dynamic Sampling for Data Mining. KDD. 1996, 367370.Google Scholar
 9.Provost F, Jensen D, Oates T: Efficient Progressive Sampling. Proc fifth ACM SIGKDD Int Conf Knowl Discov data Min. 1999, 2332.CrossRefGoogle Scholar
 10.Morgan J, Daugherty R, Hilchie A, Carey B: Sample size and modeling accuracy of decision tree based data mining tools. Acad Inf Manag Sci J. 2003, 6: 7199.Google Scholar
 11.Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR: A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996, 49: 13731379. 10.1016/S08954356(96)002363.CrossRefPubMedGoogle Scholar
 12.Harrell FE: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, SpringerCrossRefGoogle Scholar
 13.Steyerberg EW, Harrell FE, Borsboom GJJM, Eijkemans MJC, Vergouwe Y, Habbema JDF: Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001, 54: 774781. 10.1016/S08954356(01)003419.CrossRefPubMedGoogle Scholar
 14.Steyerberg EW, Eijkemans MJC, Harrell FE, Habbema JDF: Prognostic modelling with logistic regression analysis: A comparison of selection and estimation methods in small data sets. Stat Med. 2000, 19: 10591079. 10.1002/(SICI)10970258(20000430)19:8<1059::AIDSIM412>3.0.CO;20.CrossRefPubMedGoogle Scholar
 15.Steyerberg EW: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2009, 19: 500CrossRefGoogle Scholar
 16.Tzoulaki I, Liberopoulos G, Ioannidis JPA: Assessment of claims of improved prediction beyond the Framingham risk score. JAMA. 2009, 302: 23452352. 10.1001/jama.2009.1757.CrossRefPubMedGoogle Scholar
 17.Lee DB: Requiem for largescale models. ACM SIGSIM Simul Dig. 1975, 1629.Google Scholar
 18.Datema FR, Ferrier MB, van der Schroeff MP, de Jong RJ B: Impact of comorbidity on shortterm mortality and overall survival of head and neck cancer patients. Head Neck. 2010, 32: 728736.CrossRefPubMedGoogle Scholar
 19.Marmarou A, Lu J, Butcher I, McHugh GS, Mushkudiani NA, Murray GD, Steyerberg EW, Maas AIR: IMPACT database of traumatic brain injury: design and description. J Neurotrauma. 2007, 24: 239250. 10.1089/neu.2006.0036.CrossRefPubMedGoogle Scholar
 20.Tufféry S: Data Mining and Statistics for Decision Making. 2011, John Wiley & SonsCrossRefGoogle Scholar
 21.Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. 1984, 19: 368Google Scholar
 22.R Development Core Team R: R: A Language and Environment for Statistical Computing. R Found Stat Comput. 2011, 409:Google Scholar
 23.Kurahashi I: KsPlot: Check the power of seveal statistical models. 2011Google Scholar
 24.Austin PC, Lee DS, Steyerberg EW, Tu JV: Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemblebased methods?. Biometrical J. 2012, 54: 657673. 10.1002/bimj.201100251.CrossRefGoogle Scholar
 25.Young NH, Andrews PJD: Developing a Prognostic Model for Traumatic Brain Injury—A Missed Opportunity?. PLoS Med. 2008, 5: 3CrossRefGoogle Scholar
 26.Harrell FE, Lee KL, Mark DB: Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15: 361387. 10.1002/(SICI)10970258(19960229)15:4<361::AIDSIM168>3.0.CO;24.CrossRefPubMedGoogle Scholar
Prepublication history
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/14/137/prepub
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.