An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data
 4.9k Downloads
 14 Citations
Abstract
Background
Generally speaking, different classifiers tend to work well for certain types of data and conversely, it is usually not known a priori which algorithm will be optimal in any given classification application. In addition, for most classification problems, selecting the best performing classification algorithm amongst a number of competing algorithms is a difficult task for various reasons. As for example, the order of performance may depend on the performance measure employed for such a comparison. In this work, we present a novel adaptive ensemble classifier constructed by combining bagging and rank aggregation that is capable of adaptively changing its performance depending on the type of data that is being classified. The attractive feature of the proposed classifier is its multiobjective nature where the classification results can be simultaneously optimized with respect to several performance measures, for example, accuracy, sensitivity and specificity. We also show that our somewhat complex strategy has better predictive performance as judged on test samples than a more naive approach that attempts to directly identify the optimal classifier based on the training data performances of the individual classifiers.
Results
We illustrate the proposed method with two simulated and two realdata examples. In all cases, the ensemble classifier performs at the level of the best individual classifier comprising the ensemble or better.
Conclusions
For complex highdimensional datasets resulting from present day highthroughput experiments, it may be wise to consider a number of classification algorithms combined with dimension reduction techniques rather than a fixed standard algorithm set a priori.
Keywords
Support Vector Machine Partial Little Square Random Forest Linear Discriminant Analysis Classification AlgorithmBackground
Sophisticated and advanced supervised learning techniques, such as Neural Networks (NNs) and Support Vector Machines (SVMs), now have to face a legitimate, even though somewhat surprising, competitor in the form of ensemble classifiers. The latter are usually bagging [1], boosting [2], or their variations (arching, wagging) methods which improve the accuracy of "weak" classifiers that individually are no match for NNs and SVMs. Random Forests [3] and Adaboost [4] are the two most notable examples of ensemble tree classifiers that were shown to have superior performance in many circumstances.
Unfortunately, combining "strong" or stable classifiers characterized by small variance, for example, the Knearest neighbor (KNN) classifiers or SVMs, generally will not result in smaller classification error rates. Thus, there seems to be little or no incentive in running computationally expensive classification methods on random subsets of training data if the final classification accuracy will not improve. Looking from a slightly different angle, it is also naive to expect significant improvements in classifier's accuracy when it is already very close to that of the optimal Bayes classifier which cannot be improved upon. However, in a realworld problem, neither the optimal classification accuracy nor the true accuracy of any individual classifier are known and it is rather difficult to determine which classification algorithm does have the best accuracy rates when applied to specific observed training data.
In a recent classification competition that took place in the Netherlands several research groups were invited to build predictive models for breast cancer diagnosis based on proteomic mass spectrometry data [5]. Their models were objectively compared on separate testing data which were kept private before the competition. Interestingly enough, despite the "controlled" environment and the objectivity in assessing the results, no single group emerged as the winner. This was in part due to the difficulty in determining the "best" model which highly depended on what performance measure was used (accuracy, sensitivity or specificity). The overall conclusion made after the fact was that no single classification algorithm was the best and that algorithms' performance highly correlated with user's sophistication and interaction with the method (setting tuning parameters, feature selection and so on). In general, since no single classification algorithm performs optimally for all types of data, it is desirable to create an ensemble classifier consisting of commonly used "good" individual classification algorithms which would adaptively change its performance depending on the type of data to that of the best performing individual classifier.
In this work, we propose a novel adaptive ensemble classification method which internally makes use of several existing classification algorithms (user selectable) and combines them in a flexible way to adaptively produce results, at least, as good as the best classification algorithm from among those that comprise the ensemble. The proposed method is inspired by a combination of bagging and rank aggregation. In our earlier work, the latter was successfully applied in the context of aggregating clustering validation measures [6]. Outofbag (OOB) samples play a crucial role in the estimation of classification performance rates which are then aggregated over through rank aggregation to obtain the locally best performing classifier ${A}_{(1)}^{j}$ given the j^{ th }bootstrap sample.
Being an ensemble classification algorithm, the proposed classifier differs from traditional ensemble classifiers in at least two aspects. The first notable feature is its adaptive nature, which introduces enough flexibility for the classifier to exhibit consistently good performance on many different types of data. The second aspect is the multiobjective approach to classification where the resulting classification model is optimized with respect to several performance measures simultaneously through weighted rank aggregation. The proposed adaptive multiobjective ensemble classifier brings together several highly desirable properties at the expense of increased computational times.
The manuscript is organized as follows. The Results section presents two simulated (threenorm and simulated microarray data) and two realdata examples (breast cancer microarray data and prostate cancer proteomics mass spectrometry data) which clearly demonstrate the utility of the proposed method. This is followed by a discussion and general comments. In the Methods section, we describe the construction of the adaptive ensemble classifier and introduce some common classification algorithms and dimension reduction techniques that we use for demonstrating the ensemble classifier.
Results and Discussion
Performance on Simulated Data
Threenorm data
This is a ddimensional data with two class labels. The first class is generated with equal probability from either one of the two normal distributions MN({a, a, ..., a}, I) and MN({a,a, ..., a}, I) (I denotes the identity matrix), and the second class is generated from a multivariate normal distribution MN({a,a, a,a, ... a,a}, I). $a=\frac{2}{\sqrt{d}}$ depends on the number of features d. This benchmark dataset was introduced in [7] and is available in the mlbench R package.
Threenorm simulation data
Accuracy  Sensitivity  Specificity  AUC  

SVM  0.451900  0.468200  0.435600  0.429016 
(0.00988)  (0.02144)  (0.02314)  (0.01318)  
RF  0.562200  0.557600  0.566800  0.591170 
(0.00540)  (0.00853)  (0.00806)  (0.00635)  
PLS + LDA  0.610000  0.608000  0.612000  0.610032 
(0.00561)  (0.00860)  (0.00797)  (0.00561)  
PCA + LDA  0.503600  0.501800  0.505400  0.505236 
(0.00617)  (0.00674)  (0.00680)  (0.00753)  
PLS + RF  0.612200  0.586400  0.638000  0.648102 
(0.00506)  (0.01250)  (0.01198)  (0.00595)  
PLS + QDA  0.607500  0.617200  0.597800  0.607500 
(0.00577)  (0.01142)  (0.01218)  (0.00577)  
PLR  0.540800  0.538000  0.543600  0.557342 
(0.00459)  (0.00819)  (0.00804)  (0.00553)  
PLS  0.600300  0.600400  0.600200  0.647896 
(0.00542)  (0.01319)  (0.01361)  (0.00609)  
Greedy  0.596600  0.581800  0.611400  0.621590 
(0.00559)  (0.01117)  (0.01045)  (0.00657)  
Ensemble  0.613000  0.606200  0.619800  0.653700 
(0.00563)  (0.00823)  (0.00729)  (0.00587) 
For these datasets, the algorithm which uses the PCA for dimension reduction, PCA + LDA, and SVM clearly underperform in comparison to the other six individual classifiers. It is interesting to note that PLSbased classification methods exhibit very strong performances comparable to that of RF. Overall, PLS + RF has the best scores among the eight individual classifiers for the three out of four performance measures, while PLS + QDA has the best sensitivity rate. The ensemble classifier's accuracy, sensitivity and specificity are very similar to those of the top performing individual classifiers. The greedy ensemble performs well but its overall performance is consistently inferior to the proposed ensemble classifier, albeit by not much. Standard errors for the ensemble classifier are also a little smaller than the standard errors for the greedy algorithm. The AUC scores were not used in the aggregation process where we optimized with respect to accuracy, sensitivity and specificity. So these scores are valid indicators of the performance which take into consideration both sensitivity and specificity. The ensemble classifier has the largest AUC score.
Simulated microarray data
The simulation scheme incorporates the simplest model for microarray data where d = 5000 individual probes are generated independently of each other from N(μ, 1). 90% of probes do not differ between cases and controls and their expression values come from normal distribution with unit variance. The other 10% of probes have different means between cases whose expression values are generated from N(.3, 1) and controls which are generated from N(0, 1).
Simulated microarray data
Accuracy  Sensitivity  Specificity  AUC  

linear SVM  0.902200  0.907600  0.896800  0.967464 
(0.00451)  (0.00683)  (0.00679)  (0.00216)  
polynomial SVM  0.506200  0.716400  0.296000  0.498772 
(0.00383)  (0.05493)  (0.05477)  (0.00640)  
radial SVM  0.773200  0.882000  0.664400  0.833576 
(0.03090)  (0.02851)  (0.04473)  (0.03750)  
sigmoid SVM  0.905000  0.910400  0.899600  0.968472 
(0.00432)  (0.00655)  (0.00581)  (0.00210)  
greedy  0.671400  0.807200  0.535600  0.702040 
(0.04177)  (0.03811)  (0.05508)  (0.05016)  
Ensemble  0.900600  0.902400  0.898800  0.968156 
(0.00366)  (0.00661)  (0.00592)  (0.00213) 
To illustrate the point that the proposed ensemble algorithm can be used with any combination of individual classifiers or even same classifiers with different settings of tuning parameters, for this example, we selected the SVM algorithm with four different kernel parameters: linear, polynomial, radial and sigmoid. The default settings for each of the kernels were used. The ensemble classifier performs similarly to the SVM with the sigmoid kernel and clearly outperforms the greedy algorithm.
Performance on real data
Breast cancer microarray data
These data are publicly available through the GEO database with the accession number GSE16443 [8] and were collected with the purpose of determining the potential of gene expression profiling of peripheral blood cells for early detection of blood cancer. It consists of 130 samples with 67 cases and 63 controls.
Breast cancer microarray data
Accuracy  Sensitivity  Specificity  AUC  Count  

SVM  0.5846  0.6679  0.5525  0.6845  168 
PLR  0.6154  0.6859  0.5706  0.6503  197 
PLS + RF  0.6077  0.6615  0.5562  0.6498  170 
PLS + LDA  0.6846  0.6744  0.6887  0.6826  305 
PLS + QDA  0.6462  0.7063  0.5799  0.6871  78 
PCA + QDA  0.4692  0.3127  0.6645  0.5401  92 
Ensemble  0.6385  0.6563  0.6227  0.7108 
Proteomics data
To assess the predictive power of proteomic patterns in screening for all stages of ovarian cancer, [9] carried out a casecontrol SELDI (surfaceenhanced laser desorption and ionization timeofflight) study with 100 cases and 100 controls. Each spectrum was composed of 15200 intensities corresponding to m/z values on a range from 0 to 20000. Subsequently, the scientific findings of this paper were questioned by other researchers [10, 11] who argued that the discriminatory signals in this dataset may not be biological in nature. However, our use of this dataset for the purpose of an illustrative example of the comparative classification ability of our ensemble classifier is still valid.
For this illustration, we applied five classification algorithms to these highdimensional data and our proposed ensemble classifier with the number of bootstrap samples equal to 101. Once again, the internal optimization of the ensemble classifier was performed with respect to accuracy, sensitivity and specificity.
Proteomics ovarian cancer data
Accuracy  Sensitivity  Specificity  AUC  

RF  0.9550  0.9639  0.9520  0.9924 
SVM  0.9350  0.9021  0.9731  0.9795 
PLS + RF  0.9050  0.9040  0.9029  0.9703 
PLS + LDA  0.9600  0.9639  0.9624  0.9784 
PLS + QDA  0.9550  0.9539  0.9648  0.9781 
Ensemble  0.9650  0.9639  0.9711  0.9871 
Conclusions and Discussion
For complex high dimensional datasets resulting from present day high throughput experiments, it may be wise to consider a number of reputable classification algorithms combined with dimension reduction techniques rather than a single standard algorithm. The proposed classification strategy borrows elements from bagging and rank aggregation to create an ensemble classifier optimized with respect to several objective performance functions. The ensemble classifier is capable of adaptively adjusting its performance depending on the data, reaching the performance levels of the best performing individual classifier without explicitly knowing which one it is.
For a number of different data that we considered here, the best performing method according to any particular performance measure changes from one dataset to another. In some cases, if the three performance measures are considered (accuracy, sensitivity and specificity), it is not even clear what the best algorithm is. In such cases, the ensemble method appears to be optimized with respect to all three measures which can be concluded from it having the largest (or very close to the largest) AUC scores.
The biggest drawback of the proposed ensemble classifier is the computational time it takes to fit M classification algorithms on N bootstrap samples. In addition, rank aggregation may also take considerable time if M is large. We have implemented the procedure in R using available classification routines to build the ensemble classifier. On a workstation with an AMD Athlon 64 X2 4000+ Dual Core processor and 4GB of memory, it takes about five hours to run the ensemble classifier with 10fold crossvalidation on the breast cancer microarray data. For a slightly larger proteomics example, 101 bootstrap samples with 5fold external crossvalidation take approximately 17 hours to complete which is mainly due to the size of the dataset (15200 covariates) where even individual classifiers take considerable time to build their models (in particular RF). Computing variable importance is also very computationally intensive but is not essential for building an ensemble classifier. It should be noted that it is relatively easy to parallelize the ensemble classifier which would reduce the computing times dramatically if run on a grid or cluster. If a cluster is not available and one is dealing with highdimensional data, feature selection is commonly performed prior to running the ensemble classifier to reduce the dimensionality of the data to more manageable sizes. As with any classification algorithm, feature selection should be done with great caution. If any crossvalidation procedure is implemented, feature selection should be performed separately for every training set to avoid overoptimistic accuracy estimates [12]. In simulation examples, the greedy algorithm performs somewhat worse than the proposed ensemble classifier which is why it was not considered further for real data illustrations. Not surprisingly, it still demonstrates good performance overall. Generally speaking, it also takes less time to execute because it is based on a kfold crossvalidation where k is relatively small (usually between 5 and 10) instead of a computationally intensive bootstrap sampling where N is usually much larger. Also, the greedy algorithm performs a single rank aggregation, while the proposed ensemble classifier performs N of them, one for each bootstrap sample. For a small number of individual classification algorithms, M ≤ 10 or so, this does not add a substantial computational burden on the ensemble classifier. If one is willing to sacrifice on the number of bootstrap samples N, then the running times of the two algorithms not too different.
For the illustration purposes, we used some common classification algorithms and dimension reduction techniques in this paper. Obviously, many other individual classifiers and dimension reduction techniques could be incorporated into the ensemble. For example, one could select features based on the importance scores returned by the Random Forests to reduce the dimension of the data [13, 14] and follow that with any classification algorithm. Also, performance measures are not limited to the commonly used accuracy, sensitivity and specificity. If moving beyond a binary classification problem, sensitivity and specificity can easily be replaced by classspecific accuracies. Still other performance measures are available which are functions of class assignment probabilities, for example the Brier score [15] and the kappa statistic [16]. It is beyond the scope of this paper to discuss or make specific recommendation as to which component classification algorithms are to be included in the ensemble and the selection and setting of tuning parameters for individual classifiers. We have a few more illustrations of the our ensemble classifier on the supplementary website at http://www.somnathdatta.org/Supp/ensemble/.
Following the standard bagging principle we have used simple random sampling for generating our bootstrap samples. Note that a certain bootstrap sample may not include all the classes and thus prediction using these samples will also be limited to these classes. As pointed out by one of the reviewers, this may appear to be problematic, especially, in situations when one or more of the classes are rare in the overall population. Since a large number of bootstrap samples is taken, the principle of unbiasedness still applies to the overall aggregation; nevertheless, this may lead to inefficiencies. Alternative sampling strategies (e.g., sampling separately from each class to match the original training data, nonuniform probability sampling related to the class prevalence, etc) that are more efficient can be considered in such situations. Subsequent aggregation should then be done through appropriate reweighing of the individual predictions. A detailed investigation of such alternative resampling strategies is beyond the scope of this paper and will be explored elsewhere.
Methods
Construction of an adaptive ensemble classifier
The goal of any classification problem is to train classifiers on the training data, X_{(n × p)}, with known class labels y ={y_{1} ,..., y_{ n }} to be able to accurately predict class labels $\stackrel{\wedge}{y}=\{{\stackrel{\wedge}{y}}_{1},\mathrm{...},{\stackrel{\wedge}{y}}_{r}\}$ from the new testing data ${X}_{(r\times p)}^{\ast}$. Here, both n and r are the number of samples in training and testing data respectively, and p is the number of predictors (features). Suppose one considers M classification algorithms, A_{1} ,..., A_{ M }, with the true, but unknown, classification error rates of e_{1} , ..., e_{ M }. By drawing random bootstrap samples [17] from the training data {X_{(n × p)}, y_{(n × 1)}} and training each classifier on them, it is possible to build a number of "independent" models which can then be combined or averaged in some meaningful fashion. Majority voting is a common solution to model averaging but more complex schemes have been proposed in the literature [18, 19, 20].
To build an ensemble classifier, we combine bootstrap aggregation (bagging) and rank aggregation in a single procedure. Bagging is one of the first model averaging approaches to classification. The idea behind bagging is that averaging models will reduce variance and improve the accuracy of "weak" classifiers. "Weak" classifiers are defined as classifiers whose final predictions change drastically with little changes to training data. In bagging, we repeatedly sample from a training set using simple random sampling with replacement. For each bootstrap sample, a single "weak" classifier is trained. These classifiers are then used to predict class labels on testing data and the class that obtains the majority of the votes wins.
We adopt the same strategy for building our adaptive ensemble classifier with the exception that we will train several (M) classifiers on each bootstrap sample. A classifier with the best performance on OOB samples will be kept and used for prediction on testing data. The second major difference lies in the fact that we do not seek to improve upon accuracies of individual classifiers. "Strong" classifiers that we are using are quite difficult to improve and the goal here is to create an ensemble classifier whose performance is very close to that of the best performing individual classifier which is not known apriori. Our procedure is adaptive in a sense that it will dynamically adjust its performance to reflect the performance of the best individual method used for any given classification problem.
Confusion matrix
True  

Class 1  Class 0  Total  
Predicted  Class 1  a  b  a + b 
Class 0  c  d  c + d  
Total  a + c  b + d  a + b + c + d 
In many classification settings, in medical applications domain in particular, the overall prediction accuracy may not be the most important performance assessment measure. Depending on a condition or treatment, making one type of a misclassification can be much more undesirable than the other. For binary prediction problems, sometimes large sensitivity and/or specificity rates are highly sought after in addition to the overall accuracy. Thus, it is important under many circumstances to consider several performance measures simultaneously. Explicit multiobjective optimization is very attractive and the construction of a classifier which would have an optimal performance according to all performance measures, perhaps weighted according to the degree of their importance, is very desirable.
It is straightforward to determine which classification algorithm performs the best if a single performance measure is considered. For example, if overall accuracy is the only measure under consideration, a classifier with the largest accuracy on OOB samples will be kept. However, if several measures are of interest, determining which classifier to keep becomes a challenging problem in itself, since now we are interested in a classifier whose performance is optimized with respect to all performance measures.
where δ is any valid ordered list of classification algorithms of size M, d is a distance function that measures the "closeness" between any two ordered lists and w_{ i }is a weight factor associated with each performance measure. The two most common distance functions used in the literature are Spearman footrule distance and Kendall's tau distance [22].
Here, we perform the rank aggregation in which the minimization of Φ can be carried out using a brute force approach if M is relatively small (< 8). For larger optimization problems, many combinatorial optimization algorithms could be adapted. We use the CrossEntropy [23] and/or Genetic [24] algorithms which are described in the context of rank aggregation in [25]. The weights w_{ i }play an important role in aggregation allowing for greater flexibility. If highly sensitive classification is needed, more weight can be put on sensitivity and algorithms having higher sensitivity will be ranked higher by the aggregation scheme.
 1.
Initialization. Set N, the number of bootstrap samples to draw. Let j = 1. Select the M classification algorithms along with K performance measures to be optimized.
 2.
Sampling. Draw the j ^{ th }bootstrap sample of size n from training samples using simple random sampling with replacement to obtain $\{{X}_{j}^{\ast},{y}_{j}^{\ast}\}$. Sampling is repeated until samples from all classes are present in a training set. Please note that some samples will be repeated more than once, while others will be left out of the bootstrap sample. Samples which are left out of the bootstrap samples are called outofbag (OOB) samples.
 3.
Classification. Using the j ^{ th }bootstrap sample train the M classifiers.
 4.
Performance assessment. The M models fitted in the Classification step are then used to predict class labels on the OOB cases which were not included into the j ^{ th }bootstrap sample, $\{{X}_{j}^{oo{b}^{\ast}},{y}_{j}^{oo{b}^{\ast}}\}$. Since the true class labels are known, we can compute the K performance measures. Each performance measure will rank classification algorithms according to their performance under that measure, producing K ordered lists of size M, L _{1} ,..., L _{ K }.
 5.
Rank aggregation. The ordered lists L _{1} ,..., L _{ K }are aggregated using the weighted rank aggregation procedure which determines the best performing classification algorithm ${A}_{(1)}^{j}$. Steps Sampling through Rank aggregation are repeated N times.
In essence, bagging takes the form of a nested crossvalidation in our procedure which is used to select the best performing algorithm for each bootstrap sample. The outer crossvalidation can be added to estimate performance rates and we use a kfold crossvalidation scheme for that purpose (see the breast cancer microarray data results).
To predict new cases, the ensemble algorithm runs them through the N fitted models. These will likely be of different types, unlike classification trees in bagging, since different classification algorithms will exhibit the "best" local performance. Each model casts a "vote" as to which class a particular sample belongs to. The final prediction is based on the majority vote and the class label with the most votes wins. A more detailed description of the prediction algorithm is given below.

Individual Predictions. Use the N "best" individual models, ${A}_{(1)}^{1},\mathrm{...},\phantom{\rule{0.5em}{0ex}}{A}_{(1)}^{N}$, built on training data for each bootstrap sample to make N class prediction for each sample. Given a new sample x_{ (p × 1) }, let ${\stackrel{\wedge}{y}}_{1},\phantom{\rule{0.5em}{0ex}}\mathrm{...},\phantom{\rule{0.5em}{0ex}}{\stackrel{\wedge}{y}}_{N}$ denote N class predictions from N individual classifiers.
 Majority voting. The final classification is based on the most frequent class among the N predicted class labels, also known as majority voting defined as$\mathrm{arg}\underset{c}{\mathrm{max}}{\displaystyle \sum _{i=1}^{N}I}({\stackrel{\wedge}{y}}_{i}=c),$
where N is the number of bootstrap samples and c is one of the class labels.
 Class probabilities. Compute the probability of belonging to a particular class c by a simple proportion of votes for that class$P(C=cX=x)=\frac{1}{N}{\displaystyle \sum _{i=1}^{N}I}({\stackrel{\wedge}{y}}_{i}=c).$
Variable importance
Some classical classification algorithms allow for a formal statistical inference about the contribution of each predictor to the classification. For highdimensional data, variable importance becomes a challenge as most classical methodologies fail to cope with high dimensionality. Computationally intensive nonparametric methods based on permutations can come to rescue in those situations. In Random Forests, Breiman introduced a permutationbased variable importance measure which we adapt for our ensemble classifier [3].
This idea can be easily adapted to the ensemble classifier with the exception that instead of averaging across the N trees, we average the misclassification error across locally best performing algorithms as selected through the rank aggregation.
An alternative greedy ensemble approach
 1.
Data Management. Split training data into k folds.
 2.
Classification. Using the i ^{ th }fold (i = 1, ..., k) for testing, train M classifiers on the remaining k  1 folds and compute K performance measures for each individual classification algorithm.
 3.
Averaging. Average the performance scores across the k folds.
 4.
Rank Aggregation. Using the weighted rank aggregation procedure, determine the "best" performing classification algorithm.
We implement the greedy ensemble to compare its performance to the proposed adaptive ensemble classifier. We expect the greedy ensemble to possibly overfit the training data and, therefore, have an inferior performance with the test data.
Some common classification algorithms used in our ensembles
Classification algorithms in both statistical and machine learning literatures provide researchers with a very broad set of tools for discriminatory analysis [26]. They range from fairly simple ones, such as the Knearest neighbor classifier to the advanced and sophisticated Support Vector Machines. Which classification algorithm should be used in any specific case highly depends on the nature of data under consideration and its performance is usually sensitive to the selection of its tuning parameter. In the next several sections we will briefly describe several most common classification algorithms which are particularly popular in bioinformatics. These algorithms in combination with dimension reduction techniques will be used as component classifiers for our ensemble classifier. Of course, in principle, the user could use any set of classifiers in constructing the ensemble.
Logistic regression and penalized logistic regression
Here, λ is the tuning parameter controlling how much penalty should be applied, and J(β) is the penalty term which usually takes the two common forms: ridge penalty defined as ${{\displaystyle {\sum}_{i=1}^{p}\beta}}_{i}^{2}$ and the lasso penalty defined as ${\sum}_{i=1}^{p}}{\beta}_{i}$. Due to the penalty term, many of the estimated parameters will be close to 0.
Linear and Quadratic Discriminant Analysis
which is a linear function in x directly corresponding to the LDA. In the case when covariance matrices are different for each class, i.e. ∑_{ i }≠ ∑_{ j }, we obtain a Quadratic Discriminant Analysis (QDA) which would be a quadratic function in x. Both LDA and QDA have been extensively used in practice with a fair share of success.
Support Vector Machines
The Support Vector Machines (SVM) is among the most recent significant developments in the field of discriminatory analysis [29]. In its very essence it is a linear classifier (just like logistic regression and LDA) as it directly seeks a separating hyperplane between classes which would have the largest possible margin. The margin is defined here as the distance between the hyperplane and the closest sample point. It is usually the case that there are several points called support vectors which are exactly one margin away from the hyperplane and on which the hyperplane is constructed. It is clear that as stated, SVM is of little practical use because most classification problems have no distinct separation between classes and, therefore, no such hyperplane exists. To overcome this problem, two extensions have been proposed in the literature: penaltybased and kernel methods.
where k, c, k_{1}, and k_{2} are parameters that need to be specified. SVMs enjoy the advantage in flexibility over most other linear classifiers. The boundaries are linear in a transformed highdimensional space, but on the original scale they are usually nonlinear which gives the SVM its flexibility whenever required.
Random Forests
Classification trees are particularly popular among medical researchers due to their interpretability. Given a new sample, it is very easy to classify it by going down the tree until one reaches the terminal node which carries the class assignment. Random Forests [3] take classification trees one step further by building not a single but multiple classification trees using different bootstrap samples (sampled with replacement). A new sample is classified by running it through each tree in the forest. One obtains as many classifications as there are trees. They are then aggregated through a majority voting scheme and a single classification is returned. The idea of bagging, or averaging multiple classification results, as applied in this context greatly improves the accuracy of unstable individual classification trees.
One of the interesting elements of Random Forests is the ability to compute unbiased estimates of misclassification rates on the fly without explicitly resorting to testing data after building the classifier. By using the samples which were left out of the bootstrap sample when building a new tree, also known as outofbag (OOB) data, RF runs the OOB data through the newly constructed tree and calculates the error estimate. These are later averaged out over all trees to obtain a single misclassification error estimate. This combination of bagging and bootstrap is sometimes called .632 crossvalidation because roughly 2/3 of samples used for building each tree is really 1  1/e which is approximately .632. This form of crossvalidation is arguably very efficient in the way it uses available data.
Some commonly used dimension reduction techniques
For highdimensional data, such as microarrays, where the number of samples is much smaller than the number of predictors (features), most of the classical statistical methodologies require a preprocessing step in which the dimensionality of data is reduced. The Principle Component Analysis (PCA) [30] and the Partial Least Squares (PLS) [31] are among two most popular methods for data dimension reduction. Of course, other more sophisticated dimension reduction techniques can be used as well. We use the PCA and PLS in a combination with logistic regression, LDA, QDA and Random Forests as illustrative examples.
Both PCA and PLS effectively reduce the number of dimensions while preserving the structure of the data. They differ in the way they construct their latent variables. The PCA selects the directions of its principal components along the axis of the largest variability in the data. It is based on the eigenvalue decomposition of an observed covariance matrix.
The PLS maximizes the covariance between dependent and independent variables trying to explain as much variability as possible in both dependent and independent variables. The very reason that it considers the dependent variable when constructing its latent components usually makes it a better dimension reduction technique than the PCA when it comes to classification problems.
Availability
R code and additional examples are available through the supplementary website at http://www.somnathdatta.org/Supp/ensemble.
Notes
Acknowledgements
This research was supported in parts by grants from the National Science Foundation (DMS0706965 to So D and DMS0805559 to Su D, National Institute of Health (NCINIH, CA133844 and NIEHSNIH, 1P30ES014443 to Su D). We gratefully acknowledge receiving a number of constructive comments from the anonymous reviewers which lead to an improved manuscript.
Supplementary material
References
 1.Breiman L: Bagging predictors. Machine Learning 1996., 24(123–140):Google Scholar
 2.Freund Y, Schapire RE: A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences 1997, 55: 119–139. 10.1006/jcss.1997.1504CrossRefGoogle Scholar
 3.Breiman L: Random Forests. Machine Learning 2001, 45: 5–32. 10.1023/A:1010933404324CrossRefGoogle Scholar
 4.Freund Y, Schapire RE: A decisiontheoretic generalization of online learning and an application to boosting. In EuroCOLT '95: Proceedings of the Second European Conference on Computational Learning Theory. London, UK: SpringerVerlag; 1995:23–37.Google Scholar
 5.Hand D: Breast cancer diagnosis from proteomic mass spectrometry data: a comparative evaluation. Statistical applications in genetics and molecular biology 2008., 7(15):Google Scholar
 6.Pihur V, Datta S, Datta S: Weighted rank aggregation of cluster validation measures: a Monte Carlo crossentropy approach. Bioinformatics 2007, 23(13):1607–1615. 10.1093/bioinformatics/btm158CrossRefPubMedGoogle Scholar
 7.Breiman L: Bias, Variance, and Arcing Classifiers. Technical Report 460, Statistics Department, University of California 1996.Google Scholar
 8.Aaroe J, Lindahl T, Dumeaux V, Sebo S, et al.: Gene expression profiling of peripheral blood cells for early detection of breast cancer. Breast Cancer Res 2010, 12: R7. 10.1186/bcr2472CrossRefPubMedPubMedCentralGoogle Scholar
 9.Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002, 359(9306):572–577. 10.1016/S01406736(02)077462CrossRefPubMedGoogle Scholar
 10.Sorace JM, Zhan M: A data review and reassessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics 2003, 4: 24–24. 10.1186/14712105424CrossRefPubMedPubMedCentralGoogle Scholar
 11.Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDITOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics 2004, 20(5):777–785. 10.1093/bioinformatics/btg484CrossRefPubMedGoogle Scholar
 12.Simon R: Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. J Clin Oncol 2005, 23(29):7332–7341. 10.1200/JCO.2005.02.8712CrossRefPubMedGoogle Scholar
 13.Datta S: Classification of breast cancer versus normal samples from mass spectrometry profiles using linear discriminant analysis of important features selected by Random Forest. Statistical Applications in Genetics and Molecular Biology 2008, 7(2):Article 7. 10.2202/15446115.1345CrossRefGoogle Scholar
 14.Datta S, de Padilla L: Feature selection and machine learning with mass spectrometry data for distinguishing cancer and noncancer samples. Statistical Methodology 2006, 3: 79–92. 10.1016/j.stamet.2005.09.006CrossRefGoogle Scholar
 15.Brier GW: Verification of forecasts expressed in terms of probabilities. Monthly Weather Review 1950, 78: 1–3. 10.1175/15200493(1950)078<0001:VOFEIT>2.0.CO;2CrossRefGoogle Scholar
 16.Cohen J: A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960, 20: 37–46. 10.1177/001316446002000104CrossRefGoogle Scholar
 17.Efron B, Gong G: A Leisurely Look at the Bootstrap, the Jackknife, and CrossValidation. The American Statistician 1983, 37: 36–48. 10.2307/2685844Google Scholar
 18.LeBlanc M, Tibshirani R: Combining estimates in regression and classification. Journal of American Statistical Association 1996, 91(436):1641–1650. 10.2307/2291591Google Scholar
 19.Yang Y: Adaptive regression by mixing. Journal of American Statistical Association 2001, 96(454):574–588. 10.1198/016214501753168262CrossRefGoogle Scholar
 20.Merz C: Using correspondence analysis to combine classifiers. Machine Learning 1999, 36(1–2):33–58. 10.1023/A:1007559205422CrossRefGoogle Scholar
 21.Zweig MH, Campbell G: Receiveroperating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry 1993, 39(4):561–577.PubMedGoogle Scholar
 22.Fagin KR R, Sivakumar D: Comparing top k lists. SIAM Journal on Discrete Mathematics 2003, 17: 134–160. 10.1137/S0895480102412856CrossRefGoogle Scholar
 23.Rubinstein R: The crossentropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability 1999, 2: 127–190. 10.1023/A:1010091220143CrossRefGoogle Scholar
 24.Goldenberg DE: Genetic Algorithms in Search, Optimization and Machine Learning. Reading: MA: Addison Wesley; 1989.Google Scholar
 25.Pihur V, Datta S, Datta S: RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics 2009., 10(62):Google Scholar
 26.Hastie TR T, Friedman J: The Elements of Statistical Learning. New York: SpringerVerlag; 2001.CrossRefGoogle Scholar
 27.Agresti A: Categorical Data Analysis. New York: WileyInterscience; 2002. full_textCrossRefGoogle Scholar
 28.Fisher R: The use of multiple measurements in taxonomic problems. Annals of Eugenics 1936, 7(2):179–188.CrossRefGoogle Scholar
 29.Vapnik V: Statistical Learning Theory. New York: Wiley; 1998.Google Scholar
 30.Pearson K: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 1901, 2(6):559–572.CrossRefGoogle Scholar
 31.Wold S, Martens H: The multivariate calibration problem in chemistry solved by the PLS method. In Lecture Notes in Mathematics: Matrix Pencils. Edited by: Wold H, Ruhe A, Kägström B. Heidelberg: SpringerVerlag; 1983:286–293.Google Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.