Background

For more than a decade microarray technology has provided enormous momentum to the modern genomic research. The ability of quantify thousands of genes’ expressions at the same time has led to remarkable achievements in wide range biological studies. Abundance of microarray assays has been published worldwide in various databases. However, microarray technology has some limitations, such as the accuracy of expression measurements limited by levels of hybridization and variability hybridization properties of probes [1]. RNAseq is a version of next generation sequencing technology which has recently become popular due to some of its advancement over the microarray technology. Evidently, RNASeq has a potential advantage in measuring absolute expression levels compared to the microarray technique [2, 3]. Since these two methods fundamentally differ in their underline technologies, it is interesting know if this disparity results an inconstancy in experimental outcomes. Classifiers are known to be one of the most widely use statistical tools in genomic oriented biomedical studies. For an example, identifying at risk individuals for a certain disease type such as cancers, based on their genetic profiles. In this work, we explore the concordance between microarray and RNASeq genomic platforms in the context of classifications based on a set of comparative classification experiments carried using these two platforms.

In recent years, a number of authors have discussed the agreement between scientific conclusions made on microarray and RNASeq platforms, based on comparative analyses. A common choice for these studies was the concordance of differentially expressed genes (DEGs). A previous study that described a large scale comparison of microarray and RNASeq platforms using the Cancer Genome Atlas (TCGA) based analysis, reported a high correlation among expressions levels resulted from both platforms and suggested a reasonable concordance between DEGs by comparing tumors with normal tissues [4]. Another study compared these two bases using an analysis on data obtained from a colon cancer study and conclude that RNASeq had an advantage over microarray for detecting DEGs [5]. A recent article provided a comprehensive assessment between microarray and RNASeq methods, comparing DEGs using gene expressions resulted from a rat liver experiment [6]. Further they described the concordance in aspect of classification assessing the predictability of classes defined by the chemical mode of action (MOA), using a set of classifiers trained in two genomic platforms. Their study revealed weak classification accuracies for a set of classifiers when applied to these platforms.

Our work is based on the previously described rat liver data [6], where we primarily focus on developing a common classifier that works reasonably well in cross platforms providing better predictability. Next, we discuss the concordance between microarray and RNASeq platforms in various forms in prospect of classification. Furthermore, we identify a set of important genes for specifying classes given by MOAs by focusing their effects on the classifier accuracy. We use seven standard classifiers and an adaptive ensemble classifier built around them to achieve these goals. This study is part of the 2015 annual conference on Critical Assessment of Massive Data Analysis (CAMDA) challenges. The Rat liver experiment was conducted by the FDA SEQC consortium to assess the performance of modern gene transcript expression profiling methods, which is a comparative analysis designed for developing predictive models to predict the chemical mode of action (MOA).

The remainder of the article is organized as follows. In Section “Results”, we provide results and conclusions of the study. Section “Methods” explains all underline procedures applied. The main body of the paper ends with a discussion in Section “Discussion”.

Results

Classification in individual platforms

We first describe outcomes of the Analysis 1, which was performed using two basic strategies: adjusted and originally given test sets described in Section “Methods”. We provide a detailed summary of these results in Tables 1, 2, 3 and 4, where each table presents the classifier’s overall prediction accuracy, class specific sensitivity and the corresponding specificity. Graphical representations of the summarized result are also provided on Figs. 1 and 2.

Fig. 1
figure 1

Plots between prediction accuracies of RNASeq vs Microarray for two different test sets using the common gene set, by eight different classification techniques, for classifiers trained and predicted on individual platform

Fig. 2
figure 2

Plots between prediction accuracies of RNASeq vs Microarray for two different test sets using the complete gene set, by eight different classification techniques, for classifiers trained and predicted on individual platform

Table 1 Accuracies of predicting MOA’s in the adjusted test set, based on classifiers developed on gene expression sets profiled from microarray and RNASeq platforms
Table 2 Accuracies of predicting MOA’s in the originally given test set, based on classifiers developed on common gene expression sets profiled from microarray and RNASeq platforms
Table 3 Accuracies of predicting MOA’s in the adjusted test set, based on classifiers developed on complete gene expression sets profiled from microarray and RNASeq platforms
Table 4 Accuracies of predicting MOA’s in the originally given test set, based on classifiers developed on complete gene expression sets profiled from microarray and RNASeq platforms

We first discuss the classification resulted from using a set of genes that are represented in both platforms. For the adjusted test set, the left panel of the Fig. 1 shows that the performance of each classifier is similar in both platforms, since all the data points are fairly close to the diagonal line (Pearson’s r=0.92). The accuracy of individual classifier varies from 17 to 75%, and as to be expected, the performance of the ensemble classifier is the best in both platforms. The overall accuracy of the optimal classification method is slightly better in microarray compared to RNA-seq (75% vs 67%). In particular, we observe a lower prediction accuracy for the class “PPARA" in RNASeq (56%), compared to the microarray (89%) platform. Overall, the class given by “CAR/PXR" which has a maximum sensitivity of only 56%, seems to be the MOA that hardest to predict. Some individual classifiers show widely different prediction sensitivity for the same class in two platforms. For example the sensitivity for “PPARA" by RPART is 100% in microarray, whereas it reaches as low as 22% in RNAseq.

When the original (i.e., unadjusted) test set is used, we again observe matching performance of classifiers in both platforms (Table 2) similar to the case with the adjusted test set; in fact, the agreement is even higher (Pearson’s r=0.94) as shown in the right panel of the Fig. 1. The overall accuracy ranges from 60 to 12% indicating a drop in the classification performance compare to the previous scenario. For example, 75% vs 50% in microarray and 67% vs 50% in RNASeq for the ensemble classifier. Comparing Tables 1 and 2, we also notice a decline in sensitivities of predicting three known classes namely “PPARA", “CAR/PXR", and “Control". Since this analysis was carried using an alternative approach as described in the Section “Methods”, such decline could be possibly resulted from classifying several samples belonging to above known classes as “OTHER" by depressing the “true" class probability below 0.5 if these class attributes are somewhat close to one another. In this case, few other individual classifiers such as SVM, RF outperform the ensemble classifier in terms of the overall accuracy. But nevertheless, the ensemble classifier still acts as the best overall amongst all with regard to all performance measures.

Even with the complete set of genes, we observe similar conformity of classifiers’ performance between the two platforms (Fig. 2) as described above. Specifically for the ensemble classifier the overall accuracy is identical in the two platforms, in each case. According to Tables 3 and 4, the overall accuracy ranges between 8 to 67% and 10 to 55%, for adjusted test set and the original test set, respectively. Even though we used bigger gene sets, there is no additional improvement for predicting MOAs; indeed the performance gets worse, which is quite evident for the adjusted test set. However, some classifiers surprisingly hold equal performances for both sets of genes. As for example, the RPART shows identical performances in the microarray platform under bigger and smaller sets of genes.

Classification in cross platforms

Results of the 2nd analysis, namely, classification in cross platform are summarized in Table 5 and Fig. 3. We performed this study using only the common set of genes since both platforms are involved together throughout the analysis. Compared to all previous classifications we discussed in Analysis 1, this result shows even greater agreement between the prediction accuracies of the classifiers trained on a bigger training set in one platform and used to predict using the bigger test data on the other platform (Pearson’s r=0.99). Remarkably, the ensemble classifier was able to provide 100% accurate predictions for both cases, regardless of the additional complexity caused by 8 varieties of classes. In this analysis, the component classifier PLS+LDA also performed similarly to the ensemble classifier in both cases yielding 100% accurate class predictions. Apart from above two classifiers, SVM, RF, and PLS+RF also hold substantially high prediction accuracies.

Fig. 3
figure 3

Plots between prediction accuracies of RNASeq vs Microarray test sets, by eight different classification techniques, for classifiers trained and predicted on cross platforms

Table 5 Accuracies of predicting MOA’s in the whole datasets (inducing testing and training sets) of RNAseq and microarray platforms, using the classifiers trained on corresponding opposite platform

Exploring outcomes resulted from Analysis 1 and 2 (Tables 1, 2, 3, 4 and 5), we clearly notice, between the two types of dimension reduction methods, PLS performs far better than PCA throughout this study. The performances of classifiers integrated with PCA are clearly the weakest among all individual classifiers in each scenario.

Importance of genes

We summarize results of the 3rd analysis in Tables 6, 7, 8 and 9, where each table lists the top 20 important gene name and the overall accuracy obtained by the cross validation. As we describe in the methods section this analysis was performed using two experiments: (i) using the adjusted test set and (ii) the full dataset. Furthermore, we consider using the common and complete sets of genes as additional sub-analyses within above primary experiments.

Table 6 Genes ranked by the importance based on accuracy reduction, for Microarray and RNA-Seq, using the adjusted test set with the common set of genes
Table 7 Analysis 3: Genes ranked by the importance, for microarray and RNASeq, using the adjusted test set with complete sets of genes
Table 8 Genes ranked by the importance (based on the measure given by R), for Microarray and RNA-Seq, using the whole data including 8 verities of MOAs with the common gene set
Table 9 Genes ranked by the importance (based on the measure given by R), for Microarray and RNA-Seq, using the whole data including 8 verities of MOAs with the complete gene set

Referring to the Table 6, we observe that five of ten most important genes for classification (Cyp1a1, Fam111a, Ugt2b, Akr1b8, and Hbb) are in common between the two platforms, when the adjusted test set is used with the common set of gene. From literature search we found that Cyp1a1 encodes a member of the cytochrome P450 super-family of enzymes which catalyze many reactions involved in drug metabolism [7]. Likewise, Ugt2b belongs to a large family of proteins capable of detoxifying a wide variety of both endogenous and exogenous substrates such as biogenic amines, steroids, bile acids, phenolic compounds, and various other pharmacologically relevant compounds including numerous carcinogens, toxic environmental pollutants, and prescription drugs [8]. The function of Akr1b8 implicated in the pathogenesis of diabetic complications [9]. Mutations in Hbb have been implicated in a number of blood disorders [10], while mutations of Fam111a are strongly associated with type 2 Kenny-Caffey syndrome [11].

Table 7 presents the top 20 genes detected from complete gene sets for two platforms. We notice that 6 genes (Fam111a, Cyp1a1, Hbb, Aldh1a7, Psat1, and Obp3) for the microarray and 5 genes (Fam111a, Hbb, Cyp1a1, Ugt2b, and Dhrs7) for the RNASeq are in common with the top 20 of the previous analysis (Table 6).

Although the main goal of detecting impotent genes with the full data (Analysis 3.2) was to identify sets of genes making considerable impact on classifying all eight MOAs, interestingly, the outcome of this study (Tables 8 and 9) reveal high average (unpermuted) prediction accuracies (close to 100%) for both platforms using the 5 fold cross-validation technique. Tables 8 and 9 show lists of top genes ranked by the relative reduction of accuracy (R), for microarray and RNASeq, respectively. Clearly, there is no single gene that makes a substantial contribution to the accuracy. However, we identified two genes (Cyp1a1, Abcc3) that are commonly present in both lists when the complete set of genes was used. Based on the same analysis but performed using complete sets of genes we observe only one gene named Id1 is common important gene for the two platforms. We observed that Abcc3 is a member of the superfamily of ATP-binding cassette (ABC) transporters, which is involved in multi-drug resistance [12]. The Id1 gene plays a crucial role in activating hepatic stellate cells (HSCs) responding to liver damages [13].

Methods

Ensemble classifier

Support Vector Machines (SVM), Random Forests (RF), Neural Network (NN), Linear and Quadric Discriminant Analysis (LDA, QDA) are examples of standard techniques that are widely applied in classification problems. Performances of these classifiers are highly variable across problems. Thus, none of standard classifier can be considered to be the best for all classification settings. In complex situations, such as classifications in high dimensional genomic data, a more meaningful approach would be use an ensemble classifier which combines many standard classification algorithms together to develop an improved classifier. The ensemble classifier we use builds a number of individual models on randomly selected subsets of data which can then be combined or averaged in some meaningful fashion. Majority voting is a popular choice is for a typical solution. Such a classifier by allowing data based utilization of a multitude of classification algorithms for a upholds consistent performance in various types of data and classification problems. In this work, we use the adaptive optimal ensemble classer developed, via bagging and rank aggregation [14]. In this approach, several user specified classifiers are trained on bootstrap samples drawn from the original data using simple random sampling. Since the sampling is done with replacement, some samples will be repeated multiple times while others will be out of the bootstrap sample (known as out-of-bag (OOB) samples). Focusing on the prediction performances on the OOB samples, a best classifier is select based on various performance measures. For example, in a binary classification problem, sensitivity, specificity, and the area under the curve of the Receiver Operating Characteristic (ROC) curve are some legitimate performance measures. This method is equipped with rank aggregation [15, 16], which provides a great flexibility in selecting the optimal classifier with respect to various multiple performance measures. Predicted classes for a given test set is selected as the highest voted class, as predicted by the above set of “best" classifiers over all bootstrap resamples. Datta et al. [14], demonstrated the performance of the ensemble classifier using various numerical studies and real applications of gene expressions data. In the context of regression similar concepts have been developed [17].

The algorithm described below demonstrates the step by step procedure of developing an ensemble classifier [14]. Suppose the dataset of n samples with p dimensional covariates in the form of { X n×p ,Y n×1}, where X corresponds to independent variables and Y represents the dependent categorical variable that specifies a class label. Assume the ensemble classier is intend to built with M classification algorithms based on K different performance methods such as overall accuracy, class sensitivities etc. to optimize the predictive performance. Thus, we proceed as follows:

  1. 1.

    Resampling: Draw a bootstrap sample of size n \(\left \{\boldsymbol {X}^{*}_{n \times p}, \boldsymbol {Y}^{*}_{n \times 1}\right \}\) from the original data {X n×p ,Y n×1} by resampling rows with simple random sampling. Sampling is repeated until samples from all classes are present in the bootstrap sample and then determine the corresponding OOB sample that contains all samples which are left out from the bootstrap sample.

  2. 2.

    Classifier Training: Train M classification algorithms, C 1,…,C M , on the bootstrap sample.

  3. 3.

    Performance Assessment: Obtain M predicted class labels for each OOB case. Since true classes of the OOB samples are known, calculate K different performance measures for each of M algorithms using their corresponding predictions.

  4. 4.

    Rank Aggregation: Rank M algorithms according to K performance measures. So, we have K ordered lists (L 1,…,L K ) of size M. These lists are then rank-aggregated using the weighted rank aggregation to determines the best algorithm C (1) overall.

    Repeat the above procedure (steps 1–4) for B times, where B considered to be a large integer which is usually selected according to the computational capacity.

  5. 5.

    Prediction for a New Sample: Predict the class variable Y for a new sample X using the B prediction models \(C^{1}_{(1)},\ldots,C^{B}_{(1)}\) and determined the highest voted class to obtain the final class prediction \(\hat {Y}\).

Rank aggregation

Suppose the performances of M classifiers are evaluated on the basis of K performance measures. Assume we have ordered lists L 1,…,L K , where ith ordered list L i , i=1,…K, provides ranks of M algorithms on their performances evaluated on the ith measure. The rank aggregation [15, 16] procedure provides a single ranked list of M classifiers that minimizes the weighted sum of distances from all individual lists, given by the following objective function,

$$ \Phi(L)=\sum_{i}w_{i}d(L,L_{i})\text{,} $$
(1)

where L is any possible ordered list of the M classifiers, w i ’s are weights which represent the user specific importance of each of the K performance measures. The classifier in the first position of this aggregated list that is the optimal classifier overall with respect to all the validation measures. Of course, the default choice would be to use w i =1 for all i which means all the validation measures are taken as equally important in determining the optimal algorithm. Throughout out analyses, we have used w i =1. d is a distance function such as Spearman’s footrule or Kendall’s tau, which measures the closeness between two ordered lists. In this work, we use Spearman’s footrule distance function as the distance measure.

Often for high dimensional data, standard classifiers are combined with dimension reduction, variable selection, or penalization techniques such as Partial Least Squares (PLS), Principle Component Analysis (PCA), Random Forest (RF) based importance measures, L 1 regularization, etc., for greater applicability and improved prediction accuracy [18, 19]. For a genomic data characterized by high dimension, use of an ensemble classifier developed on such set of improved component classifiers represents an ideal choice.

Rat liver data

Our data for this study was released by 2015 CAMDA competition. Microarray and RNASeq platforms contain gene expression measurements of nearly 31,000 and 46,000 genes, respectively. The dataset consists of gene expression responses profiled by Affymetrix microarrays and Illumina RNASeq sequencer in rat liver tissues from 105 male Sprague-Dawley Rats, which are exposed to 27 different chemicals represented by 9 different MOAs. In the original experiment, a training set is formed with 45 rats, which are treated with 15 chemicals corresponding to MOAs of “PPARA", “CAR/PXR", “AhR", “Cytotoxic", “DNA damage", and 18 controls. Test set contains data on 36 rats which are treated with 12 chemicals corresponding to “PPARA", “CAR/PXR", “ER", “HMGCOA" and 6 controls. We found that two MOAs, “ER" and “HMGCOA" are present only in the test set. We further noticed that approximately 22,253 average expressions per sample in RNA-seq data were recorded as “NA", which indicates that insufficient number of reads mapped onto the gene to provide a reliable gene expression estimate. We retained gene sets of sizes 13,686 and 16,133 for microarray and RNASeq platforms, after (i) removing unnamed genes, (ii) removing genes with unobserved expressions, and (iii) averaging multiple expressions reported from the genes with unique names.

In this work, we used normalized expression levels that came from microarray data using Robust Multi-Array Average (RMA) expression measurements [20], whereas data obtained for RNASeq was already normalized via the Magic normalization [6, 21]. We decided that it would be reasonable to perform separate analysis with a common set of genes (8336) represented in both platforms and also with complete sets of genes, for a comparative study.

Concordance experiments

We conducted three types of investigations for studying the performance of the proposed classifiers.

  1. 1.

    Train classifiers and make predictions on individual platforms.

  2. 2.

    Train classifiers in one platform to make predictions on the other platform.

  3. 3.

    Identify important variables (genes) for accurate classification.

In the 1st analysis, we explore the predictability of MOAs using various classifiers developed in the given training data. To our knowledge, there is no established criteria to define prediction for an unknown class that was not represented in the training data. Thus, we select an adjusted test set after eliminating all test samples belonging to two classes of “ER" and “HMGCOA", where the new test was used in parts of 1st and 3rd analysis. However we also considered the originally given test set as a part of 1st analysis by adopting following alternative classification approach. Accordingly, first we designated both “ER" and “HMGCOA" samples belonging to the original test set as “OTHER". For each classifier, then we determined the maximum class probability for a given test sample and if the above probability was less than 0.5 we selected the predicted class as “OTHER", else kept the originally predicted class. For this purpose, class probabilities for the ensemble classifier was calculated using the predicted class proportions observed in the B bootstrap samples.

Our objective with the 2nd analysis was to examine the inter-platform concordance between microarray and RNAseq platforms. Thus, we trained classifiers on a selected platform using the full dataset that included the both given training and test sets for making predictions on the other platform. However, since the classifier needed to run on both platforms for this analysis, each gene expression measurement was standardized, separately for both platforms, prior to the analysis.

For analyses 1 and 2, we selected an ensemble classifier developed with a set of M=7 standard classifiers, SVM, RF, LDA, PLS+RF, PLS+LDA, PCA+RF, PCA+LDA, and Recursive Partitioning (RPART). Primarily, classifiers are selected based on the prior information of their suitabilities in high dimensional data classification. Based on accuracies of predicted classes, each classifier was ranked for K number of performance measures (for example, overall accuracy, class specific accuracies ect.). Since the selection of performance measures for a multi-class classification problem is highly depend upon the aim of study; we optimized the overall prediction accuracy, and the class specific accuracy of each group for the 1st analysis. Furthermore we considered these performance measures to be equally important for classification (i.e., we used equal weights of w i =1, in Eq. (1)), whereas in the 2nd analysis in cross platforms, we focused only on the overall accuracy without optimizing multiple group specific performances. For these analyses, we chose B to be B=300. We performed a 10 fold cross-validation for each individual classifier to select the number of components for PLS and PCA methods, separately for two platforms. Assuming consistent performance in bootstrap samples similar to the original training data, we employed the same number of components to develop the ensemble classifier.

The 3rd analysis on identifying important variables is subdivided into following two parts.

  1. 1.

    Detecting important genes with the adjusted test set.

  2. 2.

    Detecting important genes with full data using the cross-validation method.

We applied a classifier on the perturbed training data resulted from randomly permuting gene expressions of a given gene to quantify its impact on the predictability of MOAs in a test set. Accordingly, each gene was ranked by a measure given by magnitude of accuracy reduction compared to the true accuracy (in unpermuted data), such that the rank 1 corresponds to the gene that has the highest negative impact on the overall prediction accuracy. In order to reduce the computational burden, we did not use the ensemble classifier for this purpose. Instead the component classifier PLS+LDA which had an overall accuracy close to that of the ensemble classifier was used. We performed theses analysis separately for both platforms to determine a common set of genes presented among the top 20 genes in both platforms.

For Analysis 3.1, we randomly permuted a gene’s expressions in the training set and then made predictions for the test set (adjusted test set) using the classifier trained on the permuted training data. The permutation procedure was repeated l times for each gene to calculate an average overall prediction accuracy (A). Finally, genes were ordered by A, ascending order. Here we chose l to be l=30 in order to achieve reasonably stable approximation, while keeping the computational costs in check.

Analysis 3.2 was performed using the full data which contained both originally given training and test sets. Here we applied the 5 fold cross-validation technique in order to evaluate the effect of each gene on classifying MOAs. Our approach consisted of two layers of randomization. For the jth, j=1,...,J, outer randomization, we randomly partitioned the dataset into 5 folds and selected a training set of 4 folds, while remaining fold was chosen as a test set. After randomly permuting the expressions of a given gene i across the above specified training set, a classifier was trained to predict on the selected test set. Now using the same approach we described in the previous part (Analysis 3.1) we obtained an average overall prediction accuracy \((A^{cv}_{i_{j}})\) by repeating the permutation l times. After that, the whole procedure was repeated J times for various random partition sets to obtain an average overall prediction accuracy (\(A^{cv}_{i}\)) for ith gene, based on all J scenarios.

$$A^{cv}_{i}=\frac{1}{J}\sum\limits_{j=1}^{J}A^{cv}_{i_{j}}. $$

Suppose A cv is the average true accuracy (unpermuted data) based on J random partition sets. Note that the magnitude of A cv can be varied. Thus a better measure will be a relative accuracy reduction (R i ) given by,

$$R_{i}=\frac{A^{cv}-A^{cv}_{i}}{A^{cv}}, $$

where large values of R i indicate high impacts on the classification. For Analysis 3.2, we used values l=30 and J=100, which stabilize the calculations without being computationally burdensome.

Discussion

In this study, we used an ensemble classifier built on a set of standard classifiers to predict the MOA in Rat liver experiment data profiled by both microarrays and RNASeq. The newly constructed ensemble classifier performed reasonably well in both platforms individually. Using a selected test set and a set of genes (those present in both platforms) we observe comparable overall predictability of MOAs in the two platforms with 75% and 67% accuracies for microarray and RNAseq, respectively. Similarly, we observe well matched accuracies of 50% for both platforms for the full test sets based on an alternative approach. In an earlier classification approach [6] applied on the same data, reported average overall accuracies of 58% and 61% for microarray and RNAseq, suggesting a slightly better predictability in RNA-seq. However outcomes of these two studies are somewhat incomparable due to the differences in the training and test data sets used. For example, we considered controls as another class, whereas in their analysis, controls were not considered as a separate class. Interestingly, once we trained classifiers to make predictions on cross platforms, the ensemble classifier provided 100% accurate predictions for all 8 classes presented in the whole experiment. This result exhibits a perfect cross platform concordance for the purpose of classification. Also, our study clearly demonstrates a high agreement between the individual classifiers’ performances in two genomic platforms. Except for few scenarios, the ensemble classifier performed the best with respect to the overall accuracy and other class specific measures, in all experiments. We observe widely different classification performances among standard classifiers, which reflects the unreliability of restricting to a single classifier in case of high dimensional classification problems. On the other hand, this also demonstrates the utility of the adaptive ensemble classifier which is expected to perform as good or better than the individual classifiers with respect to multiple performance measures.

Conclusion

In this study, we explored the inter-platform concordance between microarray and RNASeq in their ability to classify samples based on genomic information, using data profiled by a Rat Liver experiment. We used an ensemble classifier built on a set of seven standard classifiers to predict the MOA in Rat livers. The ensemble classifier performed reasonably well in both platforms individually, resulting respective 75% and 67% accuracies for microarray and RNAseq on a selected test set. When we trained classifiers to make predictions on cross platforms, the ensemble classifier provided remarkable 100% accurate predictions. This study demonstrates a high agreement between individual classifiers’ performances in two genomic platforms. Additionally, we identified a set of important genes those specifies MOAs, by focusing on their impact on the classification.

Reviewers’ comments

Reviewer’s report 1: Yiyi Liu (yiyi.liu@yale.edu), Yale University

In this manuscript, the authors investigated concordance between microarray and RNA-seq in classifying samples based on gene expression profiles. They tested the performances of eight classifiers, including one ensemble method, and obtained very interesting results. Overall the reviewer is positive about the work. There are several minor concerns that the authors need to address.

  1. 1.

    I suggest the authors add descriptions on the weights (w i ’s) they used in rank aggregation of the ensemble classifier. The authors explained the main idea of the aggregation method, but explicitly stating all the parameters could improve the readability of the paper.

  2. 2.

    The authors mentioned RNA-seq data are “normalized via the Magic normalization”. I suggest citing the normalization method paper for reference. method.

Authors’ response:

  1. 1.

    We have described the role and the choice of the weights.

  2. 2.

    The two suggested references have been added.

Reviewer’s report 2: Partha Dey (pdey.bit@gmail.com), Academy of Technology at Adisaptagram, Hooghly, India

The article “Inter-platform Concordance of Gene Expression Data for the Prediction of Chemical Mode of Action” by Siriwardhana et al. studies the consistency of the cross-platform classification accuracy between microarray and RNASeq in their ability to classify samples based on genomic information. Seven classifiers and an adaptive ensemble classifier developed around them were used to predict the Chemical Modes of Actions (MOA) on Rat Liver samples. The article is well written and nicely organized. In addition, addressing these few points should increase the impact of the research work across various spectrum of readers:

  1. 1.

    The “Results” section comes before the “Methods” section; if this is not due to some restriction of the publishers or typical of the field of investigation, the sequence may be reversed (to corroborate with usual practice in most research articles: after Methodology should come Results).

  2. 2.

    In the “Methods” section: The authors have mentioned the use of ’sampling with replacement.’ It would be relevant here to state the specific advantage of sampling with replacement as compared to sampling without replacement (which would result in a partition of the original training set into a pure-training and a in house-testing subsets– instead of some repeated data in the training samples and OOB samples for measuring the performance of the classifier). A brief description of the details of the different classifiers (viz. SVM, RF, LDA, PLS+RF, PLS+LDA, PCA+RF, PCA+LDA, and RPART), e.g. how many PCs were taken (or at least their range across different cases), whether linear or non-linear SVs, binary or n-ary partitioning, etc. may be provided to assist later users in this field to know the optimum classifier parameters.

  3. 3.

    In the “Discussion” section: Could a clue be given as to why the Ensemble classifier performed worse than at least one intrinsic classifier in those few scenarios? In particular, is a better ensemble approach possible, or is it in general impossible to have an ensemble classifier that performs best on all performance indices? It would be nice to have a commentary summarizing the important genes reported in Tables 6, 7, 8 and 9. For example those genes that appeared in most of the tables in Microarray or RNASeq, or both might be listed to enable the biologist to get the condensed information from this study.

Authors’ response: Sampling with replacement is part of bootstrap which is standard procedure in bagging. An advantage is that training data of the same size as the original can be created and also the out of bag samples can be used as test data in a natural way. The classifiers were described in Datta et al. (2010). The tuning parameters were selected by cross validation as described in the paper. The ensemble classifier is optimal in an overall sense as defined by the rank aggregation procedure. It may not be possible to beat every individual classifier with respect to every performance measure. We have commented on some genes in Section “ Importance of genes ”.