Diverse approaches to predicting drug-induced liver injury using gene-expression profiles
- 12 Downloads
Drug-induced liver injury (DILI) is a serious concern during drug development and the treatment of human disease. The ability to accurately predict DILI risk could yield significant improvements in drug attrition rates during drug development, in drug withdrawal rates, and in treatment outcomes. In this paper, we outline our approach to predicting DILI risk using gene-expression data from Build 02 of the Connectivity Map (CMap) as part of the 2018 Critical Assessment of Massive Data Analysis CMap Drug Safety Challenge.
First, we used seven classification algorithms independently to predict DILI based on gene-expression values for two cell lines. Similar to what other challenge participants observed, none of these algorithms predicted liver injury on a consistent basis with high accuracy. In an attempt to improve accuracy, we aggregated predictions for six of the algorithms (excluding one that had performed exceptionally poorly) using a soft-voting method. This approach also failed to generalize well to the test set. We investigated alternative approaches—including a multi-sample normalization method, dimensionality-reduction techniques, a class-weighting scheme, and expanding the number of hyperparameter combinations used as inputs to the soft-voting method. We met limited success with each of these solutions.
We conclude that alternative methods and/or datasets will be necessary to effectively predict DILI in patients based on RNA expression levels in cell lines.
This article was reviewed by Paweł P Labaj and Aleksandra Gruca (both nominated by David P Kreil).
KeywordsMachine learning Classification Cell lines Drug development Precision medicine
Drug-induced liver injury (DILI) is a serious concern during both drug development and the treatment of human disease. DILI is characterized by elevated levels of alanine aminotransferase; in serious cases, it can ultimately result in acute liver failure and patient death . Reactive drug metabolites may play a role in initiating DILI . Drug hepatotoxicity plays an important role in risk-benefit assessment during drug development, but the ability to accurately predict the risk of DILI for a new drug has evaded investigators . Historically, nearly one third of drug withdrawals may have been related to hepatotoxicity . The ability to accurately predict DILI risk could yield considerable reductions in drug-attrition and drug-withdrawal rates as well as improved treatment outcomes .
The 2018 Critical Assessment of Massive Data Analysis (CAMDA) Connectivity Map (CMap) Drug Safety Challenge was held in conjunction with the Intelligent Systems for Molecular Biology conference in Chicago, Illinois. The challenge organizers instructed participants to train predictive models on gene-expression data from Build 02 of CMap . CMap was created to facilitate the discovery of connections among drugs, genes, and human diseases . CMap contains gene-expression profiles from cell lines that were systematically exposed to a range of bioactive small molecules . For the CAMDA challenge, the class labels were binary values indicating whether treatment with a given drug was associated with liver injury in cell-based screens for the following cell lines: MCF7 (breast cancer) and PC3 (prostate cancer). Per the terms of the CAMDA challenge, we used data for 190 small molecules (of the 1309 total small molecules available in CMap) during model training and 86 additional small molecules for model testing. During Phase I of the challenge, the organizers asked each team to submit DILI predictions for the test set. Later the class labels were revealed to the challenge participants to enable follow-up analyses in Phase II.
After submitting our predictions to the challenge organizers, we learned that our predictions performed worse than random-chance expectations. Thus, during the second phase of the challenge, we explored various options for improving classification accuracy, including different preprocessing methods, feature-selection and feature-transformation approaches, class weighting, and multiple hyperparameter combinations (Fig. 1).
Summary of classification algorithms evaluated on the training set
Parameters selected after optimization
activation = ‘relu’
alpha = 0.0001
batch_size = ‘auto’
beta_1 = 0.9
beta_2 = 0.999
early_stopping = False
epsilon = 1e-08
hidden_layer_sizes = (30,30,30,30,30,30,30,30,30,30)
learning_rate = ‘constant’
learning_rate_init = 0.0376
max_iter = 200
momentum = 0.9
nesterovs_momentum = True
power_t = 0.5
random_state = None
shuffle = True
solver = ‘adam’
tol = 0.0001
validation_fraction = 0.1
warm_start = False
criterion = ‘friedman_mse’
init = None
learning_rate = 0.31
loss = ‘deviance
max_depth = 3
max_features = None
max_leaf_nodes = None
min_impurity_decrease = 0.0
min_impurity_split = None
min_samples_leaf = 1
min_samples_split = 2
min_weight_fraction_leaf = 0.0
n_estimators = 100
presort = ‘auto’
subsample = 1.0
warm_start = False
algorithm = ‘auto’
leaf_size = 30
metric = ‘minkowski’
metric_params = None
n_neighbors = 8
p = 2
weights = ‘distance’
C = 1.0
class_weight = None
dual = False
fit_intercept = True
intercept_scaling = 1
max_iter = 100
multi_class = ‘ovr’
penalty = ‘l2’
solver = ‘lbfgs’
tol = 0.0001
warm_start = False
Gaussian Naïve Bayes
priors = None
bootstrap = False
class_weight = None
criterion = ‘gini’
max_depth = 9
min_samples_split = 2
min_samples_leaf = 1
min_weight_fraction_leaf = 0.0
max_features = ‘auto’
max_leaf_nodes = 25
min_impurity_decrease = 0.0
min_impurity_split = None
n_estimators = 25
oob_score = False
warm_start = False
Support Vector Machines
C = 1.0
class_weight = None
coef0 = 0.0
decision_function_shape = ‘ovr’
degree = 3
gamma = ‘auto’
kernel = ‘rbf’
max_iter = − 1
probability = False
shrinking = True
tol = 0.001
flatten_transform = True
voting = ‘soft’
weights = ‘None’
Phase I cross-validation results
Gaussian Naïve Bayes
Support Vector Machines
In addition to providing class labels for the test set, the CAMDA organizers provided us with suggestions from reviewers. These suggestions gave us ideas for improving classification performance, which we evaluated in Phase II. Because we did not have an additional, independent dataset, our Phase II evaluations were only exploratory in nature. We explored four types of techniques for improving performance: a multi-sample normalization method and batch correction, feature scaling/selection/reduction techniques, custom class weights, and scaling of the voting-based ensemble method. To quantify the effects of these alternative approaches, we compared the performance of our classifiers with and without each change, averaged across all classification algorithms—with the exception of adjusting the class weights, which was only possible for a subset of the algorithms (see Methods). Figure 3 illustrates the effects of these changes.
In Phase I, we preprocessed the microarray array using the SCAN algorithm, a single-sample normalization method. We hypothesized that preprocessing the data using the FARMS algorithm (a multi-sample normalization method) would result in improved performance by reducing technical variability across the samples via quantile normalization. In addition, because the CMap data had been processed in many batches, we hypothesized that correcting for batch effects using the ComBat algorithm would increase classification performance. In some cases, these changes improved predictive performance slightly, whereas in other cases the performance was reduced, irrespective of whether we used SCAN, FARMS, and/or batch adjustment (Fig. 3a).
Although microarray normalization methods help to remove technical biases and multi-sample corrections can remove inter-sample variations, some classification algorithms assume that each feature has been scaled to have the same mean and standard deviation. Accordingly, in Phase II, we used scikit-learn’s RobustScaler functionality to scale the expression data for each gene; this method also adjusts for any outliers that may exist. Secondly, we reduced the feature space via feature selection (using the ANOVA F-value) and dimensionality reduction (using Principal Component Analysis). These adjustments did not improve performance consistently (Fig. 3b).
In an attempt to mitigate the effects of class imbalance, we adjusted weights assigned to the class labels. By default, classification algorithms in scikit-learn place an equal weight on each class label, but many algorithms provide an option to adjust these weights. We attempted many different weight ratios, even placing 50 times more weight on the minority class than the majority class. These adjustments often improved sensitivity or specificity, but none of these changes resulted in a higher MCC value (Fig. 3c).
Finally, we made various attempts at improving the voting-based classifier. We used hard voting rather than soft voting. With this approach, the predictions for the individual classifiers are treated as discrete rather than probabilistic values, which may improve ensemble predictions in situations where probabilistic predictions are poorly calibrated. In addition, we increased the number of individual classifiers used for voting. We retained the same classification algorithms, but we included predictions for multiple hyperparameter combinations per algorithm. We suspected that a larger and more diverse set of predictions would improve voting performance. None of these approaches resulted in consistent improvements for any of the metrics except specificity (Fig. 3d); these were counterbalanced by decreases in the other metrics.
Our goal was to make progress toward accurately predicting DILI based on gene-expression profiles of cell lines. The ability to predict these outcomes could reduce patient injury, lower costs associated with drug development, and optimize treatment selection. As a step toward these objectives, we analyzed gene-expression levels from cancer cell lines that had been treated with small molecules; we used machine-learning classification to predict DILI. Our study design relied on the assumption that drugs causing liver injury induce transcriptional changes that are common across many or all of these drugs and that these transcriptional changes might also occur in liver tissue in vivo.
In Phase I, we employed seven classification algorithms as well as a soft-voting ensemble classifier that aggregated predictions from six of the seven individual algorithms. On the training data, we observed relatively high performance for the Random Forests and Logistic Regression algorithms, which coincides to an extent with prior findings . However, when applied to the test set, neither algorithm consistently produced predictions that exceed what can be attained by defaulting to the majority class. The soft-voting approach yielded better performance than the individual algorithms at times, but this pattern was inconsistent. Voting-based approaches often outperform single-classifier approaches because they combine diverse algorithmic techniques—where one algorithm fails, other(s) may succeed. However, they rely on a diverse range of inputs; using algorithms from a narrow range of methodologies will generally be less performant.
We emphasize the importance of considering multiple, diverse performance metrics when evaluating classification results. Even though our classification algorithms sometimes attained higher levels of accuracy on the test set than the training set (Fig. 2a), these improvements were likely a consequence of different levels of class imbalance between the training and test sets—a higher proportion of drug compounds induced liver injury in the training samples than in the test samples. Our classifiers were prone to over-predicting liver injury. Although accuracy and sensitivity typically benefitted from this bias, specificity typically offset these gains when considered in the broader context. Accordingly, we believe that the degree of class imbalance was a key reason that our methods underperformed. To address this limitation in Phase II, we assigned higher weights to the minority class, thus potentially helping to account for class imbalance. Even though this approach rests on a solid theoretical foundation , it resulted in minimal, if any, improvements in overall performance.
Additionally, we attempted to improve classification performance using a multi-sample normalization method, adjusting for batch effects, scaling features, selecting features, reducing data dimensionality, and using multiple hyperparameter combinations as input to the voting-based classifier. Although these techniques might have resulted in improvements in other classification scenarios, they resulted in minimal improvements, if any, in predictive ability in our analysis. The batch-effect correction method that we used (ComBat) requires the researcher to assign batch labels to each biological sample. Alternative tools such as PEER  and SVA  can be used in situations where batch labels are unknown or more generally to detect hidden variation. Indeed, hidden factors—perhaps due to treatment duration and physiological complexity—–may have confounded this study. DILI was determined based on a meta-analysis of patient data, whereas our predictions were derived from treatments administered to cell lines over the course of only a few hours or days.
The original goal of this CAMDA challenge was to predict hepatic injury from mRNA expression profiles. Our findings suggest that some or all of the following factors may explain our limited success in predicting these outcomes: 1) gene-expression microarray measurements are often noisy, 2) mRNA expression levels in cell lines may be inadequate surrogates for in vivo responses in this setting, 3) larger datasets may be needed, and 4) more sophisticated analytic techniques may be needed.
The training set was a subset of CMap consisting of gene-expression data and known DILI status for 190 small molecules (130 of which had been found to cause DILI in patients). The test set consisted of an additional 86 small molecules. The CMap gene-expression data were generated using Affymetrix gene-expression microarrays. In Phase I, we used the Single Channel Array Normalization (SCAN) algorithm —a single-sample normalization method—to process the individual CEL files (raw data), which we downloaded from the CMap website (https://portals.broadinstitute.org/cmap/). As part of the normalization process, we used BrainArray annotations to discard faulty probes and to summarize the values at the gene level (using Entrez Gene identifiers) . We wrote custom Python scripts (https://python.org) to summarize the data and execute analytical steps. The scripts we used to normalize and prepare the data can be found here: https://osf.io/v3qyg/.
For each treatment on each cell line, CMap provides gene-expression data for multiple biological replicates of vehicle-treated cells. For simplicity, we averaged gene-expression values across the multiple vehicle files. We then subtracted these values from the corresponding gene expression values for the compounds of interest. Finally, we merged the vehicle-adjusted data into separate files for MCF7 and PC3, respectively.
The SCAN algorithm is designed for precision-medicine workflows in which biological samples may arrive serially and thus may need to be processed one sample at a time . This approach provides logistical advantages and ensures that the data distribution of each sample is similar, but it does not attempt to adjust for systematic differences that may be observed across samples. Therefore, during Phase II, we generated an alternative version of the data, which we normalized using the FARMS algorithm —a multi-sample normalization method. This enabled us to evaluate whether the single-sample nature of the SCAN algorithm may have negatively affected classification accuracy in Phase I. Irrespective of normalization method, it is possible that batch effects can bias a machine-learning analysis. Indeed, the CMap data were processed in many batches. Therefore, for SCAN and FARMS, we created an additional version of the expression data by adjusting for batch effects using the ComBat algorithm .
Initially in Phase I, we used a variance-based approach for feature selection (with the goal of identifying which genes would be most informative for classification). We calculated the variance of the expression values for each gene across all samples; then we selected different quantities of genes that had the highest variance and used those as inputs to classification. However, in performing 10-fold cross validation on the training set, we observed no improvement in classification performance regardless of the number of high-variance genes that we used, so we decided not to use feature selection for our Phase I predictions. To perform cross-validation, we wrote custom Python code that utilizes the scikit-learn module (version 0.19.2), .
In Phase II, we used the following scaling and feature-selection methods in an attempt to improve performance: robust scaling, feature selection based on the ANOVA F-value, and principal component analysis. We used scikit-learn implementations of these methods and used default hyperparameters .
We performed classification using the following algorithms from the scikit-learn library: Gradient Boosting , Logistic Regression , K-nearest Neighbors , Random Forests , Multilayer Perceptron , Support Vector Machines , and Gaussian Naïve Bayes . For each of these algorithms, we used scikit-learn to generate probabilistic predictions. For the voting-based ensemble classifier, we used the VotingClassifier class in scikit-learn. In Phase I, we used “soft” voting, which averages probabilistic predictions across the individual classifiers . In Phase II, we used “hard” voting, which predicts the class label as that which received the larger number of discrete votes.
In an attempt to optimize the performance of the voting-based classifier, we devised a weighting scheme, which assigned higher weights to individual algorithms that performed relatively well during cross validation; we also experimented with excluding individual classifiers from the voting-based classifier. The only approach that appeared to have a consistently positive effect on performance was to exclude the Gaussian Naïve Bayes algorithm, which had also performed poorly in isolation. Our final voting-based model in Phase I excluded Gaussian Naïve Bayes and assigned an equal weight to each individual classifier.
In Phase II, we attempted to improve the voting-based classifier in multiple ways. First, rather than selecting a single hyperparameter combination for each algorithm and using those as input to the voting-based classifier, we used multiple hyperparameter combinations for each classification algorithm (except Gaussian Naïve Bayes). For this approach, we incorporated the following classification algorithms (with the number of distinct hyperparameter combinations): Multilayer Perceptron (n = 5), Support Vector Machines (n = 4), Logistic Regression (n = 2), Random Forests (n = 5), K-nearest Neighbor (n = 5), and Gradient Boosting classifiers (n = 3). We also investigated whether assigning weights to each class label would help overcome the effects of class imbalance and improve classification performance. Four of the classifiers from Phase I—Random Forests, Support Vector Machine, Logistic Regression, and the soft-voting ensemble method—support a class_weight hyperparameter, which allowed us to apply custom weights to each class label (or to determine the weights algorithmically). Adjusting the class_weight hyperparameter required providing a weight for the non-DILI (weight_1) and DILI observations (weight_2), indicated here as weight_1:weight_2. We used class weights of 50:1, 25:1, 10:1, 5:1, 2:1, 1:1, and 1:2.
Reviewer’s report 1
Paweł P Labaj, Jagiellonian University (nominated by David P Kreil, Boku University Vienna).
Overall the manuscript is well written, however, reader can loose a track in both methods and results. Better structure complemented with a figure outlining the analysis procedure would improve readability and by this improve the quality of the manuscript.
What is missing in the manuscript is deeper description of Ensemble approach with all pros and cons. This approach could be easily tricked if a few used methods have similar bases / are from close families of solution. Here it is not a case but should be pointed out and described. Connected to this is selection of used methods, just saying that these ones are available ‘scikit-learn library’ is not enough.
Authors, in one of the improvements, have used ComBat for batch correction, but this will work only for known confounders. It would be interesting to see, or at least, comment the application of solutions which could detect also hidden confounders, like PEER or SVA.
Figure presenting the overview of the analysis and all additions should be provided to improve readability. The additional comment to second point is that CMap is created when cell line has been treated with a specific dose, while DILI is based on meta-analysis of real patients data. One could expect that an important factor for DILI is whether the therapy was short time or prolonged as in the other even small toxicity can accumulate and lead to DILI. Of course the necessary data were not provided here, but it could be that therapy type factor could be detected as hidden confounder.
We have revised the text in the Methods and Results sections to make the manuscript easier to read. We have also revised the sub-section headings to facilitate better organization. In addition, we have added a figure that illustrates our workflow across the two phases of the CAMDA challenge.
We modified the wording in the 3rd paragraph of the Introduction section to say the following: “Generally, voting approaches are most effective when they incorporate individual classifiers that perform reasonably well in isolation and when the component classifiers use diverse methodological approaches and thus are more likely to have deficiencies in different areas of the input space, often allowing for improved performance in aggregate. We hoped that this would hold true for predicting DILI in this study because the individual algorithms that we used represent diverse methodological approaches.” We also modified the Discussion section as follows: “The soft-voting approach yielded better performance than the individual algorithms at times, but this pattern was inconsistent. Voting-based approaches often outperform single-classifier approaches because they combine diverse algorithmic techniques—where one algorithm fails, other(s) may succeed. However, they rely on a diverse range of inputs; using algorithms from a narrow range of methodologies will generally be less performant.” In addition, we have provided an expanded table that shows which parameters we used for each algorithm.
We added the following statement to the last paragraph of the Discussion section: “The batch-effect correction method that we used (ComBat) requires the researcher to assign batch labels to each biological sample. Alternative tools such as PEER and SVA can be used in situations where batch labels are unknown or more generally to detect other types of hidden variation.”
In complement to the previous point, we have modified the Discussion to add the point that the reviewer mentioned: “… hidden factors—perhaps due to treatment duration and physiological complexity—–may have confounded this study. DILI was determined based on a meta-analysis of patient data, whereas our predictions were derived from treatments administered to cell lines over the course of only a few hours or days.”
Reviewer’s report 2
Aleksandra Gruca, Silesian University of Technology (nominated by David P Kreil, Boku University Vienna).
The authors analysed dataset from CAMDA 2018 DILI contest. The main goal of the contest is to accurately predict DILI risk of particular drug based on cell lines gene expression data. To achieve this, the authors try different parameter settings for data preprocessing and apply seven classification algorithms that are finally combined in an ensemble approach. The presented work is of a limited novelty. In general, data processing workflow is designed correctly and the analytical steps performed by the authors are typical for such kind of problems. I do not find any flaws in the proposed approach, although I also do not see any novelty in it. On the positive side I notice that the authors have tried several different combinations of methods and parameters in searching for the best outcome. However, none of the applied techniques was able to significantly improve performance of the classifiers which may be due to the fact that DILI dataset from CAMDA 2018 contest is very difficult to analyse as it is characterised by a weak signal.
The analysed dataset in described very briefly in the paper. The paper is a separate piece of scientific work, therefore authors should not assume that the reader is familiar with CAMDA contest and the dataset, and they should provide more detailed description of analysed data. For example: how many drugs were measured, what is the distribution of objects between DILI and non-DILI class.
I suggest adding the figure representing proposed workflow. It would also clarify if the preprocessing steps were performed separately or as a single workflow
I notice the following sentence (2nd paragraph of page 8 of the manuscript): “Naive Bayes algorithm, which had performed quite poorly in isolation (Fig. 3)”. However, I cannot see any data in Fig. 3 related to this sentence.
In the description of Fig. 3 I notice the following statement: “For each adjustment in our procedure, we measured the performance of all the classifiers (with the exception of adjusting the class_weight hyperparameter, which was only available for the classifiers listed above ( …)” . It is not clear what the authors mean by “classifiers listed above”
In the Fig. 1 Y—axes for metrics accuracy, sensitivity and specificity are not scaled the same way and are of different ranges. As usually values all of these measures are interpreted with the same range, presenting them on different scales might be misleading. I suggest either put all of them on the same Figure or at least present them on a charts that have the same Y-axis range.
We now provide information about sample sizes and class imbalance in the Data preprocessing section of Methods.
We have added a workflow diagram that illustrates the key components of Phases I and II.
We thank the reviewer for catching this. We have removed the part in parenthesis from the manuscript.
We have thoroughly revised this figure caption (as well as the others) to improve clarity.
We have updated this figure according to the reviewer’s suggestion (using the same Y-axis scale for all 4 sub-figures).
Our participation in the CAMDA challenge provided invaluable educational experiences for the authors of this study, most of whom were members of the Bioinformatics Research Group (BRG) at Brigham Young University. The BRG almost exclusively consists of and is led by undergraduate students. We thank the CAMDA organizers for providing the opportunity to participate in this competition and gain experience in this exciting research area.
GRS, MSB, and SRP developed the study methodology. GRS and SRP developed the data preprocessing pipeline. GRS, MSB, JTB, EF, GRGC, DJG, EDL, and ION implemented and tested algorithms in Phase I of the study. GRS implemented and tested algorithms in Phase II of the study. JTB, GRS, and SRP wrote the manuscript. GRS, JTB, and MSB edited the manuscript. GRS generated the figures, and JTB and GRS prepared the tables. SRP helped interpret results. All authors read and approved the manuscript.
SRP was supported by internal funds from Brigham Young University.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 11.Longadge R, Dongre S. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707; 2013.Google Scholar
- 18.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.Google Scholar
- 21.Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46:175–85 Taylor & Francis Group.Google Scholar
- 24.Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif. 1999;10:61–74 Cambridge, MA.Google Scholar
- 25.Chan TF, Golub GH, LeVeque RJ. Algorithms for computing the sample variance: analysis and recommendations. Am Stat. 1983;37:242–7 Taylor & Francis Group.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.