Background

In bioinformatics, molecular omics studies- like genomics, transcriptomics, proteomics and metabolomics are playing a prominent role in life sciences, health and biological research [1]. Among these approaches, metabolomics is frequently used to understand biological metabolic status, making a direct link between genotypes and phenotypes [2]. Many metabolomics-based biomarker discoveries have explored the key metabolites to discriminate between metabolic diseases, such as diabetes, cardiovascular diseases, and cancers [3]. The metabolites showing different concentrations among the given groups (e.g. healthy and disease subjects) is called as differential metabolites. Combinations of these metabolites can be used to identify subjects with a high risk of suffering from diabetes [4]. Thus, one of the most important tasks of metabolomics research is to identify a differential metabolite or a set of differential metabolites which have ability to differentiate patients with a disease from healthy subjects. The accurate identification of differential metabolites, or molecules that reflect a specific phenotype, is a cornerstone of many applications, such as predicting disease status and drug discovery [5,6,7,8].

To generate high-throughput metabolomics data, nuclear magnetic resonance (NMR) and hyphenated mass spectrometry (MS), such as gas chromatography-MS (GC-MS) and liquid chromatography-MS (LC-MS), are commonly used. These platforms can simultaneously identify and quantify hundreds of metabolites. All these analytical platforms can result in missing values in the observed data and outliers, which are caused by various reasons including analytical, experimental, and human errors, low quality measurements, malfunctioning equipment, and overlapping signals [9,10,11,12,13,14,15,16,17,18,19,20]. Thus, subsequent metabolomics data analysis should consider the presence of these problems in the given data.

Four types of statistical procedure have primarily been used to identify differential metabolites: (i) classical parametric approaches, such as Student’s t-test [21], classical volcano plot (CVP) [22] and fold change rank ordering statistics (FCROS) [23], (ii) classical non-parametric approaches, such as significance analysis of microarrays (SAM) [24], and the Wilcoxon [25] and Kruskal-Wallis (K-W) [26] tests, (iii) Bayesian parametric approaches, such as Bayesian robust inference for differential gene expression (BRIDGE) [27], empirical Bayes methods for microarrays (EBarrays) [28], and linear models for microarrays (Limma) [29], and (iv) Bayesian non-parametric approaches [30, 31]. In classical procedures, differential metabolites are identified using p-values (significance levels) that are estimated based on the distribution of a test statistic or a permutation, whereas in Bayesian procedures, differential metabolites are identified using posterior probabilities. However, most of the aforementioned techniques are not robust against outliers [27, 32]. Thus, they may produce misleading results in the presence of outlying samples or irregular concentrations of metabolites. Moreover, outlying samples or irregular concentrations of metabolites may violate the normality assumption in metabolomics datasets. Several nonparametric approaches (Wilcoxon and K-W test) and some Bayesian approaches (BRIDGE and Robust limma) are robust against outliers; however, increases in the number of outliers in these techniques reduce the accuracy of differential metabolite identification. One of the easiest ways to overcome this problem is to delete the outlying metabolites or outlying samples from the dataset. However, the deleted metabolites may be important metabolites in some cases, while deleting samples and metabolites can make the dataset much smaller or even vanish.

Comparatively, CVP [22] is a good technique for identifying differential metabolites because it can control the false discovery rate [33]. The volcano plot is based on p-values from a t-test and fold-change (FC) values [34], both of which depend on classical location and scatter, and thus volcano plot is affected by outliers. Therefore, in this paper, we develop an outlier-robust volcano plot by unifying CVP and a kernel weight function to overcome the problem of outliers. The advantage of the proposed method compared to existing methods is that it performs considerably better in the presence of outliers. We introduced a kernel weight function, which plays a key role in the performance of the proposed method. Robust volcano plot ensures robustness by producing smaller weights for outlying observations from the kernel weight function. Appropriate selection of the tuning parameter for the kernel function also improves the performance of the proposed method, as discussed later.

Since metabolomics dataset frequently contains outliers and all of the existing differential metabolite identification techniques are more or less influenced by outliers; as a result, outliers reduce the accuracy of differential metabolite identification. Therefore, in this paper we develop a kernel weight based outlier-robust volcano plot for detecting differential metabolites from metabolomics datasets in the presence of outliers. To measure the performance of the proposed method in comparison with other techniques, we consider nine existing differential metabolite identification techniques: three classical parametric approaches (t-test, FCROS, CVP), three nonparametric approaches (Wilcoxon test, K-W test, SAM) and three Bayesian approaches (BRIDGE, Limma, EBarrays). We also evaluate the performance of the proposed method using both artificially generated and experimentally measured metabolomics datasets in the absence and presence of outliers. Every metabolite identification method has a specific cutoff and its choices are sensitive to determine the metabolite identification which has large effect on the statistical analyses. In this paper, the cutoff of t-test, SAM, Wilcoxon test and K-W test have been taken as Bonferroni corrected p-value < 0.05. According to Dembélé et al. [23] we declared those metabolites as differential whose f-value is close to 0 or 1 and if f-value is close to 0.5 we took those metabolites as non differential. For CVP, a metabolite was said to be differential if p-value < 0.05 and | log2 (fold-change) | > 1. For Bayesian approaches we took the cutoff of Bonferroni corrected posterior probabilities > 0.95.

Methods

In this paper, a kernel weight based outlier-robust volcano plot is developed for detecting differential metabolites. To reduce the family wise error rate when comparing the performance of the proposed method with existing differential metabolite identification techniques, the p-values are adjusted using Bonferroni correction. The algorithm for outlier-robust volcano plot is given below.

Outlier-robust volcano plot (proposed)

We extend volcano plot by introducing a kernel weight function behind CVP. Classical volcano plot identifies differential metabolites using the t-test and fold-change (FC) methods, and plots log2 (fold-change) on the X-axis against -log10 (p-value) from the t-test on the Y-axis. Because the t-statistic depends on mean and variance and fold-change depends on mean, CVP is heavily influenced by outliers. Therefore, we use the kernel weighted average and variance instead of the classical mean and variance in the t-statistic and fold-change functions, and also plot log 2 fold-change on the X-axis and -log10 (p-value) from the t-test on the Y-axis. We refer to this procedure as robust volcano plot (RVP).

Let X = (x ij ); i = 1, 2, …, p and j = 1, 2, …, n, be a metabolomics data matrix with p metabolites and n samples. The rows and columns of X represent the metabolites and samples, respectively. In metabolomics data analysis, differential metabolites are the metabolites that show different concentrations between two groups (disease and control) of samples in a metabolomics dataset. According to the control and disease groups, the dataset can be expressed as

$$ X=\left[\begin{array}{c}\overset{Control}{\overbrace{x_{11}\kern0.5em {x}_{12}\kern0.5em \cdots \kern0.5em {x}_{1{g}_1}}}\\ {}{x}_{21}\kern0.5em {x}_{22}\kern0.5em \cdots \kern0.5em {x}_{2{g}_1}\\ {}\begin{array}{cccc}\vdots \kern0.75em & \vdots \kern0.5em & \ddots & \vdots \end{array}\\ {}\begin{array}{cccc}{x}_{p1}& {x}_{p2}& \cdots & {x}_{pg_1}\end{array}\end{array}\kern0.5em \begin{array}{c}\overset{Disease}{\overbrace{\begin{array}{cccc}{x}_{1\left({g}_1+1\right)}& {x}_{1\left({g}_1+2\right)}& \cdots & {x}_{1n}\end{array}}}\\ {}\begin{array}{cccc}{x}_{2\left({g}_1+1\right)}& {x}_{2\left({g}_1+2\right)}& \cdots & {x}_{2n}\end{array}\\ {}\begin{array}{cccc}\vdots \kern1.5em & \vdots \kern1.5em & \ddots & \vdots \end{array}\\ {}\begin{array}{cccc}{x}_{p\left({g}_1+1\right)}& {x}_{p\left({g}_1+2\right)}& \cdots & {x}_{pn}\end{array}\end{array}\right], $$

where g1 is the number of subjects in the control group and (n − g1) is the number of subjects in the disease group. In CVP, log 2 (fold-change) and -log10(p-value) from the t-test are calculated as follows.

The log 2 (fold-change) value for the i-th metabolite is

$$ {\log}_2\left({FC}_i\right)={\log}_2\left(\frac{{\overline{X}}_i^D}{{\overline{X}}_i^C}\right), $$
(1)

where \( {\overline{X}}_i^D \) represents the average intensity of the i-th metabolite for the disease group and \( {\overline{X}}_i^C \) represents the average intensity of the i-th metabolite for the control group.

The t-statistic for testing the hypothesis that the i-th metabolite is not differential, i.e.

$$ {H}_0:\kern0.75em {\mu}_i^C={\mu}_i^D\kern1.5em \mathrm{against}\kern1.5em {H}_1:\kern1em {\mu}_i^C\ne {\mu}_i^D, $$

for σ iC 2 = σ iD 2 is

$$ t=\frac{{\overline{X}}_i^C-{\overline{X}}_i^D}{\sqrt{S_i^2\left(\frac{1}{g_1}+\frac{1}{n-{g}_1}\right)}} $$
(2)

where

$$ {\displaystyle \begin{array}{l}{\overline{X}}_i^C=\frac{\sum \limits_{j=1}^{g_1}{x}_{ij}}{g_1};\kern1.25em {\overline{X}}_i^D=\frac{\sum \limits_{j={g}_1+1}^n{x}_{ij}}{n-{g}_1};\kern0.5em {S}_{iC}^2=\frac{1}{g_1-1}\sum \limits_{j=1}^{g_1}{\left({x}_{ij}-{\overline{X}}_i^C\right)}^2\kern0.75em ;\kern0.5em {S}_{iD}^2=\frac{1}{n-{g}_1-1}\sum \limits_{j={g}_1+1}^n{\left({x}_{ij}-{\overline{X}}_i^D\right)}^2\\ {};{S}_i^2=\frac{\left({g}_1-1\right){S}_{iC}^2+\left(n-{g}_1-1\right){S}_{iD}^2}{n-2}.\end{array}} $$

The value from eq. (2) is compared with Student’s t-value with n − 2 degrees of freedom.

If (σ iC 2 ≠ σ iD 2) then the test statistic is

$$ t=\frac{{\overline{X}}_i^C-{\overline{X}}_i^D}{\sqrt{\left(\frac{S_{iC}^2}{g_1}+\frac{S_{iD}^2}{n-{g}_1}\right)}} $$
(3)

In both cases, the p-value is calculated using

$$ p-\mathrm{value}=\underset{t_{calculated}}{\overset{\infty }{\int }}f(t) dt $$
(4)

In CVP, the FC value from (1) and t-value from (2) or (3) are calculated using the classical mean and variance. Because the classical mean and variance are influenced by outliers, we propose RVP using the weighted mean and variance instead of the classical mean and variance. For the weighted mean and variance, we use the kernel weight function \( {w}_j=\exp \left\{-\frac{\lambda }{2{\left( mad\left({x}_j\right)\right)}^2}{\left({x}_{ij}- median\left({x}_j\right)\right)}^2\right\}, \) where mad is the median absolute deviation. The value of this weight function lies between zero and one, and is close to zero if the observation is far from the median and close to one if the observation is close to the median. In the weight function, the tuning parameter λ is selected using k-fold cross validation (Fig. 1 summarizes the λ selection procedure). If the dataset does not contain outliers, then the value of λ is zero and all the weights are equal to 1, so the method is the same as the classical approach. The steps for RVP are given below.

  • Step − 1. Calculate log 2 (fold change) for the i-th metabolite as \( {\log}_2\left({FC}_i\right)={\log}_2\left(\frac{{\overline{X}}_i^D}{{\overline{X}}_i^C}\right), \) where \( {\overline{X}}_i^D=\sum \limits_{j={g}_1+1}^n{w}_j{x}_{ij}/\left(n-{g}_1\right) \) represents the weighted average intensity of the i-th metabolite for the disease group and \( {\overline{X}}_i^C=\sum \limits_{j=1}^{g_1}{w}_j{x}_{ij}/{g}_1 \) represents the weighted average intensity of the i-th metabolite for the control group.

  • Step − 2. Using the weighted average and weighted variance instead of the classical mean and variance, calculate -log10(p-value) for the i-th metabolite from the t-test using eqs. (2), (3) and (4), where \( {S}_{iC}^2=\sum \limits_{j=1}^{g_1}{w}_j{\left({x}_{ij}-{\overline{X}}_i^C\right)}^2/\left({g}_1-1\right) \) and \( {S}_{iD}^2=\sum \limits_{j={g}_1+1}^n{w}_j{\left({x}_{ij}-{\overline{X}}_i^D\right)}^2/\left(n-{g}_1-1\right). \)

  • Step − 3. Draw a scatter plot with log2 (fold-change) on the X-axis and -log10 (p-value) from the t-test on the Y-axis. This plot is considered to be an outlier-robust volcano plot (RVP). A metabolite is said to be differential if p-value < 0.05 and | log2 (fold-change) | > 1.

Fig. 1
figure 1

Flowchart of λ selection procedure

The R package of the proposed method with its user manual is available at https://github.com/nishithkumarpaul/Rvolcano.

Any user can install the “Rvolcano” package in R platform from the GitHub using the following code

library(devtools)

install_github("Rvolcano","nishithkumarpaul")

library(Rvolcano)

To draw the robust volcano plot using the package, the user manual is available at GitHub website.

Dataset description

In this paper, we use an artificially generated dataset and an experimentally measured metabolomics dataset to evaluate the performance of the proposed method in comparison with nine other methods.

Artificial data

In this study, as in [6], we generate an artificial metabolomics dataset using a one-way ANOVA model y ijk  = μ i  + g ij  + ∈ ijk , where y ijk is the intensity of the ith metabolite, jth group and kth sample, μ i denotes the overall intensity of metabolite i, g ij is the jth group effect for the ith metabolite, and ∈ ijk is a random error term. In this linear model, μ i ~ uniform (10, 20) and ∈ ijk ~ N(0,1). The disease and control group effects for increased concentrations of metabolites are g ij  = N(4, 1) and g ij  = N(2, 1), respectively; for decreased concentrations of metabolites, we use g ij  = N(2, 1) and g ij  = N(4, 1) for the disease and control groups, respectively. Both group effects for non-differential (equal concentration) metabolites are g ij  = N(0, 1). To create the artificial metabolomics dataset, we designated 130 metabolites as non-differential and 20 metabolites as differential (having differential concentrations). The dataset contained 70 subjects with 40 subjects in group-1 and 30 subjects in group-2. To investigate the performance of the proposed method under different conditions, outliers were randomly distributed in the artificially generated data matrix at different rates (5%, 10%, 15%, 20%, and 25%). Note that these outliers can fall anywhere in the data matrix. The outliers for the i-th metabolite were taken from a normal distribution with mean 3*μ i and variance \( {\sigma}_i^2 \), i.e. N (3*μ i , \( {\sigma}_i^2 \)), where μi and \( {\sigma}_i^2 \) are the mean and variance of the i-th metabolite. In total, 500 artificial datasets were generated for each condition, and the performance of the proposed method was evaluated using these datasets.

Experimentally measured data

In this paper, we considered a well-known publicly available metabolomics dataset for breast cancer serum data and control serum data containing metabolite abundance level measurements from different subjects. This dataset is available from the National Institute of Health (NIH) data repository and was collected by the University of Hawaii Cancer Center under study ID ST000356. The data were measured using a gas chromatography with time-of-flight mass spectrometry (GC-TOFMS) instrument and quantified using the ChromaTOF software (v4.33, Leco Co, CA, USA). The dataset contains 134 subjects (103 breast cancer without any treatment and 31 control subjects) and 101 metabolites. Auto-scaling was used to reduce the systematic variation in the dataset. To investigate the performance of the proposed method under different conditions, we also modified the dataset by changing 5%, 10%, and 15% of the real values by N(4 × μ i , σ i 2), where μ i and σ i 2 are the mean and variance of the i-th metabolite in the breast cancer data matrix.

Results and discussion

The performance of our proposed method was compared with nine existing methods using both the artificial and experimental datasets.

Performance evaluation based on artificially generated data

The performance of the proposed method was compared with those of nine existing methods using 500 artificial datasets. The misclassification error rates (MERs) for differential metabolite identification were calculated for each method. The true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), false negative rate (FNR), the area under the receiver operating characteristic (ROC) curve (AUC), and the partial AUC (pAUC with FPR ≤ 0.2) were also calculated. The above performance indices were calculated both in the absence and presence of outliers. The average MER, AUC and pAUC values for the artificial datasets are shown in the Additional file 1: Table S1. A method with a lower MER value and higher AUC and pAUC values is considered better. From Additional file 1: Table S1, we observe that our proposed method gave a lower MER value and higher AUC and pAUC values both in the absence and presence of outliers. We also present the ROC curves in Fig. 2 and boxplots of the 500 MER and AUC values in Figs. 3 and 4, respectively. Figure 2 shows that our proposed method gave a higher average TPR with respect to average FPR in comparison with the existing methods both in the absence of outliers and with 15% outliers. In Fig. 3, it is clear that the proposed method produced a smaller MER with minimum variability, and in Fig. 4, the proposed method gave higher AUC values with minimum variability both in the absence of outliers and with 15% outliers. To check the robustness of the different methods, we plotted the ROC curve and a boxplot of MER and AUC values for the artificial datasets for different rates (0%, 5%, 10%, 15%, 20% and 25%) of outliers. The graphs are shown in Additional file 2: Figures S1, S2 and S3. Further more, to measure the efficiency of the proposed method, we also calculated the power and false discovery rate (FDR) for small sample in both absence and presence of outliers (Additional file 1: Table S2). In Additional file 1: Table S2, the proposed method gave higher power and lower FDR in absence and presence of outliers for small sample sizes. More over, we calculated the execution time (speed of execution) in seconds of different methods including the proposed one for different number of metabolites and samples (Additional file 1: Table S3). Additional file 1: Table S3 showed that it is seen that the execution time of the proposed technique was lower than the robust Bayesian technique BRIDGE in all the cases, but this time is comperatively higher than the execution time of other techniques. This is one of the limitations of the proposed technique. Another limitation of the proposed technique is its compatibility only for the analyses between two groups. Although the proposed technique has several limitations, on the basis of above analyses of artificial datasets, we could conclude that the proposed RVP-based differential metabolite identification technique performs better than the nine existing methods.

Fig. 2
figure 2

Performance evaluation using ROC curves for different differential metabolite identification techniques a in the absence of outliers, b with 15% outliers, c zoom image of upper left region in absence of outliers, and d zoom image of upper left region with 15% outliers

Fig. 3
figure 3

Performance evaluation using box plots of 500 misclassification error rates (MERs) for different differential metabolite identification techniques a in the absence of outliers, and b with 15% outliers

Fig. 4
figure 4

Performance evaluation using box plots of 500 AUC values for different differential metabolite identification techniques a in the absence of outliers, and b with 15% outliers

Performance evaluation based on experimentally measured data

For the experimentally measured metabolomics (breast cancer) data, the performance of the proposed method was measured using differential metabolite identification from the experimental dataset and the modified experimental datasets. The experimental data was modified by artificially incorporating different rates (5%, 10% and 15%) of outliers. Methods that identified similar combinations of differential metabolites for the experimental dataset and the modified datasets were considered to be more outlier-robust. Additional file 1: Table S4 shows the number of differential metabolites identified by different methods for the original and modified datasets. While Additional file 1: Table S4 shows any differences in the number of differential metabolites identified by each method for the datasets in the absence and presence of outliers, we also need to know which methods identified similar combinations of differential metabolites. We use Venn diagrams to find methods that gave similar combinations of differential metabolites in the absence and presence of outliers. From Additional file 1: Table S4, we chose three techniques (Limma, FCROS and the proposed method) according to the lowest variability in the number of differential metabolites. The corresponding Venn diagrams are presented in Fig. 5. Venn diagrams for all methods are given in the Additional file 2: Figure S4. From Fig. 5, we observe that our proposed method produced similar combinations of differential metabolites in the absence and presence of outliers. Therefore, we conclude that our proposed method performs better than the nine existing techniques.

Fig. 5
figure 5

Performance evaluation using Venn diagrams for the number of differential metabolites identified by a RVP, b Limma, and c FCROS for the experimental dataset

Because we modified CVP to create an outlier-robust version called RVP, we also examine the results of these two methods for the experimental data. For the experimental dataset, CVP identified 36 metabolites as differential, whereas our proposed RVP identified 37 metabolites as differential (Fig. 6). The same 36 metabolites were identified by both methods, while RVP also identified cyclohexanone. From reviewing the literature, we found that cyclohexanone is a metabolite that is associated with breast cancer as well as several other cancer diseases (Table 1) [35,36,37,38,39,40]. This suggests that our method is more reliable for differential metabolite identification.

Fig. 6
figure 6

Differential metabolite identification for the experimental dataset using a CVP, and b RVP

Table 1 Literature review of cyclohexanone metabolite associated diseases

Sometimes, a set of metabolites may show the same pattern of behavior, in that if one of them is differential then the whole set is identified as differential. To identify the potential biomarkers from the 37 differential metabolites identified by RVP, we clustered the differential metabolites using hierarchical clustering (Fig. 7) and found the most important metabolites in each cluster (the importance score is calculated using a support vector machine (SVM) classifier with radial basis kernel function) (Fig. 7). From Fig. 7 (b), we obtained four clusters and chose the most important metabolite from each cluster according to the importance score in Fig. 7 (c). For the first cluster in Fig. 7 (b), there are 15 metabolites of which the most important is glutamate. Similarly, ethanolamine is the most important metabolite for the second cluster, pyruvic acid for the third cluster and cyclohexanone for the fourth cluster. These four metabolites (glutamate, ethanolamine, pyruvic acid, and cyclohexanone) may thus be biomarkers for breast cancer. Laboratory-based targeted metabolomics analysis to test this hypothesis could be an avenue for future research.

Fig. 7
figure 7

Metabolomic biomarker identification for breast cancer. a Heatmap plot of up-regulation and down-regulation for the 37 differential metabolites identified by the proposed method (red indicates cancer samples and blue indicates control samples). b Clustering of the 37 differential metabolites for the experimental dataset. Hierarchical clustering was used after normalizing the experimentally measured breast cancer dataset by auto-scaling. c Ranking of the 37 differential metabolites according to the importance score calculated using an SVM classifier with radial basis kernel function

Conclusions

Outlying observations weaken the performance of existing differential metabolite identification techniques. In this paper, we have proposed a new outlier-robust differential metabolite identification technique for identifying differential metabolites in the presence of outliers. To investigate the performance of our proposed method, we analyzed artificial data and experimental data in the absence and presence of outliers. We also compared the performance of our proposed method with nine existing differential metabolite identification techniques using the ROC curve, and the average MER, AUC and pAUC values. Both the artificial and experimental data analysis show that our proposed method performed better. The proposed RVP also identified an additional metabolite (cyclohexanone) that was overlooked by CVP, and it has been shown that this metabolite is associated with several cancer diseases. We recommend using the proposed method to identify differential metabolites from noisy metabolomics datasets.