# eUTOPIA: solUTion for Omics data PreprocessIng and Analysis

## Abstract

### Background

Application of microarrays in omics technologies enables quantification of many biomolecules simultaneously. It is widely applied to observe the positive or negative effect on biomolecule activity in perturbed versus the steady state by quantitative comparison. Community resources, such as Bioconductor and CRAN, host tools based on R language that have become standard for high-throughput analytics. However, application of these tools is technically challenging for generic users and require specific computational skills. There is a need for intuitive and easy-to-use platform to process omics data, visualize, and interpret results.

### Results

We propose an integrated software solution, eUTOPIA, that implements a set of essential processing steps as a guided workflow presented to the user as an R Shiny application.

### Conclusions

eUTOPIA allows researchers to perform preprocessing and analysis of microarray data via a simple and intuitive graphical interface while using state of the art methods.

## Keywords

Transcriptomics analysis Microarray Gene expression R shiny## Background

Omics data have become an integral part of biological studies as researchers leverage these techniques to obtain a broader perspective of complex biological phenomena. Scientific community resources such as CRAN and Bioconductor [1] contain an extensive collection of computational tools developed in R programming language [2] to process omics data. Application of these tools, however, requires a deep understanding of the computation and statistical aspects of the methods employed, while further integration of these tools in a workflow requires a certain degree of proficiency in computer programming languages. We tasked ourselves with creating a solution by implementing “state of the art” practices to preprocess, statistically analyze, and visualize microarray data from multiple platforms. eUTOPIA’s workflow caters to the researchers with varied levels of computation and statistical skills. This solution fills a void in a research environment, thus allowing experimental biologists to process microarray data while avoiding critical errors due to lack of proficiency in programming languages and pipeline development. The guided interface simplifies the complex tasks to set up the study and allows for more focus on the biological interpretation of the results rather than the technical challenges of executing statistical methods and generating visual representations.

## Implementation

There are seven main steps in the workflow: ‘DATA INPUT’ (Additional file 1, p. 8–13), ‘QUALITY CONTROL’ & ‘FILTERING’ (Additional file 1, p. 14), ‘NORMALIZATION’ (Additional file 1, p. 15), ‘BATCH CORRECTION & ANNOTATION’ (Additional file 1, p. 20–29), and ‘DIFFERENTIAL ANALYSIS’ (Additional file 1, p. 30–31).

eUTOPIA requires that user provides a detailed phenotype information file (Additional file 2) with all biological and technical variables of the samples in the experiment.

A quality control report can be generated from microarray raw data by affyQCReport R package [4], yaqcaffy R package [5], arrayQualityMetrics R package [6], or shinyMethyl R package [7], depending on the microarray platform. It is essential that poor quality probes from the experimental data are omitted prior data normalization. In the gene expression specific platforms, this is accomplished by estimating the robustness of probe signals against the background (negative control probes). For Illumina methylation platforms, a detection *p*-value is computed by using the total DNA signal (methylated + unmethylated) against the background signal by minfi R package [8]. The expression value OR p-value threshold determined from the background signal is used to evaluate the probes across a percentage of sample specified by the user. Finally, the probes failing this evaluation are considered unreliable and thus filtered out.

Normalization of the expression and methylation signals distribution across the samples is performed respectively with methods from the limma [9] and minfi R packages. Methods *scale*, *quantile*, and *loess* perform normalization on log2-scaled intensities and ratios, and the *vsn* method uses a variance stabilizing transformation, which performs better for weakly expressed features. Normalization of Illumina methylation arrays is performed by using the methods from minfi R package with options of background subtraction, control feature measure, dye bias, and quantile normalization.

In microarray data analysis, a fundamental step is to attenuate the effects associated with technical variables (batch effects) while retaining the variation associated with biological variables. Batch effects can arise for multiple reasons, most commonly when the experiments are conducted in multiple batches, and the data is pooled together for processing. These batches can contribute to the variability of the features and could introduce a systematic error in their assessment, ultimately leading to incorrect results in the worst scenario [10]. Batch effects can be caused by known variables (e.g.*,* dye, RNA quality, experiment date, etc.) or by hidden sources of variation not explained by the known variables. Known biological (e.g.*,* treatment, disease status, age, tissue, etc.) and technical (e.g.*,* dye, array, etc.) variables are provided by the user in the phenotype information, while unknown sources of variations can be identified by using the *sva* function from sva R package [11]. First, the impact of the technical variables is computed with *prince* function from swamp R package [12], and the correlation between both biological and technical variables is evaluated from the confounding plot which is generated by using the *confounding* function from the swamp R package. This information is used to identify batch variables as known technical or surrogate variables which are associated with strong sources of variation and are not correlated with biological variables of interest. These identified batch variables can be justifiably corrected to remove technical noise from the data. Finally, the correction is performed with *ComBat* [13] function from sva R package that employs an empirical Bayes approach to estimate systemic batch biases affecting many genes. The batch correction is carried out by specifying the variable of interest, any biological covariates, and a set of known batches or surrogate variables (obtained from the *sva* function described above). The batch correction process implemented in eUTOPIA applies the *ComBat* function iteratively to remove one batch covariate at a time while the rest are modeled as covariates of interest. The *ComBat* function can process only one batch covariate at a time, and this process of blocking other batches ensures clear separation of variation adjustment effects for each batch without any interference from the adjustment of other batches.

Linear models allow to model the covariate dependencies between samples. The differential analysis is performed with linear model implementation in the R package limma. The *lmFit* function from the limma R package fits gene-wise linear models to the microarray data. The user defines the design for the model by providing the biological variable of interest and covariates (biological and technical batch variables). The contrasts of interest are then specified to obtain contrast specific coefficients from the original coefficients of the linear model. The *eBayes* function is applied to assess differential expression by using the fitted model with the contrast coefficients. Final reporting of the differentially expressed genes is performed by using the *toptable* function where adjusted *p*-value for the multiple comparisons can be obtained by specifying methods “Holm”, “Hochberg”, “Hommel”, “Bonferroni”, “Benjamini & Hochberg”, “Benjamini and Yekutieli” or “False Detection Rate”. Differential analysis results for comparisons defined by the user are reported in tabular format and with other meaningful visualizations. The dynamic plots help to perform a preliminary interpretation of the analysis results. The distribution of the differential features by fold-change magnitude and significance can be observed for a chosen contrast by means of the volcano plots. The expression profile of user-specified top significant features from one or more contrasts can be inspected from the heatmap. Comparison of differential features from different contrasts by set intersections is represented as Venn diagrams or UpSet plots. And the distribution of signal for one or more gene(s) of interest in sample annotations (e.g.*,* experimental condition) can be inspected by means of box plots. A user manual with sample data analysis and plot descriptions is provided in Additional file 1.

## Results and discussion

The functional capabilities of eUTOPIA are showcased here by processing a publicly available dataset GSE92900 [14] obtained from the GEO [15] repository. The phenotype table provided in Additional file 2 is used for sample annotations.

In Fig. 2a, array 10 has high distance measures with the rest of the arrays as represented by the bright and dull yellow colored cells. This distance information can also be interpreted from the hierarchical cluster in the right margin of Fig. 2a, where array 10 clusters separately from the others. Outliers are detected based on this distance information as arrays with exceptionally large distance to all other arrays. A summarized distance measure is determined for each array by summing up the distances to all other arrays, which is checked against an outlier threshold obtained on the basis of well-established interquartile range (IQR) rules defined by Tukey Fences. The outliers can be observed in Fig. 2b, where array 10 has a large summarized distance from the rest of the arrays and can be markedly recognized as an outlier.

The difference in the distribution of expression values in different arrays can be observed in the box plot before normalization (Fig. 3a), suggesting the need for adjustment of the distributions for fair comparison across the set of arrays. The distribution of expression values is harmonious across the arrays in the box plot after normalization with the quantile method (Fig. 3b).

The difference in the distribution of expression values from individual channels of different arrays can be observed from the smoothed curves in the density plot (Fig. 3c) before correction. Only a single smoothed curve is visible in the density plot after normalization (Fig. 3d) since the channels from all arrays have the same distribution of expression values as a result of quantile normalization.

The larger scale of log-2 ratios (M-values on the y-axis) observed in the MDplot before normalization (Fig. 3e) shows that the data points are farther away from zero log2 expression ratio, suggesting bias. The average log2 expression values also have a large scale (x-axis) before normalization. The smaller scale of M-value (y-axis) observed in the MDplot after normalization (Fig. 3f) shows that data points are much closer to zero log2 expression ratios as the bias has been adjusted. From this plot, it is also possible to appreciate how the average log2 expression values have much smaller scale (x-axis) after normalization.

*p*-value: 0.004) with only the second principal component, while the identified batch variable ‘array’ was significantly associated with all three principal components (

*p*-values: 3e-06, 0.02, and 0.06, respectively) and ‘n.mice’ was significantly associated with the first principal component (p-value: 0.03). In the prince plot generated from the data after batch correction (Fig. 5b), the first three principal components represent 49, 23, and 11% variation in the data, which is a significant shift. The variable of interest ‘group’ is now observed to have a high association with all three principal components (p-values: 1e-13, 3e-14, and 7e-10, respectively). The batch variable ‘array’ has a comparably lesser significant association with the first principal component (p-value: 0.06) and batch variable ‘n.mice’ also has a comparably lesser association with the third principal component (p-value: 0.06). While batch variables ‘slide’, ‘area’, ‘operator’, ‘date’, and ‘dye’ are no longer associated with principal components representing high variation. Thus, allowing the user to check that the variation associated with the variable of interest is preserved while the noise associated with the batch variables has been corrected. The principal component analysis (PCA) plot generated from the data before batch correction (Fig. 5c) displays samples scattered across the projected components with no obvious grouping of samples by the variable of interest. The PCA plot (Fig. 5d) after the known batch correction displays a more discrete grouping of samples (smaller intragroup distances) and better separation of groups (intergroup distances) in the projected components.

*p*-value: 0.004) with the second principal component. While the surrogate variables ‘svaD.1’ was significantly associated with the first principal component (p-value: 6e-04), ‘svaD.2’ was significantly associated with the second and third principal components (

*p*-value: 0.05, and 5e-05, respectively), and ‘svaD.3’ was significantly associated with the third principal component (p-value: 0.06). In the prince plot generated from the data after batch correction (Fig. 7b), the variable of interest ‘group’ is now observed to have a high association with the first principal component (p-value: 3e-07). The discretized surrogate variables ‘svaD.1’, ‘svaD.2’, and ‘svaD.3’ are no longer associated with the first three principal components representing high variation. The removal of hidden batch effects with surrogate variables removes artefactual technical variation from the data, thus revealing the true variation signal of the variable of interest. The PCA plot generated from the data before batch correction (Fig. 7c) displays samples scattered across the projected components with no obvious grouping of samples by the variable of interest ‘group’. The grouping of samples in the PCA from the corrected data (Fig. 7d) is more discrete than the uncorrected data, but the separation between the groups (intergroup distances) is not very clear, only ‘rCNT’ and ‘Ctrl’ show visibly distinct separation (Fig. 7d), while the samples from other groups are much closer. The tighter packed grouping of samples can be observed by the circular outlines drawn around the groups in Fig. 7d.

## Conclusion

eUTOPIA allows users to reliably process microarray data and generate visual interpretations of the results seamlessly via a simple yet intuitive interface. It is focused on the preprocessing of microarray data, thus providing an agile and robust alternative to more comprehensive tools. Both commercially available and free software for microarray data preprocessing either provide limited methodological options or force the user to design an appropriate pipeline from the tools provided. This can be a daunting task for researchers with limited experience in omics data analysis. eUTOPIA is designed to balance both reliability and flexibility.

The case study showcases eUTOPIA’s features that enable the user to perform array data preprocessing and analysis. eUTOPIA’s guided workflow helps to easily preprocess the data, identify sources of unwanted variation, remove them and evaluate the sanity of corrected data. The graphical user interface allowed the ease of defining a linear model to test the data and the wide choice of provided dynamic plots help to classify and characterize conditions by sets of differential features and expression patterns.

We compared eUTOPIA against a set of microarray analysis tools that are free for academic use and have a graphical interface, namely AGA [16], shinyMethyl, MeV [17], O-miner [18], Chipster [19], and Babelomics [20]. The comparison table (Additional file 4) evaluates the tools over a list of implemented analytical steps and supported data platforms. One major feature that is not supported by most of the compared tools is the batch correction of known variables and more noticeably of surrogate variables. eUTOPIA’s analysis workflow integrates the visual representation of sample annotation and principal components of variation to identify batch variables, along with the ability to perform correction of batch effects seamlessly. Furthermore, it is the only tool that incorporates the surrogate variable (hidden batch) identification, visualization, and correction. This process of batch identification and correction is of extreme importance for microarray analysis because it can help to isolate technical noise from the biological signal. In this comparison, Chipster has the most comprehensive toolbox; it provides the most features that even extends beyond microarray analysis. However, these features are presented as separate tools with no specific workflows and guidelines for choosing the most optimal set of tools. This can pose a challenge to the users in need of designing analysis workflows by combining these tools appropriately. In contrast, eUTOPIA incorporates a specific set of tools in a streamlined workflow to ensure intuitiveness and ease of use. eUTOPIA does not impose on the user the technical challenges of workflow design thus allowing to focus more on the biological aspects of the data and results.

## Availability and requirements

**Project name:** eUTOPIA

**Project home page:** https://github.com/Greco-Lab/eUTOPIA

**Operating system(s):** Platform independent

**Programming language:** R Shiny

**License:** GNU GPL 3.

**Any restrictions to use by non-academics:** none

## Notes

### Acknowledgments

We would like to thank Nanna Fyhrquist (University of Helsinki, Karolinska Institute) and Marit Ilves (University of Helsinki) for providing the valuable user feedback during the development process.

### Funding

This study was supported by the Academy of Finland (grant agreements 275151 and 292307) and EU H2020 LIFEPATH (grant agreement 633666).

### Availability of data and materials

Not Applicable.

### Authors’ contributions

VM implemented the stepwise guided workflow for microarray data analysis as an interactive R Shiny app and wrote the manuscript. GS defined the framework for Illumina methylation analysis, performed evaluation of the implemented R shiny app, and critically evaluated the manuscript. PK contributed to the case study and performed critical evaluation of the manuscript. AS contributed to the implementation of the R Shiny app, performed evaluation of the implemented R shiny app, and critically evaluated the manuscript. HA performed evaluation of the implemented R shiny app. VF developed the methods and framework for the Agilent microarray analysis, developed the methodology of batch effect mitigation, and critically evaluated the manuscript. DG conceived and supervised the project, contributed to the development of the analysis framework and R shiny app implementation, and critically evaluated the manuscript. All authors reviewed and approved of the final manuscript.

### Ethics approval and consent to participate

Not Applicable.

### Consent for publication

Not Applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary material

## References

- 1.Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12:115–21.CrossRefGoogle Scholar
- 2.R Core Team. R: a language and environment for statistical computing [internet]. Vienna: R Foundation for Statistical Computing; 2018. Available from: https://www.R-project.org/ Google Scholar
- 3.Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J. shiny: Web Application Framework for R [Internet]. 2017. Available from: https://CRAN.R-project.org/package=shiny Google Scholar
- 4.Parman C, Halling C, Gentleman R. affyQCReport: QC Report Generation for affyBatch objects; 2017.Google Scholar
- 5.Gatto L. yaqcaffy: Affymetrix expression data quality control and reproducibility analysis; 2017.Google Scholar
- 6.Kauffmann A, Gentleman R, Huber W. arrayQualityMetrics—a bioconductor package for quality assessment of microarray data. Bioinformatics. 2009;25:415–6.CrossRefGoogle Scholar
- 7.Fortin J-P, Fertig E, Hansen K. shinyMethyl: interactive quality control of Illumina 450k DNA methylation arrays in R. F1000Res. 2014;3.Google Scholar
- 8.Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9.CrossRefGoogle Scholar
- 9.Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47.CrossRefGoogle Scholar
- 10.Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, et al. Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform. 2013;14:469–90.CrossRefGoogle Scholar
- 11.Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28:882–3.CrossRefGoogle Scholar
- 12.Lauss M. swamp: Visualization, Analysis and Adjustment of High-Dimensional Data in Respect to Sample Annotations [Internet]. 2017. Available from: https://CRAN.R-project.org/package=swamp Google Scholar
- 13.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–27.CrossRefGoogle Scholar
- 14.Kinaret P, Marwah V, Fortino V, Ilves M, Wolff H, Ruokolainen L, et al. Network analysis reveals similar transcriptomic responses to intrinsic properties of carbon nanomaterials in vitro and in vivo. ACS Nano. 2017;11:3786–96 (GEO accession GSE92900).CrossRefGoogle Scholar
- 15.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41:D991–5.CrossRefGoogle Scholar
- 16.Considine M, Parker H, Wei Y, Xia X, Cope L, Ochs M, et al. AGA: interactive pipeline for reproducible gene expression and DNA methylation data analyses. F1000Res. 2015;4.Google Scholar
- 17.Howe EA, Sinha R, Schlauch D, Quackenbush J. RNA-Seq analysis in MeV. Bioinformatics. 2011;27:3209–10.CrossRefGoogle Scholar
- 18.Cutts RJ, Dayem Ullah AZ, Sangaralingam A, Gadaleta E, Lemoine NR, Chelala C. O-miner: an integrative platform for automated analysis and mining of -omics data. Nucleic Acids Res. 2012;40:W560–8.CrossRefGoogle Scholar
- 19.Kallio MA, Tuimala JT, Hupponen T, Klemelä P, Gentile M, Scheinin I, et al. Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics. 2011;12:507.CrossRefGoogle Scholar
- 20.Alonso R, Salavert F, Garcia-Garcia F, Carbonell-Caballero J, Bleda M, Garcia-Alonso L, et al. Babelomics 5.0: functional interpretation for new generations of genomic data. Nucleic Acids Res. 2015;43:W117–21.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.