Integrating omics datasets with the OmicsPLS package
Abstract
Background
With the exponential growth in available biomedical data, there is a need for data integration methods that can extract information about relationships between the data sets. However, these data sets might have very different characteristics. For interpretable results, dataspecific variation needs to be quantified. For this task, Twoway Orthogonal Partial Least Squares (O2PLS) has been proposed. To facilitate application and development of the methodology, free and opensource software is required. However, this is not the case with O2PLS.
Results
We introduce OmicsPLS, an opensource implementation of the O2PLS method in R. It can handle both low and highdimensional datasets efficiently. Generic methods for inspecting and visualizing results are implemented. Both a standard and faster alternative crossvalidation methods are available to determine the number of components. A simulation study shows good performance of OmicsPLS compared to alternatives, in terms of accuracy and CPU runtime. We demonstrate OmicsPLS by integrating genetic and glycomic data.
Conclusions
We propose the OmicsPLS R package: a free and opensource implementation of O2PLS for statistical data integration. OmicsPLS is available at https://cran.rproject.org/package=OmicsPLS and can be installed in R via install.packages(“OmicsPLS”).
Keywords
Omics data integration Joint principal components Dataspecific variation R package O2PLSAbbreviations
 Genetic PCs
Genetic principal components
 JIVE
Joint and individual variances explained
 O2PLS
Twoway orthogonal partial least squares
Background
With rapid advances in technology, several levels of biological variation can be measured. Consequently, multiple omics data sets are available on the same set of subjects. For a better understanding of the underlying biological systems, these data should be analyzed simultaneously [1].
Several data integration methods have been developed that estimate joint parts while ignoring some of the dataspecific characteristics. For example, Partial Least Squares (PLS) [2] maximizes the covariance to calculate joint principal components. Canonical Correlation Analysis (CCA) [3] considers correlation rather than covariance. Several other methods perform analysis on a concatenated version of the data sets, such as Simultaneous Component Analysis (SCA) [4]. For many data integration methods, open source software packages are available [5]. In particular, the mixOmics R package implements several variants of PCA, PLS and CCA [6].
Omics data sets might be heterogeneous in that they typically differ in dataspecific characteristics, such as size, scale, distribution and experimental error. This hampers the estimation of joint parts between these data. For correct interpretation of data integration results, dataspecific variation should be modeled [7]. This variation captures information specific to each data set. Furthermore, it can distort interpretation of the estimated joint part [8]. Therefore, we consider approaches that estimate both joint and specific components. Such approaches include TwoWay Orthogonal PLS (O2PLS) [8], JIVE [9] and DISCOSCA [10]. O2PLS considers two data sets and decomposes both in a joint, specific and residual part. The joint parts are calculated by maximizing the covariation between the two data sets, while correcting for dataspecific variation. JIVE uses iterative PCA on the concatenation of multiple datasets to alternately find joint and dataspecific parts. DISCOSCA performs SCA and rotates the solution to obtain joint and specific components for each data set.
In the JIVE and DISCOSCA approach, the joint and specific components are constrained to be orthogonal to each other. Moreover, they assume that the data sets share exactly the same joint latent variables. O2PLS only imposes orthogonality of the components within each part and assumes correlated joint latent variables for each data set. Therefore, we expect a better performance of O2PLS in complex situations.
O2PLS is implemented within the software package SIMCA [11], which is closedsource and commercial. Unavailability of source code hampers developing and extending the methodology. No free and open source alternative implementing O2PLS is available to the best of our knowledge. Therefore, we propose OmicsPLS, a free and opensource R software package to decompose two datasets into joint and specific parts. With regard to the other methods, DISCOSCA [12] is available only from the commercial computing environment MATLAB, whereas JIVE is freely available in the r.jive package [13]. Therefore, we compare OmicsPLS to r.jive.
Our aim is to provide easy access to both the method and visualization tools and to facilitate the development of more advanced methodology. The rest of the article is organized as follows. First, we discuss the implementation of OmicsPLS in detail. Second, the OmicsPLS package is illustrated using genetic and glycan data from a Croatian population cohort. We also apply JIVE to these data. Motivated by the data analysis, we conduct a simulation study to compare OmicsPLS to r.jive in terms of estimation accuracy, execution time and robustness against the presence of dataspecific characteristics. Finally, we discuss future extensions of OmicsPLS.
Implementation
O2PLS model
Each row of X and Y contains measurements on the same subject. Throughout the paper, it is assumed that the columns of X and Y are centered around zero. The relationship between T and U is given by the linear model U=TB_{T}+H or T=UB_{U}+H^{′}. Here, B_{T} and B_{U} are square matrices of size n, representing regression coefficients for the two models. The particular choice of the model does not affect the estimates, as the O2PLS algorithm is symmetric in X and Y.
Note that, in PLS, only a joint and a residual part is considered for each data set. Any dataspecific variation is absorbed by these two parts. This makes interpretation of PLS results more difficult, as the estimated loadings may be biased and the correlation between the joint scores typically seem weaker. O2PLS restricts the joint loadings W and C and the specific scores T_{Y⊥} and U_{X⊥} to have orthonormal columns. JIVE and DISCOSCA additionally restrict the columns of the matrices [ WP_{Y⊥}] and [ CP_{X⊥}] to be orthonormal. Furthermore, both methods assume that U=T, while O2PLS only assumes a linear relation between U and T.
The O2PLS algorithm for estimating the O2PLS components is provided in [8]. Briefly, singular vectors of the covariance matrix X^{⊤}Y are calculated. From these vectors, loadings and scores containing both joint and specific variation are estimated. Then, specific variation is estimated using SVD and subtracted from the data. Finally, using the corrected data, the joint parts are reestimated.
Interpretation Within each part, the components have a similar interpretation as PCA. In particular, the loading value w_{jk} indicates the importance of the variable x_{j} for component k. If w_{jk} and \(\phantom {\dot {i}\!}w_{j^{\prime }k}\) have the same sign, the corresponding variables x_{j} and \(\phantom {\dot {i}\!}x_{j^{\prime }}\) are positively correlated within component k. The same interpretation holds for the other parts. The scores can be used to define similarity between subjects within each component: for example, if \(\phantom {\dot {i}\!}t_{ik} \approx t_{i^{\prime }k}\), then subjects i and i^{′} are similar in component k. Between the joint parts, in the k’th joint component, the loading values w_{jk} and \(\phantom {\dot {i}\!}c_{j^{\prime }k}\) indicate correlation between x_{j} and \(\phantom {\dot {i}\!}y_{j^{\prime }}\). High positive or negative loading values indicate high positive or negative correlation in this component between these variables, respectively. As a consequence, the joint loading values w_{k} and c_{k} can be sorted to prioritize variables in X and Y based on high covariation.
Implementation

Crossvalidating: Functions to determine the number of O2PLS components.

Fitting: Functions to fit the O2PLS model.

Summarizing & visualizing: Functions to summarize and visualize the results.
Crossvalidating. Crossvalidation is a wellknown technique to choose tuning parameters of a model, while limiting the risk to overfit. All samples are partitioned in k blocks (denoted as folds), and the model is fitted on k−1 folds. The left out fold is used to evaluate the model fit. For O2PLS, an approach to determine the number of components is to maximize the prediction error over a threedimensional grid of possible integers and select the triple (n,n_{X},n_{Y}) that minimizes this error. As O2PLS is symmetric in X and Y, the sum of the two prediction errors \(Y  \hat {Y}^{2}+X  \hat {X}^{2}\) is taken as error measure. Here, \(A^{2} := \sum _{ij} a_{ij}^{2}\). This approach is implemented in the crossval_o2m function:
crossval_o2m(X, Y, a, ax, ay, nr_folds)
Here, a, ax and ay are vectors of integers to consider for the number of components n, n_{X} and n_{Y}. The vector a must have positive elements, while both ax and ay may contain zeros. The number of folds is specified by nr_folds and should be between two and N. The crossval_o2m function returns a threedimensional array with the prediction errors.
Crossvalidation over a threedimensional grid can be computationally intensive, especially with many grid points. For this reason, we have proposed an alternative crossvalidation procedure [14]. The rationale behind this approach lies in the interpretation of the specific parts: specific variation in the data will affect the joint scores, thereby reducing the covariance between T and U. Correcting for specific variation will increase this covariance. On the other hand, overcorrecting will again reduce the covariance between the joint scores. Candidates for n_{X} and n_{Y}, given n, are those integers for which the covariance of the joint scores are maximized. This approach is called by:
crossval_o2m_adjR2(X, Y, a, ax, ay, nr_folds)
It performs the crossvalidation over a onedimensional grid a, while maximizing the covariance between the joint scores T and U over a twodimensional grid given by ax and ay. The last maximization does not involve crossvalidation. Consequently, the looping over nr_folds folds is omitted in two dimensions. This can drastically reduce computation time, while often yielding similar minimizers to those obtained with the full crossvalidation approach. The output is a matrix containing the prediction errors and the number of components (n,n_{X},n_{Y}).
Note that these two crossvalidation strategies can be combined: The alternative crossvalidation is used to find candidate minimizers of the prediction error. Based on these minimizers, a threedimensional grid is constructed on which the full crossvalidation is performed. Both crossvalidation implementations support parallel computation.
Fitting In its simplest form, the function call for fitting the O2PLS model is o2m(X, Y, n, nx, ny) The input parameters are the two data matrices X and Y, and the number of components in the joint, Xspecific and Yspecific part. The output is a list containing scores and loadings in the notation of [8], as well as proportions of explained variance and residual matrices; these proportions are defined below. The user can choose a ‘stripped’ output, by adding stripped=TRUE as an argument, to discard the residual matrices (and reduce memory usage).
By default, a Singular Value Decomposition (SVD) of the covariance matrix between X and Y is used to calculate joint and specific components. If both X and Y are highdimensional, the covariance matrix X^{⊤}Y will use a high amount of memory. Therefore, an alternative algorithm is implemented in the OmicsPLS package, named NIPALS [2]. The NIPALS algorithm is an iterative algorithm that avoids construction and storage of the covariance matrix. Moreover, the NIPALSbased joint components are numerically equal to the SVDbased PLS components (up to sign) if the number of NIPALS iterations is large enough. In the case that p or q is not too large, the NIPALS approach can be slower than the SVD approach. Therefore, a check on data dimensionality is performed to determine the proper approach. The threshold is by default at p=q=3000 and can be adjusted.

variation in X and Y explained by the joint, specific and residual parts, e.g., TW^{⊤}/X.

variation in U and T that is predictable by T resp. U, e.g., TB_{T}^{2}/U^{2}.
Note that the proportion of predictable variation in Y by X is then TB_{T}^{2}/U^{2}×UC^{⊤}^{2}/Y^{2}=TB_{T}^{2}/Y^{2}.
The OmicsPLS package provides a flexible framework to plot loadings in each component. As this framework is built on the ggplot2 package, several plotting layers can be added to enhance visualization and aid interpretation of the results. The command for constructing a plot is
plot(x, loading_name).
Here x is the O2PLS fit and the only required object. The parameter loading_name represents which of the four parts (Xjoint, Yjoint, Xspecific or Yspecific) should be plotted. The plot command calls geom_text from the ggplot2 package. Its documentation contains information about editing, for example, text color, transparency and size. These attributes can be changed within the OmicsPLS plot function.
Results
Analysis of genetic and glycomic data
We consider p=333858 genotyped Single Nucleotide Polymorphisms (SNPs) and q=20 quantified IgG1 glycan (glycopeptide) abundances, measured with nanoLCESIMS, for N=885 participants in the CROATIA_Korcula cohort [15]. Both data sets contain highly correlated measurements and are heterogeneous (as they differ in scale, distribution and measurement error).
Our aim is to investigate how IgG1 glycans relate to genetic variation by determining the statistical overlap between IgG1 glycan data and genetic data, as in Eq. (1). To this end, we use the OmicsPLS package to obtain estimates of the amount of joint variation and estimate the contribution of the genetic and IgG1 glycan measurements to this joint variation.
The SNPs were summarized by taking, for each gene (in the UCSC hg18 database), all SNPs within 50 kilobases from that gene and applying Principal Components Analysis. For each gene, the set of corresponding SNPs were replaced by as many principal components as needed to explain at least 80% of this set of SNPs. This provided a new data set with one or several variables, which we denote as Genetic PCs, per gene. This ‘Genetic PCs’ data set contains 37819 variables and is referred to as X. The glycan measurements were logtransformed, batchcorrected [16] and quantilenormalized [17]. The resulting data matrix is referred to as Y.
Top 5 genes and loading values of the GeneticGlycan joint principal components
Component 1: ‘average’ glycan  Component 2: ‘fucosylation’  Component 3: ‘galactosylation’  

Gene symbol  Loading value  Gene symbol  Loading value  Gene symbol  Loading value 
DNAJC10  0.0929  FUT8  0.0844  MTO1  0.0875 
ARID3B  0.0880  LGALS8  0.0781  AKAP9  0.0627 
ZNF502  0.0756  LDB3  0.0766  MRPL33  0.0622 
TBC1D13  0.0611  ARID3B  0.0701  MYLPF  0.0562 
ZC2HC1C  0.0601  LCE2D  0.0677  POLR2F  0.0554 
For comparison purposes, r.jive was also applied to the data. However, the algorithm did not converge after 500 iterations (and 3000 s). We will investigate possible reasons in the simulation study.
Simulation study
where U=T. Note that in the O2PLS formulation, B_{T}=I_{r} and H=0. In the first scenario, we take N=500, p=q=100, n=2, n_{X}=3 and n_{Y}=1. In the second scenario, we consider p=q=10^{4}. Elements of W, C, P_{Y⊥} and P_{X⊥} are drawn independently from a standard normal distribution. The JIVE constraints are applied by orthogonalizing each column in both joint and specific parts with respect to each other. Elements of T, T_{Y⊥} and U_{X⊥} are drawn independently from a standard normal distribution. Noise, represented by E and F, is added to X and Y to account for about 10% of the total variation. For both r.jive and OmicsPLS, loading matrices are extracted. To evaluate estimation accuracy, the absolute value of the inner product between corresponding columns are calculated. Here, higher values represent lower estimation errors. For each scenario, we generated 1000 replicates.
To investigate the lack of convergence of r.jive in the data analysis, two additional scenarios are considered. In the first additional scenario, elements in U have a standard deviation of 10, i.e., U=10T. In the second additional scenario, elements in the specific parts will be normally distributed with a standard deviation of 10. The dimensions and sample size are taken as above. Note that both scenarios represent an ‘imbalance’ in the amount of variation per part. Here, r.jive is considered converged if it needs less than 500 steps. In these additional scenarios, we generated 100 replicates.
Simulation results for OmicsPLS and r.jive: inner products
OmicsPLS  r.jive  

X joint  0.88 (0.09)  0.88 (0.09) 
X specific  0.79 (0.08)  0.78 (0.09) 
Y joint  0.85 (0.08)  0.85 (0.08) 
Y specific  0.93 (0.013)  0.92 (0.014) 
Performance comparison of OmicsPLS and r.jive w.r.t. median (MAD) total elapsed time in seconds across 1000 replicates, and convergence across 100 runs
CPU time (sec)  Convergence (%)  

Dimensions  OmicsPLS  r.jive  OmicsPLS  r.jive 
Low (p=q=100)  0.04 (0.007)  14 (2.8)  100  9 
High (p=q=10^{4})  18 (4.1)  132 (16)  100  8 
R code for the data analysis and simulation study is available as an online supplement (Additional file 2).
Discussion
In this article, we introduced the OmicsPLS package for integration of two (omics) data sets. We evaluated its performance with a simulation study and demonstrated it using genetic and IgG1 glycomic data. Regarding the data analysis, the proportion of joint variation in the Genetic PCs data set was 0.8%. This proportion is expected to be small since it is not likely that a large fraction of genetic variation (in particular SNPs) is related to IgG1 glycosylation. In the joint components, several genes were found that might play a role in the genetic regulation of IgG1 glycosylation. Some of these genes are known to be directly involved (e.g., FUT8), while others (DNAJC10 and AKAP9) are localized to cell compartments where the majority of glycosylation takes place (the endoplasmic reticulum and Golgi Apparatus). However, much is still unknown about the genetic regulation of (IgG) glycosylation.
Additionally, we considered JIVE for this type of data, but without success: the algorithm did not converge. A potential cause for this lack of convergence is the different dataspecific characteristics of the two data sets. In particular, the dimensionality and amount of variation differ. Therefore, the JIVE assumption U=T might not be reasonable. This is confirmed by our simulation: the r.jive algorithm is not robust against an ‘imbalance’ in the amount of variation between the two joint parts, or between the joint and specific parts. In particular, when U=10T, r.jive did not converge in more than 90% of the replicates. This suggests that r.jive might be inappropriate for analyzing heterogeneous data sets (in which dataspecific characteristics differ across data sets). Note that in DISCOSCA the same assumption (U=T) is made, therefore we expect a suboptimal performance of this method as well when analyzing heterogeneous data.
As part of a future update of the OmicsPLS software package, we intend to deal with missing data. To impute missing values and simultaneously estimate O2PLS components, the OmicsPLS algorithm can be extended [2]. The imputation step can also be performed prior to analysis. For multiple omics data, Ensemble Regression Imputation [18] and Multiple Factor Analysis imputation [19] have been proposed. Note that, as with all imputation methods, uncertainty due to missing data should be assessed and presented to the user. A probabilistic framework for O2PLS would facilitate imputation and simultaneously addresses additional uncertainty due to missing data.
An important extension of OmicsPLS involves obtaining standard errors for the estimates. To this end, bootstrap approaches, similar to those found in PLS literature, can be applied [20]. A drawback of using resampling methods is the computational burden, especially with highdimensional data sets. To avoid such procedures, a probabilistic framework for O2PLS can be used to directly calculate asymptotic standard errors.
Interpretability of the OmicsPLS output can be increased by extending the algorithm to produce sparse estimates. This extension can be implemented by considering Sparse PLS [21] or by considering a probabilistic framework for O2PLS and obtaining penalized maximum likelihood estimates.
We are currently investigating the possibilities of Probabilistic O2PLS for data integration, which facilitates multiple imputation and statistical inference, such as calculation of asymptotic standard errors. By penalizing the likelihood, sparse estimates can be obtained.
As OmicsPLS is opensource, it is straightforward to extend the current implementation to handle more complex situations. For example, in the GitHub repository, several ‘branches’ can be initialized in which new functionalities can be developed.
Conclusion
We propose OmicsPLS, an opensource and freely available R package for robust integration of heterogeneous data with O2PLS. It includes functions to determine the number of components, fit, and inspect results. For highdimensional data, a memoryefficient implementation is used.
Availability and requirements

Project name: OmicsPLS

Project home page: https://github.com/selbouhaddani/OmicsPLS

Operating systems: Linux, Mac OS, Windows

Programming language: R

License: GPL3

Any restrictions to use by nonacademics: none.
Notes
Acknowledgements
The authors would like to thank the Editor and the reviewers for their valuable comments and suggestions. We would like to acknowledge the MIMOmics consortium and the staff of several institutions in Croatia that supported the fieldwork, including but not limited to The University of Split and Zagreb Medical Schools, Institute for Anthropological Research in Zagreb and the Croatian Institute for Public Health.
Funding
The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7HealthF52012) under grant agreement number 305280 (MIMOmics).
The CROATIA_Korcula study was funded by grants from the Medical Research Council (UK), European Commission Framework 6 project EUROSPAN (Contract No. LSHGCT2006018947), FP7 contract BBMRILPC (grant No. 313010), Croatian Science Foundation (grant 8875) and the Republic of Croatia Ministry of Science, Education and Sports (21610803150302). Glycome analysis was supported by the European Commission HighGlycan (contract No. 278535), MIMOmics (contract No. 305280), HTPGlycoMet (contract No. 324400), IntegraLife (contract No. 315997). The funding bodies did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Availability of data and materials
The data that support the findings of this study are available from C.H. [caroline.hayward@igmm.ed.ac.uk], but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of C.H.
Authors’ contributions
SB, HWU, SK, GJ and JH designed the study, interpreted the results and drafted the manuscript. C.H. acquired the data. L.K interpreted the data analysis results. S.B carried out the computational aspect of the work. S.B conceived, designed and directed the study. All authors read and approved the final version of the manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
 1.Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015; 16(2):85–97. https://doi.org/10.1038/nrg3868.CrossRefGoogle Scholar
 2.Wold H. Nonlinear iterative partial least squares (NIPALS) modelling: some current developments. In: Multivar. Anal. III (Proc. Third Internat. Symp. Wright State Univ., Dayton, Ohio, 1972). New York: Academic Press: 1973. p. 383–407.Google Scholar
 3.Hotelling H. Relations between two sets of variates. Biometrika. 1936; 28(34):321–77. https://doi.org/10.1093/biomet/28.34.321.CrossRefGoogle Scholar
 4.Van Deun K, Smilde AK, van der Werf MJ, Kiers HaL, Van Mechelen I. A structured overview of simultaneous component based data integration. BMC Bioinformatics. 2009; 10:246. https://doi.org/10.1186/1471210510246.CrossRefGoogle Scholar
 5.Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC. Dimension reduction techniques for the integrative analysis of multiomics data. Brief Bioinform. 2016; 17(October 2015):108. https://doi.org/10.1093/bib/bbv108.Google Scholar
 6.Rohart F, Gautier B, Singh A, Lê Cao KA. mixOmics: An R package for ’omics feature selection and multiple data integration. PLoS Comput Biol. 2017; 13(11):1005752. https://doi.org/10.1371/journal.pcbi.1005752.CrossRefGoogle Scholar
 7.van der Kloet FM, SebastiánLeón P, Conesa A, Smilde AK, Westerhuis JA. Separating common from distinctive variation. BMC Bioinformatics. 2016; 17(S5):195. https://doi.org/10.1186/s1285901610372.CrossRefGoogle Scholar
 8.Trygg J, Wold S. O2PLS, a twoblock (X–Y) latent variable regression (LVR) method with an integral OSC filter. J Chemom. 2003; 17(1):53–64. https://doi.org/10.1002/cem.775.CrossRefGoogle Scholar
 9.Lock EF, Hoadley Ka, Marron JS, Nobel AB. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013; 7(1):523. 10.1214/12AOAS597. 1102.4110.CrossRefGoogle Scholar
 10.Van Deun K, Van Mechelen I, Thorrez L, Schouteden M, De Moor B, van der Werf MJ, De Lathauwer L, Smilde AK, Kiers HaL. DISCOSCA and Properly Applied GSVD as Swinging Methods to Find Common and Distinctive Processes. PLoS ONE. 2012; 7(5):37840. https://doi.org/10.1371/journal.pone.0037840.CrossRefGoogle Scholar
 11.UMetrics. SIMCA O2PLS software. 2017. http://umetrics.com/products/simca.
 12.KULEUVEN. DISCOSCA software. 2017. https://ppw.kuleuven.be/okp/software/discosca/. Accessed 05 May 2017.
 13.O’Connell MJ, Lock EF. R.JIVE for exploration of multisource molecular data. Bioinformatics. 2016; 32(June):324. https://doi.org/10.1093/bioinformatics/btw324.Google Scholar
 14.el Bouhaddani S, HouwingDuistermaat J, Salo P, Perola M, Jongbloed G, Uh HW. Evaluation of O2PLS in Omics data integration. BMC Bioinformatics. 2016;17(2). https://doi.org/10.1186/s128590150854z.
 15.Lauc G, Huffman JE, Pučić M, Zgaga L, Adamczyk B, Mužinić A, Novokmet M, Polašek O, Gornik O, Krištić J, Keser T, Vitart V, Scheijen B, Uh HW, Molokhia M, Patrick AL, McKeigue P, Kolčić I, Lukić IK, Swann O, van Leeuwen FN, Ruhaak LR, HouwingDuistermaat JJ, Slagboom PE, Beekman M, de Craen AJM, Deelder AM, Zeng Q, Wang W, Hastie ND, Gyllensten U, Wilson JF, Wuhrer M, Wright AF, Rudd PM, Hayward C, Aulchenko Y, Campbell H, Rudan I. Loci Associated with NGlycosylation of Human Immunoglobulin G Show Pleiotropy with Autoimmune Diseases and Haematological Cancers. PLoS Genet. 2013; 9(1):1003225. https://doi.org/10.1371/journal.pgen.1003225.CrossRefGoogle Scholar
 16.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007; 8(1):118–27. https://doi.org/10.1093/biostatistics/kxj037.CrossRefGoogle Scholar
 17.Amaratunga D, Cabrera J. Analysis of Data From Viral DNA Microchips. J Am Stat Assoc. 2001; 96(456):1161–70. https://doi.org/10.1198/016214501753381814.CrossRefGoogle Scholar
 18.Lin D, Zhang J, Li J, Xu C, Deng Hw, Wang Yp. An integrative imputation method based on multiomics datasets. BMC Bioinformatics. 2016; 17(1):247. https://doi.org/10.1186/s1285901611226.CrossRefGoogle Scholar
 19.Voillet V, Besse P, Liaubet L, San Cristobal M, González I. Handling missing rows in multiomics data integration: multiple imputation in multiple factor analysis framework. BMC Bioinformatics. 2016; 17(1):402. https://doi.org/10.1186/s1285901612735.CrossRefGoogle Scholar
 20.Wehrens R, van der Linden WE. Bootstrapping principal component regression models. J Chemom. 1997; 11(2):157–71. 10.1002/(SICI)1099128X(199703)11:2<157::AIDCEM471>3.0.CO;2J.CrossRefGoogle Scholar
 21.Lê Cao KA, Rossouw D, RobertGranié C, Besse P, Stat Appl Genet Mol Biol. A sparse PLS for variable selection when integrating omics data. 2008; 7(1). https://doi.org/10.2202/15446115.1390.
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.