Abstract
Biotechnological advances in genomics have heralded in a new era of quantitative molecular biology whereby it is now possible to routinely measure over tens of thousands of molecular features (e.g., gene expression levels) in hundreds if not thousands of patient samples. A key statistical challenge in the analysis of such large omic datasets is the presence of confounding sources of variation, which are often either unknown or only known with error. In this chapter, we present a supervised normalization method in which Blind Source Separation (BSS) is applied to identify the sources of variation, and demonstrate that this leads to improved statistical inference in subsequent supervised analyses. The statistical framework presented here will be of interest to biologists, bioinformaticians and signal processing experts alike.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Instead of the residual variation matrix \(R\) which requires specification of the POI and is thus supervised.
References
Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Campbell, P.J., Stratton, M.R.: Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3(1), 246–259 (2013)
Baufays, H.: Unification de techniques de sparation aveugle de sources avec application l’analyse de l’expression des gnes. Ecole Polytechnique de Louvain, Master thesis with Prof. P.-A. Absil (2011)
Bell, C.G., Teschendorff, A.E., Rakyan, V.K., Maxwell, A.P., Beck, S., Savage, D.A.: Genome-wide dna methylation analysis for diabetic nephropathy in type 1 diabetes mellitus. BMC Med. Genomics 3, 33 (2010)
Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., Gunderson, K.L.: Genome-wide DNA methylation profiling using the infinium assay. Epigenomics 1(1), 177–200 (2009)
Blenkiron, C., Goldstein, L.D., Thorne, N.P., Spiteri, I., Chin, S.F., Dunning, M.J., Barbosa-Morais, N.L., Teschendorff, A.E., Green, A.R., Ellis, I.O., Tavar, S., Caldas, C., Miska, E.A.: Microrna expression profiling of human breast cancer identifies new markers of tumor subtype. Genome Biol. 8(10), R214 (2007)
Cardoso, J.F.: High-order contrasts for independent component analysis. Neural Comput. 11(1), 157–192 (1999)
Consortium 1000 Genomes Project, Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., McVean, G.A.: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)
Curtis, C., Shah, S.P., Chin, S.F., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D., Lynch, A.G., Samarajiwa, S., Yuan, Y., Grf, S., Ha, G., Haffari, G., Bashashati, A., Russell, R., McKinney, S., Watson, P., Markowetz, F., Murphy, L., Ellis, I., Purushotham, A., Brresen-Dale, A.L., Brenton, J.D., Tavar, S., Caldas, C., Aparicio, S.: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403), 346–352 (2012)
Deaton, A.M., Bird, A.: Cpg islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011)
Doane, A.S., Danso, M., Lal, P., Donaton, M., Zhang, L., Hudis, C., Gerald, W.L.: An estrogen receptor-negative breast cancer subset characterized by a hormonally regulated transcriptional program and response to androgen. Oncogene 25(28), 3994–4008 (2006)
Feinberg, A.P., Vogelstein, B.: Hypomethylation distinguishes genes of some human cancers from their normal counterparts. Nature 301(5895), 89–92 (1983)
Frigyesi, A., Veerla, S., Lindgren, D., Hoglund, M.: Independent component analysis reveals new and biologically significant structures in micro array data. BMC Bioinformatics 7, 290 (2006)
Gao, Y., Church, G.: Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 21(21), 3970–3975 (2005)
Huang, D.S., Zheng, C.H.: Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics 22(15), 1855–1862 (2006)
Hyvaerinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001)
Johnson, W.E., Li, C., Rabinovic, A.: Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8(1), 118–127 (2007)
Jones, P.A., Baylin, S.B.: The epigenomics of cancer. Cell 128(4), 683–692 (2007)
Lee, S.I., Batzoglou, S.: Application of independent component analysis to microarrays. Genome Biol. 4(11), R76 (2003)
Leek, J.T., Storey, J.D.: A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105(48), 18, 718–18, 723 (2008)
Leek, J.T., Storey, J.D.: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3(9), 1724–1735 (2007)
Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733–739 (2010)
Liao, J.C., Boscolo, R., Yang, Y.L., Tran, L.M., Sabatti, C., Roychowdhury, V.P.: Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl. Acad. Sci. USA 100(26), 15,522–15,527 (2003)
Liebermeister, W.: Linear modes of gene expression determined by independent component analysis. Bioinformatics 18(1), 51–60 (2002)
Liu, Y., Aryee, M.J., Padyukov, L., Fallin, M.D., Hesselberg, E., Runarsson, A., Reinius, L., Acevedo, N., Taub, M., Ronninger, M., Shchetynsky, K., Scheynius, A., Kere, J., Alfredsson, L., Klareskog, L., Ekstrm, T.J., Feinberg, A.P.: Epigenome-wide association data implicate dna methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 31(2), 142–147 (2013)
Liu, N.W., Sanford, T., Srinivasan, R., Liu, J.L., Khurana, K., Aprelikova, O., Valero, V., Bechert, C., Worrell, R., Pinto, P.A., Yang, Y., Merino, M., Linehan, W.M., Bratslavsky, G.: Impact of ischemia and procurement conditions on gene expression in renal cell carcinoma. Clin. Cancer Res. 19(1), 42–49 (2013)
Loi, S., Haibe-Kains, B., Desmedt, C., Lallemand, F., Tutt, A.M., Gillet, C., Ellis, P., Harris, A., Bergh, J., Foekens, J.A., Klijn, J.G., Larsimont, D., Buyse, M., Bontempi, G., Delorenzi, M., Piccart, M.J., Sotiriou, C.: Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J. Clin. Oncol. 25(10), 1239–1246 (2007)
Maegawa, S., Hinkal, G., Kim, H.S., Shen, L., Zhang, L., Zhang, J., Zhang, N., Liang, S., Donehower, L.A., Issa, J.P.: Widespread and tissue specific age-related dna methylation changes in mice. Genome Res. 20(3), 332–340 (2010)
Martoglio, A.M., Miskin, J.W., Smith, S.K., MacKay, D.J.: A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics 18(12), 1617–1624 (2002)
Plerou, V., Gopikrishnan, P., Rosenow, B., Amaral, L.A., Guhr, T., Stanley, H.E.: Random matrix approach to cross correlations in financial data. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 65(6), 066,126 (2002)
Rakyan, V.K., Down, T.A., Maslau, S., Andrew, T., Yang, T.P., Beyan, H., Whittaker, P., McCann, O.T., Finer, S., Valdes, A.M., Leslie, R.D., Deloukas, P., Spector, T.D.: Human aging-associated dna hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res. 20(4), 434–439 (2010)
Rakyan, V.K., Down, T.A., Balding, D.J., Beck, S.: Epigenome-wide association studies for common human diseases. Nat. Rev. Genet. 12(8), 529–541 (2011)
Rhodes, D.R., Chinnaiyan, A.M.: Integrative analysis of the cancer transcriptome. Nat. Genet. 37, S31–S37 (2005)
Sainlez, M., Absil, P.-A., Teschendorff, A. Gene expression data analysis using spatiotemporal blind, source separation. In: Proceedings of ESANN’2009, pp. 159–164. (2009)
Sawyers, C.L.: The cancer biomarker problem. Nature 452(7187), 548–552 (2008)
Schmidt, M., Bhm, D., von Trne, C., Steiner, E., Puhl, A., Pilch, H., Lehr, H.A., Hengstler, J.G., Klbl, H., Gehrmann, M.: The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res. 68(13), 5405–5413 (2008)
Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Haibe-Kains, B., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D., Buyse, M., Van de Vijver, M.J., Bergh, J., Piccart, M., Delorenzi, M.: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J. Natl. Cancer Inst. 98(4), 262–272 (2006)
Stone, J.V., Porrill, J., Porter, N.R., Wilkinson, I.D.: Spatiotemporal independent component analysis of event-related fmri data using skewed probability density functions. Neuroimage 15 (2002)
Storey, J.D., Tibshirani, R.: Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100(16), 9440–9445 (2003)
Subramanian, A,. Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102(43), 15, 545–15, 550 (2005)
Swanton, C., Caldas, C.: From genomic landscapes to personalized cancer management-is there a roadmap? Ann. N. Y. Acad. Sci. 1210, 34–44 (2010)
Teschendorff, A.E., Naderi, A., Barbosa-Morais, N.L., Caldas, C.: Pack: profile analysis using clustering and kurtosis to find molecular classifiers in cancer. Bioinformatics 22(18), 2269–2275 (2006)
Teschendorff, A.E., Journe, M., Absil, P.A., Sepulchre, R., Caldas, C.: Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS Comput. Biol. 3(8), e161 (2007)
Teschendorff, A.E., Menon, U., Gentry-Maharaj, A., Ramus, S.J., Gayther, S.A., Apostolidou, S., Jones, A., Lechner, M., Beck, S., Jacobs, I.J., Widschwendter, M.: An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS ONE 4(12), e8274 (2009)
Teschendorff, A.E., Menon, U., Gentry-Maharaj, A., Ramus, S.J., Weisenberger, D.J., Shen, H., Campan, M., Noushmehr, H., Bell, C.G., Maxwell, A.P., Savage, D.A., Mueller-Holzner, E., Marth, C., Kocjan, G., Gayther, S.A., Jones, A., Beck, S., Wagner, W., Laird, P.W., Jacobs, I.J., Widschwendter, M.: Age-dependent dna methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20(4), 440–446 (2010)
Teschendorff, A.E., Zhuang, J., Widschwendter, M.: Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27(11), 1496–1505 (2011)
The Cancer Genome Atlas Research Network: Integrated genomic analyses of ovarian carcinoma. Nature 474(7353), 609–615 (2011)
Theis, F., Gruber, P., Keck, I., Meyer-Bäse, A., Lang, E.: Spatiotemporal blind source separation using double-sided approximate joint diagonalization. In: Proceedings of EUSIPCO 2005, Antalya, Turkey (2005)
Wang, Y., Klijn, J.G., Zhang, Y., Sieuwerts, A.M., Look, M.P., Yang, F., Talantov, D., Timmermans, M., Yu, J., Jatkoe, T., Berns, E.M., Atkins, D., Foekens, J.A.: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365(9460), 671–679 (2005)
Zhang, X.W., Yap, Y.L., Wei, D., Chen, F., Danchin, A.: Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis. Eur. J. Hum. Genet. 13(12), 1303–1311 (2005)
Zhang, S., Liu, C.C., Li, W., Shen, H., Laird, P.W., Zhou, X.J.: Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 40(19), 9379–9391 (2012)
Zhuang, J., Widschwendter, M., Teschendorff, A.E.: A comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform. BMC Bioinformatics 13, 59 (2012)
Acknowledgments
AET was supported by a Heller Research Fellowship. This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Program initiated by the Belgian Science Policy Office.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Simulated Data
We simulated data matrices with 2,000 features and 50 samples and considered the case of two confounding factors (CFs) in addition to the primary phenotype of interest. The primary phenotype is a binary variable \(I_1\) with 25 samples in one class (\(I_1=0\)) and the other half with \(I_1=1\). Similarly, each confounding factor is assumed to be a binary variable affecting one half of the samples (randomly selected). For a given sample \(s\) we thus have a 3-tuple of indicator variables \(I_s=(I_{1s},I_{2s},I_{3s})\) where \(I_2\) and \(I_3\) are the indicators for the two confounding factors. Thus, samples fall into 8 classes. For instance, if \(I_s=(0,0,0)\) then this sample belongs to phenotype class 1 and is not affected by the two confounding factors. Similarly, \(I_s=(0,1,0)\) means that the sample belongs to class 1 and is affected by the first confounding factor but not the second.
We assume 10 % of features (200 features) to be TPs discriminating between the two phenotypic classes. We model the confounding factors as follows: each confounding factor is assumed to affect 10 % of features with a 25% overlap with the TPs (i.e 50 of the 200 TPs are confounded by each factor). Let \(J_g\) denote the indicator variable of feature \(g\), so \(J_g\) is a 3-tuple \((J_{1g},J_{2g},J_{3g})\) with \(J_{1g}\) an indicator for the feature to be a true positive, and \(J_{2g}\) (\(J_{3g}\)) an indicator for the feature to be affected by the first (second) confounding factor. Thus, the space of features is also divided into eight groups. Furthermore, let \((e_1,e_2,e_3)\) denote the effect sizes of the primary variable and the two confounding factors respectively, where we assume for simplicity that \(e_2=e_3\). Without loss of generality, we further assume that noise is modeled by a Gaussian of mean zero and unit variance \(N(0,1)\). Thus, for a given sample \(s\) we draw data values for the various feature groups as follows:
-
1.
\(J_g=(0,0,0)\): null unaffected features
$$\begin{aligned} p(x|I_s)&\sim \delta _{J_g,000}N(0,1) \\ \end{aligned}$$ -
2.
\(J_g=(0,1,0)\) or \((0,0,1)\): null features affected by only one CF
$$\begin{aligned} p(x|I_s)&\sim \delta _{J_g,010}\bigl \{\delta _{I_s,x1z}N(e_2,1) \\&\quad + \delta _{I_{s},x0z}N(0,1)\bigr \} \\&\quad + \delta _{J_g,001}\bigl \{\delta _{I_{s},xy1}N(e_3,1) \\&\quad + \delta _{I_{s},xy0}N(0,1) \bigr \} \\ \end{aligned}$$ -
3.
\(J_g=(0,1,1)\): null features affected by the two CFs
$$\begin{aligned} p(x|I_s)&\sim \delta _{J_g,011}\bigl \{\delta _{I_{s},x11}N(e_2+e_3,1) \\&\quad + \delta _{I_{s},x01}N(e_3,1) \\&\quad + \delta _{I_{s},x10}N(e_2,1) \\&\quad + \delta _{I_{s},x00}N(0,1)\bigr \} \\ \end{aligned}$$ -
4.
\(J_g=(1,0,0)\): true positives not affected by CFs
$$\begin{aligned} p(x|I_s)&\sim \delta _{J_g,100}\bigl \{\delta _{I_{s},0yz}N(0,1) \\&\quad + \delta _{I_s,1yz}(\pi _{-1}N(-e_1,1)+\pi _1N(e_1,1))\bigr \} \\ \end{aligned}$$ -
5.
\(J_g=(1,0,1)\) or \((1,1,0)\): true positives affected by one CF
$$\begin{aligned} p(x|I_s)&\sim \delta _{J_g,101}\bigl \{\delta _{I_{s},0y0}N(0,1)+\delta _{I_s,0y1}N(e_3,1) \\&\quad + \delta _{I_s,1y0}(\pi _{-1}N(-e_1,1)+\pi _1N(e_1,1)) \\&\quad + \delta _{I_s,1y1}(\pi _{-1}N(-e_1+e_3,1) \\&\quad +\pi _1N(e_1+e_3,1))\bigr \} \\&\sim \delta _{J_g,110}\bigl \{\delta _{I_s,00z}N(0,1)+\delta _{I_s,01z}N(e_2,1) \\&\quad + \delta _{I_s,10z}(\pi _{-1}N(-e_1,1)+\pi _1N(e_1,1)) \\&\quad + \delta _{I_s,11z}(\pi _{-1}N(-e_1+e_2,1) \\&\quad +\pi _1N(e_1+e_2,1))\bigr \} \\ \end{aligned}$$ -
6.
\(J_g=(1,1,1)\): true positives affected by all CFs
$$\begin{aligned} p(x|I_s)&\sim \delta _{J_g,111}\bigl \{ \delta _{I_s,000}N(0,1) \\&\quad + \delta _{I_s,010}N(e_2,1) + \delta _{I_s,001}N(e_3,1) \\&\quad + \delta _{I_s,011}N(e_2+e_3,1) \\&\quad + \delta _{I_s,101}(\pi _{-1}N(-e_1+e_3,1)\\&\quad +\pi _1N(e_1+e_3,1)) \\&\quad + \delta _{I_s,110}(\pi _{-1}N(-e_1+e_2,1)\\&\quad +\pi _1N(e_1+e_2,1)) \\&\quad + \delta _{I_s,111}(\pi _{-1}N(-e_1+e_2+e_3,1)\\&\quad +\pi _1N(e_1+e_2+e_3,1))\bigr \} \\ \end{aligned}$$
where in the above \(\delta _{x'y'z',xyz}\) denotes the triple Kronecker delta: \(\delta _{x^{\prime }y^{\prime }z^{\prime },xyz}=1\) if and only if \(x'=x\), \(y^{\prime }=y\) and \(z^{\prime }=z\), otherwise \(\delta _{x^{\prime }y^{\prime }z^{\prime },xyz}=0\), and \((\pi _{-1},\pi _{1})\) are weights satisfying \(\pi _{-1}+\pi _1=1\). In our case, we used \(\pi _1=\pi _{-1}=0.5\).
1.2 DNA Methylation Data (Whole Blood Tissue)
In all datasets, age is the phenotype of interest. (i) T1D: this DNAm dataset consists of 187 blood samples from patients (94 women and 93 men) with type-1 diabetes. This set served as validation for a DNAm signature for aging [44]. We take BSCE, beadchip, cohort, and sex as potential confounding factors. Samples were distributed over 17 beadchips; (ii) UKOPS1: this DNAm set consists of 108 blood samples from healthy postmenopausal women which served as controls for the UKOPS study [43]. Confounding factors in this study include BSCE, beadchip and DNA concentration (DNAc). Samples were distributed over 10 beadchips; (iii) UKOPS2: This is similar to Dataset2 but consists of 145 blood samples from healthy postmenopausal women distributed over 36 beadchips (i.e., approximately four healthy samples per chip, the other eight blood samples per chip were from cancer cases) [43]; (iv) WBBC: This dataset consists of whole blood samples from a total of 84 women (49 healthy and 35 women with breast cancer). Samples were distributed over seven beadchips, and confounders are BSCE, status (cancer/healthy), and beadchip.
1.3 Breast Cancer mRNA Expression Data
The mRNA expression profiles are all from primary breast cancers and three of the datasets were profiled on Affymetrix platforms, while another was profiled on an Illumina Beadchip. Normalized data were downloaded from GEO (http://ncbi.nlm.nih.gov/), and probes mapping to the same Entrez ID identifier were averaged. Sotiriou: 14,223 genes and 101 samples [36]; Loi: 15,736 genes and 137 samples [26]; Schmidt: 13,292 genes and 200 samples [35]; Blenkiron: 17,941 genes and 128 samples [5]. In these datasets, we take histological grade as the phenotype of interest and consider estrogen receptor status and tumor size as potential confounders. Cell-cycle-related genes are known to discriminate low and high grade breast cancers irrespective of estrogen receptor status [26, 36]. Therefore, we compare the algorithms in their ability to detect specifically cell-cycle-related genes and not estrogen-regulated genes. To this end, we focused attention on two gene sets, one representing cell-cycle-related genes from the Reactome http://www.reactome.org, and another representing estrogen receptor (ESR1) upregulated genes [10]. The cell-cycle set showed negligible overlap with the ESR1 gene set, however, we removed the few overlapping genes to ensure mutual exclusivity of the cell-cycle and ESR1 sets.
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Teschendorff, A.E., Renard, E., Absil, P.A. (2014). Supervised Normalization of Large-Scale Omic Datasets Using Blind Source Separation. In: Naik, G., Wang, W. (eds) Blind Source Separation. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55016-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-55016-4_17
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55015-7
Online ISBN: 978-3-642-55016-4
eBook Packages: EngineeringEngineering (R0)