Integrated data analysis for genome-wide research

  • Matthias Steinfath
  • Dirk Repsilber
  • Matthias Scholz
  • Dirk Walther
  • Joachim Selbig
Part of the Experientia Supplementum book series (EXS, volume 97)


Integrated data analysis is introduced as the intermediate level of a systems biology approach to analyse different ‘omicsrs datasets, i.e., genome-wide measurements of transcripts, protein levels or protein—protein interactions, and metabolite levels aiming at generating a coherent understanding of biological function. In this chapter we focus on different methods of correlation analyses ranging from simple pairwise correlation to kernel canonical correlation which were recently applied in molecular biology. Several examples are presented to illustrate their application. The input data for this analysis frequently originate from different experimental platforms. Therefore, preprocessing steps such as data normalisation and missing value estimation are inherent to this approach. The corresponding procedures, potential pitfalls and biases, and available software solutions are reviewed. The multiplicity of observations obtained in omics-profiling experiments necessitates the application of multiple testing correction techniques.


Mutual Information Independent Component Analysis Canonical Correlation Analysis Independent Component Analysis Biological Organisation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Somogyi R, Sniegoski CA (1996) Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation. Complexity 1(6): 45–63Google Scholar
  2. 2.
    Gygi S, Rochon Y, Franza B, Aebersold R (1999) Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19(3): 1720–1730PubMedGoogle Scholar
  3. 3.
    Noble D (2002) Modeling the heart-from genes to cells to the whole organ. Science 295(5560) 1678–1682PubMedCrossRefGoogle Scholar
  4. 4.
    Grünenfelder B, Winzeler EA (2002) Treasures and traps in genome-wide datasets: case examples from yeast. Nat Rev Genetics 3: 653–661CrossRefGoogle Scholar
  5. 5.
    Shevchenko A, Jensen O, Podtelejnikov A, Sagliocco F, Wilm M, Vorm O, Mortensen P, Shevchenko A, Boucherie H, Mann M (1996) Linking genome and proteome by mass spectrometry: large-scale identification of yeast proteins from two dimensional gels. Proc Natl Acad Sci USA 93(25): 14440–14445PubMedCrossRefGoogle Scholar
  6. 6.
    Pandey A, Mann M (2000) Proteomics to study genes and genomes. Nature 405: 837–846PubMedCrossRefGoogle Scholar
  7. 7.
    Walhout A, Vidal M (2001) Protein interaction maps for model organisms. Nat Rev Mol Cell Biol 2(1): 55–62PubMedCrossRefGoogle Scholar
  8. 8.
    Fiehn O, Kopka J, Dormann P, Altmann T, Trethewey R, Willmitzer L (2000) Metabolite profiling for plant functional genomics. Nat Biotechnol 18(11): 1157–1161PubMedCrossRefGoogle Scholar
  9. 9.
    Roessner U, Luedemann A, Brust D, Fiehn O, Linke T, Willmitzer L, Fernie A (2001) Metabolic profiling allows comprehensive phenotyping of genetically or environmentally modified plant systems. Plant Cell 13(1): 11–29PubMedCrossRefGoogle Scholar
  10. 10.
    Fernie A, Trethewey R, Krotzky A, Willmitzer L (2004) Metabolite profiling: from diagnostics to systems biology. Nat Rev Mol Cell Biol 5(9): 763–769PubMedCrossRefGoogle Scholar
  11. 11.
    Klipp E, Herwig R, Kowald A, Wierling C, Lehrach H (2005) Systems biology in practice — concepts, implementation and application, chapter.3, Wiley-VCH Verlag, Weinheim, Germany, 11–17Google Scholar
  12. 12.
    Griffin TJ, Gygi SP, Ideker T, Rist B, Eng J, Hood L, Aebersold R (2002) complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae. Mol Cell Proteomics 1(4): 323–333PubMedCrossRefGoogle Scholar
  13. 13.
    Aitchison JD, Galitski T (2003) Inventories to insights. J Cell Biol 161(3): 465–469PubMedCrossRefGoogle Scholar
  14. 14.
    Wissel C (1992) Aims and limits of ecological modelling exemplified by island theory. Ecol Model 63: 1–12CrossRefGoogle Scholar
  15. 15.
    Searls D (2005) Data integration: challenges for drug discovery. Nat Rev Drug Discov 4(1): 45–58PubMedCrossRefGoogle Scholar
  16. 16.
    Park P, Cao Y, Lee S, Kim J, Chang M, Hart R, Choi S (2004) Current issues for DNA microarrays: platform comparison, double linear amplification, and universal RNA reference. J Biotechnol 112(3): 225–245PubMedCrossRefGoogle Scholar
  17. 17.
    Aebersold R, Hood L, Watts J (2000) Equipping scientists for the new biology. Nat Biotechnol 18(4): 359PubMedCrossRefGoogle Scholar
  18. 18.
    Weinstein JN (2002) ‘Omic’ and hypothesis-driven research in the molecular pharmacology of cancer. Curr Opin Pharmacol 2: 361–365PubMedCrossRefGoogle Scholar
  19. 19.
    Ge H, Liu Z, Church GM, Vidal M (2001) Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genetics 29: 482–486PubMedCrossRefGoogle Scholar
  20. 20.
    Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1): 25–29PubMedCrossRefGoogle Scholar
  21. 21.
    The Plant Ontology Consortium (2002) The Plant Ontology Consortium and Plant Ontologies. Comp Funct Genomics 3: 137–142CrossRefGoogle Scholar
  22. 22.
    Hazbun T, Malmstrom L, Anderson S, Graczyk B, Fox B, Riffle M, Sundin B, Aranda J, McDonald W, Chiu C et al. (2003) Assigning function to yeast proteins by integration of technologies. Mol Cell 12(6): 1353–1365PubMedCrossRefGoogle Scholar
  23. 23.
    Wacholder S, McLaughlin JK, Silverman DT, Mandel JS (1992) Selection of controls in case-control studies. I. principles. Am J Epidemiol 135(9): 1019–1028PubMedGoogle Scholar
  24. 24.
    Repsilber D, Fink L, Jacobsen M, Bläsing O, Ziegler A (2005) Sample selection for microarray gene expression studies. Meth Info Med 44(3): 461–467Google Scholar
  25. 25.
    Smith JJ, Marelli M, Christmas RH, Vizeacoumar FJ, Dilworth DJ, Ideker T, Galitski T, Dimitrov K, Rachubinski RA, Aitchison JD (2002) Transcriptome profiling to identify genes involved in peroxisome assembly and function. J Cell Biol 158(2): 259–271PubMedCrossRefGoogle Scholar
  26. 26.
    Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863–14868PubMedCrossRefGoogle Scholar
  27. 27.
    Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM (1999) Systematic determination of genetic network architecture. Nature Genetics 22(3): 281–285PubMedCrossRefGoogle Scholar
  28. 28.
    Qiu P (2003) Recent advances in computational promoter analysis in understanding the transcriptional regulatory network. Biochem Biophys Res Commun 309(3): 495–501PubMedCrossRefGoogle Scholar
  29. 29.
    Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18(Suppl.1): S233–S240PubMedGoogle Scholar
  30. 30.
    Kriete A, Anderson MK, Love B, Freund J, Caffrey JJ, Young MB, Sendera TJ, Magnuson SR, Braughler JM (2003) Combined histomorphometric and gene-expression profiling applied to toxicology. Genome Biol 4: R32PubMedCrossRefGoogle Scholar
  31. 31.
    Weckwerth W (2003) Metabolomics in systems biology. Annu Rev Plant Biol 54: 669–689PubMedCrossRefGoogle Scholar
  32. 32.
    Urbanczyk-Wochniak E, Luedemann A, Kopka J, Selbig J, Roessner-Tunali U, Willmitzer L, Fernie A (2003) Parallel analysis of transcript and metabolic profiles: a new approach in systems biology. EMBO Rep 4(10): 989–993PubMedCrossRefGoogle Scholar
  33. 33.
    Nilsson J, Fioetos T, Höglund M, Fontes M (2004) Approximate geodetic distances reveal biological relevant structure in microarray data. Bioinformatics 20(6): 874–880PubMedCrossRefGoogle Scholar
  34. 34.
    Scholz M, Gatzek S, Sterling A, Fiehn O, Selbig J (2004) Metabolite fingerprinting: detection of biological features by independent component analysis. Bioinformatics 20: 2447–2454PubMedCrossRefGoogle Scholar
  35. 35.
    Scholz M, Kaplan F, Guy CL, Kopka J, Selbig J (2005) Nonlinear PCA: a missing data approach. Bioinformatics, Advance Access published online 18 August 2005Google Scholar
  36. 36.
    Gasch AP, Spellmann PT, Kao CM, Carmel-Harel O, Eisen M, Storz, Botstein D, Brown PO (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11: 4241–4257PubMedGoogle Scholar
  37. 37.
    Butte A, Kohane IS (2000) Mutual information relevance networks: Functional genomic clustering using pair-wise entropy measurements. Pac Symp Biocomput 5: 415–426Google Scholar
  38. 38.
    Steuer R, Kurths J, Daub C, Weise J, Selbig J (2002) The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18: S231–S240PubMedGoogle Scholar
  39. 39.
    Best DJ, Roberts DE (1975) Algorithm AS 89: The upper tail probabilities of spearman’s rho. Appl Stats 24: 377–379CrossRefGoogle Scholar
  40. 40.
    Hotelling H (1936) Relation between two sets of variates. Biometrica 28: 312–377Google Scholar
  41. 41.
    Hardoon D, Szedmak S, Shawe-Taylor J (2003) Canonical correlation analysis; An overview with application to learning methods. Technical Report CSD-TR-03-02. Department of Computer Science, University of London, UKGoogle Scholar
  42. 42.
    Yamanishi Y, Vert JP, Kanehisa M (2003) Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics 19:Suppl 1 i323–330PubMedCrossRefGoogle Scholar
  43. 43.
    Kuss M, Graepel T (2003) The geometry of kernel canonical analysis. Technical Report No. 108, Max Planck Institute for Biological CyberneticsGoogle Scholar
  44. 44.
    Kanehisa M, Goto S, Kawashima S, Nakaya A (2002) The KEGG databases at GenomeNet. Nucleic Acids Res 30: 42–45PubMedCrossRefGoogle Scholar
  45. 45.
    Gibbons F, Roth F (2002) Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res 12(10): 1574–1581PubMedCrossRefGoogle Scholar
  46. 46.
    Daub C, Steuer R, Selbig J, Kloska S (2004) Estimating mutual information using B-spline functions’an improved similarity measure for analysing gene expression data. BMC Bioinformatics 5: 118PubMedCrossRefGoogle Scholar
  47. 47.
    Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R (1998) Largescale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 95: 334–339PubMedCrossRefGoogle Scholar
  48. 48.
    Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12): 6745–6750PubMedCrossRefGoogle Scholar
  49. 49.
    Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organising maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96(6): 2907–2912PubMedCrossRefGoogle Scholar
  50. 50.
    Heyer L, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9(11): 1106–1115PubMedCrossRefGoogle Scholar
  51. 51.
    Michaels GS, Carr DB, Askenazi M, Fuhrman S, Wen X, Somogyi R (1998) Cluster analysis and data visualization of large-scale gene expression data. Pac Symp Biocomp 3: 42–53Google Scholar
  52. 52.
    Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100(16): 9440–9445PubMedCrossRefGoogle Scholar
  53. 53.
    Broberg P (2005) A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 6: 199PubMedCrossRefGoogle Scholar
  54. 54.
    Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics. J Comp Graph Stats 5(3): 299–314CrossRefGoogle Scholar
  55. 55.
    R Development Core Team (2005) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
  56. 56.
    Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5: R80PubMedCrossRefGoogle Scholar
  57. 57.
    MathWorks IUC (2000) MATLABGoogle Scholar
  58. 58.
    Eichler G, Huang S, Ingber D (2003) Gene Expression Dynamics Inspector (GEDI): for integrative analysis of expression profiles. Bioinformatics 19(17): 2321–2322PubMedCrossRefGoogle Scholar
  59. 59.
    Thimm O, Bläsing O, Yves Gibon OB, Nagel A, Meyer S, Krüger P, Selbig J, Müller LA, Rhee SY, Stitt M (2004) MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J 37: 914–939PubMedCrossRefGoogle Scholar
  60. 60.
    Zimmermann P, Hennig L, Gruissem W (2005) Gene-expression analysis and network discovery using Genevestigator. Trends Plant Sci 10(9): 407–409PubMedCrossRefGoogle Scholar
  61. 61.
    Zimmermann P, Hirsch-Hoffmann M, Hennig L, Gruissem W (2004) GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox. Plant Physiol 136(1): 2621–2632PubMedCrossRefGoogle Scholar
  62. 62.
    Breitkreutz B, Stark C, Tyers M (2003) Osprey: a network visualization system. Genome Biol 4(3): R22PubMedCrossRefGoogle Scholar
  63. 63.
    Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11): 2498–2504PubMedCrossRefGoogle Scholar
  64. 64.
    Daub C, Kloska S, Selbig J (2003) MetaGeneAlyse: analysis of integrated transcriptional and metabolite data. Bioinformatics 19(17): 2332–2333PubMedCrossRefGoogle Scholar

Copyright information

© Birkhäuser Verlag/Switzerland 2007

Authors and Affiliations

  • Matthias Steinfath
    • 1
  • Dirk Repsilber
    • 1
  • Matthias Scholz
    • 2
  • Dirk Walther
    • 2
  • Joachim Selbig
    • 1
    • 2
  1. 1.Institute for Biology and BiochemistryUniversity PotsdamPotsdam-GolmGermany
  2. 2.Max Planck Institute of Molecular Plant PhysiologyPotsdam-GolmGermany

Personalised recommendations