Biophysical Reviews

, Volume 11, Issue 1, pp 41–50 | Cite as

Stems cells, big data and compendium-based analyses for identifying cell types, signalling pathways and gene regulatory networks

  • Md Humayun Kabir
  • Michael D. O’ConnorEmail author


Identification of new drug and cell therapy targets for disease treatment will be facilitated by a detailed molecular understanding of normal and disease development. Human pluripotent stem cells can provide a large in vitro source of human cell types and, in a growing number of instances, also three-dimensional multicellular tissues called organoids. The application of stem cell technology to discovery and development of new therapies will be aided by detailed molecular characterisation of cell identity, cell signalling pathways and target gene networks. Big data or ‘omics’ techniques—particularly transcriptomics and proteomics—facilitate cell and tissue characterisation using thousands to tens-of-thousands of genes or proteins. These gene and protein profiles are analysed using existing and/or emergent bioinformatics methods, including a growing number of methods that compare sample profiles against compendia of reference samples. This review assesses how compendium-based analyses can aid the application of stem cell technology for new therapy development. This includes via robust definition of differentiated stem cell identity, as well as elucidation of complex signalling pathways and target gene networks involved in normal and diseased states.


Pluripotent stem cell Bioinformatics Compendium Signalling Growth factor Pathway Gene regulatory network 


Author contributions

M.H.K drafted the manuscript. M.H.K and M.D.O’C revised and approved the manuscript.


M.H.K was supported by WSU Postgraduate Research Awards. M.D.O’C was supported by The Medical Advances Without Animals Trust.

Compliance with ethical standards

Conflict of interest

Md Humayun Kabir declares that he has no conflict of interest. Michael D. O’Connor declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human or animal subjects performed by any of the authors.


  1. Andersson R et al (2014) An atlas of active enhancers across human cell types and tissues. Nature 507:455–461. Google Scholar
  2. Asp P et al (2011) Genome-wide remodeling of the epigenetic landscape during myogenic differentiation. Proc Natl Acad Sci U S A 108:E149–E158. Google Scholar
  3. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA (2004) Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14:283–291. Google Scholar
  4. Bailey T et al (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol 9:e1003326. Google Scholar
  5. Banks CJ, Joshi A, Michoel T (2016) Functional transcription factor target discovery via compendia of binding and expression profiles. Sci Rep 6:20649. Google Scholar
  6. Barrett T et al (2013) NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res 41:D991–D995. Google Scholar
  7. Bebek G, Yang J (2007) PathFinder: mining signal transduction pathway segments from protein-protein interaction networks. BMC Bioinformatics 8:335. Google Scholar
  8. Beer MA, Tavazoie S (2004) Predicting gene expression from sequence. Cell 117:185–198Google Scholar
  9. Berg J (2016) Gene-environment interplay. Science 354:15. Google Scholar
  10. Boeva V (2016) Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic. Cells Front Genet 7:24. Google Scholar
  11. Boyer LA et al (2005) Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122:947–956. Google Scholar
  12. Bumgarner R (2013) Overview of DNA microarrays: types, applications, and their future. Curr Protoc Mol Biol Chapter 22:Unit 22 21. Google Scholar
  13. Butcher EC, Berg EL, Kunkel EJ (2004) Systems biology in drug discovery. Nat Biotechnol 22:1253–1259. Google Scholar
  14. Chen K, Rajewsky N (2007) The evolution of gene regulation by transcription factors and microRNAs. Nat Rev Genet 8:93–103. Google Scholar
  15. Chen H et al (2015) Reinforcement of STAT3 activity reprogrammes human embryonic stem cells to naive-like pluripotency. Nat Commun 6:7095. Google Scholar
  16. Cloonan N et al (2008) Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 5:613–619. Google Scholar
  17. Cohen SN, Chang AC, Boyer HW, Helling RB (1973) Construction of biologically functional bacterial plasmids in vitro. Proc Natl Acad Sci U S A 70:3240–3244Google Scholar
  18. Collas P (2010) The current state of chromatin immunoprecipitation. Mol Biotechnol 45:87–100. Google Scholar
  19. Consortium F et al (2014) A promoter-level mammalian expression atlas. Nature 507:462–470. Google Scholar
  20. Consortium GT (2013) The genotype-tissue expression (GTEx) project. Nat Genet 45:580–585. Google Scholar
  21. Consortium TEP (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74. Google Scholar
  22. Consortium TME (2012) An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol 13:418. Google Scholar
  23. Consortium TU (2007) The universal protein resource (UniProt). Nucleic Acids Res 35:D193–D197. Google Scholar
  24. Cressey D (2012) Stem cells take root in drug development. Nat NewsGoogle Scholar
  25. Davidson EH et al (2002) A genomic regulatory network for development. Science 295:1669–1678. Google Scholar
  26. DeFreitas T, Saddiki H, Flaherty P (2016) GEMINI: a computationally-efficient search engine for large gene expression datasets. BMC Bioinf 17:102. Google Scholar
  27. Djordjevic D, Kusumi K, Ho JW (2016) XGSA: a statistical method for cross-species gene set analysis. Bioinformatics 32:i620–i628. Google Scholar
  28. Duggal G et al (2015) Alternative routes to induce naive pluripotency in human embryonic stem cells. Stem Cells 33:2686–2698. Google Scholar
  29. Engreitz JM, Chen R, Morgan AA, Dudley JT, Mallelwar R, Butte AJ (2011) ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression. Bioinformatics 27:3317–3318. Google Scholar
  30. Fujibuchi W, Kiseleva L, Taniguchi T, Harada H, Horton P (2007) CellMontage: similar expression profile search server. Bioinformatics 23:3103–3104. Google Scholar
  31. Furey TS (2012) ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet 13:840–852. Google Scholar
  32. Germanguz I, Listgarten J, Cinkornpumin J, Solomon A, Gaeta X, Lowry WE (2016) Identifying gene expression modules that define human cell fates. Stem Cell Res 16:712–724. Google Scholar
  33. Gil DP, Law JN, Murali TM (2017) The PathLinker app: connect the dots in protein interaction networks. F1000Res 6:58. Google Scholar
  34. Gitter A, Klein-Seetharaman J, Gupta A, Bar-Joseph Z (2011) Discovering pathways by orienting edges in protein interaction networks. Nucleic Acids Res 39:e22. Google Scholar
  35. Hackney JA, Moore KA (2005) A functional genomics approach to hematopoietic stem cell regulation. Methods Mol Med 105:439–452Google Scholar
  36. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33:D514–D517. Google Scholar
  37. Han X, Aslanian A, Yates JR 3rd (2008) Mass spectrometry for proteomics. Curr Opin Chem Biol 12:483–490. Google Scholar
  38. Hannah R, Joshi A, Wilson NK, Kinston S, Gottgens B (2011) A compendium of genome-wide hematopoietic transcription factor maps supports the identification of gene regulatory control mechanisms. Exp Hematol 39:531–541. Google Scholar
  39. Heinz S et al (2010) Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 38:576–589. Google Scholar
  40. Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG (2007) Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 23:2692–2699. Google Scholar
  41. Hirst M et al (2007) LongSAGE profiling of nine human embryonic stem cell lines. Genome Biol 8:R113. Google Scholar
  42. Hoopes L (2008) Introduction to the gene expression and regulation topic room. Nat Educ 1(1)Google Scholar
  43. Huang DW, Sherman BT, Lempicki RA (2009a) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37:1–13. Google Scholar
  44. Huang DW, Sherman BT, Lempicki RA (2009b) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44–57. Google Scholar
  45. Janky R et al (2014) iRegulon: from a gene list to a gene regulatory network using large motif and track collections. PLoS Comput Biol 10:e1003731. Google Scholar
  46. Kabir MH, Djordjevic D, O’Connor MD, Ho JWK (2018a) C3: an R package for cross-species compendium-based cell-type identification. Comput Biol Chem 77:187–192Google Scholar
  47. Kabir MH, Murphy P, Lim S, Ho JWK, O’Connor MD (2018b) Large scale profiling of lens epithelial cell signalling pathways and target genes reveals regulatory networks for cataract-associated genes. Exp Eye Res (under review)Google Scholar
  48. Kabir MH, Patrick R, Ho JWK, O’Connor MD (2018c) Identification of active signaling pathways by integrating gene expression and protein interaction data. BMC Syst Biol in pressGoogle Scholar
  49. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:27–30Google Scholar
  50. Kim HD, O'Shea EK (2008) A quantitative model of transcription factor-activated gene expression. Nat Struct Mol Biol 15:1192–1198. Google Scholar
  51. Kuleshov MV et al (2016) Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res 44:W90–W97. Google Scholar
  52. Lee TI et al (2006) Control of developmental regulators by Polycomb in human embryonic stem cells. Cell 125:301–313. Google Scholar
  53. Liu Y, Zhao H (2004) A computational approach for ordering signal transduction pathway components from genomics and proteomics. Data BMC Bioinf 5:158. Google Scholar
  54. Marbach D, Lamparter D, Quon G, Kellis M, Kutalik Z, Bergmann S (2016) Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nat Methods 13:366–370. Google Scholar
  55. Mardis ER (2007) ChIP-seq: welcome to the new frontier. Nat Methods 4:613–614. Google Scholar
  56. Medina I et al (2010) Babelomics: an integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling. Nucleic Acids Res 38:W210–W213. Google Scholar
  57. Mei S, Zhu H (2015) Multi-label multi-instance transfer learning for simultaneous reconstruction and cross-talk modeling of multiple human signaling pathways. BMC Bioinf 16:417. Google Scholar
  58. Murphy P et al (2018) Light-focusing human micro-lenses generated from pluripotent stem cells model lens development and drug-induced cataract in vitro. Development 145.
  59. O'Connor MD (2013) The 3R principle: advancing clinical application of human pluripotent stem cells. Stem Cell Res Ther 4:21. Google Scholar
  60. O'Connor MD, Kardel MD, Eaves CJ (2011a) Functional assays for human embryonic stem cell pluripotency. Methods Mol Biol 690:67–80. Google Scholar
  61. O'Connor MD et al (2011b) Retinoblastoma-binding proteins 4 and 9 are important for human pluripotent stem cell maintenance. Exp Hematol 39:866–879 e861. Google Scholar
  62. Pinto JP, Reddy Kalathur RK, Machado RS, Xavier JM, Braganca J, Futschik ME (2014) StemCellNet: an interactive platform for network-oriented investigations in stem cell biology. Nucleic Acids Res 42:W154–W160. Google Scholar
  63. Rackham OJ et al (2016) A predictive computational framework for direct reprogramming between human cell types. Nat Genet 48:331–335. Google Scholar
  64. Ralston A, Shaw K (2008) Gene expression regulates cell differentiation. Nat Educ 1(1)Google Scholar
  65. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP (2006) GenePattern 2.0. Nat Genet 38:500–501. Google Scholar
  66. Respuela P, Nikolic M, Tan M, Frommolt P, Zhao Y, Wysocka J, Rada-Iglesias A (2016) Foxd3 promotes exit from naive pluripotency through enhancer decommissioning and inhibits germline specification cell. Stem Cell 18:118–133. Google Scholar
  67. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47. Google Scholar
  68. Ritz A et al (2016) Pathways on demand: automated reconstruction of human signaling networks. NPJ Syst Biol Appl 2:16002. Google Scholar
  69. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140. Google Scholar
  70. Roider HG, Manke T, O'Keeffe S, Vingron M, Haas SA (2009) PASTAA: identifying transcription factors associated with sets of co-regulated genes. Bioinformatics 25:435–442. Google Scholar
  71. Ruau D et al (2013) Building an ENCODE-style data compendium on a shoestring. Nat Methods 10:926. Google Scholar
  72. Scott J, Ideker T, Karp RM, Sharan R (2006) Efficient algorithms for detecting signaling pathways in protein interaction networks. J Comput Biol 13:133–144Google Scholar
  73. Shanks N, Greek R, Greek J (2009) Are animal models predictive for humans? Philos Ethics Humanit Med 4:2. Google Scholar
  74. Sharov AA et al (2008) Identification of Pou5f1, Sox2, and Nanog downstream target genes with statistical confidence by applying a novel algorithm to time course microarray and genome-wide chromatin immunoprecipitation data. BMC Genomics 9:269. Google Scholar
  75. Shiels A, Bennett TM, Hejtmancik JF (2010) Cat-Map: putting cataract on the map. Mol Vis 16:2007–2015Google Scholar
  76. Spitz F, Furlong EE (2012) Transcription factors: from enhancer binding to developmental control. Nat Rev Genet 13:613–626. Google Scholar
  77. Steffen M, Petti A, Aach J, D'Haeseleer P, Church G (2002) Automated modelling of signal transduction networks. BMC Bioinf 3:34Google Scholar
  78. Tuncbag N et al (2013) Simultaneous reconstruction of multiple signaling pathways via the prize-collecting steiner forest problem. J Comput Biol 20:124–136. Google Scholar
  79. Ungrin M, O'Connor M, Eaves C, Zandstra PW (2007) Phenotypic analysis of human embryonic stem cells. Curr Protoc Stem Cell Biol Chapter 1:Unit 1B 3. Google Scholar
  80. Van der Jeught M et al (2015) Application of small molecules favoring naive pluripotency during human embryonic stem cell derivation. Cell Reprogram 17:170–180. Google Scholar
  81. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B (2003) STRING: a database of predicted functional associations between proteins. Nucleic Acids Res 31:258–261Google Scholar
  82. Wang K et al (2011) CASCADE_SCAN: mining signal transduction network from high-throughput data based on steepest descent method. BMC Bioinf 12:164. Google Scholar
  83. Warrier S et al (2017) Direct comparison of distinct naive pluripotent states in human embryonic stem cells. Nat Commun 8:15055. Google Scholar
  84. Zacher B, Michel M, Schwalb B, Cramer P, Tresch A, Gagneur J (2017) Accurate promoter and enhancer identification in 127 ENCODE and roadmap epigenomics cell types and tissues by GenoSTAN. PLoS One 12:e0169249. Google Scholar
  85. Zhang L, Mallick BK (2013) Inferring gene networks from discrete expression data. Biostatistics 14:708–722. Google Scholar
  86. Zhang S, Cao J, Kong YM, Scheuermann RH (2010) GO-Bayes: Gene Ontology-based overrepresentation analysis using a Bayesian approach. Bioinformatics 26:905–911. Google Scholar
  87. Zhao XM, Li S (2017) HISP: a hybrid intelligent approach for identifying directed signaling pathways. J Mol Cell Biol 9:453–462. Google Scholar
  88. Zhao XM, Wang RS, Chen L, Aihara K (2008) Uncovering signal transduction networks from high-throughput data by integer linear programming. Nucleic Acids Res 36:e48. Google Scholar
  89. Zinman GE, Naiman S, Kanfi Y, Cohen H, Bar-Joseph Z (2013) ExpressionBlast: mining large, unstructured expression databases. Nat Methods 10:925–926. Google Scholar

Copyright information

© International Union for Pure and Applied Biophysics (IUPAB) and Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of MedicineWestern Sydney UniversityCampbelltownAustralia
  2. 2.Department of Computer Science and EngineeringUniversity of RajshahiRajshahiBangladesh
  3. 3.Medical Sciences Research GroupWestern Sydney UniversityCampbelltownAustralia

Personalised recommendations