Pathprinting: An integrative approach to understand the functional basis of disease
New strategies to combat complex human disease require systems approaches to biology that integrate experiments from cell lines, primary tissues and model organisms. We have developed Pathprint, a functional approach that compares gene expression profiles in a set of pathways, networks and transcriptionally regulated targets. It can be applied universally to gene expression profiles across species. Integration of large-scale profiling methods and curation of the public repository overcomes platform, species and batch effects to yield a standard measure of functional distance between experiments. We show that pathprints combine mouse and human blood developmental lineage, and can be used to identify new prognostic indicators in acute myeloid leukemia. The code and resources are available at http://compbio.sph.harvard.edu/hidelab/pathprint
KeywordsAcute Myeloid Leukemia Gene Expression Omnibus Gene Expression Omnibus Database Pathway Expression Gene Expression Score
acute myeloid leukemia
Embryonic stem cell
Gene Expression Barcode GEO: Gene Expression Omnibus
Gene Set Enrichment Analysis
G protein-coupled receptor
Kyoto Encyclopedia of Genes and Genomes
Induced pluripotent stem cell
1-phosphatidylinositol-4,5-bisphosphate phosphodiesterase gamma-2
Probability of expression
RAS-related nuclear protein
Stem Cell Discovery Engine
Self renewal-associated signature.
Complex human diseases arise from perturbations of the cellular system . Defining these changes from a systems biology perspective provides the opportunity to relate the function of genes, pathways, and processes. The ability to compare experiments across model organisms and humans directly influences our capacity to determine the basis of disease [2, 3, 4], and the importance of cross-species data analysis has been well illustrated: human disease genes have been identified by large-scale meta-analysis of conserved human-mouse co-expression , gene-based cross-species distance metrics have highlighted diseases that activate similar human and mouse pathways , and oncogenetic expression signatures have been prioritized by comparing human cancer and mouse model expression profiles [7, 8, 9]. Gene expression provides the most extensive resource to profile functional changes, and the opportunity for large-scale meta-analyses has been made possible by the development of public data repositories such as the National Center for Biotechnology Information Gene Expression Omnibus (GEO)  and the European Bioinformatics Institute ArrayExpress . Cross-study analysis and integration is an area of extremely active research; however, most gene-based approaches are confounded by the challenge of comparing gene activity between different platforms and species. Consistent and scalable methods for combining these data are now required so that researchers can perform comprehensive integration of existing knowledge with new experiments, identify consistent signals, compare heterogeneous data, and validate hypotheses.
Methods for cross-study integration of gene expression data have tended to focus on differential expression in well-matched control and experimental samples , because approaches based on correlation or absolute profiles  are dominated by laboratory and platform variability in cross-study analyses . The ability to leverage public data to address platform-effects has been demonstrated most recently by the Gene Expression Barcode (GEB) and Gene Expression Commons, both of which define absolute gene expression scores based on a background distribution [15, 16]. However, by virtue of their reliance on gene level comparisons, these compelling simplifying approaches are restricted to selected platforms, and so do not address global comparison of biological function across experiments and species.
We sought to develop a new function-based approach for comparing profiles, which can truly scale across the diversity of available experiments, platforms, and species. Expression of biological functions across batches and divergent expression platforms shows higher concordance than across genes , and assigning genes to pathways [18, 19, 20] or ontologies  is effective for revealing phenotype associations [22, 23, 24, 25], performing cross-platform integration , and specifying disease subgroups . To this end, we have developed Pathprint, a global pathway activation map spanning 6 species and 31 array technologies, which represents expression profiles as a ternary score (underexpressed (-1), intermediately expressed (0), or overexpressed (+1)) in a set of 633 pathways, networks, and transcriptionally regulated targets. The method leverages a static background built from public data repositories, integrating pathway annotation and prediction with large-scale profiling.
Pathprint provides both a quantitative definition of cellular phenotype and a functional distance between all experiments, based on their global pathway activity. It presents a significant methodological advance over single-study, relative enrichment methods such as Gene Set Enrichment Analysis (GSEA)  and existing gene-based methods for comparison between platforms and species. Pathprinting provides a robust framework for large-scale meta-analyses of clinical data, and allows phylogenetic reconstruction of developmental lineages from a functional perspective. We demonstrate the use of pathprinting for retrieval of functionally matched samples from cross-platform expression databases, reconstruction of the blood developmental lineage across species, and integration of data from mouse experiments, human samples, and clinical studies to develop new prognostic indicators and drug targets in acute myeloid leukemia (AML).
Expression data for building pathway background distributions
A list of arrays from 31 of the most highly represented one-channel gene expression platforms in GEO that profiled Homo sapiens, Mus musculus, Rattus norvegicus, Danio rerio, Drosophila melanogaster, and Caenorhabditis elegans was compiled (see Additional file 1) and the normalized expression tables retrieved. All normalization methods were accepted. After discarding incomplete records, this list contained 176,971 arrays. It was necessary to restrict the platform coverage to one-channel arrays, because two-channel arrays provide the relative expression of genes between test and control samples, hindering direct comparison of the test sample between experiments when the control sample differs. The expression data were mapped to Entrez Gene identifications (IDs) using systematically updated annotations from AILUN (Array Information Library Universal Navigator) . Multiple probes were merged to unique Entrez Gene IDs by taking the mean probe set intensity. It should be noted that although the mean expression level will produce stable gene expression values, it also 'averages out' the effects of alternative promoter usage and splice variants. Tissue-specific splicing has been recognized as an important factor in defining cellular function ; however, at the present time, there are insufficient data to allow consistent mapping of individual splice variants to pathways.
Canonical pathway gene sets were compiled from Reactome , Wikipathways , and KEGG (Kyoto Encyclopedia of Genes and Genomes) , which were chosen because they include pathways relating to metabolism, signaling, cellular processes, and disease. For the major signaling pathways, experimentally derived transcriptionally upregulated and downregulated gene sets were obtained from Netpath . The pathways provide structured relationships between genes, unlike ontologies such as the Gene Ontology (GO) database  that define relationships between but not within terms.
Pathprint is built to leverage expertly curated biological knowledge found in canonical pathway databases within a systematic framework. This approach provides a consistent biological annotation of datasets in terms that are well understood by the community. However, a uniquely pathway-centric approach would introduce an inherent curation bias towards well-studied genes and processes. Therefore, we have supplemented the curated pathways with non-curated sources of interactions by including highly connected modules from a functional-interaction network, termed 'static modules.' This functional-interaction network was constructed by extending curated pathways with non-curated sources of information, including protein-protein interactions, gene co-expression, protein domain interaction, GO annotations and text-mined protein interactions. The final functional-interaction network contains 181,706 interactions between 9,452 genes , representing close to 50% of the total human proteome. A Markov cluster algorithm was applied to decompose the network, yielding 144 closely related functional-interaction clusters or 'static modules', ranging from 10 to 743 nodes. Each cluster was named according to the member gene with the highest interaction degree. The modules cover 6,458 genes, 1,542 of which are not represented in any of the pathway databases. These static modules offer the opportunity to examine the activity of less studied or annotated biological processes, and also to compare their activity with that of the canonical pathways. To provide biological context for the static modules, the top GO terms associated with all the pathways have been compiled (see Additional file 2).
Compiling cross-species gene sets
M. musculus, R. norvegicus, D. rerio, D. melanogaster, and C. elegans gene sets were inferred using homology based on the HomoloGene database . HomoloGene uses pairwise gene comparison combined with a guide tree and gene neighborhood conservation. HomoloGene was selected because, compared with alternative inference methods, it provides a better functional proxy and higher specificity for the resolution of shared cellular ontogeny, albeit with lower overall coverage .
Summary of the pathprint gene sets
Summary of gene sets used in Pathprint
Mean size, n
Median size, n
Minimum size, n
Maximum size, n
Total genes, n
Calculating pathway expression
where R i is the rank of gene G i in a pathway containing n genes. Rank normalizations provide robust summary statistics to calculate pathway expression scores [6, 13] that can be applied across all technologies, and does not depend on the dynamic range of an array. The mean squared rank was chosen based on a survey of statistical approaches for gene set analysis , and out-performed other summary statistics in a series of classification benchmarks based on tissue-specific pathway expression (see benchmarking section below).
Normalization and probability of expression
When comparing gene-set expression scores between experiments, it is essential to assess the expression against a suitable null hypothesis . In this case, comparison of the expression of a gene set in one array with its expression in all other arrays, that is, sample permutation, is required to account for the internal gene expression correlation structure within gene sets, which is expected to be particularly high within pathways . For each gene set, the expression score was normalized against a background built using all arrays of the same platform type. To our knowledge, this is the first study comparing database-wide gene set expression, and the expected distribution scores are not known. We adopted a similar approach to the GEB , which estimates which genes are expressed and which are unexpressed in data from single microarrays. The GEB converts gene expression levels to binary scores based on a static background distribution built from public expression data for three distinct platforms. In this study, we constructed static pathway expression background distributions for each pathway across 31 platforms in GEO , and then fitted each of these distributions to a two-component uniform-normal mixture model . The normal component represents the core distribution of pathway expression scores for a particular pathway, that is, not significantly high or low expression. The uniform component represents outlying pathway expression due to significantly high or low expression. A signed probability of expression (POE), representing the probability that a pathway expression score belongs to the uniform component of the fitted mixture model, can be calculated. We took advantage of the increase in computation speed afforded by the expectation-maximization implementation of POE in the R package metaArray .
Application of a ternary threshold
where POE i (i = 1, 2... n) represents the POE for gene set i, T is the threshold and F i are components of the pathprint vector. Selection of the threshold, T, is of vital importance, as this directly modulates the sensitivity and specificity at which gene sets are scored as significant. Large values of T (high stringency) is appropriate for gene expression , while small values of T (low stringency) increase the weighting of subtle differences in expression, and may be required to discriminate arrays at the pathway level, where the coordinated effects of multiple genes are under consideration. The threshold was optimized by combining multiple benchmarks (see below). Thresholding improves sample clustering (see below), provides a read-out for sample annotation, and simplifies quantification of sample relationships.
Constructing consensus pathprints
where μ i is the mean score for pathway i across the group of pathprints, and t is a consensus threshold value. The consensus pathprint is the vector constructed by calculating the consensus score for each pathway, representing the consistently significantly expressed pathways across the group. The rationale behind introducing a threshold is to associate a set of pathways with a phenotype, and so provide a discrete functional representation of a cell type based on a collection of pathprints.
Defining distance between pathprints
A functional distance between experiments is defined as the distance between two pathprint vectors. We defined the distance by the Manhattan distance, providing a simple read-out for the number of pathway scores differing between two samples. We defined the distance from a consensus pathprint to any other pathprint by the Manhattan distance between the subset of the pathprint vectors that contain only the pathways for which the consensus pathprint is non-zero. This ensures that only differences in the consistently expressed pathways that make up the consensus pathprint are considered.
Optimizing threshold value
The threshold value was optimized using cross-platform, cross-species gene expression data from a panel of human and mouse tissue samples  and an independent dataset profiling brain sub-regions in human, mouse, and rat .
Four approaches were used to determine the optimum threshold.
The datasets were divided into five sub-sets of equal, or approximately equal, size. One of the sub-sets (the test set) was omitted, and mean pathprints were calculated for each tissue from the remaining samples (the training set). Next, the samples in the test set were assigned to the tissue with the closest mean tissue pathprint in the training set by Euclidean or Manhattan distance (both yielded similar results). An error rate was calculated by comparing these assignments with the known annotations. This was repeated, omitting each of the sub-sets in turn, to obtain a mean error rate. The cross-validation procedure was performed 10 times for each threshold value to estimate the mean and standard deviation (SD) of the error rate. The SD was small relative to the change in mean error rate over the thresholds, and so this number of repetitions was deemed sufficient (see Additional file 4). The procedure was also performed as a 'leave-one-out' cross-validation, equivalent to dividing the data into a number of sub-sets equal to the same number of samples, with similar results.
2) Cluster validity (intra-tissue versus. inter-tissue distance and principal components analysis)
Cluster validity was determined by the ratio of the intra- to inter-tissue variance, where variance was defined as sum of the squared Euclidean distance between each sample and the mean pathprint for each tissue. A lower ratio indicates tighter clustering within tissues and/or better separation of the tissue type clusters. The clusters formed by pathprints had an intra-cluster/inter-cluster distance ratio of 0.63, compared with 1.26 for GEB and 0.92 for Spearman correlation (see Additional file 4).
3) Retrieval: precision recall of cross-species tissue data
4) Comparison with randomly constructed gene sets
A pathprint based on 'random' gene sets was constructed to test whether the 'expert' knowledge contained within the pathways and modules contributed to the success of the pathprint, over and above the effect of simply reducing the dimensionality of the data. These random gene sets contained genes sampled without replacement from the genes used in the original pathways, and retained the size distribution of the original pathway list. The performance of the precision-recall curves for pathprint based on random gene sets (Figure 2; see Additional file 4) were inferior to pathprint, and this was especially pronounced at stringent thresholds. At less stringent thresholds, the difference between the curves was smaller, implying that both the reduction in data dimensionality and the integration of biological knowledge contribute to the effectiveness of pathprint.
A threshold value of 0.001 was chosen on the basis that it performed optimally across the majority of the performance measures. It is interesting to note that a highly stringent threshold, approximately 0.9, did not perform well in cross-validation but yielded good results for the precision-recall and cluster-validity tests, and produced the greatest difference in performance compared with the equivalently thresholded random gene sets. These results show that moderate pathway expression levels best characterize samples, but the most highly expressed pathway expression scores are also informative. Further work is required to determine whether combining more than one thresholding regimen would be beneficial.
Phenotype matching using the GEO database
Any set of arrays, such as tissue-specific arrays, can be used as a 'seed' to construct a consensus pathprint profile representing the commonly expressed functions of the set (Figure 2). The distance of every array in the GEO pathprint collection can then be measured to produce a table of GEO samples, ordered on the basis of their phenotypic similarity to the seed set, that is, a ranked list of retrieved samples (see Additional file 5).
Distribution of distances
In considering the distribution of distances from a consensus pathprint, a major problem is how to assign a measure of significance. This is particularly important if it is necessary to impose a cut-off point at which to evaluate retrieved results. Calculating significance based on the distribution of pathprint scores across the full GEO database is complicated because 1) each pathway has a different distribution of ternary scores, and 2) the pathways scores are known to be correlated. An alternative strategy is to use the distribution of the database to define a background distribution, based on the following assumptions: first, that there are two distinct populations, namely, a small number of closely matched and a large number of non-matched samples; and second, that the distances of the non-matched samples are normally distributed. The estimated distribution of the non-matched samples is derived from the interquartile range of the full distribution. The significance with which an array is matched with a pathprint, or with a consensus pathprint, is then calculated using the P-value, based on the normal distribution function based on this estimated distribution. This approach is clearly an oversimplification, and a more complete significance model will form the basis of further study. We expect a large number of the samples contained in GEO to be disease-related, representative of a research focus bias inherent in the scientific literature, and so we are aware that the underlying distribution could be multimodal, owing to perturbed transcriptional programs and copy-number variations associated with disease, specifically cancer cell types. The correlation between this estimated P-value and the precision for each of the six tissue samples is shown (see Additional file 5).
Pathprints corresponding to hematopoietic gene expression datasets GSE24759  and GSE6506  were calculated using the pathprint pipeline. A consensus pathprint was constructed for each of cell types using an arbitrarily selected threshold of 0.75. Phylogenetic analysis was performed using the R package Phangorn . Optimized parsimony and (non-parametric) bootstrapped trees were found by nearest neighbor interchange with a cost matrix based on the difference between pathprint scores.
Self renewal-associated signature and survival analysis
Gene expression data for leukemia stem cells, normal stem cells, and progenitor cells in mouse and human were obtained from the GEO database (GSE24006 and GSE3722). Pathprints were calculated for each sample using the Pathprint package in R. Pathways shared by leukemic and normal stem cells that are differentially expressed in progenitor cells were identified for the human and mouse datasets. The self renewal-associated signature (SRAS) was defined as the set of pathways common to the human and mouse signatures. Gene expression arrays and the associated survival data were obtained from GEO for four clinical studies of AML (GSE10358, GSE12417, GSE1159, and GSE14468). Pathprints were calculated for each sample in these datasets. Survival plots and associated P-values were derived using the Kaplan-Meier method by stratifying patient samples into two groups by the sum of their pathprint scores across the SRAS pathways. For each dataset, the approach was repeated 1,000 times using random permutations of the pathprint pathways with the same number of member pathways as the SRAS set to produce a background distribution of P-values against which to compare the SRAS result.
Code and Pathprint R package
The code and data to process gene expression arrays to pathprints have been compiled into the R package Pathprint. Pathprints have also been pre-calculated for approximately 180,000 gene expression profiles from the GEO repository and included in the R package, along with their associated metadata, in order to create a searchable cross-platform matrix covering 31 platforms and 6 species (see Additional file 1). Future versions of Pathprint will extend the acquisition pipeline to encompass the remaining platforms and incorporate data from other repositories. The package and the complete R code (as Sweave documents) required to reproduce the analysis and figures contained within this manuscript are available online .
Results and Discussion
The ability of pathprints to classify cross-platform and species data was tested on a series of tissue-specific datasets, and compared with the GEB , gene expression correlation, and a pathprint based on random gene sets (Figure 2; see Additional file 4). In each test, pathprints improved sample classification, and clustered tissues together across platform and species. The biological and technical variation across pathprints in the tissue-specific dataset was investigated by principal components analysis (Figure 2c). The first two principal components separated most tissue types, irrespective of their originating platform and species, with some convolution of the lung and spleen samples. Notably, a corresponding plot produced from GEB data clustered samples first by platform and then tissue type (Figure 2d).
A high degree of overlap in gene membership is introduced when combining multiple pathway databases. Overlapping gene membership can be due to redundancy in the pathway sets, for example different views of the Wnt pathway in the Reactome, Wikipathway, and KEGG databases, or due to a close biological relationship between pathways and so sharing of a subset of their genes, such as 'G1 to S cell cycle control' and 'DNA replication'. Overlapping genes will result in correlation between the gene expression scores of these pathways. In addition to the correlation due to overlapping genes, it is well recognized that pathways do not function as discrete elements, but rather are organized into cascades and co-regulatory networks. We did not attempt to make a quantitative definition of the second source of correlation, but we did test the effect of correcting for overlapping genes by incorporating a pathway covariance matrix to adjust the contribution of each gene set using the Mahalanobis distance. The covariance matrix was calculated using pathway expression scores from 10,000 randomly permuted expression profiles to providing a measure of the covariance due to the gene-member overlap, without the additional complication of gene-gene expression correlations. In the benchmark tests, the Mahalanobis distance did not improve performance over the simpler Euclidean and Manhattan distances (Figure 2b), thus, all pathways, irrespective of size and including overlapping gene sets, were retained in the pathprint. No additional correction was made, as we wished to maximize the utility of the pathprint as a source of annotation of samples and for sample clustering and organization. Plans to include feature selection of gene sets that contribute the most toward performance, for example by non-negative matrix factorization, are the subject of ongoing algorithmic development.
We will now outline a series of case studies demonstrating major applications of pathprinting, focusing on integrating data from human and mouse.
Tissue-specific pathway profiles
Pathprint-based retrieval of data from the Gene Expression Omnibus (GEO); arrays retrieved from GEO from consensus tissue pathprints at 95% precision
Seed arrays, n
Correct retrievals, n
Development of a pluripotent pathprint
The study and characterization of embryonic stem cells (ESCs) is dominated by subjective choices of selection markers. ESCs express consistent transcriptional profiles that provide benchmarks for pluripotency ; however, to date, it has not been possible to consistently assess ESC signatures across all available data and platforms, and it is becoming increasingly important to provide biologically interpretable functional signatures that are robust across a range of experimental origins. An ESC pathprint was derived from 127 human and mouse samples (see Additional file 7) that includes high expression of known ESC-related functions such as DNA repair, one-carbon metabolism , and a network centered on SUMO1, the ubiquitin-related modifier thought to target and stabilize Oct4 . The profile is a consistent indicator of pluripotency; 90% of the 1,000 closest pathprint-matched samples in GEO are ESCs and induced pluripotent stem cells (iPSCs) from 140 different human and mouse studies and 13 platforms (see Additional file 8; see Additional file 9). The non-ESC/iPSC samples retrieved were cancer cell lines known to express ESC pathways, consistent with the concept that pathways required for stem cell specification play fundamental roles in tissue regeneration and cancer. Systematically profiling stem cells using pathprints to integrate data from mouse models, human primary tissue, and clinical studies will resolve the contributions of these stem cell pathways to developing and aberrant systems and reveal pathways of clinical relevance.
Integration of the human and mouse hematopoietic lineage
Self-renewal pathways in acute myeloid leukemia
The pathprinting project provides the scientific community with a consistent, functional annotation of gene expression across a fixed 'set' of pathways. It moves beyond traditional approaches, resolving the major bottleneck on the road towards efficient systems biology-based modeling by addressing the inherent experimental and platform biases that confound microarray analyses. Pathprinting is now being applied to group the function of datasets within the Harvard Stem Cell Institute Stem Cell Commons  so that samples that have similar function can be discovered within stem cell data. A Cytoscape plug-in is also in development as part of the National Heart, Lung, and Blood Institute Progenitor consortium , and we have integrated the method into the Stem Cell Discovery Engine (SCDE)  to provide web-based accessibility. The SCDE is a portal for integrated access to tissue and cancer stem cell experimental information and molecular profiling analysis tools via a web-based Galaxy instance. Pathprinting is also embedded within the toolbench distribution of Galaxy. We encourage the community to employ pathprinting to communicate functional findings more consistently. It is important to note that pathprinting is effective for use on single samples; a sample can easily be pathprinted and compared with 'what is there'. This has important implications for applications in personalized medicine and single cell analyses.
The R package Pathprint is provided to calculate pathprints (or continuous pathway scores) from expression arrays and pathway enrichments from input gene lists. The package also contains a database of approximately 180,000 pathprints from GEO. The packages, along with Sweave files detailing the package usage and analysis in this paper are available online . A supplementary package, pathprintTF, is also provided, containing a similar framework and database to pathprint but built upon protein interaction modules centered on transcription factors rather than pathways to enable cross-platform comparison of transcriptional control elements. The transcription factor modules are based on protein-protein interaction sub-networks centered on a series of 1,022 transcription factors. The package and more details are provided on the Pathprint website.
The correlation of mRNA expression with protein levels, and also with phenotype, depends on a variety of factors such as translation efficiency, mRNA abundance, ribosome occupancy, and protein abundance and turnover. Gene expression levels are a good surrogate for protein levels for housekeeping genes (ribosomal proteins, glycolytic enzymes, and tricarboxylic acid cycle proteins) but mRNA levels correlate less well with protein levels for kinases, proteases, secreted proteins and transcription factors, and overall mRNA variability explains only approximately 40% of the variability in protein levels. Pathprinting establishes a standardized method for large-scale quantitative comparisons of cellular function, and any analysis of this type depends on the availability of large-scale quantitative genome-wide datasets. Gene expression data repositories are currently the only resource expansive enough to address this need. Future versions of Pathprint will extend the value of existing array data by integrating RNA-sequencing, epigenetic and proteomic profiles, providing context for new experiments from the existing body of microarray data, and helping resolve the links between regulation and expression of cellular function.
We thank Dr R Gazit, Dr T Mehan, Prof F Michor, Prof C Seoighe, Dr A Sinha, Dr R Sompallae, C Spencer, and I Sytchev for conceptual and technical advice. This work was supported by the National Institute of Health (1RC2CA148222-01 to WH).
- 6.Le H-S, Oltvai ZN, Bar-Joseph Z: Cross species queries of large gene expression databases. Bioinformatics. 2010, 28: 2349-2356.Google Scholar
- 9.Johnson RA, Wright KD, Poppleton H, Mohankumar KM, Finkelstein D, Pounds SB, Rand V, Leary SE, White E, Eden C, Hogg T, Northcott P, Mack S, Neale G, Wang YD, Coyle B, Atkinson J, DeWire M, Kranenburg TA, Gillespie Y, Allen JC, Merchant T, Boop FA, Sanford RA, Gajjar A, Ellison DW, Taylor MD, Grundy RG, Gilbertson RJ: Cross-species genomics matches driver mutations and cell compartments to model ependymoma. Nature. 2010, 466: 632-636. 10.1038/nature09173.PubMedCentralCrossRefPubMedGoogle Scholar
- 11.Rustici G, Kolesnikov N, Brandizi M, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Ison J, Keays M, Kurbatova N, Malone J, Mani R, Mupo A, Pedro Pereira R, Pilicheva E, Rung J, Sharma A, Tang YA, Ternent T, Tikhonov A, Welter D, Williams E, Brazma A, Parkinson H, Sarkans U: ArrayExpress update--trends in database growth and links to data analysis tools. Nucleic Acids Res. 2013, 41: D987-990. 10.1093/nar/gks1174.PubMedCentralCrossRefPubMedGoogle Scholar
- 18.Croft D, O'Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, Jupe S, Kalatskaya I, Mahajan S, May B, Ndegwa N, Schmidt E, Shamovsky V, Yung C, Birney E, Hermjakob H, D'Eustachio P, Stein L: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011, 39: D691-697. 10.1093/nar/gkq1018.PubMedCentralCrossRefPubMedGoogle Scholar
- 21.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedCentralCrossRefPubMedGoogle Scholar
- 25.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O'Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, et al: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17: 98-110. 10.1016/j.ccr.2009.12.020.PubMedCentralCrossRefPubMedGoogle Scholar
- 27.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.PubMedCentralCrossRefPubMedGoogle Scholar
- 30.Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GSS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi SK, Tattikota SG, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HKC, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Rahiman BA, Prasad TSK, Lin J-X, Houtman JCD, Desiderio S, Renauld J-C, Constantinescu SN, et al: NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010, 11: R3-10.1186/gb-2010-11-1-r3.PubMedCentralCrossRefPubMedGoogle Scholar
- 32.Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, Feolo M, Fingerman IM, Geer LY, Helmberg W, Kapustin Y, Krasnov S, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Karsch-Mizrachi I, Ostell J, Panchenko A, Phan L, Pruitt KD, Schuler GD, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012, 40: D13-25. 10.1093/nar/gkr1184.PubMedCentralCrossRefPubMedGoogle Scholar
- 39.Molecular Brain. [http://molecularbrain.org]
- 40.Novershtern N, Subramanian A, Lawton LN, Mak RH, Haining WN, McConkey ME, Habib N, Yosef N, Chang CY, Shay T, Frampton GM, Drake AC, Leskov I, Nilsson B, Preffer F, Dombkowski D, Evans JW, Liefeld T, Smutko JS, Chen J, Friedman N, Young RA, Golub TR, Regev A, Ebert BL: Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell. 2011, 144: 96-309.CrossRefGoogle Scholar
- 41.Chambers SM, Boles NC, Lin K-YK, Tierney MP, Bowman TV, Bradfute SB, Chen AJ, Merchant AA, Sirin O, Weksberg DC, Merchant MG, Fisk CJ, Shaw CA, Goodell MA: Hematopoietic fingerprints: an expression database of stem cells and their progeny. Cell Stem Cell. 2007, 1: 578-591. 10.1016/j.stem.2007.10.003.PubMedCentralCrossRefPubMedGoogle Scholar
- 42.Schliep K: Phylogenetics in R package phangorn. 2010, 1-46.Google Scholar
- 43.Pathprint. [http://compbio.sph.harvard.edu/hidelab/pathprint]
- 50.Metzeler KH, Hummel M, Bloomfield CD, Spiekermann K, Braess J, Sauerland MC, Heinecke A, Radmacher M, Marcucci G, Whitman SP, Maharry K, Paschka P, Larson RA, Berdel WE, Buchner T, Wormann B, Mansmann U, Hiddemann W, Bohlander SK, Buske C: An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood. 2008, 112: 4193-4201. 10.1182/blood-2008-02-134411.PubMedCentralCrossRefPubMedGoogle Scholar
- 51.Stirewalt DL, Meshinchi S, Kopecky KJ, Fan W, Pogosova-Agadjanyan EL, Engel JH, Cronk MR, Dorcy KS, McQuary AR, Hockenbery D, Wood B, Heimfeld S, Radich JP: Identification of genes with abnormal expression changes in acute myeloid leukemia. Genes Chromosomes Cancer. 2008, 47: 8-20. 10.1002/gcc.20500.CrossRefPubMedGoogle Scholar
- 52.Tomasson MH, Xiang Z, Walgren R, Zhao Y, Kasai Y, Miner T, Ries RE, Lubman O, Fremont DH, McLellan MD, Payton JE, Westervelt P, DiPersio JF, Link DC, Walter MJ, Graubert TA, Watson M, Baty J, Heath S, Shannon WD, Nagarajan R, Bloomfield CD, Mardis ER, Wilson RK, Ley TJ: Somatic mutations and germline sequence variants in the expressed tyrosine kinase genes of patients with de novo acute myeloid leukemia. Blood. 2008, 111: 4797-4808. 10.1182/blood-2007-09-113027.PubMedCentralCrossRefPubMedGoogle Scholar
- 53.Wouters BJ, Lowenberg B, Erpelinck-Verschueren CA, van Putten WL, Valk PJ, Delwel R: Double CEBPA mutations, but not single CEBPA mutations, define a subgroup of acute myeloid leukemia with a distinctive gene expression profile that is uniquely associated with a favorable outcome. Blood. 2009, 113: 3088-3091. 10.1182/blood-2008-09-179895.PubMedCentralCrossRefPubMedGoogle Scholar
- 54.Assouline S, Culjkovic B, Cocolakis E, Rousseau C, Beslu N, Amri A, Caplan S, Leber B, Roy DC, Miller WH, Borden KL: Molecular targeting of the oncogene eIF4E in acute myeloid leukemia (AML): a proof-of-principle clinical trial with ribavirin. Blood. 2009, 114: 257-260. 10.1182/blood-2009-02-205153.CrossRefPubMedGoogle Scholar
- 55.Stem Cell Commons: [http://stemcellcommons.org]
- 56.NHLBI Progenitor Cell Biology Consortium (PCBC). Bioinformatics and Genomics Tools. [http://www.progenitorcells.org/content/bioinformatics-and-genomics-tools]
- 57.Ho Sui SJ, Begley K, Reilly D, Chapman B, McGovern R, Rocca-Sera P, Maguire E, Altschuler GM, Hansen TA, Sompallae R, Krivtsov A, Shivdasani RA, Armstrong SA, Culhane AC, Correll M, Sansone SA, Hofmann O, Hide W: The Stem Cell Discovery Engine: an integrated repository and analysis system for cancer stem cell comparisons. Nucleic Acids Res. 2012, 40: D984-991. 10.1093/nar/gkr1051.PubMedCentralCrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.