Content-based microarray search using differential expression profiles
- 9.3k Downloads
With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.
We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.
Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.
KeywordsSmall Cell Lung Cancer Independent Component Analysis Gene Expression Omnibus Duchenne Muscular Dystrophy Dimension Reduction Method
With the development of the DNA microarray and other technologies that probe gene expression on an "omic" scale, we are now able to discover associations between biological conditions based on their molecular underpinnings. Seminal work by Golub et al.  classified leukemia samples by their global gene expression profiles, demonstrating that transcriptomic signatures can aid in functional prediction and improve our molecular understanding of disease. Hughes et al.  predicted the effects of novel gene deletions and chemical treatments by profiling yeast mutants and comparing new arrays to this reference. More recent studies examined cellular transcriptional response to drug treatment [3, 4] and disease [5, 6] in order to identify novel relationships between apparently unrelated conditions and compounds. This work not only demonstrated the utility of expression-based discovery, but also suggested that functional studies about drugs and diseases can utilize data from different platforms and cell types. This general approach to hypothesis generation - namely, finding associations between diverse conditions based on gene expression - has great potential to further biological and biomedical research if implemented on a large scale.
Here we develop methods for content-based gene expression search using an entire experiment as a query. That is, given an input experiment comparing case to control, we aim to identify other experiments that show similar patterns of differential expression. This concept is exemplified by the Connectivity Map , which searches for relationships between treatment-control comparisons for small molecules. While the Connectivity Map focused on drug treatment and disease, a similar approach across a sufficiently large data source would allow for the identification of associations between gene knockdowns, diseases, drugs, and myriad other perturbations and phenotypes. Public repositories provide a wealth of data amenable to this task. The largest of these repositories, the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) , now contains over 400,000 individual samples from more than 17,000 experiments detailing the molecular characteristics of diverse cell types, diseases, and drug treatments. The European Bioinformatics Institute (EBI) ArrayExpress Repository  and Stanford Microarray Database  host additional data. While GEO supports searches of its content based on free-text and controlled-vocabulary annotations, there is increasing interest in methods for querying microarray databases based on the molecular measurements themselves [10, 11, 12, 13, 14]. The power of this approach would grow with the size of the repository.
Current methods for content-based search typically involve a two-step process: they identify a gene set of interest and then search for experiments in which this gene set is important. Several groups have introduced methods for identifying experiments that co-express  or differentially express  a given gene set. Recently, EBI implemented the Gene Expression Atlas, which provides this latter functionality over their curated array archive . These methods, however, require that both the query and target experiments differentially express genes above some hard threshold, and thus may miss more subtle or noisy relationships . Other approaches, typified by Gene Set Enrichment Analysis (GSEA) , partially bypass this requirement by comparing a subset of genes to ranked profiles, using a hard threshold for the query experiment and a soft threshold for the queried experiments [3, 4].
While previous approaches require designating a group of differentially expressed genes, we explore the possibility of using as a query a differential expression (DE) profile, consisting of a complete list of features and associated expression scores. By examining all genes shared between query and queried experiments, we aim to identify experimental conditions and perturbations that exhibit similar transcriptional responses. A successful strategy in this effort should reconcile differences between species, platform types, and normalization methods, as well as overcome the confounding effects of noise and technical replicability. To achieve this, we consider combinations of methods for three tasks: data representation, dimension reduction, and search algorithm.
First, we consider the problem of data representation. Typical microarray analysis methods represent differential expression as a fold-change, comparing the expression in one set of samples to that in another . However, because public expression databases consist of a broad range of data types and experimental modalities, rank-based representations are often employed to account for the disparities in the distributions of observed data [4, 10]. Here we compare both parametric and nonparametric data representations to determine the best approach for comparing DE profiles. We also consider an alternate representation of gene expression data, and construct DE profiles based on the p-value of differential expression.
A second challenge is that gene expression profiles from high-throughput technologies consist of up to tens of thousands of measurements per sample. In addition to the computational complexity involved in handling these large datasets, high dimensionality often confounds data mining techniques [16, 18]. In particular, high-dimensional, multimodal data lends itself to over-fitting and reduced performance . Many solutions to this problem have been proposed, of which dimensionality reduction is the foremost. Matrix decomposition [20, 21], feature selection , and module or gene-set based approaches [16, 23] attempt to capture the most relevant data while removing redundant or noisy features .
Given an appropriate data representation for differential expression, the final challenge is how best to calculate the similarity between two experiments. While Fujibuchi et al. use Spearman rank correlation to compare individual microarrays , it is not clear whether a similar approach is appropriate for DE profiles. Several recent studies use a modified Pearson correlation measure on rank-normalized profiles [4, 5, 24]. Other work suggests that weighting expression values by each gene's variance may improve classification and analysis [25, 26].
Evaluation of data representation and similarity measures
The receiver operating curves for the best-performing search methods indicated that, on average, about 50% of the true positives could be recovered with greater than 90% specificity. This high specificity is important for search because typically the first few results, rather than a complete list, are examined. To evaluate the performance of our search over the top results in each search, we calculated the "precision at 4" for each of the 32 experiments, permuting labels to create a null model (see Additional file 2). The average precision for Duchenne muscular dystrophy and Huntington's disease exceeded the random model at a 95% confidence interval for 13/15 and 8/8 experiments, respectively. The "precision at 4" for breast cancer, a genetically complex disease, was also high, but it significantly surpassed the random model in only 4/9 experiments.
Constructing a network of GEO differential expression experiments
Application to Nanog knockdown in embryonic stem cells
In addition to mouse ESC datasets, this search produced interesting comparisons with different experimental systems. Result 14 supports a similarity between Nanog knockdown and the comparison of non-small cell lung carcinoma (NSCLC) to small cell lung cancer (SCLC). SCLC, the more aggressive disease, has been linked with expression of stem cell factor  and the Hedgehog signaling pathway . These relationships suggest that, in a broad sense, SCLC compared to NSCLC may have a more stem-like transcriptional program.
Application to FoxO3 knockout in neural stem cells
Other matches from the FoxO3 search (Results 11, 12, 15) point to a role for FoxO3 in cellular response to cytokine interleukin-2 (IL-2) stimulation. All three of these matching profiles compare cytotoxic T cell line (CTLL-2) at 1 hour after IL-2 stimulation to a later time point (6, 12, or 16 hours). From the direction of these comparisons, we would predict that IL-2 stimulates a transcriptional program that is similar to that of FoxO3 knockout. Indeed, IL-2 signaling leads to phosphorylation and inactivation of FoxO3 in CTLL-2 cells , confirming this hypothesis.
In optimizing our data processing and search pipeline, we found that linear combinations of gene expression features derived in a separate compendium benefited our analysis. The most effective dimension reduction technique involved projecting each DE profile into a feature-space identified by independent component analysis. We previously used ICA to identify fundamental components of human gene expression from a large compendium of 10,000 arrays, of which only a small subset overlap with the experiments examined here . The ICA projection method reduced the set of features from on the order of 20,000 to less than 500, allowing for rapid indexing and searching of large libraries of differential expression profiles. Furthermore, this approach outperformed module-based methods, possibly because the linear model incorporated data from all of the genes rather than only those that participate in discrete gene sets. Despite the fact that these ICA features were derived in human data, they proved robust in identifying and ranking experiments in closely related species as well. Thus, our results support previous findings that gene expression features derived in one compendium can be useful for interpreting data from new datasets .
To calculate similarities between differential expression profiles, we introduced a novel weighting scheme that incorporates information about a feature's significance of differential expression. This approach provides an intuitive means for emphasizing the contributions of features that are significantly differentially expressed in both experiments, which may represent the most relevant common biology. At the same time, the weighted correlation incorporates even genes that are not significantly differentially expressed, potentially capturing the effects of broader transcriptional changes. We observed that this scheme worked well with Pearson correlation, but did not perform as well when combined with rank-based correlation. Future work will characterize the behavior of this similarity measure on a larger scale.
We used the most successful data processing pipeline to index all GEO DataSets. Our results with transcription factor experiments suggest that this approach can provide predictions for genes, phenotypes and perturbations that share functional similarities with a query experiment. Analysis of Nanog knockdown in ESCs successfully identified other ESC differentiation time courses, induced by a variety of factors, from amongst almost 10,000 other profiles (Figure 5). The same search predicted a link between small lung cell carcinoma and ESC transcriptional programs. For a less well characterized transcription factor, FoxO3, our method also succeeded in recapitulating known biology across species and experimental systems. Although it is clear that FoxO3 has lineage-specific effects [42, 49], we identified a role for FoxO3 in hypoxia response that appears to transcend tissue type [40, 41]. For uncharacterized comparisons, this information has the potential to provide useful hypotheses for phenotypes and pathways to investigate.
As in more traditional microarray analyses, however, interpretation of the most significant genes identified by our weighting scheme remains difficult. Our analysis of the FoxO3 search revealed a number of genes involved in both hypoxia and FoxO3 signaling, linking these two pathways. However, the top genes in the Nanog knockdown search failed to reveal convincing pathways that might explain the relationship between small lung carcinoma and ESCs. While we focus on the interpretation of several individual genes in this study, future efforts may benefit from the use of gene set enrichment tools to find pathways that are significantly represented in the top gene list.
As experimentalists continue to explore and deposit information about cellular processes and perturbations, the utility of content-based search approaches will increase. With a larger bank of transcriptomic data and a high chance of identifying overlapping and functionally related biology, an "experiment-omic" screen might be the first step in characterizing a novel dataset. To realize this, further ontological indexing of expression databases may also be necessary . Several groups have already begun to integrate expression with textual phenotype data to enable gene function prediction  and automatic disease diagnosis  from large databases. Even for expression-driven methods, controlled annotations for experimental variables, tissue types, and culture systems would allow for more accurate assessments of functional relevance. Finally, ontological indexing of textual annotations will enable the creation of more sophisticated connectivity maps linking not just diseases and drugs, but also gene knockdowns, over-expression studies, and genotype comparisons. These ontology-informed studies may not only search public repositories based on gene expression, but also provide meta-analysis across phenotypic categories.
We have explored computational methods needed to search large repositories for relevant experiments based on differential expression, using an experiment as a query. While previous studies use hard thresholding to select gene sets of interest [4, 11, 13], we propose a data-driven approach that uses information from all shared genes to compare two experiments. Differential expression profiles containing scores for each gene or feature were generated and compared using correlation metrics, following the hypothesis that this direct and intuitive method would perform well across diverse datasets. In a collection of 32 experiments comparing normal to diseased tissue, we achieved an average AUC of 0.737 for retrieving experiments that measure the same disease. We further demonstrated the ability of our method to identify functionally relevant experiments from a large database of studies. Future work will include implementing the principles learned here into a web-based application. Public deployment of these methods will enable discoveries in drug repurposing, disease classification, and systems repositioning as we explore the molecular underpinnings of diverse biological processes and phenotypes.
From a previous collection of disease-associated NCBI GEO microarray experiments , we collected 1,278 processed arrays comprising 32 experiments that compared normal to diseased tissue for Duchenne muscular dystrophy, breast cancer, and Huntington's disease. These experiments represented a variety of species, platforms, tissues, and normalization techniques, factors which might strongly influence the clustering of expression data.
Differential expression profiles
In transcriptomic studies, differential expression analysis identifies the genes and biological processes that vary between two samples. To represent this information from different datasets in a standardized manner, we mapped probesets to Entrez Gene identifiers using AILUN  and generated differential expression (DE) profiles for each comparison using Bioconductor software . Here, a DE profile consists of a list of features (e.g., genes) each with an associated score (e.g., fold change). For each comparison, we represented this differential expression score in three ways.
Log fold-change profile
We converted all microarray data to log values by examining the maximum and minimum values of the normalized probe-level data and applying log2 transformation as needed. We aggregated probes to genes using the fixed effects meta-estimate, calculating an average for each gene weighted by the variance of each probe . We calculated the fold-change difference between normal and disease by averaging samples within each group.
Probes were aggregated as for the log fold-change method. For each gene in each experiment, we determined the probability that the gene was differentially expressed with an empirical Bayes moderated t-statistic implemented in the limma R package (version 2.16.5) . We corrected for multiple hypothesis testing using the Benjamini-Hochberg method . For DE profiles represented in terms of a reduced set of features (see below), we applied limma to assess the differential expression of that feature.
For each sample in each experiment, we ranked probes based on their raw expression score, then averaged all scores for a probe to create a single score for normal and disease sample groups. We mapped from probes to genes by finding the median of the subtractive difference between all pairwise combinations of probes for the same gene in normal and disease.
For all three DE profile representations, we mapped genes to their human homologs using NCBI Homologene, removing genes that did not have one-to-one homologs between species (Additional file 7). While removing species-specific genes may result in loss of important biological information, we hypothesized that comparing global, conserved patterns of gene expression between experiments would prove sufficient to predict functional associations (see Additional file 1 for data on the number of genes mapped for each dataset discussed in the manuscript). Next, we applied one of two methods of dimension reduction.
Projection onto independent components
where A is the final reduced profile (423 features), S is the component matrix (components × genes), and X is the original profile in gene-space.
Fixed effect meta-estimate
To evaluate the performance of the ICA projection method, we also used a set of known features to reduce the dimensionality of our DE profiles. Given a collection of gene sets, we calculated a meta-score for each gene set using the fixed-effect meta-estimate, which represents an average across all genes in the set weighted by their inverse variance . This method summarizes the contributions of functionally coherent gene sets, and may be appropriate for expression analysis. We used MSigDB v2.5, a well described collection of 5,452 gene sets most often used in conjunction with GSEA . For comparison, we also derived gene sets from the ICA features described above: for each independent component, we created a module from all genes that scored three standard deviations above the mean in one direction. These 423 modules represented data-derived functionally coherent gene sets as determined by GO enrichment .
where w i is the weight for feature i, p ij is the FDR-corrected empirical Bayes p-value for experiment j, and C is a scaling factor. For this work, we empirically chose C = 2 because it delivered the best clustering of our disease compendium (data not shown). We used the ROCR package  to evaluate the performance of various data processing methods.
GEO DataSet search
To search GEO for experiments with similar transcriptional patterns, we indexed all GEO DataSets (GDSs). We downloaded processed data from GEO and used the GDS "Value type" field to transform the data to log2 space. Each GDS is manually annotated with one or more factors, e.g., "disease state" or "time", which outline the experimental conditions that vary between groups of samples. Within each GDS, we compared all combinations of groups for a single factor. For each of these comparisons, we created two DE profiles: one in gene-space, and one in the ICA feature-space described above. We calculated p-values for each gene and ICA feature using the empirical Bayes modified t-test as described . To search these DE profiles, we used the absolute value of the p-weighted Pearson correlation metric, since the direction of the comparison is arbitrary. To assess the significance of DE profile comparisons, we selected 10,000 random pairs of comparisons to serve as a background distribution of correlation scores. We estimated the false discovery rate (FDR) of our search results by calculating the percentage of these random comparisons that exceed a given similarity score. Because this random sampling may include true positive comparisons (e.g., two profiles from the same dataset), our corrected p-values may underestimate the significance of new comparisons.
The authors thank Boris Oskotsky and Alex Skrenchuk for assistance and technical support; Nicholas Tatonetti for critical comments; and Anne Brunet and Ashley Webb for interpretation of FoxO3 results. Computing resources at the Stanford Center for Biomedical Informatics Research were funded by the Hewlett Packard Foundation and the Lucile Packard Foundation for Children's Health. Financial support was provided by the National Library of Medicine (R01 LM009719) and Howard Hughes Medical Institute.
- 1.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–7. 10.1126/science.286.5439.531CrossRefPubMedGoogle Scholar
- 2.Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell 2000, 102: 109–26. 10.1016/S0092-8674(00)00015-5CrossRefPubMedGoogle Scholar
- 3.Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR: The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 2006, 313(5795):1929–35. 10.1126/science.1132939CrossRefPubMedGoogle Scholar
- 4.Hassane DC, Guzman ML, Corbett C, Li X, Abboud R, Young F, Liesveld JL, Carroll M, Jordan CT: Discovery of agents that eradicate leukemia stem cells using an in silico screen of public gene expression data. Blood 2008, 111(12):5654–62. 10.1182/blood-2007-11-126003CrossRefPubMedPubMedCentralGoogle Scholar
- 7.7. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 2009, (37 Database):D885â€“90. 10.1093/nar/gkn764CrossRefGoogle Scholar
- 8.8. Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF, Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R, Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone SA, Sklyar N, Zhao M, Sarkans U, Brazma A: ArrayExpress update-from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 2009, (37 Database):D868â€“72. 10.1093/nar/gkn889CrossRefGoogle Scholar
- 16.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–50. 10.1073/pnas.0506580102CrossRefPubMedPubMedCentralGoogle Scholar
- 30.Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, Wong KY, Sung KW, Lee CWH, Zhao XD, Chiu KP, Lipovich L, Kuznetsov VA, Robson P, Stanton LW, Wei CL, Ruan Y, Lim B, Ng HH: The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet 2006, 38(4):431–40. 10.1038/ng1760CrossRefPubMedGoogle Scholar
- 46.Shoshani T, Faerman A, Mett I, Zelin E, Tenne T, Gorodin S, Moshel Y, Elbaz S, Budanov A, Chajut A, Kalinski H, Kamer I, Rozen A, Mor O, Keshet E, Leshkowitz D, Einat P, Skaliter R, Feinstein E: Identification of a novel hypoxia-inducible factor 1-responsive gene, RTP801, involved in apoptosis. Mol Cell Biol 2002, 22(7):2283–93. 10.1128/MCB.22.7.2283-2293.2002CrossRefPubMedPubMedCentralGoogle Scholar
- 49.Paik JH, Kollipara R, Chu G, Ji H, Xiao Y, Ding Z, Miao L, Tothova Z, Horner JW, Carrasco DR, Jiang S, Gilliland DG, Chin L, Wong WH, Castrillon DH, DePinho RA: FoxOs are lineage-restricted redundant tumor suppressors and regulate endothelial cell homeostasis. Cell 2007, 128(2):309–23. 10.1016/j.cell.2006.12.029CrossRefPubMedPubMedCentralGoogle Scholar
- 53.Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80CrossRefPubMedPubMedCentralGoogle Scholar
- 55.Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3: Article 3.Google Scholar
- 56.Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc 1995, B(57):289–300.Google Scholar
- 57.Kaufman L, Rousseeuw PJ:Finding groups in data: an introduction to cluster analysis. Hoboken, N.J.: Wiley; 2005. [http://www.loc.gov/catdir/enhancements/fy0626/2005278659-b.html]Google Scholar
- 58.Hedges LV, Olkin I:Statistical methods for meta-analysis. Orlando: Academic Press; 1985. [http://www.loc.gov/catdir/description/els032/84012469.html]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.