Connecting genes, coexpression modules, and molecular signatures to environmental stress phenotypes in plants
- 15k Downloads
One of the eminent opportunities afforded by modern genomic technologies is the potential to provide a mechanistic understanding of the processes by which genetic change translates to phenotypic variation and the resultant appearance of distinct physiological traits. Indeed much progress has been made in this area, particularly in biomedicine where functional genomic information can be used to determine the physiological state (e.g., diagnosis) and predict phenotypic outcome (e.g., patient survival). Ecology currently lacks an analogous approach where genomic information can be used to diagnose the presence of a given physiological state (e.g., stress response) and then predict likely phenotypic outcomes (e.g., stress duration and tolerance, fitness).
Here, we demonstrate that a compendium of genomic signatures can be used to classify the plant abiotic stress phenotype in Arabidopsis according to the architecture of the transcriptome, and then be linked with gene coexpression network analysis to determine the underlying genes governing the phenotypic response. Using this approach, we confirm the existence of known stress responsive pathways and marker genes, report a common abiotic stress responsive transcriptome and relate phenotypic classification to stress duration.
Linking genomic signatures to gene coexpression analysis provides a unique method of relating an observed plant phenotype to changes in gene expression that underlie that phenotype. Such information is critical to current and future investigations in plant biology and, in particular, to evolutionary ecology, where a mechanistic understanding of adaptive physiological responses to abiotic stress can provide researchers with a tool of great predictive value in understanding species and population level adaptation to climate change.
KeywordsGene Coexpression Network Weighted Gene Coexpression Network Analysis Salicylic Acid Signaling Late Embryogenesis Abundant Gene Coexpression Module
The advent of high-throughput genome sequencing coupled with breakthroughs in the field of functional genomics has provided an unprecedented opportunity to study the molecular mechanisms that govern the dynamic behavior of cells, organs, and organisms . Indeed, there are excellent examples documenting interdisciplinary use of these emerging technologies, from human genome SNP scans diagnostic of human disease susceptibility [2, 3] to discovery of the genetic mechanisms underlying beak morphology of Darwin's finches . Applications are also apparent in plant biology, where the use of genomic technologies have uncovered stress-dependent behaviors in mechanistic detail (see  for a review). Such studies have led to the elucidation of highly complex and interacting networks of the abiotic stress response. For example, salinity, drought, and cold elicit a dehydration response that shares many common elements and interacting pathways [6, 7]. These findings have spurred additional investigations searching for shared signaling cascades or molecules associated with pathway integration, or cross-talk, and have led to numerous candidates including reactive oxygen species (ROS) and calcium signaling [8, 9], hormones [10, 11] and others [12, 13, 14]. However, despite the advances made possible by "omics"-based technologies, we still struggle to accurately associate the genes, transcriptional cascades, and signaling networks with physiological performance and ecological fitness.
One obstacle to this lack of association is perhaps the result of two opposing paradigms often used in comparative physiology . The first approach, termed gene-to-phenotype, is typified by that of many "omics"-based studies where the effects of specific genes on phenotypic performance and fitness are evaluated (e.g., a reverse genetics approach, ). This is in contrast to the phenotype-to-gene approach where the biologist attempts to determine the evolutionary potential of a given trait within a population without identifying the underlying genes (e.g., ecological genetics ). Thus, the latter approach is interested in the potential for a trait to evolve, while the former focuses on the underlying genetic mechanism of a particular trait. The integration of both approaches will be an important component of the emerging field of evolutionary and ecological genomics, which aims to study adaptation of natural populations to their environment .
To fully understand the genetic mechanisms underlying physiological adaptation to abiotic stress, we must first begin to understand the complex biological processes of how the resultant phenotype is generated from the genotype and then seamlessly coalesce our newfound understanding with population and evolutionary genetics. To initiate this task, we have adapted and integrated two recent analytical advances from the biomedical community. The first approach uses a novel weighted gene coexpression network to determine signaling networks and core genes underlying disease states and evolutionary diversification [19, 20, 21]. The second approach explores the genomic signature concept as recently defined by Lamb et al. , and is currently used to connect the disease state of an organism with the underlying genes and possible drug treatments . Our purpose is to determine if these techniques can be used to associate the abiotic plant stress transcriptome with common and specific pathways underlying phenotypic response in a manner that is conducive to current and future genetic studies. We address this by combining gene coexpression networks with the genomic signature concept to investigate transcript profiles for plants exposed to drought, osmotic, salt, cold, heat, and UV-B stress. Our intent is not to describe in exhaustive detail the genes unique to or common among these stresses, although we do this to some extent, but rather to illustrate the power of this approach and provide sufficient information so that we and others can evaluate the full potential of this technique for plant biologists and evolutionary ecologists.
Arabidopsis stress gene coexpression network
It is known that the plant stress response is characteristic of highly complex and often integrated signaling pathways [6, 7, 8, 9, 10, 11, 12]. To help elucidate the transcriptional networks associated with exposure to abiotic environmental stress, a weighted gene coexpression network was constructed as described in Zhang and Horvath  and in Materials and Methods from a subset of the AtGenExpress abiotic stress dataset . The data subsets were determined by first analyzing all abiotic stress datasets separately for differential gene expression between control and treatment conditions using the limma package  operated within the R statistical program language . Genes that had an adjusted p < 0.01 and a log-odds ratio > 1.5 were deemed significantly differentially expressed and were subsequently included in the data subset. This subset, or initial input gene list, contained 16,036 (~57% of the genome) unique gene transcripts identifiers with significantly higher or lower abundance at least once per treatment and per time-point [see Additional file 1].
The green module contained genes known to participate in the heat response pathway including many of the heat shock proteins (HSP 83- At5g52640, HSP23.5-At5g51440, HSP17.6-At5g12030, mtHSP70-At5g09590, HSP21-At4g27670, mtHSP23.6-At4g25200, HSP22-At4g10250, HSP17.4-At3g46230, HSP70-At2g32120, HSP101-At1g74310, HSP-17.6-At1g53540). In addition, the green module eigengene significantly correlated with the expression of heat responsive markers genes At5g59720 (Pearson cor = 0.95, p = 10 × 10-20) and At4g36990 (Pearson cor = 0.78, p = 4.6 × 10-14) as previously reported [32, 33, 34].
The brown module showed a slight yet significant relationship to the cold treatment (Fig. 3C; p = 1.3 × 10-3, Kruskal-Wallis test) and was enriched with gene products targeting the thylakoid (GO:0044436, p = 1.4 × 10-78) and that participate in photosynthesis (GO:0015979, p = 1.8 × 10-7). The module was most associated with the early stages of cold stress as determined by differential gene enrichment [see Additional file 2] and the -0.57 Pearson correlation (p = 10 × 10-20) between eigengene value and the known cold responsive marker genes [33, 35, 36] COR15A (At2g42540), but not in At4g25480 (DREB1A), as previously reported .
The turquoise module showed a significant relationship for both salt (Fig. 3A; p = 5.4 × 10-5, Kruskal-Wallis test) and osmotic (p = 5.4 × 10-5) treatments. The module is enriched with GO categories in starch metabolism (GO:0005982, p = 1.9 × 10-5), protein/peptide degradation (GO:0009056, p = 1.1 × 10-4), and included dehydration responsive genes (At5g66400-RAB19, AT5G52300-RD29B, At5g25610-RD22, At3g56080, At1g01250-DREB A4), early dehydration responsive genes (At1g08930-ERD6, At1g69450-ERD4 like) and late embryogenesis abundant genes (At5g06760, At4g36600, At4g13560, At2g35300, At1g32560). The module eigengene value significantly correlated with known maker genes for salt [33, 37, 38] (ATPLC1, Pearson cor = 0.68, p = 7 × 10-10) and osmotic stress (KIN1, cor = 0.74 p = 3.5 × 10-12; COR78, cor = 0.64, p = 1 × 10-8) [33, 39, 40], further suggesting a role for the turquoise module in the dehydration responsive program.
The blue module showed a significant relationship with the UV-B treatment (Fig. 3D; p = 2.2 × 10-5, Kruskal-Wallis test), which is characterized as a down-regulation response. According to ontology analysis, blue module genes tend to encode products that are involved in protein modification (GO:0006464, p = 1.3 × 10-19) including amino acid phosphorylation (GO:0006468, p = 3.2 × 10-25) and translation (GO:0043687, p = 9.3 × 10-21).
The yellow module also showed a significant relationship with the UV-B treatment (Fig. 3B; p = 2.2 × 10-5, Kruskal-Wallis test), except that the module genes were up-regulated. Ontology analysis indicated that the yellow module was enriched for macromolecule biosynthetic processes (GO:0009059, p = 1.3 × 10-90) including organelle biogenesis (GO:0006996, p = 4.8 × 10-64) and ribosome biosynthesis and assembly (GO:0042254, p = 6.8 × 10-58). The blue module showed a significant relationship with the UV-B marker At2g24850 (Pearson cor = 0.7, p = 1.3 × 10-10), but not At5g52250 (cor = 0.05, p = 0.67). Alternatively, the yellow module displayed a weak yet significant relationship with At5g52250 (cor = 0.26, p = 0.037), but not At2g24850 (cor = -0.13, p = 0.29) as previously reported .
According to the singular decomposition values and differential gene enrichment analysis, the red module showed significant relationships with nearly all stress treatments [see Additional file 2, Fig. 2]. Ontology analysis indicted that the red module was overrepresented with genes participating in signal transducer activity, including transmembrane receptor activity (GO:0004888, p = 7.6 × 10-3) as well as response to environmental stimuli (GO:0050896, p = 0.013). Interestingly, the most connected gene within the red module, or hub, is an uncharacterized ankyrin repeat family protein [At5g54720; see Additional file 3], which has been shown to regulate salicylic acid signaling. Genes involved in calcium-based signaling also enriched the red module, including calcium dependent protein kinases, calmodulin related proteins, calcium and calmodulin binding proteins [see Additional file 3].
As illustrated above, genes with significantly higher or lower transcript abundance were associated with specific modules depending on the duration of and kind of stress treatment. To relate patterns of genome-wide mRNA expression to phenotypic state, we adapted the genomic signature concept from Lamb et al.  where statistical approaches are used to scan an unknown query signature against a database of known reference profiles. For our purposes, the database of reference profiles was created from the above AtGenExpress dataset specific to UV-B, heat, salt, cold, osmotic, and drought treatments at the four initial time-points already detailed above. The 'query' signatures were derived from independent studies imposing UV-B  and cold  treatments as well as our own heat and drought investigations. In contrast to the Lamb et al.  approach, we used ordered list statistics [43, 44, 45] to determine structural similarities among gene ranks of query and reference signatures.
Signature comparisons. Similarity score comparisons for independently derived query signatures for cold at 3 h, 6 h, and 24 h ; UV-B ; heat alone, drought alone, and heat and drought applied simultaneously; scanned against the reference signature database. The most similar scores per treatment are bolded. The top row represents time in (h) for stress treatment duration of the reference signature.
Cold 3 h
Cold 6 h
Cold 24 h
Heat & Drought Simultaneous
Next an independent UV-B specific query signature was generated from Brown et al.  data to scan against our reference database. Despite differences in growth and treatment induction conditions, the query signature showed strong similarity with the UV-B 3 h and 6 h reference signatures (Table 1, Fig. 4B). Interestingly, the common intersect genes between reference and query signatures were enriched with transcription factors, particularly the WKRY family At1g80840, At2g38470, At5g24110 and zinc finger family At1g27730, At5g27420, At5g27420, At5g59820, [see Additional file 4].
We next created a signature from the simultaneously imposed heat and drought treatments to query the reference database. Interestingly, the two highest similarity scores were 3 h heat (Table 1, Fig. 5B) and 6 h drought (Table 1, Fig. 6B), indicating that the genomic signature concept as applied here has the potential to detect more than one environmental perturbation within a single treatment. Unfortunately, the significance of the similarity score is difficult to interpret. To address this more specifically, we decomposed the dual stress signature into its 6 independent replicates [see Additional file 5]. The results still place drought and heat as the two most similar signatures yet the significance of the drought similarity is weak. However, the similarity score becomes significant when the depth of the signatures is increased [see Additional file 5 for further discussion]. Nonetheless, the query signature had 17 genes in common with the drought signature including the drought responsive marker RD20 (At2g33380), and contained 26 genes in common with the heat reference including many of the heat shock protein encoding genes and a number of genes encoding DNA heat shock N-terminal domain-containing proteins [see Additional file 4].
The goal of our study was to demonstrate the use of an integrative systems approach for connecting gene expression patterns to physiological characteristics, thereby providing mechanistic insight into genome function under abiotic stress conditions. Central to our approach is the use of the genomic signature concept to characterize the plant stress phenotype and provide a link to the underlying network pathways, modules, and eventually genes. The use of expression array data to create a signature cataloging system (reference signature database) has been used previously to characterize chemical perturbations on tissue samples and cell culture populations [47, 48], and more recently to link genes and disease states to potential therapies [22, 23]. In the present study, we extend the signature cataloging approach to plant biology/ecological genomics by using the ATGenExpress abiotic stress dataset to compile our first-generation reference signatures database.
Validation of the reference database, and the approach in general, was accomplished with independent datasets for UV-B , cold  and our own datasets for heat, drought, and the simultaneously imposed heat-drought treatments. Altogether, more than half of the stress treatments included in the signature database were scanned by independent query signatures. Our results are encouraging and show that despite differences in array platform, growth conditions, and even the application of treatments, the signature approach is robust in classifying the plant stress phenotype. This was particularly evident with highly conserved stress specific responses such as heat and UV-B. At the same time, our results illustrate the complexity of the stress response that is characteristic of cross-talk pathways [8, 9, 10, 11, 12, 13, 14] and multiple secondary effects from prolonged treatments. For example, the early cold stress query signatures (3 h and 6 h) showed very high similarity to cold signatures with only weak similarity scores to other signature phenotypes. Alternatively, the 24 h cold query showed similarity to cold signatures as well as drought and osmotic signatures. This result likely reflects the secondary effects of the prolonged (3 h and 6 h vs. 24 h) cold treatment. Not surprisingly, the co-occurrence of cold and dehydration response reflected in their signaling pathways, or cross-talk, is widely reported in the literature [7, 49].
One promising aspect of the signature approach as applied in this study is in the potential use for classification of the dual imposed heat/drought treatment. In nature, a departure from the homeostatic equilibrium, or stress, is often brought about by multiple environmental factors . Heat and drought, for example, are co-occurring stresses that have been implicated in severe yield losses (, and citations within). In this study, the highest similarity scores were observed with the heat and drought reference signatures, but the significance of the drought score was dependent upon the depth the signature lists interrogated (see Additional File 5). One explanation for this finding is that the drought responsive transcripts were further down the signature list than the more responsive heat induced transcripts, thereby requiring a greater depth of the signature lists to be compared. This suggests that care must be taken with comparisons between multiple stress phenotypes. However, our results are encouraging in this regard and future research should consider additional statistical means for determining depth of signature list comparisons.
Network theory and analysis was used in an attempt to relate the phenotypic signature information to genome-wide transcriptional programs. Network theory, in general, is promising in this regard because it allows us to view the biology as a system of networks and interacting modules . Here, we use the weighted gene coexpression network approach recently proposed by Zhang and Horvath , which has been used successfully to link molecular targets to oncogenic signals , complex traits (e.g., mouse weight; ), and even network divergence between human and chimpanzee neural patterns . This approach is particularly relevant for our application because it is based on unsupervised clustering, bypasses multiple testing problems when relating gene information to physiological traits, and does not need a priori gene ontology information. The latter point is especially important for ecological genomics, which continues to transition from the use of model organisms to those of more ecological relevance.
Results from weighted gene coexpression network analysis produced six distinct modules from the abiotic stress dataset. Importantly, this unsupervised approach grouped genes into network modules that are reflective of biological process. For example, brown module genes clearly participate in photosynthetic processes while turquoise module genes contribute to starch and sucrose regulation. In addition, specific stress responsive modules were identified. The green module, for example, was almost entirely unique to the heat stress pathway and was, in fact, enriched with genes known to participate in heat responsive programs. Equally interesting was the identification of module genes participating in multiple stress responsive pathways. This was apparent for modules consisting of conserved metabolic pathways i.e., brown (photosynthesis) and blue (starch/sucrose metabolism) modules.
One of the more promising aspects of weighted gene coexpression network analysis was the identification of a common abiotic stress responsive module (red module) that enriched differentially expressed genes for all treatments investigated. The most connected gene, or hub, within this module was an uncharacterized ankyrin repeat family protein that was specific to our analysis. Ankyrin proteins have been reported to act as regulators in salicylic acid signaling, which is a key molecule in signal transduction of biotic stress responses . The discovery of this ankyrin family member as the hub in our common stress responsive module suggests that salicylic acid signaling may play a role in abiotic stress response, which would corroborate results from exogenously applied salicylic acid . In addition, this common stress responsive module was enriched with genes known to participate in calcium and calmodulin signaling pathways, which have been shown to participate in a multitude of cellular functions including cell death .
Although our findings are robust within the current context, a number of questions remain to be answered. For example, the reference database is generated from immediately perturbed systems that typically exhibit marked and highly significant changes in transcript abundance, and does not include acclimated states where changes in transcript abundance are typically smaller. This has recently been shown in studies investigating changes in gene expression in response to long-term growth at elevated carbon dioxide concentration [54, 55, 56]. Therefore the feasibility of scanning the database with a signature from a fully acclimated organism and obtaining a highly correlated signature is uncertain. However, we hypothesize that the acclimated state will also be characterized by unique expression patterns that, in theory, should be amenable to our approach. Like Lamb et al. , we are also uncertain how to interpret the significance of the similarity score. Unique to our approach is the use of ordered lists statistics to compare signatures. This statistical test provides a p-value based on permuted data that indicates if signature comparisons are more similar than by chance alone. Unfortunately, the interconnectedness among stress responsive pathways resulted in low p-values even for some low similarity score comparisons (data not shown). However, we are reluctant to disregard the p-value entirely, because as the reference signature database grows and more diverse datasets are included, the p-value may help assign phenotypes to general category (e.g., abiotic stress vs. development).
Here, a first-step approach toward classifying and understanding the processes behind the plant stress phenotype is presented. We integrated two analytical techniques that have traditionally been applied only within the biomedical community. Results from our adaptation of the these techniques show that one can take an unknown query signature and through pattern matching software scan a reference database to classify both singular and multiple plant stress phenotype(s). Then, one can use a number of inferential techniques to link phenotypic attributes to their corresponding signaling modules and genes. In essence, this technique provides a tool allowing one to navigate the potential phenotypes of a given Arabidopsis genotype. In the current context, the approach is restricted to a single organism. However, a number of technical advances, including sequence-based transcriptomics , comparative gene ontology algorithms , and analytical approaches for linking network characteristics to quantitative genetics  illustrates the potential to enrich our methodology to address questions of evolutionary and ecological interest, particularly physiological trait development.
There are two attributes of our approach that facilitate its use for such purposes. First, the technique is applied within a network framework. Network theory, has been well received in molecular biology for providing a 'systems biology' framework for the discipline (see  for a historical perspective), and has more recently been proposed as possible means for determining the evolutionary basis of complex phenotypic traits . Second, and just as important, is the potential to link our approach with population-based genetic analyses. Many of the molecular-based, systems biology experiments are conducted within a narrow adaptive context, with little or no regard for other nonadaptive evolutionary forces (drift, mutation, recombination, gene flow). The inclusion of a genetic association with network analysis, as demonstrated by , and placed within a population genetic context allows the appropriate testable null models (e.g., genetic drift) to be included in such studies. Therefore, relating genomic information to genetic information, e.g., quantitative trait loci, is not only possible, but crucial for those interested in exploring the full potential of the evolutionary mechanisms shaping phenotypic development.
Although useful in its current form, we envision that the true potential of our approach will be realized when the scientific community accepts, critiques, and eventually amends our methods with current and future applicable analytical and technological improvements. To facilitate this process, we conducted our analysis within the publicly available R statistical language and make available to the scientific community our signature compendium, R scripts used within, and a brief tutorial illustrating the process, with the near-term goal of providing the community with an integrative systems tool for connecting genes and signaling networks to phenotypic characteristics in order to further the continuing goal of understanding plant genome function.
In the present study, we detail the initial stages of a theoretical and analytical framework for classifying the plant stress phenotype according to the architecture of the transcriptome, and relate that information to underlying coexpression networks and genes. Our results confirm the existence of known stress-specific genomic signatures, report previously unknown stress-responsive modules and genes, and successfully scale such information to the physiological state of the phenotype. We are encouraged by the results of our present investigation and believe that the approaches developed and information gained here will be critical as we continue to use these tools to better understand species and population-level adaptation to environmental stress, including stress resulting from climate change. Future research in our laboratory is centered on linking functional genomic approaches to genetic information, thereby providing a clear means to pursue evolutionary and ecological genomics at the level of individual organisms, populations, and ecosystems.
Plant growth and treatments
For hydroponic plant production, seeds of Arabidopsis thaliana 'Columbia' wild-type were cultivated in a modification of the system described in Norén et al. . Briefly, seeds were sterilized in 70% ethanol 5 min, gently agitated in 0.5% SDS solution 15 min, then triple rinsed in sterile Type I water and stratified for 3 d at 4°C. Sterilized seeds were suspended in a 0.1% agarose solution for dispensing onto nutrient agar. Nutrient agar for germination contained 0.5X MS (Sigma-Aldrich) pH 5.7 plus 0.3% phytagar/0.3% phytagel. Approximately 200 μl sterile nutrient agar was placed in the barrel of a 200 μl pipet tip in a 96-well cell well plate. After 2–3 seeds were placed on the cooled and solidified agar, plates were covered with lids, sealed with parafilm, and placed in a CMP 3244-controlled Conviron growth cabinet under 10 h, 22°C days/16 h, 18°C nights with fluorescent lighting (~45 μmol m-2 s-1) for 2 weeks until the seedlings had produced 6–8 leaves. The seedlings were then transferred to holes cored into a Styrofoam lid floating atop a 10L plastic tub filled with 1 mM N nutrient solution (1.0 mM KNO3, 1.5 mM CaCl2, 1.0 mM MgSO4·7H20, 1.0 mM K3PO4·7H20, 0.5 mM NH4Cl; micronutrients and Fe-EDTA as described in Norén et al. 2004; and 0.5 gL-1 MES, pH 5.7). Tubs were transferred to a Conviron BWD80-controlled growth room programmed for 10 h, 22°C days/16 h, 18°C nights, 80% RH with aeration provided by aquarium stones attached via tubing to an air pump. Metal halide and tungsten incandescent lamps provided an average of 110 μmol m-2 s-1 within the growth chamber. Seedlings were misted twice daily for approximately one week and lightly covered with a dome to reduce desiccation. A series of shadecloth coverage (70%-50%-30%-no shade) was used to acclimate the plants to MH lights over the course of 7–10 d. During the fourth week the plants were gradually introduced to lower humidity by loosening and finally removing domed lids. After complete chamber acclimation (week 5), plantlets were transferred to JetFlo® Econo Mini hydroponic systems (American Agritech). The system consisted of a 22" × 22" × 7" ebb and flow tray mounted to a 20 gallon reservoir containing 1 mM N nutrient solution as described above with aeration provided by a circulating pump and aquarium stones in both upper and lower chambers.
In total, there were 10 hydroponic systems resulting in a population of over 200 Arabidopsis plants at a relatively mature pre-flowering stage. At the end of 11 weeks, 24 plants were randomly chosen and six each were subjected to the following treatments: (1) Drought- systems drained and roots exposed to air until ~10 % fresh weight had been lost; (2) Heat – systems placed in 38°C chamber for 3 h; (3) plants subjected to combined treatments 1 & 2; and (4) no change (control). At the end of the treatment, plant shoots were harvested and immediately flash frozen in liquid nitrogen for subsequent RNA isolation.
RNA preparation, design, and microarray hybridization
Total RNA was isolated and labeled from control and treatment shoot tissue using methods from the Vicki Chandler Lab . Briefly, the TRIZOL reagent procedure was initially used to isolate total RNA that was further purified with RNA clean-up spin columns (Qiagen). This two-step process ensured for high quality RNA determined spectrophotometircally (Nanodrop) and with denaturing agarose gels. For each treatment, RNA was transcribed into cDNA primed with T7 Oligo (dT). Amino Allyl-modified RNA was then amplified and purified using manufacturer directions (Ambion) as modified by the Vicki Chandler lab (refer to  for protocol). Six μg of each amplified RNA (aRNA) sample was coupled to either CY3 or CY5 monochromeric dye (Amersham) also according to the Chandler protocols. Three μg of labeled aRNA were subsequently used for hybridization onto Arabidopsis slides printed with Operon v 3.1, 70 mer oligos obtained from the laboratory of David Galbraith at the University of Arizona . A formamide-based (50%) hybridization solution was applied at 42°C for 14 h and then washed according to Galbraith protocols .
The experiment consisted of one control and three treatments (heat alone, drought alone, heat and drought together) with six replicate plants RNA samples per treatment. To maximize the full power of the design, we used a direct loop design in which half of the samples from each treatment were labeled with CY3 and the other half were labeled with CY5. Therefore, each treatment was directly compared with all other treatments with the possible confounding effects of dye bias controlled for by the inclusion of dye reversals [see Additional file 6].
Microarray data analysis
Arrays were imaged with a Scan Array 5000 system (Perkin Elmer, Wellesley, MA) and the resultant image files were imported into Imagene v.6.0 for data extraction and initial data diagnostics. All statistical calculations were performed using the limma (v.2.10.5) Bioconductor open access package run in the R statistical framework (R Development Core Team, 2005). Quality of the hybridization was assessed with the Bioconductor packages arrayMagic v.1.14.0 and arrayQuality v.1.12.0. Log-fold (M) over variance (A) plots (M-A plots) were used to determine the most appropriate background correction method. We used basic subtraction in this study, but normexp, normexp offset = 50, and no correction at all were investigated. Arrays were normalized with print-tip loess and no between array normalization was applied. Using limma, a linear model was fitted to compare control to all treatment conditions as contrasts and empirical Bayes was used to compute a moderated t-statistic [25, 65]. Multiple testing errors were accounted for by using a false discovery rate (FDR) correction .
Construction of gene coexpression network
Data for network construction came from the AtGenExpress consortium  using the first four time-points for cold (4°C), heat (38°C), osmotic (300 mM mannitol), salt (150 mM NaCL), drought (15 min air dry, about 10% loss of fresh weight) and UV-B stresses (15 min exposure, 1.18 w/m2 Phillips TL40W/12). A complete description of experimental design and treatments for the abiotic portion of the AtGenExpress was recently reported by . All hybridizations were performed on the ATH1 affymetrix microarray platform . To reduce noise of the subsequent network, the input gene list was restricted to genes that were differentially expressed at a minimum of one time-point and one treatment. This resulted in a list of 16,036 (~71% of the all non-control array identifiers on the ATH1 slide) unique gene transcript identifiers [see Additional file 1]. Raw intensity values from the gene list were downloaded and log2 transformed before being subjected to network construction algorithms.
Construction of the gene coexpression network has been described in detail  and R scripts for our network construction were modeled after the Weighted Gene Co-Expression Network . Briefly, the framework for the weighted gene coexpression network (WGCNA) consists of 4 steps: (1) A similarity matrix of gene coexpression is initially determined by the absolute value of the Pearson correlation for all genes across all treatments; (2) transformation of coexpression similarities into connection strengths (connectivity) using a power adjacency function; (3) identification of network modules (highly correlated gene expression patterns across samples) by coupling linkage hierarchical clustering with topological overlap matrix; and (4) relating external gene or sample information to network properties. Details and tutorials of the WGCNA are available at the website.
Similar to Lamb et al. , the first step in our process was to create a catalog of reference signatures representing known biological states, plant stress phenotypes in our case. As a first step in this process, raw .cel files from the AtGenExpress dataset were downloaded and analyzed for differential expression. We again used the limma (v.2.10.5) Bioconductor open access package run in the R statistical framework. The entire dataset was normalized using gcRMA procedures and the linear model was fitted and contrasts were restricted to within treatment conditions. A moderated t-statistic was computed using empirical Bayes [25, 65] and a FDR correction . The log-fold change in expression relative to the control was used to rank the lists. Therefore, the first gene on the list represented the most over-expressed gene in relation to the control while the last gene in the signature represented the most down-regulated gene in relation to the control.
The reference signature database therefore consisted of six treatments for the first four time-points, resulting in 24 independent signatures. The rank-based nature of the signatures allowed us to use list statistical tests from the OrderedLists Bioconductor statistical package [43, 44]. The test includes a user defined value for how many genes to consider in the list comparison. In our case, we found that an alpha level of 0.3 interrogated a substantial portion of the list while focusing on list portions that were significantly differentially expressed for all conditions. The similarity score is computed as the shared number of genes between both signatures (lists) including the weighted sum score as determined by a weighting vector (refer to ). Therefore, the ends of the list are weighted most highly and contribute more to the similarity score, ensuring that the most differentially expressed genes contribute most to the score. Similar to Lamb et al. , the output consists of a non-parametric similarity score generated from the number and weights of the rank-based comparison between query and reference signatures. In addition, a p- value is calculated based on the rank of the query signature genes compared to the rank of the reference profile genes and a 1000 random permutations of the reference profile (refer to ). Therefore, a significant p-value means that the genes in common to the lists are not due to chance.
Availability of the methods and data
As mentioned in the manuscript, we present only a prototype version with limited number of signature phenotypes represented. The full potential of this resource, however, will become apparent when the representative phenotypes grow and began to exhaust that realized by the genome, and eventually genomes from other ecotypes and closely related species. Therefore, our microarray analysis has been deposited in the Gene Omnibus database (GEO: GSE9415). In addition, a the R scripts to reproduce our weighted gene coexpression network results [see Additional file 7] as well as the complete file of all reference genomic signature lists has been provided [see Additional file 8]. Information regarding network theory and R-code tutorials for weighted gene coexpression networks is available from the Steve Horvath Lab .
The authors would like to thank David Yang and Chris Schadt for critical reviews of the manuscript and the AtGenExpress consortium for availability of their data. In addition, the authors are indebted to Steve Horvath for contributing R-code. We acknowledge support from the U.S. Department of Energy (DOE), Office of Science, Biological and Environmental Research (BER) program as part of its Program for Ecosystem Research (PER) and contract No. DE-AC02-98CH10886 to Brookhaven National Laboratory. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the DOE under contract DE-AC05-00OR22725.
- 10.Anderson JP, Badruzsaufari E, Schenk PM, Manners JM, Desmond OJ, Ehlert C, Maclean DJ, Ebert PR, Kazan K: Antagonistic interaction between abscisic acid and jasmonate-ethylene signaling pathways modulates defense gene expression and disease resistance in Arabidopsis. Plant Cell. 2004, 16 (12): 3460-3479.PubMedCentralCrossRefPubMedGoogle Scholar
- 16.Lewin R: First success with reverse genetics. Science. 1986, 233 (4760): 159-160.Google Scholar
- 17.Lowe A, Harris S, Ashton P: Ecological Genetics. 2004, 326-Oxford , Blackwell ScienceGoogle Scholar
- 22.Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR: The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006, 313 (5795): 1929-1935.CrossRefPubMedGoogle Scholar
- 24.Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D'Angelo C, Bornberg-Bauer E, Kudla J, Harter K: The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007, 50 (2): 347-363.CrossRefPubMedGoogle Scholar
- 26.Team RDC: R: A language and environment for statistical computing., http://www.R-project.org
- 28.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: From the Cover: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005, 102 (43): 15545-15550.CrossRefGoogle Scholar
- 30.Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Qi S, Chen Z, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS: Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci U S A. 2006, 103 (46): 17402-17407.PubMedCentralCrossRefPubMedGoogle Scholar
- 35.Seki M, Narusaka M, Ishida J, Nanjo T, Fujita M, Oono Y, Kamiya A, Nakajima M, Enju A, Sakurai T, Satou M, Akiyama K, Taji T, Yamaguchi-Shinozaki K, Carninci P, Kawai J, Hayashizaki Y, Shinozaki K: Monitoring the expression profiles of 7000 Arabidopsis genes under drought, cold and high-salinity stresses using a full-length cDNA microarray. Plant J. 2002, 31 (3): 279-292.CrossRefPubMedGoogle Scholar
- 36.Vergnolle C, Vaultier MN, Taconnat L, Renou JP, Kader JC, Zachowski A, Ruelland E: The cold-induced early activation of phospholipase C and D pathways determines the response of two distinct clusters of genes in Arabidopsis cell suspensions. Plant Physiol. 2005, 139 (3): 1217-1233.PubMedCentralCrossRefPubMedGoogle Scholar
- 47.Ganter B, Tugendreich S, Pearson CI, Ayanoglu E, Baumhueter S, Bostian KA, Brady L, Browne LJ, Calvin JT, Day GJ, Breckenridge N, Dunlea S, Eynon BP, Furness LM, Ferng J, Fielden MR, Fujimoto SY, Gong L, Hu C, Idury R, Judo MS, Kolaja KL, Lee MD, McSorley C, Minor JM, Nair RV, Natsoulis G, Nguyen P, Nicholson SM, Pham H, Roter AH, Sun D, Tan S, Thode S, Tolley AM, Vladimirova A, Yang J, Zhou Z, Jarnagin K: Development of a large-scale chemogenomics database to improve drug candidate selection and to understand mechanisms of chemical toxicity and action. J Biotechnol. 2005, 119 (3): 219-244.CrossRefPubMedGoogle Scholar
- 48.Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell. 2000, 102 (1): 109-126.CrossRefPubMedGoogle Scholar
- 50.Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S: Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007Google Scholar
- 62.Norén H, Svensson P, Andersson B: A convenient and versatile hydroponic cultivation system for Arabidopsis thaliana. Physiol Plant. 2004, 343-348.Google Scholar
- 64.Galbraith D:, http://www.cals.arizona.edu/microarray
- 66.Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological). 1995, 57 (1): 289-300. Royal Statistical SocietyGoogle Scholar
- 67.Consortium AGE:, http://www.weigelworld.org/resources/microarray/AtGenExpress/
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.