Detecting separate time scales in genetic expression data
- 4k Downloads
Biological processes occur on a vast range of time scales, and many of them occur concurrently. As a result, system-wide measurements of gene expression have the potential to capture many of these processes simultaneously. The challenge however, is to separate these processes and time scales in the data. In many cases the number of processes and their time scales is unknown. This issue is particularly relevant to developmental biologists, who are interested in processes such as growth, segmentation and differentiation, which can all take place simultaneously, but on different time scales.
We introduce a flexible and statistically rigorous method for detecting different time scales in time-series gene expression data, by identifying expression patterns that are temporally shifted between replicate datasets. We apply our approach to a Saccharomyces cerevisiae cell-cycle dataset and an Arabidopsis thaliana root developmental dataset. In both datasets our method successfully detects processes operating on several different time scales. Furthermore we show that many of these time scales can be associated with particular biological functions.
The spatiotemporal modules identified by our method suggest the presence of multiple biological processes, acting at distinct time scales in both the Arabidopsis root and yeast. Using similar large-scale expression datasets, the identification of biological processes acting at multiple time scales in many organisms is now possible.
KeywordsGene Ontology Arabidopsis Root Multiple Time Scale Meristematic Zone Pericycle Cell
Biological processes in living organisms occur on a vast range of time scales, from 10-9s (nanoseconds) to 109s (decades), many of them taking place simultaneously. The advent of high-throughput technologies has given biologists the ability to measure system-wide gene expression, potentially capturing many of these processes at once. As a result, one of the major challenges of biological data analysis is the separation of these processes and their time scales. In many cases it is not even known how many processes underlie the measured signal or what their respective time scales are. These questions are particularly relevant to the field of developmental biology. Developmental studies focus on systems such as animal embryos or plant organs in which processes such as growth, segmentation and differentiation can all take place simultaneously, but on different time scales, complicating the interpretation of expression data.
Here we introduce a method for detecting the presence of different time scales in time-series gene expression data. Our approach is based on two assumptions that hold for many data sets of this type. First, that at least two replicate time-series measurements have been produced. Second, that there is at least one time-dependent process for which the time scale is known. This known process allows us to synchronize the replicates, and is most often the time scale on which the data was measured.
As an example of an applicable dataset, consider a gene expression measurement time-series with two replicates that is used to study cell cycle. Both replicates are synchronized in order to start at the same point in the cell cycle. Now let us suppose there is a second time-dependent process that is not affected by the synchronization (ie. not the cell-cycle). The two time-series of cellular snapshots provided by the replicates will now catch this second process in different temporal states. However, to ensure that we are observing the same process in two different states, rather than a signal corrupted by noise, the two snapshots have to be shifted versions of each other with a high degree of similarity, which is why our significance analysis has to incorporate the temporal shift in an explicit manner.
Our approach is somewhat analogous to an astronomical device called the blink comparator , which is used to distinguish the separate time scales on which planets and stars move across the sky. Two photographs of the night sky are taken on different days and the stars aligned so that they occupy the same position on both photographs. The comparator then alternates between the two images. Any object that is not a star, such as a planet, will jump back and forth, because it moves on a different time scale relative to the stellar background (which only moves due to the Earth's motion). In this analogy, the two astronomical photographs correspond to the biological replicates, the stellar background to the known biological process, and the object which changes position represents another biological process on a different time scale.
We apply our approach to detect time scales in two datasets. The first is a Saccharomyces cerevisiae cell-cycle dataset , and acts as a benchmark. We demonstrate that our method can successfully detect processes operating on two different time scales, namely real (chronological) time and cell-cycle time. The second dataset measures gene expression through root developmental time in Arabidopsis thaliana . Using our approach we discover many classes of statistically significant shifted patterns for the root dataset. These patterns can be further divided into tightly co-expressed spatiotemporal transcriptional modules, some of which are related to processes that occur during branching of the root, termed lateral root initiation. These patterns and modules suggest a rich and complex set of genes that act at multiple time scales to perform a range of biological functions.
1) Detection of separate time scales in Saccharomyces cerevisiae
To validate our method we chose to analyze a dataset for which there was a known separation between the time scale of the experiment and the time scale of a biological process of interest. The dataset we chose was a recent synchrony/release time-series microarray dataset from the yeast Saccharomyces cerevisiae measuring gene expression through the cell cycle (GEO Accession: GSE8799).
In the synchrony/release protocol used by the study, a population of cells is synchronized to early G1 phase. The cells are subsequently released to progress through the cell cycle, during which a time-series of microarray measurements are made. Thus, the measured time scale in the dataset is chronological time (minutes since release from the synchronization event). However, as the kinetics of release from synchronization always varies from experiment to experiment , replicate time-series experiments will not progress through the cell-cycle in exactly the same way. This introduces a separation of time scales between the measured scale, chronological time, and that of a biological process of interest, the cell-cycle.
The dataset itself consists of two replicate synchrony/release time-series experiments, each with 15 Affymetrix Yeast 2.0 microarray measurements taken at 16 minute intervals after the first sample. In this dataset the start of sampling in each replicate began at slightly different times (30 minutes and 38 minutes). As our method requires directly comparable data, simple linear splines were fit to each replicate and sampled at 8 minute intervals starting at 38 minutes after release, with a total of 28 samples per replicate.
Distribution of significant shifts in the cell-cycle dataset
GO term analysis result for cell cycle data
M phase of mitotic cell cycle
proteolysis involved in cellular pro- tein catabolic process
ribonucleoprotein complex biogene- sis and assembly
ubiquitin-dependent protein catabolic process
mitotic cell cycle
modification-dependent protein catabolic process
rRNA metabolic process
cell cycle phase
ncRNA metabolic process
maturation of 5.8S rRNA
cell cycle process
maturation of 5.8S rRNA from tri- cistronic rRNA transcript
ribosomal large subunit biogenesis
sister chromatid segregation
maturation of SSU-rRNA from tri- cistronic rRNA transcript
response to DNA damage stimulus
maturation of SSU-rRNA
mitotic sister chromatid segregation
endonucleolytic cleavage in ITS1 to separate SSU-rRNA ...
glycoprotein metabolic process
cleavages during rRNA processing
carbohydrate metabolic process
endonucleolytic cleavages during rRNA processing
protein amino acid glycosylation
endonucleolytic cleavage of tri- cistronic rRNA transcript ...
RNA metabolic process
maturation of LSU-rRNA from tri- cistronic rRNA transcript ...
glycoprotein biosynthetic process
maturation of LSU-rRNA
cellular response to DNA damage stimulus
rRNA 5'-end processing
ncRNA 5'-end processing
RNA 5'-end processing
DNA metabolic process
cellular component organization
DNA-dependent DNA replication
endonucleolytic cleavage to gener- ate mature 5'-end ...
response to stimulus
ribosomal large subunit assembly and maintenance
response to stress
endonucleolytic cleavage in 5'-ETS of tricistronic ...
microtubule cytoskeleton organiza- tion
ribosomal subunit assembly
nucleobase, nucleoside, nucleotide and nucleic acid ...
The identification of biologically coherent sets of shifted genes strongly suggests that our method is able to successfully detect the presence of processes occurring on multiple unrelated time scales. Furthermore, by analyzing the identified genes associated with those shifts, we were able to correctly identify the other major process, associated with Shift+8, as the cell-cycle.
2) Detection of separate time scales in the Arabidopsis root
The Arabidopsis root is an excellent model system for studying development due to its simple physical structure. In the root, the majority of new cells are born at the root apex from a stem cell population that surrounds the quiescent center (QC). Cell types are constrained within files, and with each new cell division, an older cell is successively displaced distal to the stem cell population. A cell's developmental time line can therefore be tracked along the root's longitudinal axis. In the work of Brady et al. , two developmental microarray time courses were generated by taking 12 or 13 successive transverse sections along an Arabidopsis root (GEO Accession: GSE8934). We use these two time-courses as the replicates (removing the 1st section of the 13 section time-course) and use developmental time as the known time scale.
Many different enriched processes were identified in these spatiotemporally co-expressed groups suggesting that they possess distinct biological functions. One cluster of phloem-enriched, chloroplast genome-encoded genes whose expression is shifted -2 sections (Cluster 1, Additional File 1) is associated with the translation of energy capturing proteins. Two clusters that show a shift of +2 also show enrichment of genes known to be activated or repressed during lateral root initiation in the LRIS. Lateral roots develop, at regular intervals, from pericycle cells located at root xylem poles in the root maturation zone, and their initiation is dependent upon xylem pole architecture . Cluster 3 shows an extremely strong enrichment for genes activated during lateral root initiation in the LRIS in both the whole root and activated in pericycle cells located at the xylem pole in the LRIS (p = 2.4e-113, p = 3.1e-39). These genes also show enriched expression in xylem cells in the meristematic zone (p = 3.2e-63) and display a time shift within the meristematic zone (Figure 7A). Interestingly, a second cluster (cluster 6) is enriched for genes repressed during lateral root initiation in xylem pole pericycle cells in the LRIS (p = 1.92e-9), but also contains genes whose expression is enriched in phloem cells, phloem companion cells and phloem pole pericycle cells (p = 1.59e-4, p = 5.6e-5, p = 5.54e-5) (Figure 7A). This profile shows a shift in the root maturation zone.
Biological processes on multiple time scales occur during the development and morphogenesis of embryos, tissues and organs. Using time series microarray expression data in replicate, we have developed a method that identifies a number of temporal scales in addition to the time scale being measured. This method was able to identify these time scales in two different organisms, suggesting that it is an organism-independent method.
Given the number of genes in high-throughput datasets, the computational efficiency of any data analysis method is critically important. By converting the data to rank permutations (see Methods), we can use uniformly distributed random permutations as a null model. As a result, our method is able to use a continuous Gaussian distribution for p(γ i ) as a close approximation to the real (discrete) probability density function of γ i values. Using a Monte Carlo simulation over uniformly random permutations we confirmed that this continuous distribution is an accurate approximation (data not shown). Note that, since the Gaussian distribution extends below γ i = 0, the p-value given by the Gaussian distribution is in fact an upper bound on the true p-value for small (i.e. the most significant) γ i , which means that the true p-value lies even lower. Therefore, our method provides an efficient, accurate, and conservative method for determining the significance of shifts in high-throughput datasets. Previous work on time-shifted expression data [15, 16, 17, 18] has focused on other biological questions, such as the detection of pair wise interactions between genes, rather than on the detection of processes on separate time scales and the comparison of replicates. As a result these approaches do not place an emphasis on shift classification, as they do not explicitly incorporate the shift into their significance analysis [15, 16, 17] and only consider small shifts . The shift significance analysis however is crucial to our approach, which aims to detect similar but shifted patterns (see Background section above), and must therefore carefully weigh the relative significance of similarities across varying sizes of the comparison window. Because our approach focuses on comparing replicates, we seek pairs of series that are highly shape-similar across the widest possible comparison window.
In our analysis of the yeast cell-cycle dataset, it is not a coincidence that the cell cycle process was identified in the Shift+8 group and that the original study adjusted the sampling times by eight minutes in the second biological replicate. In the original study, the authors employed a model designed to use auxiliary budding index data to specifically analyze kinetics of populations released from synchrony/release experiments . They used this information to determine which of the samples to hybridize to microarrays. Our method, blind to this prior knowledge, successfully identified this difference as the +8 minute shift. To ensure this agreement was not due to the 8-minute interval data splining used, we repeated the analysis on data splined at one minute intervals (data not shown). This still identified shifts of +8 and +9 minutes as being highly enriched for known periodic probes, thus indicating the method is not sensitive to sampling intervals and successfully detects the known biological shift.
Numerous biological processes have been identified in plants that occur in multiple time scales ranging from seconds (calcium signaling) to hours (circadian rhythms). In the root however, the full spectrum of biological processes that act in multiple time scales has likely not been described, due to a lack of knowledge of the time scales that these processes are acting on. Our rigorous method is able to utilize the gene expression dataset measuring expression through root developmental time in Arabidopsis thaliana to identify numerous spatiotemporal transcriptional modules acting in separate time scales. Each spatiotemporal module demonstrates a strong conservation of biological association occurring during root development. Interestingly, the strongest observed associations are linked to genes expressed during the process of lateral root initiation.
Lateral root initiation occurs at regular intervals within pericycle cells located at the xylem pole, suggesting cell-cell communication between xylem and pericycle cells . However, specific causal factors within the xylem have not yet been identified. Furthermore, periodic fluctuations in auxin response activity in xylem cells within the root basal meristem that regulate lateral root initiation in the root's maturation zone have been reported using a synthetic auxin reporter, but no candidate genes have been identified to play a functional role in this process . We propose that genes contained within the two clusters showing a shifted profile of +2 and a corresponding statistically significant enrichment of genes activated or repressed during lateral root initiation in the LRIS may contain some of the previously unidentified factors that play an important role in regulating lateral root initiation. Genes in cluster 3 whose expression is enriched in xylem cells in the meristematic zone may act in the fluctuating auxin response mechanism associated with lateral root initiation, perhaps by signaling to associated pericycle cells. Genes in cluster 6 that are repressed in the xylem pole pericycle upon auxin-induced lateral root initiation are also highly expressed in phloem tissue and phloem pole pericycle cells. No functional link has been established between this tissue and lateral root initiation, but our data suggest that during lateral root initiation, genes that are actively repressed in xylem pole pericycle cells must also be repressed within phloem tissue and within phloem pole pericycle cells.
Our analysis uses p-values in two separate places, which should not be confused. Firstly, they are employed in form of the significance thresholds of p < 0.01 (for Arabidopsis) and p < 0.001 (for yeast), which are thresholds for a given shift class, and ensure that each such class contains only a small proportion of false positives. These thresholds are picked for technical reasons, and are therefore inevitably somewhat arbitrary. The second role of p-values is in the subsequent GO enrichment analysis for each class, where they measure the biological significance of the classification. The extremely small p-values we find in this context demonstrate that the shift classification is indeed biologically meaningful.
In principle this method can be generalized to the case of three or more replicates, by choosing m-1 independent pairs among the m replicates, calculating the relative shifts and their respective p-values for these pairs, and combining the p-values using Fisher's method  to find the most significant combination of shifts. The genes can then be classified according to their position in this (m-1)-dimensional space. Note that the rapid growth of the volume of this space with m is likely to limit the feasibility of this generalization for m larger than three or four.
For all identified, uncharacterized modules in both yeast and Arabidopsis, further studies are needed to determine the relevant time scale of the observed shifts, and the nature of these shifts. Do these shifts act as part of signaling pathways that are on the scale of seconds, as part of metabolic rhythms , or are they shifting with respect to the circadian clock? Furthermore, are these groups of co-expressed genes oscillatory in nature, or is their observed shift part of a moving wave of expression that is not oscillatory? How are these temporal shifts generated? Finally, the functional roles of these genes acting at separate scales need to be experimentally elucidated. Regardless, the numerous spatiotemporal modules identified by our method suggest the presence of multiple biological processes, acting at distinct time scales in both the Arabidopsis root and yeast. Using similar large-scale expression datasets, the identification of biological processes acting at multiple time scales in many organisms is now possible.
Significant Shift Detection in Replicates
Consider a pair of replicate datasets with M probesets or genes and N datapoints, which we write as M × N matrices d1ij and d2ij. We convert these series to rank permutations for each probeset, resulting in two new matrices π1ij and π2ij.
As an example with N = 4, consider a row in d1 reading '0.3, 0.5, 0.6, 0.2', for which the corresponding rank permutation in π1 would be '3, 2, 1, 4', since 0.6 is the highest value, 0.5 the second highest, and so on. Thus each row of π1 and π2 contains a permutation of length N. In the real data ties are highly unlikely due to the high resolution of the measurements, and can be broken randomly if necessary.
The conversion to permutations simplifies the null model considerably, which makes it straightforward to measure the significance of the similarity between the replicates in a computationally efficient way. Rank permutations also form the basis of other correlation measures, such as Spearman's rank correlation . Another advantage of rank-based measures is a lower sensitivity to outliers and a greater emphasis on shape similarity .
We then calculate a measure of similarity of two rank profiles for a given shift s:
We are only interested in similarity, i.e. in cases where γi < μ. Hence we can calculate a p-value for every gene i and every shift s under this condition. We can then record, for each gene, the shift at which the maximally significant (smallest) p-value occurs, and the p-value itself. Note that the most significant shift between two replicates does not necessarily correspond to the shift with the lowest value of γi itself.
Biological Process Enrichment - Arabidopsis thaliana
All probesets that display a shift with a significance value of p < 0.01 were selected for further analysis. AGI identifiers were assigned to probesets using data from the 2008-5-29 Affy_ATH1_array_elements file. The ChipEnrich program was modified by including genes enriched in individual root cell types as identified by . Genes activated or repressed in xylem pole pericycle cells during lateral root initiation in the auxin-activated Lateral Root Induction System (LRIS) were also included. These were obtained from , with activated genes from clusters 1, 2, 3, 5, 9 and 10, and repressed genes from clusters 4, 6, 7 and 8.
The authors would like to thank Aubrey HB, Wolfgang Busch, and Miguel Moreno-Risueno for their critical review of the manuscript. SMB was partially supported by a Natural Science and Engineering Research Council postdoctoral fellowship. This work was funded in part by the Defense Advanced Research Projects Agency (DARPA) Fundamental Laws of Biology (FunBio) grant HR 0011-05-1-0057 to PNB and TMAF, and by a NSF AT2010 grant to PNB. SEA was supported by The Leverhulme Trust, UK and The Royal Society, UK.
- 1.Liller W, Mayer B: The Cambridge astronomy guide: a practical introduction to astronomy. 1990, Cambridge: Cambridge University Press, (Chapter 15)Google Scholar
- 5.Boyle EI, Weng SA, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO:: TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004, 20: 3710-3715. 10.1093/bioinformatics/bth456.PubMedCentralPubMedCrossRefGoogle Scholar
- 7.Orlando DA, Brady SM, Koch JD, Dinneny JR, Benfey PN: Chapter 4 in Plant Systems Biology (Methods in Molecular Biology). Edited by: Belostotsky D. 2009, New Jersey: Humana Press, 553:Google Scholar
- 10.Menges M, de Jager SM, Gruissem W, Murray JAH: Global analysis of the core cell cycle regulators of Arabidopsis identifies novel genes, reveals multiple and highly specific profiles of expression and provides a coherent model for plant cell cycle control. Plant Journal. 2005, 41: 546-566. 10.1111/j.1365-313X.2004.02319.x.PubMedCrossRefGoogle Scholar
- 11.Vanneste S, De Rybel B, Beemster GTS, Ljung K, De Smet I, Van Isterdael G, Naudts M, Iida R, Gruissem W, Tasaka M, Inze D, Fukaki H, Beeckman T: Cell cycle progression in the pericycle is not sufficient for SOLITARY ROOT/IAA14-mediated lateral root initiation in Arabidopsis thaliana. Plant Cell. 2005, 17: 3035-3050. 10.1105/tpc.105.035493.PubMedCentralPubMedCrossRefGoogle Scholar
- 12.De Smet I, Vassileva V, De Rybel B, Levesque MP, Grunewald W, Van Damme D, Van Noorden G, Naudts M, Van Isterdael G, De Clercq R, Wang J, Meuli N, Vanneste S, Friml J, Hilson P, Juergens G, Ingram GC, Inze D, Benfey PN, Beeckman T: Receptor-like kinase ACR4 restricts formative cell divisions in the Arabidopsis root. Science. 2008, 322: 594-597. 10.1126/science.1160158.PubMedCrossRefGoogle Scholar
- 14.Parizot B, Laplaze L, Ricaud L, Boucheron-Dubuisson E, Bayle V, Bonke M, De Smet I, Poethig SR, Helariutta Y, Haseloff J, Chriqui D, Beeckman T, Nussaume L: Diarch Symmetry of the Vascular Bundle in Arabidopsis Root Encompasses the Pericycle and Is Reflected in Distich Lateral Root Initiation. Plant Physiology. 2008, 146: 140-148. 10.1104/pp.107.107870.PubMedCentralPubMedCrossRefGoogle Scholar
- 19.De Smet I, Tetsumura T, De Rybel B, Frey NFd, Laplaze L, Casimiro I, Swarup R, Naudts M, Vanneste S, Audenaert D, Inze D, Bennett MJ, Beeckman T: Auxin-dependent regulation of lateral root positioning in the basal meristem of Arabidopsis. Development. 2007, 134: 681-90. 10.1242/dev.02753.PubMedCrossRefGoogle Scholar
- 20.Fisher RA: Statistical Methods for Research Workers. 1932, London: Oliver and BoydGoogle Scholar
- 22.Barlow RJ: Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences. 1989, Chichester: WileyGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.