Spectral estimation in unevenly sampled space of periodically expressed microarray time series data
 6.4k Downloads
 17 Citations
Abstract
Background
Periodogram analysis of timeseries is widespread in biology. A new challenge for analyzing the microarray time series data is to identify genes that are periodically expressed. Such challenge occurs due to the fact that the observed time series usually exhibit nonidealities, such as noise, short length, and unevenly sampled time points. Most methods used in the literature operate on evenly sampled time series and are not suitable for unevenly sampled time series.
Results
For evenly sampled data, methods based on the classical Fourier periodogram are often used to detect periodically expressed gene. Recently, the LombScargle algorithm has been applied to unevenly sampled gene expression data for spectral estimation. However, since the LombScargle method assumes that there is a single stationary sinusoid wave with infinite support, it introduces spurious periodic components in the periodogram for data with a finite length. In this paper, we propose a new spectral estimation algorithm for unevenly sampled gene expression data. The new method is based on signal reconstruction in a shiftinvariant signal space, where a direct spectral estimation procedure is developed using the Bspline basis. Experiments on simulated noisy gene expression profiles show that our algorithm is superior to the LombScargle algorithm and the classical Fourier periodogram based method in detecting periodically expressed genes. We have applied our algorithm to the Plasmodium falciparum and Yeast gene expression data and the results show that the algorithm is able to detect biologically meaningful periodically expressed genes.
Conclusion
We have proposed an effective method for identifying periodic genes in unevenly sampled space of microarray time series gene expression data. The method can also be used as an effective tool for gene expression time series interpolation or resampling.
Keywords
Spectral Estimation Periodic Gene Sample Time Series Regulation Strength Gene Expression Time SeriesBackground
Periodic phenomena are widely studied in biology and there are numerous biological applications where periodicities must be detected from experimental biological data. Because the measured data are often nonideal, efficient algorithms are needed to extract as much information as possible. Spectral estimation has been a classical research topic in digital signal processing and has recently found important applications in DNA microarray time series data analysis. Many spectral estimation methods have been proposed in the past decades, including the modified periodogram, the autoregressive (AR) model, the MUSIC algorithm and the multitaper method [1, 2]. Although all these algorithms have their own advantages, they were all developed based on a basic assumption: the input signal is evenly sampled. However, in many realworld applications, the data can be unevenly sampled. For example, in DNA microarray gene expression experiments, a time series may be obtained with different sampling intervals [3, 4, 5]. Furthermore, an evenly sampled time series may contain missing values due to corruption or absence of some expression measurements [6, 7]. A time series with missing values can be considered as an unevenly sampled time series in general.
Recently, several methods for detecting periodic gene expression have been proposed [8, 9, 10, 11, 12, 13, 14, 15]. Lu et al. [8] have proposed a periodicnormal mixture (PNM) model to fit transcription profiles of periodically expressed genes in cell cycle microarray experiments. Ahdesmäki et al. [9] proposed a generalpurpose robust testing procedure for finding periodic sequences in multiple time series data, which is based on a robust spectral estimator that is incorporated into a hypothesis testing framework using the socalled Gstatistic together with correction for multiple testing. Chen [10] proposed a statistical inference approach, the C&G procedure, to effectively detect statistically significant periodically expressed genes based on two statistical hypothesis testing procedures. Wichert et al. [11] proposed to use the average periodogram as an exploratory tool to detect the presence of possible periodic genes and give an exact statistical test to determine whether or not a sinusoid is presence. Luan and Li [12] proposed to use the shapeinvariant model combined with a cubic Bspline estimation to model periodic gene expression profiles. Ruf [13] is one of the first to treat evenly sampled gene expression time series with missing values as unevenly sampled data for spectral analysis using the LombScargle periodogram. Bohn et al. [14] have used the LombScargle periodogram in their attempt to detect rhythmic components in the circadian cycle of the Crassulacean acid metabolism plants. Glynn et al. [15] also used the LombScargle periodogram to detect periodic patterns in unevenly spaced gene expression time series. The LombScargle periodogram produces better results on unevenly sampled data than the classical Fourier transform method since it weights the data on a "per point" basis instead of on a "per time interval" basis [16]. Lomb [17] proved that this periodogram is the same as the classical periodogram in the case of equally spaced data. However, since the LombScargle method assumes that there is a single stationary sinusoid wave with infinite support, it introduces spurious periodic components in the periodogram for data with a finite length. Also, due to the effect of noise in the data, it may produce inaccurate estimation results.
In this paper, we propose a new spectral estimation technique for unevenly sampled data. Our method models the signal in a shiftinvariant signal space, for which many theories and algorithms are available [18, 19, 20, 21, 22, 23, 24, 25]. In our method, a direct spectral estimation formula is derived based on the Bspline basis that has finite support. Experiments on simulated noisy periodic signals show that our algorithm is more accurate in detecting periodicity compared to the LombScargle algorithm.
Results and discussion
Our method is based on signal reconstruction in a shiftinvariant signal space, where a direct spectral estimation procedure is developed using the Bspline basis. The details of the reconstruction algorithm and the power spectrum density (PSD) estimation are given in the Method Section.
Simulated data
We first test our spectral estimation algorithm on simulated signals to compare the estimation accuracy with the LombScargle method. A cosine curve has been used to represent the ideal expression of a gene that goes from an "on" state, to an "off" state, and then back to "on" [11]. For a gene g and expression level observed at time t_{ i }, we denote the time series by Y_{ g }(t_{ i }),
where ${Y}_{g}({t}_{i})\beta \mathrm{cos}\phantom{\rule{0.2em}{0ex}}(\frac{{\text{t}}_{\text{i}}\pi}{12})+{\epsilon}_{\text{g}}({t}_{i})$
for i = 1,...,N and g = 1,...,G. Our test data consists of simulated time series data for the expression of G = 1000 genes, where 900 of them are random genes (i.e., β = 0) and 100 are noisy periodic genes.
To obtain this dataset, the time series (genes) is first evenly sampled at 48 points. That is, t_{ i }= i, (i = 1,...,48). Then, time points are randomly deleted from each time series to simulate the uneven sampling situation.
Number of periodic genes detected using different methods for simulated data
Noise levels (σ)  Missing ratio  P  

LombScargle method  Our Algorithm  
0.01  15%  93  98 
20%  88  92  
40%  68  80  
65%  40  65  
0.2  15%  85  92 
20%  75  85  
40%  60  78  
65%  32  50  
0.6  15%  72  89 
20%  65  80  
40%  51  65  
65%  28  47 
Experimental data
Plasmodium falciparum
Yeast
Spectral analysis is useful for the identification of cellcycleregulated genes. Spellman et al. [3] monitored genomewide mRNA levels for 6178 yeast ORFs simultaneously using several different methods of synchronization including an α (alpha)factormediated G1 arrest which covers approximately two cellcycle periods with measurements at 7 min intervals for 119 min with a total of 18 time points, a temperaturesensitive cdc15 mutation to induce a reversible Mphase arrest, and a temperaturesensitive cdc28 mutation to arrest cells in G1 phase reversibly, and finally, an elutriation synchronization to produce the elutriation dataset of 14 time points (raw data available at [35]). For the cdc15 experiment, gene expression data were measured every 10 min for 290 min, lacking observations for the 20, 40, 60, 260 and 280 min time points, and gives a total of 24 time points. For the cdc28 experiment, samples were taken every 10 min from 0 to 160 min for a total of 17 time points. These four microarray datasets have spawned a large body of work on the gene expressions of the yeast cell cycle.
As pointed out by Lichtenberg et al. [28], there is a remarkably poor agreement between the numbers of periodically expressed genes detected by various computational methods. To enable a more objective comparison between the performances of different algorithms, they proposed three benchmark sets B1, B2, and B3 (see their website [36]). Set B1 contains a total of 113 genes previously identified as periodically expressed in smallscale experiments. Set B2 contains 352 genes whose promoters were bound by at least one of the 9 known cell cycle transcription factors (i.e., Fkh1, Fkh2, Ndd1, Mcm1, Ace2, Mbp1, Swi4, Swi5, Swi6) in two Chromatin IP studies, and therefore many of the genes in this benchmark set should be expected to be cell cycle regulated. Set B3 contains 518 genes annotated in MIPS [37] as "cell cycle and DNA processing". However, since a large number of genes involved in the cell cycle are not subjected to transcriptional regulation (not periodic) and genes found in B1 were explicitly removed, only a small fraction of the genes in B3 are expected to be periodically expressed. They define a good method as one that is able to reproduce precious findings (B1), extract genes whose promoters are associated with known cell cycle transcription factors (B2), or enriched for genes that play a role in the cell cycle (B3).
Conclusion
In this paper, we have proposed a new spectral estimation algorithm based on a signal reconstruction technique in an unevenly sampled space. The advantage of our algorithm over the LombScargle spectral estimation method is that the new algorithm can effectively reduce the effects of noise and spurious oscillation components and therefore improve the estimation accuracy. Experiments on simulated signals and real gene expression data show that our method is effective in identifying periodically expressed genes. Finally, we remark that this paper focuses on the improvement of periodicity estimation accuracy using spectral analysis algorithms. Another important issue is the statistical significance of the periodicity of a time series. Interested readers are referred to Chen [10], Wichert et al. [11] and Glynn et al. [15], who have used hypothesis testing to address this problem.
Methods
Mathematical model
In the following, we first review existing work on signal analysis in the shiftinvariant signal space, and then derive the new spectral estimation algorithm.
Shannon's signal sampling and reconstruction theorem states that:
where the coefficients {c_{ i }} are related to the choice of basis function, ϕ.
Signal reconstruction in the shiftinvariant space is an active research area and there are many mathematical theories and algorithms available [18, 19, 20, 21, 22, 23, 24, 25]. When the signal f ∈ V(ϕ), we need to reconstruct signal f from sampled value {f(x_{ i })}, where {x_{ i }} is the sampled point set. If {x_{ i }} is an evenly sampled point set, this problem can be regarded as signal reconstruction in an even sampled space. Otherwise, this is a signal reconstruction problem in an uneven sampled space.
where $\tilde{x}$[n] and e[n] represent the estimation of x[n] and the corresponding estimation error, respectively. Yeung et al. [31] have presented an application of the AR model to microarray gene expression time series analysis for gene regulation study. Comparing Equation (4) with Equation (5), it is obvious that Equation (5) is a special case of Equation (4).
where ω is the frequency in Hz. In order to avoid introducing spurious periodic components caused by a basis function ϕ with infinite support, for example, the sinc function, we only consider basis functions with compact support.
PSD estimation algorithm
 (1)Given sampling points x_{1},...,x_{ J }∈ [A_{1}, A_{2}] and corresponding discrete function values y = (y_{1},...,y_{ J }). Assume that J ≥ J_{min} = A_{2}  A_{1} + 2Ω  1 and that the truncated matrix T defined below is invertible. Compute matrix: U = (U_{ jk }), T = T_{ kl }), where${U}_{jk}=\phi ({x}_{j}k),{T}_{kl}=\sum _{j=1}^{J}\overline{\phi ({x}_{j}k)}\phi ({x}_{j}l),$
j = 1,...,J, k, l = A_{1}  Ω + 1,..., A_{2} + Ω  1.
 (2)
Compute c = T^{1}b according to b = $\overline{U}$y, where $\overline{U}$ denotes the complex conjugate transpose of U and T^{1} is the inverse of T.
 (1)
The construction of matrix T has the advantage that T is a positive operator on ℓ^{2}(Z).
 (2)In the case of Bspline, matrix T is invertible under the condition that${\mathrm{sup}\phantom{\rule{0.2em}{0ex}}}_{j}({x}_{j+1}{x}_{j})\le \frac{({A}_{2}{A}_{1})}{{A}_{2}{A}_{1}+2\Omega 1}<1$
 (3)
Since the numerical solution of (c_{ k }) can be sensitive to particular sets of intersample spacing, we can approximate the inverse of the illconditioned T by its pseudoinverse using singular value decomposition (SVD).
We have the following observations on our PSD estimation method:

The proposed method does not make any assumption about the number of frequency components in the data, whereas the LombScargle method assumes that there is only a single stationary sinusoid wave with infinite support. In practice, noise in the data can contribute to many frequency components. Thus, our model is more appropriate for realworld noisy data.

Compared with the traditional PSD estimation algorithms, our method can directly compute the PSD for an unevenly sampled signal from Equation (11). The classical Fourier periodogram based method only can deal with evenly sampled data.

The order N of the Bspline basis function ϕ can be chosen adaptively according to Equation (9). From Equation (9), we know that the relationship between the Bspline order N and the minimum number of sampling points J_{min} is given by N ≈ J_{min}  (A_{2}  A_{1}) such that the matrix T is invertible. If we hold the number of sampling point J and the signal support [A_{1}, A_{2}] fixed, then increasing N (which would result in a smoother Bspline with larger support) would ensures a smoother reconstruction. In practice, some oversampling is done for stability of reconstruction, and we increase the oversampling rate (i.e., number of sample time points per time interval) by timescaling the original signal support appropriately to a smaller interval.

From Equation (11), we see that the periodogram decays more rapidly with increasing N, meaning that the reconstructed signal will be smoother with increasing N. This fact can be used for signal denoising.
Detecting periodically expressed genes
If the power ratio calculated is larger than a threshold, the corresponding profile is considered to be periodic. In [27], the threshold for S is heuristically chosen to be 0.7.
n = [N_{0}/2], and p is the largest integer less than 1/x. Equation (14) is computed over the set of normalized Fourier frequencies ω = k/N_{0}, where k = 0,1,...,N_{0}/2. Equation (13) gives an exact pvalue that allows one to test whether a gene expression profile behaves like a purely random white noise process or whether the maximum peak in the periodogram is significant. However, for gene expression profiles of short length, i.e., < 40 time points, the pvalue given by Equation (13) has weak statistical power to determine the number of periodic genes [9, 11]. In view of this, we follow [28] and provide instead a ranking of the expression profiles based on the Gstatistic of Equation (14).
LombScargle method
The statistical significance of the periodic components detected in the periodogram can be evaluated using an exponential probability distribution test [16] which we denoted in this work as the LombScargle test.
Authors' contributionsAWCL worked on mathematical modelling, implementation, and microarray datasets analysis. JX worked on signal reconstruction theories and is responsible for the simulated signal experiments. SHW worked on signal reconstruction theories. DS provided biological insight to this project. HY initiated the project and worked on spectral estimation. All authors read and approved the final manuscript.
Notes
Acknowledgements
This work is supported with a grant from the Hong Kong Research Grant Council (project CityU122005) and the Mathematical Tianyuan Foundation of the Chinese National Science Foundation (Grant number 10526036). We would like to thank the anonymous reviewers for their valuable comments.
Supplementary material
References
 1.Hayes MH: Statistical digital signal processing and modeling. 1996, John Wiley and Sons, IncGoogle Scholar
 2.Kay SM, Marple SL: Spectrum analysisA modern perspective. Proceeding of IEEE. 1981, 69: 13801418.CrossRefGoogle Scholar
 3.Spellman TS, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisia by microarray hybridization. Mol Biol Cell. 1998, 9: 32733297.PubMedCentralCrossRefPubMedGoogle Scholar
 4.Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science. 1998, 282: 699705. 10.1126/science.282.5389.699.CrossRefPubMedGoogle Scholar
 5.Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. PNAS. 1998, 95: 1486314868. 10.1073/pnas.95.25.14863.PubMedCentralCrossRefPubMedGoogle Scholar
 6.Bø TH, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares method. Nucleic Acids Research. 2004, 32: e3410.1093/nar/gnh026.PubMedCentralCrossRefPubMedGoogle Scholar
 7.Gan X, Liew AWC, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Research. 2006, 34: 16081619. 10.1093/nar/gkl047.PubMedCentralCrossRefPubMedGoogle Scholar
 8.Lu X, Zhang W, Qin ZH, Kwast KE, Liu JS: Statistical resynchronization and Bayesian detection of periodically expressed genes. Nucleic Acids Research. 2004, 32: 447455. 10.1093/nar/gkh205.PubMedCentralCrossRefPubMedGoogle Scholar
 9.Ahdesmäki M, Lähdesmäki H, Pearson R, Huttunen H, YliHarja O: Robust detection of periodic time series measured from biological systems. BMC Bioinformatics. 2005, 6: 11710.1186/147121056117.PubMedCentralCrossRefPubMedGoogle Scholar
 10.Chen J: Identification of significant periodic genes in microarray gene expression data. BMC Bioinformatics. 2005, 6: 286297. 10.1186/147121056286.PubMedCentralCrossRefPubMedGoogle Scholar
 11.Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics. 2004, 20: 520. 10.1093/bioinformatics/btg364.CrossRefPubMedGoogle Scholar
 12.Luan Y, Li H: Modelbased methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics. 2004, 20: 332339. 10.1093/bioinformatics/btg413.CrossRefPubMedGoogle Scholar
 13.Ruf T: The LombScargle Periodogram in biological rhythm research: Analysis of incomplete and unequally spaced timeseries. Biological Rhythm Research. 1999, 30: 178201. 10.1076/brhm.30.2.178.1422.CrossRefGoogle Scholar
 14.Bohn A, Hinderlich S, Hutt MT, Kaiser F, Luttge U: Identification of rhythmic subsystems in the circadian cycle of crassulacean acid metabolism under thermoperiodic perturbations. Biol Chem. 2003, 384: 721728. 10.1515/BC.2003.080.CrossRefPubMedGoogle Scholar
 15.Glynn EF, Chen J, Mushegian AR: Detecting periodic patterns in unevenly spaced gene expression time series using LombScargle periodograms. Bioinformatics. 2006, 22: 310316. 10.1093/bioinformatics/bti789.CrossRefPubMedGoogle Scholar
 16.Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in C. 1995, chapter 13.8: 2Google Scholar
 17.Lomb NR: Leastsquares frequency analysis of unequally spaced data. Astrophys Space Sci. 1976, 39: 447462. 10.1007/BF00648343.CrossRefGoogle Scholar
 18.Benedetto JJ: Irregular sampling and frames. wavelets: A Tutorial in Theory and Applications. Edited by: Chui CK. 1992, 445507.CrossRefGoogle Scholar
 19.Chen W, Itoh S, Shiki J: Irregular sampling theorem for wavelet subspaces. IEEE Trans Information Theory. 1998, 44: 11311141. 10.1109/18.669187.CrossRefGoogle Scholar
 20.Chen W, Itoh S, Shiki J: On sampling in shift invariant spaces. IEEE Trans Information Theory. 2002, 48: 28022810. 10.1109/TIT.2002.802646.CrossRefGoogle Scholar
 21.Ericsson S, Grip N: An analysis method for sampling in shiftinvariant spaces. Int J Wavelet Multi Information Processing. 2005, 3: 301319. 10.1142/S0219691305000877.CrossRefGoogle Scholar
 22.Goh SS, Ong IGH: Reconstruction of bandlimited signals from irregular samples. Signal Processing. 1995, 46: 315329. 10.1016/01651684(95)000910.CrossRefGoogle Scholar
 23.Liu Y: Irregular sampling for spline wavelet subspaces. IEEE Trans Information Theory. 1996, 42: 623627. 10.1109/18.485731.CrossRefGoogle Scholar
 24.Xian J, Lin W: Sampling and reconstruction in timewarped spaces and their applications. Appl Math Comput. 2004, 157: 153173. 10.1016/j.amc.2003.08.031.CrossRefGoogle Scholar
 25.Xian J, Luo SP, Lin W: Weighted sampling and signal reconstruction in spline subspaces. Signal Processing. 2006, 86: 331340. 10.1016/j.sigpro.2005.05.013.CrossRefGoogle Scholar
 26.Grochenig K, Schwab H: Fast local reconstruction methods for nonuniform sampling in shift invariant spaces. SIAM Journal on Matrix Analysis and Applications. 2002, 24: 899913. 10.1137/S0895479802409067.CrossRefGoogle Scholar
 27.Bozdech Z, Llinas M, Pulliam BL, Wong ED, Zhu J, DeRisi JL: The Transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biology. 2003, 1: 116. 10.1371/journal.pbio.0000005.CrossRefGoogle Scholar
 28.Lichtenberg U, Jensen LJ, Fausbøll A, Jensen TS, Bork P, Brunak S: Comparison of computational methods for the identification of cell cycleregulated genes. Bioinformatics. 2005, 21: 11641171. 10.1093/bioinformatics/bti093.CrossRefPubMedGoogle Scholar
 29.Andersson CR, Isaksson A, Gustafsson MG: Bayesian detection of periodic mRNA time profiles without use of training examples. BMC Bioinformatics. 2006, 7: 6310.1186/14712105763.PubMedCentralCrossRefPubMedGoogle Scholar
 30.Dowski ER, Whitmore CA, Avery SK: Estimation of randomly sampled sinusoids in additive noise. IEEE Trans Acoustics Speech Signal Processing. 1988, 36: 19061908. 10.1109/29.9035.CrossRefGoogle Scholar
 31.Yeung LK, Szeto LK, Liew AWC, Yan H: Dominant spectral component analysis for transcriptional regulations using microarray timeseries data. Bioinformatics. 2004, 20: 742749. 10.1093/bioinformatics/btg479.CrossRefPubMedGoogle Scholar
 32.Schumarker LL: Spline functions: Basic Theory. 1981, WileyInterscience, New YorkGoogle Scholar
 33.Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. JR Statist Soc B. 1995, 57: 289300.Google Scholar
 34.Fisher RA: Tests of significance in Harmonic Analysis. Proceedings of the Royal Society of London, Series A. 1929, 125: 5459.CrossRefGoogle Scholar
 35.Cell Cycle Regulated Yeast Genes. [http://genomewww.stanford.edu/cellcycle/data/rawdata/]
 36.Supplementary information and datasets for reference [28]. [http://www.cbs.dtu.dk/cellcycle/yeast_benchmark/benchmark.php]
 37.The MIPS Comprehensive Yeast Genome Database. [http://mips.gsf.de/genre/proj/yeast/]
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.