Bayesian Hierarchical Model for Estimating Gene Expression Intensity Using Multiple Scanned Microarrays
- 1.2k Downloads
We propose a method for improving the quality of signal from DNA microarrays by using several scans at varying scanner sen-sitivities. A Bayesian latent intensity model is introduced for the analysis of such data. The method improves the accuracy at which expressions can be measured in all ranges and extends the dynamic range of measured gene expression at the high end. Our method is generic and can be applied to data from any organism, for imaging with any scanner that allows varying the laser power, and for extraction with any image analysis software. Results from a self-self hybridization data set illustrate an improved precision in the estimation of the expression of genes compared to what can be achieved by applying standard methods and using only a single scan.
KeywordsLaser Power Posterior Distribution Spot Intensity Latent Intensity Multiple Scan
DNA microarray technology is used to study simultaneously the expression profile of a large number of distinct genes . Several factors contribute to the accuracy with which these genes and their expressions (also referred here as intensities) can be determined. In particular, very low or very high in-tensities may lead to poor estimation of the ratio between the two samples and thus to an incorrect identification of differentially expressed genes. Low intensities tend to be noisy and lead to highly variable ratio estimates, whereas very high intensities are saturated from above and hence give biased results.
One of the objectives of microarray experiments is to identify a subset of genes that are differentially expressed be-tween the samples of interest. The relative intensity between the samples at a spot (also referred here as gene) is extracted by applying suitable image processing methods to the images produced by scanning the microarray slides on which the two samples have been hybridized. Errors occurring during image acquisition affect all further analyses and, therefore, the process of generation of these digital images is crucial. Photomultiplier tube (PMT), laser power (LP), and analog to digital converter (ADC), are the main components of an acquisition device, the scanner, which controls the for-mation of digital images. Each spot on the hybridized slide has fluorescent molecules corresponding to the two labeled samples and they emit photons upon excitation by a laser. The photons are converted into electrons by the PMT and the amount of current that eventually flows is directly proportional to the amount of incident light at the photocathode, unless saturation occurs . Saturation oc-curs when the signal from a pixel exceeds the scanner's upper threshold of detection ( Open image in new window , for a 16-bit computer storage system). This phenomenon is also called clipping, and it occurs when the ADC converter converts the electrons into a sequence of digital signals. This clipping effect renders the relation between the measured and the true intensities nonlinear in the upper range of intensities.
A single scanner setting will not be optimal for both weakly and highly expressed genes. The choice of the parameters (of a scanner) involves a trade-off between spots with a low intensity and spots that are saturated to some degree. Thus it seems reasonable to consider multiple scanning of the same microarray at different scanner sensitivities and estimate spot intensities from the combined data.
Not much work has been done so far on improving the data quality by using multiple scans and by adjusting for pixel censoring. Dudley et al.  increased the dynamic range of gene expression using a new method. They proposed to hybridize experimental and control samples against labeled oligos that would be complementary with respect to every microarray feature, rather than cohybridizing the samples. However, their method cannot be applied to experiments that follow the standard method of Schena et al. . Khondoker et al.  presented a regression model based on a nonlinear relationship and involving both an additive and a multiplicative error terms to establish a link between the saturated and the true intensities, and used an approach based on maximum likelihood estimation to correct for saturation. Lyng et al.  recalculated the saturated signals using a set of unsaturated intensities from a second scan. Though they proposed a method to determine the unsaturated intensities, the main focus of their paper was on investigating the relationship between PMT voltages, spot intensities, and expression ratios for three commercial scanners of two different brands. Piepho et al.  suggested using a nonlinear latent regression model for correcting the bias caused by saturation and for combining data from multiple scans. Skibbe et al.  compared the number of differentially expressed genes when using approaches based on linear regression and when considering a union of sets of differentially expressed genes that had been identified by scans made by varying the PMT and laser power. They showed that the latter approach effectively identifies a subset of statistically significant genes that the former approach is unable to find.
Our approach towards improving quality of intensity measurements is based on first producing three images with different scanner sensitivities, and obtaining three different sets of expression values. We then apply a novel Bayesian latent intensity model, in which these different sets of expression values are used to estimate suitably calibrated true expressions of genes. The resulting estimates, treated in the form of respective (posterior) distributions, can be used in a higher-level analysis for identifying differentially expressed genes. The proposed approach is applicable to standard microarray methodology and to cDNA arrays. The method, however, cannot be applied to Affymetrix gene chips as the current Affymetrix technology does not allow multiple scanning. The method is generic and can be applied to data from any organism, imaging with any scanner that allows scanning at different laser powers, and extraction with any image analysis software.
In this study, we used cDNA microarrays containing 16 000 individual fragments printed in duplicate (produced in Turku Centre for Biotechnology, University of Turku, Finland). Our approach was tested on two experiments. The first experiment was designed to examine the effects of RhoG on HeLa cells by comparing expression profile of RhoG expressing cells versus control cells. This experiment was performed on two arrays (here called Open image in new window and Open image in new window ). Each array had RNA from RhoG G12V labeled with Cy5, and control labeled with Cy3. The second experiment was a self-self hybridization experiment where RNA samples from T-Rex-HeLa cells (Invitrogen) transfected with a pcDNA4/TO vector were used. Details about sample preparation, RNA extraction, sample labeling and microarray hybridization for the two experiments can be requested from the authors.
2.2. Multiple Scanning
2.3. Quantification of Spot Intensities
Digital images were processed using GenePix Pro version 3.0 software (Axon Instruments, Inc., Foster City, Calif, USA. http://www.axon.com/GN_GenePix4000.html). The automatic spot finding algorithm of GenePix was used to find spot boundaries and to calculate spot intensities. Spot fore-ground and background intensities for both channels were derived and background corrected intensities above 200 from all three scans were used for our study.
2.4. Latent Variable Model
Strictly speaking, right censoring at Open image in new window (where 65 535 is the scanner's upper threshold of detection) is only appropriate in a pixel level model. In spot intensities, some degree of clipping takes place already well below this value. Piepho et al.  showed in their paper that spot saturation begins between 15 and 16 on log (base 2) scale, that is, somewhere between 32 768 and 65 535 on natural scale. The reason is that the signal from a spot is obtained by averaging the readings over the pixels belonging to the spot, and some of these pixels may be already saturated. As a result, and unlike in pixel level data, in spot level data there is no sharp threshold value beyond which saturation has an effect. Gupta et al.  provided data where, as could be expected, with increasing observed spot intensity also an increasing proportion of the pixel readings had reached their maximal value 65 535. At spot summary value of 60 000, most of the pixels comprising the spot were already saturated. However, although a pixel level model can be said to give a more truthful description of the saturation phenomenon as such, it cannot be applied in practice for analyzing pixel level data from arrays which typically con-tain several thousand spots. The reason is simply the computational cost involved, as each spot consists of 80–100 pixels. Here, instead of attempting to model the effect of saturation on observed spot intensities, we treat the high intensity readings, which are most affected by saturation, as right censored observations. We then compensate for the resulting loss of information by utilizing the measurements obtained with a lower laser power, finally combining, within the Bayesian framework, the information from all three scan measurements to obtain the posterior distribution of the true latent intensity. Right censored measurements are taken care of as a part of the same estimation process, by data augmentation, which effectively means that they are replaced by the corresponding conditional distributions. Applying such a process naturally still involves deciding on the level beyond which right censoring should take place. In the results reported here we considered signals which exceeded Open image in new window (i.e., approximately 11) as right censored. Later, in Section 3, we consider the influence of the choice of the censoring threshold in some details.
To complete specification of the model, we assume errors Open image in new window are independent and identically distributed Normal random variables with mean Open image in new window and interval dependent variances Open image in new window , where Open image in new window ; Open image in new window . The interval dependent precision parameters (inverse of variances Open image in new window , Open image in new window , and Open image in new window , Open image in new window ) of errors Open image in new window , Open image in new window , Open image in new window were assigned gamma prior with parameters ( Open image in new window ). The true un-derlying latent intensities Open image in new window are assigned Uniform prior distribution over the interval [ Open image in new window , Open image in new window ], which is approximately [ Open image in new window ]. The parameters ( Open image in new window , Open image in new window ) are assigned Uniform distribution over ( Open image in new window ).
2.5. Bayesian Analysis
Several authors suggested Bayesian methods for analyzing microarray data [9, 10, 11, 12, 13, 14, 15, 16]. Under the Bayesian paradigm, once the model is defined, statistical inferences can be expressed directly in terms of the conditional posterior probabilities conditioned on the observed data.
A priori, the parameters Open image in new window , Open image in new window , and Open image in new window are assumed to be independent. The numerical computations were done using Markov chain Monte Carlo (MCMC), where the sampling algorithm can be summarized as follows.
Specify initial values of Open image in new window , Open image in new window , Open image in new window , and of the augmented variables to be sampled when considering right censored Open image in new window 's.
Sample the latent intensities Open image in new window from their conditional distribution.
Sample Open image in new window from its conditional distribution.
Sample augmented Open image in new window 's from their conditional distri-butions.
Repeat step 2 to step 5 till sufficient samples are gen-erated.
The model was formulated in BUGS language and pa-rameter estimation was performed using WinBUGS . WinBUGS is a free software and its newer versions can also run from within the statistical package Open image in new window . The current model runs in OpenBUGS version 2.01 on Intel Pentium processor 2.80 GHz with 1 GB RAM and takes approximately two hour to do 30 000 iterations using two chains in parallel. Convergence was monitored visually (i.e., by the mixing of two chains) and after a burn-in of 3000 iterations, two chains of 12 000 iterations each were generated to check the convergence of the parameter estimates under consideration. Thereafter, a sample of size 15 000 was generated to make inference.
The approach described in this paper was tested on two real data sets described in Section 2.1 For the first experiment, samples were hybridized on two arrays Open image in new window and Open image in new window and each array was scanned three times at different scanner sensitivities (see Table 1). The same samples were hybridized on both arrays, but the scanner settings chosen for the two arrays were different as the experiments were performed independently on the two arrays.
Posterior median estimate of parameters (median ± sd) using data from Cy3 dye
Here our focus has been on the systematic bias in the intensity measurements caused by intrinsic scanner noise at the lower end and pixel censoring at the upper end. These two problems cannot be handled under a single scanner setting. Moreover, guidelines are not available for choosing optimal scanner settings to address both of these issues. Therefore, it seems reasonable to do several scans on every array, some at relatively lower sensitivities (ensuring no censoring at the upper end) and others at higher sensitivity levels (to capture weakly expressed genes), and ultimately combine the information to get improved gene expression measurements at all ranges. More scans can easily be accommodated in the model but there are practical limitations like degradation of the dye and the time required for scanning. Keeping these points in mind, three scans seem to be a good compromise.
The proposed method has advantages at three levels. First, modeling under the Bayesian framework allows for missing data estimation by sampling randomly from the corresponding posterior predictive distribution. Second, it allows for joint estimation of a large number of model parameters and latent variables. Usually for analyzing microarray data, the statistical methods are applied in a sequential manner with the output of each step in the analysis serving as the input for the next. Under the sequential approach, the uncertainties in the conclusions from any earlier step make the subsequent steps dependent on the particular choice of the method and the resulting point estimate that is then used. In our model, such uncertainties are accounted for in a systematic manner as we work with distributions of all the unknown parameters, including the latent expression of the genes being considered. A third aspect of our method is that it opens up the possibility of extending the current model to accommodate features of normalization and identification of differentially expressed genes in an integrated model, which first improves the overall signal and then identifies differentially expressed genes by using such improved signals. Realization of the integrated model is in principle a straightforward modification to the model proposed here, by adding further layers to the present hierarchical model. Such additional layers would then account for between-array variations, within-array variations, and dye swap, and allow for comparing and combing data from multiple arrays. We are currently working towards such an integrated model.
See Algorithm 1.
Algorithm 1: Code written in BUGS language.
model Open image in new window
for(i in 1 : N) Open image in new window
T[i]<- Open image in new window
muYe1[i] Open image in new window dunif(0, 15)
class[i]<- 1 + step(logye1[i] - cut1) + step(logye1[i] - cut2) + step(logye1[i] - cut3) + step(logye1[i] - cut4) + step(logye1[i] - cut5)
A[i]<- (b Open image in new window muYe1[i]) Open image in new window step(cut1-logye1[i]) + (b + (b Open image in new window (muYe1[i]-1))) Open image in new window step(cut2-logye1[i]) Open image in new window step(logye1[i]-cut1) +
B[i] Open image in new window -(b + b + b + (b Open image in new window (muYe1[i]-3))) Open image in new window step(cut4-logye1[i]) Open image in new window step(logye1[i]-cut3) + (b + b + b + b +
muYe3[i] Open image in new window -D[i]+E[i]+F[i]
D[i] Open image in new window -(d Open image in new window muYe1[i]) Open image in new window step(cut1-logye1[i]) + (d + (d Open image in new window (muYe1[i]-1))) Open image in new window step(cut2-logye1[i]) Open image in new window step(logye1[i]-cut1) +
E[i] Open image in new window -(d + d + d + (d Open image in new window (muYe1[i]-3))) Open image in new window step(cut4-logye1[i]) Open image in new window step(logye1[i]-cut3) + (d + d + d + d +
logye1[i] ~ dnorm(muYe1[i], tau1[class[i]]) I(logye1cen[i], )
logye2[i] ~ dnorm(muYe2[i], tau2[class[i]]) I(logye2cen[i], )
logye3[i] ~ dnorm(muYe3[i], tau3[class[i]]) I(logye3cen[i], )
for(j in 1 : nClass) Open image in new window
tau1[j] Open image in new window dgamma(0.001, 0.001)
sigma1[j] Open image in new window - 1 / sqrt(tau1[j])
tau2[j] Open image in new window dgamma(0.001, 0.001)
sigma2[j] Open image in new window - 1 / sqrt(tau2[j])
tau3[j] Open image in new window dgamma(0.001, 0.001)
sigma3[j] Open image in new window - 1 / sqrt(tau3[j])
b[j] ~ dunif(0,5)
d[j] ~ dunif(0,5)
for(i in 1 : N) Open image in new window
residual1[i] Open image in new window -Ye1[i]-muYe1[i]
residual2[i] Open image in new window -Ye2[i]-muYe2[i]
residual3[i] Open image in new window -Ye3[i]-muYe3[i]
where N is the number of genes, logye1, logye2, logye3 are the measurements from the three scans on logarithmic scale,
I(logye1cen[i], ), I(logye2cen[i], ), I(logye3cen[i], ) were used to specify the lower bound for the censored measurements
from the three scans.
The authors thank Bob O'Hara for the careful reading of the manuscript and Mizanur R. Khondoker for having kindly provided the computer code for generating the data that led to our Figure 9. This work has been supported by the ComBi graduate school (RG), the Academy of Finland via its funding of the Centre of Population Genetic Analyses and via the SYSBIO Research Program (EA), and the Institute of Biotechnology (PA).
- 3.Dudley AM, Aach J, Steffen MA, Church GM: Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proceedings of the National Academy of Sciences of the United States of America 2002,99(11):7554-7559. 10.1073/pnas.112683499CrossRefGoogle Scholar
- 9.Keller AD, Schummer M, Hood L, Ruzzo WL: Bayesian classification of DNA array expression data. In Tech. Rep. UW-CSE-2000-08-01. Department of Computer Science and Engineering, University of Washington, Seattle, Wash, USA; 2000.Google Scholar
- 11.Dror RO, Murnick JG, Rinaldi NA: A Bayesian approach to transcript estimation from gene array data: the beam technique. In Proceedings of the 6th Annual International Conference on Computational Molecular Biology (RECOMB '02), Washington, DC, USA, April 2002. ACM Press; 137-143.Google Scholar
- 13.Ramoni MF, Sebastiani P: Bayesian methods for microarray data analysis. Proceedings of the IMA Workshop 1: Statistical Methods for Gene Expression: Microarrays and Proteomics, Minneapolis, Minn, USA, September-October 2003Google Scholar
- 17.Spiegelhalter DJ, Thomas A, Best NG: "WinBUGS" Version 1.2. User Manual, MRC Biostatistics Unit, 199.Google Scholar
- 18.Yang YH, Dudoit S, Luu P, Speed TP: Normalization for cDNA microarray data. Microarrays: Optical Technologies and Informatics, Proceedings of SPIE, San Jose, Calif, USA, January 2001 4266: 141-152.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.