Inverse Langmuir method for oligonucleotide microarray analysis
- 4k Downloads
An algorithm for the analysis of Affymetrix Genechips is presented. This algorithm, referred to as the Inverse Langmuir Method (ILM), estimates the binding of transcripts to complementary probes using DNA/RNA hybridization free energies, and the hybridization between partially complementary transcripts in solution using RNA/RNA free energies. The balance between these two competing reactions allows for the translation of background-subtracted intensities into transcript concentrations.
To validate the ILM, it is applied to publicly available microarray data from a multi-lab comparison study. Here, microarray experiments are performed on samples which deviate only in few genes. The log2 fold change between these two samples, as obtained from RT-PCR experiments, agrees well with the log2 fold change as obtained with the ILM, indicating that the ILM determines changes in the expression level accurately. We also show that the ILM allows for the identification of outlying probes, as it yields independent concentration estimates per probe.
The ILM is robust and offers an interesting alternative to purely statistical algorithms for microarray data analysis.
KeywordsOutlier Probe Microarray Data Analysis Log2 Fold Change Median Absolute Deviation Complementary Probe
Thanks to DNA microarrays the gene expression profiling can nowadays be extended to a genome-wide analysis. Since the first prototypes of the mid nineties , microarray technology has advanced considerably in terms of reproducibility and cross-platforms agreement . Quite some effort was also devoted to the development of data analysis tools, which process the raw experimental fluorescence intensities through background subtraction, normalization and eventually summarization. Different microarray platforms, for instance two-color versus single-color arrays, require also different data processing. Most of the current algorithms for DNA microarrays data analysis  rely on complex statistical transformations for the above mentioned preprocessing steps. Microscopically based methods [4, 5, 6] offer an interesting alternative to purely statistical approaches. These methods use estimates of physical quantities involved in the underlying microscopic processes, as for instance the hybridization free energy, which measures the transcript-probe affinity. The fluorescent intensities are then linked to gene expression levels by means of thermodynamic functions. Input from physics and chemistry is expected to offer a simpler, but still accurate handling of the experimental data.
Here we describe a thermodynamic approach for the calculation of gene expression levels from microarray data, which will be referred to as the Inverse Langmuir Method (ILM). As will be shown below, the ILM determines changes in the expression level accurately, using a simple computational scheme involving a minimal number of adjustable parameters. In Affymetrix Genechips  transcripts are interrogated by oligo sequences (25-mers), which are referred to as the probes. The collection of 10 to 20 probes complementary to the same transcript forms a probe set. The ILM allows for the identification of "outlying" probes, for instance due to a faulty genomic annotation, or a high sensitivity for cross-hybridization. The ILM thus also provides feedback for the improvement of the microarray design.
Results and discussion
To assess the quality of the ILM, we use in this paper the publicly available data from Gene Expression Omnibus with number GSE2521. These data originate from a study  of a multi-laboratory comparison of hybridization of the same mRNA sample on three different platforms: Affymetrix oligo, two-color cDNA and two-color oligo arrays. Using mixtures of knockout human cell lines two samples were created in which the expression of few genes is expected to be altered. The study  focuses on 16 genes whose expression level was measured by RT-PCR in both samples, in order to compare the log-fold changes with those obtained from Microarray data analysis. Here we are concerned only with Affymetrix data, which were produced by five different laboratories, with two technical replicates each.
In a microarray hybridization experiment several different types of chemical reactions take place simultaneously. A transcript sequence in solution does not only bind to its complementary probe, but may be involved in, for instance, self-folding, it may bind to other partially complementary transcripts in solution, or to non-complementary probes. A review of the physical chemistry of these processes can be found in Ref. . The ILM takes into account a subset of these processes. Nevertheless, a comparison with RT-PCR data shows that it is an accurate method for the estimation of the fold changes of the transcript concentrations.
Comparison accuracies ILM and RMA
Offset ILM (p-value)
Log-fold changes of RT-PCR, ILM and RMA for lab 2 and 4
To assess the effects of different labs and genes in the analysis, we used a general linear model of the type Δ = Gene + Lab + Gene : Lab where Δ is the difference between log2 fold changes for RT-PCR and ILM data. A similar analysis was done also for RMA data. In both cases the lab and gene-lab interaction effects are not significant. The p-values are for ILM pLab = 0.14 and pGene:Lab = 0.55, while for RMA pLab = 0.87 and pGene:Lab = 0.26. The majority of the variation in both methods is explained by the gene effect.
ILM concentrations (expression levels) for each individual probe in a probe set
The absolute concentrations in Fig. 2 and Table 3 are given in picomolar (pM). The estimated values of the concentration depend in the ILM on some parameters, see Ref.  and Methods. In the present work we have taken these parameters from Ref. , where they were fitted against Affymetrix spike-in data. Hence, no adjustment of the parameters is done in the present analysis. However the absolute picomolar values should be interpreted with some care. As hybridization conditions may vary from experiment to experiment, the parameters from Ref.  may not be fully adequate for the present experiment. A retuning of the parameters should mainly affect the absolute concentration estimates, but not the fold-changes. The agreement between fold-changes of ILM and RT-PCR data indicates that this is plausible.
Most of the current algorithms for oligonucleotide microarray data analysis rely on purely statistical methods and use complex transformations for normalization, background subtraction and summarization . Due to its importance in many biotechnological applications, DNA hybridization when both strands are free in solution has been widely investigated during the past decades (see e.g. [10, 11, 12]). It is therefore natural to seek for an application of the insights obtained during those studies to the case of hybridization in DNA microarrays. The advantages of microarray data analysis algorithms based on physico-chemical properties of the underlying hybridization process were indeed emphasized in few studies [4, 5, 6].
Here we presented the Inverse Langmuir Method, a new algorithm for Affymetrix Genechips data analysis which is based upon DNA and RNA hybridization thermodynamics. The ILM consists of two steps: the background subtraction and the transcript concentration estimation for the background-subtracted intensities. These two steps, which were accurately fine-tuned and tested separately on Affymetrix spike-in data in previous publications [9, 13], are here combined into a single algorithm.
To validate the ILM, we applied it to publicly available microarray data from a multi-lab comparison study  in which microarray experiments were performed on mRNA samples where different expression levels of few genes are induced. The log2 fold change between these two samples, as obtained from RT-PCR experiments, agrees well with the log2 fold change as obtained with the ILM, indicating that the ILM determines changes in the expression level accurately. A comparison of replicated experiments shows that the ILM is not very sensitive to slight variations in the experimental handling of the microarrays, and hence gives a very good degree of reproducibility. Since the ILM yields independent concentration estimates per probe, it allows for the identification of outlier probes. Differently from other physico-chemical inspired algorithms [5, 6] the ILM uses a much smaller number of adjustable parameters. This is because the hybridization free energies ΔG and ΔG' entering in Eq. (3) are obtained from RNA/DNA and RNA/RNA tabulated stacking parameters obtained from melting experiments in solution [10, 11].
In the near future, we plan to make a freely available version of the ILM available through http://www.bioconductor.org.
In general, in a microarray experiment, the intensity measured from a given probe with sequence s can be decomposed into two contributions
I(c) = Isp(c) + I0.
The first term, Isp, is the specific part of the signal, which depends on the transcript concentration in solution c, which is the gene expression level one needs to estimate. The second term, I0, is due to cross-hybridization or other spurious effects. This is the amount of fluorescent signal one measures even in absence of the specific transcript (c = 0); it is sequence-dependent, since some probes (for instance CG-rich ones) are more prone to cross-hybridize than others. Equation (1) neglects competitive effects between specific and non-specific hybridization (see e.g. ), which arise if the intensity approaches saturation. Since typically the background intensity I0 is only a few percent of the saturation intensity, this effect only plays a significant role if the total intensity is dominated by the contribution of specific hybridization. In that case, however, the resulting overestimation of I0 does not affect the main purpose of the ILM, which is to estimate transcript concentrations.
In the ILM, the estimate of I0 is obtained as described in Ref. . This background estimation scheme combines information from the intensities of nearby probes with a purely sequence-based estimation of the affinity for non-specific binding. The latter correlates with the stacking parameters obtained from melting experiments in solution . The performance of this background estimation scheme has been extensively tested , with excellent results, on pure background data from Affymetrix spike-in experiments.
Here A is the saturation value, i.e. the limiting value of intensity reached at large concentrations. At saturation all probes are hybridized and the microarray is no longer sensitive to a further increase in transcript concentration. An analysis of Genechips intensity histograms  of several organisms reveals that there is a sharp drop at intensities in the range 10,000 ≲ I ≲ 15,000, and very few probes have a higher intensity. This suggests that A varies within this range. Recent literature [15, 16, 17] reported significant variations in A from probe to probe, attributed to the post-hybridization washing step. Lacking a detailed quantitative understanding of the washing process, we take a fixed value of A = 10,000, as done in Ref. .
with R the gas constant (= 1.99 cal/(mol·K)) and K0 is a unit constant with a dimension of inverse concentration. The hybridization free energies ΔG for transcript-probe hybridization are calculated from the nearest neighbor model, with tabulated experimental data for DNA/RNA  duplex formation in solution. Recent test experiments  on spotted arrays showed a strong correlation between microarray fluorescent intensities and hybridization free energies from stacking parameters in solution, corroborating our underlying assumption that hybridization in solution is very similar to hybridization on a microarray. The hybridization free energies ΔG' enter in the factor α and are associated to transcript-transcript hybridization in solution as well as self-folding; these are also calculated from the nearest neighbor model, with tabulated experimental data for RNA/RNA  duplex formation in solution. Both types of these hybridization free energies do not contain fitting parameters. The ILM uses therefore only four global adjustable parameters which are A, T, T' and , with values as obtained by regression analysis of Affymetrix spike-in data . In general, one may wonder whether different experimental circumstances require a refitting of these parameters. For instance, the parameter α describes hybridization between partially complementary transcripts in solution. Two different experiments with, say, human mRNAs extracted from two different tissues contain different sequences in solution and hence α could in principle be different. Also, a significant decrease in total concentration, for instance, will impact the balance between the two competing types of processes , and would result in a reduced . In a previous publication  it has been shown that the model of Eqs. (2,3) describes well spike-in HGU133 data, with the same parameters A, T, T' and as fitted on HGU95 . We expect thus that a refitting does not significantly improve the results. Indeed, the good agreement with the RT-PCR data presented in this paper is an indication that the values of parameters as fitted Ref.  are appropriate for the study presented here. Note the competing effect of the two coupled chemical reactions: an increase of the affinity of the transcript-probe hybridization (increase of ΔG) leads to an overall increase of the fluorescence signal. The effect is opposite for an increase in the affinity of transcript binding in solution (higher ΔG'), as this leads to a decrease of the signal. We also note that Eq. (3) defines an extended Langmuir model in which K can be viewed as a measurements of the effective affinity for binding of the transcript to the probe sequence, in which the effect of reaction 2) is incorporated.
This equation forms the basis of the Inverse Langmuir Method (ILM). For a probe set containing n probe sequences, Eq. (3) yields n values of K and from Eq. (4) one gets n independent estimates of the transcript concentration. We take the median estimate as our best estimate for the value of the transcript concentration.
In summary, the ILM measures the expression level (transcript concentration) as follows. The first step in the analysis is the background estimation  and its subtraction from the raw intensities to yield I sp = I - I0. Next, using Eqs. (3) and (4) one obtains a value of the concentration per probe. The median concentration within the probe set is then taken as global transcript concentration.
We acknowledge financial support from FWO (Research Foundation – Flanders) grant n. G.0311.08. We thank Joke Allemeersch for assisting us with the statistical analysis.
- 3.Speed TP: Statistical Analysis of Gene Expression Microarray Data. Chapman and Hall. 2003Google Scholar
- 8.Binder H: Thermodynamics of competitive surface adsorption on DNA microarrays. J Phys: Cond Matt. 2006, 18: S491. 10.1088/0953-8984/18/18/S02.Google Scholar
- 15.Burden CJ, Pittelkow Y, Wilson SR: Adsorption models of hybridization and post-hybridization behaviour on oligonucleotide microarrays. J Phys: Cond Matt. 2006, 18 (23): 5545. 10.1088/0953-8984/18/23/024.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.