Background

With the recent technological advances, different genome-wide studies have uncovered an unprecedented number of structural variants in the human genome [17], mainly in the form of copy number variations (CNVs). As much as 12% of the human genome has been reported to be variable among different individuals [6]. The important number of genes and other regulatory elements encompassed by those variable regions, makes it very likely for them to have functional and, ultimately, phenotypical consequences [8, 9]. In fact, several publications have already correlated the number of copies of different genes with different degrees of disease predisposition [1012]. Therefore, the identification of DNA copy number is important in understanding genesis and progression of human diseases.

Several techniques and platforms have been developed for genome-wide analysisi of DNA copy number, such as array-based comparative genomic hybridization (aCGH). The goal of the analysis of DNA copy number data using this approach is to partition the whole genome into segments where copy numbers changes between contiguous segments may be present. The ability of aCGH to discern between different number of copies is very limited, thus different kinds of techniques have been developed for targeted, and more precise, analyses of genomic regions. Multiplex Ligation-dependent Probe Amplification (MLPA) [13] is one of the most used technologies among other existing ones such us Quantitative Multiplex PCR of Short Fluorescent (QMPSF) [14, 15] or Multiplex Amplifiable Probe Hybridization (MAPH) [16].

MLPA is a recently developed semi-quantitative method that aims to detect copy number alterations at the genomic level (gains or loses) in a test DNA with respect to a control. Due to its low cost, reliability and ease of implementation it has become very popular both as a research and a diagnostic tool. After hybridization, ligation and multiplex amplification of specific probes targeting different genomic regions, the probes are electrophoresed and analyzed using a DNA analyzer [13]. Each specific probe migrates according to its molecular weight and the resulting pherograms show specific peaks that correspond to each probe (Figure 1). Relative dosage information can be obtained after the comparison of peak intensities (height or area) of the different probes between test and control samples.

Figure 1
figure 1

Example of a deletion in an MLPA assay. Panels A and B show pherograms corresponding to the electrophoresis of an MLPA assay. In the Y-asix are depicted the intensity signals (peak heights) for each probe that are depicted in the X-axis according to their length (probe size). Peaks marked with a C correspond to control probes and peaks numbered from 1 through 24 correspond to region-specific probes. Panel A corresponds to a normal individual, while panel B corresponds to an individual with a deletion at probe #13 as visible by the reduced peak intensity in this pherogram.

Due to the variation of PCR efficiencies across probes (due to their different size and nature) and across samples, a normalization method is needed before comparing dosage quotients. This step is crucial because the variation in experimental conditions may lead to differences of measured values between samples, thus hampering the correct interpretation of results. After normalization, which eliminates possible differences introduced during the experimental process, we aim to find biological differences in gene dosage (copy numbers). That is, the normalization tries to minimize the amount of systematic non-biological variation among samples. Different normalization methods have been used for analyzing MLPA data. The most common method divides each intensity by the sum of all intensities in each sample (see, for instance [17] and [18]). Other alternative approaches are based on regression methods that account for the amplification decay of larger probes. One of such regression methods uses internal control probes [19], while the another normalizes intensities based upon the statistically most probable median peak intensities using a median filter [20]. Other authors suggest to normalize peak intensities by using 4 separate peak groups according to increasing frament sizes ot the peaks (see [21] for further details). Finally, a similar approach using the mean intensity of control probes inside a normalization group as the dividing factor was considered in [22].

After an intensity normalization, the next key point is the determination of genes that are significantly altered. As an example, the peak intensity of exon 13 in Figure 1 seems to be lower in patient than in control indicating a deletion. Again, different approaches have been used to find probes that are altered. In this work we are considering ratio- [18, 23] and regression-based [19, 20] methods. In both approaches, the basis for the analysis is the comparison among normalized peak intensities from patient and control samples using a dosage ratio. In the ratio-based approach, deletions and duplications are given as outliers from the data set after defining a "biological" threshold (i.e. in a diploid genome, two copies exists of each gene). We assume that the simplest, most likely scenario, is an heterozygous gain or a loss of the material, under which ideal ratio values of 1.5 and 0.5 are expected, respectively. In such a simple scenario, and taking into account experimental noise, it is generally accepted in the literature that values below 0.7 and above 1.33 are indicative of loss and gain of a genetic material, respectively [13, 18]. The regression-based approach is based on fitting several linear regression models, considering the normalized peak intensity of a given patient as the dependent variable and the normalized peak intensities of the mean control sample as independent variables. Another method based on taking into account the individual noise for each probe (i.e., standard deviation) for all control samples is considered in [24].

Both the ratio- and regression-based methods have general drawbacks. The ratio-based approach always considers the same gain/loss threshold without taking into account the different variability among experiments and among probes. On the other hand, although the regression-based approaches solve this problem, fitting regression models without considering that the independent variable (normalized peak intensity in controls) is also subject to error may lead to wrong conclusions. In addition, the regression method uses confidence intervals for predictions which are built under the normal theory. The normality assumption may be untenable, particularly because of the limited amount of control probes that are typically used in routine experiments (< 10 probes). Finally, none of these procedures considers replicates for each individual, reducing information by averaging the replicate values.

To overcome these difficulties, we propose alternative methods to deal with the normalization and the discovery of alterations when analyzing MLPA data. Our approach is based on both nonlinear (normalization procedure) and linear (to determine alteration) mixed models. The main advantages of our methods are: (1) the variation among individuals is modelled explicitly using a random effect; and (2) the replicates may be easily incorporated in the model. The deletions and duplications will be determined using tolerance intervals. We validate and compare our methods with the existing methods using a controlled experiment. We evaluate the presence of gains and loses in a set of DNAs from normal individuals and individuals affected by known genetic disorders. The probemix we used included probes of single copy number regions located both inside and outside of the genomic disorder regions. A total of 30 samples, including 24 controls and 6 patients, were analyzed blindly.

Results

Simulation study

In this section, we investigate the performance of the two existing methods and our proposed approach through a number of simulations. We simulated an hypothetical assay with 45 probes. We also simulated internal control probes (i.e. non-altered) to be used in the iterative regression approach by setting σ γ = 0. Two different scenarios were simulated in order to investigate the performance of REX-MLPA approach given the number of control probes: 10% and 20% of the 45 probes. Finally, the ratios of the rest of probes were simulated as altered (gains or loses) depending on a different percentages: 5%, 10%, 20% and 50% of the probes. These values were generated from a normal distribution under different scenarios varying between probe variability (σ β ∈ {0.2, 0.4, 0.6}), probe-test variability (σ γ ∈ {0.2, 0.4, 0.6, 1.0, 1.5, 4.0}) and within-probe or random variability (σ ε ∈ {0.05, 0.08}). We simulated 3 replicates of each peak intensity for test and control samples. Note that the average of these replicates were considered in the case of using threshold and REX-MLPA approaches. We summarized our simulations by computing the mean number of altered probed simulated in each run, the empirical type-I error and the empirical power to detect gains and loses. These results are based on 1,000 simulations.

Tables 1 and 2 show the results for the case of having 10% of true altered probes when 10% and 20% of probes have been generated as a internal control probes, respectively. The results for the REX-MLPA method were based on building confidence limits at 99% at each iteration. Regarding the empirical type-I error, the REX-MLPA approach usually overestimate the expected 5%, while the threshold method clearly underestimates type-I error since in all simulations the simulated type-I error was closer to 0% (results omitted in the tables). On the other hand, the probe-specific model shows a good performance. As expected, the power of the three methods increases when the probe-test variability increases. That is, the power depends on the size effect of those altered probes. In general, the probe-specific mixed-model outperforms both REX-MLPA and threshold approaches. The threshold approach is clearly underpowered for those cases in which the magnitude of the effect is not large enough. Finally, the same pattern is observed when the random variability is increased (e.g. within-probe variability) but showing lower power. The simulation results for the case of having 5%, 20% and 50% of true altered probes showed similar behaviour and they can be found in the Additional file 1.

Table 1 Empirical type-I error and power obtained in 1,000 simulations using the three different approaches: REX (iterative regression), PEMM (probe-specific mixture model) and threshold.
Table 2 Empirical type-I error and power obtained in 1,000 simulations using the three different approaches: REX (iterative regression), PEMM (probe-specific mixture model) and threshold.

Validation study

In order to validate our proposed method, and compare its performace with those of the existing methods, we have designed a controlled MLPA probemix in which we have included 34 different oligonucleotides, corresponding to 16 different targeted regions with single copy number in normal individuals. Eight of the targeted regions are variable in copy number in individuals suffering from different genomic disorders (see Table 3). A total of 30 DNAs from different individuals were assayed in triplicate in the study. 24 of the DNAs came from unrelated HapMap individuals (i.e. normal general population), 2 DNAs from Prader-Willi syndrome patients (15q11-q13 deletion), 2 DNAs from DiGeorge syndrome patients (22q11 deletion) and 2 DNAs from autistic patients with a duplication in 10q region. After hybridization, ligation and PCR, probes corresponding to the different individuals were electrophoresed and peak intensities were recorded and blindly analyzed using the different methods. MLPA hybridization, ligation and PCR with universal MLPA primers was performed as described elsewhere [13]. PCR products were loaded on a capillar DNA analyzer and electrophoresed. Genescan software was used to analyze the runs and to retrieve peak intensities corresponding to each probe in the different samples.

Table 3 MLPA probemix composition

Comparison of methods of copy number estimation

Before analyzing the data we illustrate that our proposed method based on a mixed-model can be applied. Supplementary Figure S1 shows the residual error for each individual and probe indicating that the residuals are centered at zero and that the variability does not change with individuals. Table 4 shows the results obtained from the 9 test samples used in the validation study using the threshold, iterative regression and proposed probe-specific mixed-model method. For the two patients with Prader-Willi syndrome, we observed that all methods were able to detect the three deletions in 15q11-q13 region. The same conclusion was observed regarding patients diagnosed with DiGeorge syndrome. Considering patients with Autism, all methods detected gains in the 10q region. Only in one case threshold method had a false negative result and the REX-MLPA method indicated a deletion in the UBE3A, a false positive finding. These results clearly agree with our simulation studies in which we found that iterative regression has a increased type-I error rate and the threshold approach a low power. For instance, the ratio between the second individual diagnosed with Autism and the mean controls was 1.27 at probe named ZWINT, so the threshold method did not indicate that this was a probe altered in this individual. This result is due to the fact that the ratio between the probe-test variability and the random error variability was around 3 (3.09 = 0.1423 0.0461 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIWaamcqGGUaGlcqaIXaqmcqaI0aancqaIYaGmcqaIZaWmaeaacqaIWaamcqGGUaGlcqaIWaamcqaI0aancqaI2aGncqaIXaqmaaaaaa@37B1@ ) and the simulation study showed that in this case the power of threshold approach is very poor. Finally, regarding HapMap individuals who were considered as negative controls, we observed that any method found a false positive result.

Table 4 MLPA results from the validation study.

Conclusion

The MLPA method is a potentially useful semi-quantitative method to detect copy number alterations in targeted regions obtained after performing genome-wide screening using comparative methods usually based on genomic hybridization. MLPA is based on comparing peak intensities of the different probes between test and control samples through its ratio. Before determining the statistical significance of dosage ratios, a normalization procedure is needed to control variation of PCR efficiencies across probes. We have proposed a non-linear mixed-model to perform such normalization, and two other existing methods have been discussed. One of them is based on considering the sum of peak intensities in each sample, while another uses internal control probes to fit a regression line which is used as a reference.

So far, the widely used method to detect statistical significance of dosage ratio is to consider as altered those probes that are outside a given threshold. In this paper, we have described a new approach to detect altered probes based on a mixed-model. The threshold is then established by using a tolerance interval which accommodates the specific random error variability observed in each test sample. We have also discussed another approach based on an iterative regression. Through simulation studies and a controlled MLPA assay, we have shown that our proposed method outperforms the existing ones. Another important advantage by comparing our approach and the REX-MLPA method is that our method can be used even when no internal control probes are available in the assay.

A novelty of our algorithm is that it can handle information from replicate experiments. Replicates of MLPA assays are typically not necessary for commercial kits since in most cases more than one probe is used to interrogate the same region and concordance between those probes is considered to provide enough reliability of the existence of a copy number variation. Nevertheless, in cases where a single probe per region is placed in the assay, it is definitely useful and desirable (if not necessary) to have replicates that help minimise possible variations arising during the long MLPA protocol and to ensure reliability of the findings. Thus, having a method that is able to handle such information is very useful and provides extra statistical robustness to the findings.

The use of control probes as references for data normalization is one of the methods recommended by the manufacturer. Commercial and home-made MLPA assays are clearly different in terms of quality of the resulting pherograms, mainly due to the process of producing the specific probes that target regions of interest. While commercial applications produce very high, sharp and neat peaks, home-made assays typically do not perform that well. Nevertheless, in both cases, the use of control probes is always desirable from a statistical point of view and it is so regardless of the number of probes that target a determined loci. In exploratory experiments with MLPA designs targeting extremely variable and polymorphic CNVs, adding control probes whose behaviour is proven unvariable and can help in assessing reliability of single experiments. In such cases, we understand that the inclusion of evenly spaced control probes is a good practice. One key point is to know the number of replicates that should be included in a MLPA assay. Our simulations indicates that 3 replicates was enough to reach a power of 90% keeping a value of alfa around 5% in those combinations where the probe-test variability was over 8 times greater than that of random error (combinations where σ γ = 0.4 and σ ε = 0.05). Thus, if the probe-test versus random error variability ratio was lower, a lesser power would be expected. In that case the researcher could improve the expected power by increasing the number of replicates.

Although our proposed method outperforms other existing ones, it has to be kept in mind that in the case of having probes targeting regions that are highly variable among the population (i.e. in an extreme case where half individuals are normal homozygous and half are homozygously deleted), the tolerance intervals calculated for those probes might collect all existing variability and become very broad, thus making it impossible to distinguish the existence of copy number alteration. Nevertheless, the other existing methods would also show inconsistent copy number calls when analysing such regions. In statistical terms, this means that we should have homogeneity of the random error variance through the probes. This is the reason why a visual inspection of the pherograms as well as manual curation of the results might be the most reliable way to proceed. However, we can easily accomodate cases where the homogeneity of random error variance assumption cannot be assumed by introducing an interaction probe-random error allowing a different random error variance for each probe. In this case, different tolerance bands would be therefore estimated for each probe.

In conclusion, we have proposed a mixed-model which is able to determine thresholds to decide whether a region is altered. These threholds are specific for each individual, incorporating experimental variability, resulting in improved sensitivity and specificity as the examples with real data have revealed. An R language package for the three approaches discussed in this paper and data analyzed will be freely available at our web page [25].

Methods

To further illustrate the statistical methods described in this section, we are using real data provided by the United Kingdom National Genetics Reference Laboratory of Manchester [26]. In particular we are analyzing a dataset from a breast cancer study (called P002 BRCA1), in which copy number changes of different genes have previously been reported. In this example, 9 control probes and 25 analytical probes were analyzed in a total of 5 controls and 8 different test samples.

Normalization

Herein, we will focus on methods based on considering the total peak intensities and regression-based approaches. The first and the simplest method of normalization, determines the normalized signal of each probe in a given sample by dividing peak intensity by the sum of all peak intensity of the sample. This method was initially proposed by [13] and is the recommended method by MRC-Holland [27] which provides commercial probes for the copy number analysis of different genome regions for diagnostic purposes. Let us begin by giving some notations to illustrate how this method works. Let H ipk be the peak intensity for the i-th individual, i = 1, ..., C, C + 1, ..., C + T (C is the number of control samples and T denotes the number of test samples), the p-th probe, p = 1, ..., P, and the k-th replicate, k = 1, ..., K. The existing simple method designed for normalizing data considers the average of these replicates, H ˜ i p = k = 1 K H i p k / K MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaGaadaWgaaWcbaGaemyAaKMaemiCaahabeaakiabg2da9maaqadabaGaemisaG0aaSbaaSqaaiabdMgaPjabdchaWjabdUgaRbqabaGccqGGVaWlcqWGlbWsaSqaaiabdUgaRjabg2da9iabigdaXaqaaiabdUealbqdcqGHris5aaaa@3EEA@ , and divides each measured peak intensity by the sum of all peaks, S i = p = 1 P H ˜ i p MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpdaaeWaqaaiqbdIeaizaaiaWaaSbaaSqaaiabdMgaPjabdchaWbqabaaabaGaemiCaaNaeyypa0JaeGymaedabaGaemiuaafaniabggHiLdaaaa@3A32@ of that sample, i.e. Y ˜ i p = H ˜ i p / S i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmywaKLbaGaadaWgaaWcbaGaemyAaKMaemiCaahabeaakiabg2da9iqbdIeaizaaiaWaaSbaaSqaaiabdMgaPjabdchaWbqabaGccqGGVaWlcqWGtbWudaWgaaWcbaGaemyAaKgabeaaaaa@38DD@ .

In addition to experimental variability, as Figure 1 shows, peak intensity exhibits a negative correlation with probe size. To account for this effect, regression methods may be employed. If internal control probes are provided in the MLPA experiment (dark vertical lines in Figure 2A), a regression-based method using the control probes may also be employed to normalize the data. Different methods have been proposed. The first approach, known as "slope correction" [19], model the negative correlation between probe size and peak intensities using a linear regression model following the formula H ˜ p c = α + β X p c MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaGaadaWgaaWcbaGaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleqaaOGaeyypa0JaeqySdeMaey4kaSIaeqOSdiMaemiwaG1aaSbaaSqaaiabdchaWnaaBaaameaacqWGJbWyaeqaaaWcbeaaaaa@39A0@ , where X pc denotes the size of the p c -th internal control probe for a given individual (Figure 2B solid blue line), or using a quadratic model H ˜ p c = α + β 1 X p c + β 2 X p c 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaGaadaWgaaWcbaGaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleqaaOGaeyypa0JaeqySdeMaey4kaSIaeqOSdi2aaSbaaSqaaiabigdaXaqabaGccqWGybawdaWgaaWcbaGaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleqaaOGaey4kaSIaeqOSdi2aaSbaaSqaaiabikdaYaqabaGccqWGybawdaqhaaWcbaGaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleaacqaIYaGmaaaaaa@43C3@ (Figure 2B, dotted blue line).

Figure 2
figure 2

MLPA normalization procedure. Panel A shows the peak intensities depending on probe sizes. Panel B shows the regression estimates for different methods: linear regression model using control probes (solid blue line), quadratic model using control probes (dotted blue line), nonlinear mixed model (red lines) using control probes (solid line) or median filter approach (dotted line). Panel C illustrates the parameters involved in the nonlinear mixed model. Panel D shows the size-adjusted normalized peak intensities prepared to compute the dosage quotient. In all panels dark lines represents control probes while light lines are for analytical probes.

Another possibility proposed by [20] is to fit an exponential decay model, H ˜ p = α e β X p MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaGaadaWgaaWcbaGaemiCaahabeaakiabg2da9iabeg7aHjabbwgaLnaaCaaaleqabaGaeyOeI0IaeqOSdiMaemiwaG1aaSbaaWqaaiabdchaWbqabaaaaaaa@381C@ . In that case, the authors propose to consider all probes p and call this method "population normalization". Notice that in that case β encodes the rate of descend. [20] argued that the trend of peak intensities varies greatly between samples regarding the control probes. This is the reason why they propose to consider all probes (both analytical and control ones) after applying a median filter approach to remove outliers. However this method has important inconveniences. The smoothed data, obtained after applying the median filter, violates some of the assumptions of nonlinear regression such as that residuals are no longer independent or that errors are not Gaussian, among others [28].

Neither slope correction nor population normalization approaches consider replicates, e.g. they use H ˜ i p MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaGaadaWgaaWcbaGaemyAaKMaemiCaahabeaaaaa@2FED@ instead of H ipk . In addition, these methods do not model the variability among individuals either. That is, using these methods, the authors allow only β to vary among individuals fitting different models for each one. However, the pattern of decay may vary highly between subjects, which is not captured by varying β 's only. As an example, the height of the peak intensity for the first control probe (top parameter in Figure 2C) and the asymptotic value for larger probe sizes (bottom parameter in Figure 2C) are very different among test samples, as we observe in Figure 3.

Figure 3
figure 3

Regression estimates for the test samples from BRCA1 data set. Regression estimates for each of the 8 test samples given in the example provided by the NGRL-Manchester called P002 BRCA1. Red lines are estimated using the nonlinear mixed model. The solid lines are estimated using control probes, while dotted lines are obtained after using the median filter approach. The dotted blue lines are showing the regression estimates using quadratic model. These regression lines are then used to normalize the peak intensities.

To circumvent these problems, we propose the following nonlinear mixed model to normalize the data which includes replicates and variability among individuals:

H i p k = ( φ 1 i φ 2 i ) exp [ 1 φ 3 i ( X p off i ) ] + φ 2 i + ε i p k , φ i = [ φ 1 i φ 2 i φ 3 i ] = [ β 1 β 2 β 3 ] + [ b 1 i b 2 i b 3 i ] = β + b i , b i ~ N ( 0 , Ψ ) , ε i j k ~ N ( 0 , σ 2 ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabmqaaaqaaiabdIeainaaBaaaleaacqWGPbqAcqWGWbaCcqWGRbWAaeqaaOGaeyypa0JaeiikaGIaeqOXdy2aaSbaaSqaaiabigdaXiabdMgaPbqabaGccqGHsislcqaHgpGzdaWgaaWcbaGaeGOmaiJaemyAaKgabeaakiabcMcaPiGbcwgaLjabcIha4jabcchaWjabcUfaBjabgkHiTKqbaoaalaaabaGaeGymaedabaGaeqOXdy2aaSbaaeaacqaIZaWmcqWGPbqAaeqaaaaakiabcIcaOiabdIfaynaaBaaaleaacqWGWbaCaeqaaOGaeyOeI0Iaee4Ba8MaeeOzayMaeeOzay2aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGGDbqxcqGHRaWkcqaHgpGzdaWgaaWcbaGaeGOmaiJaemyAaKgabeaakiabgUcaRiabew7aLnaaBaaaleaacqWGPbqAcqWGWbaCcqWGRbWAaeqaaOGaeiilaWcabaGaeqOXdy2aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpdaWadaqaauaabeqadeaaaeaacqaHgpGzdaWgaaWcbaGaeGymaeJaemyAaKgabeaaaOqaaiabeA8aMnaaBaaaleaacqaIYaGmcqWGPbqAaeqaaaGcbaGaeqOXdy2aaSbaaSqaaiabiodaZiabdMgaPbqabaaaaaGccaGLBbGaayzxaaGaeyypa0ZaamWaaeaafaqabeWabaaabaGaeqOSdi2aaSbaaSqaaiabigdaXaqabaaakeaacqaHYoGydaWgaaWcbaGaeGOmaidabeaaaOqaaiabek7aInaaBaaaleaacqaIZaWmaeqaaaaaaOGaay5waiaaw2faaiabgUcaRmaadmaabaqbaeqabmqaaaqaaiabdkgaInaaBaaaleaacqaIXaqmcqWGPbqAaeqaaaGcbaGaemOyai2aaSbaaSqaaiabikdaYiabdMgaPbqabaaakeaacqWGIbGydaWgaaWcbaGaeG4mamJaemyAaKgabeaaaaaakiaawUfacaGLDbaacqGH9aqpiiWacqWFYoGycqGHRaWkcqWHIbGydaWgaaWcbaGaemyAaKgabeaakiabcYcaSaqaauaabeqabiaaaeaacqWHIbGydaWgaaWcbaGaemyAaKgabeaakiabc6ha+nrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab+1q8ojabcIcaOiabhcdaWiabcYcaSGGabiab9H6azjabcMcaPiabcYcaSaqaaiabew7aLnaaBaaaleaacqWGPbqAcqWGQbGAcqWGRbWAaeqaaOGaeiOFa4Nae4xdX7KaeiikaGIaeGimaaJaeiilaWIaeq4Wdm3aaWbaaSqabeaacqaIYaGmaaGccqGGPaqkcqGGSaalaaaaaaaa@BDA6@
(1)

where X p denotes the size for the p-th probe (note that it only depends on the probe since it is the same for each individual and each replicate) φi 1is the maximum peak intensity for control probes, φ2i, is the asymptotic peak intensity, and φ3iis the reciprocal of the rate decay constant. The term off i provides a more stable parametrization for the data and corresponds to the average value of the peak size for the first control probe, off i = k = 1 K H i 1 k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaabmaeaacqWGibasdaWgaaWcbaGaemyAaKMaeGymaeJaem4AaSgabeaaaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGlbWsa0GaeyyeIuoaaaa@372F@ , i = 1, ..., C, C + 1, ..., C + T. The fixed effects, β represent the population average of subject-specific parameters, φ i , and the random effects, b i , represents the deviations of the φ i 's from their population average. The random effects are assumed to be independent and the within-group errors ε ipk are assumed to be independent for different i, p and to be independent of the random effects. The model parameters are fitted by maximizing the restricted log-likelihood (RMLE) of the data using the R library nlme [29].

After estimating this generalized exponential model, the normalization procedure is performed by dividing the peak intensities, H ipk , by the regression estimate, H ^ i p k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaKaadaWgaaWcbaGaemyAaKMaemiCaaNaem4AaSgabeaaaaa@314D@ , obtaining a normalized, size-adjusted peak intensity for every probe, Y ipk = H ipk H ^ i p k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaKaadaWgaaWcbaGaemyAaKMaemiCaaNaem4AaSgabeaaaaa@314D@ (Figure 2D). For the control samples we consider a unique regression line corresponding to the average of all control samples. Nonetheless, for the test samples the normalization is performed individually for each subject.

Statistical significance of dosage ratios

Ratio approach

The normalized peak intensities may be directly compared between control and test samples peak by peak. The first method for determining genes that are altered (gains or loses) is based on a threshold rule. The probes whose dosage ratios are outside of the threshold lines are considered altered. The thresholds are normally 0.7 for deletions (note that 0.5 would correspond to a half reduction in the dosage but this value is considered too stringent by some researchers) and 1.33 for duplications, although other cut-points have been used in other studies [18].

Iterative regression

The main disadvantage of defining a threshold is that the specific variability of each experiment is not considered. Some authors have proposed alternative methods using linear regression models. [19] stated that, after a square-root transformation of the data, the plot of normalized peak intensities for test samples versus control samples should be around the diagonal if no dosage imbalances are present; otherwise, probes above or below would correspond to duplications or deletions, respectively. Their proposal was to fit a linear regression without intercept using only control probes and use the standard error of the regression to determine confidence limits for outlier detections. As in the case of normalization procedures, however, a small number of control probes is a limiting factor for adopting this approach since, as the author stated, the standard error of the regression underestimates the normal variability. Thus, [19] proposed an iterative regression algorithm for this problem. Starting with the control probes, the linear regression is fitted. Then, a prediction interval is computed for every probe. If the normalized peak intensity of a non-control probe falls into this interval, the probe is considered to belong to the non-altered population and it is added to the control probes in the next step. After a few iterations, a final model is reached including only those probes that are considered as preserved. In practice, those probes that are not included in this final model are considered to be either deleted or duplicated. The authors call this method Regression-Enhanced MLPA analysis (REX-MLPA). Figure 4 illustrates the confidence limits of the final regression model. [20] proposed a similar approach but starting with all probes and retaining and rejecting points at each iteration with a given level of confidence.

Figure 4
figure 4

Normal control comparison procedure using iterative regression. Plot of normalized peak intensities in controls (calculated as the mean of all controls) against normalized peak intensities in test sample number 1 using iterative regression procedure. Dotted line correspond to linear regression among those probes that are considered non altered after applying the iterative procedure. Solid lines determine upper and lower boundaries which are used to indicate whether or not a given probe is a duplication or a deletion, respectively.

The REX-MLPA approach assumes that the X variable (e.g. squared-root of mean normalized peak intensities in controls) in the linear regression model is measured without error. This assumption is not tenable since this variable corresponds to the normalized peak intensity measured in multiple control samples. Violations of this assumption are a serious problem when trying to accurately predict Y (squared-root of normalized peak intensities in test sample) from X.

The probe-specific mixture model

The REX-MLPA approach has the limitations discussed above, specifically, the regression models are fitted without considering that the independent variable (normalized peak intensity in controls) is also subject to error. In addition, confidence intervals for predictions are built assuming the normal theory. This assumption may not hold when only control probes are considered since we typically have no more than 10 probes. Replicates for each individual are also not considered.

Based on these limitations, we propose the following model to compare controls with a given test sample:

log ( Y t p k ) = μ + α t + β p + γ t p + ε t p k β p ~ N ( 0 , σ β 2 ) , p = 1 , ... , P γ t p ~ N ( 0 , σ γ 2 ) , t = 0 , 1 , p = 1 , ... , P ε t p k ~ N ( 0 , σ ε 2 ) , t = 0 , 1 , p = 1 , ... , P , k = 1 , ... , K MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabqqaaaaabaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemywaK1aaSbaaSqaaiabdsha0jabdchaWjabdUgaRbqabaGccqGGPaqkcqGH9aqpcqaH8oqBcqGHRaWkcqaHXoqydaWgaaWcbaGaemiDaqhabeaakiabgUcaRiabek7aInaaBaaaleaacqWGWbaCaeqaaOGaey4kaSIaeq4SdC2aaSbaaSqaaiabdsha0jabdchaWbqabaGccqGHRaWkcqaH1oqzdaWgaaWcbaGaemiDaqNaemiCaaNaem4AaSgabeaaaOqaauaabeqabiaaaeaacqaHYoGydaWgaaWcbaGaemiCaahabeaakiabc6ha+nrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=1q8ojabcIcaOiabicdaWiabcYcaSiabeo8aZnaaDaaaleaacqaHYoGyaeaacqaIYaGmaaGccqGGPaqkcqGGSaalaeaacqWGWbaCcqGH9aqpcqaIXaqmcqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGqbauaaaabaqbaeqabeWaaaqaaiabeo7aNnaaBaaaleaacqWG0baDcqWGWbaCaeqaaOGaeiOFa4Nae8xdX7KaeiikaGIaeGimaaJaeiilaWIaeq4Wdm3aa0baaSqaaiabeo7aNbqaaiabikdaYaaakiabcMcaPiabcYcaSaqaaiabdsha0jabg2da9iabicdaWiabcYcaSiabigdaXiabcYcaSaqaaiabdchaWjabg2da9iabigdaXiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdcfaqbaaaeaafaqabeqadaaabaGaeqyTdu2aaSbaaSqaaiabdsha0jabdchaWjabdUgaRbqabaGccqGG+bGFcqWFneVtcqGGOaakcqaIWaamcqGGSaalcqaHdpWCdaqhaaWcbaGaeqyTdugabaGaeGOmaidaaOGaeiykaKIaeiilaWcabaGaemiDaqNaeyypa0JaeGimaaJaeiilaWIaeGymaeJaeiilaWcabaqbaeqabeGaaaqaaiabdchaWjabg2da9iabigdaXiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdcfaqjabcYcaSaqaaiabdUgaRjabg2da9iabigdaXiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdUealbaaaaaaaaaa@BCDF@
(2)

where μ is the mean normalized peak intensity across individuals and probes, α t is a fixed effect representing a different mean for all controls (t = 0) and a given test sample (t = 1), β p is a random effect for the p th probe, γ tp is is the probe-test interaction random effect, and ε tpk is a random error variable. The index p denotes the probe, and K is the number of replicates. We assume that β p , γ tp , and ε tpk are independent, homocedastic, and normally distributed random variables with mean zero. The variances are denoted by σ β 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabek7aIbqaaiabikdaYaaaaaa@3058@ for the, β p , or the "between-probe" variability, σ γ 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabeo7aNbqaaiabikdaYaaaaaa@305E@ for γ tp , or "probe-test" variability, and σ ε 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabew7aLbqaaiabikdaYaaaaaa@305E@ for the ε tpk , or "within-probe" variability. Considering the assumptions of the proposed model, we need to verify that the within-groups errors are independent and identically normally distributed, with mean zero and variance σ ε 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabew7aLbqaaiabikdaYaaaaaa@305E@ , and that they are also independent of the random effect. It can be easily checked using different plots. This assumption makes sense for biologists since probes that behave abnormally in replicate studies or when hybridised onto DNAs that have a different quality are typically removed or redesigned to ensure assay robutness and reliability. The model parameters are fitted by maximizing the restricted log-likelihood (RMLE) using the R library lme (Pinheiro and Bates, 2000).

The criterion

For the sake of simplicity, we take control samples as the reference, i.e. α0 = 0 and α1 = α, and γ0p= 0. Therefore, α may be interpreted as an average deviation between test and control samples across all probes, and γ1pas the deviation control-test for the p-th probe.

The criterion to determine those probes that are altered for each test sample is based on the observed differences among controls and its differences with controls samples. Conditioned to p-th probe, the difference between two control measurements is distributed as

log ( Y 0 p l ) log ( Y 0 p k ) ~ N ( 0 , 2 σ ε 2 ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemywaK1aaSbaaSqaaiabicdaWiabdchaWjabdYgaSbqabaGccqGGPaqkcqGHsislcyGGSbaBcqGGVbWBcqGGNbWzcqGGOaakcqWGzbqwdaWgaaWcbaGaeGimaaJaemiCaaNaem4AaSgabeaakiabcMcaPiabc6ha+nrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=1q8ojabcIcaOiabicdaWiabcYcaSiabikdaYiabeo8aZnaaDaaaleaacqaH1oqzaeaacqaIYaGmaaGccqGGPaqkcqGGUaGlaaa@59FB@
(3)

Conversely, the difference between a test, i', and a control sample, conditioned to a given probe, is distributed as

log ( Y 0 p k ) log ( Y i p k ) ~ N ( α + γ i p , 2 σ ε 2 ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemywaK1aaSbaaSqaaiabicdaWiabdchaWjabdUgaRbqabaGccqGGPaqkcqGHsislcyGGSbaBcqGGVbWBcqGGNbWzcqGGOaakcqWGzbqwdaWgaaWcbaGafmyAaKMbauaacqWGWbaCcqWGRbWAaeqaaOGaeiykaKIaeiOFa43enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX7KaeiikaGIaeqySdeMaey4kaSIaeq4SdC2aaSbaaSqaaiqbdMgaPzaafaGaemiCaahabeaakiabcYcaSiabikdaYiabeo8aZnaaDaaaleaacqaH1oqzaeaacqaIYaGmaaGccqGGPaqkcqGGSaalaaa@60AE@
(4)

where σ ε 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabew7aLbqaaiabikdaYaaaaaa@305E@ corresponds to the variance of the error defined in equation (2). Then, we consider that a probe is not altered if the mean of the difference between test and control samples is included within a probability interval defined by differences between control samples distribution. Hence, the criterion consists in checking whether the difference between test and control samples is greater than that expected between two control samples. That is, whether the difference is included in the interval

( Z 1 κ / 2 2 σ ε 2 , Z 1 κ / 2 2 σ ε 2 ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeiikaGIaeyOeI0IaemOwaO1aaSbaaSqaaiabigdaXiabgkHiTiabeQ7aRjabc+caViabikdaYaqabaGcdaGcaaqaaiabikdaYiabeo8aZnaaDaaaleaacqaH1oqzaeaacqaIYaGmaaaabeaakiabcYcaSiabdQfaAnaaBaaaleaacqaIXaqmcqGHsislcqaH6oWAcqGGVaWlcqaIYaGmaeqaaOWaaOaaaeaacqaIYaGmcqaHdpWCdaqhaaWcbaGaeqyTdugabaGaeGOmaidaaaqabaGccqGGPaqkcqGGSaalaaa@4960@
(5)

where Z1-κ/2is the 1 - κ/2 percentile from the standard normal distribution. Hence, 1 - κ is the proportion of difference control-control that the difference test-control must exceed to declare that a probe is altered. Nonetheless, when using sample data to estimate such intervals, the uncertainty associated with the estimation estimating them has to be considered. Therefore, we propose to estimate the interval (5) through a tolerance interval over the control-control differences distribution. To build such intervals, it is assumed that ν σ ^ e 2 σ e 2 ~ χ ν 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaH9oGBcuaHdpWCgaqcamaaDaaabaGaemyzaugabaGaeGOmaidaaaqaaiabeo8aZnaaDaaabaGaemyzaugabaGaeGOmaidaaaaakiabc6ha+jabeE8aJnaaDaaaleaacqaH9oGBaeaacqaIYaGmaaaaaa@3CAB@ , where ν are the residual degrees of freedom, and σ ^ e 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4WdmNbaKaadaqhaaWcbaGaemyzaugabaGaeGOmaidaaaaa@301A@ is the REML estimator of σ e 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Wdm3aa0baaSqaaiabdwgaLbqaaiabikdaYaaaaaa@300A@ . Therefore, the tolerance interval that contains (1 - κ)% proportion of the control-control differences estimated with a confidence of (1 - α)% is

( Z 1 κ / 2 ν σ ^ ε 2 χ α , ν 2 , Z 1 κ / 2 ν σ ^ ε 2 χ α , ν 2 ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaeWaaeaacqGHsislcqWGAbGwdaWgaaWcbaGaeGymaeJaeyOeI0IaeqOUdSMaei4la8IaeGOmaidabeaakmaakaaajuaGbaWaaSaaaeaacqaH9oGBcuaHdpWCgaqcamaaDaaabaGaeqyTdugabaGaeGOmaidaaaqaaiabeE8aJnaaDaaabaGaeqySdeMaeiilaWIaeqyVd4gabaGaeGOmaidaaaaaaSqabaGccqGGSaalcqWGAbGwdaWgaaWcbaGaeGymaeJaeyOeI0IaeqOUdSMaei4la8IaeGOmaidabeaakmaakaaajuaGbaWaaSaaaeaacqaH9oGBcuaHdpWCgaqcamaaDaaabaGaeqyTdugabaGaeGOmaidaaaqaaiabeE8aJnaaDaaabaGaeqySdeMaeiilaWIaeqyVd4gabaGaeGOmaidaaaaaaSqabaaakiaawIcacaGLPaaacqGGUaGlaaa@5A27@
(6)

Once the limits of the criterion are established, the mean difference test-control, δ p for each probe, p, is estimated using the estimate of α + γ i'p , where γp 1is estimated through the best linear unbiased predictor (BLUP). Then a probe for a given test sample is considered as altered when δ p does not fall into (6). Figure 5 shows those probes that are imbalanced for each individual considering the threshold obtained using tolerance limits obtained using equation (6). In this example those limits are 0.78 for deletions and 1.29 for duplications.

Figure 5
figure 5

Normal control comparison procedure using tolerance interval criteria. Plot of ratio between normalized peak intensities in controls against normalized peak intensities in each of the 8 test samples (vertical lines). Horizontal gray lines indicate lower and upper boundaries obtained using the linear mixed model and tolerance interval criteria. Vertical green lines are indicating that those probes are duplicated for the corresponding probe.