Probe-specific mixed-model approach to detect copy number differences using multiplex ligation-dependent probe amplification (MLPA)

González, Juan R; Carrasco, Josep L; Armengol, Lluís; Villatoro, Sergi; Jover, Lluís; Yasui, Yutaka; Estivill, Xavier

doi:10.1186/1471-2105-9-261

Probe-specific mixed-model approach to detect copy number differences using multiplex ligation-dependent probe amplification (MLPA)

Methodology article
Open access
Published: 04 June 2008

Volume 9, article number 261, (2008)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Probe-specific mixed-model approach to detect copy number differences using multiplex ligation-dependent probe amplification (MLPA)

Download PDF

Juan R González^1,2,3,
Josep L Carrasco³,
Lluís Armengol^4,2,
Sergi Villatoro^4,2,
Lluís Jover³,
Yutaka Yasui⁵ &
…
Xavier Estivill^4,2

5706 Accesses
13 Citations
Explore all metrics

Abstract

Background

MLPA method is a potentially useful semi-quantitative method to detect copy number alterations in targeted regions. In this paper, we propose a method for the normalization procedure based on a non-linear mixed-model, as well as a new approach for determining the statistical significance of altered probes based on linear mixed-model. This method establishes a threshold by using different tolerance intervals that accommodates the specific random error variability observed in each test sample.

Results

Through simulation studies we have shown that our proposed method outperforms two existing methods that are based on simple threshold rules or iterative regression. We have illustrated the method using a controlled MLPA assay in which targeted regions are variable in copy number in individuals suffering from different disorders such as Prader-Willi, DiGeorge or Autism showing the best performace.

Conclusion

Using the proposed mixed-model, we are able to determine thresholds to decide whether a region is altered. These threholds are specific for each individual, incorporating experimental variability, resulting in improved sensitivity and specificity as the examples with real data have revealed.

VARista: a free web platform for streamlined whole-genome variant analysis across T2T, hg38, and hg19

Article 12 April 2024

Comparison and evaluation of statistical error models for scRNA-seq

Article Open access 18 January 2022

Next-Generation Sequencing: Advantages, Disadvantages, and Future

Background

With the recent technological advances, different genome-wide studies have uncovered an unprecedented number of structural variants in the human genome [1–7], mainly in the form of copy number variations (CNVs). As much as 12% of the human genome has been reported to be variable among different individuals [6]. The important number of genes and other regulatory elements encompassed by those variable regions, makes it very likely for them to have functional and, ultimately, phenotypical consequences [8, 9]. In fact, several publications have already correlated the number of copies of different genes with different degrees of disease predisposition [10–12]. Therefore, the identification of DNA copy number is important in understanding genesis and progression of human diseases.

Several techniques and platforms have been developed for genome-wide analysisi of DNA copy number, such as array-based comparative genomic hybridization (aCGH). The goal of the analysis of DNA copy number data using this approach is to partition the whole genome into segments where copy numbers changes between contiguous segments may be present. The ability of aCGH to discern between different number of copies is very limited, thus different kinds of techniques have been developed for targeted, and more precise, analyses of genomic regions. Multiplex Ligation-dependent Probe Amplification (MLPA) [13] is one of the most used technologies among other existing ones such us Quantitative Multiplex PCR of Short Fluorescent (QMPSF) [14, 15] or Multiplex Amplifiable Probe Hybridization (MAPH) [16].

MLPA is a recently developed semi-quantitative method that aims to detect copy number alterations at the genomic level (gains or loses) in a test DNA with respect to a control. Due to its low cost, reliability and ease of implementation it has become very popular both as a research and a diagnostic tool. After hybridization, ligation and multiplex amplification of specific probes targeting different genomic regions, the probes are electrophoresed and analyzed using a DNA analyzer [13]. Each specific probe migrates according to its molecular weight and the resulting pherograms show specific peaks that correspond to each probe (Figure 1). Relative dosage information can be obtained after the comparison of peak intensities (height or area) of the different probes between test and control samples.

Due to the variation of PCR efficiencies across probes (due to their different size and nature) and across samples, a normalization method is needed before comparing dosage quotients. This step is crucial because the variation in experimental conditions may lead to differences of measured values between samples, thus hampering the correct interpretation of results. After normalization, which eliminates possible differences introduced during the experimental process, we aim to find biological differences in gene dosage (copy numbers). That is, the normalization tries to minimize the amount of systematic non-biological variation among samples. Different normalization methods have been used for analyzing MLPA data. The most common method divides each intensity by the sum of all intensities in each sample (see, for instance [17] and [18]). Other alternative approaches are based on regression methods that account for the amplification decay of larger probes. One of such regression methods uses internal control probes [19], while the another normalizes intensities based upon the statistically most probable median peak intensities using a median filter [20]. Other authors suggest to normalize peak intensities by using 4 separate peak groups according to increasing frament sizes ot the peaks (see [21] for further details). Finally, a similar approach using the mean intensity of control probes inside a normalization group as the dividing factor was considered in [22].

After an intensity normalization, the next key point is the determination of genes that are significantly altered. As an example, the peak intensity of exon 13 in Figure 1 seems to be lower in patient than in control indicating a deletion. Again, different approaches have been used to find probes that are altered. In this work we are considering ratio- [18, 23] and regression-based [19, 20] methods. In both approaches, the basis for the analysis is the comparison among normalized peak intensities from patient and control samples using a dosage ratio. In the ratio-based approach, deletions and duplications are given as outliers from the data set after defining a "biological" threshold (i.e. in a diploid genome, two copies exists of each gene). We assume that the simplest, most likely scenario, is an heterozygous gain or a loss of the material, under which ideal ratio values of 1.5 and 0.5 are expected, respectively. In such a simple scenario, and taking into account experimental noise, it is generally accepted in the literature that values below 0.7 and above 1.33 are indicative of loss and gain of a genetic material, respectively [13, 18]. The regression-based approach is based on fitting several linear regression models, considering the normalized peak intensity of a given patient as the dependent variable and the normalized peak intensities of the mean control sample as independent variables. Another method based on taking into account the individual noise for each probe (i.e., standard deviation) for all control samples is considered in [24].

Both the ratio- and regression-based methods have general drawbacks. The ratio-based approach always considers the same gain/loss threshold without taking into account the different variability among experiments and among probes. On the other hand, although the regression-based approaches solve this problem, fitting regression models without considering that the independent variable (normalized peak intensity in controls) is also subject to error may lead to wrong conclusions. In addition, the regression method uses confidence intervals for predictions which are built under the normal theory. The normality assumption may be untenable, particularly because of the limited amount of control probes that are typically used in routine experiments (< 10 probes). Finally, none of these procedures considers replicates for each individual, reducing information by averaging the replicate values.

To overcome these difficulties, we propose alternative methods to deal with the normalization and the discovery of alterations when analyzing MLPA data. Our approach is based on both nonlinear (normalization procedure) and linear (to determine alteration) mixed models. The main advantages of our methods are: (1) the variation among individuals is modelled explicitly using a random effect; and (2) the replicates may be easily incorporated in the model. The deletions and duplications will be determined using tolerance intervals. We validate and compare our methods with the existing methods using a controlled experiment. We evaluate the presence of gains and loses in a set of DNAs from normal individuals and individuals affected by known genetic disorders. The probemix we used included probes of single copy number regions located both inside and outside of the genomic disorder regions. A total of 30 samples, including 24 controls and 6 patients, were analyzed blindly.

Results

Simulation study

In this section, we investigate the performance of the two existing methods and our proposed approach through a number of simulations. We simulated an hypothetical assay with 45 probes. We also simulated internal control probes (i.e. non-altered) to be used in the iterative regression approach by setting σ_γ= 0. Two different scenarios were simulated in order to investigate the performance of REX-MLPA approach given the number of control probes: 10% and 20% of the 45 probes. Finally, the ratios of the rest of probes were simulated as altered (gains or loses) depending on a different percentages: 5%, 10%, 20% and 50% of the probes. These values were generated from a normal distribution under different scenarios varying between probe variability (σ_β∈ {0.2, 0.4, 0.6}), probe-test variability (σ_γ∈ {0.2, 0.4, 0.6, 1.0, 1.5, 4.0}) and within-probe or random variability (σ_ε∈ {0.05, 0.08}). We simulated 3 replicates of each peak intensity for test and control samples. Note that the average of these replicates were considered in the case of using threshold and REX-MLPA approaches. We summarized our simulations by computing the mean number of altered probed simulated in each run, the empirical type-I error and the empirical power to detect gains and loses. These results are based on 1,000 simulations.

Tables 1 and 2 show the results for the case of having 10% of true altered probes when 10% and 20% of probes have been generated as a internal control probes, respectively. The results for the REX-MLPA method were based on building confidence limits at 99% at each iteration. Regarding the empirical type-I error, the REX-MLPA approach usually overestimate the expected 5%, while the threshold method clearly underestimates type-I error since in all simulations the simulated type-I error was closer to 0% (results omitted in the tables). On the other hand, the probe-specific model shows a good performance. As expected, the power of the three methods increases when the probe-test variability increases. That is, the power depends on the size effect of those altered probes. In general, the probe-specific mixed-model outperforms both REX-MLPA and threshold approaches. The threshold approach is clearly underpowered for those cases in which the magnitude of the effect is not large enough. Finally, the same pattern is observed when the random variability is increased (e.g. within-probe variability) but showing lower power. The simulation results for the case of having 5%, 20% and 50% of true altered probes showed similar behaviour and they can be found in the Additional file 1.

Table 1 Empirical type-I error and power obtained in 1,000 simulations using the three different approaches: REX (iterative regression), PEMM (probe-specific mixture model) and threshold.

Full size table

Table 2 Empirical type-I error and power obtained in 1,000 simulations using the three different approaches: REX (iterative regression), PEMM (probe-specific mixture model) and threshold.

Full size table

Validation study

In order to validate our proposed method, and compare its performace with those of the existing methods, we have designed a controlled MLPA probemix in which we have included 34 different oligonucleotides, corresponding to 16 different targeted regions with single copy number in normal individuals. Eight of the targeted regions are variable in copy number in individuals suffering from different genomic disorders (see Table 3). A total of 30 DNAs from different individuals were assayed in triplicate in the study. 24 of the DNAs came from unrelated HapMap individuals (i.e. normal general population), 2 DNAs from Prader-Willi syndrome patients (15q11-q13 deletion), 2 DNAs from DiGeorge syndrome patients (22q11 deletion) and 2 DNAs from autistic patients with a duplication in 10q region. After hybridization, ligation and PCR, probes corresponding to the different individuals were electrophoresed and peak intensities were recorded and blindly analyzed using the different methods. MLPA hybridization, ligation and PCR with universal MLPA primers was performed as described elsewhere [13]. PCR products were loaded on a capillar DNA analyzer and electrophoresed. Genescan software was used to analyze the runs and to retrieve peak intensities corresponding to each probe in the different samples.

Table 3 MLPA probemix composition

Full size table

Comparison of methods of copy number estimation

Before analyzing the data we illustrate that our proposed method based on a mixed-model can be applied. Supplementary Figure S1 shows the residual error for each individual and probe indicating that the residuals are centered at zero and that the variability does not change with individuals. Table 4 shows the results obtained from the 9 test samples used in the validation study using the threshold, iterative regression and proposed probe-specific mixed-model method. For the two patients with Prader-Willi syndrome, we observed that all methods were able to detect the three deletions in 15q11-q13 region. The same conclusion was observed regarding patients diagnosed with DiGeorge syndrome. Considering patients with Autism, all methods detected gains in the 10q region. Only in one case threshold method had a false negative result and the REX-MLPA method indicated a deletion in the UBE3A, a false positive finding. These results clearly agree with our simulation studies in which we found that iterative regression has a increased type-I error rate and the threshold approach a low power. For instance, the ratio between the second individual diagnosed with Autism and the mean controls was 1.27 at probe named ZWINT, so the threshold method did not indicate that this was a probe altered in this individual. This result is due to the fact that the ratio between the probe-test variability and the random error variability was around 3 (3.09 = $\frac{0.1423}{0.0461}$ ) and the simulation study showed that in this case the power of threshold approach is very poor. Finally, regarding HapMap individuals who were considered as negative controls, we observed that any method found a false positive result.

Table 4 MLPA results from the validation study.

Full size table

Conclusion

The MLPA method is a potentially useful semi-quantitative method to detect copy number alterations in targeted regions obtained after performing genome-wide screening using comparative methods usually based on genomic hybridization. MLPA is based on comparing peak intensities of the different probes between test and control samples through its ratio. Before determining the statistical significance of dosage ratios, a normalization procedure is needed to control variation of PCR efficiencies across probes. We have proposed a non-linear mixed-model to perform such normalization, and two other existing methods have been discussed. One of them is based on considering the sum of peak intensities in each sample, while another uses internal control probes to fit a regression line which is used as a reference.

So far, the widely used method to detect statistical significance of dosage ratio is to consider as altered those probes that are outside a given threshold. In this paper, we have described a new approach to detect altered probes based on a mixed-model. The threshold is then established by using a tolerance interval which accommodates the specific random error variability observed in each test sample. We have also discussed another approach based on an iterative regression. Through simulation studies and a controlled MLPA assay, we have shown that our proposed method outperforms the existing ones. Another important advantage by comparing our approach and the REX-MLPA method is that our method can be used even when no internal control probes are available in the assay.

A novelty of our algorithm is that it can handle information from replicate experiments. Replicates of MLPA assays are typically not necessary for commercial kits since in most cases more than one probe is used to interrogate the same region and concordance between those probes is considered to provide enough reliability of the existence of a copy number variation. Nevertheless, in cases where a single probe per region is placed in the assay, it is definitely useful and desirable (if not necessary) to have replicates that help minimise possible variations arising during the long MLPA protocol and to ensure reliability of the findings. Thus, having a method that is able to handle such information is very useful and provides extra statistical robustness to the findings.

The use of control probes as references for data normalization is one of the methods recommended by the manufacturer. Commercial and home-made MLPA assays are clearly different in terms of quality of the resulting pherograms, mainly due to the process of producing the specific probes that target regions of interest. While commercial applications produce very high, sharp and neat peaks, home-made assays typically do not perform that well. Nevertheless, in both cases, the use of control probes is always desirable from a statistical point of view and it is so regardless of the number of probes that target a determined loci. In exploratory experiments with MLPA designs targeting extremely variable and polymorphic CNVs, adding control probes whose behaviour is proven unvariable and can help in assessing reliability of single experiments. In such cases, we understand that the inclusion of evenly spaced control probes is a good practice. One key point is to know the number of replicates that should be included in a MLPA assay. Our simulations indicates that 3 replicates was enough to reach a power of 90% keeping a value of alfa around 5% in those combinations where the probe-test variability was over 8 times greater than that of random error (combinations where σ_γ= 0.4 and σ_ε= 0.05). Thus, if the probe-test versus random error variability ratio was lower, a lesser power would be expected. In that case the researcher could improve the expected power by increasing the number of replicates.

Although our proposed method outperforms other existing ones, it has to be kept in mind that in the case of having probes targeting regions that are highly variable among the population (i.e. in an extreme case where half individuals are normal homozygous and half are homozygously deleted), the tolerance intervals calculated for those probes might collect all existing variability and become very broad, thus making it impossible to distinguish the existence of copy number alteration. Nevertheless, the other existing methods would also show inconsistent copy number calls when analysing such regions. In statistical terms, this means that we should have homogeneity of the random error variance through the probes. This is the reason why a visual inspection of the pherograms as well as manual curation of the results might be the most reliable way to proceed. However, we can easily accomodate cases where the homogeneity of random error variance assumption cannot be assumed by introducing an interaction probe-random error allowing a different random error variance for each probe. In this case, different tolerance bands would be therefore estimated for each probe.

In conclusion, we have proposed a mixed-model which is able to determine thresholds to decide whether a region is altered. These threholds are specific for each individual, incorporating experimental variability, resulting in improved sensitivity and specificity as the examples with real data have revealed. An R language package for the three approaches discussed in this paper and data analyzed will be freely available at our web page [25].

Methods

To further illustrate the statistical methods described in this section, we are using real data provided by the United Kingdom National Genetics Reference Laboratory of Manchester [26]. In particular we are analyzing a dataset from a breast cancer study (called P002 BRCA1), in which copy number changes of different genes have previously been reported. In this example, 9 control probes and 25 analytical probes were analyzed in a total of 5 controls and 8 different test samples.

Normalization

Herein, we will focus on methods based on considering the total peak intensities and regression-based approaches. The first and the simplest method of normalization, determines the normalized signal of each probe in a given sample by dividing peak intensity by the sum of all peak intensity of the sample. This method was initially proposed by [13] and is the recommended method by MRC-Holland [27] which provides commercial probes for the copy number analysis of different genome regions for diagnostic purposes. Let us begin by giving some notations to illustrate how this method works. Let H_ipkbe the peak intensity for the i-th individual, i = 1, ..., C, C + 1, ..., C + T (C is the number of control samples and T denotes the number of test samples), the p-th probe, p = 1, ..., P, and the k-th replicate, k = 1, ..., K. The existing simple method designed for normalizing data considers the average of these replicates, ${\tilde{H}}_{i p} = \sum_{k = 1}^{K} H_{i p k} / K$ , and divides each measured peak intensity by the sum of all peaks, $S_{i} = \sum_{p = 1}^{P} {\tilde{H}}_{i p}$ of that sample, i.e. ${\tilde{Y}}_{i p} = {\tilde{H}}_{i p} / S_{i}$ .

In addition to experimental variability, as Figure 1 shows, peak intensity exhibits a negative correlation with probe size. To account for this effect, regression methods may be employed. If internal control probes are provided in the MLPA experiment (dark vertical lines in Figure 2A), a regression-based method using the control probes may also be employed to normalize the data. Different methods have been proposed. The first approach, known as "slope correction" [19], model the negative correlation between probe size and peak intensities using a linear regression model following the formula ${\tilde{H}}_{p_{c}} = α + β X_{p_{c}}$ , where X_pcdenotes the size of the p_c-th internal control probe for a given individual (Figure 2B solid blue line), or using a quadratic model ${\tilde{H}}_{p_{c}} = α + β_{1} X_{p_{c}} + β_{2} X_{p_{c}}^{2}$ (Figure 2B, dotted blue line).

Another possibility proposed by [20] is to fit an exponential decay model, ${\tilde{H}}_{p} = α e^{- β X_{p}}$ . In that case, the authors propose to consider all probes p and call this method "population normalization". Notice that in that case β encodes the rate of descend. [20] argued that the trend of peak intensities varies greatly between samples regarding the control probes. This is the reason why they propose to consider all probes (both analytical and control ones) after applying a median filter approach to remove outliers. However this method has important inconveniences. The smoothed data, obtained after applying the median filter, violates some of the assumptions of nonlinear regression such as that residuals are no longer independent or that errors are not Gaussian, among others [28].

Neither slope correction nor population normalization approaches consider replicates, e.g. they use ${\tilde{H}}_{i p}$ instead of H_ipk. In addition, these methods do not model the variability among individuals either. That is, using these methods, the authors allow only β to vary among individuals fitting different models for each one. However, the pattern of decay may vary highly between subjects, which is not captured by varying β 's only. As an example, the height of the peak intensity for the first control probe (top parameter in Figure 2C) and the asymptotic value for larger probe sizes (bottom parameter in Figure 2C) are very different among test samples, as we observe in Figure 3.

To circumvent these problems, we propose the following nonlinear mixed model to normalize the data which includes replicates and variability among individuals:

\begin{matrix} H_{i p k} = (φ_{1 i} - φ_{2 i}) \exp [- \frac{1}{φ_{3 i}} (X_{p} - {off}_{i})] + φ_{2 i} + ε_{i p k}, \\ φ_{i} = [\begin{matrix} φ_{1 i} \\ φ_{2 i} \\ φ_{3 i} \end{matrix}] = [\begin{matrix} β_{1} \\ β_{2} \\ β_{3} \end{matrix}] + [\begin{matrix} b_{1 i} \\ b_{2 i} \\ b_{3 i} \end{matrix}] = β + b_{i}, \\ \begin{matrix} b_{i} ~ N (0, Ψ), & ε_{i j k} ~ N (0, σ^{2}), \end{matrix} \end{matrix}

(1)

where X_pdenotes the size for the p-th probe (note that it only depends on the probe since it is the same for each individual and each replicate) φ_{i 1}is the maximum peak intensity for control probes, φ_2i, is the asymptotic peak intensity, and φ_3iis the reciprocal of the rate decay constant. The term off_iprovides a more stable parametrization for the data and corresponds to the average value of the peak size for the first control probe, off_i= $\sum_{k = 1}^{K} H_{i 1 k}$ , i = 1, ..., C, C + 1, ..., C + T. The fixed effects, β represent the population average of subject-specific parameters, φ_i, and the random effects, b_i, represents the deviations of the φ_i's from their population average. The random effects are assumed to be independent and the within-group errors ε_ipkare assumed to be independent for different i, p and to be independent of the random effects. The model parameters are fitted by maximizing the restricted log-likelihood (RMLE) of the data using the R library nlme [29].

After estimating this generalized exponential model, the normalization procedure is performed by dividing the peak intensities, H_ipk, by the regression estimate, ${\hat{H}}_{i p k}$ , obtaining a normalized, size-adjusted peak intensity for every probe, Y_ipk= H_ipk ${\hat{H}}_{i p k}$ (Figure 2D). For the control samples we consider a unique regression line corresponding to the average of all control samples. Nonetheless, for the test samples the normalization is performed individually for each subject.

Statistical significance of dosage ratios

Ratio approach

The normalized peak intensities may be directly compared between control and test samples peak by peak. The first method for determining genes that are altered (gains or loses) is based on a threshold rule. The probes whose dosage ratios are outside of the threshold lines are considered altered. The thresholds are normally 0.7 for deletions (note that 0.5 would correspond to a half reduction in the dosage but this value is considered too stringent by some researchers) and 1.33 for duplications, although other cut-points have been used in other studies [18].

Iterative regression

The main disadvantage of defining a threshold is that the specific variability of each experiment is not considered. Some authors have proposed alternative methods using linear regression models. [19] stated that, after a square-root transformation of the data, the plot of normalized peak intensities for test samples versus control samples should be around the diagonal if no dosage imbalances are present; otherwise, probes above or below would correspond to duplications or deletions, respectively. Their proposal was to fit a linear regression without intercept using only control probes and use the standard error of the regression to determine confidence limits for outlier detections. As in the case of normalization procedures, however, a small number of control probes is a limiting factor for adopting this approach since, as the author stated, the standard error of the regression underestimates the normal variability. Thus, [19] proposed an iterative regression algorithm for this problem. Starting with the control probes, the linear regression is fitted. Then, a prediction interval is computed for every probe. If the normalized peak intensity of a non-control probe falls into this interval, the probe is considered to belong to the non-altered population and it is added to the control probes in the next step. After a few iterations, a final model is reached including only those probes that are considered as preserved. In practice, those probes that are not included in this final model are considered to be either deleted or duplicated. The authors call this method Regression-Enhanced MLPA analysis (REX-MLPA). Figure 4 illustrates the confidence limits of the final regression model. [20] proposed a similar approach but starting with all probes and retaining and rejecting points at each iteration with a given level of confidence.

The REX-MLPA approach assumes that the X variable (e.g. squared-root of mean normalized peak intensities in controls) in the linear regression model is measured without error. This assumption is not tenable since this variable corresponds to the normalized peak intensity measured in multiple control samples. Violations of this assumption are a serious problem when trying to accurately predict Y (squared-root of normalized peak intensities in test sample) from X.

The probe-specific mixture model

The REX-MLPA approach has the limitations discussed above, specifically, the regression models are fitted without considering that the independent variable (normalized peak intensity in controls) is also subject to error. In addition, confidence intervals for predictions are built assuming the normal theory. This assumption may not hold when only control probes are considered since we typically have no more than 10 probes. Replicates for each individual are also not considered.

Based on these limitations, we propose the following model to compare controls with a given test sample:

\begin{matrix} \log (Y_{t p k}) = μ + α_{t} + β_{p} + γ_{t p} + ε_{t p k} \\ \begin{matrix} β_{p} ~ N (0, σ_{β}^{2}), & p = 1, ..., P \end{matrix} \\ \begin{matrix} γ_{t p} ~ N (0, σ_{γ}^{2}), & t = 0, 1, & p = 1, ..., P \end{matrix} \\ \begin{matrix} ε_{t p k} ~ N (0, σ_{ε}^{2}), & t = 0, 1, & \begin{matrix} p = 1, ..., P, & k = 1, ..., K \end{matrix} \end{matrix} \end{matrix}

(2)

where μ is the mean normalized peak intensity across individuals and probes, α_tis a fixed effect representing a different mean for all controls (t = 0) and a given test sample (t = 1), β_pis a random effect for the p th probe, γ_tpis is the probe-test interaction random effect, and ε_tpkis a random error variable. The index p denotes the probe, and K is the number of replicates. We assume that β_p, γ_tp, and ε_tpkare independent, homocedastic, and normally distributed random variables with mean zero. The variances are denoted by $σ_{β}^{2}$ for the, β_p, or the "between-probe" variability, $σ_{γ}^{2}$ for γ_tp, or "probe-test" variability, and $σ_{ε}^{2}$ for the ε_tpk, or "within-probe" variability. Considering the assumptions of the proposed model, we need to verify that the within-groups errors are independent and identically normally distributed, with mean zero and variance $σ_{ε}^{2}$ , and that they are also independent of the random effect. It can be easily checked using different plots. This assumption makes sense for biologists since probes that behave abnormally in replicate studies or when hybridised onto DNAs that have a different quality are typically removed or redesigned to ensure assay robutness and reliability. The model parameters are fitted by maximizing the restricted log-likelihood (RMLE) using the R library lme (Pinheiro and Bates, 2000).

The criterion

For the sake of simplicity, we take control samples as the reference, i.e. α₀ = 0 and α₁ = α, and γ_0p= 0. Therefore, α may be interpreted as an average deviation between test and control samples across all probes, and γ_1pas the deviation control-test for the p-th probe.

The criterion to determine those probes that are altered for each test sample is based on the observed differences among controls and its differences with controls samples. Conditioned to p-th probe, the difference between two control measurements is distributed as

\log (Y_{0 p l}) - \log (Y_{0 p k}) ~ N (0, 2 σ_{ε}^{2}) .

(3)

Conversely, the difference between a test, i', and a control sample, conditioned to a given probe, is distributed as

\log (Y_{0 p k}) - \log (Y_{i^{'} p k}) ~ N (α + γ_{i^{'} p}, 2 σ_{ε}^{2}),

(4)

where $σ_{ε}^{2}$ corresponds to the variance of the error defined in equation (2). Then, we consider that a probe is not altered if the mean of the difference between test and control samples is included within a probability interval defined by differences between control samples distribution. Hence, the criterion consists in checking whether the difference between test and control samples is greater than that expected between two control samples. That is, whether the difference is included in the interval

(- Z_{1 - κ / 2} \sqrt{2 σ_{ε}^{2}}, Z_{1 - κ / 2} \sqrt{2 σ_{ε}^{2}}),

(5)

where Z_1-κ/2is the 1 - κ/2 percentile from the standard normal distribution. Hence, 1 - κ is the proportion of difference control-control that the difference test-control must exceed to declare that a probe is altered. Nonetheless, when using sample data to estimate such intervals, the uncertainty associated with the estimation estimating them has to be considered. Therefore, we propose to estimate the interval (5) through a tolerance interval over the control-control differences distribution. To build such intervals, it is assumed that $\frac{ν {\hat{σ}}_{e}^{2}}{σ_{e}^{2}} ~ χ_{ν}^{2}$ , where ν are the residual degrees of freedom, and ${\hat{σ}}_{e}^{2}$ is the REML estimator of $σ_{e}^{2}$ . Therefore, the tolerance interval that contains (1 - κ)% proportion of the control-control differences estimated with a confidence of (1 - α)% is

(- Z_{1 - κ / 2} \sqrt{\frac{ν {\hat{σ}}_{ε}^{2}}{χ_{α, ν}^{2}}}, Z_{1 - κ / 2} \sqrt{\frac{ν {\hat{σ}}_{ε}^{2}}{χ_{α, ν}^{2}}}) .

(6)

Once the limits of the criterion are established, the mean difference test-control, δ_pfor each probe, p, is estimated using the estimate of α + γ_i'p, where γ_{p 1}is estimated through the best linear unbiased predictor (BLUP). Then a probe for a given test sample is considered as altered when δ_pdoes not fall into (6). Figure 5 shows those probes that are imbalanced for each individual considering the threshold obtained using tolerance limits obtained using equation (6). In this example those limits are 0.78 for deletions and 1.29 for duplications.

Abbreviations

CNV:: Copy Number Variant
MLPA:: Multiplex Ligation-dependent Probe Amplification
DNA:: Deoxyribonucleic acid
aCGH:: array-based Comparative Genomic Hybridization
QMPSF:: Quantitative Multiplex PCR of Short Fluorescent
PCR:: Polymerase Chain Reaction
MAPH:: Multiplex Amplifiable Probe Hybridization
BRCA1:: Breast Cancer 1
RMLE:: Restricted maximum log-likelihood
REX-MLPA:: Regression-Enhanced MLPA
BLUP:: Best linear unbiased predictor
PEMM:: probe-specific mixture model

References

Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet 2004, 36(9):949–51. 10.1038/ng1416
Article CAS PubMed Google Scholar
Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525–8. 10.1126/science.1098918
Article CAS PubMed Google Scholar
Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Fine-scale structural variation of the human genome. Nat Genet 2005, 37(7):727–32. 10.1038/ng1562
Article CAS PubMed Google Scholar
Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 2005, 77: 78–88. 10.1086/431652
Article PubMed Central CAS PubMed Google Scholar
Locke DP, Sharp AJ, McCarroll SA, McGrath SD, Newman TL, Cheng Z, Schwartz S, Albertson DG, Pinkel D, Altshuler DM, Eichler EE: Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am J Hum Genet 2006, 79(2):275–90. 10.1086/505653
Article PubMed Central CAS PubMed Google Scholar
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature 2006, 444(7118):444–54. 10.1038/nature05329
Article PubMed Central CAS PubMed Google Scholar
Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay C, Ng RT, Brown CJ, Eichler EE, Lam WL: A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet 2007, 80: 91–104. 10.1086/510560
Article PubMed Central CAS PubMed Google Scholar
Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet 2006, 7(2):85–97. 10.1038/nrg1767
Article CAS PubMed Google Scholar
Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavare S, Deloukas P, Hurles ME, Dermitzakis ET: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315(5813):848–53. 10.1126/science.1136678
Article PubMed Central CAS PubMed Google Scholar
Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Nibbs RJ, Freedman BI, Quinones MP, Bamshad MJ, Murthy KK, Rovin BH, Bradley W, Clark RA, Anderson SA, O'Connell RJ, Agan BK, Ahuja SS, Bologna R, Sen L, Dolan MJ, Ahuja SK: The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 2005, 307(5714):1434–40. 10.1126/science.1101160
Article CAS PubMed Google Scholar
Rovelet-Lecrux A, Hannequin D, Raux G, Le Meur N, Laquerriere A, Vital A, Du-manchin C, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D: APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy. Nat Genet 2006, 38: 24–6. 10.1038/ng1718
Article CAS PubMed Google Scholar
Le Marechal C, Masson E, Chen JM, Morel F, Ruszniewski P, Levy P, Ferec C: Hereditary pancreatitis caused by triplication of the trypsinogen locus. Nat Genet 2006, 38(12):1372–4. 10.1038/ng1904
Article CAS PubMed Google Scholar
Schouten JP, McElgunn CJ, Waaijer R, Zwijnenburg D, Diepvens F, G P: Relative quantification of 40 nucleic acid sequences by multiplex ligation-dependent probe amplification. Nucleic Acids Res 2002, 30(12):e57. 10.1093/nar/gnf056
Article PubMed Central PubMed Google Scholar
Charbonnier F, Raux G, Wang Q, Drouot N, Cordier F, Limacher JM, Saurin JC, Puisieux A, Olschwang S, Frebourg T: Detection of exon deletions and duplications of the mismatch repair genes in hereditary nonpolyposis colorectal cancer families using multiplex polymerase chain reaction of short fluorescent fragments. Cancer Res 2000, 60(11):2760–3.
CAS PubMed Google Scholar
Casilli F, Di Rocco ZC, Gad S, Tournier I, Stoppa-Lyonnet D, Frebourg T, Tosi M: Rapid detection of novel BRCA1 rearrangements in high-risk breast-ovarian cancer families using multiplex PCR of short fluorescent fragments. Hum Mutat 2002, 20(3):218–26. 10.1002/humu.10108
Article CAS PubMed Google Scholar
Armour JA, Sismani C, Patsalis PC, Cross G: Measurement of locus copy number by hybridisation with amplifiable probes. Nucleic Acids Res 2000, 28(2):605–9. 10.1093/nar/28.2.605
Article PubMed Central CAS PubMed Google Scholar
Lai KK, Lo IF, Tong TM, Cheng LY, Lam ST: Detecting exon deletions and duplications of the DMD gene using Multiplex Ligation-dependent Probe Amplification (MLPA). Clin Biochem 2006, 39(4):367–72. 10.1016/j.clinbiochem.2005.11.019
Article CAS PubMed Google Scholar
Palomares M, Delicado A, Lapunzina P, Arjona D, Aminoso C, Arcas J, Mar-tinez Bermejo A, Fernandez L, Lopez Pajares I: MLPA vs multiprobe FISH: comparison of two methods for the screening of subtelomeric rearrangements in 50 patients with idiopathic mental retardation. Clin Genet 2006, 69(3):228–33. 10.1111/j.1399-0004.2006.00567.x
Article CAS PubMed Google Scholar
Mavrogiannis LA, Cockburn DJ: Regression-Enhanced Analysis of Multiplex Ligation-Dependent Probe Amplification (REX-MLPA). Tech rep., Yorkshire Regional DNA Laboratory 2004. [http://leedsdna.info/downloads.htm]
Google Scholar
Kellander M, Riley M, Liu C: GeneMarker Software for Multiplex Ligation-dependent Probe Amplification (MLPA). Tech rep., SoftGenetics LLC 2006. [http://www.softgenetics.com/GeneMarkerMLPA.html]
Google Scholar
Gerdes T, Kirchhoff M, Bryndorf T: Automatic analysis of multiple ligation-dependent probe amplification products (exemplified by a comercial kit for prenatal aneuploidy detection). Electrophoresis 2005, 26(22):4327–7332. 10.1002/elps.200500390
Article CAS PubMed Google Scholar
Huang C, Chang Y, Chen C, Kuo Y, Hwu W, Gerdes T, Ko T: Copy number analysis of survival motor neuron genes by multiplex ligation-dependent probe amplification. Genet Med 2007, 9(4):241–248.
Article PubMed Google Scholar
Pastrello C, Baglioni S, Tibiletti MG, Papi L, Fornasarig M, Morabito A, Agostini M, Genuardi M, Viel A: Stability of BAT26 in tumours of hereditary nonpolyposis colorectal cancer patients with MSH2 intragenic deletion. Eur J Hum Genet 2006, 14: 63–8.
CAS PubMed Google Scholar
Kirchhoff M, Gerdes T, Brunebjerg S, Bryndorf T: Investigation of patients with mental retardation and dysmorphic features using comparative genomic hybridization and subtelomeric multiplex ligation dependent probe amplification. Am J Med Genet A 2005, 139(3):231–233.
Article PubMed Google Scholar
CREAL's web-page[http://www.creal.cat/jrgonzalez/software.htm]
NGRL-Manchester[http://www.ngrl.org.uk/Manchester/Technologypubs.htm]
MRC-Holland[http://www.mlpa.com/pages/support_mlpa_analysis_normalisationpag.html]
Motulsky H, Christopoulos A: Fitting Models to Biological Data Using Linear and Nonlinear Regression: a practical guide to curve fitting. Oxford: Oxford Univesity Press; 2004.
Google Scholar
Pinheiro JC, Bates DM: Mixed-Effects Models in S and S-plus. New York: Springer Verlag; 2000.
Book Google Scholar

Download references

Acknowledgements

We would like to thank the two reviewers for their constructive comments, which have led to improvements in the manuscript. The laboratory of X.E is supported by the Departament dEducació i Universitats and the Departament de Salut of the Catalan Autonomous Government ("Generalitat de Catalunya"); the Ministry of Health, and the Ministry of Education and Science of the Spanish Government; and the European Union Sixth Framework Programme. CIBERESP is supported by the Instituto de Salud Carlos III of the Ministry of Health of the Spanish Government.

Author information

Authors and Affiliations

Center for research in environmental epidemiology (CREAL), Barcelona, Spain
Juan R González
CIBER en Epidemiología y Salud Pública (CIBERESP), Spain
Juan R González, Lluís Armengol, Sergi Villatoro & Xavier Estivill
Biostatistic Unit, Department of Public Health, University of Barcelona, Spain
Juan R González, Josep L Carrasco & Lluís Jover
Genes and Disease Program, Center for Genomic Regulation (CRG), Barcelona, Spain
Lluís Armengol, Sergi Villatoro & Xavier Estivill
Department of Public Health Sciences, School of Public Health, University of Alberta, Canada
Yutaka Yasui

Authors

Juan R González
View author publications
You can also search for this author in PubMed Google Scholar
Josep L Carrasco
View author publications
You can also search for this author in PubMed Google Scholar
Lluís Armengol
View author publications
You can also search for this author in PubMed Google Scholar
Sergi Villatoro
View author publications
You can also search for this author in PubMed Google Scholar
Lluís Jover
View author publications
You can also search for this author in PubMed Google Scholar
Yutaka Yasui
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Estivill
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan R González.

Additional information

Authors' contributions

JRG and JLC developed the new statistical methods and performed the simulation studies. JRG wrote the R functions and the main text of the manuscript. LA and SV performed the MPLA experiment, interpreted the results and tested the programs functionality. LA, LJ and YY proposed abundant suggestions for improving the implementation of the models and participated in the design and discussion of this study. YY and XE reviewed the paper and revised its framework. All authors have read, and approved the final manuscript.

Electronic supplementary material

Additional file 1: mix_model_MLPA.pdf, 173.4K. (PDF 68 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

González, J.R., Carrasco, J.L., Armengol, L. et al. Probe-specific mixed-model approach to detect copy number differences using multiplex ligation-dependent probe amplification (MLPA). BMC Bioinformatics 9, 261 (2008). https://doi.org/10.1186/1471-2105-9-261

Download citation

Received: 21 January 2008
Accepted: 04 June 2008
Published: 04 June 2008
DOI: https://doi.org/10.1186/1471-2105-9-261

Probe-specific mixed-model approach to detect copy number differences using multiplex ligation-dependent probe amplification (MLPA)

Abstract

Background

Results

Conclusion

Similar content being viewed by others

Background

Results

Simulation study

Validation study

Comparison of methods of copy number estimation

Conclusion

Methods

Normalization

Statistical significance of dosage ratios

Ratio approach

Iterative regression

The probe-specific mixture model

The criterion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation