Genome-wide evaluation for quantitative trait loci under the variance component model
- 483 Downloads
- 2 Citations
Abstract
The identity-by-descent (IBD) based variance component analysis is an important method for mapping quantitative trait loci (QTL) in outbred populations. The interval-mapping approach and various modified versions of it may have limited use in evaluating the genetic variances of the entire genome because they require evaluation of multiple models and model selection. In this study, we developed a multiple variance component model for genome-wide evaluation using both the maximum likelihood (ML) method and the MCMC implemented Bayesian method. We placed one QTL in every few cM on the entire genome and estimated the QTL variances and positions simultaneously in a single model. Genomic regions that have no QTL usually showed no evidence of QTL while regions with large QTL always showed strong evidence of QTL. While the Bayesian method produced the optimal result, the ML method is computationally more efficient than the Bayesian method. Simulation experiments were conducted to demonstrate the efficacy of the new methods.
Keywords
Bayesian analysis Genome selection Markov chain Monte Carlo Maximum likelihoodIntroduction
Identical-by-descent (IBD) based variance component method is often used to map quantitative trait loci (QTL) for outbred populations (Goldgar 1990; Amos 1994). The commonly used method is the interval mapping where two markers are used at a time to infer the IBD matrix for any positions bracketed by the two markers (Fulker and Cardon 1994). The model usually contains one QTL and a polygenic effect so that the variance of the QTL, the polygenic variance and the residual variance are the only variance components to be estimated. If multiple QTL exist, this interval mapping approach will produce biased estimate for the QTL variance. When the entire genome is scanned, the total genetic variance (sum of all variances of detected QTL) is often greater than the total phenotypic variance. This phenomenon always occurs in interval mapping, regardless whether the random model for an outbred population or the fixed model for a line cross is used. The reason for that is that QTL effects or QTL variances of different locations are estimated using different models. To scan the entire genome, multiple analyses are conducted, one for each putative location. None of the single QTL models is correct if multiple QTL exist. Therefore, the optimal method should be a multiple variance component model in which all QTL are included in a single model.
Multiple variance components may be difficult to estimate if the number of QTL included in the model is extremely large. However, the popular MCMC implemented Bayesian method is designed to handle large saturated models and it is the ideal method for multiple variance component estimation (Uimari and Hoeschele 1997). The maximum likelihood method may also be sufficient to handle large saturated models under the random model framework; we just never thought of placing one QTL in every few cM of the genome. Meuwissen et al. (2001) first attempted to evaluate the entire genome using a high dense marker map under the popular Bayesian approach. Their method actually treats the positions of QTL as fixed and only estimates the QTL variances and other parameters. Meuwissen et al. (2001) placed many QTL in the model. As a result QTL positions may not be relevant because the whole genome is already well covered by the proposed QTL. Yi and Xu (2000) used the reversible jump MCMC to infer the number of QTL under the random model framework. Only large QTL were eventually included in the model and the entire genome may not be evaluated thoroughly due to the slow mixing behavior of the reversible jump MCMC.
In this study, we proposed to cover the entire genome by QTL and estimated the QTL variances simultaneously within a single model. As long as the extra QTL placed in regions of the genome that do not contain QTL have estimated QTL variances close to zero, we can put as many QTL as we want to make sure that the entire genome is evaluated fairly. We investigated both the ML method and the Bayesian method and showed the pros and cons of each method.
Methods
Linear model and likelihood
Maximum likelihood estimation
The challenge for the genome-wide evaluation is that, for a large genome, the number of proposed QTL can be very large and majority of the proposed QTL should have estimated variance components close to zero. This will cause problems in the parameter estimation. The EM algorithm is the first candidate method for the variance component model (Thompson and Shaw 1990). However, it is sensitive to the initial values of the parameters. We cannot choose zero as the initial value for \( \sigma_{k}^{2} \), although most \( \sigma_{k}^{2} \) are in fact zero. Other initial values are hard to choose. Therefore, we decide to directly maximize the log likelihood function using a sequential approach by updating one variance component at a time, conditional on the values of all other variance components. When a single variance component is considered, maximizing the log likelihood function is a one-dimension problem; the bisection or any other simple algorithm can be used when one parameter is updated. When all parameters are updated, we go back to the first parameter and update the value again. The sequential algorithm requires iterations nested within other iterations until a certain criterion of convergence is satisfied. The iterations within an iteration are called the inner iterations while the iterations outside are called the outer iterations. This algorithm requires more iterations than an algorithm that updates all parameters simultaneously, but choosing the initial value for the parameter of interest becomes trivial, i.e., \( \sigma_{k}^{2} = 0 \) can be used as initial for all \( k = 1, \ldots ,M \). The sequential approach of Xu (2007) was adopted here, where \( V_{j}^{ - 1} \) and \( |V_{j} | \) are calculated only once for each outer iteration. For large family sizes, much of the computing burden comes from calculating \( V_{j}^{ - 1} \) and \( |V_{j} | \). Therefore, the sequential algorithm can save computing time substantially, in addition to ease the choice of initial values.
Estimation of QTL positions
The QTL positions can move along the genome, but the order of the QTL remains unchanged, as denoted by λ _{1} < λ _{2}…< λ _{ M } The connection between the log likelihood function and the QTL positions is through the IBD matrices. We first calculate the IBD matrix for each putative position of the genome (Gessler and Xu 2000). If a QTL moves to a new position, the IBD matrix for the new position is used to evaluate the log likelihood function. The search for QTL positions is also sequential, i.e., we update one position at a time, given positions of all other QTL. For the kth QTL, we use a grid search between λ _{k−1} and λ _{k+1} with 2 cM increment. When the iterations converge, all parameters, including the QTL positions, will remain unchanged. We now have the MLE of all parameters, including the MLE for the QTL positions.
Bayesian estimation of parameters
The values of the hyper parameters a > 0 and b > 0 can be chosen arbitrarily, e.g. (a,b) = (0.5, 0.1).
Results
Setup of simulation experiments
Comparison of the multiple variance component model with the interval mapping approach under the standard setup
True parameter | Estimated parameter | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Multiple variance components | Interval mapping | |||||||||
Position | Variance | Heritability (%) | Position | Variance | Power (%) | Heritability (%) | Position | Variance | Power (%) | Heritability (%) |
45 | 1.980 | 40.00 | 47.0 (3.2) | 1.485 (0.296) | 100 | 29.5 (5.8) | 47.9 (2.6) | 1.856 (0.306) | 100 | 38.2 (5.5) |
235 | 0.990 | 20.00 | 234.6 (7.0) | 0.707 (0.266) | 100 | 14.1 (5.3) | 238.4 (4.3) | 1.184 (0.277) | 97 | 24.8 (5.4) |
340 | 0.495 | 10.00 | 338.9 (10.0) | 0.431 (0.206) | 85 | 8.5 (4.0) | 342.1 (9.2) | 0.969 (0.222) | 84 | 20.7 (4.6) |
440 | 0.248 | 5.00 | 439.4 (12.1) | 0.323 (0.180) | 71 | 6.4 (3.6) | 440.7 (11.4) | 0.847 (0.174) | 54 | 18.2 (3.4) |
640 | 0.124 | 2.50 | 637.7 (11.6) | 0.272 (0.189) | 63 | 5.4 (3.7) | 638.7 (13.4) | 0.868 (0.168) | 46 | 18.5 (3.2) |
740 | 0.062 | 1.25 | 737.0 (13.7) | 0.251 (0.196) | 45 | 5.0 (4.0) | 741.2 (14.3) | 0.831 (0.184) | 41 | 17.8 (3.8) |
835 | 0.031 | 0.62 | 832.5 (10.2) | 0.181 (0.123) | 52 | 3.6 (2.4) | 838.2 (12.7) | 0.825 (0.127) | 39 | 17.6 (2.7) |
940 | 0.015 | 0.30 | 937.5 (13.1) | 0.281 (0.161) | 41 | 5.5 (3.1) | 942.1 (15.3) | 0.792 (0.133) | 39 | 17.0 (2.9) |
Residual variance | 1.000 | 0.448 (0.162) | - | |||||||
Phenotype variance | 4.945 | 5.032 (0.229) | - | |||||||
Number of iterations | 98.6 (24.6) | - |
The simulation experiment with this setup is called the standard setup. Some parameters were eventually altered relative to the standard setup in the extended simulation experiments. For example, the marker density was later decreased from 10 cM per interval to 20 and 40 cM per interval. The sampling strategy was also extended to 750 × 2 = 1,500 and 375 × 4 = 1,500. When one experimental parameter was altered, the remaining parameters were fixed at the values in the standard setup.
Results of data analysis
Standard setup
Under the standard setup (10 cM per interval and 500 family each with three siblings), we replicated the experiment 100 times and each replicated dataset was analyzed with two methods. One method is the multiple variance component model proposed in this study, where the proposed number of QTL included in the model was 20 and the positions of the 20 proposed QTL were also estimated using the maximum likelihood method. The other method is the interval mapping of Xu and Atchley (1995) in which a single QTL and a polygenic effect were included in the model. Since the multiple variance component model has no test for a chromosome location, we simply examined the estimated QTL variance in the neighborhood of a true QTL. When the estimated QTL variance in the neighborhood (within 20 cM) of a true QTL is sufficiently large (larger than any peek appearing in a non-QTL region), the QTL was claimed to be detected. For each simulated true QTL, the mean estimate and the standard deviation across the 100 replicated simulations were calculated. The empirical statistical power for each simulated QTL was also calculated as the proportion of the replicated experiments that the QTL was detected. It appears to be subjective, but the multiple variance component model usually provides very small estimated QTL variances for regions that are not placed for any QTL. Therefore, any region that has a noticeable estimated QTL variance indicates that a true QTL is nearby. For the interval mapping of Xu and Atchley (1995), the likelihood ratio test statistic was used to claim the significance of a QTL. If an estimated QTL variance nearby a true QTL (within 20 cM) is significant, this QTL was claimed to be detected. The estimated QTL variances and QTL positions for the interval mapping are compared with those obtained from the multiple variance component model (see Table 1 for the comparison). Overall the multiple variance component model performs better than the interval mapping. The interval mapping provided biased (upward) estimates for all the QTL variances, especially when the true QTL variance was small. Because of the large biases for the estimated QTL variances, they do not add up, i.e., the sum of all the estimated QTL variances is greater than the total phenotypic variance. Therefore, the multiple variance component model outperforms the interval mapping approach.
Estimated QTL parameters by the multiple variance component model under two extended family structures
True parameter | Estimated parameter | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
N × n = 750 × 2 | N × n = 350 × 4 | |||||||||
Position | Variance | Heritability (%) | Position | Variance | Power (%) | Heritability (%) | Position | Variance | Power (%) | Heritability (%) |
45 | 1.980 | 40.00 | 46.5 (4.0) | 1.414 (0.326) | 100 | 28.1 (6.4) | 46.4 (2.7) | 1.506 (0.221) | 100 | 29.6 (4.3) |
235 | 0.990 | 20.00 | 234.9 (7.3) | 0.670 (0.318) | 89 | 13.3 (6.3) | 234.3 (5.4) | 0.756 (0.227) | 99 | 14.9 (4.4) |
340 | 0.495 | 10.00 | 337.2 (10.2) | 0.519 (0.277) | 80 | 10.3 (5.4) | 340.0 (8.2) | 0.445 (0.171) | 95 | 8.7 (3.3) |
440 | 0.248 | 5.00 | 438.3 (11.9) | 0.424 (0.215) | 62 | 8.4 (4.3) | 439.3 (9.3) | 0.284 (0.149) | 80 | 5.6 (2.9) |
640 | 0.124 | 2.50 | 635.6 (12.3) | 0.288 (0.217) | 46 | 5.7 (4.3) | 640.0 (10.6) | 0.209 (0.120) | 66 | 4.1 (2.3) |
740 | 0.062 | 1.25 | 737.4 (12.1) | 0.311 (0.208) | 38 | 6.2 (4.1) | 735.8 (13.3) | 0.226 (0.164) | 53 | 4.4 (3.2) |
835 | 0.031 | 0.62 | 831.0 (10.1) | 0.299 (0.220) | 35 | 5.9 (4.4) | 834.1 (9.6) | 0.164 (0.096) | 47 | 3.2 (1.9) |
940 | 0.015 | 0.30 | 937.0 (12.1) | 0.314 (0.189) | 30 | 6.3 (3.8) | 936.7 (12.9) | 0.154 (0.116) | 53 | 3.0 (2.3) |
Residual variance | 1.000 | 0.395 (0.165) | 0.467 (0.146) | |||||||
Phenotype variance | 4.945 | 5.026 (0.216) | 5.082 (0.220) | |||||||
Number of iterations | 124.4 (33.2) | 101.3 (19.2) |
Estimated QTL parameters by the multiple variance component model under two extended marker densities
True parameter | Estimated parameter | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
20 cM | 40 cM | |||||||||
Position | Variance | Heritability (%) | Position | Variance | Power (%) | Heritability (%) | Position | Variance | Power (%) | Heritability (%) |
45 | 1.980 | 40.00 | 43.8 (4.3) | 1.423 (0.386) | 100 | 28.6 (7.5) | 41.7 (7.1) | 1.393 (0.459) | 100 | 28.3 (9.1) |
235 | 0.990 | 20.00 | 236.7 (5.8) | 0.794 (0.282) | 99 | 16.0 (5.6) | 236.6 (7.3) | 0.891 (0.366) | 95 | 18.1 (7.2) |
340 | 0.495 | 10.00 | 340.7 (8.0) | 0.444 (0.232) | 91 | 8.9 (4.6) | 339.3 (16.6) | 0.458 (0.249) | 73 | 9.3 (5.0) |
440 | 0.248 | 5.00 | 440.2 (12.6) | 0.380 (0.246) | 73 | 7.6 (4.8) | 442.2 (16.1) | 0.385 (0.212) | 62 | 7.8 (4.3) |
640 | 0.124 | 2.50 | 642.3 (12.3) | 0.309 (0.167) | 65 | 6.3 (3.4) | 640.5 (16.7) | 0.308 (0.236) | 44 | 6.3 (4.8) |
740 | 0.062 | 1.25 | 742.3 (14.0) | 0.240 (0.159) | 52 | 4.8 (3.1) | 737.3 (17.9) | 0.247 (0.172) | 44 | 5.1 (3.6) |
835 | 0.031 | 0.62 | 834.0 (10.1) | 0.218 (0.111) | 30 | 4.4 (2.3) | 840.8 (17.0) | 0.314 (0.188) | 25 | 6.4 (3.8) |
940 | 0.015 | 0.30 | 943.0 (15.2) | 0.233 (0.131) | 42 | 4.7 (2.6) | 939.9 (18.6) | 0.297 (0.168) | 48 | 6.1 (3.5) |
Error variance | 1.000 | 0.279 (0.168) | 0.139 (0.156) | |||||||
Phenotype variance | 4.945 | 4.976 (0.208) | 4.912 (0.210) | |||||||
Number of iterations | 126.9 (22.4) | 226.5 (123.2) |
Discussion
We examined two different methods for genome-wide evaluation of QTL in outbred populations. The ML method is an extension of the interval mapping of Xu and Atchley (1995) to handle multiple QTL. The MCMC implemented Bayesian method is an extension of the Bayesian shrinkage analysis of Wang et al. (2005) for line crosses to outbred populations. Similar random model methodology has been proposed by Yi and Xu (2000) who used the reversible jump MCMC algorithm for model selection. In Yi and Xu (2000), the QTL number was treated as a parameter and sampled along with other parameters. In this study, we emphasize genome evaluation rather than QTL mapping. The difference between genome evaluation and QTL mapping is that the former tries to evaluate the entire genome, including regions that have no QTL, while the latter emphasizes detecting regions of the genome that have QTL. We purposely placed more QTL than necessary to give the method a better chance to evaluate the entire genome. For regions of the genome that contain no QTL, the proposed QTL in those regions often have very small estimated variances. Another advantage of the genome evaluation is that it has avoided model selection, which is still a hot topic for discussion in the literature (Kadane and Lazar 2004).
We used multiple full-sib families as an example to demonstrate the method. Extension to multiple complicated pedigrees is straightforward, at least, theoretically because the method requires only the IBD matrices for each putative location of the genome. Methods to calculate the IBD matrix using marker information are available for arbitrarily complicated pedigrees (Amos et al. 1990; Almasy and Blangero 1998). The programs Lokie (Heath 1997) and SimWalk2 (Sobel et al. 2001) are the most well known software packages for IBD matrix calculation.
Surprisingly, the multiple variance component model has very low false positive rate (also called the Type I error). Although we did not actually calculate the Type I error in our simulation experiments, just by visual inspection on the QTL variance profiles, we can see that regions of the genome that contain no QTL rarely show any noticeable peaks while regions with large QTL always have strong signals. This observation implies that the multiple variance component model has great power and small Type I error. Of course, statistical power and Type I error are concepts of frequentists, not of Bayesians. Another surprising discovery is that the Bayesian method is very robust to the prior choice for the QTL variance components. We examined the uniform prior and hierarchical prior (exponential and Gamma), they all generated similar results.
Supplementary materials
The SAS/IML programs for the maximum likelihood method and Bayesian method are posted on the journal website along with sample data.
Notes
Acknowledgments
We thank two anonymous reviewers for their comments on an early version of the manuscript and suggestions on the improvement of the manuscript. This project was supported by the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture under the Plant Genome, Genetics and Breeding Program 2007-35300-18285.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Supplementary material
References
- Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet 62:1198–1211CrossRefPubMedGoogle Scholar
- Amos CI (1994) Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet 54:535–543PubMedGoogle Scholar
- Amos CI, Dawson DV, Elston RC (1990) The probabilistic determination of identity-by-descent sharing for pairs of relatives from pedigrees. Am J Hum Genet 47:842–853PubMedGoogle Scholar
- Fulker DW, Cardon LR (1994) A sib-pair approach to interval mapping of quantitative trait loci. Am J Hum Genet 54:1092–1103PubMedGoogle Scholar
- Gessler DDG, Xu S (2000) Multipoint genetic mapping of quantitative trait loci with dominant markers in outbred populations. Genetica 105:281–291CrossRefGoogle Scholar
- Goldgar DE (1990) Multipoint analysis of human quantitative genetic variation. Am J Hum Genet 47:957–967PubMedGoogle Scholar
- Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109CrossRefGoogle Scholar
- Heath SC (1997) Markov chain Monte Carlo segregation and linkage analysis of oligogenic models. Am J Hum Genet 61:748–760CrossRefPubMedGoogle Scholar
- Kadane JB, Lazar NA (2004) Methods and criteria for model selection. J Am Statist Assoc 99:279–290CrossRefGoogle Scholar
- Metropolis N, Rosenbluth AW, Rosenbluth MN et al (1953) Equations of state calculations by fast computing machines. J Chem Phys 21:1087–1091CrossRefGoogle Scholar
- Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829PubMedGoogle Scholar
- Sobel E, Sengul H, Weeks DE (2001) Multipoint estimation of identity-by-descent probabilities at arbitrary positions among marker loci on general pedigrees. Hum Hered 52:121–131CrossRefPubMedGoogle Scholar
- Thompson EA, Shaw RG (1990) Pedigree analysis for quantitative traits: variance components without matrix inversion. Biometrics 46:399–413CrossRefPubMedGoogle Scholar
- Uimari P, Hoeschele I (1997) Mapping linked quantitative trait loci using Bayesian analysis and Markov chain Monte Carlo algorithms. Genetics 146:735–743PubMedGoogle Scholar
- Wada Y, Kashiwagi N (1990) Selecting statistical models with information statistics. J Dairy Sci 73:3573–3582CrossRefGoogle Scholar
- Wang H, Zhang YM, Li X et al (2005) Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170:465–480CrossRefPubMedGoogle Scholar
- Xu S (2007) An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics 63:513–521CrossRefPubMedGoogle Scholar
- Xu S, Atchley WR (1995) A random model approach to interval mapping of quantitative trait loci. Genetics 141:1189–1197PubMedGoogle Scholar
- Yi N, Xu S (2000) Bayesian mapping of quantitative trait loci under the identity-by-descent-based variance component model. Genetics 156:411–422PubMedGoogle Scholar