Abstract
This chapter also (as does Chap. 5) discusses an extension of association analyses to include a larger set of hypotheses beyond just the single markers that have been genotyped in a particular study. Imputed SNP analysis is in certain respects identical to haplotype analysis since the imputed SNPs are on specific haplotypes or haplotype combinations. Testing imputed SNPs as well as genotyped SNPs is thus simply a more focused kind of haplotype analysis and serves to extend the set of hypotheses that are tested to encompass known but ungenotyped variants. Imputed SNP analysis plays a special role during a post-GWAS phase when during meta-analysis many studies are combined in efforts to find associations that are too small to be detectable in any one study. Imputation is necessarily relied upon when (as is generally the case) not all studies used the same genotyping platform or chip version.
This chapter discusses the basic statistical method, namely, Hidden Markov Model (HMM) that is used for fast and very large-scale SNP imputation in a number of high-performance programs. A brief introduction to the HMM methods is provided and the basic principles behind estimating the parameters of an HMM are illustrated with R code. The basics of a particular algorithm, patterned loosely after that implemented in the program MACH, are described.
Since nearly all large-scale SNP imputation methods require that phased haplotypes be provided for SNPs to be imputed (or measured for the purpose of imputation), a discussion of the use of phasing algorithms is also provided, with the details also modeled after the MACH program.
The use of imputed SNPs as independent variables in regression analysis is introduced with the discussion mostly focused on the same approach (expectation substitution) used in haplotype analysis. Use of imputed SNPs in association analyses for a single study is described, although the use of imputed SNPs in meta-analysis is deferred until Chap. 8.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Actually the EM algorithm generally will only find a local maximum; running the EM repeatedly using multiple starting values for the parameters in λ is recommended for assessment of whether a global maximum has been achieved.
- 2.
In other settings, the EM algorithm requires calculating expectations of sufficient statistics or of log likelihood functions in toto, rather than just the unobserved data; see [19]. It is because the HMM is a model for multinomial data (i.e., number of transitions), and since the log of the likelihood of a multinomial is linear in the observed or unobserved counts, calculating the expectation of unobserved transitions, given observed data (signals), is sufficient for the estimation of model parameters.
- 3.
The number of unique states, N, can be reduced to h(1 + h)/2 since haplotype order is not being considered here. The R code above becomes slightly less intelligible when mapping the state number j to the pair of haplotypes referred to by j which is why this redundancy has not been removed.
References
Howie, B., Marchini, J., & Stephens, M. (2011). Genotype imputation with thousands of genomes. G3 (Bethesda), 1, 457–470.
Carlson, C. S., Eberle, M. A., Rieder, M. J., Yi, Q., Kruglyak, L., & Nickerson, D. A. (2004). Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. The American Journal of Human Genetics, 74, 106–120.
Stram, D. O. (2004). Tag SNP selection for association studies. Genetic Epidemiology, 27, 365–374.
de Bakker, P. I., Burtt, N. P., Graham, R. R., Guiducci, C., Yelensky, R., Drake, J. A., et al. (2006). Transferability of tag SNPs in genetic association studies in multiple populations. Nature Genetics, 38, 1298–1303.
Haiman, C. A., Hsu, C., de Bakker, P., Frasco, M., Sheng, X., Van Den Berg, D., et al. (2007). Comprehensive association testing of common genetic variation in DNA repair pathway genes in relationship with breast cancer risk in multiple populations. Human Molecular Genetics, 17(6), 825–834.
de Bakker, P. I., Yelensky, R., Pe'er, I., Gabriel, S. B., Daly, M. J., & Altshuler, D. (2005). Efficiency and power in genetic association studies. Nature Genetics, 37, 1217–1223.
Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263–265.
Stram, D. O., Haiman, C. A., Hirschhorn, J. N., Altshuler, D., Kolonel, L. N., Henderson, B. E., et al. (2003). Choosing haplotype-tagging SNPs based on unphased genotype data from a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Human Heredity, 55(1), 27–36.
Chapman, J. M., Cooper, J. D., Todd, J. A., & Clayton, D. G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity, 56, 18–32.
Stephens, M., & Donnelly, P. (2003). A comparison of bayesian methods for haplotype reconstruction from population genotype data, American Journal of Human Genetics, 73, 1162–1169.
Li, Y., Willer, C. J., Ding, J., Scheet, P., & Abecasis, G. R. (2010). MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology, 34, 816–834.
Scheet, P., & Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics, 78, 629–644.
Delaneau, O., Marchini, J., & Zagury, J.-F. (2011). A linear complexity phasing method for thousands of genomes. Nature Methods, 9, 179–181.
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis. Cambridge, UK: Cambridge University Press.
Siegmund, D., & Yakir, Y. (2007). The statistics of gene mapping. New York, NY: Springer.
Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletein of the American Mathematical Society, 73, 360–363.
Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164–171.
Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3, 1–8.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. JRSS-B, 37, 1–22.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286.
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., & Abecasis, G. R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics, 44, 955–959.
Haiman, C. A., Stram, D. O., Pike, M. C., Kolonel, L. N., Burtt, N. P., Altshuler, D., et al. (2003). A comprehensive haplotype analysis of CYP19 and breast cancer risk: The multiethnic Cohort study. Human Molecular Genetics, 12, 2679–2692.
Howie, B. N., Donnelly, P., & Marchini, J. (2009). Impute2: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics, 5, e1000529.
Liu, E. Y., Buyske, S., Aragaki, A. K., Peters, U., Boerwinkle, E., Carlson, C., et al. (2012). Genotype imputation of Metabochip SNPs using a study-specific reference panel of 4,000 haplotypes in African Americans from the Women’s Health initiative. Genetic Epidemiology, 36, 107–117.
Li, L., Li, Y., Browning, S. R., Browning, B. L., Slater, A. J., Kong, X., et al. (2011). Performance of genotype imputation for rare variants identified in exons and flanking regions of genes. PLoS One, 6, e24945.
Egyud, M. R., Gajdos, Z. K., Butler, J. L., Tischfield, S., Le Marchand, L., Kolonel, L. N., et al. (2009). Use of weighted reference panels based on empirical estimates of ancestry for capturing untyped variation. Human Genetics, 125, 295–303.
Chen, Z., Pereira, M. A., Seielstad, M., Koh, W.-P., Tai, E. S., Teo, Y.-Y., et al. (2013). Joint effects of known Type 2 diabetes susceptibility loci in genome-wide association study of Singapore Chinese: The Singapore Chinese health study. PLOS ONE In Press
Hu, Y. J., & Lin, D. Y. (2010). Analysis of untyped SNPs: Maximum likelihood and imputation methods. Genetic Epidemiology, 34, 803–815.
Author information
Authors and Affiliations
6.1 Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Stram, D.O. (2014). SNP Imputation for Association Studies. In: Design, Analysis, and Interpretation of Genome-Wide Association Scans. Statistics for Biology and Health. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9443-0_6
Download citation
DOI: https://doi.org/10.1007/978-1-4614-9443-0_6
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9442-3
Online ISBN: 978-1-4614-9443-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)