SNP Imputation for Association Studies

Stram, Daniel O.

doi:10.1007/978-1-4614-9443-0_6

Daniel O. Stram⁷

Part of the book series: Statistics for Biology and Health ((SBH))

3300 Accesses

Abstract

This chapter also (as does Chap. 5) discusses an extension of association analyses to include a larger set of hypotheses beyond just the single markers that have been genotyped in a particular study. Imputed SNP analysis is in certain respects identical to haplotype analysis since the imputed SNPs are on specific haplotypes or haplotype combinations. Testing imputed SNPs as well as genotyped SNPs is thus simply a more focused kind of haplotype analysis and serves to extend the set of hypotheses that are tested to encompass known but ungenotyped variants. Imputed SNP analysis plays a special role during a post-GWAS phase when during meta-analysis many studies are combined in efforts to find associations that are too small to be detectable in any one study. Imputation is necessarily relied upon when (as is generally the case) not all studies used the same genotyping platform or chip version.

This chapter discusses the basic statistical method, namely, Hidden Markov Model (HMM) that is used for fast and very large-scale SNP imputation in a number of high-performance programs. A brief introduction to the HMM methods is provided and the basic principles behind estimating the parameters of an HMM are illustrated with R code. The basics of a particular algorithm, patterned loosely after that implemented in the program MACH, are described.

Since nearly all large-scale SNP imputation methods require that phased haplotypes be provided for SNPs to be imputed (or measured for the purpose of imputation), a discussion of the use of phasing algorithms is also provided, with the details also modeled after the MACH program.

The use of imputed SNPs as independent variables in regression analysis is introduced with the discussion mostly focused on the same approach (expectation substitution) used in haplotype analysis. Use of imputed SNPs in association analyses for a single study is described, although the use of imputed SNPs in meta-analysis is deferred until Chap. 8.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Actually the EM algorithm generally will only find a local maximum; running the EM repeatedly using multiple starting values for the parameters in λ is recommended for assessment of whether a global maximum has been achieved.
2.
In other settings, the EM algorithm requires calculating expectations of sufficient statistics or of log likelihood functions in toto, rather than just the unobserved data; see [19]. It is because the HMM is a model for multinomial data (i.e., number of transitions), and since the log of the likelihood of a multinomial is linear in the observed or unobserved counts, calculating the expectation of unobserved transitions, given observed data (signals), is sufficient for the estimation of model parameters.
3.
The number of unique states, N, can be reduced to h(1 + h)/2 since haplotype order is not being considered here. The R code above becomes slightly less intelligible when mapping the state number j to the pair of haplotypes referred to by j which is why this redundancy has not been removed.

References

Howie, B., Marchini, J., & Stephens, M. (2011). Genotype imputation with thousands of genomes. G3 (Bethesda), 1, 457–470.
Article Google Scholar
Carlson, C. S., Eberle, M. A., Rieder, M. J., Yi, Q., Kruglyak, L., & Nickerson, D. A. (2004). Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. The American Journal of Human Genetics, 74, 106–120.
Article Google Scholar
Stram, D. O. (2004). Tag SNP selection for association studies. Genetic Epidemiology, 27, 365–374.
Article Google Scholar
de Bakker, P. I., Burtt, N. P., Graham, R. R., Guiducci, C., Yelensky, R., Drake, J. A., et al. (2006). Transferability of tag SNPs in genetic association studies in multiple populations. Nature Genetics, 38, 1298–1303.
Article Google Scholar
Haiman, C. A., Hsu, C., de Bakker, P., Frasco, M., Sheng, X., Van Den Berg, D., et al. (2007). Comprehensive association testing of common genetic variation in DNA repair pathway genes in relationship with breast cancer risk in multiple populations. Human Molecular Genetics, 17(6), 825–834.
Article Google Scholar
de Bakker, P. I., Yelensky, R., Pe'er, I., Gabriel, S. B., Daly, M. J., & Altshuler, D. (2005). Efficiency and power in genetic association studies. Nature Genetics, 37, 1217–1223.
Article Google Scholar
Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263–265.
Article Google Scholar
Stram, D. O., Haiman, C. A., Hirschhorn, J. N., Altshuler, D., Kolonel, L. N., Henderson, B. E., et al. (2003). Choosing haplotype-tagging SNPs based on unphased genotype data from a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Human Heredity, 55(1), 27–36.
Article Google Scholar
Chapman, J. M., Cooper, J. D., Todd, J. A., & Clayton, D. G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity, 56, 18–32.
Article Google Scholar
Stephens, M., & Donnelly, P. (2003). A comparison of bayesian methods for haplotype reconstruction from population genotype data, American Journal of Human Genetics, 73, 1162–1169.
Article Google Scholar
Li, Y., Willer, C. J., Ding, J., Scheet, P., & Abecasis, G. R. (2010). MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology, 34, 816–834.
Article Google Scholar
Scheet, P., & Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics, 78, 629–644.
Article Google Scholar
Delaneau, O., Marchini, J., & Zagury, J.-F. (2011). A linear complexity phasing method for thousands of genomes. Nature Methods, 9, 179–181.
Article Google Scholar
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis. Cambridge, UK: Cambridge University Press.
Book MATH Google Scholar
Siegmund, D., & Yakir, Y. (2007). The statistics of gene mapping. New York, NY: Springer.
MATH Google Scholar
Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletein of the American Mathematical Society, 73, 360–363.
Article MathSciNet MATH Google Scholar
Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164–171.
Article MathSciNet MATH Google Scholar
Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3, 1–8.
Google Scholar
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. JRSS-B, 37, 1–22.
MathSciNet Google Scholar
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286.
Article Google Scholar
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., & Abecasis, G. R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics, 44, 955–959.
Article Google Scholar
Haiman, C. A., Stram, D. O., Pike, M. C., Kolonel, L. N., Burtt, N. P., Altshuler, D., et al. (2003). A comprehensive haplotype analysis of CYP19 and breast cancer risk: The multiethnic Cohort study. Human Molecular Genetics, 12, 2679–2692.
Article Google Scholar
Howie, B. N., Donnelly, P., & Marchini, J. (2009). Impute2: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics, 5, e1000529.
Article Google Scholar
Liu, E. Y., Buyske, S., Aragaki, A. K., Peters, U., Boerwinkle, E., Carlson, C., et al. (2012). Genotype imputation of Metabochip SNPs using a study-specific reference panel of 4,000 haplotypes in African Americans from the Women’s Health initiative. Genetic Epidemiology, 36, 107–117.
Article Google Scholar
Li, L., Li, Y., Browning, S. R., Browning, B. L., Slater, A. J., Kong, X., et al. (2011). Performance of genotype imputation for rare variants identified in exons and flanking regions of genes. PLoS One, 6, e24945.
Article Google Scholar
Egyud, M. R., Gajdos, Z. K., Butler, J. L., Tischfield, S., Le Marchand, L., Kolonel, L. N., et al. (2009). Use of weighted reference panels based on empirical estimates of ancestry for capturing untyped variation. Human Genetics, 125, 295–303.
Article Google Scholar
Chen, Z., Pereira, M. A., Seielstad, M., Koh, W.-P., Tai, E. S., Teo, Y.-Y., et al. (2013). Joint effects of known Type 2 diabetes susceptibility loci in genome-wide association study of Singapore Chinese: The Singapore Chinese health study. PLOS ONE In Press
Google Scholar
Hu, Y. J., & Lin, D. Y. (2010). Analysis of untyped SNPs: Maximum likelihood and imputation methods. Genetic Epidemiology, 34, 803–815.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Preventive Medicine, University of Southern California Keck School of Medicine, Los Angeles, CA, USA
Daniel O. Stram

Authors

Daniel O. Stram
View author publications
You can also search for this author in PubMed Google Scholar

6.1 Electronic Supplementary Material

Below is the link to the electronic supplementary material.

chapter6 (ZIP 3.79 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stram, D.O. (2014). SNP Imputation for Association Studies. In: Design, Analysis, and Interpretation of Genome-Wide Association Scans. Statistics for Biology and Health. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9443-0_6

Download citation

DOI: https://doi.org/10.1007/978-1-4614-9443-0_6
Published: 11 November 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9442-3
Online ISBN: 978-1-4614-9443-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics