Skip to main content

Part of the book series: Statistics for Biology and Health ((SBH))

  • 3300 Accesses

Abstract

This chapter also (as does Chap. 5) discusses an extension of association analyses to include a larger set of hypotheses beyond just the single markers that have been genotyped in a particular study. Imputed SNP analysis is in certain respects identical to haplotype analysis since the imputed SNPs are on specific haplotypes or haplotype combinations. Testing imputed SNPs as well as genotyped SNPs is thus simply a more focused kind of haplotype analysis and serves to extend the set of hypotheses that are tested to encompass known but ungenotyped variants. Imputed SNP analysis plays a special role during a post-GWAS phase when during meta-analysis many studies are combined in efforts to find associations that are too small to be detectable in any one study. Imputation is necessarily relied upon when (as is generally the case) not all studies used the same genotyping platform or chip version.

This chapter discusses the basic statistical method, namely, Hidden Markov Model (HMM) that is used for fast and very large-scale SNP imputation in a number of high-performance programs. A brief introduction to the HMM methods is provided and the basic principles behind estimating the parameters of an HMM are illustrated with R code. The basics of a particular algorithm, patterned loosely after that implemented in the program MACH, are described.

Since nearly all large-scale SNP imputation methods require that phased haplotypes be provided for SNPs to be imputed (or measured for the purpose of imputation), a discussion of the use of phasing algorithms is also provided, with the details also modeled after the MACH program.

The use of imputed SNPs as independent variables in regression analysis is introduced with the discussion mostly focused on the same approach (expectation substitution) used in haplotype analysis. Use of imputed SNPs in association analyses for a single study is described, although the use of imputed SNPs in meta-analysis is deferred until Chap. 8.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Actually the EM algorithm generally will only find a local maximum; running the EM repeatedly using multiple starting values for the parameters in λ is recommended for assessment of whether a global maximum has been achieved.

  2. 2.

    In other settings, the EM algorithm requires calculating expectations of sufficient statistics or of log likelihood functions in toto, rather than just the unobserved data; see [19]. It is because the HMM is a model for multinomial data (i.e., number of transitions), and since the log of the likelihood of a multinomial is linear in the observed or unobserved counts, calculating the expectation of unobserved transitions, given observed data (signals), is sufficient for the estimation of model parameters.

  3. 3.

    The number of unique states, N, can be reduced to h(1 + h)/2 since haplotype order is not being considered here. The R code above becomes slightly less intelligible when mapping the state number j to the pair of haplotypes referred to by j which is why this redundancy has not been removed.

References

  1. Howie, B., Marchini, J., & Stephens, M. (2011). Genotype imputation with thousands of genomes. G3 (Bethesda), 1, 457–470.

    Article  Google Scholar 

  2. Carlson, C. S., Eberle, M. A., Rieder, M. J., Yi, Q., Kruglyak, L., & Nickerson, D. A. (2004). Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. The American Journal of Human Genetics, 74, 106–120.

    Article  Google Scholar 

  3. Stram, D. O. (2004). Tag SNP selection for association studies. Genetic Epidemiology, 27, 365–374.

    Article  Google Scholar 

  4. de Bakker, P. I., Burtt, N. P., Graham, R. R., Guiducci, C., Yelensky, R., Drake, J. A., et al. (2006). Transferability of tag SNPs in genetic association studies in multiple populations. Nature Genetics, 38, 1298–1303.

    Article  Google Scholar 

  5. Haiman, C. A., Hsu, C., de Bakker, P., Frasco, M., Sheng, X., Van Den Berg, D., et al. (2007). Comprehensive association testing of common genetic variation in DNA repair pathway genes in relationship with breast cancer risk in multiple populations. Human Molecular Genetics, 17(6), 825–834.

    Article  Google Scholar 

  6. de Bakker, P. I., Yelensky, R., Pe'er, I., Gabriel, S. B., Daly, M. J., & Altshuler, D. (2005). Efficiency and power in genetic association studies. Nature Genetics, 37, 1217–1223.

    Article  Google Scholar 

  7. Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263–265.

    Article  Google Scholar 

  8. Stram, D. O., Haiman, C. A., Hirschhorn, J. N., Altshuler, D., Kolonel, L. N., Henderson, B. E., et al. (2003). Choosing haplotype-tagging SNPs based on unphased genotype data from a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Human Heredity, 55(1), 27–36.

    Article  Google Scholar 

  9. Chapman, J. M., Cooper, J. D., Todd, J. A., & Clayton, D. G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity, 56, 18–32.

    Article  Google Scholar 

  10. Stephens, M., & Donnelly, P. (2003). A comparison of bayesian methods for haplotype reconstruction from population genotype data, American Journal of Human Genetics, 73, 1162–1169.

    Article  Google Scholar 

  11. Li, Y., Willer, C. J., Ding, J., Scheet, P., & Abecasis, G. R. (2010). MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology, 34, 816–834.

    Article  Google Scholar 

  12. Scheet, P., & Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics, 78, 629–644.

    Article  Google Scholar 

  13. Delaneau, O., Marchini, J., & Zagury, J.-F. (2011). A linear complexity phasing method for thousands of genomes. Nature Methods, 9, 179–181.

    Article  Google Scholar 

  14. Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis. Cambridge, UK: Cambridge University Press.

    Book  MATH  Google Scholar 

  15. Siegmund, D., & Yakir, Y. (2007). The statistics of gene mapping. New York, NY: Springer.

    MATH  Google Scholar 

  16. Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletein of the American Mathematical Society, 73, 360–363.

    Article  MathSciNet  MATH  Google Scholar 

  17. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164–171.

    Article  MathSciNet  MATH  Google Scholar 

  18. Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3, 1–8.

    Google Scholar 

  19. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. JRSS-B, 37, 1–22.

    MathSciNet  Google Scholar 

  20. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286.

    Article  Google Scholar 

  21. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., & Abecasis, G. R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics, 44, 955–959.

    Article  Google Scholar 

  22. Haiman, C. A., Stram, D. O., Pike, M. C., Kolonel, L. N., Burtt, N. P., Altshuler, D., et al. (2003). A comprehensive haplotype analysis of CYP19 and breast cancer risk: The multiethnic Cohort study. Human Molecular Genetics, 12, 2679–2692.

    Article  Google Scholar 

  23. Howie, B. N., Donnelly, P., & Marchini, J. (2009). Impute2: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics, 5, e1000529.

    Article  Google Scholar 

  24. Liu, E. Y., Buyske, S., Aragaki, A. K., Peters, U., Boerwinkle, E., Carlson, C., et al. (2012). Genotype imputation of Metabochip SNPs using a study-specific reference panel of 4,000 haplotypes in African Americans from the Women’s Health initiative. Genetic Epidemiology, 36, 107–117.

    Article  Google Scholar 

  25. Li, L., Li, Y., Browning, S. R., Browning, B. L., Slater, A. J., Kong, X., et al. (2011). Performance of genotype imputation for rare variants identified in exons and flanking regions of genes. PLoS One, 6, e24945.

    Article  Google Scholar 

  26. Egyud, M. R., Gajdos, Z. K., Butler, J. L., Tischfield, S., Le Marchand, L., Kolonel, L. N., et al. (2009). Use of weighted reference panels based on empirical estimates of ancestry for capturing untyped variation. Human Genetics, 125, 295–303.

    Article  Google Scholar 

  27. Chen, Z., Pereira, M. A., Seielstad, M., Koh, W.-P., Tai, E. S., Teo, Y.-Y., et al. (2013). Joint effects of known Type 2 diabetes susceptibility loci in genome-wide association study of Singapore Chinese: The Singapore Chinese health study. PLOS ONE In Press

    Google Scholar 

  28. Hu, Y. J., & Lin, D. Y. (2010). Analysis of untyped SNPs: Maximum likelihood and imputation methods. Genetic Epidemiology, 34, 803–815.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

6.1 Electronic Supplementary Material

Below is the link to the electronic supplementary material.

chapter6 (ZIP 3.79 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Stram, D.O. (2014). SNP Imputation for Association Studies. In: Design, Analysis, and Interpretation of Genome-Wide Association Scans. Statistics for Biology and Health. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9443-0_6

Download citation

Publish with us

Policies and ethics