An Adaptive and Memory Efficient Algorithm for Genotype Imputation

  • Hyun Min Kang
  • Noah A. Zaitlen
  • Buhm Han
  • Eleazar Eskin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5541)


Genome wide association studies have proven to be a highly successful method for identification of genetic loci for complex phenotypes in both humans and model organisms. These large scale studies rely on the collection of hundreds of thousands of single nucleotide polymorphisms (SNPs) across the genome. Standard high-throughput genotyping technologies capture only a fraction of the total genetic variation. Recent efforts have shown that it is possible to “impute” with high accuracy the genotypes of SNPs that are not collected in the study provided that they are present in a reference data set which contains both SNPs collected in the study as well as other SNPs. We here introduce a novel HMM based technique to solve the imputation problem that addresses several shortcomings of existing methods. First, our method is adaptive which lets it estimate population genetic parameters from the data and be applied to model organisms that have very different evolutionary histories. Compared to traditional methods, our method is up to ten times more accurate on model organisms such as mouse. Second, our algorithm scales in memory usage in the number of collected markers as opposed to the number of known SNPs. This issue is very relevant due to the size of the reference data sets currently being generated. We compare our method over mouse and human data sets to existing methods and show that each has either comparable or better performance and much lower memory usage. The method is available for download at .


Hide Markov Model Inbred Mouse Strain Imputation Accuracy Silent State Genotype Imputation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Borevitz, J.O., Hazen, S.P., Michael, T.P., Morris, G.P., Baxter, I.R., Hu, T.T., Chen, H., Werner, J.D., Nordborg, M., Salt, D.E., Kay, S.A., Chory, J., Weigel, D., Jones, J.D., Ecker, J.R.: Genome-wide patterns of single-feature polymorphism in Arabidopsis thaliana. Proc. Natl. Acad. Sci. U.S.A. 104, 12057–12062 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Collins, F.S., Brooks, L.D., Chakravarti, A.: A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8, 1229–1231 (1998)CrossRefPubMedGoogle Scholar
  3. 3.
    de Bakker, P.I., Yelensky, R., Pe’er, I., Gabriel, S.B., Daly, M.J., Altshuler, D.: Efficiency and power in genetic association studies. Nat. Genet. 37, 1217–1223 (2005)CrossRefPubMedGoogle Scholar
  4. 4.
    Devlin, B., Risch, N.: A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995)CrossRefPubMedGoogle Scholar
  5. 5.
    Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B., Pethiyagoda, C.L., Stuve, L.L., Johnson, F.M., Daly, M.J., Wade, C.M., Cox, D.R.: A sequence-based variation map of 8. 27 million SNPs in inbred mouse strains 448, 1050–1053 (2007)Google Scholar
  6. 6.
    Gunderson, K.L., Steemers, F.J., Lee, G., Mendoza, L.G., Chee, M.S.: A genome-wide scalable SNP genotyping assay using microarray technology. Nat. Genet. 37, 549–554 (2005)CrossRefPubMedGoogle Scholar
  7. 7.
    International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (October 2007)Google Scholar
  8. 8.
    Karlsson, E.K., Baranowska, I., Wade, C.M., Salmon Hillbertz, N.H., Zody, M.C., Anderson, N., Biagi, T.M., Patterson, N., Pielberg, G.R., Kulbokas, E.J., Comstock, K.E., Keller, E.T., Mesirov, J.P., von Euler, H., Kämpe, O., Hedhammar, A., Lander, E.S., Andersson, G., Andersson, L., Lindblad-Toh, K.: Efficient mapping of mendelian traits in dogs through genome-wide association. Nat. Genet. 39, 1321–1328 (2007)CrossRefPubMedGoogle Scholar
  9. 9.
    Kingman, J.F.C.: On the genealogy of large populations. Journal of Applied Proability 19, 27–43 (1982)CrossRefGoogle Scholar
  10. 10.
    Li, Y., Willer, C.J., Ding, J., Scheet, P., Abecasis, G.R.: Rapid Markov chain haplotyping and genotype inference (in submission) (2006)Google Scholar
  11. 11.
    Marchini, J., Howie, B., Myers, S., McVean, G., Donnelly, P.: A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007)CrossRefPubMedGoogle Scholar
  12. 12.
    Matsuzaki, H., Dong, S., Loi, H., Di, X., Liu, G., Hubbell, E., Law, J., Berntsen, T., Chadha, M., Hui, H., Yang, G., Kennedy, G.C., Webster, T.A., Cawley, S., Walsh, P.S., Jones, K.W., Fodor, S.P., Mei, R.: Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods 1, 109–111 (2004)CrossRefPubMedGoogle Scholar
  13. 13.
    Risch, N., Merikangas, K.: The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996)CrossRefPubMedGoogle Scholar
  14. 14.
    Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Szatkiewicz, J.P., Beane, G.L., Ding, Y., Hutchins, L., de Villena, F.P.-M., Churchill, G.A.: An imputed genotype resource for the laboratory mouse. Mamm. Genome 19, 199–208 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    The STAR Consortium. SNP and haplotype mapping for genetic analysis in the rat. Nat. Genet. 40, 560–566 (May 2008)Google Scholar
  17. 17.
    The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls 447, 661–678 (2007)Google Scholar
  18. 18.
    Zaitlen, N., Kang, H.M., Eskin, E., Halperin, E.: Leveraging the HapMap correlation structure in association studies. Am. J. Hum. Genet. 80, 683–691 (2007)CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Hyun Min Kang
    • 1
  • Noah A. Zaitlen
    • 2
  • Buhm Han
    • 1
  • Eleazar Eskin
    • 3
  1. 1.Computer Science and EngineeringUniversity of CaliforniaSan DiegoUSA
  2. 2.Bioinformatics ProgramUniversity of CaliforniaSan DiegoUSA
  3. 3.Department of Computer Science and Department of Human GeneticsUniversity of California, Los AngelesLos AngelesUSA

Personalised recommendations