Multivariate Imputation of Genotype Data Using Short and Long Range Disequilibrium

  • María M. Abad-Grau
  • Paola Sebastiani
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4739)


Missing values in genetic data are a common issue. In this paper we explore several machine learning techniques for creating models that can be used to impute the missing genotypes using multiple genetic markers. We map the machine learning techniques to different patterns of transmission and, in particular, we contrast the effect of short and long range disequilibrium between markers. The assumption of short range disequilibrium implies that only physically close genetic variants are informative for reconstructing missing genotypes, while this assumption is relaxed in long range disequilibrium and physically distant genetic variants become informative for imputation. We evaluate the accuracy of a flexible feature selection model that fits both patterns of transmission using six real datasets of single nucleotide polymorphisms (SNP). The results show an increased accuracy compared to standard imputation models. [Supplementary material]


Bayesian networks decision trees imputation missing data SNPs linkage disequilibrium 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Consortium, T.G.I.S.: Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)CrossRefGoogle Scholar
  2. 2.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithm. Machine Learning 6, 37–66 (1991)Google Scholar
  3. 3.
    Quinlan, J.R.: Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996)zbMATHCrossRefGoogle Scholar
  4. 4.
    Sebastiani, P., Abad-Grau, M.M., Ramoni, M.F.: Learning Bayesian Networks. In: Maimon, O., Rokach, L. (eds.) Data mining and knowledge discovery handbook, pp. 193–230. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  5. 5.
    John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 121–129. Morgan Kaufmann, San Francisco (1994)Google Scholar
  6. 6.
    Kohavi, R., John, G.H.: The wrapper approach. In: Artificial Intelligence Journal, Springer, Heidelberg (1998)Google Scholar
  7. 7.
    Patil, N., Berno, A., Hinds, D., Barrett, W., Doshi, J., Hacker, C., Kautzer, C., Lee, D., Marjoribanks, C., McDonough, D.: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294 (2001)Google Scholar
  8. 8.
    Gabriel, S., Schaffner, S., Nguyen, H., Moore, J., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S.N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E., Daly, M., Altshuler, D.: The structure of haplotype blocks in the human genome. Science 296 (2002)Google Scholar
  9. 9.
    Castellana, N., Dhamdhere, K., Sridhar, S., Schwartz, R.: Relaxing haplotype block models for association testing. In: Proceedings of the Pacific Symposium on Biocomputing, vol. 11, pp. 454–466 (2006)Google Scholar
  10. 10.
    Baldwin, C.T., Nolan, V.G., Wyszynski, D.F., Ma, Q.L., Sebastiani, P., Embury, S.H., Bisbee, A., Farrell, J., Farrer, L.S., Steinberg, M.H.: Association of klotho, bone morphogenic protein 6 and annexin a2 polymorphisms with sickle cell osteonecrosis. Blood 106(1), 372–375 (2005)CrossRefGoogle Scholar
  11. 11.
    John, D., Rioux, M.J.D., Silverberg, M.S., Lindblad, K., Steinhart, H., Cohen, Z., Delmonte, T., Kocher, K., Miller, K., Guschwan, S., Kulbokas, E.J., O’Leary, S., Winchester, E., Dewar, K., Green, T., Stone, V., Chow, C., Cohen, A., Langelier, D., Lapointe, G., Gaudet, D., Faith, J., Branco, N., Bull, S.B., McLeod, R.S., Griffiths, A.M., Bitton, A., Greenberg, G.R., Lander, E.S., Siminovitch, K.A., Hudson, T.J.: Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to crohn disease. Nature Genetics 29, 223–228 (2001)CrossRefGoogle Scholar
  12. 12.
    Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., Lander, E.S.: High-resolution haplotype structure in the human genome. Nat. Genet. 29, 229–232 (2001)CrossRefGoogle Scholar
  13. 13.
    HapMap-Consortium, T.I.: The international hapmap project. Nature 426, 789–796 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • María M. Abad-Grau
    • 1
  • Paola Sebastiani
    • 2
  1. 1.Software Engineering Department, University of Granada, Granada 18071Spain
  2. 2.Department of Biostatistics, Boston University, Boston MA 02118USA

Personalised recommendations