Abstract
Missing values in genetic data are a common issue. In this paper we explore several machine learning techniques for creating models that can be used to impute the missing genotypes using multiple genetic markers. We map the machine learning techniques to different patterns of transmission and, in particular, we contrast the effect of short and long range disequilibrium between markers. The assumption of short range disequilibrium implies that only physically close genetic variants are informative for reconstructing missing genotypes, while this assumption is relaxed in long range disequilibrium and physically distant genetic variants become informative for imputation. We evaluate the accuracy of a flexible feature selection model that fits both patterns of transmission using six real datasets of single nucleotide polymorphisms (SNP). The results show an increased accuracy compared to standard imputation models. [Supplementary material] http://bios.ugr.es/missingGenotypes
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Consortium, T.G.I.S.: Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithm. Machine Learning 6, 37–66 (1991)
Quinlan, J.R.: Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996)
Sebastiani, P., Abad-Grau, M.M., Ramoni, M.F.: Learning Bayesian Networks. In: Maimon, O., Rokach, L. (eds.) Data mining and knowledge discovery handbook, pp. 193–230. Springer, Heidelberg (2005)
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 121–129. Morgan Kaufmann, San Francisco (1994)
Kohavi, R., John, G.H.: The wrapper approach. In: Artificial Intelligence Journal, Springer, Heidelberg (1998)
Patil, N., Berno, A., Hinds, D., Barrett, W., Doshi, J., Hacker, C., Kautzer, C., Lee, D., Marjoribanks, C., McDonough, D.: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294 (2001)
Gabriel, S., Schaffner, S., Nguyen, H., Moore, J., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S.N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E., Daly, M., Altshuler, D.: The structure of haplotype blocks in the human genome. Science 296 (2002)
Castellana, N., Dhamdhere, K., Sridhar, S., Schwartz, R.: Relaxing haplotype block models for association testing. In: Proceedings of the Pacific Symposium on Biocomputing, vol. 11, pp. 454–466 (2006)
Baldwin, C.T., Nolan, V.G., Wyszynski, D.F., Ma, Q.L., Sebastiani, P., Embury, S.H., Bisbee, A., Farrell, J., Farrer, L.S., Steinberg, M.H.: Association of klotho, bone morphogenic protein 6 and annexin a2 polymorphisms with sickle cell osteonecrosis. Blood 106(1), 372–375 (2005)
John, D., Rioux, M.J.D., Silverberg, M.S., Lindblad, K., Steinhart, H., Cohen, Z., Delmonte, T., Kocher, K., Miller, K., Guschwan, S., Kulbokas, E.J., O’Leary, S., Winchester, E., Dewar, K., Green, T., Stone, V., Chow, C., Cohen, A., Langelier, D., Lapointe, G., Gaudet, D., Faith, J., Branco, N., Bull, S.B., McLeod, R.S., Griffiths, A.M., Bitton, A., Greenberg, G.R., Lander, E.S., Siminovitch, K.A., Hudson, T.J.: Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to crohn disease. Nature Genetics 29, 223–228 (2001)
Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., Lander, E.S.: High-resolution haplotype structure in the human genome. Nat. Genet. 29, 229–232 (2001)
HapMap-Consortium, T.I.: The international hapmap project. Nature 426, 789–796 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abad-Grau, M.M., Sebastiani, P. (2007). Multivariate Imputation of Genotype Data Using Short and Long Range Disequilibrium. In: Moreno Díaz, R., Pichler, F., Quesada Arencibia, A. (eds) Computer Aided Systems Theory – EUROCAST 2007. EUROCAST 2007. Lecture Notes in Computer Science, vol 4739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75867-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-75867-9_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75866-2
Online ISBN: 978-3-540-75867-9
eBook Packages: Computer ScienceComputer Science (R0)