Identifying Blocks and Sub-populations in Noisy SNP Data

  • Gad Kimmel
  • Roded Sharan
  • Ron Shamir
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2812)


We study several problems arising in haplotype block partitioning. Our objective function is the total number of distinct haplotypes in blocks. We show that the problem is NP-hard when there are errors or missing data, and provide approximation algorithms for several of its variants. We also give an algorithm that solves the problem with high probability under a probabilistic model that allows noise and missing data. In addition, we study the multi-population case, where one has to partition the haplotypes into populations and seek a different block partition in each one. We provide a heuristic for that problem and use it to analyze simulated and real data. On simulated data, our blocks resemble the true partition more than the blocks generated by the LD-based algorithm of Gabriel et al. [7]. On single-population real data, we generate a more concise block description than extant approaches, with better average LD within blocks. The algorithm also gives promising results on real 2-population genotype data.


haplotype block genotype SNP sub-population stratification algorithm complexity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alon, N., Spencer, J.H.: The Probabilistic Method. John Wiley and Sons, Inc., Chichester (2000)zbMATHCrossRefGoogle Scholar
  2. 2.
    Bafna, V., Halldorsson, B.V., Schwartz, R., Clark, A., Istrail, S.: Haplotyles and informative SNP selection algorithms: Don’t block out information. In: Proc. of RECOMB, pp. 19–27 (2003)Google Scholar
  3. 3.
    Clark, A.: Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution 7(2), 111–122 (1990)Google Scholar
  4. 4.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press, Cambridge (1990)zbMATHGoogle Scholar
  5. 5.
    Daly, M.J., et al.: High-resolution haplotype structure in the human genome. Nature Genetics 29(2), 229–232 (2001)CrossRefGoogle Scholar
  6. 6.
    Eskin, E., Halperin, E., Karp, R.M.: Large scale reconstruction of haplotypes from genotype data. In: Proc. of RECOMB, pp. 104–113 (2003)Google Scholar
  7. 7.
    Gabriel, S.B., et al.: The structure of haplotype blocks in the human genome. Science 296, 2225–2229 (2002)CrossRefGoogle Scholar
  8. 8.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., San Francisco (1979)zbMATHGoogle Scholar
  9. 9.
    Grugliyak, L., Nickerson, D.A.: Variation is the spice of life. Nature Genetics 27, 234–236 (2001)CrossRefGoogle Scholar
  10. 10.
    Gusfield, D.: Inference of haplotypes in samples of diploid populations: Complexity and algorithms. Journal of Computational Biology 8(3), 305–323 (2001)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Gusfield, D.: Haplotyping by pure parsimony. Technical Report UCDavis CSE- 2003-2, To appear in the Proceedings of the 2003 Combinatorial Pattern Matching Conference (2003)Google Scholar
  12. 12.
    Halldorsson, B.V., et al.: Combinatorial problems arising in SNP. In: Calude, C.S., Dinneen, M.J., Vajnovszki, V. (eds.) DMTCS 2003. LNCS, vol. 2731. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Hubbell, E.: Finding a parsimony solution to haplotype phase is NP-hard. Personal’s communicationGoogle Scholar
  14. 14.
    Koivisto, M., et al.: AnMDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries. In: Proc. PSB 2003 (2003)Google Scholar
  15. 15.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1965)Google Scholar
  16. 16.
    Patil, N., et al.: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294, 1719–1723 (2001)CrossRefGoogle Scholar
  17. 17.
    Sachidanandam, R., et al.: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 291, 1298–2302 (2001)Google Scholar
  18. 18.
    Venter, C., et al.: The sequence of the human genome. Science 291, 1304–1351 (2001)CrossRefGoogle Scholar
  19. 19.
    Waterman, M.S.: Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, Boca Raton (1995)zbMATHGoogle Scholar
  20. 20.
    Zhang, K., Deng, M., Chen, T., Waterman, M.S., Sun, F.: A dynamic programming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci. USA 99(11), 7335–7339 (2002)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Gad Kimmel
    • 1
  • Roded Sharan
    • 2
  • Ron Shamir
    • 1
  1. 1.School of Computer ScienceTel-Aviv UniversityTel-AvivIsrael
  2. 2.International Computer Science InstituteBerkeley

Personalised recommendations