Skip to main content

Allele Identification in Assembled Genomic Sequence Datasets

  • Protocol
  • First Online:
Data Production and Analysis in Population Genomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 888))

  • 3967 Accesses

Abstract

Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Avise JC (2004) Molecular markers, natural history, and evolution. Sinauer Associates, Sunderland

    Google Scholar 

  2. Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland

    Google Scholar 

  3. Wakeley J (2008) Coalescent theory: an introduction. Roberts & Company, Greenwood Village

    Google Scholar 

  4. McCarthy MI, Abecasis GR, Cardon LR et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9:356–369

    Article  PubMed  CAS  Google Scholar 

  5. Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. In: Bryson V, Vogel H (eds) Evolving genes and proteins. Academic, New York

    Google Scholar 

  6. Altshuler DL, Durbin RM, Abecasis GR et al (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073

    Article  Google Scholar 

  7. Ossowski S, Schneeberger K, Clark RM et al (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18:2024–2033

    Article  PubMed  CAS  Google Scholar 

  8. Charlesworth D, Vekemans X, Castric V, Glemin S (2005) Plant self-incompatibility systems: a molecular evolutionary perspective. New Phytol 168:61–69

    Article  PubMed  CAS  Google Scholar 

  9. Hulbert SH, Webb CA, Smith SM, Sun Q (2001) Resistance gene complexes: evolution and utilization. Annu Rev Phytopathol 39:285–312

    Article  PubMed  CAS  Google Scholar 

  10. Patnaik SK, Blumenfeld OO (2011) Patterns of human genetic variation inferred from comparative analysis of allelic mutations in blood group antigen genes. Hum Mutat 32:263–271

    Article  PubMed  CAS  Google Scholar 

  11. Bergelson J, Kreitman M, Stahl EA, Tian D (2001) Evolutionary dynamics of plant R-genes. Science 292:2281–2285

    Article  PubMed  CAS  Google Scholar 

  12. Lawlor DA, Ward FE, Ennis PD et al (1988) HLA-A and B polymorphisms predate the divergence of humans and chimpanzees. Nature 335:268–271

    Article  PubMed  CAS  Google Scholar 

  13. Li WH, Sadler LA (1991) Low nucleotide diversity in man. Genetics 129:513–523

    PubMed  CAS  Google Scholar 

  14. Moriyama EN, Powell JR (1996) Intraspecific nuclear DNA variation in Drosophila. Mol Biol Evol 13:261–277

    Article  PubMed  CAS  Google Scholar 

  15. Demuth JP, De Bie T, Stajich JE et al (2006) The evolution of mammalian gene families. PLoS One 1:e85

    Article  PubMed  Google Scholar 

  16. Hahn MW, De Bie T, Stajich JE et al (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res 15:1153–1160

    Article  PubMed  CAS  Google Scholar 

  17. Hahn MW, Han MV, Han S-G (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genet 3:e197

    Article  PubMed  Google Scholar 

  18. Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155

    Article  PubMed  CAS  Google Scholar 

  19. Sebat J, Lakshmi B, Troge J et al (2004) Large-scale copy number polymorphism in the human genome. Science 305:525–528

    Article  PubMed  CAS  Google Scholar 

  20. Lynch M (2007) The origins of genome architecture. Sinauer Associates, Sunderland

    Google Scholar 

  21. Fredman D, White SJ, Potter S et al (2004) Complex SNP-related sequence variation in segmental genome duplications. Nat Genet 36:861–866

    Article  PubMed  CAS  Google Scholar 

  22. Bentley DR (2006) Whole-genome re-sequencing. Curr Opin Genet Dev 16:545–552

    Article  PubMed  CAS  Google Scholar 

  23. Charlesworth B (2010) Molecular population genomics: a short history. Genet Res 92: 397–411

    Article  Google Scholar 

  24. Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189

    Article  PubMed  CAS  Google Scholar 

  25. Nagaraj SH, Gasser RB, Ranganathan S (2007) A hitchhiker’s guide to expressed sequence tag (EST) analysis. Brief Bioinform 8:6–21

    Article  PubMed  CAS  Google Scholar 

  26. Tang J, Vosman B, Voorrips RE et al (2006) QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics 7:438

    Article  PubMed  Google Scholar 

  27. Wang J-PZ, Lindsay BG, Leebens-Mack J et al (2004) EST clustering error evaluation and correction. Bioinformatics 20:2973–2984

    Article  PubMed  CAS  Google Scholar 

  28. Hazelhurst S, Hide W, Lipták Z et al (2008) An overview of the wcd EST clustering tool. Bioinformatics 24:1542–1546

    Article  PubMed  CAS  Google Scholar 

  29. Lynch M (2009) Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182:295–301

    Article  PubMed  CAS  Google Scholar 

  30. Malhis N, Jones SJM (2010) High quality SNP calling using Illumina data at shallow coverage. Bioinformatics 26:1029–1035

    Article  PubMed  CAS  Google Scholar 

  31. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858

    Article  PubMed  CAS  Google Scholar 

  32. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829

    Article  PubMed  CAS  Google Scholar 

  33. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760

    Article  PubMed  CAS  Google Scholar 

  34. Gibbons JG, Janson EM, Hittinger CT et al (2009) Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol 26:2731–2744

    Article  PubMed  CAS  Google Scholar 

  35. Kozik A, Matvienko M, Michelmore RW (2010) Effects of filtering, trimming, sampling and k-mer value on de novo assembly of Illumina GA reads. In: Plant and Animal Genomes XVIII Conference, San Diego

    Google Scholar 

  36. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215: 403–410

    PubMed  CAS  Google Scholar 

  37. Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214

    Article  PubMed  CAS  Google Scholar 

  38. Barker MS, Kane NC, Matvienko M et al (2008) Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Mol Biol Evol 25: 2445–2455

    Article  PubMed  CAS  Google Scholar 

  39. Chevreux B, Pfisterer T, Suhai S (2000) Automatic assembly and editing of genomic sequences. In: Suhai S (ed) Genomics and proteomics: functional and computational aspects. Kluwer Academic/Plenum Publishers, New York

    Google Scholar 

  40. Guo S, Zheng Y, Joung JG et al (2010) Transcriptome sequencing and comparative analysis of cucumber flowers with different sex types. BMC Genomics 11:384

    Article  PubMed  Google Scholar 

  41. Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868–877

    Article  PubMed  CAS  Google Scholar 

  42. Barker MS, Vogel H, Schranz ME (2009) Paleopolyploidy in the brassicales: analyses of the cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other brassicales. Genome Biol Evol 1:391–399

    Article  PubMed  Google Scholar 

  43. Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729

    Article  PubMed  CAS  Google Scholar 

  44. Omilian AR, Scofield DG, Lynch M (2008) Intron presence-absence polymorphisms in Daphnia. Mol Biol Evol 25:2129–2139

    Article  PubMed  CAS  Google Scholar 

  45. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584

    Article  PubMed  CAS  Google Scholar 

  46. Gupta S, Zink D, Korn B et al (2004) Genome wide identification and classification of alternative splicing based on EST data. Bioinformatics 20:2579–2585

    Article  PubMed  CAS  Google Scholar 

  47. Bragg LM, Stone G (2009) k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 25:2302–2308

    Article  PubMed  CAS  Google Scholar 

  48. Li R, Yu C, Li Y et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967

    Article  PubMed  CAS  Google Scholar 

  49. Margulies M, Egholm M, Altman WE et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380

    PubMed  CAS  Google Scholar 

  50. Griffin PC, Robin C, Hoffmann AA (2011) A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biol 9:19

    Article  PubMed  CAS  Google Scholar 

  51. Hartl DL, Clark AG (2006) Principles of population genetics, 4th edn. Sinauer Associates, Sunderland

    Google Scholar 

  52. Lai Z, Kane N, Kozik A et al (2012) Genomics of compositae weeds: EST libraries, microarrays, and evidence of introgression. American Journal of Botany 99:209–218

    Google Scholar 

Download references

Acknowledgments

We thank MS Barker, LH Rieseberg, I Mayrose, and SP Otto for insightful discussions on this topic. We also thank Z Lai and LH Rieseberg for making available multi-individual genomic datasets that prompted our interests in this area.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katrina M. Dlugosch .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media New York

About this protocol

Cite this protocol

Dlugosch, K.M., Bonin, A. (2012). Allele Identification in Assembled Genomic Sequence Datasets. In: Pompanon, F., Bonin, A. (eds) Data Production and Analysis in Population Genomics. Methods in Molecular Biology, vol 888. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-61779-870-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-61779-870-2_12

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-61779-869-6

  • Online ISBN: 978-1-61779-870-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics