Abstract
Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Avise JC (2004) Molecular markers, natural history, and evolution. Sinauer Associates, Sunderland
Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland
Wakeley J (2008) Coalescent theory: an introduction. Roberts & Company, Greenwood Village
McCarthy MI, Abecasis GR, Cardon LR et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9:356–369
Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. In: Bryson V, Vogel H (eds) Evolving genes and proteins. Academic, New York
Altshuler DL, Durbin RM, Abecasis GR et al (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Ossowski S, Schneeberger K, Clark RM et al (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18:2024–2033
Charlesworth D, Vekemans X, Castric V, Glemin S (2005) Plant self-incompatibility systems: a molecular evolutionary perspective. New Phytol 168:61–69
Hulbert SH, Webb CA, Smith SM, Sun Q (2001) Resistance gene complexes: evolution and utilization. Annu Rev Phytopathol 39:285–312
Patnaik SK, Blumenfeld OO (2011) Patterns of human genetic variation inferred from comparative analysis of allelic mutations in blood group antigen genes. Hum Mutat 32:263–271
Bergelson J, Kreitman M, Stahl EA, Tian D (2001) Evolutionary dynamics of plant R-genes. Science 292:2281–2285
Lawlor DA, Ward FE, Ennis PD et al (1988) HLA-A and B polymorphisms predate the divergence of humans and chimpanzees. Nature 335:268–271
Li WH, Sadler LA (1991) Low nucleotide diversity in man. Genetics 129:513–523
Moriyama EN, Powell JR (1996) Intraspecific nuclear DNA variation in Drosophila. Mol Biol Evol 13:261–277
Demuth JP, De Bie T, Stajich JE et al (2006) The evolution of mammalian gene families. PLoS One 1:e85
Hahn MW, De Bie T, Stajich JE et al (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res 15:1153–1160
Hahn MW, Han MV, Han S-G (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genet 3:e197
Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155
Sebat J, Lakshmi B, Troge J et al (2004) Large-scale copy number polymorphism in the human genome. Science 305:525–528
Lynch M (2007) The origins of genome architecture. Sinauer Associates, Sunderland
Fredman D, White SJ, Potter S et al (2004) Complex SNP-related sequence variation in segmental genome duplications. Nat Genet 36:861–866
Bentley DR (2006) Whole-genome re-sequencing. Curr Opin Genet Dev 16:545–552
Charlesworth B (2010) Molecular population genomics: a short history. Genet Res 92: 397–411
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189
Nagaraj SH, Gasser RB, Ranganathan S (2007) A hitchhiker’s guide to expressed sequence tag (EST) analysis. Brief Bioinform 8:6–21
Tang J, Vosman B, Voorrips RE et al (2006) QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics 7:438
Wang J-PZ, Lindsay BG, Leebens-Mack J et al (2004) EST clustering error evaluation and correction. Bioinformatics 20:2973–2984
Hazelhurst S, Hide W, Lipták Z et al (2008) An overview of the wcd EST clustering tool. Bioinformatics 24:1542–1546
Lynch M (2009) Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182:295–301
Malhis N, Jones SJM (2010) High quality SNP calling using Illumina data at shallow coverage. Bioinformatics 26:1029–1035
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760
Gibbons JG, Janson EM, Hittinger CT et al (2009) Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol 26:2731–2744
Kozik A, Matvienko M, Michelmore RW (2010) Effects of filtering, trimming, sampling and k-mer value on de novo assembly of Illumina GA reads. In: Plant and Animal Genomes XVIII Conference, San Diego
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215: 403–410
Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214
Barker MS, Kane NC, Matvienko M et al (2008) Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Mol Biol Evol 25: 2445–2455
Chevreux B, Pfisterer T, Suhai S (2000) Automatic assembly and editing of genomic sequences. In: Suhai S (ed) Genomics and proteomics: functional and computational aspects. Kluwer Academic/Plenum Publishers, New York
Guo S, Zheng Y, Joung JG et al (2010) Transcriptome sequencing and comparative analysis of cucumber flowers with different sex types. BMC Genomics 11:384
Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868–877
Barker MS, Vogel H, Schranz ME (2009) Paleopolyploidy in the brassicales: analyses of the cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other brassicales. Genome Biol Evol 1:391–399
Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729
Omilian AR, Scofield DG, Lynch M (2008) Intron presence-absence polymorphisms in Daphnia. Mol Biol Evol 25:2129–2139
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584
Gupta S, Zink D, Korn B et al (2004) Genome wide identification and classification of alternative splicing based on EST data. Bioinformatics 20:2579–2585
Bragg LM, Stone G (2009) k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 25:2302–2308
Li R, Yu C, Li Y et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967
Margulies M, Egholm M, Altman WE et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380
Griffin PC, Robin C, Hoffmann AA (2011) A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biol 9:19
Hartl DL, Clark AG (2006) Principles of population genetics, 4th edn. Sinauer Associates, Sunderland
Lai Z, Kane N, Kozik A et al (2012) Genomics of compositae weeds: EST libraries, microarrays, and evidence of introgression. American Journal of Botany 99:209–218
Acknowledgments
We thank MS Barker, LH Rieseberg, I Mayrose, and SP Otto for insightful discussions on this topic. We also thank Z Lai and LH Rieseberg for making available multi-individual genomic datasets that prompted our interests in this area.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media New York
About this protocol
Cite this protocol
Dlugosch, K.M., Bonin, A. (2012). Allele Identification in Assembled Genomic Sequence Datasets. In: Pompanon, F., Bonin, A. (eds) Data Production and Analysis in Population Genomics. Methods in Molecular Biology, vol 888. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-61779-870-2_12
Download citation
DOI: https://doi.org/10.1007/978-1-61779-870-2_12
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-61779-869-6
Online ISBN: 978-1-61779-870-2
eBook Packages: Springer Protocols