Allele Identification in Assembled Genomic Sequence Datasets

Dlugosch, Katrina M.; Bonin, Aurélie

doi:10.1007/978-1-61779-870-2_12

Katrina M. Dlugosch³ &
Aurélie Bonin⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 888))

3967 Accesses

Abstract

Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely related loci. We outline the key considerations involved in this process, including assessing the accuracy of allele resolution in sequence assembly, clustering of alleles within and among individuals, and identifying clusters that are most likely to correspond to true allelic variants of a single locus. Our focus is particularly on the case where alleles must be identified without a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing a sequence, such as in transcriptome or post-assembly datasets. Throughout, we provide information about publicly available tools to aid allele identification in such cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Avise JC (2004) Molecular markers, natural history, and evolution. Sinauer Associates, Sunderland
Google Scholar
Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland
Google Scholar
Wakeley J (2008) Coalescent theory: an introduction. Roberts & Company, Greenwood Village
Google Scholar
McCarthy MI, Abecasis GR, Cardon LR et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9:356–369
Article PubMed CAS Google Scholar
Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. In: Bryson V, Vogel H (eds) Evolving genes and proteins. Academic, New York
Google Scholar
Altshuler DL, Durbin RM, Abecasis GR et al (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Article Google Scholar
Ossowski S, Schneeberger K, Clark RM et al (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18:2024–2033
Article PubMed CAS Google Scholar
Charlesworth D, Vekemans X, Castric V, Glemin S (2005) Plant self-incompatibility systems: a molecular evolutionary perspective. New Phytol 168:61–69
Article PubMed CAS Google Scholar
Hulbert SH, Webb CA, Smith SM, Sun Q (2001) Resistance gene complexes: evolution and utilization. Annu Rev Phytopathol 39:285–312
Article PubMed CAS Google Scholar
Patnaik SK, Blumenfeld OO (2011) Patterns of human genetic variation inferred from comparative analysis of allelic mutations in blood group antigen genes. Hum Mutat 32:263–271
Article PubMed CAS Google Scholar
Bergelson J, Kreitman M, Stahl EA, Tian D (2001) Evolutionary dynamics of plant R-genes. Science 292:2281–2285
Article PubMed CAS Google Scholar
Lawlor DA, Ward FE, Ennis PD et al (1988) HLA-A and B polymorphisms predate the divergence of humans and chimpanzees. Nature 335:268–271
Article PubMed CAS Google Scholar
Li WH, Sadler LA (1991) Low nucleotide diversity in man. Genetics 129:513–523
PubMed CAS Google Scholar
Moriyama EN, Powell JR (1996) Intraspecific nuclear DNA variation in Drosophila. Mol Biol Evol 13:261–277
Article PubMed CAS Google Scholar
Demuth JP, De Bie T, Stajich JE et al (2006) The evolution of mammalian gene families. PLoS One 1:e85
Article PubMed Google Scholar
Hahn MW, De Bie T, Stajich JE et al (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res 15:1153–1160
Article PubMed CAS Google Scholar
Hahn MW, Han MV, Han S-G (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genet 3:e197
Article PubMed Google Scholar
Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155
Article PubMed CAS Google Scholar
Sebat J, Lakshmi B, Troge J et al (2004) Large-scale copy number polymorphism in the human genome. Science 305:525–528
Article PubMed CAS Google Scholar
Lynch M (2007) The origins of genome architecture. Sinauer Associates, Sunderland
Google Scholar
Fredman D, White SJ, Potter S et al (2004) Complex SNP-related sequence variation in segmental genome duplications. Nat Genet 36:861–866
Article PubMed CAS Google Scholar
Bentley DR (2006) Whole-genome re-sequencing. Curr Opin Genet Dev 16:545–552
Article PubMed CAS Google Scholar
Charlesworth B (2010) Molecular population genomics: a short history. Genet Res 92: 397–411
Article Google Scholar
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189
Article PubMed CAS Google Scholar
Nagaraj SH, Gasser RB, Ranganathan S (2007) A hitchhiker’s guide to expressed sequence tag (EST) analysis. Brief Bioinform 8:6–21
Article PubMed CAS Google Scholar
Tang J, Vosman B, Voorrips RE et al (2006) QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics 7:438
Article PubMed Google Scholar
Wang J-PZ, Lindsay BG, Leebens-Mack J et al (2004) EST clustering error evaluation and correction. Bioinformatics 20:2973–2984
Article PubMed CAS Google Scholar
Hazelhurst S, Hide W, Lipták Z et al (2008) An overview of the wcd EST clustering tool. Bioinformatics 24:1542–1546
Article PubMed CAS Google Scholar
Lynch M (2009) Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182:295–301
Article PubMed CAS Google Scholar
Malhis N, Jones SJM (2010) High quality SNP calling using Illumina data at shallow coverage. Bioinformatics 26:1029–1035
Article PubMed CAS Google Scholar
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858
Article PubMed CAS Google Scholar
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Article PubMed CAS Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760
Article PubMed CAS Google Scholar
Gibbons JG, Janson EM, Hittinger CT et al (2009) Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol 26:2731–2744
Article PubMed CAS Google Scholar
Kozik A, Matvienko M, Michelmore RW (2010) Effects of filtering, trimming, sampling and k-mer value on de novo assembly of Illumina GA reads. In: Plant and Animal Genomes XVIII Conference, San Diego
Google Scholar
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215: 403–410
PubMed CAS Google Scholar
Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214
Article PubMed CAS Google Scholar
Barker MS, Kane NC, Matvienko M et al (2008) Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Mol Biol Evol 25: 2445–2455
Article PubMed CAS Google Scholar
Chevreux B, Pfisterer T, Suhai S (2000) Automatic assembly and editing of genomic sequences. In: Suhai S (ed) Genomics and proteomics: functional and computational aspects. Kluwer Academic/Plenum Publishers, New York
Google Scholar
Guo S, Zheng Y, Joung JG et al (2010) Transcriptome sequencing and comparative analysis of cucumber flowers with different sex types. BMC Genomics 11:384
Article PubMed Google Scholar
Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868–877
Article PubMed CAS Google Scholar
Barker MS, Vogel H, Schranz ME (2009) Paleopolyploidy in the brassicales: analyses of the cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other brassicales. Genome Biol Evol 1:391–399
Article PubMed Google Scholar
Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725–1729
Article PubMed CAS Google Scholar
Omilian AR, Scofield DG, Lynch M (2008) Intron presence-absence polymorphisms in Daphnia. Mol Biol Evol 25:2129–2139
Article PubMed CAS Google Scholar
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584
Article PubMed CAS Google Scholar
Gupta S, Zink D, Korn B et al (2004) Genome wide identification and classification of alternative splicing based on EST data. Bioinformatics 20:2579–2585
Article PubMed CAS Google Scholar
Bragg LM, Stone G (2009) k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 25:2302–2308
Article PubMed CAS Google Scholar
Li R, Yu C, Li Y et al (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967
Article PubMed CAS Google Scholar
Margulies M, Egholm M, Altman WE et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380
PubMed CAS Google Scholar
Griffin PC, Robin C, Hoffmann AA (2011) A next-generation sequencing method for overcoming the multiple gene copy problem in polyploid phylogenetics, applied to Poa grasses. BMC Biol 9:19
Article PubMed CAS Google Scholar
Hartl DL, Clark AG (2006) Principles of population genetics, 4th edn. Sinauer Associates, Sunderland
Google Scholar
Lai Z, Kane N, Kozik A et al (2012) Genomics of compositae weeds: EST libraries, microarrays, and evidence of introgression. American Journal of Botany 99:209–218
Google Scholar

Download references

Acknowledgments

We thank MS Barker, LH Rieseberg, I Mayrose, and SP Otto for insightful discussions on this topic. We also thank Z Lai and LH Rieseberg for making available multi-individual genomic datasets that prompted our interests in this area.

Author information

Authors and Affiliations

Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ, USA
Katrina M. Dlugosch
Laboratoire d’Ecologie Alpine, UMR CNRS 5553, Université Joseph Fourier, Grenoble, France
Aurélie Bonin

Authors

Katrina M. Dlugosch
View author publications
You can also search for this author in PubMed Google Scholar
Aurélie Bonin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katrina M. Dlugosch .

Editor information

Editors and Affiliations

, Laboratoire d'Ecologie Alpine, Université Grenoble I, CNRS-UMR5553, rue de la Piscine 2233, Grenoble Cedex 09, 38041, France
François Pompanon
Laboratoire d'Ecologie Alpine, Université Grenoble 1, Rue de la Piscine 2233, Grenoble, 38041, France
Aurélie Bonin

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Dlugosch, K.M., Bonin, A. (2012). Allele Identification in Assembled Genomic Sequence Datasets. In: Pompanon, F., Bonin, A. (eds) Data Production and Analysis in Population Genomics. Methods in Molecular Biology, vol 888. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-61779-870-2_12

Download citation

DOI: https://doi.org/10.1007/978-1-61779-870-2_12
Published: 24 April 2012
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-61779-869-6
Online ISBN: 978-1-61779-870-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics