Short Read Alignment Using SOAP2

  • Bhavna HurgobinEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1374)


Next-generation sequencing (NGS) technologies have rapidly evolved in the last 5 years, leading to the generation of millions of short reads in a single run. Consequently, various sequence alignment algorithms have been developed to compare these reads to an appropriate reference in order to perform important downstream analysis. SOAP2 from the SOAP series is one of the most commonly used alignment programs to handle NGS data, and it efficiently does so using low computer memory usage and fast alignment speed. This chapter describes the protocol used to align short reads to a reference genome using SOAP2, and highlights the significance of using the in-built command-line options to tune the behavior of the algorithm according to the inputs and the desired results.

Key words

Next-generation sequencing Short read alignment Read mapping Gapped alignment Ungapped alignment Burrows–Wheeler transform (BWT) Nucleotides Mismatches Repeats Match mode Seed length Genomeindexing SNPprediction Genomics Structural variant 


  1. 1.
    Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto JM, Hansen T, Le Paslier D, Linneberg A, Nielsen HB, Pelletier E, Renault P, Sicheritz-Ponten T, Turner K, Zhu H, Yu C, Li S, Jian M, Zhou Y, Li Y, Zhang X, Li S, Qin N, Yang H, Wang J, Brunak S, Dore J, Guarner F, Kristiansen K, Pedersen O, Parkhill J, Weissenbach J, Bork P, Ehrlich SD, Wang J (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285):59–65PubMedCentralCrossRefPubMedGoogle Scholar
  2. 2.
    Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS (2008) SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods 5(3):247–252CrossRefPubMedGoogle Scholar
  3. 3.
    Taylor KH, Kramer RS, Davis JW, Guo J, Duff DJ, Xu D, Caldwell CW, Shi H (2007) Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res 67(18):8511–8518CrossRefPubMedGoogle Scholar
  4. 4.
    Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O’Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML (2008) A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891):956–960CrossRefPubMedGoogle Scholar
  5. 5.
    Guffanti A, Iacono M, Pelucchi P, Kim N, Solda G, Croft LJ, Taft RJ, Rizzi E, Askarian-Amiri M, Bonnal RJ, Callari M, Mignone F, Pesole G, Bertalot G, Bernardi LR, Albertini A, Lee C, Mattick JS, Zucchi I, De Bellis G (2009) A transcriptional sketch of a primary human breast cancer by 454 deep sequencing. BMC Genomics 10:163PubMedCentralCrossRefPubMedGoogle Scholar
  6. 6.
    Auffray C, Chen Z, Hood L (2009) Systems medicine: the future of medical genomics and healthcare. Genome Med 1(1):2PubMedCentralCrossRefPubMedGoogle Scholar
  7. 7.
    Yu X, Guda K, Willis J, Veigl M, Wang Z, Markowitz S, Adams MD, Sun S (2012) How do alignment programs perform on sequencing data with varying qualities and from repetitive regions? BioData Min 5(1):6PubMedCentralCrossRefPubMedGoogle Scholar
  8. 8.
    Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6(11 Suppl):S6–S12CrossRefPubMedGoogle Scholar
  9. 9.
    Flicek P (2009) The need for speed. Genome Biol 10(3):212PubMedCentralCrossRefPubMedGoogle Scholar
  10. 10.
    Ferragina P, Manzini G (2005) Indexing compressed text. J ACM 52(4):552–581CrossRefGoogle Scholar
  11. 11.
    Schbath S, Martin V, Zytnicki M, Fayolle J, Loux V, Gibrat JF (2012) Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis. J Comput Biol 19(6):796–813PubMedCentralCrossRefPubMedGoogle Scholar
  12. 12.
    Ruffalo M, LaFramboise T, Koyuturk M (2011) Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27(20):2790–2796CrossRefPubMedGoogle Scholar
  13. 13.
    Hatem A, Bozdag D, Toland AE, Catalyurek UV (2013) Benchmarking short sequence mapping tools. BMC Bioinformatics 14:184PubMedCentralCrossRefPubMedGoogle Scholar
  14. 14.
    Li R, Li Y, Kristiansen K, Wang J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24(5):713–714CrossRefPubMedGoogle Scholar
  15. 15.
    Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15):1966–1967CrossRefPubMedGoogle Scholar
  16. 16.
    Liu CM, Wong T, Wu E, Luo R, Yiu SM, Li Y, Wang B, Yu C, Chu X, Zhao K, Li R, Lam TW (2012) SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 28(6):878–879CrossRefPubMedGoogle Scholar
  17. 17.
    Luo R, Wong T, Zhu J, Liu CM, Zhu X, Wu E, Lee LK, Lin H, Zhu W, Cheung DW, Ting HF, Yiu SM, Peng S, Yu C, Li Y, Li R, Lam TW (2013) SOAP3-dp: fast, accurate and sensitive GPU-based short read aligner. PLoS One 8(5), e65632PubMedCentralCrossRefPubMedGoogle Scholar
  18. 18.
    Minoche AE, Dohm JC, Himmelbauer H (2011) Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol 12(11):R112PubMedCentralCrossRefPubMedGoogle Scholar
  19. 19.
    Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40(D1):D1178–D1186PubMedCentralCrossRefPubMedGoogle Scholar
  20. 20.
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16):2078–2079PubMedCentralCrossRefPubMedGoogle Scholar
  21. 21.
    Reynoso V, Putonti C (2011) Mapping short sequencing reads to distant relatives. In: Proceedings of the 2nd ACM conference on bioinformatics, computational biology and biomedicine, 2011. ACM, Chicago, IL, p 420–424Google Scholar
  22. 22.
    Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814):796–815CrossRefGoogle Scholar
  23. 23.
    Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483PubMedCentralCrossRefPubMedGoogle Scholar
  24. 24.
    Lorenc MT, Hayashi S, Stiller J, Lee H, Manoli S, Ruperao P, Visendi P, Berkman PJ, Lai K, Batley J, Edwards D (2012) Discovery of single nucleotide polymorphisms in complex genomes using SGSautoSNP. Biology 1(2):370–382PubMedCentralCrossRefPubMedGoogle Scholar
  25. 25.
    Siragusa E, Weese D, Reinert K (2013) Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res 41(7), e78PubMedCentralCrossRefPubMedGoogle Scholar
  26. 26.
    Mott R, Tribe R (1999) Approximate statistics of gapped alignments. J Comput Biol 6(1):91–112CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.University of QueenslandSt LuciaAustralia
  2. 2.School of Plant BiologyUniversity of Western AustraliaPerthAustralia

Personalised recommendations