Skip to main content

NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads

  • Conference paper
  • First Online:
Infectious Diseases and Nanomedicine III

Part of the book series: Advances in Experimental Medicine and Biology ((AEMB,volume 1052))

Abstract

Recent advancements in sequencing technologies have decreased both time span and cost for sequencing the whole bacterial genome. High-throughput Next-Generation Sequencing (NGS) technology has led to the generation of enormous data concerning microbial populations publically available across various repositories. As a consequence, it has become possible to study and compare the genomes of different bacterial strains within a species or genus in terms of evolution, ecology and diversity. Studying the pan-genome provides insights into deciphering microevolution, global composition and diversity in virulence and pathogenesis of a species. It can also assist in identifying drug targets and proposing vaccine candidates. The effective analysis of these large genome datasets necessitates the development of robust tools. Current methods to develop pan-genome do not support direct input of raw reads from the sequencer machine but require preprocessing of reads as an assembled protein/gene sequence file or the binary matrix of orthologous genes/proteins. We have designed an easy-to-use integrated pipeline, NGSPanPipe, which can directly identify the pan-genome from short reads. The output from the pipeline is compatible with other pan-genome analysis tools. We evaluated our pipeline with other methods for developing pan-genome, i.e. reference-based assembly and de novo assembly using simulated reads of Mycobacterium tuberculosis. The single script pipeline (pipeline.pl) is applicable for all bacterial strains. It integrates multiple in-house Perl scripts and is freely accessible from https://github.com/Biomedinformatics/NGSPanPipe.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Didelot X, Bowden R, Wilson DJ, Peto TEA, Crook DW (2012) Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13:601–612

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102:13950–13955

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, Post JC, Ehrlich GD (2007) Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol 8:R103

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J et al (2008) High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet 40:987–993

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. D’Auria G, Jimenez-Hernandez N, Peris-Bondia F, Moya A, Latorre A (2010) Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genom 11:181

    Article  CAS  Google Scholar 

  6. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, Gajer P, Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R et al (2008) The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol 190:6881–6893

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Serruto D, Serino L, Masignani V, Pizza M (2009) Genome-based approaches to develop vaccines against bacterial pathogens. Vaccine 27:3245–3250

    Article  CAS  PubMed  Google Scholar 

  8. Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics (Oxford, England) 28:416–418

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40:e172

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Chaudhari NM, Gupta VK, Dutta C (2016) BPGA- an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Schneeberger K, Ossowski S, Ott F, Klein JD, Wang X, Lanz C, Smith LM, Cao J, Fitz J, Warthmann N et al (2011) Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proc Natl Acad Sci USA 108:10249–10254

    Article  PubMed  PubMed Central  Google Scholar 

  12. Baker M (2012) De novo genome assembly: what every biologist should know. Nat Methods 9:333–337

    Article  CAS  Google Scholar 

  13. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Limasset A, Cazaux B, Rivals E, Peterlongo P (2016) Read mapping on de Bruijn graphs. BMC Bioinformatics 17:237

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 25:1754–1760

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Huang W, Li L, Myers JR, Marth GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics (Oxford, England) 28:593–594

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Article  CAS  PubMed  Google Scholar 

  18. Luo H, Lin Y, Gao F, Zhang CT, Zhang R (2014) DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Res 42:D574–D580

    Article  CAS  PubMed  Google Scholar 

  19. Bryant J, Chewapreecha C, Bentley SD (2012) Developing insights into the mechanisms of evolution of bacterial pathogens from whole-genome sequences. Future Microbiol 7:1283–1296

    Article  CAS  PubMed  Google Scholar 

  20. Bentley SD, Parkhill J (2015) Genomic perspectives on the evolution and spread of bacterial pathogens. In: Proceedings of the Royal Society B: Biological Sciences, p 282

    Article  PubMed  PubMed Central  Google Scholar 

  21. McGann P, Bunin JL, Snesrud E, Singh S, Maybank R, Ong AC, Kwak YI, Seronello S, Clifford RJ, Hinkle M et al (2016) Real time application of whole genome sequencing for outbreak investigation—What is an achievable turnaround time? Diagn Microbiol Infect Dis 85:277–282

    Article  CAS  PubMed  Google Scholar 

  22. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874

    Article  Google Scholar 

  23. Powers DMW (2011) Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J Mach Learn Technol 2:37–63

    Google Scholar 

  24. Lecuit M, Eloit M (2014) The diagnosis of infectious diseases by whole genome next generation sequencing: a new era is opening. Front Cell Infect Microbiol 4:25

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

PK and AK acknowledge the financial support from Indian Council of Medical Research. UK thanks the UGC for grant of the fellowship. The authors thank Dr. Amit Katiyar for discussions. PK is grateful for the invitation to present her work at the ‘Second International Conference on Infectious Diseases and Nanomedicine—2015’ held during 15–18 December 2015 in Kathmandu, Nepal.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Punit Kaur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kulsum, U., Kapil, A., Singh, H., Kaur, P. (2018). NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads. In: Adhikari, R., Thapa, S. (eds) Infectious Diseases and Nanomedicine III. Advances in Experimental Medicine and Biology, vol 1052. Springer, Singapore. https://doi.org/10.1007/978-981-10-7572-8_4

Download citation

Publish with us

Policies and ethics