NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads

  • Umay Kulsum
  • Arti Kapil
  • Harpreet Singh
  • Punit KaurEmail author
Conference paper
Part of the Advances in Experimental Medicine and Biology book series (AEMB, volume 1052)


Recent advancements in sequencing technologies have decreased both time span and cost for sequencing the whole bacterial genome. High-throughput Next-Generation Sequencing (NGS) technology has led to the generation of enormous data concerning microbial populations publically available across various repositories. As a consequence, it has become possible to study and compare the genomes of different bacterial strains within a species or genus in terms of evolution, ecology and diversity. Studying the pan-genome provides insights into deciphering microevolution, global composition and diversity in virulence and pathogenesis of a species. It can also assist in identifying drug targets and proposing vaccine candidates. The effective analysis of these large genome datasets necessitates the development of robust tools. Current methods to develop pan-genome do not support direct input of raw reads from the sequencer machine but require preprocessing of reads as an assembled protein/gene sequence file or the binary matrix of orthologous genes/proteins. We have designed an easy-to-use integrated pipeline, NGSPanPipe, which can directly identify the pan-genome from short reads. The output from the pipeline is compatible with other pan-genome analysis tools. We evaluated our pipeline with other methods for developing pan-genome, i.e. reference-based assembly and de novo assembly using simulated reads of Mycobacterium tuberculosis. The single script pipeline ( is applicable for all bacterial strains. It integrates multiple in-house Perl scripts and is freely accessible from


Next-generation sequencing Pan-genome Core genome Accessory genome Bacterial species Short reads 



PK and AK acknowledge the financial support from Indian Council of Medical Research. UK thanks the UGC for grant of the fellowship. The authors thank Dr. Amit Katiyar for discussions. PK is grateful for the invitation to present her work at the ‘Second International Conference on Infectious Diseases and Nanomedicine—2015’ held during 15–18 December 2015 in Kathmandu, Nepal.


  1. 1.
    Didelot X, Bowden R, Wilson DJ, Peto TEA, Crook DW (2012) Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13:601–612CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102:13950–13955CrossRefPubMedGoogle Scholar
  3. 3.
    Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, Post JC, Ehrlich GD (2007) Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol 8:R103CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J et al (2008) High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet 40:987–993CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    D’Auria G, Jimenez-Hernandez N, Peris-Bondia F, Moya A, Latorre A (2010) Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genom 11:181CrossRefGoogle Scholar
  6. 6.
    Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, Gajer P, Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R et al (2008) The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol 190:6881–6893CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Serruto D, Serino L, Masignani V, Pizza M (2009) Genome-based approaches to develop vaccines against bacterial pathogens. Vaccine 27:3245–3250CrossRefPubMedGoogle Scholar
  8. 8.
    Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics (Oxford, England) 28:416–418CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40:e172CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Chaudhari NM, Gupta VK, Dutta C (2016) BPGA- an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Schneeberger K, Ossowski S, Ott F, Klein JD, Wang X, Lanz C, Smith LM, Cao J, Fitz J, Warthmann N et al (2011) Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proc Natl Acad Sci USA 108:10249–10254CrossRefPubMedGoogle Scholar
  12. 12.
    Baker M (2012) De novo genome assembly: what every biologist should know. Nat Methods 9:333–337CrossRefGoogle Scholar
  13. 13.
    Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Limasset A, Cazaux B, Rivals E, Peterlongo P (2016) Read mapping on de Bruijn graphs. BMC Bioinformatics 17:237CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 25:1754–1760CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Huang W, Li L, Myers JR, Marth GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics (Oxford, England) 28:593–594CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410CrossRefGoogle Scholar
  18. 18.
    Luo H, Lin Y, Gao F, Zhang CT, Zhang R (2014) DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Res 42:D574–D580CrossRefPubMedGoogle Scholar
  19. 19.
    Bryant J, Chewapreecha C, Bentley SD (2012) Developing insights into the mechanisms of evolution of bacterial pathogens from whole-genome sequences. Future Microbiol 7:1283–1296CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Bentley SD, Parkhill J (2015) Genomic perspectives on the evolution and spread of bacterial pathogens. In: Proceedings of the Royal Society B: Biological Sciences, p 282CrossRefPubMedGoogle Scholar
  21. 21.
    McGann P, Bunin JL, Snesrud E, Singh S, Maybank R, Ong AC, Kwak YI, Seronello S, Clifford RJ, Hinkle M et al (2016) Real time application of whole genome sequencing for outbreak investigation—What is an achievable turnaround time? Diagn Microbiol Infect Dis 85:277–282CrossRefPubMedGoogle Scholar
  22. 22.
    Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874CrossRefGoogle Scholar
  23. 23.
    Powers DMW (2011) Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J Mach Learn Technol 2:37–63Google Scholar
  24. 24.
    Lecuit M, Eloit M (2014) The diagnosis of infectious diseases by whole genome next generation sequencing: a new era is opening. Front Cell Infect Microbiol 4:25CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Umay Kulsum
    • 1
  • Arti Kapil
    • 2
  • Harpreet Singh
    • 3
  • Punit Kaur
    • 1
    Email author
  1. 1.Department of BiophysicsAll India Institute of Medical SciencesNew DelhiIndia
  2. 2.Department of MicrobiologyAll India Institute of Medical SciencesNew DelhiIndia
  3. 3.Indian Council of Medical ResearchNew DelhiIndia

Personalised recommendations