NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads

Kulsum, Umay; Kapil, Arti; Singh, Harpreet; Kaur, Punit

doi:10.1007/978-981-10-7572-8_4

Umay Kulsum⁸,
Arti Kapil⁹,
Harpreet Singh¹⁰ &
…
Punit Kaur⁸

Part of the book series: Advances in Experimental Medicine and Biology ((AEMB,volume 1052))

1188 Accesses
5 Citations
4 Altmetric

Abstract

Recent advancements in sequencing technologies have decreased both time span and cost for sequencing the whole bacterial genome. High-throughput Next-Generation Sequencing (NGS) technology has led to the generation of enormous data concerning microbial populations publically available across various repositories. As a consequence, it has become possible to study and compare the genomes of different bacterial strains within a species or genus in terms of evolution, ecology and diversity. Studying the pan-genome provides insights into deciphering microevolution, global composition and diversity in virulence and pathogenesis of a species. It can also assist in identifying drug targets and proposing vaccine candidates. The effective analysis of these large genome datasets necessitates the development of robust tools. Current methods to develop pan-genome do not support direct input of raw reads from the sequencer machine but require preprocessing of reads as an assembled protein/gene sequence file or the binary matrix of orthologous genes/proteins. We have designed an easy-to-use integrated pipeline, NGSPanPipe, which can directly identify the pan-genome from short reads. The output from the pipeline is compatible with other pan-genome analysis tools. We evaluated our pipeline with other methods for developing pan-genome, i.e. reference-based assembly and de novo assembly using simulated reads of Mycobacterium tuberculosis. The single script pipeline (pipeline.pl) is applicable for all bacterial strains. It integrates multiple in-house Perl scripts and is freely accessible from https://github.com/Biomedinformatics/NGSPanPipe.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Didelot X, Bowden R, Wilson DJ, Peto TEA, Crook DW (2012) Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13:601–612
Article CAS PubMed PubMed Central Google Scholar
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS et al (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci USA 102:13950–13955
Article CAS PubMed PubMed Central Google Scholar
Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, Post JC, Ehrlich GD (2007) Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol 8:R103
Article CAS PubMed PubMed Central Google Scholar
Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J et al (2008) High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet 40:987–993
Article CAS PubMed PubMed Central Google Scholar
D’Auria G, Jimenez-Hernandez N, Peris-Bondia F, Moya A, Latorre A (2010) Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genom 11:181
Article CAS Google Scholar
Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, Gajer P, Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R et al (2008) The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol 190:6881–6893
Article CAS PubMed PubMed Central Google Scholar
Serruto D, Serino L, Masignani V, Pizza M (2009) Genome-based approaches to develop vaccines against bacterial pathogens. Vaccine 27:3245–3250
Article CAS PubMed Google Scholar
Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J (2012) PGAP: pan-genomes analysis pipeline. Bioinformatics (Oxford, England) 28:416–418
Article CAS PubMed PubMed Central Google Scholar
Fouts DE, Brinkac L, Beck E, Inman J, Sutton G (2012) PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40:e172
Article CAS PubMed PubMed Central Google Scholar
Chaudhari NM, Gupta VK, Dutta C (2016) BPGA- an ultra-fast pan-genome analysis pipeline. Sci Rep 6:24373
Article CAS PubMed PubMed Central Google Scholar
Schneeberger K, Ossowski S, Ott F, Klein JD, Wang X, Lanz C, Smith LM, Cao J, Fitz J, Warthmann N et al (2011) Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proc Natl Acad Sci USA 108:10249–10254
Article PubMed PubMed Central Google Scholar
Baker M (2012) De novo genome assembly: what every biologist should know. Nat Methods 9:333–337
Article CAS Google Scholar
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Article CAS PubMed PubMed Central Google Scholar
Limasset A, Cazaux B, Rivals E, Peterlongo P (2016) Read mapping on de Bruijn graphs. BMC Bioinformatics 17:237
Article CAS PubMed PubMed Central Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 25:1754–1760
Article CAS PubMed PubMed Central Google Scholar
Huang W, Li L, Myers JR, Marth GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics (Oxford, England) 28:593–594
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Article CAS PubMed Google Scholar
Luo H, Lin Y, Gao F, Zhang CT, Zhang R (2014) DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Res 42:D574–D580
Article CAS PubMed Google Scholar
Bryant J, Chewapreecha C, Bentley SD (2012) Developing insights into the mechanisms of evolution of bacterial pathogens from whole-genome sequences. Future Microbiol 7:1283–1296
Article CAS PubMed Google Scholar
Bentley SD, Parkhill J (2015) Genomic perspectives on the evolution and spread of bacterial pathogens. In: Proceedings of the Royal Society B: Biological Sciences, p 282
Article PubMed PubMed Central Google Scholar
McGann P, Bunin JL, Snesrud E, Singh S, Maybank R, Ong AC, Kwak YI, Seronello S, Clifford RJ, Hinkle M et al (2016) Real time application of whole genome sequencing for outbreak investigation—What is an achievable turnaround time? Diagn Microbiol Infect Dis 85:277–282
Article CAS PubMed Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Article Google Scholar
Powers DMW (2011) Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J Mach Learn Technol 2:37–63
Google Scholar
Lecuit M, Eloit M (2014) The diagnosis of infectious diseases by whole genome next generation sequencing: a new era is opening. Front Cell Infect Microbiol 4:25
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

PK and AK acknowledge the financial support from Indian Council of Medical Research. UK thanks the UGC for grant of the fellowship. The authors thank Dr. Amit Katiyar for discussions. PK is grateful for the invitation to present her work at the ‘Second International Conference on Infectious Diseases and Nanomedicine—2015’ held during 15–18 December 2015 in Kathmandu, Nepal.

Author information

Authors and Affiliations

Department of Biophysics, All India Institute of Medical Sciences, New Delhi, India
Umay Kulsum & Punit Kaur
Department of Microbiology, All India Institute of Medical Sciences, New Delhi, India
Arti Kapil
Indian Council of Medical Research, New Delhi, 110029, India
Harpreet Singh

Authors

Umay Kulsum
View author publications
You can also search for this author in PubMed Google Scholar
Arti Kapil
View author publications
You can also search for this author in PubMed Google Scholar
Harpreet Singh
View author publications
You can also search for this author in PubMed Google Scholar
Punit Kaur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Punit Kaur .

Editor information

Editors and Affiliations

Research Center for Applied Science and Technology (RECAST), Tribhuvan University, Kathmandu, Nepal
Rameshwar Adhikari
Department of Microbiology, Immunology and Genetics, Graduate School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas, USA
Santosh Thapa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kulsum, U., Kapil, A., Singh, H., Kaur, P. (2018). NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads. In: Adhikari, R., Thapa, S. (eds) Infectious Diseases and Nanomedicine III. Advances in Experimental Medicine and Biology, vol 1052. Springer, Singapore. https://doi.org/10.1007/978-981-10-7572-8_4

Download citation

DOI: https://doi.org/10.1007/978-981-10-7572-8_4
Published: 22 May 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7571-1
Online ISBN: 978-981-10-7572-8
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics