PathOGiST: A Novel Method for Clustering Pathogen Isolates by Combining Multiple Genotyping Signals

  • Mohsen KatebiEmail author
  • Pedro Feijao
  • Julius Booth
  • Mehrdad Mansouri
  • Sean La
  • Alex Sweeten
  • Reza Miraskarshahi
  • Matthew Nguyen
  • Johnathan Wong
  • William Hsiao
  • Cedric Chauve
  • Leonid Chindelevitch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12099)


In this paper we study the problem of clustering bacterial isolates into epidemiologically related groups from next-generation sequencing data. Existing methods for this problem mainly use a single genotyping signal, and either use a distance-based method with a pre-specified number of clusters, or a phylogenetic tree-based method with a pre-specified threshold. We propose PathOGiST, an algorithmic framework for clustering bacterial isolates by leveraging multiple genotypic signals and calibrated thresholds. PathOGiST uses different genotypic signals, clusters the isolates based on these individual signals with correlation clustering, and combines the clusterings based on the individual signals through consensus clustering. We implemented and tested PathOGiST on three different bacterial pathogens - Escherichia coli, Yersinia pseudotuberculosis, and Mycobacterium tuberculosis - and we conclude by discussing further avenues to explore.


Bacterial pathogens Whole-genome sequencing Correlation clustering Microbiology Public health 


  1. 1.
    Alaridah, N., Hallbäck, E.T., Tångrot, J., et al.: Transmission dynamics study of tuberculosis isolates with whole genome sequencing in southern Sweden. Sci. Rep. 9(1), 4931 (2019)CrossRefGoogle Scholar
  2. 2.
    Balaban, M., Moshiri, N., Mai, U., et al.: TreeCluster: clustering biological sequences using phylogenetic trees. bioRxiv (2019).
  3. 3.
    Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56, 89–113 (2004)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bonizzoni, P., Vedova, G.D., Dondi, R., Jiang, T.: On the approximation of correlation clustering and consensus clustering. J. Comput. Syst. Sci. 74, 671–696 (2008)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Cheng, L., Connor, T.R., Sirén, J., et al.: Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol. Biol. Evol. 30, 1224–1228 (2013)CrossRefGoogle Scholar
  6. 6.
    Faison, W.J., et al.: Whole genome single-nucleotide variation profile-based phylogenetic tree building methods for analysis of viral, bacterial and human genomes. Genomics 104(1), 1–7 (2014)CrossRefGoogle Scholar
  7. 7.
    Feijao, P., Yao, H.T., Fornika, D., et al.: MentaLiST-a fast MLST caller for large MLST schemes. Microb. Genom. 4 (2018) Google Scholar
  8. 8.
    Dantzig, G., Fulkerson, R., Johnson, S.: Solution of a large-scale traveling salesman problem. Oper. Res. 2, 393–410 (1954)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Gascuel, O.: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14(7), 685–695 (1997)CrossRefGoogle Scholar
  10. 10.
    Guthrie, J.L., Delli Pizzi, A., Roth, D., et al.: Genotyping and whole-genome sequencing to identify tuberculosis transmission to pediatric patients in British Columbia, Canada, 2005–2014. J. Infect. Dis. 40, 1–9 (2018)Google Scholar
  11. 11.
    Han, A.X., Parker, E., Maurer-Stroh, S., et al.: Inferring putative transmission clusters with Phydelity. bioRxiv (2019).
  12. 12.
    Hanage, W.P., Fraser, C., Spratt, B.G.: Sequences, sequence clusters and bacterial species. Philos. Trans. R. Soc. B: Biol. Sci. 361(1475), 1917–1927 (2006)CrossRefGoogle Scholar
  13. 13.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)CrossRefGoogle Scholar
  14. 14.
    Kallonen, T., Brodrick, H.J., Harris, S.R., et al.: Systematic longitudinal survey of invasive Escherichia coli in England demonstrates a stable population structure only transiently disturbed by the emergence of ST131. Genome Res. 27, 1437–1449 (2017)CrossRefGoogle Scholar
  15. 15.
    Kaufmann, M.E.: Pulsed-field gel electrophoresis. In: Woodford, N., Johnson, A.P. (eds.) Molecular Bacteriology, pp. 33–50. Springer, Heidelberg (1998). Scholar
  16. 16.
    Lees, J.A., Kendall, M., Parkhill, J., Colijn, C., Bentley, S.D., Harris, S.R.: Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study. Wellcome Open Res. 3 (2018)CrossRefGoogle Scholar
  17. 17.
    Loman, N.J., Pallen, M.J.: Twenty years of bacterial genome sequencing. Nat. Rev. Microbiol. 13(12), 787 (2015)CrossRefGoogle Scholar
  18. 18.
    Maiden, M.C., Bygraves, J.A., Feil, E., et al.: Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. PNAS 95(6), 3140–3145 (1998)CrossRefGoogle Scholar
  19. 19.
    Maiden, M.C., Van Rensburg, M.J.J., Bray, J.E., et al.: MLST revisited: the gene-by-gene approach to bacterial genomics. Nat. Rev. Microbiol. 11(10), 728 (2013)CrossRefGoogle Scholar
  20. 20.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Nat. Lang. Eng. 16(1), 100–103 (2010)CrossRefGoogle Scholar
  21. 21.
    Mansouri, M., Booth, J., Vityaz, M., et al.: PRINCE: accurate approximation of the copy number of tandem repeats. In: WABI 2018, pp. 20:1–20:13 (2018)Google Scholar
  22. 22.
    Meehan, C.J., Moris, P., Kohl, T.A., et al.: The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology. EBioMedicine 37, 410–416 (2018)CrossRefGoogle Scholar
  23. 23.
    Murray, K.D., Webers, C., Ong, C.S., et al.: kWIP: the k-mer weighted inner product, a de novo estimator of genetic similarity. PLoS Comput. Biol. 13, 1–17 (2017)CrossRefGoogle Scholar
  24. 24.
    Nguyen, N.P., Warnow, T., Pop, M., White, B.: A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity. NPJ Biofilms Microbi. 2, 16004 (2016)CrossRefGoogle Scholar
  25. 25.
    Ondov, B.D., Treangen, T.J., Melsted, P., et al.: Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17(1), 132 (2016)CrossRefGoogle Scholar
  26. 26.
    Pan, X., Papailiopoulos, D.S., Oymak, S., et al.: Parallel correlation clustering on big graphs. In: NIPS 2015, pp. 82–90 (2015)Google Scholar
  27. 27.
    Reed, M., Pichler, V., McIntosh, F., et al.: Major Mycobacterium tuberculosis lineages associate with patient country of origin. J. Clin. Microbiol. 47, 1119–1128 (2009)CrossRefGoogle Scholar
  28. 28.
    Seemann, T.: Snippy (2015).
  29. 29.
    Vergnaud, G., Pourcel, C.: Multiple locus variable number of tandem repeats analysis. In: Caugant, D. (ed.) Molecular Epidemiology of Microorganisms, pp. 141–158. Springer, Heidelberg (2009). Scholar
  30. 30.
    Williamson, D.A., Baines, S.L., Carter, G.P., et al.: Genomic insights into a sustained national outbreak of Yersinia pseudotuberculosis. Genome Biol. Evol. 8, 3806–3814 (2017)Google Scholar
  31. 31.
    Xia, E., Teo, Y.Y., Ong, R.T.H.: SpoTyping: fast and accurate in silico mycobacterium spoligotyping from sequence reads. Genome Med. 8(1), 19 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Mohsen Katebi
    • 1
    Email author
  • Pedro Feijao
    • 1
  • Julius Booth
    • 1
  • Mehrdad Mansouri
    • 1
  • Sean La
    • 1
  • Alex Sweeten
    • 1
  • Reza Miraskarshahi
    • 1
  • Matthew Nguyen
    • 1
  • Johnathan Wong
    • 1
  • William Hsiao
    • 2
  • Cedric Chauve
    • 1
  • Leonid Chindelevitch
    • 1
  1. 1.Simon Fraser UniversityBurnabyCanada
  2. 2.British Columbia Centre for Disease ControlVancouverCanada

Personalised recommendations