1 Introduction

Partitioning the isolates of a bacterial pathogen into epidemiologically related groups is an important challenge in public health microbiology. Specifically, such a partitioning, which we will refer to as a clustering, can provide information on particularly transmissible strains (super-spreaders) and identify where an intervention such as active case finding may be particularly beneficial. In combination with additional metadata, such as geography or time of observation, such a clustering can also help identify rapidly growing groups (transmission hotspots), narrow down the potential origins of an outbreak (index case), and distinguish between recent and historical transmissions.

The clustering problem can leverage a variety of genotypic signals. Historically, fairly coarse genotypes such as VNTR (variable-number of tandem repeats, i.e. the number of copies of a set of pre-specified repeated regions in a strain) [29], PFGE (pulsed field gel electrophoresis) [15] and MLST (multi-locus sequence type, i.e. the alleles at a small number of pre-specified housekeeping genes) [18] have been the predominant mode of genotyping bacterial pathogens. These low-resolution signals, which we refer to as “fingerprints”, could lead to incorrectly clustered strains [1] since unrelated bacterial isolates may happen to share identical fingerprints. With the advent of next-generation sequencing (NGS) [17], new genotypic signals have become available. These include SNP (single-nucleotide polymorphism) profiles [6], which can be identified at the whole-genome scale, and also wgMLST (whole-genome multi-locus sequence type) [19], which contains the alleles at all of the known genes in the organism of interest.

Methodologically, existing approaches fall into one of two categories. Some methods - including those inspired by and used in metagenomics [24] - use a pure distance-based approach, whereby a sequence similarity cutoff threshold is chosen, and any pair of sequences whose similarity exceeds it are considered to be in the same cluster, with a transitive closure operator applied to ensure the result is a valid partition. Alternatively, such methods may simply apply a standard clustering method, such as hierarchical clustering, to the pairwise distance matrix; in this case, the number of clusters is typically specified in advance [5]. Other methods - which tend to be more computationally expensive - leverage a phylogenetic tree reconstructed from the data to define clusters [2, 11]. They also typically require a similarity threshold, but may be less sensitive to outlier isolates or to homoplasy, i.e. convergent evolution.

The majority of existing approaches for clustering bacterial isolates use a single genotypic signal, typically one of the higher-resolution ones, in isolation [12]. However, in this paper we argue for the principled combination of both low-resolution as well as high-resolution genotypic signals. The framework we propose here, called PathOGiST, innovates in several key ways. First, it leverages multiple genotypic signals extracted from NGS data. They can be further subdivided according to granularity into coarse and fine signals; the former get penalized only for grouping together isolates with different genotypes, not for splitting isolates with similar genotypes, while the latter get penalized for both of these. Second, it is based on a distance threshold, but does not apply a transitive closure operator to the similarity graph, or require a pre-specified number of clusters. Instead, it makes use of the correlation clustering paradigm, which tries to minimize the number of pairs of distant isolates within clusters while minimizing the number of pairs of close isolates between clusters. Third, it can be calibrated to different bacterial pathogens and genotyping signals, although we also provide an automatic threshold detector based on the distribution of pairwise distances between isolates.

Our results demonstrate that, when applied to a selection of three bacterial pathogens with annotated datasets publicly available - Escherichia coli, Yersinia pseudo-tuberculosis, and Mycobacterium tuberculosis - PathOGiST performs with a higher accuracy than recently published existing methods in most cases, both in terms of its ARI (adjusted Rand index) as well as CP (cluster purity). Our paper establishes that the use of calibrated thresholds and multiple genotypic signals can lead to an accurate clustering of bacterial isolates for public health epidemiology.

2 Methods

The goal of our approach is to cluster pathogen isolates from whole-genome sequencing data by using different genotyping approaches, alone and in combination. Each cluster should ideally represent a set of isolates related by an epidemiological transmission chain. We assume that we are given as input several matrices recording the pairwise distances between the isolates, one per genotyping signal. The algorithm proceeds in two stages. We first compute a clustering of the isolates for each distance matrix, and then compute a consensus of these separate clusterings. For the first step, we rely on correlation clustering [3], which we describe in Sect. 2.1. For the second step, we use a modified approach to the consensus clustering problem [4], also based on a correlation clustering formulation, described in Sect. 2.2. The whole process is illustrated in Fig. 1.

Fig. 1.
figure 1

PathOGiST starts by computing clusters based on single distance signals using correlation clustering. Then we run consensus clustering on the outputs of the correlation clustering.

2.1 Correlation Clustering

Let G be an undirected complete weighted graph with vertices V and edges E. Let \(W: E \rightarrow \mathbb {R}\) be the edge-weighting function, which is positive for edges connecting vertices (representing isolates) that are similar and negative for those connecting dissimilar vertices. Correlation clustering aims to partition the vertices into disjoint clusters \(C_1, C_2, \dots , C_N\) where \(N\le n\). Let I be the set of edges whose endpoints lie in the same cluster and let \({J} = E - {I}\) be the set of edges whose endpoints lie in different clusters. The goal of the minimum correlation clustering problem is to find a clustering that minimizes the total weight of the edges in I with negative weight minus the total weight of the edges in J with positive weight:

$$\begin{aligned} \begin{array}{c} {\arg \min }\\ {C_1, C_2, \dots , C_N} \end{array} \sum _{\begin{array}{c} e \in {I}\\ W(e) < 0 \end{array}} W(e) - \sum _{\begin{array}{c} e \in {J}\\ W(e) > 0 \end{array}} W(e) \end{aligned}$$

In this work, we perform the construction of the weighted graph G from a distance matrix. Given a distance matrix D on the input elements (graph vertices) such that \(d_{ij}\) is the distance between elements i and j, we define \(s_{ij} = T - d_{ij}\), where T is a distance threshold, intuitively meaning that if \(d_{ij} < T\), i and j are considered similar, while \(d_{ij} > T\) means that i and j are considered dissimilar. We use \(s_{ij}\) as the weight of the edge between vertices i and j in G.

By defining binary variables \(x_{ij}\) such that \(x_{ij} = 0\) if i and j are in the same cluster and \(x_{ij} = 1\) otherwise, we can write the minimum correlation clustering objective function as

$$\begin{aligned} f(x) = \sum _{s_{ij} > 0} s_{ij} x_{ij} - \sum _{s_{ij}<0} s_{ij}(1-x_{ij}) = \sum s_{ij} x_{ij} - \sum _{s_{ij} <0} s_{ij}. \end{aligned}$$

Since the second term is constant, the minimum correlation clustering problem can be solved optimally with the following Integer Linear Program (ILP):

$$\begin{aligned} \underset{x}{\text {minimize}} \quad \sum s_{ij} x_{ij} \end{aligned}$$
(1)
$$\begin{aligned} \text {s.t.} \qquad&x_{ik} \le x_{ij} + x_{jk} \quad \text {for all }i,j,k \\&x_{ij} \in \{0,1\} \quad \text {for all }i,j \end{aligned}$$

Here, the inequality constraints (which we call the “triangle inequality” constraints) together with the binary constraints ensure that the assignment is transitive [3]. Indeed, if \(x_{ij} = 0\) and \(x_{jk} = 0\), they enforce that \(x_{ik} = 0\).

C4: A Fast Parallel Heuristic for the Correlation Clustering Problem. Solving the ILP in Eq. (1) can be time consuming and can require a large amount of memory due to the quadratic number of variables and cubic number of constraints. For this reason, we additionally implemented the faster C4 algorithm, a parallel algorithm that guarantees a 3-approximation ratio of the optimal objective function of correlation clustering in the special case of metric distances (i.e. when the \(s_{ij}\) satisfy the triangle inequality \(s_{ik} \le s_{ij} + s_{jk}\)) [26].

Our results show that this algorithm is remarkably fast and quite accurate on the input graphs we tested. However, it is non-deterministic, as it depends on the initial permutation of the vertices. With this in mind, we run C4 multiple times with random initial permutations and compute the objective value for each solution. Then, among those solutions, we choose the one that minimizes the objective function. Our experiments show that in practice, this works very well and most of the time is able to find the optimum or near-optimum solution.

Solving the Minimum Correlation Clustering Problem Exactly. In order to solve the minimum correlation clustering problem exactly, while coping with the cubic number of linear constraints, we employed two approaches.

First, recalling that we often get a near-optimum solution from C4, we use it as a warm start to the problem by supplying it to the ILP solver.

Second, rather than creating the ILP with all the constraints right away, we iteratively add the constraints as follows. According to Eq. (1), for every index triple (ijk) the ILP has 3 constraints on the decision variables \(x_{ij}\), \(x_{jk}\), \(x_{ik}\):

$$ x_{ik} \le x_{ij} + x_{jk},\ x_{ij} \le x_{ik} + x_{jk},\ x_{jk} \le x_{ik} + x_{ij}. $$

To provide the intuition for our second heuristic, assume that all three similarities between elements ijk are positive. This implies that the elements i, j, and k are similar to each other, so are more likely to belong to the same cluster. In this case, the three variables \(x_{ik}\), \(x_{ij}\), and \(x_{jk}\) will likely be assigned the value 0 and satisfy the inequalities. On the other hand, all three similarities being negative implies that elements i, j, and k are likely to be in different clusters, which would set these variables to 1 and again satisfy the inequalities.

Taking this into account, we use an approach inspired by constraint generation [8], and start by only including constraints induced by element triples whose set of similarities contain both positive and negative edges and solve this trimmed-down ILP. We then check all the excluded constraints in the solution to see whether any of them is violated. If none is violated, then the current solution is also an optimum solution for the original ILP and we are done. Otherwise, we add all the constraints that are not satisfied by the current solution to the ILP and solve the modified ILP again. We repeat this process until no violated constraint remains.

In most experiments (225 out of 235 experiments), we observe that no violated constraints have been found. Almost all of the other cases only required one extra iteration to find a solution that satisfied all the constraints. The average number of iterations was 1.102.

2.2 Consensus Clustering

Given a set of clusterings and a measure of distance between clusterings, the consensus clustering problem aims to find a clustering minimizing the total distance to all input clusterings. A simple distance between two clusterings \(\pi _1\) and \(\pi _2\) is the number of elements clustered differently in \(\pi _1\) and \(\pi _2\), that is, the number of pairs of elements co-clustered in \(\pi _1\) but not co-clustered in \(\pi _2\), or vice versa.

Representing a clustering x by a quadratic number of binary variables (\(x_{ij} = 0\) if and only if ij are co-clustered), the distance between x and a clustering \(\pi \) is given by the formula

$$\begin{aligned} d(x,\pi ) = \sum _{\pi _{ij} = 1} w_{ij} (1 - x_{ij}) + \sum _{\pi _{ij} = 0} w_{ij} x_{ij} = \sum _{i,j} s_{ij} x_{ij} + \sum _{x_{ij}=1} w_{ij}, \end{aligned}$$
(2)

where a weight \(w_{ij}\) is assigned to each pair of elements i and j penalizing any clustering decision in x that differs from \(\pi \), and we define \(s_{ij} := (-1)^{\pi _{ij}}w_{ij}\).

Notice that solving the minimum consensus problem for a given set of clusterings \(\pi ^{(1)},\dots , \pi ^{(n)}\) is equivalent to solving a minimum correlation clustering problem with the matrix S defined as

$$\begin{aligned} s_{ij} = \sum _{\left\{ k | \pi ^{(k)}_{ij} = 0 \right\} } w^{(k)}_{ij} - \sum _{\left\{ k | \pi ^{(k)}_{ij} = 1 \right\} } w^{(k)}_{ij} = \sum _{k=1}^n (-1)^{\pi ^{(k)}_{ij}}w^{(k)}_{ij} \end{aligned}$$
(3)

Consensus Clustering with Different Granularities. An important feature of our problem is that the different genotyping signals we consider might not cluster the isolates with the same granularity. For example, it was shown in [22] that when clustering Mycobacterium tuberculosis isolates using SNPs, MLST, CNVs and spolygotypes, the latter two genotyping signals result in coarser clusters than the former two. For this reason, we assume that the input clusterings can be of different granularities. In this setting, we want to avoid penalizing the differences between a finer clustering \(\pi \) and a coarser clustering \(\pi '\), and we introduce the following asymmetric distance: \(d(\pi ,\pi ') = | \pi - \pi ' |\). In this case, assuming \(\pi \) is the coarser clustering and \(\pi '\) the finer one, we penalize only those pairs that are co-clustered in \(\pi \) but not in \(\pi '\).

Then, given the clusterings \(\pi ^{(1)}, \dots , \pi ^{(m)}\) and a subset F of these clusterings, representing the clusterings with the finer resolution, the finest consensus clustering problem is to find a clustering x that minimizes the total distance between x and all input clusterings, where

$$\begin{aligned} d(x,\pi ) = {\left\{ \begin{array}{ll} \displaystyle \sum _{\pi _{ij} = 1} w_{ij}(1 - x_{ij}) + \sum _{\pi _{ij} = 0}w_{ij} x_{ij},&{} \text {if } \pi \in F\\ \displaystyle \sum _{\pi _{ij} = 0}w_{ij} x_{ij}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

We can then reformulate this problem as a minimum correlation clustering again, with matrix S defined by

$$\begin{aligned} s_{ij} = \sum _{\left\{ k | \pi ^{(k)}_{ij} = 0 \right\} } w^{(k)}_{ij} - \sum _{\left\{ k | \pi ^{(k)}_{ij} = 1, \pi ^{(k)} \in F\right\} } w^{(k)}_{ij} \end{aligned}$$
(5)

Selecting Appropriate Weights for Consensus Clustering. There might be many meaningful ways of defining the weights \(w_{ij}^{(k)}\) used in the previous equations. If we assume that a clustering \(\pi \) was inferred based on a distance matrix D, normalized such that \(0 \le d_{ij} \le 1\), we can define \(w_{ij}\) as

$$\begin{aligned} w_{ij} = {\left\{ \begin{array}{ll} d_{ij},&{} \text {if } \pi _{ij}=1\\ 1-d_{ij}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

The reasoning behind this definition is that if \(\pi _{ij}=1\) (ij are not co-clustered in \(\pi \)), then the distance \(d_{ij}\) should be large, therefore it is a good penalty for co-clustering ij in x. On the other hand, if \(\pi _{ij}=0\), \(d_{ij}\) can be expected to be small, which means that \(1-d_{ij}\) is a better candidate for the penalty of choosing \(x_{ij}=1\). The distance between two clusterings (Eq. (2)) can then be written as

$$\begin{aligned} d(x,\pi ) = \sum _{\pi _{ij} = 1} d_{ij} (1 - x_{ij}) + \sum _{\pi _{ij} = 0} (1-d_{ij}) x_{ij} \end{aligned}$$
(7)

and Eq. (3) becomes

$$\begin{aligned} s_{ij} = \sum _{\{k | \pi ^{(k)}_{ij} = 0 \}} \left( 1-d^{(k)}_{ij}\right) - \sum _{\{k | \pi ^{(k)}_{ij} = 1 \}} d^{(k)}_{ij} = \varPi _{ij} - D_{ij} \end{aligned}$$
(8)

where \(\varPi _{ij} = \biggl | \left\{ k | \pi ^{(k)}_{ij} = 0 \right\} \biggr |\) and \(D = \sum _{k=1}^n d_{ij}^{(k)}\).

We can naturally combine the weighting with the different granularities within a single formulation. In summary, the finest consensus clustering problem with weights can be formulated as a minimum correlation clustering problem, and thus solved by the algorithms described in Sect. 2.1.

2.3 Evaluation

To evaluate our methods for clustering, we compute two measures between our clustering and a ground truth clustering: Adjusted Rand Index (ARI) and Cluster Purity (CP).

The adjusted Rand index is a measure that computes how similar the clusters are to the ground truth. It is the corrected-for-chance version of the Rand index which is the percentage of correctly clustered elements. It can be computed using the following formula:

$$ ARI = \frac{\sum _{ij} \left( {\begin{array}{c}n_{ij}\\ 2\end{array}}\right) - \left[ \sum _{i} \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) \sum _{j} \left( {\begin{array}{c}b_j\\ 2\end{array}}\right) \right] / \left( {\begin{array}{c}n\\ 2\end{array}}\right) }{ \frac{1}{2}\left[ \sum _{i} \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) +\sum _{j} \left( {\begin{array}{c}b_j\\ 2\end{array}}\right) \right] - \left[ \sum _{i} \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) \sum _{j} \left( {\begin{array}{c}b_j\\ 2\end{array}}\right) \right] / \left( {\begin{array}{c}n\\ 2\end{array}}\right) } $$

where \(n_{ij}\), \(a_{i}\), \(b_{j}\) are values, row sums, and column sums from the contingency table [13].

Cluster Purity is another measure of similarity between two data clusterings. To compute, assign each cluster to the most common ground truth cluster in it. Then, count the number of correctly assigned data points and divide by the total number of data points. Formally:

$$ CP(C, G) = \frac{1}{N} \sum _{k} \max _{j} |c_{k} \cap g_{j}| $$

where N is the number of data points, \(C = \{ c_1, c_2, \ldots , c_K \}\) is the set of clusters and \(G = \{ g_1,g_2,\ldots ,g_J \}\) is the set of ground truth clusters [20].

3 Results

3.1 Datasets and Genotyping Methods

We used three published datasets for three pathogens, Escherichia coli [14], Mycobacterium tuberculosis [10], Yersinia pseudotuberculosis [30], and a simulated dataset taken from [16]. Several genotyping signals were extracted from the WGS data: multilocus sequence typing (MLST) using MentaLiST pipeline [7], single nucleotide polymorphisms (SNP) using Snippy [28], copy number variants (CNV) using Prince [21], k-mer weighted inner products (kWIP) using kWIP [23], and spacer oligonucleotide typing (Spoligotyping) using SpoTyping [31] (Table 1).

Table 1. Datasets and genotyping summary

For each genotyping signal, in order to apply our correlation clustering algorithm, we needed to determine a threshold T to decide which pairs of isolates should be considered similar. To do so, we consider the pairwise distance distribution for each signal, choosing a threshold range that covers the first valley in the distribution, under the assumption that the first peak likely indicates distances between isolates belonging to the same cluster. The resulting threshold ranges and steps are described in Appendix (Table 4).

In our experiments, for each sample, each signal and each threshold, we ran our two algorithms for solving the minimum correlation clustering problem, the C4 approximation algorithm (with multiple runs) and the exact ILP using delayed constraint generation. Then we ran the consensus clustering algorithm, again using both methods.

3.2 Single Signal Genotyping

The E. coli dataset contains 1509 isolates collected from across England and spans an 11-year period. The distance distribution for each genotyping signal (MLST, SNP, kWIP) is shown in the Appendix (Fig. 2).

The second dataset contains 163 isolates of Y. pseudotuberculosis mostly collected from New Zealand [30]. We applied the same genotyping methods as for the E. coli dataset to this one. The results are presented in Appendix (Fig. 3). For E. coli and Y. pseudotuberculosis, we consider the MLST groups determined in their respective studies [14, 30] as the ground truth, and use them to calculate the ARI and CP values. For the simulated data, since the authors suggest [16] BAPS for building the true phylogenetic tree and clustering, we used the RhierBAPS results as the ground truth for this dataset instead of MLST.

The isolates of the M. tuberculosis dataset were obtained from pediatric patients in British Columbia, Canada, and were collected between 2005 and 2014. We used a subset of 1377 isolates, all of which underwent WGS. In addition to SNP, MLST and kWIP information, we considered two additional genotyping signals, CNV and Spolygotyping. For M. tuberculosis, due to the lack of MLST groups, we use the strain’s lineage, a proxy for its geographic origin [27]. For this dataset, the ARI would not be informative since a lineage is a coarse grouping largely uninformative of the underlying epidemiology, and should be split into multiple clusters. Thus, we only calculate and report the CP in this case. The results for this dataset are illustrated in Appendix (Fig. 4).

We can observe that for all the pathogens and genotyping signals we consider, there is a relatively clear threshold that falls within the chosen range which results in high accuracy clusters, with ARI and CP values above 0.8, and often close to 1. The only exceptions concern the M. tuberculosis dataset with SNPs, MLST and kWIP, although the CP statistics is notoriously less robust than the ARI. Moreover, most of the time, around the best thresholds, the clustering obtained with the exact ILP method results in accuracy measures that are either very close to those of the C4 method or slightly better.

3.3 Comparison of the C4 and Exact ILP Methods

The ILP generally gives more accurate results and is a deterministic method. However, its running time and memory usage depend a lot on the size of the dataset. For example, while it is able to cluster the smaller Y. pseudotuberculosis dataset in less than a minute, it takes more than three hours to find clusters for larger datasets at some threshold values. On the other hand, the C4 heusritic is significantly faster and requires much less memory even on the larger datasets, as shown in Table 2. However, it is not deterministic, and random restarts may give slightly different, incompatible results. To evaluate the C4 heuristic performance, we compared the objective values of solutions found by C4 and by the exact ILP. In most cases, C4 performs very well and finds a solution whose objective value is close to the optimal (Table 2).

Furthermore, the objective value of CPLEX is affected by the tolerance parameter; when the gap between the lower and upper bound is less than a certain fraction \(\epsilon \), set to \(10^{-6}\) by default, the optimization is stopped. In this case, we see that because the magnitude of the objective function is fairly large, it is possible for the C4 method to obtain a better objective function than CPLEX.

Table 2. Average running time (in seconds) and memory footprint (in gigabytes); ILP and C4 objective value comparison.
Table 3. ARI (Adjusted Rand Index) and CP (Cluster Purity) computed for different methods and genotyping signals

3.4 Comparison with Existing Clustering Methods

The results from PathOGiST were compared to those generated by two recent methods developed for clustering WGS datasets, Phydelity [11] and TreeCluster [2]; both of them are based on phylogenetic trees. To infer a phylogeny for our datasets, we first calculated a pair-wise distance matrix using Mash [25], then we ran the popular and widely used BIONJ [9] variant of the neighbor joining algorithm on the distance matrix. After we inferred phylogenetic trees, we ran Phydelity and TreeCluster with their default settings. In order to pick a single threshold for each genotyping signal-pathogen combination in PathOGiST, we chose the threshold resulting in the best ARI (CP for M. tuberculosis) among all the options. These thresholds are set as the default thresholds for these genotyping signal-pathogen combinations, but can be overridden by the user. Table 5 (Appendix) shows the chosen optimal threshold for each dataset and genotyping signal.

Having clustering outputs of the single signal correlation clustering algorithm with chosen default thresholds, we ran consensus clustering for each pathogen with all their available genotyping signals. We considered SNP clustering as the finest because it provides a higher resolution signal comparing to other genotyping signals. The results are described in Table 3. The main observation is that in all cases, but M. tuberculosis, the consensus clustering ARI is close to the best ARI obtained by a single genotyping signal, showing that our approach indeed removes the need to chose a single signal for clustering.

4 Conclusion

In this paper we described PathOGiST, an algorithmic framework for clustering bacterial isolates. One of our key contributions is to introduce the paradigms of correlation clustering and consensus clustering for the analysis of bacterial pathogens, together with two implementations - one exact and one heuristic - of correlation clustering algorithms, tailored to the problem at hand. Our experimental results suggest that our approach allows to compute a very accurate, often close to optimal, clustering without having to determine an optimal genotyping signal.

In the future, we hope to address several challenges. The first issue is the risk of overfitting, as the calibration of the threshold relies on the correlation clustering results’ comparison to a gold standard. However, we also provide an automatic threshold detector. Our results demonstrate that our approach has the potential to provide reliable clusters.

Second, instead of a single output, a multi-scale or hierarchical representation of the clusters may be helpful in order to provide the user with the flexibility of deciding on their own clustering granularity. Moreover, some metadata, such as collection time or geographic location, may be fruitfully incorporated into the clustering approach in order to better inform the resolution of some groups of isolates.

Finally, due to the lack of existing tools for simulating multiple genotyping signals, we considered MLST as our gold standard. This may not represent the correct clustering, but is the best available among the individual genotyping signals.

Despite these challenges, we believe that PathOGiST is a first step in the right direction, and we hope that it will generate an impetus to further explore the problem of clustering bacterial isolates.