Keywords

1 Introduction

1.1 Motivation

Fully Homomorphic Encryption (FHE) schemes can in theory perform arbitrary computations on encrypted data. Since the discovery of FHE, many applications have been proposed, ranging from medical over financial to advertising scenarios. The underlying idea is mostly the same: Suppose Alice has some confidential data \(X \) which she would like to utilize, and Bob has an algorithm \(\mathcal {A}\) which he could apply to Alice’s data for money. However, conventionally, either Alice would have to give her confidential data to Bob, or run the algorithm herself, for which she may not have the know-how or computational power. FHE allows Alice to encrypt her data to \(C:=\text {Enc} (X)\) and send it to Bob. Bob can convert his algorithm \(\mathcal {A}\) into a function \(\mathcal {A}'\) over the ciphertext space and apply it to the encrypted data, resulting in \(R:=\mathcal {A}'(C)\). He can then send this result back to Alice, who can decrypt it with her secret key. FHE promises that indeed \(\text {Dec} (R)=\text {Dec} (\mathcal {A}'(\text {Enc} (X))) = \mathcal {A}(X)\). Since Alice’s data was encrypted the whole time, Bob learns nothing about the data entries. Note that the functionality where Bob’s algorithm is also kept secret from Alice is not traditionally guaranteed by FHE, but can in practice be achieved via a property called circuit privacy, in the sense that Alice learns nothing except the result \(\mathcal {A}(X)\).

One of the most popular applications of FHE has been Machine Learning, with many works focusing on Neural Networks and different variants of regression. To our knowledge, all works in this line are concerned with supervised learning. This means that there is a training set with known outcomes, and the algorithm tries to build a model that matches the desired outputs to the inputs as well as possible. When the training phase is done, the algorithm can be applied to new instances to predict unknown outcomes. However, there is a second branch in Machine Learning that has not been touched by FHE research: Unsupervised learning. For these kinds of algorithms, there are no labeled training examples, there is simply a dataset on which some kind of analysis shall be performed. An example of this is clustering, where the aim is to group data entries that are similar in some way. The number of clusters might be a parameter that the user enters, or it may be automatically selected by the algorithm. Clustering has numerous applications like genome sequence analysis, market research, medical imaging or social network analysis, to name a few, some of which inherently involve sensitive data – making a privacy-preserving evaluation with FHE even more interesting.

1.2 Contribution

In this work, we approach this unexplored branch of Machine Learning and show how to implement the \(K\)-Means-Algorithm, an important clustering algorithm, on encrypted data. We discuss the problems that arise when trying to evaluate the \(K\)-Means-Algorithm on encrypted data, and show how to solve them. To this end, we first present a natural encoding that allows the execution of the algorithm as it is (including the usually challenging division by an encrypted value), but is not optimal in terms of performance. We then present a modification to the \(K\)-Means-Algorithm that performs comparably in terms of clustering accuracy, but is much more FHE-friendly in that it avoids division by an encrypted value. We include another modification that trades accuracy for efficiency in the involved comparison operation, and compare the runtimes of these approaches.

2 Related Work

Encryption schemes that allow one type of operation on ciphertexts have been around for some time and have a comprehensive security characterization [3]. Fully Homomorphic Encryption however, which allows both unlimited additions and multiplications, was only first solved in [19]. Since then, many other schemes have been developed, for example [8, 12,13,14,15, 18, 20, 37], to name just a few. An overview can be found in [2]. There are several libraries offering FHE implementations, like [11, 16, 23], and the one we use, [38].

Machine Learning as an application of FHE was first proposed in [35], and subsequently there have been numerous works on the subject, to our knowledge all concerned with supervised learning. The most popular of these applications seem to be (Deep) Neural Networks (see [7, 10, 21, 26, 36]) and (Linear) Regression (e.g., [4, 17, 32] or [22]), though there is also some work on other algorithm classes like decision trees and random forests [41], or logistic regression ([5, 6, 29, 30]). In contrast, our work is concerned with the clustering problem from unsupervised Machine Learning.

The \(K\)-Means-Algorithm has been a subject of interest in the context of privacy-preserving computations for some time, but to our knowledge all previous works like [9, 24, 25, 31, 42] require interaction between several parties, e.g. via Multiparty Computation (MPC). For a more comprehensive overview of the \(K\)-Means-Algorithm in the context of MPC, we refer the reader to [34]. While this interactivity may certainly be a feasible requirement in many situations, and indeed MPC is likely to be faster than FHE in these cases, we feel that there are several reasons why a non-interactive solution as we present it is an important contribution.

  1. 1.

    Client Economics: In MPC, the computation is split between different parties, each performing computations every round and combining the results. In FHE computations, the entire computation is performed by the service provider. Even if this computation on encrypted data is more expensive than the total MPC computation, the client reduces his effort to zero this way, making this solution attractive to him and thus generating a demand for it.

  2. 2.

    Function Privacy: Imagine the \(K\)-Means-Algorithm in this paper as a placeholder for a more complex proprietary algorithm that the service provider executes on the client’s data as a service. This algorithm could utilize building blocks from the \(K\)-Means-Algorithm that we present in this paper, or involve the \(K\)-Means-Algorithm as a whole in the context of pipelining several algorithms together, or be something completely new. Here, the service provider would want to prevent the user from learning the details of this algorithm, as it is his business secret. While FHE per se does not guarantee this functionality, all schemes today fulfill the requirement of circuit privacy needed to achieve it. Thus for this case, FHE would be the preferred solution.

  3. 3.

    Future Efficiency Gain: MPC is much older than FHE, and efficiency for the latter has increased by a factor of \(10^4\) in the last six years alone. To argue that MPC is faster and thus FHE solutions are superfluous seems premature at this point, and our contributions are not specific to any implementation, but work on all FHE schemes that support a \(\{0, 1\}\) plaintext space.

Also, many of these interactive solutions rely on a vertical (in [40]) or horizontal (in [28]) partitioning of the data for security. In contrast, FHE allows a non-interactive setting with a single database owner who wishes to outsource the computation.

3 Preliminaries

In this section, we cover underlying concepts like the \(K\)-Means-Algorithm, encoding issues, our choice of implementation library, and the datasets we use.

3.1 The K-Means Algorithm

The \(K\)-Means-Algorithm is one of the most well-known clustering algorithms in unsupervised learning. Published in [33], it is considered an important benchmark algorithm and is frequently the subject of current research to this day. It takes as input the data \(X = \{x _1,\dots , x _m \} \) and a number \(K\) of clusters to be used, and begins by choosing \(K\) randomly chosen data entries as so-called cluster centroids \(c _k \). Then, in a step called Cluster Assignment , it computes for each data entry \(x _i \) which cluster centroid \(c _k \) is nearest, and assigns the data entry to that centroid. When this has been done for all data entries, the second step begins: During the Move Centroids step, the cluster centroids are moved by setting each centroid as the average of all data entries that were assigned to it in the previous step. These two steps are repeated for a set number of times \(T\) or until the centroids do not change anymore. We use the first method.

The output of the algorithm is the values of the centroids, or the cluster assignment for the data entries (which can easily be computed from the former). We opt for the first approach. The pseudocode for the algorithm as we use it can be found in Appendix A, along with a visualization. Accuracy can either be measured in terms of correctly classified data entries, which assumes that the correct classification is known (there might not even exist a unique best solution), or via the so-called cost function, which measures the (average) distance of the data entries to their assigned cluster centroids. We opt for the first approach because our datasets are benchmarking sets for which the labels are indeed provided, and it allows better comparability between the different algorithms.

3.2 Encoding

FHE schemes generally have finite fields as a plaintext space, and any rational numbers (which can be scaled to integers) must be embedded into this plaintext space. There are two main approaches in literature, which we quickly compare side by side in Table 1. Note that for absolute value computation and comparison, we need to use the digitwise encoding.

Table 1. Two mainstream encoding approaches.

3.3 FHE Library Choice

In [27], it was shown that among all bases p for digitwise p-adic encoding in FHE computations, the choice \(p=2\) is best in terms of the number of additions and multiplications to be performed on the ciphertexts. Hence, we use an FHE scheme with a plaintext space of \(\{0, 1\}\). The currently fastest FHE implementation for this plaintext space, TFHE [38], states that “an optimal circuit for TFHE is most likely a circuit with the smallest possible number of gates” – thus, this library is a perfect choice for us, and we will use the binary encoding for signed integers and tweaks presented in [26] for maximum efficiency.

3.4 Datasets

To evaluate performance, we use four datasets from the FCPS dataset [39]:

  • The Hepta dataset consists of 212 data points of 3 dimensions. There are 7 clearly defined clusters.

  • The Lsun dataset is 2-dimensional with 400 entries and 3 classes. The clusters have different variances and sizes.

  • The Tetra dataset is comprised of 400 entries in 3 dimensions. There are 4 clusters, which almost touch.

  • The Wingnut dataset has only 2 clusters, which are side-by-side rectangles in 2-dimensional space. There are 1016 entries.

For accuracy measurements, each version of the algorithm was run 1000 times (with varying starting centroids) for number of iterations \(T =5,10,...,45,50\) on each dataset. For runtimes on encrypted data, we used the Lsun dataset.

4 Approach 1: Implementing the Exact \(K\)-Means-Algorithm

We now show a method of implementing the K-Means algorithm largely as it is. To this end, we first discuss challenges that arise in the context of FHE computation of this algorithm. We then address these challenges by changing the distance metric, and then present an encoding that supports the division required in computing the average in the MoveCentroid-step. As this method is in no way restricted to the \(K\)-Means-Algorithm, the result is of independent interest. As it turns out, there are some issues with this approach, which we will also discuss.

4.1 FHE Challenges

Fully homomorphic encryption schemes can easily compute additions and multiplications on the underlying plaintext space, and most also offer subtraction. Using these operations as building blocks, more complex functionalities can be obtained. However, there are three elements in the \(K\)-Means-Algorithm that pose challenges, as it is not immediately clear how to obtain them from these building blocks. We list these (with the line numbers referring to the pseudocode on page 20 in Appendix A.2) and quickly explain how we solve them.

  • The distance metric (Line 9, \(\varDelta (x,y)=||x - y||_2:=\sqrt{\sum _i (x_i-y_i)^2}\)): To our knowledge, taking the square root of encrypted data has not been implemented yet. In Sect. 4.2, we will argue that the Euclidean norm is an arbitrary choice in this context and solve this problem by using the \(L_1\)-distance \(\varDelta (x,y)=||x - y||_1:=\sum _i(\vert x_i - y_i\vert )\) instead of the Euclidean distance.

  • Comparison (Line 10, \(\tilde{\varDelta } < \varDelta \)) in finding the centroid with the smallest distance to the data entry: This has been constructed from bit multiplications and additions in [26] for bitwise encoding, so we view this issue as solved. A detailed explanation can be found in the extended version of this paper.

  • Division (Line 25, \(c _k =c _k/d _k \)) in computing the new centroid value as the average of the assigned data points: In FHE computations, division by an encrypted value is usually not possible (whereas division by an unencrypted value is no problem). We present a way of implementing the division with a new encoding in Sect. 4.3, and propose a modified version of the Algorithm in Sect. 5 that only needs division by a constant.

4.2 The Distance Metric

Traditionally, the distance measure used with the K-Means Algorithm is the Euclidean Distance \(\varDelta (x,y)=~||x - y||_2:=\sqrt{\sum _i (x_i-y_i)^2}\), also known as the \(L_2\)-Norm, as it is analytically smooth and thus reasonably well-behaved. However, in the context of K-Means Clustering, smoothness is irrelevant, and we may look to other distance metrics. Concretely, we consider the \(L_1\)-NormFootnote 1 (also known as the Manhattan-Metric) \(\varDelta (x,y):=~\sum _i(\vert x_i - y_i\vert )\). This has a considerable advantage over the Euclidean distance: Firstly, we do not need to take a square root, which to our knowledge has not yet been achieved on encrypted data. Secondly, of course one could apply the standard trick and not take the root, working instead with the sum of squared distances. However, this would mean a considerable efficiency loss due to numerous multiplications and the greatly increased bitlengths of their results. These long numbers are then summed up, and the result is input into the algorithm that finds the minimum (Algorithm 2 on page 12). These two steps already constitute bottlenecks in the entire computation when working with short numbers in the \(L_1\) norm, so an increase in the bitlengths would greatly increase computation time.

Taking the absolute value can easily be achieved through a digit-wise encoding like the binary encoding which we use: We can use the MSB as the conditional (it is 1 if the number is negative and 0 if it is positive) and use a multiplexerFootnote 2 gate applied to the value and its negative. The concrete algorithm can be seen in the extended version of this paper. Thus, using the \(L_1\)-Norm is not only justified by the arbitrariness of the Euclidean Norm, but is also much more efficient. We compare the clustering accuracy in Fig. 1.

Fig. 1.
figure 1

Difference in percent of data points mislabeled for \(L_1\)-norm compared to the \(L_2\)-norm mislabeled \(L_1\)) − (\(\%\) mislabeled .

For both versions of the distance metric, we calculated the percentage of wrongly labeled data points for 1000 runs, which we can do because the datasets we use come with the correct labels. We plotted histograms of the difference (in percent mislabeled) between the \(L_1\)-norm and the \(L_2\)-norm for each run. Thus, a value of 0.5 means that the \(L_1\) norm version misclassified \(0.5\%\) more data entries than the \(L_2\)-version, and \(-2\) means that the \(L_1\) version misclassified \(2\%\) less entries than the \(L_2\)-version. Each subplot corresponds to one of the four datasets. We see that indeed, it is impossible to say which metric is better – for the Hepta dataset, the performance is very balanced, for the Lsun dataset, the \(L_1\)-norm performs much better, for the Tetra dataset, they nearly always perform exactly the same, and for the Wingnut dataset, the \(L_2\)-norm is consistently better.

4.3 Fractional Encoding

Suppose we have routines to perform addition, multiplication and comparison on bitwise encoded numbers. The idea is to express the number we wish to encode as a fraction and encode the numerator and denominator separately. Concretely, we choose the denominator \(a_d\) randomly in a certain range (like \(a_d \in [2^k,2^{k+1})\) for some k) and compute the nominator \(a_n\) as \(a_n = \lfloor a\cdot a_d\rceil \). We then encode both separately, so we have \(a=(a_n,a_d)\). If we then want to perform computations (including division) on values encoded in this way, we can express the operations using the subroutines from the binary encoding through the regular computation rules for fractions. The details can be seen in Appendix B.

Controlling the Bitlength. Every single one of these operations requires a multiplication of some sort, which means that the bitlengths of the nominators and denominators double with each operation, as there is no cancellation when the data is encrypted. However, in bitwise encoding, deleting the last k least significant bits corresponds to dividing by \(2^k\) and truncating. Doing this for both nominator and denominator yields roughly the same result as before, but with lower bitlengths. As an example, suppose that we have encoded our integers with 15 bits, and after multiplication we thus have 30 bits in nominator and denominator, e.g. \(651049779/1053588274 \approx 0.617936\). Then dividing both nominator and denominator by \(2^{15}\) and truncating yields 19868 / 32152, which evaluates to \(0.617939 \approx 0.617936\). The accuracy can be set through the original encoding bitlength (15 here).

4.4 Evaluation

While this new encoding theoretically allows us to perform the \(K\)-Means-Algorithm and solves the division problem in FHE, we now discuss the practical performance in terms of accuracy and runtime.

Accuracy. To see how the exact algorithm performs, we use the four datasets from Sect. 3.4. We ran the exact algorithm 1000 times for number of iterations \(T =5,10,...,45,50\), and for sake of completeness we include both distance metrics. The results in this section were obtained by running the algorithms in unencrypted form. We first examine the effect of \(T \) on the exact version of the algorithm by looking at the average (over the 1000 runs) misclassification rate for both metrics. The result can be seen in Fig. 2 – we see that the rate levels off after about 15 rounds in all cases, so there is no reason to iterate further.

Fig. 2.
figure 2

Misclassification rate with increasing rounds for exact algorithms.

In practice, however, our Fractional Encoding does have some problems: The first issue is the procedure to shorten the bitlengths from Subsect. 4.3. While it works reasonably well for short computations, we found it nearly impossible to set the number of bits to delete such that the entire algorithm ran correctly. The reason is simple: If not enough bits are cut off, the bitlength grows, propagating with each operation and resulting in an overflow when the number becomes too large for the allocated bitlength. If too many bits are cut off, one loses too much accuracy or may even end with a 0 in the denominator. Both these cases result in completely arbitrary and unusable results. The reason why it is so hard to set the shortening parameter properly is that generally, nominator and denominator will not require the same number of bits. Also, because the data is encrypted, we cannot see the actual size of the underlying data, so the shortening parameter cannot be set dynamically – in fact, if this were possible, it would imply that the FHE scheme is insecure. Even setting the parameter roughly requires extensive knowledge about the encrypted data, which the data owner may not want to share with the computing party.

Runtime. The second issue with this encoding is the runtime. Even though TFHE is the most efficient FHE library with which many computational tasks approach practically feasible runtimes, the fact that this encoding requires several multiplications on binary numbers for each elementary operation slows it down considerably. We compare the runtimes of all our algorithms in Sect. 7, and as we will see, running the \(K\)-Means-Algorithm on a real-world dataset with this Fractional Encoding would take almost 1.5 years on our computer.

4.5 Conclusion

In conclusion, this encoding is theoretically possible, but we would not recommend it for practical use due to its inefficiency and hardness of setting the shortening parameter (or even higher inefficiency if little to no shortening is done). However, for very flat computations (in the sense that there are not many successive operations performed), this encoding that allows division may still be of interest. For the \(K\)-Means-Algorithm, we instead change the algorithm in a way that avoids the problematic division, which we present in the rest of this paper.

5 Approach 2: The Stabilized \(K\)-Means-Algorithm

In this section, we present a modification of the K-Means algorithm that avoids the division in the MoveCentroid-step. Recall that conventional encodings in FHE, like the binary one we will use, do not allow the computation of \(c _1/c _2\) where \(c _1\) and \(c _2\) are ciphertexts, but it is possible to compute \(c _1/a\) where a is some unencrypted number. We use this fact to exchange the ciphertext division in Line 25 of Algorithm 3 (page 20) for a constant division, resulting in a variant that can be computed with more established and efficient encodings than the one from Sect. 4.3. We present this new algorithm in Sect. 5.2, and compare the accuracy of the results to the original \(K\)-Means-Algorithm in Sect. 5.3.

5.1 Encoding

The dataset we use to evaluate our algorithms consists of rational numbers. To encode these so that we can encrypt them bit by bit, we scaled them with a factor of \(2^{20}\) and truncated to obtain an integer. We then used Two’s Complement encoding to accommodate signed numbers, and switched to Sign-Magnitude Encoding for multiplication. Note that deleting the last 20 bits corresponds to dividing the number by \(2^{20}\) and truncating, so the scaling factor can remain constant even after multiplication, where it would normally square.

5.2 The Algorithm

Recall that in the original \(K\)-Means-Algorithm, the MoveCentroid-step consists of computing each centroid as the average of all data entries that have been assigned to it. More specifically, suppose that we have a \((m \times K)\)-dimensional cluster assignment matrix \(A\), where

$$A _{i k}={\left\{ \begin{array}{ll}1, \ \ \ \text {Data entry } x _i \text { is assigned to centroid } c _k \\ 0 \ \ \ \text {else.}\end{array}\right. }$$

Then computing the new centroid value \(c _k \) consists of multiplying the data entries \(x _i \) with the corresponding entry \(A _{i k}\) and summing up the results before dividing by the sum over the respective column \(k \) of \(A \):

$$c _k = \sum \limits _{i =1}^m x _i \cdot A _{i k}\big /\sum \limits _{i =1}^m A _{i k}. $$

Our modification now replaces this procedure with the following idea: To compute the new centroid \(c _k \), add the corresponding data entry \(x _i \) to the running sum if \(A _{i k}=1\), otherwise add the old centroid value \(\bar{c _k}\) if \(A _{i k}=0\). This can be easily done with a multiplexer gate (or more specifically, by abuse of notation, a multiplexer gate applied to each bit of the two inputs) with the entry \(A _{i k}\) as the conditional boolean variable:

$$c _k = \sum \limits _{i =1}^m \texttt {MUX}(A _{i k},x _i,\bar{c _k})\big /m.$$

The sum now always consists of \(m \) terms, so we can divide by the unencrypted constant \(m \). It is also now obvious why we call it the stabilized \(K\)-Means-Algorithm: We expect the centroids to move much more slowly, because the old centroid values stabilize the value in the computation. The details of this new algorithm can be found in Algorithm 1, with the changes compared to the original \(K\)-Means-Algorithm shaded.

figure a

Computing the Minimum. As the reader may have noticed in Line 10, we have replaced the comparison step in finding the nearest centroid for a data entry with a new function \(\texttt {FindMin}(\varDelta _1,\dots ,\varDelta _K)\) due the change in data structure of \(A\) (from an integer vector to a boolean matrix). This new function outputs

$$A [i,\cdot ]\leftarrow \texttt {FindMin}(\varDelta _1,\dots ,\varDelta _K)$$

such that the \(i ^{th}\) row of \(A\), \(A [i,\cdot ]\), has all 0’s except at the column corresponding to the centroid with the minimum distance to \(x _i \). The idea is to run the Compare circuit to obtain a Boolean value: \(\texttt {Compare} (x,y)=1\) if \(x<y\), and 0 otherwise.

We start by comparing the first two distances \(\varDelta _1\) and \(\varDelta _2\) and setting the Boolean value as \(C:= \texttt {Compare} (\varDelta _1,\varDelta _2)\). Then we can write \(A [i,1]=C\) and \(A [i,2]=\lnot C\) and keep track of the current minimum through \(\texttt {minval}:= \texttt {MUX}(C,\varDelta _1,\varDelta _2)\). We then compare minval to \(\varDelta _3\) etc. until we have reached \(\varDelta _K \). Note that we need to modify all entries \(A [i,k ]\) with \(k \) smaller than the current index by multiplying them with the current Boolean value, preserving the indices if the minimum doesn’t change through the comparison, and setting them to 0 if it does. The exact workings can be found in Algorithm 2, and an example of how the algorithm works can be found in the extended version of this paper.

figure b

If the encryption scheme is one where multiplicative depth is important, it is easy to modify FindMin to be depth-optimal: Instead of comparing \(\varDelta _1\) and \(\varDelta _2\), then comparing the result to \(\varDelta _3\), then comparing that result to \(\varDelta _4\) etc., we could instead compare \(\varDelta _1\) to \(\varDelta _2\) and \(\varDelta _3\) to \(\varDelta _4\) and then compare those two results etc., reducing the multiplicative depth from linear in the number of clusters \(K\) to logarithmic. Since depth is not important for our implementation choice TFHE, we implemented the function as described in Algorithm 2.

5.3 Evaluation

In this section, we will investigate the performance of our Stabilized \(K\)-Means-Algorithm compared to the traditional \(K\)-Means-Algorithm.

Accuracy. The results in this section were obtained by running the algorithms in unencrypted form. As we are interested in relative rather than absolute performance, we merely care about the difference in the output of the modified and exact algorithms on the same input (i.e., datasets and starting centroids), not so much about the output itself. Recall that we obtained \(T=15\) as a good choice for number of rounds for the exact algorithm – however, as we have already explained above, the cluster centroids converge more slowly in the stabilized version, so we will likely need more iterations here. We now compare the performance of the stabilized version to the exact version. We perform this comparison by examining the average (over the 1000 iterations) difference in the misclassification rate. Thus, a value of 2 means that the stabilized version mislabeled \(2\%\) more instances than the exact version, and a difference of \(-1\) means that the stabilized version misclassified \(1\%\) less data points than the exact version.

Fig. 3.
figure 3

Average difference in misclassification rate between the stabilized and the exact algorithm (average % mislabeled stabilized) − (average % mislabeled exact).

The results for both distance metrics can be seen in Fig. 3. We see that while behavior varies slightly depending on the dataset, \(T=40\) iterations is a reasonable choice since the algorithms do not generally seem to converge further with more rounds. We will fix this parameter from here on, as it also exceeds the required amount of iterations for the exact version to converge.

While the values in Fig. 3 do converge, they do not generally reach a difference of 0, which would imply similar performance. However, this is not surprising - we significantly modified the original algorithm, not with the intention of improving clustering accuracy, but rather to make it executable under an FHE scheme at all. This added functionality comes as a tradeoff, and we will now examine the magnitude of the loss in accuracy in Fig. 4. The corresponding histogram for the \(L_2\)-norm can be found in the extended version of this paper.

Fig. 4.
figure 4

Distribution of the difference in misclassification rate for stabilized vs. exact \(K\)-Means-Algorithm (% mislabeled stabilized) − (% mislabeled exact), \(L_1\)-norm.

We can see that in the vast majority of instances, the stabilized version performs exactly the same as the the original \(K\)-Means-Algorithm. We also see that concrete performance does depend on the dataset. In some cases, the modified version even outperforms the original one: Interestingly, for the Lsun dataset, the stabilized version is actually slightly better than the original algorithm in about \(30\%\) of the cases. However, most of the time, we feel that there will be a slight performance decrease. The fact that there are some outliers where performance is drastically worse can easily be solved by running the algorithm several times in parallel, and only keeping the best run. This can be done under homomorphic encryption much like computing the minimum in Sect. 5.2, but will not be implemented in this paper.

Runtime. While we will have a more detailed discussion of the runtime of all our algorithms in Sect. 7, we would like to already present the performance gain at this point: Recall that we estimated that running the exact algorithm from Sect. 4 would take almost 1.5 years. In contrast, our Stabilized Algorithm can be run in 25.93 days, or less than a month. This is less than \(5\%\) of the runtime of the exact version.

Conclusion. In conclusion to this section, we feel that by modifying the \(K\)-Means-Algorithm, we have traded a very small amount of accuracy for the ability to perform clustering on encrypted data in a more reasonable amount of time, which is a functionality that has not been achieved previously. The next section will deal with an idea to improve runtimes even more.

6 Approach 3: The Approximate Version

We now present another modification which trades in a bit of accuracy for improved runtime. Due to space constraints, the details have been moved to Appendix C and we give only a high-level sketch at this point: Since the Compare function is linear in its inputs lengths, speeding up this building block would make the entire computation more efficient. First recall that we encode our numbers bitwise after having scaled them to integers. This means that we have access to the individual bits and can delete the \(S\) least significant bits, which corresponds to dividing the number by \(2^S \) and truncating. Let \(\tilde{X}\) denote this truncated version of a number X, and \(\tilde{Y}\) that of a number Y. Then \(\texttt {Compare} (\tilde{X},\tilde{Y}) = \texttt {Compare} (X,Y)\) if \(|X-Y|\ge 2^S \), and may or may not return the correct result if \(|X-Y|< 2^S \). However, correspondingly, if the result is wrong, the centroid that is wrongly assigned to the data entry is no more than \(2^S \) further from the data entry than the correct one. We propose to pick an initial \(S\) and decrease it over the course of the algorithm, so that accuracy increases as we near the end. We call this variant of the (stabilized) algorithm the approximate version.

In our experiments with \(S =5\), we saw that accuracy is comparable to the stabilized version, and the gain is around 210.7 min for the entire algorithm. Unfortunately, this is swallowed by the magnitude of the total computation time, as the main bottlenecks lie elsewhere. However, running just the comparison and approximate comparison functions with the same parameters as in our implementation of the \(K\)-Means-Algorithm (35 bits, 5 bits deleted for approximate comparison) yielded a drop in average runtime from 3.24 to 1.51 s. We see that this does make a big difference and may be of independent interest for computations involving many comparisons, which is why we choose to present the modification even though the effect was outweighed by other bottlenecks in the \(K\)-Means-Algorithm computation.

7 Implementation Results

We now present runtimes for the stabilized and approximate versions of the \(K\)-Means-Algorithm, and the times for the exact version using Fractional Encoding. Computations were done in a virtual machine with 20 GB of RAM and 4 cores, running an Intel i7-3770 processor with 3.4 GHz. We used the TFHE library [38] without the SPQLIOS_FMA-option, as our processor did not support this.

The dataset we used was the Lsun dataset from [39], which consists of 400 rational data entries of 2 dimensions, and \(K =3\) clusters. We encoded the binary numbers with 35 bits and scaled to integers using \(2^{20}\). The timings measured were for one round, and the approximate version used a deletion parameter of \(S=5\). For the Fractional Encoding, the data was encoded with nominator in \([2^{11},2^{12})\) and denominator in roughly the same range. We allotted 35 bits total for nominator and denominator each to allow a growth in required bitlength, and set the shortening parameter to 12, but shortened by 11 every once in a while (we derived this approach experimentally, see the discussion of the shortcoming of this approach in Sect. 4.4). The Fractional exact version was so slow that we ran it only on the first 10 data entries of the dataset - we will extrapolate the runtimes in Sect. 7.1.

7.1 Runtimes for the Entire Algorithm on a Single Core

We now present the runtimes for the entire \(K\)-Means-Algorithm on encrypted data on our specific machine with single-thread computation. There is some extrapolation involved, as the measured runtimes were for one round (so we multiplied by the round number, which differs between the exact version and the other two), and in the Fractional (exact) case, only for 10 data entries, so we multiplied that time by 40. Note that these times (which are with no parallelization) can be found in Table 2. We see that even though the stabilized version needs more rounds than the exact version, the latter is still significantly slower due to the Fractional Encoding. The approximate version (always with \(S=5\) deleted bits in the comparison) would save about 210.7 min.

Table 2. Single-thread runtimes (extrapolated) on our machine.

7.2 Further Speedup

We would now like to address the subject of parallelism. At the moment (last accessed April 24\(^{th}\) 2018), the TFHE library only supplies single-thread computations - i.e., there is no parallelism. However, version 1.5 is expected soon, and this will allegedly support multithreading. We first explain the huge difference this would make for the runtime, and then quantify the involved timings.

Parallelism. It is easy to see that all our versions of the \(K\)-Means-Algorithm are highly parallelizable: The Cluster Assignment step trivially so over the data entries (without any time needed for recombination), and the Move Centroids similarly over the cluster centroids (also over the data entries with very small recombination effort). Since both steps are linear in the number \(K\) of centroids, the number \(m\) of data entries, and the number \(T\) of round iterations, we present our runtimes in this subsection as per centroid, per data entry, per round, per core. This allows a flexible estimate for when multithreading is supported.

Round Runtimes. We now present the runtime results for each of the three variants on encrypted data per centroid, per data entry, per round, per core in Table 3. We do not include runtimes for encoding/encryption and decryption/decoding, as these would be performed on the user side, whereas the computation would be outsourced (encoding/encryption is ca. 1.5 s, and decoding/decryption is around 5 ms). We see that the Fractional Encoding is extremely slow, which motivated the Stabilized Algorithm in the first place.

Table 3. Runtimes per centroid, per data entry, per round, per core.