Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

K-Nearest-Neighbors (KNN) graphs play a crucial role in a large number of applications, ranging from classification [22] to recommender systems [4, 15, 17]. In a KNN graph, every entity (or node) is linked to its k closest counterparts, based on a given similarity metric. Despite being one of the simplest model of machine learning, computing an exact KNN graphFootnote 1 is unfortunately a highly time consuming task. A simple brute force approach for instance has a quadratic complexity in the number of entities. For applications for which data freshness is more valuable than the exactness of the results, such as news recommenders, such computation time is prohibitive. To overcome these costs, most applications therefore compute an approximate KNN graph by using pre-indexing mechanisms [5, 11] or by exploiting greedy incremental strategies [4, 10] to reduce the number of similarity computations. However, it seems hard to lower even further that number.

In this paper we focus on an orthogonal approach, and leverage sampling as a preliminary pruning step to accelerate the time to compute similarities between two entities. Our proposal stems from the observation that many KNN graphs computations are performed on entities (users, documents, molecules) linked to items (e.g. the web pages an user has viewed, the terms of a document, the properties of a molecule). In these KNN graphs, the similarity function is expressed as a set similarity between bags of items (possibly weighted), such as Jaccard’s coefficient or cosine similarity. The goal of sampling is to limit the size of these bags of items and thus the time to compute the similarity.

Sampling might however degrade the resulting approximated KNN graph to a point where it becomes unusable, and must therefore be performed with care. In this paper we propose to sample the bags of items associated with each entity to a common fixed size s, by keeping their s least popular items. Our intuition is that less popular items are more discriminant when comparing entities than more popular or random items. For instance, the fact that Alice enjoys the original 1977 Star Wars movie tells us less about her tastes than the fact she also loves the 9 hour version of Abel Gance’s 1927 Napoléon movie.

We compare this policy against three other sampling policies: (i) keeping the s most popular items of each entity, (ii) keeping s random items of each entity, and (iii) sampling the universe of items, independently of the entities. We evaluate these four sampling policies on four representative datasets. As a case study, we finally assess the effects of these strategies on recommendation, an emblematic application of KNN graphs. Our evaluation shows that our sampling policy clearly outperforms the other policies in terms of computation time and resulting quality: keeping the 25 least popular items reduces the computational time by up to 63%, while producing a KNN graph close to the ideal one. The recommendations done by using the resulting KNN graphs are moreover as good as the one relying on the exact KNN graph on all datasets.

The rest of this paper is organized as follows. In Sect. 2 we formally define the context of our work and our approach. The evaluation procedure is described in Sect. 3. Section 4 presents our experimental results. The related work is discussed in Sect. 5 and we conclude in Sect. 6.

2 Problem Statement: Reduce KNN Computation Time

2.1 System Model and Problem

For ease of exposition, we will speak about users rather than entities, but our approach remains applicable to any entity-item dataset. We consider a set of users \(U=\{u_1,\ldots ,u_n\}\) in which each user u is associated with a set of items (the movies this user has liked, the pages she has viewed), termed her profile, and noted \( P _{u}\). We note I the universe of all items: \(I= \cup _{u\in U} P _{u}\).

A k-nearest neighbor (KNN) graph associates each user u with the set of k other users \( knn (u) \subseteq U\) which are closest to u according to a given similarity metric on profiles:

$$\begin{array}{rccc} sim : &{} U \times U &{} \rightarrow &{} \mathbb {R} \\ &{} (u,v) &{} &{} sim(u,v) = f_{ sim }( P _{u}, P _{v}). \end{array} $$

Thus computing the KNN graph results in finding \( knn (u)\) for each u such that

$$\begin{aligned} knn (u) \in \mathop {\mathrm {argmax}}_{S \in \mathcal {P}(U\backslash \{u\}): |S|=k} \sum _{v \in S} sim(u,v), \end{aligned}$$
(1)

where \(\mathcal {P}(X)\) is the powerset of a set X. We focus in this work on Jaccard similarity, a commonly used similarity metric, but our work can be applied to others. The Jaccard similarity between two users u and v is expressed as the size of the intersection of their profiles divided by the size of the union of their profiles:

$$\begin{aligned} f_{ sim }( P _{u}, P _{v}) = J( P _{u}, P _{v}) = \frac{| P _{u} \cap P _{v}|}{| P _{u} \cup P _{v}|} \end{aligned}$$
(2)

Since \(| P _{u} \cup P _{v}| = | P _{u}| + | P _{v}| - | P _{u} \cap P _{v}|\), and since we can store \(| P _{u}|\) for every user, computing the size of the intersection is the only non-trivial operation required to compute the Jaccard similarity.

2.2 Gance’s Napoléon tells us more than Lucas’s Star Wars

Computing the intersection \( P _{u} \cap P _{v}\) is time consuming for large sets and is the main bottleneck of Jaccard’s similarity. To reduce the complexity of this operation, we propose to sample each profile \( P _{u}\) into a subset \(\widehat{ P }_{u}\) in a preparatory phase applied when the dataset is loaded into memory, and to compute an approximated KNN graph on the sampled profiles.

Although simple, this idea has surprisingly never been applied to the computation of KNN graphs on entity-item datasets. Sampling carries however its own risks: if the items that are most characteristic of a user’s profile get deleted, the KNN neighborhood of this user might become irremediably degraded. To avoid this situation, we adopt a constant-size sampling that strives to retain the least popular items in a profile.

The intuition is that unpopular items carry more information about a user’s tastes than other items: if Alice and Bob have both enjoyed Abel Gance’s Napoléon—a 1927 silent movie about Napoléon’s early years—they are more likely to have similar tastes, than if they have both liked Star Wars: A New Hope—the 1977 first installment of the series, enjoyed by 96% of usersFootnote 2.

2.3 Our Approach: Constant-Size Least Popular Sampling (LP)

More formally, if the size of the profile of an user u is larger than a parameter s, we only keep its s least popular items

$$\begin{aligned} \widehat{ P }_{u} \in \mathop {\mathrm {argmin}}_{\mathcal {S}\in \mathcal {P}^s_u} \sum _{i \in S} pop(i), \end{aligned}$$
(3)

where \(\mathcal {P}^s_u\) is the set of subsets of \( P _{u}\) of a given size s, i.e. \(\mathcal {P}^s_u = \{S \in \mathcal {P}(I): |S|=s \wedge S \subseteq P _{u}\}\), and pop(i) is the popularity of item \(i\in I\) over the entire dataset:

$$\begin{aligned} pop(i) = |\{u \in U: i \in P _{u}\}|. \end{aligned}$$
(4)

If the profile’s size is below s, the profile remains the same: \(\widehat{ P }_{u} = P _{u}\).

In terms of implementation, we compute the popularity of every item when reading the dataset from disk. We then use Eq. (3) to sample the profile of every user in a second iteration. The sampled profiles are finally used to estimate Jaccard’s similarity between users when the KNN graph is constructed:

$$\begin{aligned} \widehat{J}( P _{u}, P _{v}) = J(\widehat{ P }_{u},\widehat{ P }_{v}) = \frac{ |\widehat{ P }_{u}\cap \widehat{ P }_{v}| }{ |\widehat{ P }_{u}| + |\widehat{ P }_{v}| - |\widehat{ P }_{u}\cap \widehat{ P }_{v}| } \end{aligned}$$
(5)

3 Experimental Setup

3.1 Baseline Algorithms and Competitors

Our Constant-Size Least Popular sampling policy (LP for short) can be applied to any KNN graph construction algorithm [4, 5, 10]. For simplicity, we apply it to a brute force approach that compares each pair of users and keeps the k most similar for each user. This choice helps focusing on the raw impact of sampling on the computation time and KNN quality, without any other interfering mechanism.

We use full profiles for our baseline, and compare our approach with three alternative sampling strategies: constant-size most popular, constant-size random, and item sampling.

Baseline: No Sampling. We use our brute force algorithm without sampling as our baseline. This approach yields an exact result, which we use to assess the approximation introduced by sampling, and provide a reference computing time.

Constant-Size Most Popular Sampling (MP). Similarly to LP, MP only keeps the s most popular items of each profile \(P_u\):

$$\begin{aligned} \widehat{ P }_{u} \in \mathop {\mathrm {argmax}}_{\mathcal {S}\in \mathcal {P}^s_u} \sum _{i \in S} pop(i). \end{aligned}$$
(6)

As with LP, we do not sample the profile if its size is lower than s.

Constant-Size Random Sampling (CS). This sampling policy randomly selects s items from \( P _{u}\), with a uniform probability. As above, there is no sampling if the size of the profile is lower than s. In terms of implementation, this policy only requires one iteration over the data.

Item Sampling (IS). This last policy uniformly removes items from the complete dataset. More precisely, each item \(i \in I\) is kept with a uniform probability p to construct a reduced item universe \(\hat{I}\) (i.e. \(\forall i \in I: \mathbb {P}(i \in \hat{I}) = p\)). The sampled profiles are then obtained by keeping the items of each profile that are also in \(\hat{I}\): \(\widehat{ P }_{u} = P _{u} \cap \hat{I}\). On average, the profile of all users is reduced by a factor of \(\frac{1}{p}\), but this policy does not adapt to the characteristics of individual profiles: small profiles run the risk of losing too much of their content to maintain good quality results.

3.2 Datasets

We use four publicly available datasets containing movie ratings (Table 1): 3 datasets from the MovieLens project, and one from Amazon. Ratings range from disliking (0.5 or 1) to liking (5). To apply Jaccard similarity, we binarize the datasets by keeping only ratings that reflect a positive opinion (i.e. \(>3\)), before performing any sampling. Figure 1 shows the resulting Complementary Cumulative Distribution Functions (CCDF) of profile sizes for each dataset. For instance, more than 66% of users have profiles larger than 25 in movielens10M (ml10M). This means that a constant-size sampling with \(s=25\) on movielens10M removes more than 3 millions ratings \((-69.23\%\)).

Table 1. The datasets used in our experiments
Fig. 1.
figure 1

CCDF of user profile sizes on the datasets used in the evaluation (positive ratings only). Between 77% (movielens1M) and 53% (AmazonMovies) of profiles are larger than the default cut-off value 25 (marked as a vertical bar).

The Three Movielens Datasets. movielens1M (ml1M for short), movielens10M (ml10M) and movielens20M (ml20M) originate from GroupLens Research [13]. They contain movie reviews by on-line users from 1995 to 2015, and only consider users with more than 20 ratings.

The AmazonMovies Dataset. (AM) [20] aggregates movie reviews received by Amazon from 1997 to 2012. To avoid users with very few ratings (the so-called cold start problem), we only consider users with at least 20 ratings.

3.3 Evaluation Metrics

We measure the effect of sampling along two main metrics: (i) their computation time, and (ii) the quality ratio of the resulting KNN graph.

The time is measured from the beginning of the execution of the algorithm, until the KNN graph is computed. It does not take into account the preprocessing of the dataset, which is evaluated separately in Sect. 4.2.

When applying sampling, the resulting KNN graph is an approximation of the exact one. In many applications such as recommender systems, this approximation should provide neighborhoods of high quality, even if those do not overlap with the exact KNN. To gauge this quality, we introduce the notion of similarity ratio, which measures how well the average similarity of an approximated graph compares against that of an exact KNN graph. Formally we define the average similarity of an approximate KNN graph \(\widehat{G}_{{\text {KNN}}}\) as

(7)

i.e. as the average similarity of the edges of \(\widehat{G}_{{\text {KNN}}}\), and we define the quality of \(\widehat{G}_{{\text {KNN}}}\) as its normalized average similarity

(8)

where \(G_{{\text {KNN}}}\) is an ideal KNN graph, obtain without sampling.

A quality close to 1 indicates that the approximate neighborhoods of \(\widehat{G}_{{\text {KNN}}}\) present a similarity that is very close to that of ideal neighborhoods, and can replace them with little loss in most applications, as we will show in the case of recommendations in our evaluation.

Throughout our experiments, we use a 5-fold cross-validation procedure which creates 5 training sets composed of 80% of the ratings. The remaining 20%, i.e. the training sets, are used for recommendations in Sect. 4.4. Our results are the average on the 5 resulting runs.

3.4 Experimental Setup

We have implemented the sampling policies in Java 1.8. We ran our experiments on a 64-bit Linux server with two Intel Xeon E5420@2.50GHz, totaling 8 hardware threads, 32 GB of memory, and a HHD of 750 GB. We use all 8 threads. Our code is available onlineFootnote 3. In our experiments, we compute KNN graphs with k set to 30, which is a standard value.

4 Experimentations

4.1 Reduction in Computing Time, and Quality/Speed Trade-Off

The baseline algorithm (without sampling) produces an exact KNN graph, with a quality of 1. To compare the different sampling policies (LP, MP, CS and IS) on an equal footing, we configure each of them on each dataset to achieve a quality of 0.9. The resulting parameter s ranges from 15 (LP on AM) to 75 (MP on movielens1M), while p (for IS) varies between 0.35 (on AmazonMovies) and 0.68 (on movielens20M). Table 2 summarizes the computation times measured on the four datasets with the percentage time reduction obtained against the baseline (\(\varDelta \) columns), while Fig. 2 shows the results on movielens10M. LP outperforms all other policies on all datasets, reaching a reduction of up to 63%.

Table 2. Computation time (s) of the baseline and the 4 sampling policies. The parameters were chosen to have a quality equal to 0.9. LP reduces computation time by 40% (ml1M) to 63% (AM), and outperforms other sampling policies on all datasets.
Fig. 2.
figure 2

Computation time and KNN quality of the baseline and the sampling policies on movielens10M, when quality is set to 0.9. LP yields a reduction of 44.2% in computation time, outperforming other sampling policies.

Fig. 3.
figure 3

Trade-off between computation time and quality. Closer to the top-left corner is better. LP clearly outperforms all other sampling policies on all datasets.

Table 3. Preprocessing time (seconds) for each dataset, and each sampling policy, with parameters set so that the resulting KNN quality is 0.9. The preprocessing times are negligible compared to the computation times.

Because they reduce the size of profiles, sampling policies exchange quality for speed. To better understand this trade-off, Fig. 3 plots the evolution of the computation time and the resulting quality when s ranges from 5 to 200 for LP, MP, and CS (\(s \in \{5, 10, 15, 20, 30, 40, 50, 75, 100, 200\}\)), and p ranges from 0.1 to 1.0 for IS (\(p\in \{0.1, 0.2, 0.4, 0.5, 0.75, 0.9, 1.0\}\)).

For clarity, we only display points with a quality above 0.7, corresponding to the upper values of s and p. The dashed vertical line on the right shows the computation time of the baseline (producing a quality of 1), while the dotted horizontal line shows the quality threshold of 0.9 used in Table 2 and Fig. 2.

Lines closer to the top-left corner are better. The figures confirm that our contribution, LP, outperforms other sampling policies on all datasets. There is however no clear winner among the remaining policies: IS performs well on movielens1M, but arrives last on the other datasets, and MP and CS show no clear order, which depends on the dataset and the quality considered.

4.2 Preprocessing Overhead

As is common with KNN graph algorithms [5, 10], the previous measurements do not include the loading and preprocessing time of the datasets, which is typically dominated by I/O rather than CPU costs. Sampling adds some overhead to this preprocessing, but Table 3 shows that this extra cost (\(\varDelta \) columns) remains negligible compared to the computation times of Table 2. For instance, LP adds 3.4s to the preprocessing of movielens20M, which only represents 0.07% of the complete execution time of the algorithm (\(4865\mathrm {s} + 11.95\mathrm {s} = 4877\mathrm {s}\)). IS even decreases the preprocessing time on 3 datasets out of 4, by starkly reducing the bookkeeping costs of profiles while introducing only a low extra complexity.

4.3 Influence of LP at the User’s Level

Constant size sampling has a different influence on each user, depending on this user’s profile’s size. Profiles whose sizes are below the parameter s remain unchanged while larger profiles are truncated, thus losing information.

Figure 4 investigates the impact of this loss with our approach, LP, on movielens10M with \(s=25\) (corresponding to a quality of 0.9). Figure 4a plots the distribution of the similarity error \(\epsilon =|J( P _{u}, P _{v}) - J(\widehat{ P }_{u},\widehat{ P }_{v})|\) introduced by sampling when \(\epsilon \) is computed for each pair of users (uv). The figure shows that \(35\%\) of pairs experience no error (\(\epsilon =0\)), and that \(96\%\) have an error below 0.05 (dotted vertical line), confirming that our sampling only introduces a limited distortion of similarities.

Figure 4b represents the impact of LP on the quality of users’ neighborhoods, according to the initial profile size of users. For every user u with an initial profile size of \(|P_u|\), we compute the average similarity of u’s approximated neighborhood \(\widehat{ knn }(u)\), and normalize this similarity with that of u’s exact neighborhood \( knn (u)\). The closest to 1 the better. We then average this normalized similarity for users with the same profile size \(\{u\in U: |P_u| =P\}\). These points are displayed as a scatter plot (in black, note the log scale on the x axis), and using a moving average of width 50 (red curve). The first dashed vertical line is the value of the truncation parameter s (\(x=25\)). The points after the second vertical line (at \(x=1553\)) represent 24 users (out of 69816) and thus are not statistically significant. As expected, there is a clear threshold affect around the truncation value \(s=25\), yet even users with much larger profiles retain a high neighborhood quality, that remains on average above 0.75.

Fig. 4.
figure 4

Influence on the similarity and the quality of sampling with LP with \(s=25\) on movielens10M (total KNN quality equal to 0.9) (Color figure online).

4.4 Recommendations

We want to evaluate the impact of the loss in quality on a practical use of the KNN graphs. To do so we perform item recommendations using the exact KNN graphs and the approximated graphs produced with LP. We recommend the items that an user u is more likely to like. This likelihood is expressed as a weighted average of the ratings the items received by the neighbors of u, weighted by the similarity of u with them. We use the real profiles, without sampling nor binarization, to compute these predicted ratings. After computing the score of every item, we recommend to u a set \(R_u\) composed by 30 items with the highest scores:

$$\begin{aligned} R_u \in \mathop {\mathrm {argmax}}_{R \subseteq I \backslash P_u : |R|=30}\sum _{i \in S} \sum _{v \in knn_u} sim(u,v) * r_{v,i}, \end{aligned}$$
(9)

where \(r_{v,i}\) is the rating made by the user v on the item i. We use the same 5-fold cross-validation as used for the KNN graph computation. We consider a recommendation successful when a recommended item is found within the 20% removed ratings (the testing set) with a rating above 3 (\(r_{u,i} > 3\)). The quality of the recommendation is measured using recall, the proportion of successful recommendations among all recommendations.

Table 4. Recommendation recall without sampling (Base.) and using the Least Popular (LP) policy (total KNN quality set to 0.9).

Table 4 shows the recall we obtain by using the exact KNN graphs obtained with the baseline and with LP using when the KNN quality is set to 0.9. In spite of its approximation, LP introduces no loss in recall, and even achieves slightly better scores than the baseline, which shows that our sampling approach can be used with little impact in concrete applications.

5 Related Work

For small datasets, some specific data structures can be used to compute the KNN graphs very efficiently [3, 18, 21]. On the other hand, these solutions do not scale and computing efficiently exact KNN graphs with large datasets remains an open problem.

For large datasets, an approximation of the KNN graph, called approximate nearest-neighbor (ANN) graph, is computed instead, by decreasing the number of comparisons between users. Locally Sensitive Hashing [11, 14] hashes users into buckets and only users within the same buckets are compared. Depending on the chosen similarity, different hashing functions are used [6,7,8]. Despite being very efficient for KNN queries, the preprocessing is too expensive to compete with other ANN graph algorithms. KIFF [5] first assigns to every user the users with which she shares at least one item. Since the Jaccard similarity is null if two users do not share any item, the neighbors research is limited to these ones. This algorithm performs particularly well on sparse datasets. Hyrec [4] and NNDescent [10] rely on the assumption that the neighbors of the neighbors are more likely to be also neighbors than random users to decrease drastically number of similarity computed.

However it seems that lowering even further the number of similarities is no longer possible. An orthogonal strategy is to speed-up the similarity computation itself by compacting the users’ profiles. b-bit minwise hashing [2, 16] relies on a similar approach than LSH to compact users’ profiles in order to approximate the Jaccard similarity. It is space efficient but at the expense of a high preprocessing time. In [9] the profiles are compacted by using bit arrays: each bit represents a feature, which value has been rounded. This does not scale and cannot be used in our case where the items are the features. To avoid such a problem [12] uses constant-sized Bloom filters to encode the profiles. Then the Jaccard’s similarity is approximated by a bitwise AND operation. Despite its privacy properties and its speed-up, there is a substantial loss in precision.

As far as we know, sampling has never been used to compact the users’ profiles, even though it is used in information filtering systems such as collaborative filtering. It can be used to find association rules [1], to reduce the size of the items’ universe to recommend [17] and to change the distribution of the training points [19, 23]. The popularity is used to solve the cold-start problem [24] by finding items the new user is likely to rate, but not to represent its profile in a compact manner.

6 Conclusion

In this paper, we have proposed Constant-Size Least Popular Sampling (LP) to speed up the construction of KNN graphs on entity-item datasets. By keeping only the least popular items of users’ profiles, we make them shorter and thus faster to compare. Our extensive evaluation on four realistic datasets shows that LP outperforms more straightforward sampling policies. More precisely, LP is able to decrease the computation time of KNN graphs by up to 63%, while providing a KNN graph close to the ideal one, with no observable loss when used to compute recommendations.

In the future, we plan to investigate more advanced sampling policies, and to explore how sampling could be combined with orthogonal greedy techniques to accelerate KNN graph computations [4, 5, 10].