Nobody Cares if You Liked Star Wars: KNN Graph Construction on the Cheap

Kermarrec, Anne-Marie; Ruas, Olivier; Taïani, François

doi:10.1007/978-3-319-96983-1_30

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11014))

Included in the following conference series:

European Conference on Parallel Processing

2356 Accesses
4 Citations

Abstract

K-Nearest-Neighbors (KNN) graphs play a key role in a large range of applications. A KNN graph typically connects entities characterized by a set of features so that each entity becomes linked to its k most similar counterparts according to some similarity function. As datasets grow, KNN graphs are unfortunately becoming increasingly costly to construct, and the general approach, which consists in reducing the number of comparisons between entities, seems to have reached its full potential. In this paper we propose to overcome this limit with a simple yet powerful strategy that samples the set of features of each entity and only keeps the least popular features. We show that this strategy outperforms other more straightforward policies on a range of four representative datasets: for instance, keeping the 25 least popular items reduces computational time by up to 63%, while producing a KNN graph close to the ideal one.

You have full access to this open access chapter, Download conference paper PDF

Fast kNN Graph Construction with Locality Sensitive Hashing

Greedy Filtering: A Scalable Algorithm for K-Nearest Neighbor Graph Construction

Generalized Relative Neighborhood Graph (GRNG) for Similarity Search

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

K-Nearest-Neighbors (KNN) graphs play a crucial role in a large number of applications, ranging from classification [22] to recommender systems [4, 15, 17]. In a KNN graph, every entity (or node) is linked to its k closest counterparts, based on a given similarity metric. Despite being one of the simplest model of machine learning, computing an exact KNN graph^{Footnote 1} is unfortunately a highly time consuming task. A simple brute force approach for instance has a quadratic complexity in the number of entities. For applications for which data freshness is more valuable than the exactness of the results, such as news recommenders, such computation time is prohibitive. To overcome these costs, most applications therefore compute an approximate KNN graph by using pre-indexing mechanisms [5, 11] or by exploiting greedy incremental strategies [4, 10] to reduce the number of similarity computations. However, it seems hard to lower even further that number.

In this paper we focus on an orthogonal approach, and leverage sampling as a preliminary pruning step to accelerate the time to compute similarities between two entities. Our proposal stems from the observation that many KNN graphs computations are performed on entities (users, documents, molecules) linked to items (e.g. the web pages an user has viewed, the terms of a document, the properties of a molecule). In these KNN graphs, the similarity function is expressed as a set similarity between bags of items (possibly weighted), such as Jaccard’s coefficient or cosine similarity. The goal of sampling is to limit the size of these bags of items and thus the time to compute the similarity.

Sampling might however degrade the resulting approximated KNN graph to a point where it becomes unusable, and must therefore be performed with care. In this paper we propose to sample the bags of items associated with each entity to a common fixed size s, by keeping their s least popular items. Our intuition is that less popular items are more discriminant when comparing entities than more popular or random items. For instance, the fact that Alice enjoys the original 1977 Star Wars movie tells us less about her tastes than the fact she also loves the 9 hour version of Abel Gance’s 1927 Napoléon movie.

We compare this policy against three other sampling policies: (i) keeping the s most popular items of each entity, (ii) keeping s random items of each entity, and (iii) sampling the universe of items, independently of the entities. We evaluate these four sampling policies on four representative datasets. As a case study, we finally assess the effects of these strategies on recommendation, an emblematic application of KNN graphs. Our evaluation shows that our sampling policy clearly outperforms the other policies in terms of computation time and resulting quality: keeping the 25 least popular items reduces the computational time by up to 63%, while producing a KNN graph close to the ideal one. The recommendations done by using the resulting KNN graphs are moreover as good as the one relying on the exact KNN graph on all datasets.

The rest of this paper is organized as follows. In Sect. 2 we formally define the context of our work and our approach. The evaluation procedure is described in Sect. 3. Section 4 presents our experimental results. The related work is discussed in Sect. 5 and we conclude in Sect. 6.

2 Problem Statement: Reduce KNN Computation Time

2.1 System Model and Problem

For ease of exposition, we will speak about users rather than entities, but our approach remains applicable to any entity-item dataset. We consider a set of users $U=\{u_1,\ldots ,u_n\}$ in which each user u is associated with a set of items (the movies this user has liked, the pages she has viewed), termed her profile, and noted $ P _{u}$. We note I the universe of all items: $I= \cup _{u\in U} P _{u}$.

A k-nearest neighbor (KNN) graph associates each user u with the set of k other users $ knn (u) \subseteq U$ which are closest to u according to a given similarity metric on profiles:

$$\begin{array}{rccc} sim : &{} U \times U &{} \rightarrow &{} \mathbb {R} \\ &{} (u,v) &{} &{} sim(u,v) = f_{ sim }( P _{u}, P _{v}). \end{array} $$

Thus computing the KNN graph results in finding $ knn (u)$ for each u such that

$$\begin{aligned} knn (u) \in \mathop {\mathrm {argmax}}_{S \in \mathcal {P}(U\backslash \{u\}): |S|=k} \sum _{v \in S} sim(u,v), \end{aligned}$$

(1)

where $\mathcal {P}(X)$ is the powerset of a set X. We focus in this work on Jaccard similarity, a commonly used similarity metric, but our work can be applied to others. The Jaccard similarity between two users u and v is expressed as the size of the intersection of their profiles divided by the size of the union of their profiles:

$$\begin{aligned} f_{ sim }( P _{u}, P _{v}) = J( P _{u}, P _{v}) = \frac{| P _{u} \cap P _{v}|}{| P _{u} \cup P _{v}|} \end{aligned}$$

(2)

Since $| P _{u} \cup P _{v}| = | P _{u}| + | P _{v}| - | P _{u} \cap P _{v}|$, and since we can store $| P _{u}|$ for every user, computing the size of the intersection is the only non-trivial operation required to compute the Jaccard similarity.

2.2 Gance’s Napoléon tells us more than Lucas’s Star Wars

Computing the intersection $ P _{u} \cap P _{v}$ is time consuming for large sets and is the main bottleneck of Jaccard’s similarity. To reduce the complexity of this operation, we propose to sample each profile $ P _{u}$ into a subset $\widehat{ P }_{u}$ in a preparatory phase applied when the dataset is loaded into memory, and to compute an approximated KNN graph on the sampled profiles.

Although simple, this idea has surprisingly never been applied to the computation of KNN graphs on entity-item datasets. Sampling carries however its own risks: if the items that are most characteristic of a user’s profile get deleted, the KNN neighborhood of this user might become irremediably degraded. To avoid this situation, we adopt a constant-size sampling that strives to retain the least popular items in a profile.

The intuition is that unpopular items carry more information about a user’s tastes than other items: if Alice and Bob have both enjoyed Abel Gance’s Napoléon—a 1927 silent movie about Napoléon’s early years—they are more likely to have similar tastes, than if they have both liked Star Wars: A New Hope—the 1977 first installment of the series, enjoyed by 96% of users^{Footnote 2}.

2.3 Our Approach: Constant-Size Least Popular Sampling (LP)

More formally, if the size of the profile of an user u is larger than a parameter s, we only keep its s least popular items

$$\begin{aligned} \widehat{ P }_{u} \in \mathop {\mathrm {argmin}}_{\mathcal {S}\in \mathcal {P}^s_u} \sum _{i \in S} pop(i), \end{aligned}$$

(3)

where $\mathcal {P}^s_u$ is the set of subsets of $ P _{u}$ of a given size s, i.e. $\mathcal {P}^s_u = \{S \in \mathcal {P}(I): |S|=s \wedge S \subseteq P _{u}\}$, and pop(i) is the popularity of item $i\in I$ over the entire dataset:

$$\begin{aligned} pop(i) = |\{u \in U: i \in P _{u}\}|. \end{aligned}$$

(4)

If the profile’s size is below s, the profile remains the same: $\widehat{ P }_{u} = P _{u}$.

In terms of implementation, we compute the popularity of every item when reading the dataset from disk. We then use Eq. (3) to sample the profile of every user in a second iteration. The sampled profiles are finally used to estimate Jaccard’s similarity between users when the KNN graph is constructed:

$$\begin{aligned} \widehat{J}( P _{u}, P _{v}) = J(\widehat{ P }_{u},\widehat{ P }_{v}) = \frac{ |\widehat{ P }_{u}\cap \widehat{ P }_{v}| }{ |\widehat{ P }_{u}| + |\widehat{ P }_{v}| - |\widehat{ P }_{u}\cap \widehat{ P }_{v}| } \end{aligned}$$

(5)

3 Experimental Setup

3.1 Baseline Algorithms and Competitors

Our Constant-Size Least Popular sampling policy (LP for short) can be applied to any KNN graph construction algorithm [4, 5, 10]. For simplicity, we apply it to a brute force approach that compares each pair of users and keeps the k most similar for each user. This choice helps focusing on the raw impact of sampling on the computation time and KNN quality, without any other interfering mechanism.

We use full profiles for our baseline, and compare our approach with three alternative sampling strategies: constant-size most popular, constant-size random, and item sampling.

Baseline: No Sampling. We use our brute force algorithm without sampling as our baseline. This approach yields an exact result, which we use to assess the approximation introduced by sampling, and provide a reference computing time.

Constant-Size Most Popular Sampling (MP). Similarly to LP, MP only keeps the s most popular items of each profile $P_u$:

$$\begin{aligned} \widehat{ P }_{u} \in \mathop {\mathrm {argmax}}_{\mathcal {S}\in \mathcal {P}^s_u} \sum _{i \in S} pop(i). \end{aligned}$$

(6)

As with LP, we do not sample the profile if its size is lower than s.

Constant-Size Random Sampling (CS). This sampling policy randomly selects s items from $ P _{u}$, with a uniform probability. As above, there is no sampling if the size of the profile is lower than s. In terms of implementation, this policy only requires one iteration over the data.

Item Sampling (IS). This last policy uniformly removes items from the complete dataset. More precisely, each item $i \in I$ is kept with a uniform probability p to construct a reduced item universe $\hat{I}$ (i.e. $\forall i \in I: \mathbb {P}(i \in \hat{I}) = p$). The sampled profiles are then obtained by keeping the items of each profile that are also in $\hat{I}$: $\widehat{ P }_{u} = P _{u} \cap \hat{I}$. On average, the profile of all users is reduced by a factor of $\frac{1}{p}$, but this policy does not adapt to the characteristics of individual profiles: small profiles run the risk of losing too much of their content to maintain good quality results.

3.2 Datasets

We use four publicly available datasets containing movie ratings (Table 1): 3 datasets from the MovieLens project, and one from Amazon. Ratings range from disliking (0.5 or 1) to liking (5). To apply Jaccard similarity, we binarize the datasets by keeping only ratings that reflect a positive opinion (i.e. $>3$), before performing any sampling. Figure 1 shows the resulting Complementary Cumulative Distribution Functions (CCDF) of profile sizes for each dataset. For instance, more than 66% of users have profiles larger than 25 in movielens10M (ml10M). This means that a constant-size sampling with $s=25$ on movielens10M removes more than 3 millions ratings $(-69.23\%$).

Table 1. The datasets used in our experiments

Full size table

The Three Movielens Datasets. movielens1M (ml1M for short), movielens10M (ml10M) and movielens20M (ml20M) originate from GroupLens Research [13]. They contain movie reviews by on-line users from 1995 to 2015, and only consider users with more than 20 ratings.

The AmazonMovies Dataset. (AM) [20] aggregates movie reviews received by Amazon from 1997 to 2012. To avoid users with very few ratings (the so-called cold start problem), we only consider users with at least 20 ratings.

3.3 Evaluation Metrics

We measure the effect of sampling along two main metrics: (i) their computation time, and (ii) the quality ratio of the resulting KNN graph.

The time is measured from the beginning of the execution of the algorithm, until the KNN graph is computed. It does not take into account the preprocessing of the dataset, which is evaluated separately in Sect. 4.2.

When applying sampling, the resulting KNN graph is an approximation of the exact one. In many applications such as recommender systems, this approximation should provide neighborhoods of high quality, even if those do not overlap with the exact KNN. To gauge this quality, we introduce the notion of similarity ratio, which measures how well the average similarity of an approximated graph compares against that of an exact KNN graph. Formally we define the average similarity of an approximate KNN graph $\widehat{G}_{{\text {KNN}}}$ as

(7)

i.e. as the average similarity of the edges of $\widehat{G}_{{\text {KNN}}}$, and we define the quality of $\widehat{G}_{{\text {KNN}}}$ as its normalized average similarity

(8)

where $G_{{\text {KNN}}}$ is an ideal KNN graph, obtain without sampling.

A quality close to 1 indicates that the approximate neighborhoods of $\widehat{G}_{{\text {KNN}}}$ present a similarity that is very close to that of ideal neighborhoods, and can replace them with little loss in most applications, as we will show in the case of recommendations in our evaluation.

Throughout our experiments, we use a 5-fold cross-validation procedure which creates 5 training sets composed of 80% of the ratings. The remaining 20%, i.e. the training sets, are used for recommendations in Sect. 4.4. Our results are the average on the 5 resulting runs.

3.4 Experimental Setup

We have implemented the sampling policies in Java 1.8. We ran our experiments on a 64-bit Linux server with two Intel Xeon E5420@2.50GHz, totaling 8 hardware threads, 32 GB of memory, and a HHD of 750 GB. We use all 8 threads. Our code is available online^{Footnote 3}. In our experiments, we compute KNN graphs with k set to 30, which is a standard value.

4 Experimentations

4.1 Reduction in Computing Time, and Quality/Speed Trade-Off

The baseline algorithm (without sampling) produces an exact KNN graph, with a quality of 1. To compare the different sampling policies (LP, MP, CS and IS) on an equal footing, we configure each of them on each dataset to achieve a quality of 0.9. The resulting parameter s ranges from 15 (LP on AM) to 75 (MP on movielens1M), while p (for IS) varies between 0.35 (on AmazonMovies) and 0.68 (on movielens20M). Table 2 summarizes the computation times measured on the four datasets with the percentage time reduction obtained against the baseline ($\varDelta $ columns), while Fig. 2 shows the results on movielens10M. LP outperforms all other policies on all datasets, reaching a reduction of up to 63%.

Table 2. Computation time (s) of the baseline and the 4 sampling policies. The parameters were chosen to have a quality equal to 0.9. LP reduces computation time by 40% (ml1M) to 63% (AM), and outperforms other sampling policies on all datasets.

Full size table

Table 3. Preprocessing time (seconds) for each dataset, and each sampling policy, with parameters set so that the resulting KNN quality is 0.9. The preprocessing times are negligible compared to the computation times.

Full size table

Because they reduce the size of profiles, sampling policies exchange quality for speed. To better understand this trade-off, Fig. 3 plots the evolution of the computation time and the resulting quality when s ranges from 5 to 200 for LP, MP, and CS ($s \in \{5, 10, 15, 20, 30, 40, 50, 75, 100, 200\}$), and p ranges from 0.1 to 1.0 for IS ($p\in \{0.1, 0.2, 0.4, 0.5, 0.75, 0.9, 1.0\}$).

For clarity, we only display points with a quality above 0.7, corresponding to the upper values of s and p. The dashed vertical line on the right shows the computation time of the baseline (producing a quality of 1), while the dotted horizontal line shows the quality threshold of 0.9 used in Table 2 and Fig. 2.

Lines closer to the top-left corner are better. The figures confirm that our contribution, LP, outperforms other sampling policies on all datasets. There is however no clear winner among the remaining policies: IS performs well on movielens1M, but arrives last on the other datasets, and MP and CS show no clear order, which depends on the dataset and the quality considered.

4.2 Preprocessing Overhead

As is common with KNN graph algorithms [5, 10], the previous measurements do not include the loading and preprocessing time of the datasets, which is typically dominated by I/O rather than CPU costs. Sampling adds some overhead to this preprocessing, but Table 3 shows that this extra cost ($\varDelta $ columns) remains negligible compared to the computation times of Table 2. For instance, LP adds 3.4s to the preprocessing of movielens20M, which only represents 0.07% of the complete execution time of the algorithm ($4865\mathrm {s} + 11.95\mathrm {s} = 4877\mathrm {s}$). IS even decreases the preprocessing time on 3 datasets out of 4, by starkly reducing the bookkeeping costs of profiles while introducing only a low extra complexity.

4.3 Influence of LP at the User’s Level

Constant size sampling has a different influence on each user, depending on this user’s profile’s size. Profiles whose sizes are below the parameter s remain unchanged while larger profiles are truncated, thus losing information.

Figure 4 investigates the impact of this loss with our approach, LP, on movielens10M with $s=25$ (corresponding to a quality of 0.9). Figure 4a plots the distribution of the similarity error $\epsilon =|J( P _{u}, P _{v}) - J(\widehat{ P }_{u},\widehat{ P }_{v})|$ introduced by sampling when $\epsilon $ is computed for each pair of users (u, v). The figure shows that $35\%$ of pairs experience no error ($\epsilon =0$), and that $96\%$ have an error below 0.05 (dotted vertical line), confirming that our sampling only introduces a limited distortion of similarities.

Figure 4b represents the impact of LP on the quality of users’ neighborhoods, according to the initial profile size of users. For every user u with an initial profile size of $|P_u|$, we compute the average similarity of u’s approximated neighborhood $\widehat{ knn }(u)$, and normalize this similarity with that of u’s exact neighborhood $ knn (u)$. The closest to 1 the better. We then average this normalized similarity for users with the same profile size $\{u\in U: |P_u| =P\}$. These points are displayed as a scatter plot (in black, note the log scale on the x axis), and using a moving average of width 50 (red curve). The first dashed vertical line is the value of the truncation parameter s ($x=25$). The points after the second vertical line (at $x=1553$) represent 24 users (out of 69816) and thus are not statistically significant. As expected, there is a clear threshold affect around the truncation value $s=25$, yet even users with much larger profiles retain a high neighborhood quality, that remains on average above 0.75.

4.4 Recommendations

We want to evaluate the impact of the loss in quality on a practical use of the KNN graphs. To do so we perform item recommendations using the exact KNN graphs and the approximated graphs produced with LP. We recommend the items that an user u is more likely to like. This likelihood is expressed as a weighted average of the ratings the items received by the neighbors of u, weighted by the similarity of u with them. We use the real profiles, without sampling nor binarization, to compute these predicted ratings. After computing the score of every item, we recommend to u a set $R_u$ composed by 30 items with the highest scores:

$$\begin{aligned} R_u \in \mathop {\mathrm {argmax}}_{R \subseteq I \backslash P_u : |R|=30}\sum _{i \in S} \sum _{v \in knn_u} sim(u,v) * r_{v,i}, \end{aligned}$$

(9)

where $r_{v,i}$ is the rating made by the user v on the item i. We use the same 5-fold cross-validation as used for the KNN graph computation. We consider a recommendation successful when a recommended item is found within the 20% removed ratings (the testing set) with a rating above 3 ($r_{u,i} > 3$). The quality of the recommendation is measured using recall, the proportion of successful recommendations among all recommendations.

Table 4. Recommendation recall without sampling (Base.) and using the Least Popular (LP) policy (total KNN quality set to 0.9).

Full size table

Table 4 shows the recall we obtain by using the exact KNN graphs obtained with the baseline and with LP using when the KNN quality is set to 0.9. In spite of its approximation, LP introduces no loss in recall, and even achieves slightly better scores than the baseline, which shows that our sampling approach can be used with little impact in concrete applications.

5 Related Work

For small datasets, some specific data structures can be used to compute the KNN graphs very efficiently [3, 18, 21]. On the other hand, these solutions do not scale and computing efficiently exact KNN graphs with large datasets remains an open problem.

For large datasets, an approximation of the KNN graph, called approximate nearest-neighbor (ANN) graph, is computed instead, by decreasing the number of comparisons between users. Locally Sensitive Hashing [11, 14] hashes users into buckets and only users within the same buckets are compared. Depending on the chosen similarity, different hashing functions are used [6,7,8]. Despite being very efficient for KNN queries, the preprocessing is too expensive to compete with other ANN graph algorithms. KIFF [5] first assigns to every user the users with which she shares at least one item. Since the Jaccard similarity is null if two users do not share any item, the neighbors research is limited to these ones. This algorithm performs particularly well on sparse datasets. Hyrec [4] and NNDescent [10] rely on the assumption that the neighbors of the neighbors are more likely to be also neighbors than random users to decrease drastically number of similarity computed.

However it seems that lowering even further the number of similarities is no longer possible. An orthogonal strategy is to speed-up the similarity computation itself by compacting the users’ profiles. b-bit minwise hashing [2, 16] relies on a similar approach than LSH to compact users’ profiles in order to approximate the Jaccard similarity. It is space efficient but at the expense of a high preprocessing time. In [9] the profiles are compacted by using bit arrays: each bit represents a feature, which value has been rounded. This does not scale and cannot be used in our case where the items are the features. To avoid such a problem [12] uses constant-sized Bloom filters to encode the profiles. Then the Jaccard’s similarity is approximated by a bitwise AND operation. Despite its privacy properties and its speed-up, there is a substantial loss in precision.

As far as we know, sampling has never been used to compact the users’ profiles, even though it is used in information filtering systems such as collaborative filtering. It can be used to find association rules [1], to reduce the size of the items’ universe to recommend [17] and to change the distribution of the training points [19, 23]. The popularity is used to solve the cold-start problem [24] by finding items the new user is likely to rate, but not to represent its profile in a compact manner.

6 Conclusion

In this paper, we have proposed Constant-Size Least Popular Sampling (LP) to speed up the construction of KNN graphs on entity-item datasets. By keeping only the least popular items of users’ profiles, we make them shorter and thus faster to compare. Our extensive evaluation on four realistic datasets shows that LP outperforms more straightforward sampling policies. More precisely, LP is able to decrease the computation time of KNN graphs by up to 63%, while providing a KNN graph close to the ideal one, with no observable loss when used to compute recommendations.

In the future, we plan to investigate more advanced sampling policies, and to explore how sampling could be combined with orthogonal greedy techniques to accelerate KNN graph computations [4, 5, 10].

Notes

1.
We focus here on the computation of the whole graph, which is different from the related but distinct problem of answering KNN queries.
2.
https://www.rottentomatoes.com/m/star_wars, accessed 21 Feb. 2018.
3.
https://gitlab.inria.fr/oruas/SamplingKNN.

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB 1994 (1994)
Google Scholar
Bachrach, Y., Porat, E.: Sketching for big data recommender systems using fast pseudo-random fingerprints. In: Fomin, F.V., Freivalds, R., Kwiatkowska, M., Peleg, D. (eds.) ICALP 2013. LNCS, vol. 7966, pp. 459–471. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39212-2_41
Chapter Google Scholar
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML (2006)
Google Scholar
Boutet, A., Frey, D., Guerraoui, R., Kermarrec, A.-M., Patra, R.: Hyrec: leveraging browsers for scalable recommenders. In Middleware (2014)
Google Scholar
Boutet, A., Kermarrec, A.-M., Mittal, N., Taïani, F.: Being prepared in a sparse world: the case of KNN graph construction. In: ICDE (2016)
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997 (1997)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Computer Networks and ISDN Systems (1997)
Article Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC 2002 (2002)
Google Scholar
Cui, B., Shen, H.T., Shen, J., Tan, K.-L.: Exploring bit-difference for approximate KNN search in high-dimensional databases. In: ADC (2005)
Google Scholar
Dong, W., Moses, C., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW (2011)
Google Scholar
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: VLDB (1999)
Google Scholar
Gorai, M., Sridharan, K., Aditya, T., Mukkamala, R., Nukavarapu, S.: Employing bloom filters for privacy preserving distributed collaborative KNN classification. In: WICT (2011)
Google Scholar
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. In: TIIS (2015)
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)
Google Scholar
Levandoski, J.J., Sarwat, M., Eldawy, A., Mokbel, M.F.: LARS: a location-aware recommender system. In: ICDE (2012)
Google Scholar
Li, P., König, A.C.: Theory and applications of b-bit minwise hashing. Commun. ACM 54, 101–109 (2011)
Article Google Scholar
Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item collaborative filtering. Internet Comput. 7, 76–80 (2003)
Article Google Scholar
Liu, T., Moore, A.W., Yang, K., Gray, A.G.: An investigation of practical approximate nearest neighbor algorithms. In: NIPS (2004)
Google Scholar
Mani, I., Zhang, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on learning from imbalanced datasets, ICML (2003)
Google Scholar
McAuley, J.J., Leskovec, J.: From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: WWW (2013)
Google Scholar
Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: UAI (2000)
Google Scholar
Nodarakis, N., Sioutas, S., Tsoumakos, D., Tzimas, G., Pitoura, E.: Rapid aknn query processing for fast classification of multidimensional data in the cloud. CoRR (2014)
Google Scholar
Pan, R., et al.: One-class collaborative filtering. In: ICDM (2008)
Google Scholar
Rashid, A.M., et al.: Getting to know you: learning new user preferences in recommender systems. In: IUI (2002)
Google Scholar

Download references

Acknowledgments

This work was partially funded by the PAMELA project of the French National Research Agency (ANR-16-CE23-0016), the Web-Alter-Ego Google Focused Award, the ANR-DFG joint project DISCMAT (ANR-14-CE35-0010) and the DeSceNt project granted by the Labex CominLabs excellence laboratory of the French Agence Nationale de la Recherche (ANR-10-LABX-07-01).

Author information

Authors and Affiliations

Mediego, Rennes, France
Anne-Marie Kermarrec
EPFL, Lausanne, Switzerland
Anne-Marie Kermarrec
Univ Rennes, Inria, CNRS, IRISA, Rennes, France
Olivier Ruas & François Taïani

Authors

Anne-Marie Kermarrec
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Ruas
View author publications
You can also search for this author in PubMed Google Scholar
François Taïani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivier Ruas .

Editor information

Editors and Affiliations

Department of Computer Science, University of Torino, Torino, Italy
Marco Aldinucci
Department of Computer Science, University of Torino, Torino, Italy
Luca Padovani
Department of Computer Science, University of Pisa, Pisa, Italy
Massimo Torquati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kermarrec, AM., Ruas, O., Taïani, F. (2018). Nobody Cares if You Liked Star Wars: KNN Graph Construction on the Cheap. In: Aldinucci, M., Padovani, L., Torquati, M. (eds) Euro-Par 2018: Parallel Processing. Euro-Par 2018. Lecture Notes in Computer Science(), vol 11014. Springer, Cham. https://doi.org/10.1007/978-3-319-96983-1_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-96983-1_30
Published: 01 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96982-4
Online ISBN: 978-3-319-96983-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Nobody Cares if You Liked Star Wars: KNN Graph Construction on the Cheap

Abstract

Similar content being viewed by others

Fast kNN Graph Construction with Locality Sensitive Hashing

Greedy Filtering: A Scalable Algorithm for K-Nearest Neighbor Graph Construction

Generalized Relative Neighborhood Graph (GRNG) for Similarity Search

Keywords

1 Introduction