Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints

  • Yoram Bachrach
  • Ely Porat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7966)


A key building block for collaborative filtering recommender systems is finding users with similar consumption patterns. Given access to the full data regarding the items consumed by each user, one can directly compute the similarity between any two users. However, for massive recommender systems such a naive approach requires a high running time and may be intractable in terms of the space required to store the full data. One way to overcome this is using sketching, a technique that represents massive datasets concisely, while still allowing calculating properties of these datasets. Sketching methods maintain very short fingerprints of the item sets of users, which allow approximately computing the similarity between sets of different users.

The state of the art sketch [22] has a very low space complexity, and a recent technique [14] shows how to exponentially speed up the computation time involved in building the fingerprints. Unfortunately, these methods are incompatible, forcing a choice between low running time or a small sketch size. We propose an alternative sketching approach, which achieves both a low space complexity similar to that of [22] and a low time complexity similar to [14]. We empirically evaluate our algorithm using the Netflix dataset. We analyze the running time and the sketch size of our approach and compare them to alternatives. Further, we show that in practice the accuracy achieved by our approach is even better than the accuracy guaranteed by the theoretical bounds, so it suffices to use even shorter fingerprints to obtain high quality results.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, C.C.: Data streams: models and algorithms. Springer-Verlag New York Inc. (2007)Google Scholar
  2. 2.
    Alon, N., Matias, Y., Szegedy, M.: The Space Complexity of Approximating the Frequency Moments. J. Computer and System Sciences 58(1), 137–147 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Bachrach, Y., Herbrich, R.: Fingerprinting Ratings for Collaborative Filtering — Theoretical and Empirical Analysis. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 25–36. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  4. 4.
    Bachrach, Y., Herbrich, R., Porat, E.: Sketching algorithms for approximating rank correlations in collaborative filtering systems. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 344–352. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Bachrach, Y., Porat, E., Rosenschein, J.S.: Sketching techniques for collaborative filtering. In: IJCAI, Pasadena, California (July 2009)Google Scholar
  6. 6.
    Bennett, J., Lanning, S.: The netflix prize. In: KDD Cup and Workshop (2007)Google Scholar
  7. 7.
    Broder, A.Z.: On the resemblance and containment of documents. Sequences (1998)Google Scholar
  8. 8.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. Journal of Computer and System Sciences 60(3), 630–659 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55(1), 58–75 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: VLDB (2005)Google Scholar
  11. 11.
    Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW. ACM (2007)Google Scholar
  12. 12.
    Dasgupta, A., Kumar, R., Sarlos, T.: Fast locality-sensitive hashing. In: SIGKDD (2011)Google Scholar
  13. 13.
    Datar, M., Muthukrishnan, S.: Estimating rarity and similarity over data stream windows. In: Möhring, R., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 323–335. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Feigenblat, G., Shiftan, A., Porat, E.: Exponential time improvement for min-wise based algorithms. In: SODA (2011)Google Scholar
  15. 15.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301), 13–30 (1963)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Indyk, P.: A Small Approximately Min-Wise Independent Family of Hash Functions. Journal of Algorithms 38(1), 84–90 (2001)MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM) 53(3), 323 (2006)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Kane, D.M., Nelson, J., Porat, E., Woodruff, D.P.: Fast moment estimation in data streams in optimal space. In: STOC (2011)Google Scholar
  19. 19.
    Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS, pp. 41–52. ACM (2010)Google Scholar
  20. 20.
    Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28(1), 51–55 (2003)CrossRefGoogle Scholar
  21. 21.
    Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: a better Bloom filter. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 456–467. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  22. 22.
    Li, P., Koenig, C.: b-Bit minwise hashing. In: WWW (2010)Google Scholar
  23. 23.
    Mulmuley, K.: Randomized geometric algorithms and pseudorandom generators. Algorithmica (1996)Google Scholar
  24. 24.
    Pǎtraşcu, M., Thorup, M.: On the k-Independence Required by Linear Probing and Minwise Independence. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 715–726. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  25. 25.
    Pavan, A., Tirthapura, S.: Range-efficient counting of distinct elements in a massive data stream. SIAM Journal on Computing 37(2), 359–379 (2008)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: Computer Supported Cooperative Work (1994)Google Scholar
  27. 27.
    Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering recommendation algorithms. In: WWW (2001)Google Scholar
  28. 28.
    Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Advances in Artificial Intelligence 2009, 4 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yoram Bachrach
    • 1
  • Ely Porat
    • 2
  1. 1.Microsoft ResearchCambridgeUK
  2. 2.Bar-Ilan UniversityRamat-GanIsrael

Personalised recommendations