Encyclopedia of Big Data

Living Edition
| Editors: Laurie A. Schintler, Connie L. McNeely

Collaborative Filtering

  • Ashrf AlthbitiEmail author
  • Xiaogang Ma
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-32001-4_274-1
  • 29 Downloads

Abstract

Collaborative filtering (CF) is a process to filter information or patterns with collaboration among multiple agents and resources. The main idea of CF is to effectively extract useful information from the overwhelming amount of collected data. This article discusses the perception of CF techniques and explains how to utilize CF in a recommender system (RS). RS provides recommendations to an active user based on items that other similar users prefer. CF makes automatic predictions of a user’s interests by utilizing stored data of various users, which makes it a key method for RS.

Synonyms

Introduction

Collaborative filtering (CF) entirely depends on users’ contribution such as ratings or reviews about items. It exploits the matrix of collected user-item ratings as the main source of input. It ultimately provides the recommendations as an output that takes the following two forms: (1) a numerical prediction to items that might be liked by an active user U and (2) a list of top-rated items as top-N items. CF claims that similar users express similar patterns of rating behavior. Also, CF claims that similar items obtain similar ratings. There are two primary approaches of CF algorithms: (1) neighborhood-based and (2) model-based (Aggarwal 2016).

The neighborhood-based CF algorithms (aka, memory-based) directly utilize stored user-item ratings to predict ratings for unseen items. There are two primary forms of neighborhood-based algorithms: (1) user-based nearest neighbor CF and (2) item-based nearest neighbor CF (Aggarwal 2016). In the user-based CF, two users are similar if they rate several items in a similar way. Thus, it recommends to a user the items that are the most preferred by similar users. In contrast, the item-based CF recommends to a user the items that are the most similar to the user’s previous purchases. In such an approach, two items are similar if several users have rated these items in a similar way.

The model-based CF algorithms (aka, learning-based models) form an alternative approach by sending both items and users to the same latent factor space. The algorithms utilize users’ ratings to learn a predictive model (Ning et al. 2015). The latent factor space attempts to interpret ratings by characterizing both items and users on factors automatically inferred from previous users’ ratings (Koren and Bell 2015).

Methodology

Neighborhood-Based CF Algorithms

User-Based CF

User-based CF claims that if users rated items in a similar fashion in the past, they will give similar rating to new items in the future. For instance, Table 1 shows a user-item ratings matrix, which includes four users’ rating of four items. The task is to predict the rating of unrated item3 by the active user Andy.
Table 1

User-item rating dataset

User name

Item1

Item2

Item3

Item4

Andy

3

3

?

5

U1

4

2

2

4

U2

1

1

4

2

U3

5

2

3

4

In order to solve the task presented above, the following notations are given. The set of users is symbolized as U = {U1, .., Uu}, the set of items is symbolized as I = {I1, ..,Ii}, the matrix of ratings is symbolized as R where ru, i means rating of a user U for an item I, and the set of possible ratings is symbolized as S where its values take a range of numerical ratings {1,2,3,4,5}. Most systems consider the value 1 as strongly dislike and the value 5 as strongly like. It is worth noting that ru, i should only take one rating value.

The first step is to compute the similarity between Andy and the other three users. In this example, the similarity between the users is simply computed using Pearson’s correlation coefficient (1).
$$ sim\left(u,v\right)=\sum i\in I\left( ru,i-\overline{ru}\right)\left( rv,i-\overline{rv}\right)/\sqrt{\sum i\in I{\left( ru,i,i-\overline{ru}\right)}^2}\ast \sqrt{\sum i\in I{\left( rv,i-\overline{rv}\right)}^2}\Big) $$
(1)
where \( \overline{ru} \) and \( \overline{rv} \) are the average rating of the available ratings made by users u and v.
By applying Eq. (1) to the rating data in Table 1 given that \( \left(\ \overline{rAndy}=\frac{3+3+5}{3}=3.6,\mathrm{and}\ \overline{rU1}=\frac{4+2+2+4}{4}=3\right), \)the similarity between Andy and U1 is calculated as follows:
$$ {\displaystyle \begin{array}{ll} sim\left( Andy,U1\right)& =\frac{\left(3-3.6\right)\left(4-3\right)+\left(3-3.6\right)\left(2-3\right)+\left(5-3.6\right)\left(4-3\right)}{\sqrt{{\left(3-3.6\right)}^2+{\left(3-3.6\right)}^2+{\left(5-3.6\right)}^2}\times \sqrt{{\left(4-3\right)}^2+{\left(2-3\right)}^2+{\left(4-3\right)}^2}}\\ {}& = 0.49\end{array}} $$
(2)

It is worth noting that the results of Pearson’s correlation coefficient are in the range of (+1 to − 1), where +1 means high positive correlation and − 1 means high negative correlation. The similarities between Andy and U2 and U3 are 0.15 and 0.19, respectively. Referring to the previous calculations, it seems that U1 and U3 similarly rated several items in the past. Thus, U1 and U3 are utilized in this example to predict the rating of item3 for Andy.

The second step is to compute the prediction for item3 using the ratings of Andy’s K-neighbors (U1 and U3). Thus, Eq. (3) is introduced where \( \overset{\wedge }{r} \) means the predicted rating.
$$ \overset{\wedge }{r}\left(u,i\right)=\overline{ru}+\frac{\sum v\in K\ sim\left(u,v\right)\ast \left( rv,i-\overline{rv}\right)}{\sum v\in K\ sim\left(u,v\right)} $$
(3)
$$ {\displaystyle \begin{array}{ll}\overset{\wedge }{r}\left( Andy, item3\right)& =\overline{rAndy}\\ {}& +\frac{sim\left( Andy,U1\right)\ast \left( rU1, item3-\overline{rU1}\right)+ sim\left( Andy,U3\right)\ast \left( rU3, item3-\overline{rU3}\right)}{sim\left( Andy,U1\right)+ sim\left( Andy,U3\right)}\\ {}& = 4.45\end{array}} $$
(4)

Given the result of the prediction computed by Eq. (4), it is most likely that item3 will be a good choice to be included in the recommendation list for Andy.

Item-Based CF

Item-based CF algorithms are introduced to solve serious challenges when applying user-based nearest neighbor CF algorithms. The main challenge is that when the system has massive records of users, the complexity of the prediction task increases sharply. Accordingly, if the number of items is less than the number of users, it is ideal to adopt the item-based CF algorithms.

This approach computes the similarity between items instead of an enormous number of potential neighbor users. Also, this approach considers the ratings of user U to make a prediction for item I, as item I will be similar to the previous rated items by user U. Therefore, users may prefer to utilize their ratings rather than other users’ rating when making the recommendations.

Equation (5) is used to compute the similarity between two items.
$$ {\displaystyle \begin{array}{ll} sim\left(i,j\right)=& \sum u\in U\left( ru,i-\overline{ri}\right)\left( ru,j-\overline{rj}\right)\\ {}& /\sqrt{\sum u\in U{\left( ru,i,i-\overline{ri}\right)}^2}\\ {}& \times \sqrt{\sum u\in U{\left( ru,j-\overline{rj}\right)}^2}\end{array}} $$
(5)

In Equation (5), \( \overline{ri} \) and \( \overline{rj} \) are the average rating of the available ratings made by users for both items i and j.

Then, make the prediction for item I for user U by applying Eq. (6) where K means the number of neighbors of items for item I.
$$ \overset{\wedge }{r}\left(u,i\right)=\overline{ri}+\frac{\sum j\in K\ sim\left(i,j\right)\ast \left( ru,j-\overline{rj}\right)}{\sum j\in K\ sim\left(i,j\right)} $$
(6)

Model-Based CF Algorithms

Model-based CF algorithms take the raw data that has been preprocessed in the offline step where the data typically requires to be cleansed, filtered, and transformed and then generate the learned model to make a prediction. It solves several issues that appear in the neighborhood-based CF algorithms. These issues are (1) limited coverage which means finding neighbors is based on the rating of common items and (2) sparsity in the rating matrix which means the diversity of items rated by different users.

Model-based CF algorithms compute the similarities between users or items by developing a parametric model that investigates their relationships and patterns. It is classified into two main categories: (1) factorization methods and (2) adaptive neighborhood learning methods (Ning et al. 2015).

Factorization Methods

Factorization methods aim to define the characterization of ratings by projecting users and items to the reduced latent vector. It helps discover more expressive relations between each pair of users, items, or both. It has two main types: (1) factorization of a sparse similarity matrix and (2) factorization of an actual rating matrix (Jannach et al. 2010).

The factorization is done by using the singular value decomposition (SVD) or the principal component analysis (PCA). The original sparse ratings or similarities matrix is decomposed into a smaller-rank approximation in which it captures the highly correlated relationships. It is worth mentioning that the SVD theorem (Golub and Kahan 1965) claims that matrix M can be collapsed into a product of three matrices as follows:
$$ M=U\sum {V}^T $$
(7)
where U and V contain left and right singular vectors and the values of the diagonal of are singular values.

Adaptive Neighborhood Learning Methods

This approach combines the original neighborhood-based and model-based CF methods. The main difference of this approach, in comparison with the basic neighborhood-based, is that the learning of the similarities is directly inferred from the user-item ratings matrix, instead of adopting pre-defined neighborhood measures.

Conclusion

This article discusses a general perception of the CF. CF is one of the early approaches proposed for information filtering and recommendation making. However, CF still ranks among the most popular methods that people employ in nowadays for researches on Web, big data, and data mining.

Cross-References

References

  1. Aggarwal, C. C. (2016). An introduction to recommender systems. In Recommender systems (pp. 1–28). Cham: Springer.Google Scholar
  2. Golub, G., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2), 205–224.CrossRefGoogle Scholar
  3. Jannach, D., Zanker, M., Felfernig, A., & Friedrich, G. (2010). Recommender systems: An introduction. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  4. Koren, Y., & Bell, R. (2015). Advances in collaborative filtering. In Recommender systems handbook (pp. 77–118). Boston: Springer.CrossRefGoogle Scholar
  5. Ning, X., Desrosiers, C., & Karypis, G. (2015). A comprehensive survey of neighborhood-based recommendation methods. In Recommender systems handbook (pp. 37–76). Boston: Springer.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of IdahoMoscowUSA