1 Background

In 2006, Netflix released a dataset containing 100,480,507 movie ratings (on a 1–5 scale) from \(m=480,189\) users on \(n=17,770\) movies [10]. The set was divided into 99,072,112 training points and 1,408,395 probe points for contestants to train and validate models. Netflix’s in-house algorithm Cinematch scored an RMSE (root mean square error) of 0.9514 on the probe set. The prize of one million dollars went to the first contestants to improve RMSE on a hidden test set by 10%. A combined team of “Bellkor in BigChaos” and “Pragmatic Theory” accomplished this in 2009 with an RMSE of 0.8558 on the probe data [9, 61, 79]. As one of the largest real-life datasets available, the Netflix Prize data remains a benchmark for innovations in recommender systems today.

1.1 Results

Tables 1 and 2 catalogue the final results. Full details and methods follow in the subsequent sections.

Table 1. Normalized model performance under different metrics.
Table 2. Normalized ensemble performance under different metrics.

2 Problem Description

Let M be an \(m\times n\) matrix where entry \(M_{ij} \in \{1,\cdots ,5\}\) contains user i’s rating of movie j. The matrix completion problem aims to recover the matrix M from a subset \(\varOmega \subset [m]\times [n]\) of its entries. We let \(\mathcal {P}_\varOmega (M)\) denote the projection of M onto this subset, which amounts to zeroing out unobserved elements of M. We further divide the probe data randomly into a test set of 1,000,000 points and a validation set of the remaining 408,395 points. Our aim is to construct ensembles of predictors. We build individual predictors using the training set and build ensemble estimators using the validation set. We then report RMSE performance on the test set.

As the de facto standard error for regression tasks, RMSE penalizes larger errors more than MAE (mean absolute error). However, the extent to which RMSE accurately represents misclassification loss remains debatable. For example, correctly distinguishing between 3- and 5-star ratings may prove much more important than distinguishing between 1- and 3-star ratings [30]. If the ultimate goal is to produce a user’s top-N movies, [17] argue precision- and recall-based metrics much better characterize success.

3 SVD

SVD methods play an important role in matrix completion. We discuss this approach first because many of the subsequent methods benefit greatly from leveraging a learned SVD decomposition, either by using SVD parameters for initialization or to estimate user-user and movie-movie similarities.

Under the common additional assumption that M is of low rank r, where \(r\ll \min \{m,n\}\), we can consider the SVD of M given by , where U is an \(m\times r\) matrix with orthonormal columns, \(\varSigma \) is an \(r\times r\) diagonal matrix of positive entries, and V is an \(r\times n\) matrix with orthonormal columns. Such a matrix has \(\mathcal {O}(mr)\) degrees of freedom (assuming, as in our case, that \(m>n\)). Considering uniform sampling with replacement as a coupon collector’s problem, we then require at least \(\mathcal {O}(mr\log m)\) entries from M in order for a successful reconstruction [15]. We hypothesize that users rate movies based upon r underlying features, having importance relative to the singular values of M. The rows of U correspond to feature weightings for each user and the rows of V correspond to feature weightings for each item.

3.1 Low Rank Recovery

[14, 15] described sufficient conditions on the incoherence of the rows of U and V such that M could be recovered with high probability through nuclear norm-minimization. Recall that the nuclear norm is given \(\left||M\right||_* = \textstyle \sum _{k=1}^r \sigma _k(M)\) where \(\sigma _k(M)\) denotes the kth singular value of M. The precise optimization problem was to minimize \(\Vert \hat{M} \Vert _*\) subject to \(\mathcal {P}_\varOmega (M) = \mathcal {P}_\varOmega (\hat{M})\), where \(\hat{M}\) denotes the estimate for M. Such a choice of objective function makes the problem convex (indeed, all norms are convex), as opposed to minimizing \({\text {rank}}(\hat{M})\). In a similar vein, [52] developed “soft-thresholded” SVD to find \(\hat{M}\) minimizing \(\tfrac{1}{2}\Vert \mathcal {P}_\varOmega (\hat{M}-M) \Vert _\text {F}^2 + \lambda \Vert \hat{M}\Vert _*\), where \(\Vert \cdot \Vert _F\) denotes the Frobenius matrix norm.

4 Factorization Models

We implemented numerous approaches to matrix factorization. In this approach, we estimate where U is an \(n\times k\) matrix of user factors and V is an \(m\times k\) matrix of item factors. Unconstrained matrix factorization [75, 76] simply finds , where the matrix norm is the Frobenius norm, and predicts . The straightforward mathematical description leaves a handful of implementation choices. We can initialize UV either randomly or from the SVD decomposition by taking, for example, \(U\sqrt{\varSigma }, \sqrt{\varSigma }V\) from a learned SVD model. We can perform optimization with the whole dataset in memory or perform batch-based optimization by iterating over subsets of the training data.

We also implemented Tikhonov-regularized [77, 78] matrix factorization [66, 73] to solve and non-negative matrix factorization [46, 47], with the constraint that all entries of UV be non-negative. In our experience, model performance seemed tightly coupled with initialization.

4.1 Accounting for Implicit Preferences

Hu et al. developed a weighted matrix factorization method that accounts for the implicit preference a user gives to a movie through the act of watching and rating it [38]. Called weighted alternating least squares (WALS), this method seeks where W denotes the number of movies a user has rated and \(\odot \) denotes element-wise multiplication. We used the contributed TensorFlow model and initialized with SVD output.

4.2 Neural Network Matrix Factorizaton

This estimator constituted a feedforward fully-connected neural network mapping learned representation vectors for users and movies through the network to predict the corresponding rating. Such an approach is commonly referred to as neural network matrix factorization [21, 32]. Learned model parameters consist of m user vectors in \(\mathbb {R}^r\), n movie vectors in \(\mathbb {R}^r\), and all parameters for the neural network. We initialized user vectors with the U matrix and the movie vectors with the V matrix from the soft-thresholded SVD model. Neural network parameters received Glorot uniform initialization [28]. For each batch, we performed three training steps: neural network parameters were updated, user representations were updated, and movie representations were updated. We applied Tikhonov \(L_2\)-regularization to U and V. For all parameter updates, we used the Adam optimizer [41] that maintains different learning rates for each parameter like AdaGrad [20] and allows these rates to sometimes increase like Adadelta [84] but adapts them based on the first two moments from recent gradient updates. We used a leaky rectified linear unit (ReLU) activation [50, 57], and applied dropout after the first hidden layer to prevent overfitting [74]. Neither Nesterov momentum-aided Adam [19] nor Batch Normalization [39] appeared to improve our results. Many different versions of neural networks were developed with minor variations in architecture, initialization, and training.

4.3 Probabilistic Matrix Factorization

Factorization can also be performed in a probabilistic setting, by specifying a generative graphical model and then finding the maximum a posteriori (MAP) parameters [70] or by performing Gibbs sampling in a Bayesian setting [69]. In Gaussian matrix factorization, we model where, as before, U is an \(n\times k\) matrix of user factors and V is an \(m\times k\) matrix of item factors. The \(b_{ij}\) denote the average of the mean rating from user i and the mean rating of movie j, and account for user- and movie- effects. We learn U and V to maximize the log likelihood of the observed data under this model. In Poisson matrix factorization [29], . We predict \(\hat{M}_{ij} = \mathbb {E}[\hat{X}_{ij} | \hat{X}_{ij} \in \{1,\cdots ,5\}]\) where and \(\hat{U}, \hat{V}\) denote our learned model parameters.

4.4 Factorization Machine

A factorization machine [64, 65] of second degree learns a regression model \(\hat{y}(x) = w_0 + \textstyle \sum _{i=1}^\ell w_ix_i + \sum _{1\le i < j \le \ell } \langle v_i, v_j \rangle x_i x_j \) for parameters \(w_k \in \mathbb {R}\) and \(v_k\in \mathbb {R}^\ell \), \(k=1\cdots ,\ell \). Such models were designed for sparsity, and in our case, we let \(x\in \mathbb {R}^{m+n}\) denote a one-hot vector representation for the user concatenated with a one-hot vector representation for the movie.

5 Neighborhood Models

Neighborhood-based techniques prove to be useful ingredients in a Netflix ensemble [42, 80]. A k-NN model estimates a function’s value at a test point by averaging the values of the k nearest training points. In our case, we can predict the rating for a (user, movie)-pair by averaging the user’s ratings for the k nearest movies or averaging the k nearest users’ ratings of the movie. To do so, we must first set k and a metric for comparing two movies (or users). Given the sparseness of the Netflix dataset, we must also account for the cases where we have no training data on any of the k nearest neighbors of a given test point. In this case, we typically fall back to the baseline global rating. A lower value of k tends to increase the accuracy of the ratings we can calculate, but also increases the number of test points for which we have insufficient training data.

In addition to comparing the performance yielded by choice of metric, it may also be illustrative to consider how the choice of metric impacts training data availability. Suppose, given a (user, movie)-pair, the user tends to have rated many more of the 20 nearest movies under the cosine metric than of the 20 nearest under the Euclidean metric. Under the hypothesis that the act of expressing a rating indicates preference (the ratings matrix is not revealed uniformly at random), this fact might also provide us with information, independent of how closely the ratings aligned to the target.

In [6], Bell, Koren, and Volinksy remark that neighborhood-based kernel regression approaches may fail to account for relationships in the similarity space. They describe Lord of the Rings as an example, where a neighborhood in movie-movie similarity space might include all three movies from the trilogy, causing the underlying effect from the trilogy to be counted three times. To account for this, they optimize weights with shrinkage instead of relying on a predefined similarity metric [5, 7]. They also allow a neighborhood based method to defer judgement, when provided insufficient or low-quality neighborhood information [8]. There are also factorized versions of learned user similarity [80].

5.1 k-NN on SVD Latent Space

Consider the r-dimensional rows of U from the soft-thresholded SVD decomposition of \(\mathcal {P}_\varOmega (M)\). These vectors give a dense, low-dimensional (we let \(r=5\)) representation for each user. We use a k-d tree [11] to find the \(k=15\) nearest neighbors for each user according to the Euclidean metric. For a given (user, movie)-pair in the probe set, we determine if any of the user’s neighbors rated the queried movie, and if so, calculate a weighted average over these ratings, where the weights are proportional to the exponentiated negative distance between the user and her neighbors (cf. Nadaraya–Watson kernel-regression [56, 82]).

A smaller value for k restricts to only the most similar neighbors, and so decreases the bias of this estimate. However, it also increases the chance that very few (or none) of the neighbors will have expressed a rating for the given movie. In the case that fewer than three of the \(k=15\) nearest neighbors to a user expressed a preference for a queried movie, then this method does not return a rating, and the average ensemble is taken over the remaining estimators. This allows the estimator to abstain from rating when it is not sufficiently confident, and elegantly fall back to estimators that will be more reliable for a given (user, movie)-pair. In a similar vein, we can cluster users according to k-means and use the above approach with cluster members in place of neighbors. Clustering can yield improved efficiency through memoization [54], as ratings for a given cluster need only be computed once, and can then be applied to subsequent queries. We also created a similar estimator that instead operates on the latent representation for movies.

5.2 Crossing User Neighborhoods with Movie Neighborhoods

The original k-NN approach would find neighbors for either users or movies and then aggregate ratings along a vector in the dual dimension (movies or users, respectively). It is possible instead to use k-NN to find neighbors for both rows and columns, and then aggregate along the sub-matrix consisting of the cross product between neighboring users and neighboring movies. In other words, to predict on a rating for user i on movie j, we would find indices \(\mathcal {N}_{u_i} \subset [m]\) corresponding to the neighbors of user i, and indices \(\mathcal {N}_{v_j} \subset [n]\) corresponding to the neighbors of movie j, and compute a weighted average over the available rankings in \(\mathcal {N}_{u_i}\times \mathcal {N}_{v_j}\), where the weights account for distances in user-space, movie-space, and the difference in time between the ratings. This allows us to leverage ratings of similar movies provided by similar users.

6 Gradient Boosted Trees

Ensemble methods combine multiple weak learners (estimators) into a single strong estimator [18, 22, 31, 71]. Breiman’s bootstrap aggregating (“bagging”) approach trains multiple learners (in parallel) on bootstrapped samples. In contrast, boosting algorithms like Adaboost [25] and gradient boosting [26, 27, 51] iteratively add weak learners to improve an ensemble, concentrating effort on currently misclassified examples. (Note that concentrating on currently misclassified examples is not required of a boosting algorithm: see, for example, Boost by Majority [23] and Brown Boost [24]). In gradient boosting, we supply a loss function \(L(\cdot ,\cdot )\) and a method to train new weak learners \(h_i\); in our case, we use regression trees [13]. For our application, we learned ratings for a (user, movie) pair as a function of their representations in thresholded SVD feature space. We used XGBoost [16] to build the estimator.

7 Variational Autoencoding

Variational autoencoding learns parameters for an autoencoder using a common Bayesian technique known as variational inference. An autoencoder models the identity function with a neural network [33, 68]. Its architecture includes a hidden layer of relatively small dimensionality that serves as an information bottleneck. Upon training, the output from this layer yields a lower-dimensional representation of the original data. If we restrict to linear maps and impose \(L_2\) loss on our reconstruction, autoencoding solves for the principal components from PCA [37, 59]. In this way, we can consider autoencoding to be a nonlinear extension of PCA.

In Bayesian statistics, variational inference approximates intractable integrals (expectations) through optimization, by substituting the integrand (probability distribution) for the closest member of a parametrized family of distributions [40]. When optimization is batch-based, this process is known as stochastic variational inference [36, 67].

A variational autoencoder learns a probabilistic autoencoding model, as two conditional distributions described by neural networks. The encoding distribution \(q_\theta (z|x)\) describes how to sample the latent low-dimensional representation from an observation x and the decoding distribution \(p_\varphi (x|z)\) describes how to sample a reconstructed x from the latent representation. Optimization aims to maximize \(\mathcal {L}(\theta ,\varphi ) = \mathbb {E}_{q_\theta (z|x)}[\log p_\varphi (x|z)] - D_\text {KL}\big (q_\theta (z|x)||p(z)\big ) \) For computational expediency, the expectations above are often approximated via single-sample Monte Carlo integration [53]. In particular, for the ith datapoint \(X_i\), we sample \(Z_i\sim q_\theta (\cdot |X=x_i)\)and form the unbiased approximations \(\mathbb {E}_{q_\theta (z|x)}[\log p_\varphi (x,z)] \approx \log p_\varphi (X_i,Z_i)\) and \(\mathbb {E}_{q_\theta (z|x)}[\log q_\theta (z|x)] \approx \log q_\theta (Z_i|X_i)\).

Multiple authors have implemented VAE’s for collaborative filtering. [48] learned item representations from known content data. [49] concentrated on implicit ratings data; they consider observations in the form of a single user’s (sparse) vector counts for item consumption, and argued that their two adjustments, using a multinomial likelihood and adjusting the VAE objective, were key to their performance.

We take \(X_i\in \mathbb {R}^{17770}\) to be user i’s ratings for each movie (more precisely, the residual ratings after subtracting off half of user i and movie j’s mean ratings). We model \(q_\theta (z|x) = \eta _{10}(z; f_1(x), \exp (f_2(x)I_{10} ))\) where \(\eta _{10}\) denotes a 10-dimensional Gaussian, \(I_{10}\) is the identity matrix, and \(f_1, f_2\) are leaky relu-activated neural networks. Here, \(\theta \) corresponds to the parameters for the neural networks \(f_1,f_2\). We model \(p_\phi (z_i|x) = \eta _{m_i}(z; g(x), I_{m_i} ) )\) where \(m_i\) denotes the number of movies user i rated, g is a leaky relu-activated neural networks with a single hidden layer, and \(\phi \) denotes the parameters for g.

8 Incorporating Rating Time

The Netflix training and probe sets include a time stamp for each rating event. We consider ways to leverage the effect of time in our model.

8.1 Time-Aware Neural Factorization

Building on the success of Neural Network Matrix Factorization, this model added two time components as inputs to the neural network: (1) the time of rating, normalized to lie in [0, 1] and (2) the approximate number of years between the movie’s release and the time of rating. As updates to U and V are sparse (any given row only updates a handful of times for each run through the data set), we used a Nesterov Momentum optimizer [58, 62] to train them, while continuing to apply the Adam optimizer for the neural network parameters (all of which are updated at each training step). Newer optimizers such as Adam tweak the learning rate for each parameter depending on a window of previous gradients for each parameter. This approach may not be best when updates to a given parameter occur only sporadically [63], so here we use Nesterov.

8.2 Neural One-Hot Factorization with a Time Component

In this approach, we designed a neural network that takes user- and movie-representations, and time features “movie release year” and “time of rating” and “time of user’s first rating”, and outputs a probability distribution on \(\{1,\cdots ,5\}\). Training minimizes the cross-entropy between the (point-mass, or slightly modified point-mass) distribution on the underlying label and the model’s predicted distribution. In addition to providing estimates for (user, movie, time)-ratings, this model allows us to predict the variance or uncertainty of our estimate.

8.3 Time-Binned SVD

We can partition the training and probe data into approximately equally sized bins based on the time stamps associated to them (so ratings that occur around the same time will be placed in the same or a neighboring bin) and learn a separate SVD model for each time window. Each of these models can then be used to predict a rating for a given (user, movie, time) probe pair, and a weighted average formed over all such predictions, with a higher weight given to the bin into which the query was placed.

8.4 Tensor Factorization

After partitioning our data into time bins (in the same way as for time-binned SVD), we can view our training data as a ratings tensor, where the users by movies matrix now extends along a third, temporal dimension. This allows us to perform time-aware factorization into three tensors, one for each dimension. Hitchcock pioneered a generalization of SVD to tensors, known as the minimal canonical polyadic (CP) decomposition [34], yielding the model \(M = \sum _{i=1}^r \lambda _i a_i^1 \otimes a_i^2 \otimes a_i^3\). We initialize with higher-order SVD [35, 45, 81] and use alternating least squares to fit the model.

9 Ensembling

Famously, the winning solution to the Netflix Prize challenge consisted of a blend of 107 different models [9, 61, 79]. Our best-performing models, too, aggregated predictions from other models. After building individual models using only data from the training set, we built ensembles of models using data only from the validation set. We distinguish ensembles with the letter e from our base models lettered m.

9.1 Average over a Selected Subset

Considering all \(\left( {\begin{array}{c}19\\ 10\end{array}}\right) \) size 10 subsets of \(\{m_1,\cdots ,m_{19}\}\), we found the subset whose simple average produced the smallest RMSE on the validation set. The average of these models was then computed for the test set. We performed the brute-force search with the Numba package that allows for just-in-time compilation (to LLVM) and parallelized for-loops.

9.2 Linear Regression

We performed stepwise variable selection using the Bayesian Information Criterion [72] (see also Akike’s Information Criterion [2]) for a multiple linear regression model. We also used linear regression to select the most informative subset of 10 predictors [55].

9.3 Random Forests for Regression

We used Breiman’s random forest regression algorithm [12] on the 10 top predictors, as determined by linear regression in the above section. We also tried a boosting approach for ensembling, the XGBoost algorithm [16]. The importance matrix calculated from boosting gives the top 10 models, in order, as: \(m_{14}\), \(m_{16}\), \(m_{15}\), \(m_{19}\), \(m_7\), \(m_{13}\), \(m_{17}\), \(m_{12}\), \(m_{18}\), \(m_{10}\).

9.4 Neural Network Regression

We trained neural networks on the validation set using predictions from individual models as inputs and true values on the validation set as outputs. We adopted two main architectures: (1) a direct continuous-valued function that was trained to minimize RMSE, and (2) a distributional function that was trained to minimize cross-entropy between its pdf-outputs and one-hot representations of the true values. These models were then applied to the test set to measure performance.

9.5 Impact of Objective Function on Ensemble Building

To illustrate the role that the choice of objective function plays in ensemble building, we took the boosted ensemble model and optimized it on the validation data under a range of different objective functions. See Table 3 for results. As extreme gradient boosting uses second derivatives of the objective function, we do not include \(L_1\) or Huber loss.

Table 3. RMSE performance after optimizing \(e_5\) under different objective functions.

10 Conclusions

The vast majority of benchmarks against the Netflix dataset report only RMSE performance, in line with the prize’s original objective. For estimators that aim to select a user’s top n movies, people sometimes consider precision-recall metrics. We summarize our results under numerous metrics in Table 1 for base models and Table 2 for ensembles. We normalize the \(L_p\) metrics by diving by the loss obtained by predicting the mean training value for all test points. For example, predicting the training mean (3.67 stars) for all movies in the test set yields an RMSE of 1.127, so all reported normalized RMSE’s correspond to standard RMSE divided by 1.127.

Ensembles proved essential to the winning solution for the Netflix Grand Prize. In 2009 when the competition concluded, matrix factorization and neighborhood methods provided the fundamental components from which the ensembles were built. Since 2009, researchers introduced many new machine learning approaches for recommender systems including neural network matrix factorization, factorization machines, extreme gradient boosting, and variational autoencoding. These methods have been tested individually, but little work has been done to consider how these new approaches can be combined to form more effective ensemble estimators. This paper provides a first step in that direction.