Keywords

1 The Formulation of a New Metric for Recommendation Systems

1.1 An Overview of This Work

At first, we demonstrate the problem we want to address, and we do it by using many models, data sets and multiple metrics. Then, we propose our unified and generalized metric to address the problems we observe in using different multiple and separate metrics.

Fig. 1.
figure 1

We can see the comparison of the execution time (in seconds) of the packages.

Fig. 2.
figure 2

This figure displays the execution time (in seconds) by domain and package.

Fig. 3.
figure 3

The comparison of the execution time (in seconds) by model and package.

Fig. 4.
figure 4

We can see that all packages have very similar performance, with Amelia being the best followed by missForest and impute. It is important to note that Amelia took only a fraction of the time of missForest to execute, so it might be better for large datasets. However, the Amelia package only worked for datasets that were less than 20% sparse, so it is not possible to compare the performance as having more non-missing values improves the performance of the model.

Fig. 5.
figure 5

As we can see, the performance of any model is a function of the sparsity of data.

Fig. 6.
figure 6

Also, as it can be seen, the accuracy of any model depends on the size of the data as this figure demonstrates that point.

Fig. 7.
figure 7

This figure shows that all packages experience an upward trend, where increasing the size of the matrix increases the execution time which is an expected result. We can see that rrecsys is the package that executes the fastest, followed by impute, Amelia, and mice. At the top, we can see that the softImpute, missMDA, and missForest take considerably more time to execute in any size of the dataset. All these differences are statistically significant for all matrix sizes.

Thus, we use several models and multiple data sets to evaluate our approach. First, we use all these data sets to evaluate performances of the different models using different performance metrics which are “the state of the art”. Then, we are observing the difficulties of any evaluation using these performance metrics. That is because dealing with different performance metrics, which often make contradictory conclusion, it’d be hard to decide which model has the best performance (so to use the model for the targeting campaign in mind). Therefore, we create our performance index which produces a single, unifying performance metric evaluation a targeting model.

1.2 The Data

Machine learning is the science of data driven modeling approach and as such, there is no one single model could work for all types of data [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]. As explained above, we use multiple of models, since there are models that may work the best only on some specific data sets. Also, to use as many diverse data sets, for this work, we use 15 publicly available data sets. The features of the datasets are described in Table 1.

Table 1. The features of some of the different data sets used in this work

To create a higher degree of variation in the choice of data, we created 100 bootstrap samples from each of the datasets by selecting a random number of rows, a random number of columns, and a random number of missing values. The number of missing values for the rrecsys package was selected at random between 1% and 97%, while the number of missing values for mice, softImpute, missMDA, missForest, and impute was selected between 1% and 75%, as higher number of missing values caused runtime errors. The number of missing values for the Amelia package was selected between 1% and 20%, with only 20 bootstrap samples per dataset. The reason for this was because running with more sparse data or a larger number of bootstraps caused segmentation errors that crashed R. Before any analysis, we can see that for very sparse matrices, the all packages except rrecsys are not a good option.

1.3 The Models for Recommendation Systems

Several models in seven packages, rrecsys, mice, Amelia, softImpute, missMDA, missForest, and impute, all in R, have been used to compare the results. These models are [44, 46, 53, 55, 59,60,61]:

Rrecsys:

The package rrecsys had the following models implemented:

  1. 1.

    itemAverage: impute the average rating of the item in the missing values

  2. 2.

    userAverage: impute the average rating of the user in the missing values

  3. 3.

    globalAverage: impute the overall average rating in the missing values

  4. 4.

    IBKNN: imputes the weighted average of the k-nearest neighbors of the item in the missing values

Mice:

In mice, we varied the imputation method available for numerical variables with the following:

  1. 1.

    pmm: imputes the predictive mean matching in the missing values. One of the biggest advantages is that it maintains the distribution of the data, and one of the biggest disadvantages is that it does not handle well very sparse matrices.

  2. 2.

    norm: fits a bayesian linear regression model and imputes the data by predicting missing values.

  3. 3.

    norm.nob: fits a linear regression model that ignores the model error.

  4. 4.

    mean: imputes the mean of the column in the missing samples.

  5. 5.

    sample: finds a random item that is not missing in the column where there is a missing value and imputes it.

Amelia:

The Amelia package works by performing multiple imputation on bootstrap samples based on the EMB algorithm (Expectation-Maximization with Bayesian Hierarchical classification). It allows for multiprocessing, making it fast and robust. It makes the following assumptions:

  1. 1.

    All the variables have Multivariate Normal Distribution.

  2. 2.

    Missing data is random (MAR).

softImpute

[44, 46]: The package softImpute implements matrix completion using nuclear norm regularization. It offers two algorithms:

  1. 1.

    SVD: iteratively computes the soft-thresholded Singular Value Decomposition (SVD) of a filled in matrix

  2. 2.

    ALS: uses alternating ridge regression to fill in the missing entries.

missMDA:

The package missMDA imputes the missing values of a mixed dataset (with continuous and categorical variables) using the principal component method “factorial analysis for mixed data” (FAMD). It implements 2 methods:

  1. 1.

    EM: Expectation Maximization algorithm where missing values are imputed with initial values such as the mean of the variable for the continuous variables and the proportion of the category for each category using the non-missing entries. Then, it performs the FAMD algorithm on the completed dataset until convergence.

  2. 2.

    Regularized: Similar to the EM but adds a regularization parameter to avoid overfitting.

missForest:

The package missForest creates one Random Forest model for each of the columns of the dataset, where the training set is given by the non-missing rows. Then, it uses that model to impute the missing values. It iterates until convergence. Even though Random Forest may deal with missing values in theory, this implementation first imputes the mean of the column into the missing values.

Impute:

The package impute only has one method that fills in missing values using the K-nearest neighbors. It only works on numerical datasets.

1.4 The Results

In this section, we use the performance (execution time and accuracy) of the models (Sect. 1.3) with respect to different variables such as size and sparsity of the data sets (Sect. 1.2).

Execution Time:

We want to compare the computational complexity of the packages because it is of interest to find if a model will perform well when scaling.

We can see that two packages stand out from the rest, missMDA and missForest, followed by softImpute. This is an expected result, as both packages create one model for each column of the dataset and then iterates into convergence. Because we need to create k × m (number of iterations × number of columns), the execution time is greater for these models.

We can see that missMDA and missForest stand out from the rest in all of the packages regardless of the domain. As expected, the domains with the largest datasets (environmental, financial, and medical) took the longest time to run. The high variance in some of the runs is because the execution time increased exponentially in some models, so small datasets got executed relatively fast while large datasets took an exponentially larger amount of time to execute.

We can see that the EM model from missMDA was the one that took the longest to execute, followed by missForest and the Regularized method from missMDA and the SVD method from softImpute. This is in line with the results obtained in the previous points.

Accuracy Performance:

We measure the error using the Normalized Root Mean Squared Error (NRMSE), which is similar to the RMSE but it divides the error by the variance of the dataset. The reason we used this metric is to be able to compare the performance of difference datasets regardless of the range or variance it has.

$$ NRMSE = \sqrt {mean\left( {\left( {x_{true} - x_{pred} )^{2} } \right)} \right)/var\left( {x_{true} } \right)} $$

Model Performance with Respect Sparsity:

Below we can see how the model’s performance changed by how sparse the matrix was in every domain.

Here, the lines represent a fitted linear model and the shaded region the 95% confidence interval. We can see that the difference in performance between the models is not statistically significant in the demographic, medical, and environmental domain for most levels of sparsity. We can see that the worst performing model in the environmental and financial domain is IBKNN for all levels of sparsity, while the missMDA-EM is the worst performing model for the sales, industrial and sports domains with a statistically significant difference. The best performing model in most of the domains and sparsity levels is the missForest, which is an expected result as Random Forest work extremely well in terms of accuracy but are not scalable for large datasets. We can also see an upward trend in most of the models that shows that the model’s performance decline if the matrix is sparser. This is an expected result, as having a sparser matrix implies a smaller training sample size and may cause the model to have high bias.

Model Performance with Respect to the Size of the Data:

Below we can see how the model’s performance (measured by the median NRMSE) is a function of the size of the data matrix.

Execution Time with Respect to the Size of the Data:

Below we can see how the model’s performance (measured by the execution time in seconds) changed by size the matrix (rows × columns).

It is interesting that the trend in the Amelia package appear to be completely linear with a positive slope, while the softImpute has the smallest slope and therefore the smallest change in execution time by change in the size of the dataset.

All these issues point to the difficulty of evaluating of the performance of any content providing (recommendation) model and thus we need to have a generalized and physically interpretable measure in evaluation of all these models. This is the reason this work is providing such a metric, we call it GRMM.

1.5 The New Recommendation Metric, Generalized Recommender Model Metric (GRMM)

As we observed in Sect. 1.4, a major problem in the building and implementation of recommender systems (or content providing model) is the problem of evaluation of such systems. The reasons for the issue of evaluation is due to the presence of many various – and sometime conflicting – performance criteria. As an example, when considering accuracy as one of the performance criteria, we could have a recommender system that may be highly accurate for some quantitative data sets while displaying less desirable accuracy for non-quantitative (qualitative) data sets and so this makes it difficult to come up with a clear description for the accuracy of the recommender system model and/or implementation. The same is true for another important performance measure for the recommender systems, i.e., the execution time. Given all different (and often conflicting) performance measures, it is very difficult to have a correct evaluation of any recommender system.

This work addresses the problem by creating a unified performance measure for Recommender Systems, “Generalized Recommender Model Metric, GRMM”. This metric is a normalized measure between 0–1 with 1 to be the optimal value or the highest performance measure index. This is an extremely useful tool for an AI and machine learning scientist (model design), also for marketer and engineer (in charge of implementation of the recommender systems as different system implementation have different performances also) to have a single number indicating the overall performance of any recommender system, including and encompassing all possible performance measures and sub-measures. Thus, this new recommender system index (GRMM) helps all involved in the process of designing, implementing and using recommender systems to have a single number (from 0–1) to evaluate the performance of their systems and to choose the best system (of the highest GRMM metric).

This index metric, GRMM is a flexible and most generalized index allowing everyone dealing with content creation and content providing (scientists, marketers, consultants, engineers and so on) to evaluate the model of providing content using any specific metric or a set of metrics and by giving correct weights (reflecting the business importance).

In general, for any data set X, we could have many performance measures such as accuracy (for qualitative data, accuracy for quantitative data, and so on execution time (as a function of sparsity or dimensions and so on).

In the most general case, for any recommender system X, we could have as many performance measures that may be required, let’s say “n” different measures \( m_{i} \) for i = 1:n i.e., \( {\text{M}}\left( {\text{X}} \right) = \left[ {m_{1} \left( X \right), m_{2} \left( X \right), \ldots , m_{n} \left( X \right)} \right] \).

It is important to mention that each performance measure of \( m_{i} \)(X) is a sigmoid function with a continuous value of zero to one. As an example, if \( m_{1} \left( X \right) \) is the accuracy measure for the recommender system X, then,

$$ m_{1} \left( X \right) = \frac{1}{{1 + e^{{ - \mathop \sum \nolimits_{j = 1}^{s} b_{j} a_{j} \left( X \right) }} }} $$

Where each function aj(X) is the specific accuracy (sub-measure of accuracy) performance of the corresponding recommender system model with respect to the factor j. There could be as many as “s” different sub-measure for accuracy or factors in here such as accuracy of the recommender system for continuous or quantitative variables, or the accuracy of the same recommender system for qualitative variables, accuracy with respect to size, sparsity and so on. The coefficients bj are weights given (by scientists, marketers or engineers) to each sub-measure and are normalized,

$$ \sum\nolimits_{j = 1}^{s} {b_{j} = 1} $$

Each coefficient bj could be chosen depending on the specific application and/or specific weights chosen for each factor (accuracy factor for this measure, \( m_{1} \left( X \right) \)).

Thus, for any given recommender system X, our Generalized Recommender Model Metric, GRMM(X) is,

$$ GRMM\left( X \right) = \frac{n}{{\frac{1}{{m_{1} \left( X \right)}} + \frac{1}{{m_{2} \left( X \right)}} + \ldots + \frac{1}{{m_{n} \left( X \right)}}}} $$

Though, a marketer, AI scientist or an engineer may like to give different weights to different performance measures, mi(X). This is done by considering the corresponding weights, wi for each performance measure of mi(X). We have these weights to be normalized so that,

$$ \sum\nolimits_{i = 1}^{n} {w_{i} = 1} $$

Hence, the most general form of our index, GRMM, becomes,

$$ GRMM\left( X \right) = \frac{1}{{w_{1} *\frac{1}{{m_{1} \left( X \right)}} + w_{2} *\frac{1}{{m_{2} \left( X \right)}} + \ldots + w_{n} *\frac{1}{{m_{n} \left( X \right)}}}} $$

This is the most general form of our index. This completely covers all possible performance measures (n number of them, with n to be any natural number larger or equal to 1) for any given recommender system. Also, the index includes the weights \( w_{i} \) to allow the marketers, scientists or engineers to choose the corresponding significance or weights for each performance measure. In this most general case, we also consider having different weights for any sub- performance measure. As explained above, for the example of accuracy measure, the weights bj gives also the same flexibility for the marketers, scientists and engineers to give different importance or weights for different accuracy measures of any recommender system.

Surely, this is the most generalized version of GRMM index.

There is always value for simplicity and brevity in creating any model. That is the reason we tried to find ways to make the index a bit simpler yet covering the main points and performance measure and sub-measures. Thus, in our analysis, we have found that the pseudo-optimal value for n is 2 with the two measures to be accuracy and execution time (time complexity). We also found it’s also very good pseudo-optimal choice of setting all weights, to be equal. That is,

$$ w_{i} = \frac{1}{n} $$

That is also the same procedure, that for all suboptimal measure the practical value of all weights is to be equal. For the example above, for accuracy measure and its sub-measures,

$$ b_{j} = \frac{1}{s} $$

Obviously, we have the option of using both the most general case of our GRMM index or the “simplified version”. The latter version has some extra explanations and descriptions in the original submission.

GRMM is a new and unique metric addressing a major shortcoming in this domain of content providing and content offering. It is physically interpretable and understandable and easy to use for all individuals including marketers, product managers, engineers and scientists.