Evaluation of session-based recommendation algorithms

Ludewig, Malte; Jannach, Dietmar

doi:10.1007/s11257-018-9209-6

Evaluation of session-based recommendation algorithms

Published: 01 October 2018

Volume 28, pages 331–390, (2018)
Cite this article

Download PDF

User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Evaluation of session-based recommendation algorithms

Download PDF

Malte Ludewig¹ &
Dietmar Jannach²

3900 Accesses
206 Citations
20 Altmetric
1 Mention
Explore all metrics

Abstract

Recommender systems help users find relevant items of interest, for example on e-commerce or media streaming sites. Most academic research is concerned with approaches that personalize the recommendations according to long-term user profiles. In many real-world applications, however, such long-term profiles often do not exist and recommendations therefore have to be made solely based on the observed behavior of a user during an ongoing session. Given the high practical relevance of the problem, an increased interest in this problem can be observed in recent years, leading to a number of proposals for session-based recommendation algorithms that typically aim to predict the user’s immediate next actions. In this work, we present the results of an in-depth performance comparison of a number of such algorithms, using a variety of datasets and evaluation measures. Our comparison includes the most recent approaches based on recurrent neural networks like gru4rec, factorized Markov model approaches such as fism or fossil, as well as simpler methods based, e.g., on nearest neighbor schemes. Our experiments reveal that algorithms of this latter class, despite their sometimes almost trivial nature, often perform equally well or significantly better than today’s more complex approaches based on deep neural networks. Our results therefore suggest that there is substantial room for improvement regarding the development of more sophisticated session-based recommendation algorithms.

Artificial intelligence in recommender systems

Article Open access 01 November 2020

Qian Zhang, Jie Lu & Yaochu Jin

Recommendation system based on deep learning methods: a systematic review and new directions

Article 03 August 2019

Aminu Da’u & Naomie Salim

Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks

Article 09 May 2018

C. Okan Sakar, S. Olcay Polat, … Yomi Kastro

1 Introduction

Many of today’s online services use recommender systems to point their users or site visitors to additional items that might be of interest to them. In academic research, the majority of works is focusing on techniques that rely on long-term preference models to determine the items to be presented to the user. However, in many application domains of recommender systems, such long-term user models are often not available for a larger fraction of the users, e.g., because they are first-time visitors or because they are not logged in. Consequently, suitable recommendations have to be determined based on other types of information, usually the user’s most recent interactions with the site or application. Recommendation techniques that rely solely on the user’s actions in an ongoing session and which adapt their recommendations to the user’s actions are called session-based recommendation approaches (Quadrana et al. 2018).

Amazon’s “Customers who bought... also bought” recommendations can be considered an extreme case of such a session-based approach. In this case, the recommendations are seemingly only dependent on the item that is currently viewed by the user (and the purchasing patterns of the community). A number of other techniques were proposed in the research literature, which do not limit themselves to the very last action, but consider some or all user actions since the session started. Some of these techniques only consider which events happened; others, in contrast, in addition take the sequence of events into account in their algorithms. Besides the e-commerce domain, a number of other application fields were in the focus in the literature, among them in particular music, web page navigation, or travel and tourism.

In academia, sequential recommendation problems are typically operationalized as the task of predicting the next user action. Experimental evaluations are usually based on larger, time-ordered logs of user actions, e.g., on the users’ item viewing and purchase activities on an e-commerce shop or on their listening history on a music streaming site. From an algorithmic perspective, early approaches to predict the next user actions were based, for example, on sequential pattern mining techniques. Later on, different types of more sophisticated methods based on Markov models were proposed and successfully applied to the problem. Finally, in the most recent years, the use of deep learning approaches based on artificial neural networks was explored as another solution. Recurrent Neural Networks (RNN), which are capable of learning models from sequentially ordered data, are a “natural choice” for this problem, and significant advances regarding the prediction accuracy of such algorithms were reported in the recent literature (Hidasi et al. 2016a, b; Tan et al. 2016; Hidasi and Karatzoglou 2017; Devooght and Bersini 2017).

Despite the growing number of papers on the topic in recent years, no true “standard” benchmark data sets or evaluation protocols exist in the community. Therefore, it remains difficult to compare the various algorithmic proposals, in particular as often different baseline algorithms are used in the papers. And, for some of them it is also unclear if they are particularly strong. In our previous work (Jannach and Ludewig 2017; Kamehkhosh et al. 2017), we could, for example, demonstrate that a comparably simple k-nearest-neighbor method leads to similar or even better accuracy results than a modern deep learning approach.

To establish a common base for future research, we performed an in-depth performance comparison across multiple domains and datasets, which involved a number of comparably simple as well as more sophisticated algorithms from the recent literature. Our results show that computationally and conceptually simple methods often lead to predictions that are similarly accurate or even better than those of today’s most recent techniques based on deep learning models. As a consequence, we argue that researchers should take these simpler methods as alternative baselines into account when developing novel session-based recommendation algorithms. Furthermore, our results suggest that there is still substantial room for improvement regarding the development of more sophisticated session-based recommendation algorithms.

This paper extends our previous works presented in Jannach and Ludewig (2017), and Kamehkhosh et al. (2017) in a number of ways. First, we made experiments for a larger number of datasets from different domains, using a richer set of performance measures. Second, we included recent sequential recommendation algorithms like FISM and FOSSIL (Kabbur et al. 2013; He and McAuley 2016) in the evaluation as well as the latest version of gru4rec (Hidasi et al. 2016a). Third, we designed a number of additional sequence-aware similarity measures for the previously proposed session-based nearest neighbor method, which in most cases lead to significant performance gains. Finally, we also propose a new method called Session-based Matrix Factorization (SFM), which yields good results in some of the tested application domains.

The paper is organized as follows. Next, in Sect. 2, we discuss previous works and typical application areas of session-based recommendation approaches. In Sect. 3, we provide technical details about the algorithms that were compared in our work. Section 4 describes our evaluation setup and Sect. 5 the outcomes of our experiments. To foster reproducible research on the topic, we share the code of the used evaluation framework and the compared algorithms online.^{Footnote 1}

2 Review of session-based recommendation approaches

Most of the approaches for session-based recommendation proposed in the literature implement some form of sequence learning, see also Quadrana et al. (2018) for a recent survey on the more general class of sequence-aware recommenders. Early approaches were based on the identification of frequent sequential patterns, which can be used at recommendation time to predict a user’s next action. These early approaches were applied, for example, in the context of predicting the online navigation behavior of users (Mobasher et al. 2002). Later on, such pattern mining techniques were also used for next-item recommendation problems in e-commerce or the music domain (Yap et al. 2012; Hariri et al. 2012; Bonnin and Jannach 2014).

While frequent pattern techniques are easy to implement and lead to interpretable models, the mining process can be computationally demanding. At the same time, finding good algorithm parameters, in particular a suitable minimum support threshold, can be challenging. Finally, in some application domains it seems that using frequent item sequences does not lead to better recommendations than when using simpler item co-occurrence patterns (Bonnin and Jannach 2014). In the context of this work, we investigate both sequential and co-occurrence patterns in their simplest forms as baselines.

In many newer works, more sophisticated sequence learning approaches were proposed that implement some form of sequence modeling. Such sequence modeling approaches are usually based on Markov Chain (MC) models (He et al. 2009; Mcfee and Lanckriet 2011; Garcin et al. 2013; Hosseinzadeh Aghdam et al. 2015), reinforcement learning (RL) and Markov Decision Processes (MDP) (Shani et al. 2005; Moling et al. 2012; Tavakol and Brefeld 2014), or Recurrent Neural Networks (RNN) (Zhang et al. 2014; Sordoni et al. 2015; Hidasi et al. 2016a, b; Liu et al. 2016; Song et al. 2016; Twardowski 2016; Yu et al. 2016; Du et al. 2016; Soh et al. 2017). Again, the typical application scenarios of these methods include the e-commerce and the music recommendation domain.

An early approach based on an MDP model was proposed by Shani et al. (2005). It demonstrated the value of using sequential data in an e-commerce scenario, but also showed that models based on Markov Chains often cannot be directly applied due to data sparsity. Therefore, Shani et al. (2005) proposed different heuristics to overcome the problem. An additional challenge when using this type of models is to decide how many preceding interactions should be considered when predicting the next one. Some authors therefore use a mixture of Variable-order Markov Models (VMMs) or context-trees to consider sequences of different lengths (He et al. 2009; Garcin et al. 2013). Other works, for example by Hosseinzadeh Aghdam et al. (2015), rely on Hidden Markov Models (HMMs) to overcome certain limitations of plain Markov Chain models. In Shani et al. (2005), and Moling et al. (2012), reinforcement learning was implemented based on MDPs, which made it possible to also consider the reward for the shop in the recommendation process. To deal with the problem of the explosion of the state space in such scenarios, Tavakol and Brefeld (2014) proposed to model the state space based on the sequence of item attributes in order to predict the characteristics of the next item that the user will consider. In the context of the comparative analysis presented in this paper, we limit ourselves to a simple MC-based method as a baseline, in particular because some techniques like the one discussed by Tavakol and Brefeld (2014) require the existence of knowledge about certain item attributes.

The most recent works on sequence modeling are based on RNNs. Zhang et al. (2014), for example, used them for the prediction of user clicks in an advertisement scenario. Hidasi et al. (2016a) were among the first to explore Gated Recurrent Units (GRUs) as a special form of RNNs for the prediction of the next user action in a session. Their method called gru4rec was later on extended in different ways by Hidasi et al. (2016b), Hidasi and Karatzoglou (2017), and Quadrana et al. (2017). While Hidasi et al. (2016a) reported substantial performance improvements over an item-based k-nearest-neighbor (kNN) method when using their first version of gru4rec, our previous work (Jannach and Ludewig 2017) showed that a session-based nearest neighbor method also leads to competitive accuracy results for the same problem setting. Since gru4rec was substantially improved since its initial version, we include the latest version of the method proposed by Hidasi and Karatzoglou (2017) in the performance comparison reported in this paper. Furthermore, given our observations regarding the often competitive performance of conceptually simpler methods we designed a number of variations of the basic session-based nearest neighborhood method from Jannach and Ludewig (2017), which we also considered in the experiments.

Another family of sequence modeling approaches relies on distributed item representations, e.g., in the form of latent Markov embeddings (Chen et al. 2012, 2013; Wu et al. 2013; Feng et al. 2015) or distributional embeddings (Djuric et al. 2014; Baeza-Yates et al. 2015; Grbovic et al. 2015; Tagami et al. 2015; Vasile et al. 2016; Reddy et al. 2016; Zheleva et al. 2010). Embeddings are dense, lower-dimensional representations that are derived from sequentially ordered data and encode transition probabilities based on the observations in the original data. They were applied, for example, in the domains of next-track music recommendation (Zheleva et al. 2010; Chen et al. 2012), recommendation of learning courses (Reddy et al. 2016), or next point-of-interest (POI) recommendation (Feng et al. 2015). However, a general challenge when using item embeddings is that they can be computationally demanding and sometimes require substantial amounts of training data to be effective. In the context of our work, we experimented with item embeddings as an alternative representation of the user sessions. However, the usage of embeddings did not lead to an improvement in terms of the prediction accuracy for our problem settings, which is why we do not report the detailed outcomes of these experiments in this paper.

To overcome the limitations of pure sequence learning methods, a number of hybrid methods were proposed that, for instance, combine the advantages of matrix factorization techniques with sequence modeling approaches in the form of Factorized Markov Chains (Rendle et al. 2010; Lian et al. 2013; Cheng et al. 2013; He et al. 2016; He and McAuley 2016). Rendle et al. (2010) proposed the Factorized Personalized Markov Chain (fpmc) approach as an early method for next-item recommendations in e-commerce settings, where user interactions are represented as a three-dimensional tensor (user, current item, next-item). Later on, variations of fpmc were proposed and successfully applied for a variety of application problems, e.g., by Kabbur et al. (2013) and He and McAuley (2016). Other hybrid techniques that, for example, use some form of clustering or Latent Dirichlet Allocation in combination with a sequential recommendation method were proposed, e.g., in Hariri et al. (2012), Natarajan et al. (2013), and Song et al. (2015), for the problems of next-track or next-app recommendation. In our experimental evaluation, we include both the fpmc method by Rendle et al. (2010) as well as the recent variations and improvements described by Kabbur et al. (2013) (fism) and He and McAuley (2016) (fossil).

Besides pure session-based techniques, which solely consider a user’s action of the ongoing session, there are also approaches that consider previous interactions of the same user in the recommendation process. Such techniques are called session-aware according to the terminology of Quadrana et al. (2018). Examples of such works include Baeza-Yates et al. (2015), Billsus et al. (2000), Hariri et al. (2012), Jannach et al. (2015a, 2017a), Quadrana et al. (2017), and session-aware approaches were applied for various application domains like e-commerce, music, news, or next-app recommendation. Considering longer-term user preferences in these papers shows to be helpful to improve the recommendations in the current, ongoing session. In some cases, like in Jannach et al. (2015a), it however turns out that the short-term user intents are much more important than the longer-term models. In the research presented herein, we therefore exclusively focus on session-based recommendation scenarios. We however consider the combination of long-term and short-term models as an important area for future research.

3 Details of the investigated methods

Based on these discussion, we include the following four types of techniques in our comparison of session-based recommendation algorithms: simple heuristics as baseline methods, nearest-neighbor techniques, recurrent neural networks, and factorization-based methods. The main input to all methods is a training set of past user sessions, where each session consists of a set of sequentially ordered actions of a given type, e.g., an item view event in an online shop or a consumption event on a media streaming site. The models learned by the algorithms can then be used to predict the next event in a given user session in the test set. In our evaluations, we follow a pragmatic approach to determine user sessions—in case these are not provided in the datasets—and use user inactivity times to determine session borders. The details for each dataset are described later in this paper.

Regarding the choice of the algorithms, we focus on collaborative filtering methods based on implicit feedback signals, e.g., item view or music listening events. Depending on the specific application, content-based and hybrid algorithms can be designed that use additional meta-data or content features. Since these features are domain specific and such features are only available for very few of our datasets, we limit ourselves to methods that do not rely on such types of data in this paper.

3.1 Baseline methods

We include the following baseline techniques in our comparison: a method that we call Simple Association Rules (ar), first-order Markov Chains (mc), and a method that we named Sequential Rules (sr). All baselines implement very simple prediction schemes, have a low computational complexity both for training and recommending, and only consider the very last item of a current user session to make the predictions. Furthermore, we include a prediction method based on Bayesian Personalized Ranking (bpr-mf) proposed by Rendle et al. (2009) as an alternative baseline.

3.1.1 Simple Association Rules (ar)

Simple Association Rules (ar) are a simplified version of the association rule mining technique (Agrawal et al. 1993) with a maximum rule size of two. The method is designed to capture the frequency of two co-occurring events, e.g., “Customers who bought ... also bought”. Algorithmically, the rules and their corresponding importance are “learned” by counting how often the items i and j occurred together in a session of any user.

Let a session s be a chronologically ordered tuple of item click events $s=(s_1,s_2,s_3,\dots ,s_m)$ and $S_p$ the set of all past sessions. Given a user’s current session s with $s_{|s|}$ being the last item interaction in s, we can define the score for a recommendable item i as follows, where the indicator function $1_{\textsc {eq}}(a,b)$ is 1 in case a and b refer to the same item and 0 otherwise.

$$\begin{aligned} score_{\textsc {ar}}(i,s)= & {} \frac{1}{\sum _{p \in S_p} \sum _{x=1}^{\vert p\vert } 1_{\textsc {eq}}(s_{\vert s\vert },p_{x}) \cdot (|p|-1)} \nonumber \\&\sum _{p \in S_p} \sum _{x=1}^{\vert p\vert } \sum _{y=1}^{\vert p\vert } 1_{\textsc {eq}}(s_{\vert s\vert },p_x) \cdot 1_{\textsc {eq}}(i,p_y) \end{aligned}$$

(1)

In Eq. 1, the sums at the right-hand side represent the counting scheme. The term at the left-hand side normalizes the score by the number of total rule occurrences originating from the current item $s_{|s|}$. A list of recommendations returned by the ar method then contains the items with the highest scores in descending order. No minimum support or confidence thresholds are applied. In our implementation, as shared online, we create the rules in one iteration over the training data and store them (sorted by weight) in nested maps to support fast lookups in the recommendation phase. With this data structure, top-n recommendations can be created almost instantaneously.

3.1.2 Markov Chains (mc)

The mc baseline can be seen as a variant of ar with a focus on sequences in the data. Here, the rules are extracted from a first-order Markov Chain, see Norris (1997), which describes the transition probability between two subsequent events in a session. In our baseline approach, we simply count how often users viewed item q immediately after viewing item p. Technically, the score for an item i given the current session s with the last event $s_{|s|}$ can be defined as a simplified version of Eq. 1:

$$\begin{aligned} score_{\textsc {mc}}(i,s)= & {} \frac{1}{\sum _{p \in S_p} \sum _{x=1}^{\vert p\vert -1} 1_{\textsc {eq}}(s_{\vert s\vert },p_{x})} \nonumber \\&\sum _{p \in S_p} \sum _{x=1}^{\vert p\vert -1} 1_{\textsc {eq}}(s_{\vert s\vert },p_{x}) \cdot 1_{\textsc {eq}}(i,p_{x+1}) \end{aligned}$$

(2)

where the function $1_{\textsc {eq}}(a,b)$ again indicates whether a and b refer to the same item or not. Here, with the right-hand side of the formula, we count how often item i appears immediately after $s_{|s|}$. The normalization term transforms the absolute count into a relative transition probability. In line with ar, in our implementation the rules and weights are recorded in nested maps in one single iteration over the training data to ensure short training times and to support the fast generation of the recommendations.

3.1.3 Sequential Rules (sr)

Finally, the sr method as proposed in Kamehkhosh et al. (2017) is a variation of mc or ar respectively. It also takes the order of actions into account, but in a less restrictive manner. In contrast to the mc method, we create a rule when an item q appeared after an item p in a session even when other events happened between p and q.

When assigning weights to the rules, we consider the number of elements appearing between p and q in the session. Specifically, we use the weight function $\textstyle w_{\textsc {sr}}(x) = 1/(x)$, where x corresponds to the number of steps between the two items.^{Footnote 2} Given the current session s, the sr method calculates the score for the target item i as follows:

$$\begin{aligned} score_{\textsc {sr}}(i,s)= & {} \frac{1}{\sum _{p \in S_p} \sum _{x=2}^{\vert p\vert } 1_{\textsc {eq}}(s_{\vert s\vert },p_{x}) \cdot x} \nonumber \\&\sum _{p \in S_p} \sum _{x=2}^{\vert p\vert } \sum _{y=1}^{x-1} 1_{\textsc {eq}}(s_{\vert s\vert },p_y) \cdot 1_{\textsc {eq}}(i,p_x) \cdot w_{\textsc {sr}}(x-y) \end{aligned}$$

(3)

In contrast to Eq. 1 for ar, the third inner sum only considers indices of previous item view events for each session p. In addition, the weighting function $w_{\textsc {sr}}(x)$ is added. Again, we normalize the absolute score by the total number of rule occurrences for the current item $s_{|s|}$. As for ar and mc, the algorithm was implemented using nested sorted maps, which can be created in a single iteration over the training data.

3.1.4 Bayesian Personalized Ranking (bpr-mf)

To make our results comparable with previous research, we finally include a prediction method based on bpr-mf as a baseline in our experiments.^{Footnote 3} bpr-mf proposed by Rendle et al. (2009) is a learning-to-rank method designed for implicit-feedback recommendation scenarios. The method is usually applied for matrix-completion problem formulations based on longer-term user-item interactions. In bpr-mf the matrix is factorized into two smaller matrices of latent user and item features (W and H), optimizing the following criterion:

$$\begin{aligned} { BPR}_{OPT} = \sum _{(u,i,j) \in D_S} ln \, \sigma ( r_{u,i} - r_{u,j} ) - \lambda _{{\varTheta }}||{\varTheta }||^{2} \end{aligned}$$

(4)

In the above formula, a ranking $r_{u,i}$ for user u and item i is approximated with the dot product of the corresponding rows in the matrices W and H ($ r_{u,i} = \langle W_u, H_i \rangle $). The model parameters ${\varTheta }= (W,H)$ are learned using stochastic gradient descent in multiple iterations over the dataset $D_S$, which consists of triplets of the form (u, i, j), where (u, i) is a positive feedback pair and (u, j) is a sampled negative example. The optimization criterion in Eq. 4 aims to rank the positive sample (u, i) higher than a non-observed sample (u, j).

To apply the method for the session-based recommendation scenario—where there are no long-term user profiles—we attribute each session in the training set to a different user, i.e., each session corresponds to a user in the user-item interaction matrix. At prediction time, we use the average of the latent item vectors of the current session so far as the user vector.

Generally, BPR and other methods designed for the matrix-completion problems in their original form, i.e., without considering the short-term session context, do not lead to competitive results in session-based recommendation scenarios, as reported, e.g., in Jannach et al. (2015a). Therefore, we do not consider such algorithms, e.g., traditional matrix factorization techniques, as baselines in our experiments.

3.2 Nearest neighbors

Despite their simplicity, nearest-neighbor-based approaches often perform surprisingly well as discussed, e.g., by Verstrepen and Goethals (2014) and in our previous work (Jannach and Ludewig 2017; Kamehkhosh et al. 2017). We, therefore, include different nearest neighbor schemes in our comparison. First, we consider a more traditional item-based variant, which was also employed as a baseline method by Hidasi et al. (2016a). Furthermore, we evaluate three variations of a more recent session-based nearest neighbor technique in our experiments.

3.2.1 Item-based kNN (iknn)

The iknn method as used in Hidasi et al. (2016a) only considers the last element in a given session and then returns those items as recommendations that are most similar to it in terms of their co-occurrence in other sessions. Technically, each item is encoded as a binary vector, where each element corresponds to a session and is set to “1” in case the item appeared in the session. The similarity of two items can then be determined, e.g., using the cosine similarity measure, and the number of neighbours k is implicitly defined by the desired recommendation list length.

Conceptually, the method implements a certain form of a “Customers who bought ... also bought” scheme like the ar baseline. The use of the cosine similarity metric however makes it less susceptible to popularity biases. Although item-to-item approaches are comparably simple, they are commonly used in practice and sometimes considered a strong baselines (Linden et al. 2003; Davidson et al. 2010). In terms of the technical implementation, all similarity values can be pre-computed and sorted in the training process to ensure fast responses at recommendation time.^{Footnote 4}

3.2.2 Session-based kNN (sknn)

Instead of considering only the last event in the current session, the sknn method compares the entire current session with the past sessions in the training data to determine the items to be recommended, see also Hariri et al. (2012), Bonnin and Jannach (2014), and Lerche et al. (2016). Technically, given a session s, we first determine the k most similar past sessions (neighbors) $N_s$ by applying a suitable session similarity measure, e.g., the Jaccard index or cosine similarity on binary vectors over the item space (Bonnin and Jannach 2014). In our experiments, the binary cosine similarity measure led to the best results. As in Jannach and Ludewig (2017), using $k=500$ as the number of neighbors to consider led to good performance results for many datasets. Next, given the current session s, its neighbors $N_s$, and the chosen similarity function $sim(s_1,s_2)$ for two sessions $s_1$ and $s_2$, the recommendation score for each item i can as defined by Bonnin and Jannach (2014):

$$\begin{aligned} score_{\textsc {sknn}}(i,s) = {\varSigma }_{n \in N_s} sim(s,n) \cdot 1_{n}(i) \end{aligned}$$

(5)

Here, the indicator function $1_{n}(i)$ returns 1 if session n contains item i and 0 otherwise.

Scalability considerations Given a current session s, we cannot scan a potentially large set of past sessions for possible neighbors in an online recommendation scenario. Therefore, in our implementation of the algorithm, as described in Jannach and Ludewig (2017) in more detail, we rely on pre-computed in-memory index data structures and on neighborhood sampling to enable fast recommendation responses. The index is used to quickly locate past sessions that contain a certain item, i.e., the index allows us to retrieve possible neighbor sessions that contain at least one element of the current session through fast lookup operations. On the other hand, sampling only a smaller fraction of all past sessions in our experiments as potential neighbors has shown to lead to comparably small accuracy compromises. In fact, in some domains like e-commerce, only looking for neighbors in the most recent sessions—thereby capturing recent trends in the community—proved to be very effective (Jannach et al. 2017b) and led to even better results than when all past sessions were taken into account.

Our nearest neighbor implementations, therefore, have an additional parameter m, which determines the size of the sample from which the neighbors of a target session are taken. In the experiments reported in Jannach and Ludewig (2017), it was, for example, sufficient to consider only the 1000 most recent sessions from several million existing ones.

Sequence-aware extensions v-sknn, s-sknn, and sf-sknn The described sknn method does not consider the order of the elements in a session when using the Jaccard index or cosine similarity as a distance measure. Since the order of the elements might, however, be relevant in some domains and since the user preferences might change within a single session depending on the already seen items, we propose three variations of the sknn method.^{Footnote 5}

Vector Multiplication Session-Based kNN (v-sknn): The idea of this variant is to put more emphasis on the more recent events of a session when computing the similarities. Instead of encoding a session as a binary vector as described above, we use real-valued vectors to encode the current session. Only the very last element of the session obtains a value of “1”; the weights of the other elements are determined using a linear decay function that depends on the position of the element within the session, where elements appearing earlier in the session obtain a lower weight. As a result, when using the dot product as a similarity function between the current weight-encoded session and a binary-encoded past session, more emphasis is given to elements that appear later in the sessions.
Sequential Session-based kNN (s-sknn): This variant also puts more weight on elements that appear later in the session. This time, however, we achieve the effect with the following scoring function:
$$\begin{aligned} score_{\textsc {s}\hbox {-}\textsc {sknn}}(i,s) = {\varSigma }_{n \in N_s} sim(s,n) \cdot w_{n}(s) \cdot 1_{n}(i) \end{aligned}$$
(6)
Here the indicator function $1_{n}(i)$ is complemented with a weighting function $w_{n}(i,s)$, which takes the order of the events in the current session s into account. The weight $w_{n}(i,s)$ increases when the more recent items of the current session s also appeared in a neighboring session n. If an item $s_{x}$ is the most recent item of the current session s that also appears in the neighbor session n, then the weight will be defined as $w_{n}(s) = x / |s| $, where the index x indicates the position of $s_{x}$ within the session.^{Footnote 6} If, for example, the second-to-last item of the current session with a length of 5 is the most recent item also included in the neighbor session n, the weight would be $w_{n}(i,s)=4/5$. Items from this neighbor can, therefore, potentially obtain a higher score than, e.g., items from neighbor sessions that only include the third from last item of the current session, which are assigned a weight of 3 / 5.
Sequential Filter Session-based kNN (sf-sknn): This method also uses a modified scoring function, but in a more restrictive way. The basic idea is that given the last event (and related item $s_{|s|}$) of the current session s, we only consider items for recommendation that appeared directly after $s_{|s|}$ in the training data at least once.
$$\begin{aligned} score_{\textsc {sf}\hbox {-}\textsc {sknn}}(i,s) = {\varSigma }_{n \in N_s} sim(s,n) \cdot 1_{n}(s_{\vert s\vert },i) \end{aligned}$$
(7)
While the general scoring function is identical to the one of sknn (Eq. 5), we use a different implementation of the indicator function $1_{n}(s_{|s|},i)$. Here, 1 is only returned if there exists any past session which contains the sequence $(s_{|s|},i)$, given $s_{|s|}$ is the item currently viewed in the user’s current session s. Though the sequence $(s_{|s|},i)$ can be part of any past session, the item i obviously still has to be a part of the neighbor session n for the indicator function to return 1.

3.3 Neural networks—gru4rec

Approaches based on Recurrent Neural Networks (RNNs), as discussed in Sect. 2, represent the most recently explored family of techniques for session-based recommendation problems. Among these methods, gru4rec is one of the latest deep learning approaches that was specifically designed for session-based recommendation scenarios (Hidasi et al. 2016a; Hidasi and Karatzoglou 2017).

gru4rec models user sessions with the help of an RNN with Gated Recurrent Units (Cho et al. 2014) in order to predict the probability of the subsequent events (e.g., item clicks) given a session beginning. Figure 1 shows the general architecture of the network, in which the embedding, the feedforward, and additional GRU layers are optional. In fact, the authors of the method found that a single GRU layer of varying width led to the best performance in their experiments.

The input of the network is formed by a single item, which is one-hot encoded in a vector representing the entire item space, and the output is a vector of similar shape that should give a ranking distribution for the subsequent item. Inbetween, the standard GRU layer keeps track of a hidden state that encodes the previously occurring items in the same session. Therefore, while training and predicting with the help of this network architecture, the items of a session have to be fed into the network in the correct order and the hidden state of the GRUs has to be reset after a session ends. In terms of the activation functions, the authors found tanh and the sigmoid function to work best for the GRU and the ranking layer, respectively.

While the usage of RNNs for session-based, or more generally, sequential prediction problems is a natural choice, the particular network architecture, the choice of the loss functions, and the use of session-parallel mini-batches to speed up the training phase are key innovative elements of the approach.

The model can be trained with stochastic gradient descent (SGD) using established optimizations like ADAM, ADADELTA, RMSProp, or ADAGRAD (Duchi et al. 2011; Zeiler 2012; Kingma and Ba 2014). As common in practice when optimizing deep neural networks, Hidasi et al. train the network in batches. To ensure that the items or events are fed into the network in the correct order, they propose the session-parallel mini-batch training scheme, which is illustrated in Fig. 2. In the training process, each part of a batch belongs to a specific session in the training data and the network records a separate hidden state for each position. Whenever a session at a position in the batch ends, the corresponding hidden state is reset and the next batch update includes the first event of a new session at that position.

As usual, a number of hyper-parameters can be tuned, including, the learning rate, the layer sizes, a momentum factor, and a drop-out factor to stabilize the network. The choice of the loss function is another key to the quality of the recommendations of gru4rec. The following loss functions were designed or applied by the authors. In particular the latest function (MAX) proposed by Hidasi and Karatzoglou (2017) led to a significant performance improvement over the previous ones.

BPR Bayesian Personalized Ranking (BPR), as discussed above, uses a pairwise ranking loss function for the task of creating top-n recommendations. In gru4rec, a generalized version of this function is applied using the following formula:
$$\begin{aligned} L_s({\hat{r}}_{s,i},S_N) = -\frac{1}{|S_N|} \cdot \sum _{j \in S_N} log(\sigma ({\hat{r}}_{s,i}-{\hat{r}}_{s,j})) \end{aligned}$$
(8)
In the loss function, the predicted rating ${\hat{r}}_{s,i}$ for the actual next item i given the current session s is compared to a set of negative samples $S_N$ with the goal of maximizing the difference between them. Here, the sigmoid and logarithm functions are applied to represent the proportion between the ranking of the negative and the positive example.
TOP1 This loss function was introduced by the authors of gru4rec and can be seen as a regularized approximation of the relative rank of a positive sample ${\hat{r}}_{s,i}$ and the negative samples $S_N$:
$$\begin{aligned} L_s({\hat{r}}_{s,i},S_N) = \frac{1}{|S_N|} \cdot \sum _{j \in S_N} \sigma ({\hat{r}}_{s,j}-{\hat{r}}_{s,i}) + \sigma ({\hat{r}}_{s,j}^2) \end{aligned}$$
(9)
Here the proportion is approximated with the sigmoid function, and the regularization term $\sigma ({\hat{r}}_{s,j}^2)$ is added so that the score of the negative samples is directed to zero.
MAX In continuation of their work, the authors proposed a generic extension to these two loss functions, where $L_s$ stands for a loss function like BPR or TOP1 defined above:
$$\begin{aligned} L_{max}({\hat{r}}_{s,i},S_N) = L_s \left( {\hat{r}}_{s,i}, \left\{ \max _{j \in S_N} {\hat{r}}_{s,j}\right\} \right) \end{aligned}$$
(10)
Instead of using a sum of differences between the positive item’s rating ${\hat{r}}_{s,i}$ and the negative samples $S_N$, only the highest rated negative sample $\max _{j \in S_N} {\hat{r}}_{s,j}$ from $S_N$ is used to calculate the loss. As this function has to be differentiable for SGD training, $\max _{j \in S_N}$ is approximated with the softmax function. The resulting functions $\textit{BPR}_{max}$ and $\textit{TOP}1_{max}$ showed superior performance when compared to the BPR and TOP1 functions (Hidasi and Karatzoglou 2017).

In our experiments, we used the gru4rec (v2.0) implementation that the authors shared online. The code is regularly maintained by the authors and includes the implementation of the gru4rec method, the code of their baseline algorithms, as well as the code for the evaluation procedure proposed in Hidasi et al. (2016a).

3.4 Factorization-based methods

As described in Sect. 2, a number of (hybrid) factorization-based methods were proposed in recent years for sequential recommendation problems. We include three existing methods from the literature in our experiments, Factorized Personalized Markov Chains (fpmc) proposed by Rendle et al. (2010), fism by Kabbur et al. (2013), and fossil by He and McAuley (2016). Generally, these methods aim at predicting the next actions of users, but were not designed for session-based recommendation scenarios with anonymous users. We therefore describe for each method how we applied them to our problem setting. In addition, we propose a novel factorization method called Session-based Matrix Factorization (smf), which relies on the $\textit{BPR}_{max}$ and $\textit{TOP}1_{max}$ loss functions as described above.

3.4.1 Factorized Personalized Markov Chains (fpmc)

The fpmc method was designed for the specific problem of next-basket recommendation. The problem consists of predicting the contents of the next basket of a user, given his or her history of past shopping baskets. By limiting the basket size to one item and by considering the current session as the history of baskets, the method can be directly applied for session-based recommendation problems.

Technically, fpmc combines mc and traditional user-item matrix factorization in a three dimensional tensor factorization approach. As illustrated in Fig. 3, the third dimension captures the transition probabilities from one item to another.

Internally, a special form of the Canonical Tensor Decomposition is used to factor the cube into latent matrices, which can then be used to predict a ranking in the following way:

$$\begin{aligned} {\hat{r}}_{u,l,i} = \left\langle v_{u}^{U,I},v_{i}^{I,U}\right\rangle + \left\langle v_{i}^{I,L},v_{l}^{L,I}\right\rangle + \left\langle v_{u}^{U,L},v_{l}^{L,U}\right\rangle \end{aligned}$$

(11)

where ${\hat{r}}_{u,l,i}$ is a score for item i with the preferences of user u when he or she previously examined item l. The three-dimensional decomposition results in six latent matrices $v^{X,Y}$ representing the latent factors for dimension X regarding dimension Y, e.g., $v^{U,L}$ are the user latent factors in terms of the previously examined item and $v^{I,L}$ the item latent factors regarding the previously examined item. Accordingly, $v_{u}^{U,L}$ for example represents the factors for a single user u and $v_{i}^{I,L}$ the factors for item i, which are combined with the regular dot product ($\langle a,b \rangle $) to calculate the ranking ${\hat{r}}_{u,l,i}$. Those latent factors are learned using SGD with the pairwise ranking loss function BPR.

In our problem setting, where we have no long-term user histories, each session in the training data corresponds to a user. Once the model is trained, each new session therefore represents a user cold-start situation. To apply the model to our setting, we estimate the session latent vectors as the average of the latent factors of the individual items in the session. This approach was adopted also by Hidasi et al. (2016a) to apply bpr-mf to session-based recommendation scenarios.

3.4.2 Factored Item Similarity Models (fism)

This method is based on an item-item factorization, which has the advantage of being directly applicable to our session-based cold-start scenario, where no explicit user representation can be learned. However, fism does not incorporate sequential item-to-item transitions like fpmc does. Equation 12 shows the prediction function which Kabbur et al. (2013) trained using SGD to predict ratings, e.g., for the movie domain.

$$\begin{aligned} {\hat{r}}_{u,i} = b_u + b_i + (n_u^+)^{-\alpha } \sum _{j \in R_u^+} p_j q_i^T \end{aligned}$$

(12)

Technically, for user u and item i, a score ${\hat{r}}_{u,i} $ is calculated as the sum of latent vector products $p_j q_i^T$ between item i and the items $R_u^+$ already rated by the user u. In our scenario, $R_u^+$ corresponds to the previously inspected items in a session. The terms $b_u$ and $b_i$ are bias terms and $n_u^+$ specifies the number of ratings by user u, which is combined with a parameter $\alpha $ to normalize the sum of vector products to a certain degree. Instead of using the $\textit{RMSE}$ as an error metric, we use $\textit{BPR}$’s pairwise loss function when optimizing the top-n recommendations for the given implicit feedback scenario.

3.4.3 Factorized Sequential Prediction with Item Similarity Models (fossil)

In this approach, fism is combined with factorized Markov chains to incorporate sequential information into the model. The model can be described as shown in Eq. 13 (from He and McAuley 2016):

$$\begin{aligned} {\hat{r}}_{u,l,i} = \underbrace{ \sum _{j \in R_u^+ \setminus \{i\}} p_j q_i^T }_\text {long-term preferences} + \overbrace{(w + w_u)}^\text {personalized weighting} \cdot \underbrace{n_l m_i^T}_\text {sequential dynamics} \end{aligned}$$

(13)

Again, ${\hat{r}}_{u,l,i} $ represents a rating for item i given a user u and his or her previously inspected item l. The first term represents the long-term user preferences and corresponds to the fism model in Eq. 12. Using a weighted sum with a global factor w and a personalized factor $w_u$, the model is extended by a factorized Markov chain to capture the sequential dynamics. In the last term of Eq. 13, a latent vector $n_l$ for item l is multiplied with a latent vector $m_i$ for item i to factor in the user-independent probability of item l being followed by item i.

In our scenario, again, the sessions represent the users, $R_u$ corresponds to the current session and $\textit{BPR}$ is used as the loss function to rank suitable items over negative examples.

3.4.4 Session-based Matrix Factorization (smf)

Finally, smf is a novel factorization-based model that we designed for the specific task of session-based recommendation. Similar to fossil it combines factorized Markov chains with classic matrix factorization. In addition, our method considers the cold-start situation of session-based recommendation scenarios as follows.

In contrast to the traditional factorization-based prediction model $r_{u,i} = p_u q_i^T$, in the smf method, we replace the latent user vector $p_u$ with a session preference vector $s_e$, which is computed as an embedding of the current session s:

$$\begin{aligned} s_e = M_{ST} \cdot s^T \end{aligned}$$

(14)

Here the session s is as a binary vector similar to the representation in sknn (see Sect. 3.2.2) and $M_{ST}$ is a transformation matrix of size $|I| \cdot |u_s|$, which reduces the size of the binary session vector (number of unique items |I|) to a specific latent vector size $|s_e|$.

Based on the embedded session representation $s_e$, the prediction function is defined as shown in Eq. 15.

$$\begin{aligned} {\hat{r}}_{s,l,i} = w_i\cdot ( \underbrace{ s_e q_i^T + b_{1,i}}_\text {session preferences} ) + (1-w_i)\cdot ( \overbrace{n_l m_i^T + b_{2,i}}^\text {sequential dynamics}) \end{aligned}$$

(15)

The score ${\hat{r}}_{s,l,i}$ for a session s with the most recent item l and an item i is computed as a weighted combination of session preferences and sequential dynamics. Here, the session preferences correspond to the long-term user preferences in the traditional matrix factorization model, i.e., the embedded session latent vector $s_e$ for the current session s is multiplied with an item latent vector $q_i$ for item i to compute a relevance score i regarding s. The sequential dynamics are captured exactly as in Eq. 13 for fossil using latent representations for the currently inspected item l and item i. Both partial scores are adjusted with a separate bias term $b_{x,i}$ and combined in a weighted sum with the factor $w_i$ dependent on item i.

To train this model, we incorporated some of the concepts from gru4rec (see Sect. 3.3). Specifically, we adopted ADAGRAD for SGD-based optimization, and used $\textit{BPR}_{max}$ and $\textit{TOP}1_{max}$ as loss functions. Furthermore, we integrated two additional concepts (and corresponding hyper-parameters) in the training phase to avoid model over-fitting: a session drop-out factor and a skip-rate. For a drop-out factor of 0.1, for example, each positive entry of the binary session input vector is set to 0 with a probability of 10%. The skip-rate, in contrast, describes how often not the immediate next item in the log data should be used as a positive sample in the training process, but the subsequent one. A skip rate of 0.1 therefore means that in 10% of the cases the immediate next item is skipped.

4 Experiment setup

In this section, we describe the details of our algorithm comparison in terms of to the used evaluation protocol, the performance measures, and the evaluation datasets. All source code and pointers to the public datasets are provided online to ensure reproducibility of our research.^{Footnote 7}

4.1 Evaluation protocol and performance measures

The general computational task in session-based recommendation problems is to generate a ranked list of objects that in some form “matches” a given session beginning. What represents a good match, depends on the specific application scenario. It could be a set of alternative shopping items in an e-commerce scenario or a continuation of given music listening session.

In offline evaluations for session-based recommendations, researchers often abstract from the underlying purpose of the system (Jannach and Adomavicius 2016), e.g., if the recommender should help discover something new or find alternatives to a currently inspected item. Instead, the recorded user sessions are typically considered as a “gold standard” for the evaluation. To measure the performance of an algorithm, researchers resort to assessing the capability of an algorithm to predict the withheld entries of a session.

Different approaches are found in the literature to withhold certain entries of a session. In some works, only the last element is hidden (Hariri et al. 2012; Bonnin and Jannach 2014), some propose to “reveal” the first n elements of a session (Jannach et al. 2015a), while others, finally, evaluate their approaches by iteratively revealing one entry after the other (Hidasi et al. 2016b). We employed the latter iterative revealing scheme in our experiments as it (i) conceptually includes both of the other techniques and (ii) reflects the user journey throughout a session in the best way.

Selection of the target item and accuracy measures We measured prediction accuracy in two ways and correspondingly report the results in separate tables.

First, to establish comparability with existing research, we use an evaluation scheme in which the task is to predict the immediate next item given the first n elements. For each session, we iteratively increment n, measure the hit rate (HR) and the mean reciprocal rank (MRR), and finally determine the average HR and MRR for all sessions for the different list lengths, as done by Hidasi et al. (2016b).
Second, instead of focusing only on the next item, we made a measurement where we considered all subsequent elements for the given session beginning, because all of them might be relevant to the user. In this scheme, we used the standard information retrieval measures precision and recall at defined list lengths. The number of given elements of the session is also iteratively incremented as in the previously described evaluation scheme.

Sessionization strategies Different strategies exist in the literature to split the user activity logs into sessions. In some of the public datasets used in our evaluation, the activity logs were already split up into sessions, i.e., each log entry was assigned a unique session ID (RSC15, Zalando). For other datasets (RETAILR, NOWPLAYING, 30MUSIC, CLEF), we used a common heuristic-based approach and considered a session as over after a defined user idle time, e.g., 30 minutes of user inactivity (Cooley et al. 1999). For the TMALL dataset, where the timetamps for the recorded events were only available at the granularity of a day, we considered all events of one day as belonging to one session. Finally, for the two playlists dataset (AOTM, 8TRACKS), we considered all elements of a playlist to be part of a session.

Training and test splits, repeated subsampling Hidasi et al. (2016b) used one single training-test split. In the case of an e-commerce dataset, the data was split in a way that the sessions of all six months except those of the very last day of the entire dataset were placed in the training set. The last day was used for testing. We report the results of applying this evaluation scheme to ensure comparability, e.g., with respect to the results obtained for the e-commerce dataset that was used in their experiments.

Since such single-split setups have their limitations, we focus our discussion on the results that were obtained when applying a sliding-window protocol, where we split the data into 5 slices of equal size in days. For most e-commerce data, for example, we used the data of about one month for training and the subsequent data (e.g., of one day) for testing (see Sect. 4.2 for the dataset specific configurations). This allows us to make multiple measurements with different test sets. We then evaluate the performance for each of these data samples and report the average of the performance results for all slices. This latter protocol helps us reduce the danger that the observed outcomes are the results of one particular train-test configuration.^{Footnote 8}

For the playlist datasets 8TRACKS and AOTM no timestamp information is available. For these datasets we therefore applied a standard cross-validation procedure, where elements are randomly assigned to the training and test sets. We did not use such a time-agnostic data splitting procedure for the e-commerce and news datasets for different reasons. First, as the results will show, there are strong temporal effects that should be considered in the recommendation process. Second, in these domains, the set of items is not static and in particular in the news domain new items appear constantly. Randomly splitting the sessions would then potentially result in the effect that future interactions with not-yet-existing items would be considered in the training phase.

Additional quality factors Since accuracy is not the only relevant quality factor in practice, we made the following additional measurements, as was done by Jannach and Ludewig (2017).

Coverage We report how many different items ever appear in the top-k recommendations. This measure represents a form of catalog coverage, which is sometimes referred to as aggregate diversity (Adomavicius and Kwon 2012).
Popularity bias High accuracy values can, depending on the measurement method, correlate with the tendency of an algorithm to recommend mostly popular items (Jannach et al. 2015b). To assess the popularity tendencies of the tested algorithms, we report the average popularity score for the elements of the top-k recommendations of each algorithm. This average score is the mean of the individual popularity scores of each recommended item. We compute these scores based on the training set by counting how often each item appears in one of the training sessions and by then applying min-max normalization to obtain a score between 0 and 1.
Cold start Some methods might only be effective when a significant amount of training data is available. We, therefore, report the results of measurements where we artificially removed parts of the (older) training data to simulate such situations.
Scalability Training modern machine learning methods can be computationally challenging, and obtaining good results may furthermore require extensive parameter tuning. We, therefore, report the times that the algorithms needed to train the models and to make predictions at runtime. In addition, we report the memory requirements of the algorithms.

By reporting quality factors coverage and popularity bias our aim is to emphasize that different recommendation strategies can lead to quite different recommendations, even if they are similar in terms of the prediction accuracy, see also Jannach et al. (2015b). Such multi-metric evaluation approaches should also help practitioners to better understand the potential side effects of the recommenders, e.g., reduced average sales diversity and additionally increased sales of top-sellers (Lee and Hosanagar 2014). It remains however difficult to aggregate the individual performance factors into one single score, as the relative importance of the factors can depend not only on the application domain, but also on the specific business model of the provider.

Parameter optimization Some of the algorithms that we tested require extensive (hyper-)parameter tuning including smf and gru4rec. Thus, we systematically optimized the parameters for those algorithms for each dataset. Due to the computational complexity of the methods, we restricted the layer size for gru4rec as well as the number of latent factors for smf to 100 and used a randomized search method with 100 iterations for the remaining parameters as described by Hidasi and Karatzoglou (2017). In each iteration, the learning rate, the drop-out factor, the momentum, and the loss function were determined in a randomized process to find the maximum hit rate for a list length of 20. All optimizations were performed on special validation splits, which were created by splitting a training set into a validation training and test set. For the simpler s-knn-based approaches, we used the same validation sets to manually adjust the number of neighbors and samples when applying cosine similarity as the distance measure (except for v-sknn). The final parameters for each method and dataset are provided in “Appendix A”.

4.2 Datasets

We made measurements for datasets from three different domains: e-commerce, music, and news.

E-commerce datasets We used the following four e-commerce datasets.

RSC15 This is one of the datasets that was used in Hidasi et al. (2016a) and their later works. It was published in the context of the ACM RecSys 2015 Challenge and contains recorded click sequences (item views, purchases) for a period of six months. We use the label RSC15-S to denote the dataset and measurement where only one single train-test split is used. For RSC15, each split consists of 30 days of training and 1 day of test data.
TMALL This dataset was published in the context of the TMall competition and contains interaction logs of the tmall.com website for one year. For TMALL, each split consists of 90 days of training and 1 day of test data.
RETAILR The e-commerce personalization company retailrocket published this dataset covering six month of user browsing activities, also in the context of a competition. For RETAILR, each split consists of 25 days of training and 2 days of test data.
ZALANDO The final dataset is non-public and was shared with us by the fashion retailer Zalando. It contains user logs of their shopping platform for a period of one year. In our evaluation, we only considered the item view events as was done for the other e-commerce datasets. For ZALANDO, each split consists of 90 days of training and 1 day of test data.

Table 1 Characteristics of the e-commerce datasets

Evaluation of session-based recommendation algorithms

Abstract

Similar content being viewed by others

Artificial intelligence in recommender systems

Recommendation system based on deep learning methods: a systematic review and new directions

Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks

1 Introduction

2 Review of session-based recommendation approaches

3 Details of the investigated methods

3.1 Baseline methods

3.1.1 Simple Association Rules (ar)

3.1.2 Markov Chains (mc)

3.1.3 Sequential Rules (sr)

3.1.4 Bayesian Personalized Ranking (bpr-mf)

3.2 Nearest neighbors

3.2.1 Item-based kNN (iknn)

3.2.2 Session-based kNN (sknn)

3.3 Neural networks—gru4rec

3.4 Factorization-based methods

3.4.1 Factorized Personalized Markov Chains (fpmc)

3.4.2 Factored Item Similarity Models (fism)

3.4.3 Factorized Sequential Prediction with Item Similarity Models (fossil)

3.4.4 Session-based Matrix Factorization (smf)

4 Experiment setup

4.1 Evaluation protocol and performance measures

4.2 Datasets

5 Results

5.1 E-commerce datasets

5.1.1 Accuracy measures

5.1.2 Cold-start and sparsity effects

5.1.3 Coverage and popularity bias

5.2 Media datasets

5.3 Computational complexity and memory usage

6 Conclusion and future directions

6.1 Summary of main insights

6.2 Future directions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Parameter configurations

Full result tables

Additional results for precision and recall

Additional single split results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation