Keywords

1 Introduction

Nowadays, abundant user-item interactions in recommender system (RS) are recorded over time, which can be further used to discover the patterns of users’ behaviors [3, 12]. Therefore, sequential recommendation is becoming a new trend in academic research and practical applications, because it is capable of leveraging temporal information among users’ transactions for better inferring the user preference.

Dominant approaches aim to modeling long-term temporal information, capturing holistic dependencies of user-item sequence, while short-term temporal information which are essential in capturing partial dependencies are also significant. The long-term interaction is depicted in Fig. 1(a) where arrows indicate the dependency among a user-item interaction sequence. As a representative in long-term dependency modeling for general RS, factorization-based methods plays an important role in long-term dependency sequential recommendation for its remarkable efficiency [12]. Factorization-based methods model the entire user-item interaction matrix into two low-rank matrices. Such measure that aims to deal with the entire user-item interaction matrix is well-suited to train models that capture longer-term user preference profiles, however has limitations on capturing short-term user interests. Two main drawbacks exist in factorization-based methods for sequential recommendation: 1) they failed to fully exploit the rich information of transition dependencies of multiple items; 2) modeling the entire user-item dependencies causes enormous computing cost of growing size of user-item interaction matrix when user has new interactions [8, 9].

As for modeling users’ short-term interests, mainstream methods such as Markov chain-based approaches [3] leverage transition dependency of items from the individual-level. The short-term interaction at individual-level is shown as Fig. 1(b). Therefore, individual-level dependencies can capture individual influence between a pair of single item, but may neglect the collective influence [19] among three or more items denoted by union-level dependencies, as shown in Fig. 1(c). Namely, the collective influence is caused by the dependency of a group of items on a single item. To alleviate this issue, Yu et al. [19] leverages both individual and collective influence for better sequential recommendation performance. However, two main drawbacks exist in this methods: 1) the information of individual and collective influence is simply added to the output proximity score of a factorization-based model, leveraging none of the long-term information; 2) The union-level interaction requires a group of items to be joint modeled within a limit length of sequence, which may lead to sparsity problem.

In this paper, we propose a unified framework joint relational dependency learning(JRD-L), which exploits long-term temporal information and short-term temporal information from individual-level and union-level for improving sequential recommendation. In particular, a Long Short-Term Memory (LSTM) model [5] is used to encode long-term preferences, while short-term dependencies existing in pair relations among items are computed based on the intermedia hidden states of LSTM on both individual-level and union-level. LSTM hidden states can carry the long-term dependencies information and transmit them to short-term item pairs. Meanwhile, the individual-level relation and union-level relation are modeled together to fully exploit the collective influence among union-level pair relation and to address the sparsity problem. The framework of JRD-L is described in Fig. 3. Experiments on large-scale dataset demonstrate the effectiveness of the proposed JRD-L. The main contributions of our paper can be summarized as

  • JRD-L considers user’s long-term preferences along with short-term pair-wise item relations from multiple perspectives of individual-level and union-level. Specifically, JRD-L involves a novel multi-pair relational LSTM model that can capture both long-term dependency and multi-level temporal correlations for better inferring user preferences.

  • A novel attention model is also combined with JRD-L that can augment individual-level and union-level pair relation by learning the contributions to the subsequent interactions between users and items. Meanwhile, the weighted outputs of attention model are fused together, contributing more individual-level information to alleviates the sparse problem in the union-level dependency.

Fig. 1.
figure 1

(a) Long-term user-item interaction; (b) Individual-level item relevance; (3) Union-level item relevance. The dependencies of an item on its’ subsequent item is represented as the transition arrows.

2 Related Works

Many methods consider long-term temporal information to mining the sequential patterns of the users’ behaviors, including factorziation-based approaches [12, 14] and Markov chains based approaches [2]. Recently, Deep learning (DL)-based models have achieved significant effectiveness in long-term temporal information modeling, including multi-layer perceptron-based (MLP-based) models [16, 17], Convolutional neural network-based (CNN-based) models [6, 15] and Recurrent neural network-based (RNN-based) models [1]. RNN-based models stand out among these models for its capacity of modeling sequential dependencies by transmiting long-term sequential information from the first hidden state to the last one. However, RNN can be difficult to trained due to the vanishing gradient problem [7], but advances such as Long Short-Term Memory (LSTM) [5] has enabled RNN to be successful. LSTM is considered one of the most successful variant of RNN, with the capability of capturing long-term relationships in a sequence and suffering from the vanishing gradient problem. So far, LSTM models have achieved tremendous success in sequence modelling tasks [20, 21].

With respect to short-term temporal information modeling, existing works on modeling short-term temporal information mainly model pair relations between items. The representative work is Markov Chain (MC)-based models [3]. The objective of such model is to measure the average or weighted relevance values between a given item and its next-interaction item, this only captures dependencies between two single items. Tang et al. [15] propose a method capturing collective dependencies among three or more items. However, the model in [15] suffers from data sparsity problems. Therefore, in order to solve the sparsity problem when merely modeling collective dependencies, Yu et al. [19] add individual (i.e. individual-level) dependencies into collective (i.e. union-level) dependencies, but their work is still insufficient for it does not leverage long-term temporal information.

3 Joint Relational Dependency Learning

Before introducing the proposed method, we provide some useful notations as follows. Let U and I be the user and item set, as shown in Fig. 2. A sequence of interactions between U and I can be represented as \(S = \{ {S^{u_i}_j}:u_i \in U\}\), and \(u_i\) is associated with a interaction sequence \({S^{u_i}_j} = (S_1^{u_i},S_2^{u_i},...,S_{|{S^{u_i}_j}|}^{u_i})\). The goal of JRD-L method is to predict the likelihood of the user preferred item \(e_c^{u_i}\), based on the user’s behavior sequences \({S^{u_i}_j}\).

Fig. 2.
figure 2

(a) \({S^{u_i}_j} = (S_1^{u_i},S_2^{u_i},...,S_{|{S^{u_i}_j}|}^{u_i})\) denotes a sequence of interactions between a user \(u_i\) and a given item set I. (b) Next-item recommendation aims to generate a ranking list exposed to users by modeling user-item interaction sequence.

Fig. 3.
figure 3

The overall framework of Joint Relational Dependency Learning (JRD-L).

The overall architecture of JRD-L is shown in Fig. 3. Generally, JRD-L first models long-term dependency over the whole user-item interaction data \(S = \{ {S^{u_i}_j}:u_i \in U\}\) in a LSTM layer. JRD-L takes the most recent n items before time point t of the whole sequence as the short-term interaction sequence. Then, JRD-L computes individual-level and union-level pair relations on the taken short sequence as short-term dependencies modeling. Specifically, with the input of \(u_i\) and \({S^{u_i}_j}\), JRD-L composes \(u_i\) and \({S^{u_i}_j}\) into a single user-item vector via an Embedding layer, and output \(e_t^{u_i}\) as the user-item interaction embedding. A LSTM layer is then used to map the whole interaction sequence of user-item vector \(e_1^{u_i},e_2^{u_i},...,e_{|{S^{u_i}_j}|}^{u_i}\) into a sequence of hidden vectors \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\). More importantly, we take one further step from \(h_{|{S^{u_i}_j}|}^{u_i}\) to derive hidden vector \({h_{{c_i}}^{u_i}}\) encoding \(e_c^{u_i}\) to model long-term sequential information. Based on this, \({h_{{c_i}}^{u_i}}\) is paired with the most recent n items before time point t, i.e., \({h^{u_i}}_{t - 1},{h^{u_i}}_{t - 2},...,{h^{u_i}}_{t - n}\) (\(t-n<|{S^{u_i}_j}|\)), individually. JRD-L then passes the corresponding hidden status pairs of the most recent items to an attention layer, output the correlation likelihood \({S_{individual}}\) and \({S_{union}}\), from which the short-term individual-level and union-level pair relation is modeled, respectively. At Last, \({S_{individual}}\) is concatenate with \({S_{union}}\) to obtain the correlation of \(e_c^{u_i}\) with the existing items for the next-item prediction task.

3.1 Skip-Gram Based Item Representation

By learning the item similarities from a large number of sequential behaviors over items, we apply skip-gram with negative sampling (SGNS) [10] to generate a unified representation for each item in an given user-item interaction sequence \({S^{u_i}_j} = (S_1^{u_i},S_2^{u_i},...,S_{|{S^{u_i}_j}|}^{u_i})\). Before exploiting users’ sequences dependencies, our prior problem is to represent items via embedding layer in a numerical way for subsequent calculations. In the embedding layer, the skip-gram with negative sampling is applied to directly learn high-quality item vectors from users’ interaction sequences. The SGNS [10] generate item representations by exploiting the sequence of interactions between users and items. Specifically, given an item interaction sequence \({S^{u_i}_j} = (S_1^{u_i},S_2^{u_i},...,S_{|{S^{u_i}_j}|}^{u_i})\) of user \(u_i\) from the user-item interaction sequence S, SGNS aims to solve the following objective function

$$\begin{aligned} \arg \max _{v_j,w_i} \frac{1}{K}\sum \limits _{i = 1}^K \sum \limits _{j \ne i}^K \log (\sigma (w_i^T * {v_j})\prod \limits _{j = 1}^E \sigma ( - w_i^T * {v_k})) \end{aligned}$$
(1)

K is the length of sequence \({S^{u_i}_j}\), and \(\sigma (w_i^T * {v_j})\prod \limits _{j = 1}^E {\sigma ( - w_i^T * {v_k})}\) is computed by negative sampling. \(\sigma (x) = 1/(1 + \exp ( - x))\), \({w_i} \in U( \subset {\mathbb {R}^m})\) and \(v_i \in V( \subset {\mathbb {R}^m})\) are the latent vectors that correspond to the target and context representation for items in \({S^{u_i}_j}\), respectively. The parameter m is the dimension parameter that is defined empirically according to dataset size. E is the number of negative samples per a positive sample. Finally, matrices U and V are computed to generate representation of interaction sequences.

3.2 User Preference Modeling for Long-Term Pattern

To model the long-term temporal information in users’ behaviors, we apply a standard LSTM [5] as in Fig. 3 to model the temporal information over the whole user-item interaction sequence. For each user \(u_i\), we first generate an interaction sequence \(S_{u_i}\) with embedding items \(x_j \in I\) based on U and V calculated by Eq. (1) from embedding layer in Fig. 3, represented as \(P_u=e_1^{u_i},e_2^{u_i},...,e_{|{S^{u_i}_j}|}^{u_i}\). We use \(e_{|{S^{u_i}_j}|}^{u_i}\) as the d-dimensions latent vector of item \(x_j\). Given the embedding of the user-item interaction sequence \(e_1^{u_i},e_2^{u_i},...,e_{|{S^{u_i}_j}|}^{u_i}\) and the candidate next-item \(e_{{c_i}}^{u_i} \in e_c^{u_i}\), we generate a sequence of hidden vectors \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\) by recurrently feeding \(e_1^{u_i},e_2^{u_i},...,e_{|{S^{u_i}_j}|}^{u_i}\) as inputs to the LSTM. The inner hidden states in LSTM hidden layer are updated at each time step, which can carry the long-term dependencies information and transmit them to item pairs. At each time step, the next output of computing last hidden status \(h_i^u\) is computed by

$$\begin{aligned} h_i^u = g(e_i^u,h_{i - 1}^u,W_{LSTM}) \end{aligned}$$
(2)

where g is the output function in LSTM and \(W_{LSTM}\) are network weights of \(e_i^u\) and \(h_{i - 1}^u\). Each \(e_{{c_i}}^{u_i}\) is appended separately to \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\) calculated by Eq. (2) to obtain the long-term-dependency-sensitive hidden states \(h_{{c_i}}^{u_i}\).

$$\begin{aligned} h_{{c_i}}^u = g(e_{{c_i}}^u,h_{|{S^u}|}^u,W_{LSTM}) \end{aligned}$$
(3)

Through LSTM long-term information modeling in Fig. 3, \({h_{c_1}^{u_i},h_{c_2}^{u_i},...,h_{|{c_l}|}^{u_i}}\) is output by Eq. (3) and l is the total number of candidate next items. The sequence \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\) is calculated by Eq. (2) for the following multi-relational dependency modeling stage.

3.3 Multi-relational Dependency Modeling for Short-Term Pattern

Long-term dependency models long-range user preferences but neglect important pairwise relations between items, which is insufficient in capturing pairwise relation from multiple level. Therefore our proposed method should unify both short-term sequential dependency (at both individual-level and union-level) and long-term sequential dependency. Inspired by  [18], based on \({h_{c_1}^{u_i},h_{c_2}^{u_i},...,h_{|{c_l}|}^{u_i}}\) and \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\) output by LSTM long-term information modeling stage, we calculate pair relations on \({h^{u_i}}_{t - 1},{h^{u_i}}_{t - 2},...,{h^{u_i}}_{t - n}\) (\(t-n<|{S^{u_i}_j}|\)) selected from \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\) . The task is then to learn the correlation between the items in interaction sequence and candidate items. Rather than directly applying the work [18] for modeling the short-term dependency, we introduce an attention mechanism to calculate pair relations from individual-level and union-level to fully modeling the user preferences to different items. This is mainly because the work [18] implies that all vectors share the same weight, discarding an important fact that human naturally have different opinions on items. By introducing attention mechanism, our work can distribute high weights on these items user like more, thus improving recommendation performance.

Individual-Level Pairwise Relations. To capture the individual-level pairwise relations, the input of attention network for individual-level relation measuring layer is \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\) and \({h_{c_1}^{u_i},h_{c_2}^{u_i},...,h_{|{c_l}|}^{u_i}}\), which is the output vector of LSTM long-term information modeling layer in Fig. 3. Specifically, \(h_{{c_i}}^u \in h^{(2)}=(h_{c_1}^{u_i},h_{c_2}^{u_i},...,h_{|{c_l}|}^{u_i})\) (as indicated in Eq. 3) is paired with the hidden states of the most recent n items before time point t, which is \({h^{u_i}}_{t - 1},{h^{u_i}}_{t - 2},...,{h^{u_i}}_{t - n}\) (\(t-n<|{S^{u_i}_j}|\)) from \(h_1^{u_i},h_2^{u_i},...,h_{|{S^{u_i}_j}|}^{u_i}\) calculated by Eq. (2). An attention network is used for pairwise relation measuring. Let \(H \in {\mathbb {R}^{n \times l}}\), \({H_{ij}} = h^{(1)}_i*h^{(2)}_j\) is a matrix consisting of output vectors of last LSTM layer, and n is the size of \(h^{(1)}=(h^{u_i}_{t - 1},{h^{u_i}}_{t - 2},...,{h^{u_i}}_{t - n})\) in Eq. (2) and l is the size of \(h^{(2)}=(h_{c_1}^{u_i},h_{c_2}^{u_i},...,h_{|{c_l}|}^{u_i})\) in Eq. (3). The attentive weights \(\alpha =(\alpha _1,\alpha _2,...,\alpha _{t-n})\) of the items in interaction sequence are defined by a weighted sum of these output vectors as \(\alpha = \text {softmax} ({\omega ^T}M)\) and \(M = \tanh (H)\). We obtain M by a fully connection layer activated by tanh activation function. \(\omega ^T\) is a transpose vector of attention network’s parameters. \(\alpha _i \in [0,1]\) is the weight of \(h_{t-j}^{u_i}\) and \(h_{t-j}^{u_i} \in h^{(1)}\). After obtaining the weight \(\alpha _i\) of each existing item \(h_{t-j}^u\), the likelihood \(S_{c_i}\), which describe how likely the exiting items in user-item interaction sequence will interact with \(e_{{c_i}}^{u_i}\) in candidate next-interact items set, can be calculated by

$$\begin{aligned} \begin{aligned}&{s_k} = \text {softmax} ({\beta _1}{h_{t-j}^u} + {\beta _2}{h_{{c_i}}^u} + b) \\&{S_{individual}} = \sum \limits _{i = 1}^{n - 1} {{\alpha _i} \cdot {s_k}} \\ \end{aligned} \end{aligned}$$
(4)

where \(s_k\) is the correlation score of the pair of item \(h_{t-j}^u\in h^{(1)}\) with \(h_{{c_i}}^u \in h^{(2)}\), \(S_{c_i} \in S_{individual}\) is the output of attention network for individual-level relation measuring layer. \(\beta _1\), \(\beta _2\) and b are LSTM parameters.

Union-Level Pairwise Relations. In order to model short-term union-level pair relation, we predefine a sliding window to determine the length of collective items set in existing user-item sequences. Based on the defined items set, collaborate influence in union-level pair relation can be learned in attention network for union-level relation measuring layer. Union-level pairwise relations learned by our method can capture collective dependencies among three or more items, which complements to the individual-level relation for improving recommendation performance. In the union-level pairwise relation modeling stage, the candidate length of collective items set is defined from \(\theta =\{2,4,6,8\}\). To learn the collaborate influence in union-level pair relation, we define a sequence \(Q=\{Q_1,\cdots ,Q_{n-\theta }\}\), \(Q_i = (h_{i}^u,...,h_{\theta +i}^u)\). For example, if \(\theta =2\), we have \(Q=\{(h_{1}^u,h_{2}^u,h_{3}^u),\cdots ,(h_{n-2}^u,h_{n-1}^u,h_{n}^u)\}\). Then each \(Q_i \in Q\) is paired with \(h^{(2)}=(h_{{c_1}}^{u_i},h_{{c_2}}^{u_i},...,h_{|{c_l}|}^{u_i})\) as in Eq. (3). Then union-level pairs pass through the attention network for union-level relation measuring layer to obtain the weight \(\alpha _i\) of each existing item \(h_{t-j}^u\), and output the correlation likelihood \({S_{union}}\) by

$$\begin{aligned} \begin{aligned}&{s_m} = \text {softmax} ({\beta _3}{W_i} + {\beta _4}{h_{{c_i}}^u} + b) \\&{S_{union}} = \sum \limits _{i = 1}^{n - 1} {{\alpha _i} \cdot {s_m}} \\ \end{aligned} \end{aligned}$$
(5)

\(h_{{c_i}}^u \in h^{(2)}\), \(\beta _3\), \(\beta _4\) and b are model parameters. Then, \({S_{union}}\) is output by attention network for union-level pair relation measuring layer. Finally, \({S_{union}}\) is concatenated with \({S_{individual}}\) from attention network for individual-level pair relation measuring layer to calculate the correlation of \({e_c^{u_i}}\) with the existing items for the next-item prediction task.

3.4 Optimization

To effectively learn the parameters of the proposed JDR-L model, our training objective is to minimize the loss between the predicted labels and the true labels of candidate items. The optimization setup is, firstly, we define the item that has the latest timestamp among the user-item interaction sequence as the standard subsequent item, and define the rest of items as the non-subsequent items. Secondly, the loss function is therefore based on the assumption that an item (positive samples, i.e. standard subsequent item) this user liked will have a relative larger value than other items (negative samples) that he/she has no interest in. The loss function is then formulated as

$$\begin{aligned} \mathop {\arg \min }\limits _\varTheta \sum \limits _{i = 1}^N {({\text {concatenate}}({S_{individual}^{(i)}}}, {S_{union}^{(i)}}) - {y_i}{)^2} + \frac{\lambda }{2}||\varTheta |{|^2} \end{aligned}$$
(6)

where the parameter \(\varTheta =\{W_{LSTM},\omega ,\beta _1,\beta _2,\beta _3,\beta _4,b\}\). \({S_{individual}^{(i)}}\) in Eq. (4) represents the correlation likelihood output by attention network for individual-level relation measuring layer. \({S_{union}^{(i)}}\) in Eq. (5) represents the correlation likelihood output by attention network for union-level relation measuring layer. \(y_i\) is the label of the candidate item and \(\lambda \) is a parameter for \(l_2\) regularization. Adaptive moment estimation (Adam) [11] is used to optimize parameters during the training process.

4 Experiments

4.1 Evaluation Setup

We conduct experiments to validate JDR-L for Top-N sequential recommendation task on the real-world dataset, i.e., Movie&TV dataset [19], that belongs to Amazon dataFootnote 1. Since the original datasets are sparse, we firstly filter out users with fewer than 10 interactions as in [19]. The statistical information of the before-processing and after-processing of Movie&TV dataset is shown in Table 1. Following the evaluation settings in [19], we set train/test with ratios 80/20.

Table 1. Statistical information of dataset.

We compare JRD-L with three baselines: BPR-MF [12] is a widely used matrix factorization method for sequential RS; TranRec [4] models users as translation vectors operating on item sequences for sequential RS); RNN-based model (i.e., GRU4Rec [6] uses basic Gated Recurrent Unit for sequential RS); FPMC [13] is a typical Markov chain method modeling individiual item interactions; Multi-level item temporal dependency model (MARank) [19] models both individual-level and union-level interactions with factorization model.

For fair comparisons, we set the dropout percentage as 0.5 [19]. The embedding size d of Embedding layer is chosen from \({\{32,64,128,256\}}\), which should be equal to the hidden size h of LSTM. The regularization hyper-parameter \(\lambda \) is selected from \( {\{0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001\}}\). We set the learning rate of Aadm as the default number 0.001 [11]. As n is the most recent items for short-term dependency, we choose n from \({\{10,20,40,60}\}\). The length l of the sliding window of union-level interaction is chosen from \(\{2,4,6,8\}\). We define the length N of ranked list as 20. For the hardware settings, JRD-L model is trained on a Linux server with Tesla P100-PCIE GPU.

4.2 Effect of Parameter Selection for JDR-L

This section will discuss how the parameters influence the JRD-L model performance. We first explore the impact of n on the performance of JDR-L, the comparison is set on different n chosen from \(\{10,20,40,60\}\). Secondly, we evaluate the influence of the length l, l is chosen from \(\{2,4,6,8\}\). We use two metrics to evaluate the model performance, which are MRR (Mean Reciprocal Rank) - the average of reciprocal ranks of the predicted candidate items, and NDCG (Normalized Discounted Cumulative Gain) - a normalized average of reciprocal ranks of the predicted candidate items with a discounting factor, the comparison results of different setups are shown in Fig. 4. Figure 4 show that when other hyperparameters are set equal, \(n = 10\) achieves the best performance. These observations, presumably, because sequential pattern does not involve a very long sequence. Besides, \(l=4\) achieves the best performance, indicating that the collective influence of 4 items is informative for the Movie&TV dataset.

Fig. 4.
figure 4

Results of JDR-L under different settings.

4.3 Ranking Performance Comparison

Ranking performance evaluates how the predicted Top-N lists act on the recommendation system. Table 2 shows the comparison results of JDR-L with baselines. Encouragingly, we can find that JDR-L performs best with the highest MRR and NDCG scores. Besides, baselines may not perform well as JDR-L. Firstly, BPR-MF as matrix factorization-based method obtains less competitive performance when compared with GRU4Rec. This is mainly because BPR-MF considers user intrinsic preference over item while GRU4Rec models union-level item interaction along with users’ overall preferences. Secondly, TranRec and FPMC are two state-of-the-art methods exploiting individual-level item temporal dependency. Both of them outperform the other baselines, since they consider individual-level item temporal dependency. This indicates that keeping directed interaction between a pair of items is essential for sequential recommendation. Thirdly, MARank considering individual-level and union-level interactions but neglecting long-term dependencies performs worse than JDR-L. Above all, BPR-MF performs the worst, this is mainly because BPR-MF models only intrinsic preferences within short sequences of user-item interactions, neglecting long-term user preferences and item interactions at individual-level and union-level.

Table 2. Ranking performance.

4.4 Components Influence of JDR-L

JDR-L contains three components as indicated by Fig. 3, i.e. Long-term user-item interaction modelling, individual-level item interaction modeling and union-level item interaction modelling. To analyze the influence of different components to the overall recommendation performance, we set different combinations of components for evaluation, with the results been shown in Table 3. JDR-L with three components performs best compared with other combinations as shown in Table 3, verifying that our proposed JDR-L is optimal. As for other combinations, LSTM-only obtains the lower MRR and NDCG scores compared with JDR-L, this is because LSTM-only models long-term dependencies. LSTM+individual-level item interaction outperforms LSTM+union-level item interaction, the main reason is that union-level item interaction suffers from a sparsity problem as the length of item set increases. Besides, both of LSTM+individual-level item interaction and LSTM+union-level item interaction obtain lower scores compared with JDR-L model. This further indicates that the information in individual-level item interaction should be combined into union-level interaction modeling stage to solve the sparsity problem.

Table 3. Ranking performance on different components in JDR-L.

5 Conclusions

In this paper, we design a Joint Relational Dependency learning (JRD-L) for sequential recommendation. JDR-L builds a novel model to unify both long-term dependencies and short-term dependencies from individual-level and union-level. Moreover, JDR-L can handle the sparsity problem when exploiting the individual-level relation information from the sequential behaviors. Extensive experiments on the benchmark dataset demonstrate the effectiveness of JRD-L.