Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Recently, many large-scale knowledge bases (KBs) have been constructed, including academic projects such as YAGO [8], DBpedia [2], and Elementary/ Deep-Dive [15], and commercial projects, such as those by Google [6] and Walmart [4]. These knowledge repositories hold millions of facts about the world, such as information about people, places, and things. Such information is deemed essential for improving AI applications that require machines to recognize and understand queries and their semantics in search or question answering systems. The applications include Google search and IBM’s Watson, as well as smart mobile assistants such as Apple’s Siri and NTT docomo’s Shabette-Concier [5]. They now assist users to acquire meaningful knowledge in their daily activities; e.g. looking up an actor’s birthday by question-answering systems or searching restaurants near the user’s current location by smart mobile assistants.

Fig. 1.
figure 1

Creating tensors for individual services whose objects are linked by semantics

The KBs can also be used to provide background knowledge that is shared by the different services [2]. Thus, beyond the above described usages of facts stored in the KBs, the semantics in those bases can be effectively used for mediating distributed users’ activity logs in different services. Thus they have the potential to let AI applications assist users to decide next activities across services by analyzing heterogeneous users’ logs distributed across services. In this paper, we assume services are different with each other if they do not share any objects, e.g. users, venues, or reviews. For example, in Fig. 1, US restaurant review service, YelpFootnote 1, and French one, linternauteFootnote 2, are quite different services.

Tensor factorization methods have become popular for analyzing users’ activities, since users’ activities can be represented in terms of relationships involving three or more things (e.g. when a user tags venues on a webpage) [9, 11, 13, 17, 20]. Among the proposals made to date, Bayesian Probabilistic Tensor Factorization (BPTF) [20] is promising because of its efficient sampling of large-scale datasets and simple parameter settings. Semantic Sensitive Tensor Factorization (SSTF) [11, 13] extends BPTF and applies semantic knowledge in the form of vocabularies/taxonomies extracted from Linked Open Data (LOD) to tensor factorization to solve the sparsity problem caused by sparse observation of objects. By incorporating the semantics behind objects, SSTF achieves the best rating prediction accuracy among the existing tensor factorization methods [13].

However, SSTF cannot enhance prediction accuracy across services for two reasons: (1) SSTF suffers from the balance problem that arises when handling heterogeneous datasets, e.g. the predictions for the smaller services are greatly biased by the predictions for the larger services [10]. Even if we merge user activity logs across services based on objects that appear across services, SSTF prediction results are poor when faced with merged logs. (2) SSTF focuses on only the factorization of a tensor representing users’ activities within a single service and cannot solve the sparsity problem. Even if the logs in different services share some semantic relationships, SSTF can not make use of them.

We think that a tensor factorization method that uses the semantics in the KBs to intermediate different services is needed since LOD project aims to mediate distributed data in different services [1]. Thus, this is an important goal for the Semantic Web community. For example, we can simultaneously analyze logs in an American restaurant review service and those in an equivalent Japan service by using semantics even if they share no users, restaurant venues, and review descriptions. As a result, we can improve the prediction accuracy of the individual services, extract the implicit relationships across services, and recommend good Japanese restaurants to users in the United States (and vice verse). So, this paper enhances SSTF and proposes Semantic Sensitive Simultaneous Tensor Factorization (S\(^3\)TF) that simultaneously factorizes tensors created for different services by relying on the semantics shared among services. It overcomes the above mentioned problems by taking the following two ideas:

  1. (1)

    It creates tensors for individual services whose objects are linked by semantics. This means that S\(^3\)TF does not force a tensor to be created from multiple services and then factorize that single tensor to make predictions. Below, for ease of understanding, this paper uses the scenario in which there are two different restaurant review services in different countries (they share no users, restaurants, or food reviews); e.g. Yelp and linternaute in Fig. 1. Figure 1(a) presents an example of users’ activities involving three objects: a user who assigned tags about impressive foods served by restaurants with ratings on those relationships. The restaurants and foods are linked by the semantics from the KB. In the figure, say American user \(u_{1}\) assigned tag “Banana cream pie” to restaurant “Lady M”. French user \(u_{2}\) assigned tag “Tarte aux pommes” to restaurant “Les Deux Gamins”. In Fig. 1(b), S\(^3\)TF creates tensors for two different services while sharing semantic classes; e.g. Food “Banana cream pie” is linked with food class “Sweet pies” and restaurant “Lady M” is linked with restaurant class “Bars” in a tensor for “America East Coast”. Food “Tarte aux pommes” is linked with the food class “Sweet pies” and restaurant “Les Deux Gamins” is linked with the restaurant class “Bars” in a tensor for “French”. As a result, S\(^3\)TF can factorize those individual tensors “individually” while sharing semantics across tensors. This solves the balance problem.

  2. (2)

    It uses the shared semantics present in distributed services and uses the semantics to bias the latent features learned in each service’s tensor factorization. Thus, it can avoid the sparsity problem of tensor factorization, by using not only the semantics shared within a service but also those shared among services. This has another effect: the semantic biases are shared in latent features for the tensors of individual services and thus S\(^3\)TF can extract the implicit relationships among services present in the latent features. For example, in Fig. 1(a), user \(u_1\) and \(u_2\) share no foods and no restaurants with each other, though they may share almost the same tendencies in food choice (e.g. they both tend to eat “Sweet pies” at “Bars” and “Cuts of beef” at nice restaurants). If such multi-object relationships are sparsely observed in each country, they cannot be well predicted by current tensor factorization methods because of the sparsity problem. S\(^3\)TF solves this by using the shared semantics among services. It propagates observations for “Banana cream pie” and “Tarte aux pommes” to the class “Sweet pie” as well as the observations for “Lady M” and “Les Deux Gamins” to the class “Bar”. It then applies the semantic biases from food class “Sweet pie” to “Banana cream pie” as well as those from restaurant class “Bars” to restaurant “Lady M” when the tensor for United States is factorized. It also applies semantic biases from food class “Sweet pie” to “Tarte aux pommes” as well as those from restaurant class “Bars” to restaurant “Les Deux Gamins” when the tensor for France is factorized. In this way, S\(^3\)TF solves the sparsity problem by using the semantics shared across services. It also can find the implicit relationships from the latent features (e.g. the relationships shared by users \(u_1\) and \(u_2\) described above) by the mediation provided by the shared semantics.

We evaluated S\(^3\)TF using restaurant review datasets across countries. The reviews do not share any users, restaurant venues, or review descriptions as the languages are different. Thus, they are considered to be different services. The results show that S\(^3\)TF outperforms the previous methods including SSTF by sharing the semantics behind venues and review descriptions across services.

The paper is organized as follows: Sect. 2 describes related works while Sect. 3 introduces the background of this paper. Section 4 explains our method and Sect. 5 evaluates it. Finally, Sect. 6 concludes the paper.

2 Related Work

Tensor factorization methods have recently been used in various applications such as recommendation systems [11, 17] and LOD analyses [7, 14]. For example, [14] proposed methods that use tensor factorization to analyze huge volumes of LOD datasets in a reasonable amount of time. They, however, did not use the simultaneous tensor factorization approach and thus could not explicitly incorporate the semantic relationships behind multi-object relationships into the tensor factorization; in particular, they failed to use taxonomical relationships behind multi-object relationships such as “subClassOf” and “subGenreOf”, which are often seen in LOD datasets. A recent proposal, SSTF [11, 13], solves the sparsity problem by providing semantic bias from KBs to the feature vectors for sparse objects in multi-object relationships. SSTF was, however, not designed to perform cross-domain analysis even though LOD can be effectively used for mediating distributed objects in different services [2]. Generalized Coupled Tensor Factorization (GCTF) methods [22] and recent Non-negative Multiple Tensor Factorization (NMTF) [19] try to incorporate extra information into tensor factorization by simultaneously factorizing observed tensors and matrices representing extra information. They, however, do not focus on handling semantics behind objects while factorizing tensors created for different services. Furthermore, according to the evaluations in [13], they have much worse performance than SSTF.

Other than tensor methods, [23] applies embedding models including heterogeneous network embedding and deep learning embedding to automatically extract semantic representations from the KB. Then it jointly learns the latent representations in collaborative filtering as well as items’ semantic representations from the KB. There are, however, no embedding methods that analyze different services by using shared KBs.

Recent semantic web studies try to find missing links between entities [21] or find an explanation on a pair of entities in KBs [16]. [12, 18] incorporate semantic categories of items into the model and improve the recommendation accuracies. They, however, do not focus on the analysis of users’ activities across services and find implicit relationships between entities by the above mentioned analysis.

3 Preliminary

Here, we explain Bayesian Probabilistic Tensor Factorization (BPTF) since S\(^3\)TF was implemented within the BPTF framework due to its efficiency with simple parameter settings.

This paper deals with the relationships formed by user \(u_m\), venue \(v_n\), and tag \(t_k\). A third-order tensor \(\mathcal {R}\) is used to model the relationships among objects from sets of users, venues, and tags. Here, the (mnk)-th element \({r}_{m, n, k}\) indicates the m-th user’s rating of the n-th venue with the k-th tag. Tensor factorization assigns a D-dimensional latent feature vector to each user, venue, and tag, denoted as \({\mathbf{u}}_m\), \({\mathbf{v}}_n\), and \({\mathbf{t}}_k\), respectively. Here, \({\mathbf{u}}_m\) is an M-length, \({\mathbf{v}}_n\) is an N-length, and \({\mathbf{t}}_k\) is a K-length “column” vector. Accordingly, each element \({r}_{m, n, k}\) in \(\mathcal {R}\) can be approximated as the inner-product of the three vectors as follows:

$$\begin{aligned} {r}_{m, n, k} \approx \langle {\mathbf{u}}_{m}, {\mathbf{v}}_{n}, {\mathbf{t}}_{k} \rangle \equiv \sum _{d=1}^{D}{{{u}}_{m, d} \cdot {{v}}_{n, d} \cdot {{t}}_{k, d}} \end{aligned}$$
(1)

where index d represents the d-th “row” element of each vector.

BPTF [20] models tensor factorization over a generative probabilistic model for ratings with Gaussian/Wishart priors over parameters. The Wishart distribution is most commonly used as the conjugate prior for the precision matrix of a Gaussian distribution.

We denote the matrix representations of \({\mathbf{u}}_m\), \({\mathbf{v}}_n\), and \({\mathbf{t}}_k\) as \({\mathbf{U}} \equiv [{\mathbf{u}}_1, {\mathbf{u}}_2, \ldots , {\mathbf{u}}_M ]\), \({\mathbf{V}} \equiv [{\mathbf{v}}_1, {\mathbf{v}}_2, \ldots , {\mathbf{v}}_N]\), and \({\mathbf{T}} \equiv [{\mathbf{t}}_1, {\mathbf{t}}_2, \ldots , {\mathbf{t}}_K]\). To account for randomness in ratings, BPTF uses the following probabilistic model for generating ratings:

$$\begin{aligned} {\mathcal {R}}|{\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}} \sim \prod _{m=1}^{M}\prod _{n=1}^{N}\prod _{k=1}^{K}{\mathcal {N}}(\langle {\mathbf{u}}_{m}, {\mathbf{v}}_{n}, {\mathbf{t}}_{k} \rangle , \alpha ^{-1}). \end{aligned}$$

This represents the conditional distribution of \({\mathcal {R}}\) given \(\mathbf{U}\), \(\mathbf{V}\), and \(\mathbf{T}\) in terms of Gaussian distributions, each with means of \(\langle {\mathbf{u}}_{m}, {\mathbf{v}}_{n}, {\mathbf{t}}_{k} \rangle \) and precision \(\alpha \).

The generative process of BPTF requires parameters \({\varvec{\mu }}_0\), \(\beta _0\), \(\mathbf{W}_0\), \(\nu _0\), \(\tilde{{W}}_0\), \(\tilde{{\varLambda }}\), and \(\tilde{\nu }_0\) in the hyper-priors, which should reflect prior knowledge about a specific problem and are treated as constants during training. The process is as follows:

  1. 1.

    Generate \({\varvec{{\varLambda }}}{_{\mathbf{U}}}\), \({\varvec{{\varLambda }}}{_{\mathbf{V}}}\), and \({\varvec{{\varLambda }}}{_{\mathbf{T}}}\sim {\mathcal {W}}({\varvec{{\varLambda }}}|{\mathbf{W}}_0, \nu _0)\), where \({\varvec{{\varLambda }}}{_{\mathbf{U}}}\), \({\varvec{{\varLambda }}}{_{\mathbf{V}}}\), and \({\varvec{{\varLambda }}}{_{\mathbf{T}}}\) are the precision matrices (a precision matrix is the inverse of a covariance matrix) for Gaussians. \({\mathcal {W}}({\varvec{{\varLambda }}}|{\mathbf{W}}_0, \nu _0)\) is the Wishart distribution of a \(D \times D\) random matrix \({\varvec{{\varLambda }}}\) with \(\nu _0\) degrees of freedom and a \(D \times D\) scale matrix \(\mathbf{W}_0\):\({\mathcal {W}}({\varvec{{\varLambda }}}|\mathbf{W}_0, \nu _0)=\frac{|{\varvec{{\varLambda }}}|^{({\nu }_{0}-D-1)/2}}{c}\exp (-\frac{Tr({\mathbf{W}_0}^{-1}{\varvec{{\varLambda }}})}{2})\), where C is a constant.

  2. 2.

    Generate \({\varvec{\mu }}_{\mathbf{U}}\sim {\mathcal {N}}({\varvec{\mu }}_0, (\beta _0{\varvec{{\varLambda }}}_{\mathbf{U}})^{-1})\), where \({\varvec{\mu }}_{\mathbf{U}}\) is used as the mean vector for a Gaussian. Similarly, generate \({\varvec{\mu }}_{\mathbf{V}}\sim {\mathcal {N}}({\varvec{\mu }}_0, (\beta _0\) \({\varvec{{\varLambda }}}_{\mathbf{V}})^{-1})\) and \({\varvec{\mu }}_{\mathbf{T}}\sim {\mathcal {N}}({\varvec{\mu }}_0, (\beta _0{\varvec{{\varLambda }}}_{\mathbf{T}})^{-1})\), where \({\varvec{\mu }}_{\mathbf{V}}\) and \({\varvec{\mu }}_{\mathbf{T}}\) are mean vectors for Gaussians.

  3. 3.

    Generate \(\alpha \sim {\mathcal {W}}(\tilde{{\varLambda }} | \tilde{{W}}_0, \tilde{\nu }_0)\).

  4. 4.

    For each \(m \in (1 \ldots M)\), generate \({\mathbf{u}}_{m} \sim {\mathcal {N}}({\varvec{\mu }}_{\mathbf{U}}, {\varvec{{\varLambda }}}{_{\mathbf{U}}}^{-1})\).

  5. 5.

    For each \(n \in (1 \ldots N)\), generate \({\mathbf{v}}_{n} \sim {\mathcal {N}}({\varvec{\mu }}_{\mathbf{V}}, {\varvec{{\varLambda }}}{_{\mathbf{V}}}^{-1})\).

  6. 6.

    For each \(k \in (1 \ldots K)\), generate \({\mathbf{t}}_{k} \sim {\mathcal {N}}({\varvec{\mu }}_{\mathbf{T}}, {\varvec{{\varLambda }}}{_{\mathbf{T}}}^{-1})\).

  7. 7.

    For each non-missing entry (mnk), generate \({r}_{m, n, k}\) \(\sim \) \({\mathcal {N}}(\langle {\mathbf{u}}_{m}, {\mathbf{v}}_{n}, {\mathbf{t}}_{k}\rangle , \alpha ^{-1})\).

Parameters \({\varvec{\mu }}_0\), \(\beta _0\), \(\mathbf{W}_0\), \(\nu _0\), \(\tilde{{W}}_0\), \(\tilde{{\varLambda }}\), and \(\tilde{\nu }_0\) should be set properly according to the objective dataset; fortunately, varying their values, has little impact on the final prediction [20].

BPTF views the hyper-parameters \(\alpha \), \({\varTheta }_{\mathbf{U}} \equiv \{{\varvec{\mu }}_{\mathbf{U}}, {\varvec{{\varLambda }}}_{\mathbf{U}}\}\), \({\varTheta }_{\mathbf{V}} \equiv \{{\varvec{\mu }}_{\mathbf{V}}, {\varvec{{\varLambda }}}_{\mathbf{V}}\}\), and \({\varTheta }_{\mathbf{T}} \equiv \{{\varvec{\mu }}_{\mathbf{T}}, {\varvec{{\varLambda }}}_{\mathbf{T}}\}\) as random variables, yielding a predictive distribution for unobserved ratings \(\hat{\mathcal {R}}\), which, for observable tensor \({\mathcal {R}}\), is given by:

$$\begin{aligned} p(\hat{\mathcal {R}}|{\mathcal {R}})= & {} \int p(\hat{\mathcal {R}}|{\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, \alpha )\nonumber \\&p({\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, \alpha , {\varTheta }_{\mathbf{U}}, {\varTheta }_{\mathbf{V}}, {\varTheta }_{\mathbf{T}}|{\mathcal {R}}) d\{{\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, \alpha , {\varTheta }_{\mathbf{U}}, {\varTheta }_{\mathbf{V}}, {\varTheta }_{{\mathbf{T}}}\}. \end{aligned}$$
(2)

BPTF computes the expectation of \(p(\hat{\mathcal {R}}|{\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, \alpha )\) over the posterior distribution \(p({\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, \alpha , {\varTheta }_{\mathbf{U}}, {\varTheta }_{\mathbf{V}}, {\varTheta }_{\mathbf{T}}|{\mathcal {R}})\); it approximates the expectation by averaging samples drawn from the posterior distribution. Since the posterior is too complex to be directly sampled, it applies the Markov Chain Monte Carlo (MCMC) indirect sampling technique to infer the predictive distribution for unobserved ratings \(\hat{\mathcal {R}}\) (see [20] for details on the inference algorithm of BPTF).

The time and space complexities of BPTF are \(O(\#nz \times D^2 + (M+N+K) \times D^3)\). \(\#nz\) is the number of observation entries, and M, N, and K are all much greater than D. BPTF can also compute feature vectors in parallel while avoiding fine parameter tuning during factorization.

4 Method

We now explain S\(^3\)TF. We first explain how to create augmented tensors, which share semantics among services, from individual services’ tensors. Table 1 summarizes the notations used by our method.

Table 1. Definition of main symbols
Fig. 2.
figure 2

Examples of our factorization process

4.1 Creating Augmented Tensors

Following SSTF, S\(^3\)TF creates the augmented tensor \({{\mathcal {A}}^{v}}\) that has all the observations across X services (those services do not share any object) as well as the observations for sparsely observed venues lifted in the augmented venue classes. The classes are chosen from shared KBs such as DBPedia and Freebase, and thus they are shared among services; e.g. for restaurant review services, the types of restaurants and the food categories are listed in DBPedia or Freebase in detail.

First, S\(^3\)TF extracts the observations for sparsely observed venues. Here, the set of sparse venues for the i-th service (\(1 \le i \le X\)), denoted as \({\mathbb {V}}_s^{i}\), is defined as the group of the most sparsely observed venues, \(v^i_s\)s, among all venues in the i-th service. We set a 0/1 flag to indicate the existence of relationships composed of user \(u^i_m\), venue \(v^i_n\), and tag \(t^i_k\) as \({o}^i_{m, n, k}\). Then, \({\mathbb {V}}_s^{i}\) is computed as follows:

  1. (1)

    S\(^3\)TF first sorts the venues from the rarest to the most common in the i-th service (\(1 \le i \le X\)) and creates a list of venues: \(\{v^i_{s(1)}, v^i_{s(2)}, \ldots , v^i_{s(N^i-1)}, v^i_{s(N^i)}\}\) where \(N^i\) is the number of venues in the i-th service. For example, \(v^i_{s(2)}\) is not less sparsely observed than \(v^i_{s(1)}\).

  2. (2)

    It iterates the following step (3) from \(j=1\) to \(j=N^i\).

  3. (3)

    If it satisfies the following equation, S\(^3\)TF adds the j-th sparse venue \(v^i_{s(j)}\) to set \({\mathbb {V}}_s^{i}\): \((|{\mathbb {V}}_s^{i}|/\sum _{m, n, k}{{o}^{i}_{m, n, k}})< \delta \) where \({\mathbb {V}}_s^{i}\) initially does not have any venues and \(|{\mathbb {V}}_s^{i}|\) is the number of venues in set \({\mathbb {V}}_s^{i}\). If not, it stops the iterations and returns the set \({\mathbb {V}}_s^{i}\) as the most sparsely observed venues in the i-th service. Here, \(\delta \) is a parameter used to determine the number of sparse venues in \({\mathbb {V}}_s^{i}\). Typically, we set \(\delta \) to range from 0.05 to 0.20 in accordance with the long-tail characteristic such that sparse venues account for 5–20% of all observations [13].

Second, S\(^3\)TF constructs the augmented tensor \({{\mathcal {A}}^{v}}\) as follows:.

  1. (1)

    S\(^3\)TF inserts the multi-object relationship composed of user \(u^i_m\), venue \(v^i_n\), and tag \(t^i_k\), observed in the i-th service, into \({{\mathcal {A}}^{v}}\). Here, the rating \({{r^i}_{m, n, k}}\) corresponding to the above relationship is inserted into the (\((M_1^{i - 1}+m)\), \((N_1^{i- 1}+n)\), \((K_1^{i - 1}+k)\))-th element in \({{\mathcal {A}}^{v}}\) where we denote \(M_1^{i - 1}\), \(N_1^{i - 1}\), and \(K_1^{i - 1}\) as the sum of number of users, that of venues, and that of tags in services whose identifiers are from 1 to (\(i - 1\)), respectively. As a result, \({{\mathcal {A}}^{v}}\) has all users, all venues, and all tags in all services. In Fig. 2(i), all observations in \({{{\mathcal {R}}}^{1}}\) and \({{{\mathcal {R}}}^{2}}\) are inserted into \({{\mathcal {A}}^{v}}\).

  2. (2)

    S\(^3\)TF additionally inserts the multi-object relationships composed of user \(u^i_m\), a class of sparse venue \(c^v_j\), and tag \(t^i_k\) into \({{\mathcal {A}}^{v}}\) if \(v^i_n\) is included in \({\mathbb {V}}_s^{i}\) and \(c^v_j\) is one of the classes of \(v^i_n\). Thus, the rating \({{r^i}_{m, n, k}}\) is inserted into the (\((M_1^{i - 1} + m)\), \((N_{1}^{X} + j)\), \((K_1^{i - 1} + k)\))-th element in \({{\mathcal {A}}^{v}}\). If sparse venue \(v^i_n\) has several classes, S\(^3\)TF inserts the rating \({{r^i}_{m, n, k}}\) into all corresponding elements in \({{\mathcal {A}}^{v}}\). In Fig. 2(i), observations for classes for sparse venues (“Lady M” in service 1 and “Les Deux Gamins” in service 2) are added to \({{\mathcal {A}}^{v}}\) (in the elements corresponding to their class “Bars”). Here, the number of classes that have the sparse venues in all services is denoted as \({S^{v}}\); it is computed as: \({S^{v}}=|\bigcup _{{\mathbb {V}}_s^{i}}{f(v^i_s)}|_{(1 \le i \le X)}\) where \(f(v^i_s)\) is a function that returns the classes of sparse venue \(v^i_s\) in the i-th service.

The set of sparse tags \({\mathbb {T}}_s^{i}\) is defined as the group of the most sparsely observed tags in i-th service and is computed using the same procedure as it creates \({\mathbb {V}}_s^{i}\). The augmented tensor for tags \({{\mathcal {A}}^{t}}\) is also computed in the same way as it creates \({{\mathcal {A}}^{v}}\). So we omit the explanations of the procedures for creating those here.

Tensor creation by S\(^3\)TF has the following two benefits: (1) It solves the balance problem by creating individual tensors for services and so avoids strongly biasing any particular service. (2) It overcomes the sparsity problem by propagating observations in sparse objects to their classes shared among services in the augmented tensor.

4.2 Simultaneously Factorizing Tensors Across Services

S\(^3\)TF factorizes individual services’ tensors and augmented tensors simultaneously. We first explain our approach and then the algorithm.

Approach. S\(^3\)TF takes the following three techniques in factorizing tensors.

  1. (A)

    It factorizes individual service tensors \({{\mathcal {R}}^{i}}\)s (\(1 \le i \le X\)), and augmented tensors \({{\mathcal {A}}^{v}}\) and \({{\mathcal {A}}^{t}}\) simultaneously. In particular, it creates feature vectors for users, \({{\mathbf{u}}^{i}_m}\)s, those for venues, \({{\mathbf{v}}^{i}_n}\)s, and those for tags, \({{\mathbf{c}}^{t}_j}\)s, by factorizing tensor \({{\mathcal {R}}^{i}}\) for each i-th service as well as feature vectors for their venue classes \({{\mathbf{c}}^{v}_j}\)s by \({{\mathcal {A}}^{v}}\) and those for their tag classes \({{\mathbf{c}}^{t}_j}\)s by \({{\mathcal {A}}^{t}}\). As a result, S\(^3\)TF factorizes individual tensors while enabling the semantic biases from \({{\mathbf{c}}^{v}_j}\)s and \({{\mathbf{c}}^{t}_j}\)s to be shared during the factorization process. This approach to “simultaneously” factorizing individual service tensors solves the balance problem. In the example shown in Fig. 2(ii), \({{\mathcal {R}}}^{1}\), \({{\mathcal {R}}}^{2}\), and \({{\mathcal {A}}}^{v}\) are factorized simultaneously into D-dimensional “row” feature vectors.

  2. (B)

    It shares feature vectors \({{\mathbf{u}}}^i_{m}\), \({{\mathbf{v}}}^i_{n}\), \({{\mathbf{t}}}^i_{k}\) which are computed by factorizing \({{\mathcal {R}}^{i}}\), in the factorization of augmented tensors \({{\mathcal {A}}^{v}}\) and \({{\mathcal {A}}^{t}}\). This means that it computes the feature matrix for users for the augmented tensor \({{\mathbf{U}}}^a\) by joining X numbers of service feature matrices for users, \([{{\mathbf{U}}}^1, \ldots , {{\mathbf{U}}}^i, \ldots , {{\mathbf{U}}}^X]\). Similarly, it computes the feature matrix for venues, \({{\mathbf{V}}}^a\), and that for tags, \({{\mathbf{T}}}^a\), for the augmented tensor. Then, it computes the feature matrix for venue (or tag) classes by reusing the joined feature matrices \({{\mathbf{U}}}^a\) and \({{\mathbf{T}}}^a\) (or \({{\mathbf{U}}}^a\) and \({{\mathbf{V}}}^a\)). As a result, it can, during the factorization process, share the tendencies of users’ activities across services via those shared parameters. In Fig. 2(ii), \({{\mathbf{u}}_{m,d}^a}\) is computed as: \([{{\mathbf{u}}_{m,d}^1},{{\mathbf{u}}_{m,d}^2}]\) and \({{\mathbf{t}}_{k,d}^a}\) is as: \([{{\mathbf{t}}_{k,d}^1},{{\mathbf{t}}_{k,d}^2}]\).

  3. (C)

    It updates latent feature vectors for sparse venues (or tags) in the i-th service, \({\mathbf{v}}_{s}^{i}\)s (or \({\mathbf{t}}_{s}^{i}\)s), by incorporating semantic biases from \({\mathbf{c}}^v_{j}\)s (or \({\mathbf{c}}^t_{j}\)s) to \({\mathbf{v}}_{s}^{i}\)s (or \({\mathbf{t}}_{s}^{i}\)s). Here, \({\mathbf{c}}^v_{j}\)s (or \({\mathbf{c}}^t_{j}\)s) are feature vectors for classes of the sparse venues \(v_s^i\)s (or sparse tags \(t_s^i\)s). This process incorporates the semantic tendencies of users’ activities across services captured by idea (B) into each service’s factorization; this is useful in solving the sparsity problem. In Fig. 2(iii), each row vector \({{\mathbf{c}}^{v}_{:,d}}\) has latent features for \((N^1 + N^2)\) venues and for \(S^{v}\) classes. The features in \({{\mathbf{c}}^{v}_{:,d}}\) share semantic knowledge of sparse venues across services. For example, the feature for “Bars” in \({{\mathbf{c}}^{v}_{:,d}}\) share semantic knowledge of sparse venues “Lady M” and “Les Deux Gamins” across US restaurant review service and French one (see also Fig. 1).

Algorithm. Here we explain how to compute the predictive distribution for unobserved ratings. Differently from the BPTF model (see Eq. (2)), S\(^3\)TF considers the tensors for individual services and augmented tensors in computing the distribution. Thus, the predictive distribution is computed as follows:

$$\begin{aligned}&p(\hat{\mathcal {R}}|{\mathcal {R}} , {\mathcal {A}}^{v} , {\mathcal {A}}^{t} )=\int p(\hat{\mathcal {R}}|{\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, {\mathbf{C}}^{v}, {\mathbf{C}}^{t}, \alpha , \alpha ^a)\nonumber \\&p({\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, {\mathbf{C}}^{v}, {\mathbf{C}}^{t}, {\varTheta }_{\mathbf{U}}, {\varTheta }_{{\mathbf{V}}}, {\varTheta }_{{\mathbf{C}}^{v}}, {\varTheta }_{{\mathbf{C}}^{t}}, \alpha , \alpha ^a | {\mathcal {R}}, {\mathcal {A}}^{v}, {\mathcal {A}}^{t} )\nonumber \\&d\{ {\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, {\mathbf{C}}^{v}, {\mathbf{C}}^{t}, {\varTheta }_{\mathbf{U}}, {\varTheta }_{{\mathbf{V}}}, {\varTheta }_{{\mathbf{T}}}, {\varTheta }_{{\mathbf{C}}^{v}}, {\varTheta }_{{\mathbf{C}}^{t}}, \alpha , \alpha ^a \} \end{aligned}$$
(3)

where \({\mathcal {R}} \equiv \{{\mathcal {R}}^{i}\}_{i=1}^{X}\), \(\alpha \equiv \{\alpha ^i\}_{i=1}^{X}\), \({\mathbf{U}}\equiv \{{\mathbf{U}}^{i}\}_{i=1}^{X}\), \({\mathbf{V}} \equiv \{{\mathbf{V}}^{i}\}_{i=1}^{X}\), \({\mathbf{T}} \equiv \{{\mathbf{T}}^{i}\}_{i=1}^{X}\), \({\varTheta }_{{\mathbf{U}}} \equiv \{{\varTheta }_{{\mathbf{U}}^{i}}\}_{i=1}^{X}\), \({\varTheta }_{{\mathbf{V}}} \equiv \{{\varTheta }_{{\mathbf{V}}^{i}}\}_{i=1}^{X}\), and \({\varTheta }_{{\mathbf{T}}} \equiv \{{\varTheta }_{{\mathbf{T}}^{i}}\}_{i=1}^{X}\).

Equation (3) involves a multi-dimensional integral that cannot be computed analytically. Thus, S\(^3\)TF views Eq. (3) as the expectation of \(p(\hat{\mathcal {R}}|{\mathcal {R}}, {\mathcal {A}}^{v} , {\mathcal {A}}^{t})\) over the posterior distribution \(p({\mathbf{U}}, {\mathbf{V}}, {\mathbf{T}}, {\mathbf{C}}^{v}, {\mathbf{C}}^{t}, {\varTheta }_{\mathbf{U}}, {\varTheta }_{{\mathbf{V}}}, {\varTheta }_{{\mathbf{C}}^{v}}, {\varTheta }_{{\mathbf{C}}^{t}}, \alpha , \alpha ^a | {\mathcal {R}}, {\mathcal {A}}^{v}, {\mathcal {A}}^{t}\)), and approximates the expectation by MCMC with the Gibbs sampling paradigm. It collects a number of samples, L, to approximate the integral in Eq. (3) as:

$$\begin{aligned} p(\hat{\mathcal {R}}|{\mathcal {R}}, {\mathcal {A}}^{v}, {\mathcal {A}}^{t}) \approx \sum _{l=1}^{L}{p(\hat{\mathcal {R}}|{\mathbf{U}}[l], {\mathbf{V}}[l], {\mathbf{T}}[l], {\mathbf{C}}^{v}[l], {\mathbf{C}}^{t}[l], \alpha [l], \alpha ^a[l])} \end{aligned}$$
(4)

where l represents the l-th sample.

The MCMC procedure is as follows (detail is given in the supplemental materialFootnote 3):

  1. (1)

    Initialize \({\mathbf{U}}^{i}[1]\), \({\mathbf{V}}^{i}[1]\), and \({\mathbf{T}}^{i}[1]\) (\(1 \le i \le X\)) for each i-th service as well as \({{{\mathbf{C}}^{v}}}[1]\) and \({{{\mathbf{C}}^{t}}}[1]\) for the augmented tensors by Gaussian distribution as per BPTF. \({{{\mathbf{C}}^{v}}}[1]\) and \({{{\mathbf{C}}^{t}}}[1]\) are used for sharing the semantics across services (see our approach (A)). Next, it repeats steps (2) to (8) L times.

  2. (2)

    Samples the hyperparameters for each i-th service as per BPTF i.e.:

    • \(\alpha ^i[l + 1] \sim p(\alpha ^i[l]|{\mathbf{U}}^i[l],{\mathbf{V}}^{i}[l],{\mathbf{T}}^{i}[l],{\mathcal {R}}^i)\)

    • \({\varTheta }_{U^i}[l + 1] \sim p({\varTheta }_{U^i}[l]|{\mathbf{U}}^i[l])\)

    • \({\varTheta }_{V^{i}}[l + 1] \sim p({\varTheta }_{V^{i}}[l]|{\mathbf{V}}^{i}[l])\)

    • \({\varTheta }_{T^{i}}[l + 1] \sim p({\varTheta }_{T^{i}}[l]|{\mathbf{T}}^{i}[l])\)

    here, \({\varTheta }_{\mathbf{X}} \equiv \{\mu _{\mathbf{X}},{\varLambda }_{\mathbf{X}}\}\) and is computed in the same way as BPTF.

  3. (3)

    Samples the feature vectors the same way as is done in BPTF:

    • \({\mathbf{u}}^i_m[l + 1] \sim p({\mathbf{u}}^i_m|{\mathbf{V}}^{i}[l],{\mathbf{T}}^{i}[l],\alpha ^i[l + 1],{\varTheta }_{{U}^i}[l + 1],{\mathcal {R}}^i)\)

    • \({\mathbf{v}}^{i}_{n}[l + 1] \sim p({\mathbf{v}}_{n}^{i}|{\mathbf{U}}^{i}[l + 1],{\mathbf{T}}^{i}[l],\alpha ^i[l + 1],{\varTheta }_{{V}^{i}}[l + 1],{\mathcal {R}}^i)\)

    • \({\mathbf{t}}^{i}_{k}[l + 1] \sim p({\mathbf{t}}_{k}^{i}|{\mathbf{U}}^{i}[l + 1],{\mathbf{V}}^{i}[l + 1],\alpha ^i[l +1],{\varTheta }_{{T}^{i}}[l+1],{\mathcal {R}}^i)\)

  4. (4)

    Joins the feature matrices in services in order to reuse them as the feature matrices for the augmented tensors as (see our approach (B)):

    • \({\mathbf{U}}^a[l + 1]=[{\mathbf{U}}^1[l + 1], \cdots , {\mathbf{U}}^i[l + 1], \cdots , {\mathbf{U}}^X[l + 1]]\)

    • \({\mathbf{V}}^a[l + 1]=[{\mathbf{V}}^1[l + 1], \cdots , {\mathbf{V}}^i[l + 1], \cdots , {\mathbf{V}}^X[l + 1]]\)

    • \({\mathbf{T}}^a[l + 1]=[{\mathbf{T}}^1[l + 1], \cdots , {\mathbf{T}}^i[l + 1], \cdots , {\mathbf{T}}^X[l + 1]]\)

  5. (5)

    Samples the hyperparameters for the augmented tensors similarly:

    • \(\alpha ^a[l + 1] \sim p(\alpha ^a[l]|{\mathbf{U}}^a[l + 1],{\mathbf{V}}^{a}[l + 1],{\mathbf{T}}^{a}[l + 1],{\mathcal {R}}^a)\)

    • \({\varTheta }_{{{{C}}^{v}}}[l + 1] \sim p({\varTheta }_{{{{{C}}^{v}}}}[l]|{{{\mathbf{C}}^{v}}}[l])\)

    • \({\varTheta }_{{{{C}}^{t}}}[l + 1] \sim p({\varTheta }_{{{{{C}}^{t}}}}[l]|{{{\mathbf{C}}^{t}}}[l])\)

  6. (6)

    Samples the semantically-biased feature vectors by using \(\alpha ^a[l + 1]\), \({\mathbf{U}}^a[l + 1]\), \({\mathbf{V}}^{a}[l + 1]\), and \({\mathbf{T}}^{a}[l + 1]\) as follows (see our approach (B)):

    • \({{\mathbf{c}}^{v}_j}[l + 1] \sim p({{\mathbf{c}}^{v}_j}|{\mathbf{U}}^a[l + 1],{\mathbf{T}}^{a}[l + 1],\alpha ^a[l + 1],{\varTheta }_{{C^{v}}}[l + 1],{\mathcal {A}}^{v})\)

    • \({{\mathbf{c}}^{t}_j}[l + 1] \sim p({{\mathbf{c}}^{t}_j}|{\mathbf{U}}^a[l + 1],{\mathbf{V}}^{a}[l + 1],\alpha ^a[l + 1],{\varTheta }_{{C^{t}}}[l + 1],{\mathcal {A}}^{t})\)

  7. (7)

    Samples the unobserved ratings \(\hat{r}^{i}_{m,n,k}[l + 1]\) by applying \({\mathbf{U}^i}[l + 1]\), \({\mathbf{V}^i}[l + 1]\), \({\mathbf{T}^i}[l + 1]\), \({\mathbf{C}}^{v}[l + 1]\), \({\mathbf{C}}^{t}[l + 1]\), \(\alpha ^i[l + 1]\) to equation (4).

  8. (8)

    Updates \({\mathbf{v}}_{n}^{i}[l + 1]\) as follows and uses it in the next iteration (see our approach (C)):

    $$\begin{aligned} {\mathbf{v}}_n^{i}=\left\{ \begin{array}{ll} \frac{1}{2} \bigl ({\mathbf{v}}^i_{n} + \frac{\sum _{c^{v}_{j} \in f(v^i_n)}{{{ {\mathbf{c}^{v}_{j}}}}}}{|f(v^i_n)|}\bigr ) &{} (v^i_{n} \in {\mathbb {V}}_s^{i}) \\ {\mathbf{v}}^i_{n} &{} \text{(otherwise) } \end{array} \right. \end{aligned}$$
    (5)

    Updates \({\mathbf{t}}_{k}^{i}[l + 1]\) similarly (we halt the explanation here).

The complexity of S\(^3\)TF in each MCMC iteration is \(O(\#nz \times D^2 + (M_1^X+N_1^X +K_1^X+S^{V} + S^{T})\times D^3)\). Because the first term is much larger than the rest, the computation time is almost the same as that of BPTF. Parameter \(\delta \) and parameters for factorization can be easily set based on the long-tail characteristic and the full Bayesian treatment inherited by the BPTF framework, respectively. S\(^3\)TF is faster than SSTF when analyzing X numbers of services since S\(^3\)TF creates and factorizes only one set of augmented tensors (\({\mathcal {A}}^{v}\) and \({\mathcal {A}}^{t}\)) for all services while SSTF needs X sets of augmented tensors.

5 Evaluation

The method’s accuracy was confirmed by evaluations.

5.1 Dataset

We used the Yelp ratings/reviewsFootnote 4 together with DBPedia [2] food vocabularies. Yelp datasets contain user-made ratings of restaurant venues and user reviews of venues across four countries (United Kingdom (UK), United States (US)Footnote 5, CanadaFootnote 6, and Germany). The logs of users who are included in several countries are excluded from the datasets. Thus we can consider the datasets of individual countries are made from different services. Food vocabularies are extracted from food ontologyFootnote 7 and categories are extracted from DBPedia article categories. We first extracted English food entries and then translated them into French or German by using BabelNetFootnote 8, which is a multilingual encyclopedic dictionary based on Wikipedia entries. Thus, the resulting food entries share the same categories. We then extracted tags from the reviews that match the instances in a DBPedia food vocabulary entry as was done in [13]. Consequently, we extracted 988, 1,100, 1,388, and 435 tags for UK, US, Canada, Germany, respectively. We used the genre vocabulary provided by Yelp as the venue vocabulary, it has 179 venue classes. The tag vocabulary provided by DBPedia has 1,358 food classes. The size of the user-venue-tag tensors in UK, US, Canada, and Germany were \(2,052 \times 1,398 \times 988\), \(10,736 \times 1,554 \times 1,100\), \(10,700 \times 3,085 \times 1,388\), and \(286\times 332\times 435\), respectively. The numbers of ratings in those countries were 54,774, 118,012, 172,182, and 3,062, respectively. The ratings range from 1 to 5.

5.2 Comparison Methods

We compared the accuracy of the following six methods:

  1. 1.

    NMTF [19], which utilizes the auxiliary information like GCTF. It factorizes the target tensors (user-item-tag tensors created for each countries) and auxiliary matrices (item-class matrix and tag-class matrix) simultaneously.

  2. 2.

    BPTF proposed by [20].

  3. 3.

    SSTF, which applies Semantic Sensitive Tensor Factorization proposed by [13] to the observed relationships in each service.

  4. 4.

    SSTF_all, which combines observed relationships in different services to create a merged tensor and factorizes the merged tensor by SSTF.

  5. 5.

    S \(^3\) TFT, which utilizes only the tag vocabulary.

  6. 6.

    S \(^3\) TFV, which utilizes only the venue vocabulary.

  7. 7.

    S \(^3\) TF, which is our proposal.

5.3 Methodology and Parameter Setup

We split each dataset into two halves; a training set that holds reviews entered in the first half period of all logs and a test set consisting of the reviews entered in the last half. We then performed evaluations for the two-joint combinations (total 6) of those sets to check the repeatability of results. Following the evaluation methodology used in previous studies [3, 11, 13, 20], we computed Root Mean Square Error (RMSE), which is computed by \(\sqrt{(\sum _{i=1}^{n}(P_i - R_i)^{2})}/n\), where n is the number of entries in the test set, and \(P_i\) and \(R_i\) are the predicted and actual ratings of the i-th entry, respectively. The RMSE is more appropriate to represent model performance than the Mean Absolute Error (MAE) when the error distribution is expected to be Gaussian. We varied D from 5 to 20 for each method, and set the optimum value to 20 since it gave the highest accuracies for all methods. We set the iteration count, L, to 100 since all methods could converge with this setting. \(\delta \) was set to 0.8 following [13].

Fig. 3.
figure 3

Distribution of venue frequencies

Fig. 4.
figure 4

Accuracy vs. iteration count

5.4 Results

We first investigated the sparseness of objects observed. Figure 3 plots the distribution of venue frequencies observed in the UK dataset. From this figure, we can confirm that venue observation frequencies exhibit the long-tail characteristic. Thus, observations of multi-object relationships become very sparse with respect to the possible combinations of observed objects. The distributions of other datasets showed the same tendencies. Thus, a solution to the sparsity problem across services is required. Figure 4 presents the accuracy of the UK dataset on the simultaneous factorization on UK and US datasets when the number of iterations, L, was changed. This confirms that the accuracy of S \(^3\) TF saturated before \(L= 100\). Results on other datasets showed similar tendencies.

Table 2. Comparing RMSE values of the methods

We then compared the accuracy of the methods for the simultaneous factorizations on the six datasets. The results shown in Table 2 are the average RMSE values computed for each country. They show that SSFT has better accuracy than BPFT. This is because SSFT uses the semantics shared within a single service (e.g. within a service in US) and thus solves the sparsity problem. SSTF has better accuracy by SSTF_all though SSTF_all uses the entire logs. This is because SSTF_all creates a tensor by mixing the heterogeneous datasets in different countries and thus suffers from the balance problem. S \(^3\) TFT and S \(^3\) TFV had better performance than BPTF or SSTF since S \(^3\) TFT and S \(^3\) TFV can use the shared semantics on venues and those on tags across services, respectively. Finally, S \(^3\) TF, which utilizes the semantic knowledge across services while performing coupled analysis of two tensors, yielded higher accuracy than the current best method, SSTF, with the statistical significance of \(\alpha < 0.05\).

The RMSEs of NMTF are much worse than those of S \(^3\) TF. This is mainly because: (1) NMTF straightforwardly combines different relationships, i.e., rating relationships among users, items, and tags, link relationships among items and their classes, and link relationships among tags and their classes. Thus, it suffers from the balance problem. (2) NMTF uses the KL divergence for optimizing the predictions since its authors are interested in “discrete value observations such as stars in product reviews”, as described in [19]. Our datasets are those they are interested in; however, exponential family distributions like Poisson distribution do not fit our rating datasets so well.

Table 3. Computation time (seconds) when \(L=100\)

Table 3 presents the computation times of BPTF, SSTF, and S \(^3\) TF when simultaneously factorizing tensor for German and that for UK datasets as well as simultaneously factorizing tensor for German and that for US datasets. All experiments were conducted on a Linux 3.33 GHz Intel Xeon (24 cores) server with 192 GB of main memory. All methods were implemented with Matlab and GCC. We can see that the computation time of S \(^3\) TF is shorter than that of SSTF. Furthermore, we can set L smaller than 100 (see Fig. 4). Thus, we can conclude that S \(^3\) TF can compute more accurate predictions quickly; it works better than SSTF and BPTF on real applications.

Table 4. Prediction examples for US (the upper row) and German (the lower row)

We then show the examples of the differences between the predictions output by SSTF and S \(^3\) TF in Table 4. The columns “S”, “\(S^3\)”, and “Actual” list prediction values by SSTF, those by S \(^3\) TF, and actual ratings given by users as found in the test dataset, respectively. In the US dataset, the combination of tag “streusel” at Bakeries “A” and “enchilada” at Tex-Mex restaurant “C” were highly rated in the training set. In the test set, the combination of tag “bratwurst” at American restaurant “B” and “pretzel” at Breakfast restaurant “D” were highly rated. The tags “streusel”, “bratwurst”, and “pretzel” (they are included in “german cuisine” class) are sparsely observed in the US’s training set. In the Germany dataset, the combination of tag “marzipan” at Bakeries “E” and “nachos” at Bars “G” were highly rated in the training set. In the test set, the combination of tag “burrito” at Bars “F” and “Schnitzel” at German restaurant “H” were highly rated. The tags “nachos” and “burrito” (they are included in “mexican cuisine” class) are sparsely observed in the German’s training set. S \(^3\) TF accurately predicted those observations formed by sparse tags since it uses knowledge that the tags “streusel” and “marzipan” both lie in tag class “german cuisine”, as well as the knowledge that tags “enchilada” and “nachos” both lie in tag class “mexican cuisine”. Thus, S \(^3\) TF can use such knowledge that the combinations of “german cuisine” and “mexican cuisine” are often seen in datasets across countries. SSTF predictions were inaccurate since they were not based on the semantics behind the objects being rated across services.

Fig. 5.
figure 5

Examples of implicit relationships extracted by S\(^3\)TF

We also show the implicit relationships extracted when we factorized three datasets, UK, US, and Canada, simultaneously. The implicit relationships were computed as: (1) The probability that the relationship composed by \(u_m\), \(v_n\), and \(t_k\) is included in the i-th dimension is computed by \({{{u}}_{i,m} \cdot {{v}}_{i,n} \cdot {{t}}_{i,k}}\) where \(1\le i \le D\). (2) Each observed relationship is classified into the dimension that gives the highest probability value among all D dimensions. (3) The relationships included in the same dimension are considered to form implicit relationships across services. Figure 5 presents examples as the extraction results. The first line, the second line, and the third line in balloons in the figure indicate the representative venues, venue classes, and foods, respectively. The relationships in the same dimension are represented by the same marks; circles (1) or triangles (2): (1) This dimension includes several local dishes with alcoholic content across countries. E.g. People in UK who love “Haggis” and drink “Scotch whisky” are implicitly related to those in US who love “Cheeseburger” and drink “Indian pale ale” as well as those in Canada who love “Lobster roll” and drink “White wine”. (2) This dimension includes several sweet dishes across countries. E.g. People in UK who love “Shortbread” are implicitly related to those in US who love “Sundae” as well as those in Canada who love “Maple tart”. Such implicit relationships can be used to create recommendation lists for users across services. BPTF and SSTF cannot extract such implicit relationships since they cannot use shared semantics, and thus latent features, across services.

6 Conclusion

This is the first study to show how to include the semantics behind objects into tensor factorization and thus analyze users’ activities across different services. Semantic-Sensitive Simultaneous Tensor Factorization, S\(^3\)TF, proposed here, presents a new research direction to the use of shared semantics for the cross service analysis of users’ activities. S\(^3\)TF creates individual tensors for different services and links the objects observed in each tensor to the shared semantics. Then, it factorizes the tensors simultaneously while integrating semantic biases into tensor factorization. Experiments using real-world datasets showed that S\(^3\)TF achieves much higher accuracy than the current best tensor method and extracts implicit relationships across services during factorization. One interesting future direction is to apply our idea to the recent embedding models (e.g. [23]) and analyze different services simultaneously by using KBs.