Collaborative topic regression for online recommender systems: an online and Bayesian approach
 847 Downloads
 3 Citations
Abstract
Collaborative Topic Regression (CTR) combines ideas of probabilistic matrix factorization (PMF) and topic modeling (such as LDA) for recommender systems, which has gained increasing success in many applications. Despite enjoying many advantages, the existing Batch Decoupled Inference algorithm for the CTR model has some critical limitations: First of all, it is designed to work in a batch learning manner, making it unsuitable to deal with streaming data or big data in realworld recommender systems. Secondly, in the existing algorithm, the itemspecific topic proportions of LDA are fed to the downstream PMF but the rating information is not exploited in discovering the lowdimensional representation of documents and this can result in a suboptimal representation for prediction. In this paper, we propose a novel inference algorithm, called the Online Bayesian Inference algorithm for CTR model, which is efficient and scalable for learning from data streams. Furthermore, we jointly optimize the combined objective function of both PMF and LDA in an online learning fashion, in which both PMF and LDA tasks can reinforce each other during the online learning process. Our encouraging experimental results on realworld data validate the effectiveness of the proposed method.
Keywords
Topic modeling Online learning Recommender systems Collaborative filtering Latent structure interpretation1 Introduction
Due to the abundance of personalized online business, Recommender Systems (RS) now play an important role to help us to make effective use of information. For example, CiteULike^{1} adopts RS for article recommendations, and Movielens^{2} uses RS for movie recommendations. The core technique behind RS is a personalization algorithm (Almazro et al. 2010) for predicting the preference of each individual user by making use of different sources of information with respect to users and items. The most popular algorithms adopt the collaborative filtering (CF) technique (Su and Khoshgoftaar 2009; Breese et al. 1998), which analyzes the relationship between users and interdependencies among items, in order to identify new useritem associations. In general, CF is a method of making predictions (“filtering”) about the interests of a user via collecting preferences from many users (“collaborating”). One of the most successful techniques for CF methods is based on Probabilistic Matrix Factorization (PMF) (Mnih and Salakhutdinov 2007) where a partially observed useritem rating matrix is approximated by the product of two lowrank matrices (latent factors) so as to complete the rating matrix towards recommendation purposes. Despite their popularity, most of traditional CF methods only use feedback matrix which contains the ratings on the items given by users. The prediction performance often drops significantly when the feedback matrix is sparse, which occurs when most items are given feedback by few users or most users give feedback to few items, since it is susceptible to overfitting. However, in the realworld scenarios, most users provide only a little feedback especially for new users, who have yet to provide rating information. On the other hand, in addition to the feedback matrix, auxiliary information is sometimes readily available, and could provide key information for recommendation task, meanwhile many existing CF methods ignore such side information, or are not intrinsically capable of exploiting it.
To overcome the sparsity issue of CF methods, Collaborative Topic Regression (CTR) has been actively explored in recent years (Wang and Blei 2011). Instead of purely relying on CF approaches, CTR aims to leverage contentbased techniques to overcome inaccurate and unreliable predictions with traditional CF methods due to data sparsity and other challenges. More specifically, CTR combines the idea of PMF for predicting ratings, and the idea of probabilistic topic modeling, such as Latent Dirchelet Allocation (LDA), for analyzing the content of items towards recommendation tasks. It is a joint probabilistic graphical model that integrates LDA model and PMF model. CTR has been shown as a promising method that produces more accurate and interpretable results and has been successfully applied in many recommender systems, such as tag recommendation (Wang et al. 2013; Lu et al. 2015), and social recommender systems (Purushotham et al. 2012; Kang and Lerman 2013).
Despite being studied actively and extended to various kinds of applications (Wang and Blei 2011; Wang et al. 2013), no attempts have been made to develop efficient and scalable approximate inference algorithms of CTR model. The existing Batch Decoupled Inference algorithm for CTR model (bdiCTR) suffers from several critical limitations. First of all, it is often designed to work in a batch mode learning fashion, by assuming that all text contents of items as well as the rating training data are given prior to the learning tasks. During the training process, both the inference procedure of LDA and PMF models are usually trained separately in a batch training fashion. Such an approach would suffer from a huge scalability drawback when new data (users or items) may arrive sequentially and get updated frequently in a realworld online recommender system. Second, although the graphical model of CTR is a joint model (twoway interaction exists between LDA model and PMF model), bdiCTR only leverages the content information to improve the CF tasks, but not the rating information. It first estimates LDA model, and then feed the documentspecific topic proportions of LDA to the downstream PMF part. This twostep inference procedure is inconsistent with the joint graphical model of CTR and rather suboptimal as the the rating information is not used in discovering the lowdimensional representation of documents, which is clearly not an optimal representation for prediction as the two methods are not tightly coupled to fully exploit their potential. Our work is motivated to explore more efficient, scalable, and effective techniques to maximize the potential exploiting extremes in dealing with data streams from realworld online recommender systems.
To overcome the limitations of bdiCTR, we propose a novel approximate inference scheme, called Online Bayesian Inference algorithm for CTR model (obiCTR), which jointly optimizes a unified objective function by combining both PMF model and LDA model in an online learning fashion. In contrast to bdiCTR, Our novel approximate inference scheme is able to achieve a much tighter coupling of both PMF and LDA, where both LDA and PMF tasks influence each other naturally and gradually via the joint optimization in the online learning process. This interplay yields topic representations of each item that are more suitable for making accurate and reliable rating prediction tasks.
To the best of our knowledge, our novel approximate inference algorithm is the first online learning algorithm for solving CTR tasks with fully joint optimization of both LDA model and PMF model. Our encouraging results from extensive experiments on large scale realworld data show that the proposed online learning algorithms are scalable and effective, and it not only outperforms the stateoftheart methods for rating prediction tasks but also yields more suitable latent topic proportions in topic modeling tasks. Besides, our novel approximate inference algorithm can be applied to other variants of CTR model (see Sect. 2.2 for more on related work).
In the following, we first review some important related work, then present a formal formulation of CTR model and our novel approximate inference algorithm, Online Bayesian Inference algorithm for CTR model (obiCTR). After that, we conduct extensive empirical studies and compare the proposed algorithms with the existing techniques, and finally set out our conclusions of this work.
2 Related work
In this section, we will provide a brief review of the prior studies related to our work, and some background of CTR model.
2.1 Online learning and online Bayesian inference
Online learning has been extensively studied in the machine learning communities (CesaBianchi and Lugosi 2006; ShalevShwartz 2011; Zhao et al. 2011; Hoi et al. 2014, 2013), mainly due to its high efficiency and scalability to largescale learning tasks. Different from conventional batch learning methods that assume all training instances are available prior to the learning phase, online learning considers one instance each time to update the predictive models sequentially and iteratively, which is more appropriate for largescale applications where training data often arrive sequentially. In literature, a number of online learning algorithms have been proposed. A classical online learning method is the Perceptron algorithm (Rosenblatt 1958), which adopts an additive update rule for the classifier weights when a new instance is misclassified. Recently a lot of new online learning algorithms have been proposed based on the concept of “max margin” (Crammer et al. 2006; Crammer and Singer 2003; Gentile 2002). One notable technique in this category is the online PassiveAggressive (PA) algorithm (Crammer et al. 2006). On each round, passiveaggressive algorithms solve a constrained optimization problem which balances between two competing goals: being conservative, in order to retain information acquired on preceding rounds, and being corrective, in order to make a more accurate prediction when a new instance is misclassified or its classification score does not exceed some predefined margin. In particular, PA method considers loss functions that enforce maxmargin constraints and its simple update rule enjoys a closed form solution. Motivated by PA method, Hoi et al. (2013); Wang et al. (2014) apply parameter confidence information to improve online learning performance.
Although the classical research work of online learning is based on decision theory (CesaBianchi and Lugosi 2006) and convex optimization (ShalevShwartz 2011), much progress has been made for developing online Bayesian Inference (Hoffman et al. 2010, 2013; Kingma and Welling 2013; Foulds et al. 2013). Rather than achieving a single point estimate of parameters typically in the optimizationbased setting, Bayesian methods attempt to obtain the full posterior distribution over the unknown parameters and latent variables in the model, hence providing better characterizations of the uncertainties in the learning process and avoiding overfitting. There are two categories of studies on the topic of online Bayesian Inference. One direction is to extend Monte Carlo methods to the online setting. A classic approach is sequential Monte Carlo or particle filters (Robert and Casella 2013), which approximate virtually any sequence of probability distributions. Most recently, Welling and Teh (2011); Patterson and Teh (2013); Ahn et al. (2012) proposed stochastic gradient Langevin method by updating parameters according to both the stochastic gradients as well as additional noise, which asymptotically produce samples from the posterior distribution. Another direction is online variational Bayes (Hoffman et al. 2010, 2013; Kingma and Welling 2013; Foulds et al. 2013), where on each round only a minibatch of instances is processed to give a noisy estimate of the gradient. Although these algorithms have shown impressive results most of them have adopted stochastic approximation of posterior distribution by subsampling a given finite data set, which is unsuitable for many applications where data size is unknown in advance.
2.2 The graphical model of CTR and its variants
 1.
For each user i, draw user latent vector
\({\mathbf {u}}_i \sim {\mathcal {N}}(0,\frac{1}{\sigma _u^2}{\mathbf {I}}_K)\)
 2.For each item j,
 (a)
Draw topic proportions \(\varvec{\theta }_j \sim Dirichlet(\alpha )\).
 (b)
Draw item latent offset \(\varvec{\epsilon }_j \sim {\mathcal {N}}(0,\frac{1}{\sigma _\epsilon ^2}{\mathbf {I}}_K)\) and set the item latent vector as \({\mathbf {v}}_j\,=\,\varvec{\epsilon }_j+\varvec{\theta }_j\).
 (c)For each word \(w_{jn}\)(\(1\,\le \,n\,\le \,N_j\)),
 (i)
Draw topic assignment \(z_{jn}\,\sim \,Mult(\varvec{\theta }_j)\).
 (ii)
Draw word \(w_{jn}\,\sim \,Mult(\varvec{\phi }_{z_{jn}})\).
 (i)
 (a)
 3.
For each useritem pair (i, j), draw the rating \( r_{ij}\,\sim \,{\mathcal {N}}({\mathbf {u}}_i^T{\mathbf {v}}_j,\frac{1}{\sigma _r^2}). \)
Researchers have extended CTR models to different applications of recommender systems. Some researchers extended CTR models by integrating with other side information. In CTRsmf (Purushotham et al. 2012), authors integrated CTR with social matrix factorization models to take social correlation between users into account. In LACTR (Kang and Lerman 2013), they assumed that users divide their limited attention nonuniformly over other people. In HFT (McAuley and Leskovec 2013), they aligned hidden factors in product ratings with hidden topics in product reviews for product recommendations. In CSTR (Ding et al. 2013), authors explored how to recommend celebrities to general users in the context of social network. In CTRSR (Wang et al. 2013), authors adapted CTR model by combining both itemtag matrix and item content information for tag recommendation tasks. There were also several works that attempted to extract latent topic proportions of text information in CTR via deep learning techniques (Wang et al. 2014, 2015; Van den Oord et al. 2013). However, all of these work followed the same approximate inference procedure as (Wang and Blei 2011) in a batch learning mode.
Independently of our study, Gopalan et al. 2014 developed a similar graphical model (CTPF) for articles recommendations task. Unlike CTR which applies PMF for recommendation model and LDA for document model, CTPF replaces both the usual Gaussian likelihood in PMF and multinomial likelihood in LDA with Poisson likelihood. This modification of graphical model makes it become a simple conditionally conjugate model and allows it to easily enjoy scalable approximate inference by using stochastic variational inference technique. However, the graphical model of original CTR is a direct combination of PMF and LDA, which is a much more complex nonconjugate model and makes its approximate inference nontrivial and challenging. In our work, we focus on the original graphical model of CTR and jointly optimize the combined objective function by using streaming variational inference. Moreover, CTR is widely used in different applications of recommender systems and has been extended to various kind of graphical model. Our scalable approximate inference method can also be generalized to these graphical models.
3 Online Bayesian collaborative topic regression
3.1 Inference algorithm for CTR: revisited
Before introducing our novel Online Bayesian Inference algorithm for CTR model (obiCTR). we first review the batch decoupled approximate inference method Wang and Blei (2011) proposed (bdiCTR), which has been applied to other variants of CTR models (see Sect. 2.2 for more on related work).
Motivated by the online LDA methods (Hoffman et al. 2010; Mimno et al. 2012), we extend bdiCTR algorithm to an online learning version (odiCTR) by incorporating the online LDA method (Hoffman et al. 2010). Online LDA is an EM style method. In the Estep, it approximately finds locally optimal values of \(\varvec{\theta }_j\) via an iterative method, holding \(\mathbf {\Phi }\) fixed. And then, in the Mstep, online LDA updates \(\mathbf {\Phi }\) using a weighted average of its previous value and noisy estimate corresponding to \(\varvec{\theta }_j\). If we control the learning rate such that old values are forgotten gradually, the objective with respect to posterior distribution \(p({\mathbf {Z}},\mathbf {\Phi },\mathbf {\Theta }{\mathbf {W}})\) converges to a stationary point [more details can be found in Hoffman et al. (2010)].
The main challenge is the joint optimization of CTR model in an online learning fashion. To start off, we first present inefficient (baseline) approach, bdiCTR and odiCTR, and later shows our novel Online Bayesian Inference algorithm for CTR model (obiCTR).
3.2 The online Bayesian inference algorithm for CTR model (obiCTR)
4 Experimental results
4.1 Dataset
Our experiments were conducted on an extended MovieLens dataset, named as “MovieLens10MPlot” and “MovieLens1MPlot”,^{4} which was originated from the MovieLens.^{5} Specifically, the original MovieLens 10M dataset provides a total of 10,000,053 rating records for 10,681 movies (items) by 69,878 users. However, the original dataset has very limited text content information. We enrich the dataset by collecting additional text contents for each of the movie items. Specifically, for each movie item, we first used its identifier number to find the movie listed in the IMDb^{6} website, and then collected its related text of “plot summary”. We then combine the “plot summary” text together with each movie’s title and category text given in the MovieLens10M dataset as a text document to represent each movie. For detailed text preprocessing, we follow the same procedure as the one described in Wang and Blei (2011) to process text information. Finally, we form a vocabulary with 7,689 distinct words. We then randomly select 1 million rating records to form a small dataset named “MovieLens1MPlot”. Note that we did not consider the CiteUlike dataset^{7} which was used in the previous study (Wang and Blei 2011), because their dataset only provides “like” and “dislike” preference, which is a kind of implicit feedback and thus unsuitable for our regression task. By contrast, the MovieLens10M dataset has explicit feedback with ratings ranging from 1 to 5.
4.2 Experimental setup and metric
4.3 Baselines for comparison and experimental settings

PAI: An online learning algorithm for solving online collaborative filtering tasks by applying the popular online PassiveAggressive (PA) algorithm (Blondel et al. 2014);

bdiCTR: the existing Collaborative Topic Regression (Wang and Blei 2011) . In our context, we replace the ALS algorithm (Hu et al. 2008) with SGD algorithm (Koren et al. 2009) since ratings data are explicit, and keep the rest same as the original CTR (note that the LDA step is still performed in a batch manner);

odiCTR: The proposed Online Decoupled Inference algorithm for CTR model in Algorithm 2;

obiCTR: The proposed Online Bayesian Inference algorithm for CTR model in Algorithm 3.

OnlineLDA: an online Bayesian variational inference algorithm for LDA model (Hoffman et al. 2010). We take it as a baseline to evaluate how well the model fits the data with the predictive distribution.
4.4 Evaluation of online rating prediction tasks
Figure 2a–c compares the online performance of the above methods in \(K\,=\,5\), \(K\,=\,10\) and \(K=20\) on the MovieLens10MPlot dataset and MovieLens1MPlot dataset. Note that the bdiCTR method needs to precompute the parameters \(\mathbf {\Theta }\) and \(\mathbf {\Phi }\) by a batch variational inference algorithm.^{8} Figure 2 shows only its performance in the downstream collaborative filtering phase.
RMSE results after a single pass over MovieLens10MPlot and MovieLens1MPlot dataset
MovieLens10MPlot  k = 5  k = 10  k = 20 

PAI  0.9176 ± 0.0004  0.9085 ± 0.0002  0.9148 ± 0.0003 
bdiCTR  0.8874 ± 0.0003  0.8812 ± 0.0005  0.8947 ± 0.0007 
odiCTR  0.9034 ± 0.0006  0.9054 ± 0.0008  0.9085 ± 0.0002 
obiCTR  0.8763 ± 0.0006  0.8788 ± 0.0001  0.8747 ± 0.0006 
MovieLens1MPlot  k = 5  k = 10  k = 20 

PAI  0.9692 ± 0.0007  0.9547 ± 0.0008  0.9775 ± 0.0000 
bdiCTR  0.9488 ± 0.0004  0.9488 ± 0.0003  0.9548 ± 0.0007 
odiCTR  0.9805 ± 0.0004  0.9809 ± 0.0003  0.9826 ± 0.0003 
obiCTR  0.9390 ± 0.0001  0.9393 ± 0.0006  0.9392 ± 0.0006 
4.5 Performance on online topic modeling tasks
Interpretability of the latent structures learned
Top topic by obiCTR  
1. comedy, children, romance, animal, music, fantasy, drama, friend, family  
2. work, find, die, life, only, time, kill, event, end, plan, final  
Top topic by bdiCTR  
1. adventure, story, young, ring, king, prince, come, toy, music, world, begin, place  
2. thriller, help, kill, mission, murder, lawyer, harry, evil, want, live, discover  
In user’s ratings  r  \({\hat{r}}_{obiCTR}\)  \({\hat{r}}_{bdiCTR}\) 
Sound of music  4.5  4.4  4.7 
1984 (Nineteen eightyfour)  4  3.9  4.4 
Fantasia  5  4.2  4.2 
Finding nemo  5  4.5  4.2 
Schindler’s list  5  4.8  5 
Memento  5  4.7  5 
Star wars: Episode IV  4.5  4.6  4.8 
Matrix reloaded, The  3  3.5  3.9 
Life is beautiful  5  4.9  4.6 
City of God  4.5  4.7  4.8 
4.6 Case study
4.7 Evaluation of parameter sensitivity
Running time measured in seconds consumed for each model size (K)
k = 5  k = 10  k = 20  k = 50  k = 100  

OnlineLDA  725  892  1790  4004  7901 
bdiCTR  5338  6407  14456  31125  63048 
odiCTR  1177  1571  3085  6256  12816 
obiCTR  1839  2372  4817  10372  21013 
5 Conclusion
This paper investigated online learning algorithms for making inference algorithm for Collaborative Topic Regression (CTR) model practical for realworld online recommender systems. Specifically, unlike bdiCTR that loosely combines LDA and PMF, we propose a novel Online Bayesian Inference algorithm for CTR model (obiCTR) which performs a joint optimization of both LDA and PMF to achieve a tight coupling. Our encouraging results showed that obiCTR converges much faster than the other competing algorithms in the online learning, and thus achieved the best prediction performance among all the compared algorithms. Our future work will analyze model interpretability and theoretical performance of the proposed algorithms.
Footnotes
Notes
Acknowledgements
This research is supported in part by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative, and the China Knowledge Centre for Engineering Sciences and Technology (No. CKCEST201415). This work was done when the first two authors visited Prof Hoi’s research group at Singapore Management University.
References
 Agarwal, D. & Chen, B.C. (2010). fLDA: Matrix factorization through latent Dirichlet allocation. Proceedings of the third ACM international conference on web search and data mining, (pp. 91–100). ACMM.Google Scholar
 Ahn, S., Korattikara, A. & Welling, M. (2012). Bayesian posterior sampling via stochastic gradient fisher scoring. arXiv preprint arXiv:1206.6380.
 Almazro, D., Shahatah, G., Albdulkarim, L., Kherees, M., Martinez, R. & Nzoukou, W. (2010). A survey paper on recommender systems. arXiv preprint arXiv:1006.5278 .
 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. The Journal of machine Learning research, 3, 993–1022.MATHGoogle Scholar
 Blondel, M., Kubo, Y. & Ueda, N. (2014). Online passiveaggressive algorithms for nonnegative matrix factorization and completion. Proceedings of the seventeenth international conference on artificial intelligence and statistics, (pp. 96–104).Google Scholar
 Breese, J. S., Heckerman, D. & Kadie, C. (1998). Empirical analysis of predictive Aagorithms for collaborative filtering. In Proceedings of the fourteenth conference on uncertainty in artificial intelligence, (pp. 43–52). Morgan Kaufmann Publishers Inc.Google Scholar
 Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., & Jordan, M. I. (2013). Streaming variational Bayes. Advances in Neural Information Processing Systems, 26, 1727–1735.Google Scholar
 CesaBianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
 Crammer, K., Dekel, O., Keshet, J., ShalevShwartz, S., & Singer, Y. (2006). Online passiveaggressive algorithms. The Journal of Machine Learning Research, 7, 551–585.MathSciNetMATHGoogle Scholar
 Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. The Journal of Machine Learning Research, 3, 951–991.MATHGoogle Scholar
 Ding, X., Jin, X., Li, Y., & Li, L. (2013). Celebrity recommendation with collaborative social topic regression. In Proceedings of the twentythird international joint conference on artificial intelligence, (pp. 2612–2618). AAAI Press.Google Scholar
 Foulds, J., Boyles, L., DuBois, C., Smyth, P. & Welling, M. (2013). Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 446–454). ACM.Google Scholar
 Gentile, C. (2002). A new approximate maximal margin classification algorithm. The Journal of Machine Learning Research, 2, 213–242.MathSciNetMATHGoogle Scholar
 Gopalan, P. K., Charlin, L., & Blei, D. (2014). Contentbased recommendations with Poisson factorization. Advances in Neural Information Processing Systems, 27, 3176–3184.Google Scholar
 Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 23, 856–864.Google Scholar
 Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1), 1303–1347.MathSciNetMATHGoogle Scholar
 Hoi, S. C., Jin, R., Zhao, P., & Yang, T. (2013). Online multiple kernel classification. Machine Learning, 90(2), 289–316.MathSciNetCrossRefMATHGoogle Scholar
 Hoi, S. C., Wang, J., & Zhao, P. (2014). Libol: A library for online learning algorithms. The Journal of Machine Learning Research, 15(1), 495–499.MATHGoogle Scholar
 Hoi, S. C., Zhao, P., Zhao, P. & Hoi, S. C. (2013). Costsensitive double updating online learning and its application to online anomaly detection. SDM, SIAM, pp. 207–215.Google Scholar
 Honkela, A., & Valpola, H. (2003). Online variational Bayesian learning. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation (pp. 803–808).Google Scholar
 Hu, Y., Koren, Y. & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Eighth IEEE international conference on data mining, (pp. 263–272). IEEE.Google Scholar
 Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.CrossRefMATHGoogle Scholar
 Kang, J.H. & Lerman, K. (2013). LACTR: A limited attention collaborative topic regression for social media, arXiv preprint arXiv:1311.1247 .
 Kingma, D. P. & Welling, M. (2013). Autoencoding variational Bayes, arXiv preprint arXiv:1312.6114.
 Koren, Y., Bell, R., Volinsky, C., et al. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37.CrossRefGoogle Scholar
 Lu, Z., Dou, Z., Lian, J., Xie, X. & Yang, Q. (2015). Contentbased collaborative filtering for news topic recommendation, Twentyninth AAAI conference on artificial intelligence.Google Scholar
 McAuley, J. & Leskovec, J. (2013). Hidden factors and hidden topics: understanding rating dimensions with review text, Proceedings of the 7th ACM conference on recommender systems, (pp. 165–172). ACM.Google Scholar
 Mcauliffe, J. D., & Blei, D. M. (2008). Supervised topic models. Advances in Neural Information Processing Systems, 21, 121–128.Google Scholar
 McInerney, J., Ranganath, R., & Blei, D. (2015). The population posterior and Bayesian modeling on streams. Advances in Neural Information Processing Systems, 28, 1153–1161.Google Scholar
 Mimno, D., Hoffman, M. & Blei, D. (2012). Sparse stochastic inference for latent Dirichlet allocation, arXiv preprint arXiv:1206.6425.
 Mnih, A., & Salakhutdinov, R. (2007). Probabilistic matrix factorization. Advances in Neural Information Processing Systems, 20, 1257–1264.Google Scholar
 Patterson, S., & Teh, Y. W. (2013). Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Advances in Neural Information Processing Systems, 26, 3102–3110.Google Scholar
 Purushotham, S., Liu, Y. & Kuo, C.C. J. (2012). Collaborative topic regression with social matrix factorization for recommendation systems, arXiv preprint arXiv:1206.4684.
 Robert, C., & Casella, G. (2013). Monte Carlo statistical methods. Berlin: SpringerVerlag New York, Inc.MATHGoogle Scholar
 Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.CrossRefGoogle Scholar
 ShalevShwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.CrossRefMATHGoogle Scholar
 Shi, T. & Zhu, J. (2014). Online Bayesian passiveaggressive learning. In Proceedings of the 31st international conference on machine learning, (pp. 378–386).Google Scholar
 Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009, 4.Google Scholar
 Van den Oord, A., Dieleman, S., & Schrauwen, B. (2013). Deep contentbased music recommendation. Advances in Neural Information Processing Systems, 26, 2643–2651.Google Scholar
 Wang, C. & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, (pp. 448–456). ACM.Google Scholar
 Wang, H., Chen, B., & Li, W.J. (2013). Collaborative topic regression with social regularization for tag recommendation. In Proceedings of the twentythird international joint conference on artificial intelligence, (pp. 2719–2725). AAAI Press.Google Scholar
 Wang, H., Shi, X. & Yeung, D.Y. (2015). Relational stacked denoising autoencoder for tag recommendation. In Twentyninth AAAI conference on artificial intelligence.Google Scholar
 Wang, H., Wang, N. & Yeung, D.Y. (2014). Collaborative deep learning for recommender systems, arXiv preprint arXiv:1409.2944.
 Wang, J., Zhao, P., & Hoi, S. C. (2014). Costsensitive online classification. IEEE Transactions on Knowledge and Data Engineering, 26(10), 2425–2438.CrossRefGoogle Scholar
 Welling, M. & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML11), (pp. 681–688).Google Scholar
 Zhao, P., Hoi, S. C., & Jin, R. (2011). Double updating online learning. The Journal of Machine Learning Research, 12, 1587–1615.MathSciNetMATHGoogle Scholar
 Zhu, J., Ahmed, A., & Xing, E. P. (2012). MedLDA: Maximum margin supervised topic models. The Journal of Machine Learning Research, 13(1), 2237–2278.MathSciNetMATHGoogle Scholar