# DOLDA: a regularized supervised topic model for high-dimensional multi-class regression

- 91 Downloads

## Abstract

Generating user interpretable multi-class predictions in data-rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant probit model (Johndrow et al., in: Proceedings of the sixteenth international conference on artificial intelligence and statistics, 2013) together with an efficient Horseshoe prior for variable selection/shrinkage (Carvalho et al. in Biometrika 97:465–480, 2010). We propose a computationally efficient parallel Gibbs sampler for the new model. An important advantage of DOLDA is that learned topics are directly connected to individual classes without the need for a reference class. We evaluate the model’s predictive accuracy and scalability, and demonstrate DOLDA’s advantage in interpreting the generated predictions.

## Keywords

Text classification Latent Dirichlet Allocation Horseshoe prior Diagonal Orthant probit model Interpretable models## 1 Introduction

During recent decades more and more textual data has become available, creating a growing need to statistically analyze large amounts of textual data. The popular Latent Dirichlet Allocation (LDA) model introduced by Blei et al. (2003) is a generative probabilistic model in which each document is summarized by a set of latent semantic themes, often called *topics*. Formally, a topic is a probability distribution over the vocabulary. An estimated LDA model is, therefore, a compressed latent representation of the documents where each document is a mixture of topics and where each word (token) in a document belongs to a single topic. Most probabilistic topic models, such as LDA, are unsupervised, i.e. the topics are learned solely from the words in the documents without access to other document meta-data.

In many situations, though, there is other information we would like to incorporate in modeling a corpus of documents. A common example is when we have labeled documents, such as ratings of movies together with a movie description, illness categories in medical journals, or the locations of identified bugs together with bug reports in software engineering applications. The simplest approach would be to use a standard topic model and then use the estimated topic distributions per document in another model, such as a logistic regression model. This two-step approach would result in topics that are not produced for the purpose of explaining the dependent variable of interest. Alternatively, one could use a supervised topic model to find the semantic topic structure in the documents that are related to the class of interest. The difference between a supervised topic model and a two-step approach is similar to the difference between principal component regression (PCR) and partial least squares (PLS). In PCR, the principal components are first computed and then a regression model is estimated based on the estimated components. In PLS the components are estimated together with the regression model with the purpose of estimating components that have good predictive performance for the dependent variable of interest (Geladi and Kowalski 1986).

Most of the proposed supervised topic models have been designed with the objective of by trying to find good text classification models, and the focus has naturally been on the predictive performance. However, the predictive performance of most supervised topic models is similar to that of using a Support Vector Machine (SVM) with covariates based on word frequencies (Jameel et al. 2015). While predictive performance is certainly important, the real attraction of supervised topic models comes from their ability to learn semantically relevant topics and to use those topics to produce accurate *interpretable* predictions of documents or other textual data. The interpretability of a model is an often-neglected feature, but it is crucial in real-world applications. As an example, Parnin and Orso (2011) show that bug fault localization systems are quickly disregarded when the users cannot understand how the system has reached its predictive conclusion. Compared to other text classification systems, topic models are very well suited for interpretable predictions since topics are abstract entities that humans can easily grasp. The problems of interpretability in multi-class supervised topic models can be divided into three main areas.

First, most supervised topic models use a logit or probit approach, where the model is identified by the use of a *reference category* to which the effect of any covariate is compared. This defeats one of the main purposes of supervised topic models since it complicates the interpretability of the models. Instead of interpreting the effect of a topic on a class, we need to interpret it as the effect on a class compared to the reference category.

Second, to handle multi-class categorization *a topic should be able to affect multiple classes*, and some topics *may not influence any class at all*. In most supervised topic modeling approaches (such as Jiang et al. 2012; Zhu et al. 2013; Jameel et al. 2015) the multi-class problem is solved using binary classifiers in a “one-vs-all” classification approach. This approach works well in the situation of evenly distributed classes, but may not work well for skewed class distributions (Rubin et al. 2012). A one-vs-all approach also makes it more difficult to interpret the model. Estimating one model per class makes it impossible to see which classes are affected by the same topic and which topics do not predict any label. In these situations, we would like to have *one* topic model to interpret. The approach of one-vs-all is also costly from an estimation point of view since we need to estimate one model per class (Zheng et al. 2015), something that can be difficult in a situation with hundreds of classes.

Third, there can be situations with hundreds of classes and hundreds of topics [see Jonsson et al. (2016) for an example]. Without *regularization* or variable selection we would end up with a model with too many parameters to interpret and uncertain parameter estimates. In a good predictive supervised topic model, one would like to find a small set of topics that are strong determinants of a single document class label. This is especially relevant when the numbers of observations in different classes are skewed, which is a common problem in real-world situations (Rubin et al. 2012). In the more rare classes, we would like to induce more shrinkage compared to more common classes.

Multi-class regression is a non-trivial problem in Bayesian modeling. Historically, the multinomial probit model has been preferred due to the data augmentation approach proposed by Albert and Chib (1993). Augmenting the sampler using latent variables lead to straightforward Gibbs sampling with conditionally conjugate updates of the regression coefficients. The Albert-Chib sampler often tend to mix slowly, and the same holds for improved samplers such as the parameter expansion approach in Imai and van Dyk (2005). Recently, Polson et al. (2013) have proposed a similar data augmentation approach using Polya-gamma variables for the Bayesian logistic regression model. This approach preserves conditional conjugacy in the case of a Normal prior for the regression coefficients and was the foundation for the supervised topic model in Zhu et al. (2013).

In addition to the issue of interpretability of models, scalability of topic models is of crucial importance. MCMC algorithms are generally considered to be computationally costly. In the case of probabilistic topic models, this is even more prominent, since we often sample at least one parameter per word for the whole corpus. Modern corpora can be very large, with millions of documents, making efficient and parallel sampling a crucial component.

In this paper we explore a new approach to supervised topic models that produces accurate multi-class predictions from semantically interpretable topics using a fully Bayesian approach, hence solving all three of the above-mentioned problems. The model combines LDA with the recently proposed Diagonal Orthant (DO) probit model (Johndrow et al. 2013) for multi-class classification, with an efficient Horseshoe prior that achieves sparsity and interpretation by aggressive shrinkage (Carvalho et al. 2010). The new Diagonal Orthant Latent Dirichlet Allocation (DOLDA)^{1} model has been demonstrated to have competitive predictive performance, while still producing interpretable multi-class predictions from semantically relevant topics. In addition, we also derive an efficient and parallel MCMC sampler that can be used to scale up model inference to larger corpora.

The paper is organized as follows. In Sect. 2, we describe the new proposed model and in Sect. 3 we then describe the proposed scalable MCMC sampler for the proposed model. In Sect. 4, the experimental results are presented and in Sect. 5 we conclude the paper. In the “Appendix”, a full derivation of the sampler is supplied.

## 2 Related work

Incorporating supervised information or meta-data in the estimation of topic models has been done in a large number of papers, such as Rosen-Zvi et al. (2004), Griffiths et al. (2005) and Chemudugunta et al. (2007) that incorporate author information, syntax and background structures, to give a few early examples. How topic models incorporate the labeled information, such as classes, can broadly be classified into two groups, *downstream* supervised models and *upstream* supervised models. In upstream topic models, loosely defined, the topics are conditioned on the labeled information, while in downstream topic models, the labeled information is conditioned on the topics. Examples of upstream topic models are topics conditioned on authorship (Rosen-Zvi et al. 2004), topical perspectives (Ahmed and Xing 2010) and more general supervised information (Mimno and McCallum 2012).

In downstream topic models, of which our proposed model is an example, the label information is instead conditioned on the topics, similar to conditioning on covariates in a linear regression model or a logistic regression model. One of the first approaches was proposed by McAuliffe and Blei (2008) were the authors propose a supervised topic model based on the generalized linear model framework, thereby making it possible to link binary, count, and continuous response variables to topics that are inferred jointly with the regression/classification effects. This idea has been elaborated further, especially in the case of classification, in a series of paper, all closely connected to this work. The three most related approaches are Jiang et al. (2012), that propose a downstream supervised topic model using a max-margin classification, Zhu et al. (2013) that propose a logistic supervised topic model using data augmentation with Polya-gamma variates and Perotte et al. (2011) that use a hierarchical binary probit approach to model a hierarchical label structure in the form of a binary tree. All of these models are downstream supervised topic models using MCMC for inference and different forms of data augmentation approaches to model classes. Zhu et al. (2013) and Perotte et al. (2011) in combine a data augmentation strategy together with a linear model, just our proposed model. Compared to these earlier work we differ in two major ways. First, we use the diagonal orthant data augmentation scheme that is computationally attractive for many classes, essentially addressing the issue of scalability in the number of classes. In addition, unlike the work of Zhu et al. (2013), Perotte et al. (2011) and Jiang et al. (2012), we focus on interpretability by using the horseshoe prior (Carvalho et al. 2010), something not done previously to ours. We also, just like (Zheng et al. 2015), focus on scalability in supervised topic models, but unlike (Zheng et al. 2015), we do not use Metropolis–Hastings to improve computational complexity. Instead, our approach focuses on the use of exchangeability to enable parallelism and analytically updates to make computations more efficient.

## 3 Diagonal Orthant Latent Dirichlet Allocation

### 3.1 Handling the challenges of high-dimensional interpretable supervised topic models

To solve the first and second challenges identified in the Introduction, i.e., reference classes and multi-class models, we propose the use of the Diagonal Orthant (DO) probit model in Johndrow et al. (2013) as an alternative to the multinomial probit and logit models. Johndrow et al. (2013) propose a Gibbs sampler for the Diagonal Orthant model and show that it mixes well. One of the benefits of the DO model is that all classes can be independently modeled using binary probit models when conditioning on the latent variable, thereby removing the need for a reference class. The parameters of the model can be interpreted as the effect of the covariate on the marginal probability of a specific class, which makes this model especially attractive when it comes to interpreting the inferred topics. This model also includes multiple classes in an efficient way that makes it possible to estimate a multi-class linear model in parallel over the classes.

The third problem of modeling supervised topic models is that the semantic meanings of all topics do not necessarily have an effect on our label of interest; one topic may have an effect on one or more classes, and some topics may just be noise that we do not want to use in the supervision. In cases where there are many topics and many classes, we will also have a very large number of parameters to analyze. The Horseshoe prior in Carvalho et al. (2010) was specifically designed to filter out signals from massive noise. This prior uses a local-global shrinkage approach to shrink some (or most) coefficients to zero while allowing for sparse signals to be estimated without any shrinkage. This approach has shown good performance in linear regression-type situations (Castillo et al. 2015), with many predictors (Nalenz and Villani 2018), which makes it straightforward to incorporate other covariates into our model, something that is rarely done in the area of supervised topic models. Different global shrinkage parameters are used for the different classes to handle the problem with an unbalanced number of observations in different classes. This makes it possible to shrink more when there is less data for a given class and shrink less in classes with more observations.

### 3.2 Generative model

- (1)For each topic \(k=1,\ldots ,K\)
- (a)
Draw a distribution over words \(\phi _{k}\sim \text{ Dir }_{V}(\beta )\)

- (a)
- (2)For each label \(l\in L\)
- (a)
Draw a global shrinkage parameter \(\tau _{l}\sim C^{+}(0,1)\)

- (b)For each covariate and topic \(p=1,\ldots ,K+P\)
- (i)
Draw local shrinkage parameter \(\lambda _{l,p}\sim C^{+}(0,1)\)

- (ii)
Draw coefficients

^{2}\(\eta _{l,p}\sim {\mathcal {N}}(0,\tau _{l}^{2}\lambda _{l,p}^{2})\)

- (i)

- (a)
- (3)For each observation/document \(d=1,\ldots ,D\)
- (a)
Draw topic proportions \(\theta _{d}\sim \text{ Dir }_{K}(\alpha )\)

- (b)For each token \(n=1,\ldots ,N_{d}\)
- (i)
Draw topic assignment \(z_{n,d}|\theta _{d}\sim \text{ Categorical }(\theta _{d})\)

- (ii)
Draw word \(w_{n,d}|z_{n,d},\phi _{z_{n,d}}\sim \text{ Categorical }(\phi _{z_{n,d}})\)

- (i)
- (c)\(y_{d}\sim \text{ Categorical }({\mathbf {p}}_{d})\) whereand \(F_{{\mathcal {N}}}(\cdot )\) is the univariate normal CDF (Johndrow et al. 2013).$$\begin{aligned} {\mathbf {p}}_{d}=\left[ \sum _{l}^{L}cdf_{{\mathcal {N}},l}\left( (\bar{{\mathbf {z}}}, {\mathbf {x}})_{d}^{\top }\eta _{l\cdot }\right) \right] ^{-1}\left( F_{{\mathcal {N}}} \left( (\bar{{\mathbf {z}}},{\mathbf {x}})_{d}^{\top }\eta _{1\cdot }\right) ,\ldots ,F_{{\mathcal {N}}}\left( (\bar{{\mathbf {z}}},{\mathbf {x}})_{d}^{\top } \eta _{L\cdot }\right) \right) \end{aligned}$$

- (a)

DOLDA model notation

Symbol | Description | Symbol | Description |
---|---|---|---|

\({\mathcal {V}}\) | The set of word types/vocabulary | \(\beta \) | The prior for \(\Phi \): \(K\times V\) |

| The size of the vocabulary i.e \(V=|{\mathcal {V}}|\) | \(\Theta \) | Document-topic proportions: \(D\times K\) |

| Word type | \(\theta _{d}\) | Topic probability for document |

\({\mathcal {K}}\) | The set of topics | \(\alpha \) | The prior for \(\Theta \): \(D\times K\) |

| The number of topics i.e \(K=|{\mathcal {K}}|\) | \({\mathbf {M}}\) | # of topic indicators in each document: \(D\times K\) |

| The number of labels/categories | | Matrix of latent gaussian variables: \(D\times L\) |

\({\mathcal {L}}\) | The set of labels/categories | \(\eta \) | Coefficient matrix for each label and covariate: \((K+P)\times L\) |

| The number of observations/documents i.e. \(D=|{\mathcal {D}}|\) | \(\eta _{0}\) | Prior for \(\eta \): \(L\times (K+P)\) |

\({\mathcal {D}}\) | The set of observations/documents | \(z_{n,d}\) | Topic indicator for token |

| The number of non-topic covariates/features | \(\bar{{\mathbf {Z}}}\) | Proportion of topic indicators by document: \(D\times K\) |

\(F_{\mathcal {N}}(\cdot )\) | The univariate standard Normal cdf | \(\bar{{\mathbf {z}}}_d\) | Proportion of topic indicators for document |

| The total number of tokens | \(w_{n,d}\) | Token |

\(N_{d}\) | The number of tokens in document | \({\mathbf {w}}_{d}\) | Vector of tokens in document |

\({\mathbf {N}}\) | # obs topic-word type indicators: \(K\times V\) | \(y_{d}\) | Label for document |

\(\Phi \) | The matrix with word-topic probabilities : \(K\times V\) | \({\mathbf {X}}\) | Covariate/feature matrix (including intercept): \(D\times P\) |

\(\phi _{k}\) | The word probabilities for topic | \({\mathbf {x}}_{d}\) | Covariate/features for document |

## 4 Inference

### 4.1 The MCMC algorithm

- (1)
Sample the latent variables \(a_{d,l}\sim {{\mathcal {N}}}_{+}(({\mathbf {x}}\,\bar{{\mathbf {z}}})_{d}^{T}\eta _{l},1)\) for \(l=y_{d}\) and \(a_{d,l}\sim {{\mathcal {N}}}_{-}(({\mathbf {x}}\,\bar{{\mathbf {z}}})_{d}^{T}\eta _{l},1)\) for \(l\ne y_{d}\), where \({\mathcal {N}}_{+}\) and \({\mathcal {N}}_{-}\) are the positive and negative truncated normal distribution, truncated at 0.

- (2)
Sample all of the regression coefficients as in an ordinary Bayesian linear regression per class label

*l*where \(\eta _{l}\sim {\mathcal {MVN}}\left( \mu _{l},(({\mathbf {X}}\,\bar{{\mathbf {Z}}})^{T}{({\mathbf {X}}\,\bar{{\mathbf {Z}}})}+\tau _{l}^{2}\Lambda _{l})^{-1}\right) \) and \(\Lambda _{l}\) is a diagonal matrix with the local shrinkage parameters \(\lambda _{i}\) per parameter in \(\eta _{l}\) and \(\mu _{l}=(({\mathbf {X}}\, \bar{{\mathbf {z}}})^{T}({\mathbf {X}}\, \bar{{\mathbf {z}}})+\tau _{l}^{2}\Lambda _{l})^{-1}({\mathbf {X}}\, \bar{{\mathbf {z}}})^{T}{\mathbf {a}}_{l}\) - (3)Sample the global shrinkage parameters \(\tau _{l}\) at iteration
*j*using the following two step slice sampling:where$$\begin{aligned} u\sim & {} {\mathcal {U}}\left( 0,\left[ 1+\frac{1}{\tau _{l,(j-1)}}\right] {}^{-1}\right) \\ \frac{1}{\tau _{l,j}^{2}}\sim & {} {{\mathcal {G}}}\left( (p+1)/2, \frac{1}{2}\sum _{p=1}^{K+P}\left( \frac{\eta _{l,p}}{\lambda _{l,p}} \right) ^{2}\right) I\left[ \frac{1}{\tau _{l,(j-1)}^{2}}<(1-u)/u\right] \end{aligned}$$*I*indicates the truncation region for the truncated gamma. - (4)Sample each local shrinkage parameter \(\lambda _{i,l}\) at iteration
*j*aswhere$$\begin{aligned} u\sim & {} {\mathcal {U}}\left( 0,\left[ 1+\frac{1}{\lambda _{p,l,(j-1)}^{2}} \right] {}^{-1}\right) \\ \frac{1}{\lambda _{p,l,j}^{2}}\sim & {} \text {Exp}\left( \frac{1}{2}\left( \frac{\eta _{l,p}}{\tau _{l}} \right) ^{2}\right) I\left[ \frac{1}{\lambda _{p,l,(j-1)}^{2}}<(1-u)/u\right] \end{aligned}$$*I*indicates the truncation region for the truncated exponential distribution. - (5)Sample the topic indicators \({\mathbf {z}}\)where \({\mathbf {M}}\) is a \(D\times K\) count matrix containing the sufficient statistics for \(\Theta \) and \({\mathbf {M}}^{\lnot i}\) is this matrix with topic indicator \(z_{i,d}\) removed from \({\mathbf {M}}\).$$\begin{aligned} p(z_{i,d}= & {} k|w_{i},{\mathbf {z}}^{\lnot i},\eta ,{\mathbf {a}}) \propto \phi _{v,k}\cdot \left( {\mathbf {M}}_{d,k}^{\lnot i}+\alpha \right) \\&\times \exp \left( -\frac{1}{2}\sum _{l}^{L} \left[ -2\frac{\eta _{l,k}}{N_{d}}\left( a_{d,l}-(\bar{{\mathbf {z}}}_{d}^{\lnot i} \, {\mathbf {x}}_{d})\eta _{l}^{\intercal }\right) + \left( \frac{\eta _{l,k}}{N_{d}}\right) ^{2}\right] \right) \end{aligned}$$
- (6)Sample the topic-vocabulary distributions \(\Phi \)where \({\mathbf {N}}\) is a \(K\times V\) count matrix containing the sufficient statistics for \(\Phi \).$$\begin{aligned} \phi _{k}\sim \text{ Dir }(\beta +{\mathbf {N}}_{k,\cdot }) \end{aligned}$$

### 4.2 Efficient parallel sampling of \({\mathbf {z}}\)

To improve the speed of the sampler we cache the calculations done in the supervised part of the topic indicator sampler and parallelize the sampler. Very large text corpora are increasingly common, so efficient sampling of the \({\mathbf {z}}\) is absolutely crucial in practice. The basic sampler for \({\mathbf {z}}\) can be slow due to the serial nature of the collapsed sampler and the fact that the supervised part of \(p(z_{i,d})\) involves a dot product. A naive implementation would result in a complexity of \(O((K+P) \cdot L \cdot K)\) to sample just one topic indicator \(z_i\).

*d*can be expressed as \(g_{d,k}^{\lnot i}\) where

*O*(

*K*) for all but the first token per document. For details, see “Appendix B”.

To further improve the performance we parallelize the sampler and use the fact that documents are conditionally independent given \(\Phi \). By sampling \(\Phi \) instead of marginalizing it out we will gain from parallelization with the additional cost of sampling \(\Phi \). This approach to parallelizing topic models give us a sampler that correctly samples the posterior using an ergodic Markov chain, unlike other parallel approaches such as AD-LDA (Magnusson et al. 2018; Newman et al. 2009).

In summary, we propose a sampler that samples the \(z_{i,d}\) in parallel over the documents, the elements in \(\Phi \) sampled in parallel over topics, and sampling \(\eta \) can be in parallel over classes. The code is publicly available at https://github.com/lejon/DiagonalOrthantLDA.

### 4.3 Computational complexity

The computational complexity of the sampler depends on the different parameters sampled. Below we analyze the different parts of the sampler for one iteration. Sampling the regression coefficients has three parts, (1) computing \(\Lambda _{post}=(({\mathbf {X}}\, \bar{{\mathbf {z}}})^{T}({\mathbf {X}}\, \bar{{\mathbf {z}}})+\tau _{l}^{2}\Lambda _{l})\) is of complexity \(O(L \cdot D \cdot (K+P)^2)\), (2) inverting \(\Lambda _{post}\) for all classes is of complexity \(O(L \cdot (K+P)^3)\), and (3) sampling from the multivariate Gaussian distribution of each \(\eta _l\) is also of complexity \(O(L \cdot (K+P)^3)\). Sampling the topic indicators has complexity \(O((K+P) \cdot L \cdot K \cdot D)\) for the first topic indicator in each document, a complexity dominated by \(O(L \cdot D \cdot (K+P)^2)\). If we use the method for increased efficiency proposed above, all other topic indicators can be sampled with complexity \(O(K \cdot N)\). Sampling the latent variables \({\mathbf {a}}\) is of complexity \((K+P) \cdot D \cdot L\) and is hence dominated by the sampling of \(\eta \). Similarly, sampling \(\tau \) and \(\lambda \) is of complexity \((K+P) \cdot L\) and is also dominated by the sampling of \(\eta \). Sampling \(\Phi \) is of complexity \(O(K \cdot V)\), something that is generally dominated by sampling the topic indicators \(O(K \cdot N)\), since topically \(N>>V\) (Magnusson et al. 2018).

Finally, the total complexity of the sampler, with regard to the number of classes (*L*), the number of topics (*K*), the number of documents (*D*), the mean document size (\({\bar{N}}\)), and the number of covariates (*P*) is \(O(L \cdot (K+P)^3 + L \cdot D \cdot (K+P)^2 + K \cdot D \cdot {\bar{N}})\) where \({\bar{N}}=N/D\). From this analysis, we can see that as the corpus grows (\(D \rightarrow \infty \)) we see that sampling \(\eta \) and the topic indicators \({\mathbf {z}}\) will dominate the computations. But we would also expect the number of topics to grow as the number of documents grows. In this situation, the main cost of the algorithm would be sampling the \(\eta \)s and the first topic indicator of each document.

Due to the similarity of the DOLDA sampler to that of the MedLDA sampler, it is straightforward to use the cyclical Metropolis–Hastings proposals in Zheng et al. (2015) for inference in DOLDA. But, as shown in Magnusson et al. (2018), it is not obvious that the reduction in sampling complexity will result in a faster sampler when MCMC efficiency is taken into account.

### 4.4 Evaluation of convergence and prediction

## 5 Experiments

We study model performance in four different ways: the classification accuracy, the interpretability of the model, the topic quality and supervision effects on topics, and the scalability of the sampler. The experiments are performed on 2 sockets with 8-core Intel Xeon E5-2660 Sandy Bridge processors at 2.2GHz at the National Supercomputer Center (NSC) at Linköping University.

### 5.1 Corpora and priors

To study the different aspects of the DOLDA model we use multiple corpora. We collected a corpus containing the 10,810 highest-rated movies at IMDb.com. We use both the textual description and information about producers and directors to classify a given movie into a genre. We also analyze the classical 20 Newsgroup corpus.

A document with the above example online-section would get the top label “Arts” and the hierarchical (2 level) label “Arts; Dining and Wine”. For the hierarchical classification, we extracted only the articles which had at least two levels (“Arts” being the first level and “Dining and Wine” the second level in the example above). After extracting the classes for the documents we create four subsets from the corpora described above. These subsets contain 90%, 60%, 20%, and 10% of documents sampled from the above corpus. None of the documents in the 10% subset exists in the other subsets. The purpose of the NYT corpora is to show how the sampler scales, both with regard to documents (\(\sim \) 600 K) and with a large number of classes (240).Arts; Dining and Wine; Education; Books

Our companion paper (Jonsson et al. 2016) applies the DOLDA model developed here to bug localization in a large-scale software engineering context using a corpus with 15,000 bug reports, each belonging to one of 118 classes.

We include the corpora for different purposes. IMDb is a smaller corpus, but contains additional covariates. The 20 Newsgroups corpus is included to enable accuracy performance with other comparable supervised topic models. The New York Times corpus is included to show the scalability of the MCMC sampler with regard to the number of classes as well as the number of documents.

Corpora used in experiment, by the number of classes (*L*), the number of documents (*D*), the vocabulary size (*V*), and the total number of tokens (*N*)

Corpus | | | | |
---|---|---|---|---|

IMDb | 20 | 10,810 | 47,371 | 967,255 |

20 Newsgroups | 20 | 18,846 | 187,321 | 4,913,292 |

New York Times (hiearchical) | 240 | 183,751 | 1,540,464 | 129,151,602 |

New York Times (top level) | 31 | 595,635 | 3,326,778 | 339,298,734 |

In all experiments, we use a relatively vague prior setting \(\alpha =\beta =0.01\) for the LDA part of the model and \(c=100\) for the prior variance of the \(\eta \) coefficients in the normal model prior and for the intercept coefficient when using the Horseshoe prior. The accuracy experiment for IMDb uses 5-fold cross-validation and the 20 Newsgroups corpus uses the same training and test set as in Zhu et al. (2012) to enable direct comparisons of accuracy. In the interpretability analysis of the IMDb corpus we use the whole corpus, without cross-validation.

### 5.2 Results

#### 5.2.1 Classification accuracy

Figure 2 shows the accuracy on the hold-out test set for the 20 Newsgroups corpus for different numbers of topics. The accuracy of our model is slightly lower than MedLDA and SVM using only textual features, but higher than both the classical supervised multi-class LDA and the ordinary LDA together with an SVM approach.

The advantage of DOLDA is that it produces interpretable predictions with semantically relevant topics. Therefore, it is reassuring that DOLDA can compete in accuracy with other less interpretable models such as the SVM, even when the model is dramatically simplified by aggressive Horseshoe shrinkage for interpretational purposes. Our next analysis illustrates the interpretational strength of DOLDA. See also our companion paper (Jonsson et al. 2016) in the software engineering literature for further demonstrations of DOLDA’s ability to produce interpretable predictions in industrial applications.

#### 5.2.2 Model interpretability

*Romance*genre in the IMDb corpus in Fig. 5. This genre consists of relatively few observations (only 39 movies), and the Horseshoe prior, therefore, shrinks most coefficients to zero, keeping only one large signal topic that happens to have a negative effect on the Romance genre. The normal prior, on the other, hand gives a much denser, and therefore a much less interpretable solution.

*Crime*genre, and Fig. 6 shows that this is indeed the case.

Top words in topics using the Horseshoe prior

Topic 33 | Earth space planet alien human future years world time mission |

Topic 39 | Police murder detective killer case investigation crime crimes solve murdered |

We can also see from Fig. 6 that Topic 33 has a strong negative effect on the Crime genre. In Table 3 we can see that Topic 33 seems to be some sort of Sci-Fi topic. This topic has, in turn, the largest positive relationship with the Sci-Fi movie genre.

This illustrates an example how the aggressive shrinkage of the Horseshoe prior not only increases the prediction accuracy, but also simplifies interpretations since a much smaller number of topics is estimated to affect a given label - making it easier to focus on the topics that actually have an effect in the analysis. This is much more difficult in the Normal prior situation.

#### 5.2.3 Topic quality and effect of supervision

Even though the accuracy improves using a supervised approach, this raises the question of the effect of the supervision on the quality of the individual topics. How are the topics affected by the supervision and by shrinkage priors?

To study the effect on the topics, we focus on two measurements of topic quality. First, we study the effect of *topic coherence* using the measure proposed by Mimno et al. (2011). This measure has been shown in experiments to be a good approximation of topical coherence, estimated using manual annotations, such as topic intrusion (Chang et al. 2009).

We also study the *document entropy* of the topics. This measure gives us an indication of how the topics are distributed over documents. Are topics evenly distributed over documents (high entropy) or more sparsely distributed over documents (low entropy). This is an indication of the effect that supervision has on the topics. Is the supervised information making the distribution over documents more or less sparse?

*L*number of coefficients that affect each topic, we choose to study the supervised effect on topic

*k*, called \(r_k\), by looking at the sum of the absolute values of the \(\eta \) coefficients, i.e.

Figure 8 show the effect of the supervision. We can see the effect of the horseshoe prior in that the regression supervised effects are lower in general, due to the shrinkage imposed by the prior. With regard to topic coherence and document entropy, the supervision has no clear effect. Instead, it seems like the effect that the supervision has on the topics is corpus-specific. In the 20 Newsgroups corpus, we can see a small positive relationship between coherence and supervision effect, something that is not shared by the other corpora.

These results indicate that the effect of the supervision on individual topics depends on the corpora (and the label). This makes sense in that different types of labels will relate to different aspects of the text and the underlying topics. For the 20 Newsgroups, we can see a slight positive correlation between coherence and the supervised effects, and this is also a corpus with higher prediction accuracy. But overall, the results seem to indicate that there is no clear effect of the supervision on topic quality, as measured by document entropy and coherence.

#### 5.2.4 Scaling and parallelism performance

Figure 9 show the results of the scaling and parallel performance experiments. From the results, we can see that the scaling in size of the corpus is more or less linear, as we expect. More interestingly, the runtime of the two NYT corpora is very similar. The hierarchical NYT corpus has roughly ten times more classes (240 vs. 31) than the top-level NYT corpus while being roughly one third in size. Hence we can conclude that, empirically, the largest effect on runtime is the number of documents, or tokens, rather than the number of classes for a standard setting with 100 topics.

Figure 9 also shows the parallel performance of the sampler. Unlike most other large-scale MCMC samplers for topic models, this sampler is both parallel and samples using an ergodic Markov chain with the posterior as the target. This still gives good parallel performance, even on a smaller corpus, such as the IMDb corpus. It is also obvious that the concurrency with regard to the different classes is also of importance. For the hierarchical NYT corpus, the increased number of classes affects the overall sampling time, but the parallel performance is not affected. We can also see that the parallel performance is needed mainly when the number of topics is larger.

## 6 Conclusions

Several supervised topic models have been proposed with the purpose of identifying topics that can be used successfully to classify documents. We have proposed DOLDA, a supervised topic model with special emphasis on generating semantically interpretable predictions together with an efficient and scalable MCMC sampler for inference. An important component of the model to ease interpretation is the DO-probit model without a reference class. By coupling the DO-probit model with an aggressive Horseshoe prior with a shrinkage that is allowed to vary over the different classes, it is possible to create an interpretable classification model that automatically identifies the most interesting “signal” topics. At the same time, the DOLDA model comes with very few hyperparameters - only the standard LDA parameters \(\alpha \) and \(\beta \) are needed, which has been extensively studied in Wallach et al. (2009). The fact that there are so few parameters is different from most other supervised topic models (Jiang et al. 2012; Zhu et al. 2012; Li et al. 2015).

Our experiments show that the gain in interpretation from using DOLDA comes with only a small reduction in prediction accuracy compared to the state-of-the-art supervised topic models; moreover, DOLDA outperforms other fully Bayesian models such as the original supervised LDA model. We have also shown that learning the topics jointly with the classification part of the model gives more accurate predictions than a two-step approach, where a topic model is first estimated and a classifier is then trained on the learned topics, showing a general benefit of supervised topic modeling.

The horseshoe prior has also shown benefits in supervised topic models, leading to a much more clear picture of the important topics for a given label with similar, or better, prediction accuracy. The computational cost of the horseshoe is small compared to the other parts of the sampler, making it an attractive prior for use in other supervised models as well.

The supervision effect on the topics is generally small and in line with previous results. The supervision, in general, does not seem to affect the topic interpretability much, but there seems to be an indication that this is corpus (and label) dependent. In the 20 Newsgroups corpus, where the accuracy is higher, the relationship between topic coherence and supervision effects are slightly positive.

Finally, we show that the DOLDA model scales well for large corpora and many classes. Still, further improvement in scalability can be achieved with regard to the number of topics *K*. The ideas of Zheng et al. (2015) can probably improve the scalability with respect to *K*, the number of topics, but this is something we leave for future work.

## Footnotes

## Notes

### Acknowledgements

Open access funding provided by Aalto University.

## References

- Ahmed A, Xing EP (2010) Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1140–1150Google Scholar
- Albert JH, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88(422):669–679MathSciNetCrossRefzbMATHGoogle Scholar
- Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
- Carvalho C, Polson N, Scott J (2010) The horseshoe estimator for sparse signals. Biometrika 97:465–480MathSciNetCrossRefzbMATHGoogle Scholar
- Castillo I, Schmidt-Hieber J, Van der Vaart A (2015) Bayesian linear regression with sparse priors. Ann Stat 43(5):1986–2018MathSciNetCrossRefzbMATHGoogle Scholar
- Chang J, Gerrish S, Wang C, Boyd-Graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems, pp 288–296Google Scholar
- Chemudugunta C, Smyth P, Steyvers M (2007) Modeling general and specific aspects of documents with a probabilistic topic model. In: Advances in neural information processing systems, pp 241–248Google Scholar
- Damlen P, Wakefield J, Walker S (1999) Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. J R Stat Soc Ser B (Stat Methodol) 61(2):331–344MathSciNetCrossRefzbMATHGoogle Scholar
- Geladi P, Kowalski BR (1986) Partial least-squares regression: a tutorial. Anal Chim Acta 185:1–17CrossRefGoogle Scholar
- Griffiths TL, Steyvers M, Blei DM, Tenenbaum JB (2005) Integrating topics and syntax. In: Advances in neural information processing systems, pp 537–544Google Scholar
- Imai K, van Dyk DA (2005) A Bayesian analysis of the multinomial probit model using marginal data augmentation. J Econom 124(2):311–334MathSciNetCrossRefzbMATHGoogle Scholar
- Jameel S, Lam W, Bing L (2015) Supervised topic models with word order structure for document classification and retrieval learning. Inf Retr J 18(4):283–330CrossRefGoogle Scholar
- Jiang Q, Zhu J, Sun M, Xing EP (2012) Monte Carlo methods for maximum margin supervised topic models. In: Advances in neural information processing systems, pp 1592–1600Google Scholar
- Johndrow J, Dunson D, Lum K (2013) Diagonal orthant multinomial probit models. In: Proceedings of the sixteenth international conference on artificial intelligence and statistics, pp 29–38Google Scholar
- Jonsson L, Broman D, Magnusson M, Sandahl K, Villani M, Eldh S (2016) Automatic localization of bugs to faulty components in large scale software systems using Bayesian classification. In: 2016 IEEE international conference on software quality, reliability and security (QRS). IEEE, pp 423–430Google Scholar
- Li X, Ouyang J, Zhou X, Lu Y, Liu Y (2015) Supervised labeled latent Dirichlet allocation for document categorization. Appl Intell 42(3):581–593CrossRefGoogle Scholar
- Magnusson M, Jonsson L, Villani M, Broman D (2018) Sparse partially collapsed mcmc for parallel inference in topic models. J Comput Graph Stat 27(2):449–463MathSciNetCrossRefGoogle Scholar
- McAuliffe JD, Blei DM (2008) Supervised topic models. In: Advances in neural information processing systems, pp 121–128Google Scholar
- Mimno D, McCallum A (2012) Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. arXiv preprint arXiv:1206.3278
- Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the 2011 conference on empirical methods in natural language processing. association for computational linguistics, pp 262–272Google Scholar
- Mullen L (2016) tokenizers: a consistent interface to tokenize natural language text. R package version 0.1.4Google Scholar
- Nalenz M, Villani M (2018) Tree ensembles with rule structured horseshoe regularization. Ann Appl Stat 12(4):2379–2408MathSciNetCrossRefzbMATHGoogle Scholar
- Newman D, Asuncion A, Smyth P, Welling M (2009) Distributed algorithms for topic models. J Mach Learn Res 10(Aug):1801–1828MathSciNetzbMATHGoogle Scholar
- Parnin C, Orso A (2011) Are automated debugging techniques actually helping programmers? In: Proceedings of the 2011 international symposium on software testing and analysis. ACM, pp 199–209Google Scholar
- Perotte AJ, Wood F, Elhadad N, Bartlett N (2011) Hierarchically supervised latent Dirichlet allocation. In: Advances in neural information processing systems, pp 2609–2617Google Scholar
- Polson NG, Scott JG, Windle J (2013) Bayesian inference for logistic models using Pólya-gamma latent variables. J Am Stat Assoc 108(504):1339–1349CrossRefzbMATHGoogle Scholar
- Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, pp 487–494Google Scholar
- Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1–2):157–208MathSciNetCrossRefzbMATHGoogle Scholar
- Sandhaus E (2008) The New York Times annotated corpus LDC2008T19. Linguistic Data Consortium, PhiladelphiaGoogle Scholar
- Scott JG (2010) Parameter expansion in local-shrinkage models. arXiv preprint arXiv:1010.5265
- Wallach HM, Mimno DM, McCallum A (2009) Rethinking LDA: why priors matter. In: Advances in neural information processing systems, pp 1973–1981Google Scholar
- Zheng X, Yu Y, Xing EP (2015) Linear time samplers for supervised topic models using compositional proposals. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1523–1532Google Scholar
- Zhu J, Ahmed A, Xing EP (2012) MedLDA: maximum margin supervised topic models. J Mach Learn Res 13(1):2237–2278MathSciNetzbMATHGoogle Scholar
- Zhu J, Zheng X, Zhang B (2013) Improved Bayesian logistic supervised topic models with data augmentation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, vol 1, pp 187–195Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.