# Tripartite-Replicated Softmax Model for Document Representations

- 428 Downloads

## Abstract

Text mining tasks based on machine learning require inputs to be represented as fixed-length vectors, and effective vectors of words, phrases, sentences and even documents may greatly improve the performance of these tasks. Recently, distributed word representations based on neural networks have been demonstrated powerful in many tasks by encoding abundant semantic and linguistic information. However, it remains a great challenge for document representations because of the complex semantic structures in different documents. To meet the challenge, we propose two novel tripartite graphical models for document representations by incorporating word representations into the Replicated Softmax model, and we name the models as Tripartite-Replicated Softmax model (TRPS) and directed Tripartite-Replicated Softmax model (d-TRPS), respectively. We also introduce some optimization strategies for training the proposed models to learn better document representations. The proposed models can capture linear relationships among words and latent semantic information within documents simultaneously, thus learning both linear and nonlinear document representations. We examine the learned document representations in a document classification task and a document retrieval task. Experimental results show that the learned representations by our models outperform the state-of-the-art models in improving the performance of these two tasks.

## Keywords

Document representations Replicated softmax model Text mining## 1 Introduction

Text mining tasks usually require their inputs to be fixed-length vectors for machine learning algorithms to deal with. With the number of documents increasing rapidly on the Internet, there is an urgent need to represent documents effectively to improve the performance of these tasks. Research on document representations has been studied for years, and various models have been developed [1, 2, 3, 4, 5, 6]

Vector space model, as a traditional method, is based on the bag-of-words assumption, and has been widely used in information retrieval, which represents documents with feature vectors. In these vectors, each dimension stands for a word in the vocabulary, which is weighted based on weighting schemes, such as term frequency inverse document frequency (tf-idf). However, there are two weaknesses of these models. For one thing, the word order may be lost by assuming the independency among words, and as a result representations for different documents may be the same when the documents contain the same words. For another, these models have little sense about the semantics embedded in documents for representing the documents, and the dimensionality of representations by these models is proportional to the size of vocabularies, which limits their generalization ability in different tasks due to the curse of dimensionality. Therefore, it poses a great challenge for effective document representations.

Recently, neural network-based models exhibit powerful capabilities in representing words, phrases and even documents. As one of the most effective models, Restricted Boltzmann Machines (RBMs) [7, 8] have been used for document modeling in the Replicated Softmax model (RPS) [9], which outperforms LDA [10] in different text mining tasks. The RPS model is constituted with an ensemble of RBMs with shared parameters, and outputs the values of hidden units as replacements to the original documents, thus learning fix-length features for document representations. To enhance the RPS model, Srivastava et al. [11] introduces another hidden layer into the original RPS model to learn document representations more effectively and more efficiently. However, these models still are subject to the bag-of-words assumption, which may lose semantic information about words when learning the document representations. Therefore, we may further enhance these models by taking more semantic information of words into consideration to learn the representations of documents.

Inspired by the successful use of distributed word representations based on neural network language model [12, 13], some studies attempt to integrate distributed word representations into topic models. For example, Niu et al. [14] proposed a method to train LDA-based topic models using a neural network language model [13], which achieves better performance and outperforms original LDA. Nguyen et al. [15] replaced the probability of generating words from topic distribution with the probability of co-occurrence between a topic vector and a word vector, and introduce a Bernoulli distribution to optimize the learning process. In some other works, distributed word representations have been widely used to generate presentations of phases, sentences and paragraphs [13]. These studies indicate that word representations may enhance document modeling with abundant semantic information.

Based on this idea, we propose two Tripartite-Replicated Softmax models (TRPS) to learn better document representations. The proposed models integrate distributed word representations into the Replicated Softmax model to encode more semantic information about words into document representations. The learned models, as tripartite graphical belief networks, can learn both linear representations and nonlinear representations for documents simultaneously. We conduct extensive experiments on publicly available datasets in a text classification task and an information retrieval task to examine the performance of the proposed models, and investigate the effectiveness of the learned representations for documents with different lengths. Experimental results show that the proposed models outperform state-of-the-art models, and achieve impressive improvements over the Replicated Softmax model and the Over-Replicated Softmax model.

## 2 Replicated Softmax Model

Before introducing the proposed models, we first introduce the undirected graphical model, Replicated Softmax model (RPS), which can automatically extract low-dimensional latent semantic representations of documents. The RPS model comprises of a family of RBMs with shared parameters, and each RBM has Softmax visible variables that can have one of the different states. The RPS model takes document vectors as inputs and low-dimensional representations of documents as outputs, where the input vectors are generated based on term frequencies in each document.

*K*be the size of vocabulary,

*N*be the length of a document (the total number of words in a document),

*F*be the number of units in the hidden layer (the dimensionality of the outputted representations). The values of hidden units are defined as binary stochastic hidden topic features

*h*∈ {0,1}

^{ F }. Let V be an

*N*×

*K*observed binary matrix with

*v*

_{ ik }= 1 if visible unit

*i*takes on the

*k*

^{ th }value. The energy of the RPS model is defined as:

*θ*= {

**W**,

**a**,

**b**} are parameters of the model,

*W*

_{ ijk }is a symmetric interaction term with value

*k*between the visible unit

*i*and the hidden unit

*j*,

*b*

_{ ik }is the bias of visible unit

*i*with value

*k*, and

*a*

_{ j }is the bias of hidden unit

*j*. The probability that the model assigns to a visible binary matrix

**V**is defined as follows.

*Z*(

*θ*,

*N*) is a normalization factor. The RPS model creates a separate RBM for each document with as many Softmax units as the number of words in each document. All of these Softmax units share the same weights without consideration of the order of words, connecting them to binary hidden units. Therefore, the energy of the state {

**V**,

**h**} for a document that contains

*N*words can also be interpreted as follows.

*k*

^{ th }word. The bias terms of the hidden variables are scaled up by the length of the document, which is important for the hidden units to behave sensibly when dealing with documents of different lengths. The conditional distributions are given by Softmax and logistic functions as follows.

Exact maximum likelihood learning of RPS is intractable as computing the derivatives of the partition function takes time exponentially proportional to *min*{*D*, *F*}. In practice, the learning can be approximated by contrastive divergence (CD) [16].

## 3 Tripartite-Replicated Softmax Model

Before introducing distributed word representations, we first add another hidden layer **H** to the RPS model to give a complementary prior over the hidden units, which has been proved effective to provide more flexibility in defining the prior [11]. As a result, the model’s prior over the latent topics **h** can be viewed as the geometric mean of two probability distributions: one is defined by an RBM composed of **V** and **h**, and the other is defined by an RBM composed of **h** and **H**.

On this basis, we fully connect visible units and units in the second hidden layer by extra edges (the blue edges in Fig. 1). We then define weights on these edges based on the distributed word representations. The dotted circles on the blue edges stand for a set of pseudo nodes, which can output linear representations of the documents. We call these nodes as pseudo nodes because no activations on these nodes are executed during the model training, and we only use the outputs of the models when the final model is learned. The connections between visible units and units in the second hidden layer convey much semantic information from word representations during the model training, and may generate more useful document representations. Like the RPS model, the TRPS model can also be interpreted as an RBM-based model that uses a single visible multinomial unit with support {1,…, *K*}, which is sampled *N* times.

*L*be the dimensionality of the word representations,

*K*be the vocabulary size, and

**E**be the word representation matrix in

*K*×

*L*. We define the weights on the edges between the visible layer and the second hidden layer as

**W**

^{ EE }=

**W**

^{ E }(

**W**

^{ E })

^{T}where \( w_{kl}^{\text{E}} = c_{k} *E_{kl} \).

**c**is a vector in

*K*, and can be tuned during the model training. (

**W**

^{ E }

**)**

^{ T }is the transpose of the matrix

**W**

^{ E }. The values of the elements in

**c**are forced to be non-negative, standing for the weights on each word in the vocabulary. We define the outputs for pseudo nodes as

**VW**

^{ E }, which can be taken as a linear conversion of the original inputs by weighting on each word, and can be considered as linear document representations learned from the proposed model. It is worth noting that the number of pseudo nodes equals to the dimensionality of the word representations based on the above setting. The energy of the proposed model can be formalized as follows.

*θ*= {

**W**,

**a**,

**b, c**} are parameters of the TRPS model.

**E**is the matrix based on the pre-trained distributed word representations. \( \hat{v}_{k} = \sum\nolimits_{i = 1}^{N} {v_{ik} } \) refers to the number of occurrences for the

*k*

^{ th }word in the document, and \( \hat{h}^{\prime}_{k} = \sum\nolimits_{i = 1}^{M} {\hat{h^{\prime}}_{i} } \) refers to the number of occurrences for the

*k*

^{ th }words in the second hidden layer

**H**. Based on the energy definition, we define the joint probability distribution of our model as follows.

*Z*(

*θ*,

*N*) and \( \sum\nolimits_{{\mathbf{H}}} {\sum\nolimits_{{\mathbf{h}}} {\exp ( - E({\mathbf{V}} ,{\mathbf{h}} ,{\mathbf{H}};\theta ))} } \), we can adopt Contrastive Divergence (CD) algorithm [16] or Persistent Contrastive Divergence (PCD) [17] algorithm to estimate these two probability. The corresponding conditional probability can be derived as follows.

From the probabilities, we can see that the main difference of our model compared with the RPS model lies in the conditional probability P(*v* _{ ik } = 1|**h**,**H**). In TRPS, the probability is not only determined by the binary hidden layer, but also impacted by the second layer. The information flow between the visible layer and the second layer conveys semantic information based on distributed word representations. As a result, the document representations by our model learn both original document information from the visible layer and semantic information from the second hidden layer. The learned weight **c** balances the importance of different words in the document, and help to give a linear representation of documents.

Since final outputs of our model are the values of units in binary hidden layer, we further modify the TRPS model by converting the edges from the visible units to the units in second hidden layer as directed edges. This modification restricts that the conditional probability for units in the visible layer is only impacted by the latent topic units to encode as much document information as possible in the model outputs, namely P(*v* _{ ik } = 1|**h**,**H**) = P(*v* _{ ik } = 1|**h**). We refer to the new model as directed Tripartite Replicated softmax model (d-TRPS).

The d-TRPS model is similar to the TRPS model except that the energy representation E(**V**,**h**,**H**;*θ*) is different, which produces different conditional probability P(*v* _{ ik } = 1|**h**) and \( {\text{q}}(h_{i}^{\prime } = k\,|\,{\mathbf{V}}) \). In our experiments, we will examine the effectiveness of both of the proposed models.

### 3.1 Model Training

*FE*(

**V**;

*θ*) refers to the free energy of our model. To calculate the maximum likelihood for P(

**V**;

*θ*), we give the derivative of parameter

*θ*as follows.

**h**,

**H**|

**V**;

*θ*).

*μ*= {

**μ**^{ (1) },

**μ**^{ (2) }} are the parameters of the distribution, and can be estimated iteratively as follows.

The second item in Eq. (14) is related to the distribution by the model, and can be estimated using Persistent Contrastive Divergence (PCD) [17] algorithm. Specifically, let *x* _{ t } = {**V** _{ t },**h** _{ t },**H** _{ t }} and *θ* _{t} be the state and parameters at current time. To obtain a new state *x* _{ t+1 }, we update *x* _{ t } using alternating Gibbs sampling, and obtain a new parameter *θ* _{t+1} by making a gradient step. After iterations, we obtain a new state **V** _{ t’ }, which can be taken as the estimation for the item. Besides, we normalize and scale the trained word representations as the initial weight, and tune the parameter **c** in the training process.

### 3.2 Pre-training

*h*

_{ j }= 1|

**V**) and the energy E(

**V**,

**h**;

*θ*) as follows.

We conduct the pre-training to achieve two goals: one is to make the initial values of units in the second layer to have a nearer distribution as the visible layer, since the latent topic layer units are determined by the other two layers; the other is since the numbers of units in the input layer and units in the second hidden layer are different, we adjust the values of units on the second hidden layer with a normalization factor *M*/*N*. Since the TRPS model and the d-TRPS model have similar structures, we take the same pre-training method to optimize the two models.

## 4 Experiments

### 4.1 Experimental Settings

Statistics about the datasets

Datasets | #documents (train + test) | # words | Average length of docs |
---|---|---|---|

20-newsgroups | 11314 + 7532 | 1,115,342 | 99.58 |

20-newssingle | 11298 + 7516 | 344,188 | 30.46 |

Tagmynews | 29344 + 3260 | 487,152 | 14.94 |

To examine the performance of the proposed models on documents with different lengths, we take the 20-newsgroups dataset as long documents and the Tagmynews dataset as short documents. We also extract the last paragraph of each document in the 20-newsgroups dataset as a new short document to construct a new dataset with medium length, denoted as 20-newssingle. We refer to the 20-newsgroups, 20-newssingle and TagmyNews datasets as 20-NG, 20-NS, and TMN, respectively. We compare our models with the Replicated Softmax model [9] and the Over-Replicated Softmax model (ORPS) [11]. The ORPS model is a modified RPS model with a second hidden layer.

### 4.2 Training Details

In our experiment, we use the word representations provided by Google to train the TRPS model and the d-TRPS model. A validation set is held out from the training set for hyper-parameter selection. Using the validation set, we choose the dimensionality of the word representations as 300, which equals to the number of pseudo nodes. We choose the value of *M* and the number of hidden units over a grid search using the validation set, and finally set *M* for the 20-newsgroups dataset and *M* = 20 for the other two datasets. We also find that a large *M* cannot contribute much to the performance in the proposed models. We set the number of units in the binary hidden layer as *F* = 200, since more nodes on the binary hidden layer may cause overfitting. We use the Adadelta [19] in our training to avoid manual tuning of the learning rate. We transform the raw term frequencies using equation log(1 + *w* _{ i }), and round the values to improve the performance.

### 4.3 Document Classification

Performance of models for document classification task on three datasets

Dataset | Methods | Nonlinear Dim. = 200 | Linear Dim. = 300 | Combination Dim. = 500 |
---|---|---|---|---|

20-newsgroups | RPS | 79.01 | – | 79.34 |

ORPS | 80.21 | – | 81.43 | |

TRPS | 77.55 | 73.29 | 89.80 | |

d-TRPS | | | | |

20-newssingle | RPS | 46.19 | – | 43.09 |

ORPS | 48.78 | – | 49.59 | |

TRPS | 48.36 | | | |

d-TRPS | | 32.96 | 50.94 | |

Tagmynews | RPS | 77.94 | – | 76.93 |

ORPS | 79.29 | – | | |

TRPS | 77.64 | | 79.42 | |

d-TRPS | | 57.94 | 80.25 |

From the table, we can find that the TRPS model achieves comparable performance with the baseline models, especially when combined with linear features, and the d-TRPS model further enhances the performance, which shows transforming the edges from the inputs to the hidden units as direct connections is effective to build more effective models.

### 4.4 Document Retrieval

*s*

_{ 1 }is the similarity score with nonlinear features, and

*s*

_{ 2 }is the score with linear features. We tune the parameter \( \lambda \) on the validation set, and find that the best performance can be achieved when set \( \lambda \) = 0.9. We evaluate the retrieval performance in terms of precision-recall curves, shown in Fig. 2.

From the results, we find that in the information retrieval task the TRPS model outperforms the baseline models, and the d-TRPS model achieves the best performance on all the datasets. We also find that the proposed models achieve better results especially when modeling long documents. A possible explanation for this may be that the TRPS model connects the input layer and the second hidden layer to make them compete for the contribution to the binary hidden layer and benefit from the word representations. Furthermore, d-TRPS model replaces the undirected connections with directed connections to reduce the influence of the competition.

### 4.5 Discussion

For documents with different lengths, the TRPS model is especially suitable for modeling long documents according to the experimental results on both of the tasks, compared with the RPS-based models. This may attribute to the introduction of distributed word representations while learning the model. Distributed word representations are learned mainly based on the co-occurrence of words within a certain window size in a large corpus, thus capturing much semantic information of words, while long documents contains more complicated semantic structures compared with short documents. Therefore, the TRPS model learns better document representations for long documents. Meanwhile, compared with the RPS model, the conditional probability for the visible units in the TRPS model is not only impacted by the latent topic layer, but also impacted by the second hidden layer, which may miss some document information while learning the document representations.

The d-TRPS model further enhances the TRPS model by transforming the connection from the visible units to the second hidden units as directed edges, namely P(*v* _{ ik } = 1|**h**,**H**) = P(*v* _{ ik } = 1|**h**). This modification restricts the conditional probability for the units in the visible layer is only impacted by the latent topic units to encode as much document information as possible in the final document representations, thus achieving better performance.

As to the complexity of the proposed models, our models only require an extra parameter **c** as the weights on word representations, whose dimensionality equals to the dimensionality of the word representations. Therefore, the complexity of our models is comparable with the RPS-based models. Besides, since our model is general, the learned document representations can also be applied in other text mining tasks, and the learned weight vector **c** for words can be applied for weighting words in different tasks. When setting the weight vectors as zero vectors and setting *M* = 0, our model reduces to the RPS model. Therefore, the proposed models can be considered as a generalization of the Replicated Softmax model.

## 5 Conclusion and Future Work

In the paper, we propose two tripartite graphical models, the Tripartite-Replicated Softmax model and the directed Tripartite-Replicated Softmax model, to represent documents as fixed-length representations. The proposed models introduce distributed word representations based on neural network to encode much semantic information of words into the document representations, and learn the linear and the nonlinear representations simultaneously. Experimental results show that the proposed models, especially the d-TRPS model, outperform the state-of-the-art models for document representations in document classification task and document retrieval task, which indicates the effective of the learned document representations.

We will carry out our future work in two directions: one is to investigate more powerful word representations in our framework to further enhance the learned document representations, and the other is to explore effective ways to make the most of the linear and the nonlinear document representations, together with the learned weights on words, in consideration of characteristics of different tasks.

## Notes

### Acknowledgements

This work is partially supported by grant from the Natural Science Foundation of China (No. 61632011, 61572102, 61402075, 61602078, 61562080), State Education Ministry and The Research Fund for the Doctoral Program of Higher Education (No. 20090041110002), the Fundamental Research Funds for the Central Universities.

## References

- 1.Grefenstette, E., Dinu, G., Zhang, Y.Z., et al.: Multi-step regression learning for compositional distributional semantics. arXiv preprint arXiv:1301.6939 (2013)
- 2.Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
- 3.Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci.
**34**(8), 1388–1429 (2010)CrossRefGoogle Scholar - 4.Nam, J., Mencía, E.L., Fürnkranz, J.: All-in text: learning document, label, and word representations jointly. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)Google Scholar
- 5.Yessenalina, A., Cardie, C.: Compositional matrix-space models for sentiment analysis. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 172–182. Association for Computational Linguistics (2011)Google Scholar
- 6.Zanzotto, F.M., Korkontzelos, I., Fallucchi, F., et al.: Estimating linear models for compositional distributional semantics. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1263–1271. Association for Computational Linguistics (2010)Google Scholar
- 7.Gehler, P.V., Holub, A.D., Welling, M.: The rate adapting Poisson model for information retrieval and object recognition. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 337–344. ACM (2006)Google Scholar
- 8.Xing, E.P., Yan, R., Hauptmann, A.G.: Mining associated text and images with dual-wing harmoniums. arXiv preprint arXiv:1207.1423 (2012)
- 9.Hinton, G.E., Salakhutdinov, R.R.: Replicated softmax: an undirected topic model. In: Advances in Neural Information Processing Systems, pp. 1607–1614 (2009)Google Scholar
- 10.Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res.
**3**, 993–1022 (2003)zbMATHGoogle Scholar - 11.Srivastava, N., Salakhutdinov, R.R., Hinton, G.E.: Modeling documents with deep Boltzmann machines. arXiv preprint arXiv:1309.6865 (2013)
- 12.Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
- 13.Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
- 14.Niu, L.Q., Dai, X.Y.: Topic2Vec: learning distributed representations of topics. arXiv preprint arXiv:1506.08422 (2015)
- 15.Nguyen, D.Q., Billingsley, R., Du, L., et al.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist.
**3**, 299–313 (2015)Google Scholar - 16.Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput.
**14**(8), 1771–1800 (2002)CrossRefzbMATHGoogle Scholar - 17.Tieleman, T.: Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1064–1071. ACM (2008)Google Scholar
- 18.Salakhutdinov, R., Hinton, G.E.: Deep Boltzmann machines. In: International Conference on Artificial Intelligence and Statistics, pp. 448–455 (2009)Google Scholar
- 19.Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)