Keywords

1 Introduction

Recurrent neural network (RNN) is a class of artificial neural network that has been providing remarkable achievements in many important applications. Sequence generation is one among such applications [6, 10]. By adopting a special architecture like LSTM [8] or GRU [3, 4], RNN can learn dependencies among word tokens appearing at distant positions. When we use sequence generation in realistic situations, we may sift the sequences generated by RNN to obtain only useful ones. Such sifting may give each generated sequence a score representing its usefulness relative to the application under consideration. While an improvement of the generated sequences can also be achieved by modifying the architecture of RNN [5], we here consider a scoring method that can be tuned and applied separately from RNN. In particular, this paper proposes a method achieving a diversity of subsequences appearing in highly scored sequences.

We may score the generated sequences by using their output probabilities in RNN. However, this method is likely to give high scores to the sequences containing subsequences popular among the sequences used for training RNN. Consequently, the sequences of high score are doomed to look alike and to show only a limited diversity. In contrast, our scoring method uses latent Dirichlet allocation (LDA) [2]. Topic models like LDA can extract diverse topics from training documents. By using the per-topic word probabilities LDA provides, we can assign high scores to the sequences containing many words having a strong relevance to some particular topic. Our scoring method is expected to select the sequences individually being relevant to some particular topic and together being relevant to diverse topics. We performed an evaluation experiment by generating Japanese Tanka poems with RNN. After training RNN under different settings, we chose the best setting in terms of validation perplexity. We then generated random sequences by using RNN under the best setting. The generated sequences were scored by using their output probabilities in RNN or by using our LDA-based method. The results show that our LDA-based method could select more diverse sequences in the sense that a wider variety of subsequences were obtained as the parts of the top-ranked Tanka poems.

2 Method

2.1 Preprocessing of Tanka Poems

Sequence generation is one among the important applications of RNN [6, 10]. This paper considers generation of Japanese Tanka poems. We assume that all Tanka poems for training RNN are given in Hiragana characters with no voicing marks. Tanka poems have a 5-7-5-7-7 syllabic structure and thus consist of five subsequences, which are called parts in this paper. Here we give an example of Tanka poem taken from The Tale of Genji: “mo no o mo hu ni/ta ti ma hu he ku mo/a ra nu mi no/so te u ti hu ri si/ko ko ro si ri ki ya.” In this paper, we use Kunrei-shiki romanization for Hiraganas. While the first part of the standard syllabic structure consists of five syllables, that of this example consists of six. In this manner, a small deviation from the standard 5-7-5-7-7 structure is often observed. Our preprocessing addresses this kind of deviation.

First, we put a spacing character ‘_’ between each neighboring pair of parts and also at the tail of the poem. Moreover, the first Hiragana character of each of the five parts is “uppercased,” i.e., marked as distinct from the same character appearing at the other positions. The above example is then converted to: “MO no o mo hu ni _ TA ti ma hu he ku mo _ A ra nu mi no _ SO te u ti hu ri si _ KO ko ro si ri ki ya _.” Second, we represent each Tanka poem as a sequence of non-overlapping character bigrams, which are regarded as vocabulary words composing each Tanka poem. However, only to the parts containing an even number of Hiragana characters, we apply an additional modification. The special bigram ‘’ is put at the tail of such parts in place of the spacing character ‘_’. Consequently, the non-overlapping bigram sequence corresponding to the above example is obtained as: “(MO no) (o mo) (hu ni) (TA ti) (ma hu) (he ku) (mo _) (A ra) (nu mi) (no _) (SO te) (u ti) (hu ri) (si _) (KO ko) (ro si) (ri ki) (ya _).” Finally, we put a token of the special words ‘BOS’ and ‘EOS’ at the head and the tail of each sequence, respectively. The sequences preprocessed in this manner were used for training RNN and also for training LDA.

2.2 Tanka Poem Generation by RNN

We downloaded 179,225 Tanka poems from the web site of International Research Center for Japanese Studies.Footnote 1 3,631 different non-overlapping bigrams were found to appear in this set. Therefore, the vocabulary size was 3,631. Among the 179,225 Tanka poems, 143,550 were used for training both RNN and LDA, and 35,675 were used for validation, i.e., for tuning free parameters. We implemented RNN with PyTorchFootnote 2 by using LSTM or GRU modules. RMSprop [11] was used for optimization with the learning rate 0.002. The mini-batch size was 200. The number of hidden layers was three. The dropout probability was 0.5. Based on an evaluation in terms of validation set perplexity, the hidden layer size was set to 600. Since the validation perplexity of GRU-RNN was slightly better than that of LSTM-RNN, GRU-RNN was used for generating Tanka poems.

2.3 LDA-Based Sequence Scoring

This paper proposes a new method of scoring the sequences generated by RNN. We use latent Dirichlet allocation (LDA) [2], the best-known topic model, for scoring. LDA is a Bayesian probabilistic model of documents and can model the difference in semantic contents of documents as the difference in mixing proportions of topics. Each topic is in turn modeled as a probability distribution defined over vocabulary words. We denote the number of documents, the vocabulary size, and the number of topics by D, V, and K, respectively. By performing an inference for LDA via variational Bayesian inference [2], collapsed Gibbs sampling (CGS) [7], etc., over training set, we can estimate the two groups of parameters: \(\theta _{dk}\) and \(\phi _{kv}\), for \(d=1,\ldots ,D\), \(v=1,\ldots ,V\), and \(k=1,\ldots ,K\). The parameter \(\theta _{dk}\) is the probability of the topic k in the document d. Intuitively, \(\theta _{dk}\) quantifies the importance of each topic in each document. The parameter \(\phi _{kv}\) is the probability of the word v in the topic k. Intuitively, \(\phi _{kv}\) quantifies the relevance of each vocabulary word to each topic. For example, in autumn, people talk about fallen leaves more often than about blooming flowers. Such topic relevancy of each vocabulary word is represented by \(\phi _{kv}\).

In our experiment, we regarded each Tanka poem as a document. The inference for LDA was performed by CGS, where we used the same set of Tanka poems as that used for training RNN. Therefore, \(D=143,550\) and \(V=3,631\) as given in Subsect. 2.2. K was set to 50, because other values gave no significant improvement. The Dirichlet hyperparameters of LDA were tuned by a grid search [1] based on validation set perplexity. Table 1 gives an example of the 20 top-ranked words in terms of \(\phi _{kv}\) for three among \(K=50\) topics. Each row corresponds to a different topic. The three topics represent blooming flowers, autumn moon, and singing birds, respectively from top to bottom. For example, in the topic corresponding to the top row, the words “ha na” (flowers), “ha ru” (spring), “ni ho” (the first two Hiragana characters of the word “ni ho hi,” which means fragrance), and “u me” (plum blossom) have large probabilities.

Table 1. An example of topic words obtained by CGS for LDA

Our sequence scoring uses the \(\phi _{kv}\)’s, i.e., the per-topic word probabilities, learned by CGS for LDA. Based on the \(\phi _{kv}\)’s learned from the training set, we can estimate the topic probabilities of unseen documents by fold-in [1]. In our case, bigram sequences generated by RNN are unseen documents. When the fold-in procedure estimates \(\theta _{dk}\) for some k as far larger than \(\theta _{dk^\prime }\) for \(k^\prime \ne k\), we can say that the document d is exclusively related to the topic k. In this manner, LDA can be used to know if a given Tanka poem is exclusively related to some particular topic. By using the fold-in estimation of \(\theta _{dk}\) for a Tanka poem generated by RNN, we compute the entropy \(- \sum _{k=1}^K \theta _{dk} \log \theta _{dk}\), which is called topic entropy of the Tanka poem. Smaller topic entropies are regarded as better, because smaller ones correspond to the situations where the Tanka poems relate to some particular topic more exclusively. In other words, we would like to select the poems showing a topic consistency. Since LDA can extract a wide variety of topics, our scoring method is expected to select the sequences individually showing a topic consistency and together showing a topic diversity.

3 Evaluation

The evaluation experiment compared our scoring method to the method based on RNN output probabilities. The output probability in RNN can be obtained as follows. We generate a random sequence with RNN by starting from the special word ‘BOS’ and then randomly drawing words one by one until we draw the special word ‘EOS.’ The output probability of the generated sequence is the product of the output probabilities of all tokens, where the probability of each token is the output probability at each moment during the sequence generation. Our LDA-based scoring was compared to this probability-based scoring.

We first investigate the difference of the top-ranked Tanka poems obtained by the two compared scoring methods. Table 2 presents an example of the top five Tanka poems selected by our method in the left column and those selected based on RNN output probabilities in the right column. To obtain these top-ranked poems, we first generated 100,000 Tanka poems with the GRU-RNN. Since the number of poems containing grammatically incorrect parts was large, a grammar check was required. However, we could not find any good grammar check tool for archaic Japanese. Therefore, as an approximation, we regarded the poems containing at least one part appearing in no training Tanka poem as grammatically incorrect. After removing grammatically incorrect poems, we assigned to the remaining ones a score based on each method. Table 2 presents the resulting top five Tanka poems for each method.

Table 2. Top five Tanka poems selected by compared methods

Table 2 shows that when we use RNN output probabilities (right column), it is difficult to achieve topic consistency. The fourth poem contains the words “sa ku ra” (cherry blossom) and “yu ki” (snow). The fifth one contains the words “sa ku ra” (cherry blossom) and “a ki ka se” (autumn wind). In this manner, the poems top-ranked based on RNN output probabilities sometimes contain the words expressing different seasons. This is prohibitive for Tanka composition. In contrast, the first poem selected by our method contains the words “si ku re” (drizzling rain) and “mo mi ti” (autumnal tints). Because drizzling rain is a shower observed in late autumn or in early winter, the word “mo mi ti” fits well within this context. In the third poem selected by our method, the words “ya ma” and “hu mo to” are observed. The former means mountain, and the latter means the foot of the mountain. This poem also shows a topic consistency. However, a slight weakness can be observed in the poems selected by our method. The same word is likely to be used twice or more. While refrains are often observed in Tanka poems, some future work may introduce an improvement here.

Table 3. Five most frequent parts observed in the top 200 Tanka poems

We next investigate the diversity of selected Tanka poems. We picked up the 200 top-ranked poems given by each method and then split each poem into five parts to obtain 1, 000 parts in total. Since the resulting set of 1,000 parts included duplicates, we grouped those parts by their identity and counted duplicates. Table 3 presents the five most frequent parts for each method. When we used RNN output probabilities (right column), “a ri a ke no tu ki no” appeared 13 times, “hi sa ka ta no” 12 times, and so on, among the 1,000 parts coming from the 200 top-ranked poems. In contrast, when we used our LDA-based scoring (left column), “hi sa ka ta no” appeared nine times, “ko ro mo te sa mu ki” eight times, and so on. That is, there were less duplicates for our method. Moreover, we also observed that while only 678 parts among 1,000 were unique when RNN output probabilities were used, 806 were unique when our method was used. It can be said that our method explored a larger diversity.

4 Previous Study

While there already exist many proposals of sequence generation using RNN, LDA is a key component in our method. Therefore, we focus on the proposals using topic modeling. Yan et al. [12] utilize LDA for Chinese poetry composition. However, LDA is only used for obtaining word similarities, not for exploring topical diversity. The combination of topic modeling and RNN can be found in the proposals not related to automatic poetry composition. Dieng et al. [5] propose a combination of RNN and topic modeling. The model, called TopicRNN, modifies the output word probabilities of RNN by using long-range semantic information of documents captured by an LDA-like mechanism. However, when we generate random sequences with TopicRNN, we need to choose one document among the existing documents as a seed. This means that we can only generate sequences similar to the document chosen as a seed. While TopicRNN has this limitation, it provides a valuable guide for future work. Our method detaches sequence selection from sequence generation. It may be better to directly generate the sequences having some desirable property regarding their topical contents.

5 Conclusions

This paper proposed a method for scoring sequences generated by RNN. The proposed method was compared to the scoring using RNN output probabilities. The experiment showed that our method could select more diverse Tanka poems. In this paper, we only consider the method for obtaining better sequences by screening generated sequences. However, the same thing can also be achieved by modifying the architecture of RNN. As discussed in Sect. 4, Dieng et al. [5] incorporate an idea from topic modeling into the architecture of RNN. It is an interesting research direction to propose an architecture of RNN that can directly generate sequences diverse in topics. With respect to the evaluation, it is a possible research direction to apply evaluations using BLEU [9] or even human subjective evaluations for ensuring the reliability.