1 Introduction

Keywords summarize a document concisely and give a high-level description of the document content [1]. Keyword has been used in various domains which include document summarization [2], document classification, document clustering [3], document retrieval, topic search, and document analysis [1]. As the development of digital libraries and publication, there is a need to assemble new books or resources by taking advantage of books which having been published and stored in digital libraries. Editors of the press segment the book into hundreds of items which is subject to one’s own topic. As each item is to some extent semantically independent from each other and correspond to one topic or more topics, previously, editors need to assign several keywords to the item manually according to its meaning which is a time-consuming process. When more new books come, the workload of editors to designate new keywords becomes heavier. So an automatic keyword recommendation mechanism is needed to faster the process of making items of books.

Numerous methods have been proposed to automatically extract keywords from a text. Keyword extraction technique tries to extract words that can summarize the text mostly which means the keyword must come from the text content.

In digital libraries and publication domain, there are huge amount of books accumulated that can be used as corpus of developing automatic keyword recommendation system. The problem is when a new item comes, what keywords should be recommended. Traditional keyword extraction method like the TextRank [4] can extract important words from item, but in many cases, the keyword may even not appear in the content of the item. Under such circumstances, traditional keyword extraction can’t solve the problem independently and it needs a supervised method to conduct the keyword recommendation process. The focus of this paper is to solve the item keyword recommendation problem using supervised keyword recommendation algorithm.

2 Keyword Extraction Methods

Earlier techniques mainly focus on the word frequencies of the text or the TFIDF values to determine the weight of the candidate words [5]. Although the frequency of word can imply the importance of the word in some cases, there are still some cases that the important words appear only few times. To overcome the problem of frequency-based keyword extraction method, graph-based method is proposed which is inspired by the PageRank [6] via building network of words/phrases and ranking the node using some kind of centrality measure, variants of the graph based method include [2] and HITS [7]. Semantic method is supposed to bring meaningful information to keyword extraction. Semantic relation of words can be found with help of WordNet or Wikipedia and HowNet to recommend semantically similar words of original words in the text. Topic modeling methods which include LSA, PLSA, and LDA are used to mind hidden topic to improve accuracy and coverage ability of keywords.

Keyword extraction can be formulated as a supervised classification problem. The word or phrase to be classified is represented as a vector of features which may include tf-idf [1] values, length or occurrence position [8]. A training set which is annotated as positive and negative should be provided, and during the testing phase, the candidate keywords should be formulated as a feature vector to be classified. Variants of machine learning method are used which include SVM [1], decision trees, conditional random fields [9]. The shortcoming of the supervised method is that it needs a manually constructed training set which is time-consuming and hard to get.

3 TFIDF-Similarity Based Keyword Recommendation

Traditional extraction method can extract keywords from the content itself. But when the content of the document is not long enough it will be difficult to extract useful keywords from the document directly. Recommending existed keywords to new documents can be implemented with the help of tfidf similarity based keyword recommendation technology which is described in [10].

Given a document set \( D\left\{ {d_{1} ,d_{2} , \ldots ,d_{n} } \right\} \), every document is annotated with several keywords: \( d_{i} = \{ text,tagset\} \), where the tagset comprised of several keywords and all the keywords form a keyword library T. Once there comes a new document q, we need to recommend proper existed keywords based on its content. The process can be described in two steps:

Step1. Compute P(t|q,T,D), which is the probability of every keyword in keyword library T, comparing the new document q with document in D.

Step2. Sort P(t|q,T,D) in descending order and select the top k keywords as the final recommendation.

P(t|q,T,D) can be formulated as follows:

$$ P(t|q,T,D) = \frac{keyWeight(t,q,D)}{{\sum\limits_{t \in T} {keyWeight(t,q,D)} }} $$
(1)

Where \( keyWeight(t,q,D) \) is the weight of keyword t according to the similarity of document t with all document in D.

$$ keyWeight(t,q,D) = \sum\limits_{d \in D} {DocSim(q,d) \times isTag(t,d)} $$
(2)

\( DocSim(q,d) \) is the similarity of new document q and document d of corpus D and we select the cosine similarity measure to compute similarity.

$$ DocSim(q,d) = \frac{q \cdot d}{\left\| q \right\| \times \left\| d \right\|} = \frac{{\sum\limits_{i = 1}^{n} {q_{i} \times d_{i} } }}{{\sqrt {\sum\nolimits_{j = 1}^{n} {q_{i}^{2} } } \times \sqrt {\sum\nolimits_{j = 1}^{n} {d_{i}^{2} } } }} $$
(3)
$$ isTag(t,d) = \left\{ {\begin{array}{*{20}c} 1 & {t \in d} \\ 0 & {t \notin d} \\ \end{array} } \right. $$
(4)

When d is annotated with keyword t, then isTag(t,d) is given the value 1, otherwise is given to 0.

The q and d vector is the TF-IDF value of each word in document q and d.

$$ q = (tfidf(w_{1} ),tfidf(w_{2} ), \ldots tfidf(w_{n} )) $$
(5)

4 Probability-Based Keyword Recommendation

4.1 Problem Definition

Our keyword recommendation method of items in dynamic publication domain can be formulated as follows. The training set is composed of items annotated with keywords by editors.

TraininSet = {[Tags(1),Item(1),ClassId(1)],[Tags(2),Item(2),ClassId(2)],…,[Tags(i),

Item(i),ClassId(i)),…, [Tags(n), Item(n), ClassId(n)]},

where Tags(i) is the keyword set assigned to item i by editors. Tags(i) = (key(1), key(2), …, key(m)) where key(i) is the keyword and the keyword number m varies from one to ten or more. In digital publication areas, the keyword number for each item often varies from 3 to 5. ClassId(i) is the category id of item i which suggests that the item belongs to the class i. We utilize the text classify technology to classify the item first in order to narrow the range of recommending keywords because the training set is usually very large, direct keyword recommendation would face the problem that there are thousands of keywords to be evaluated and to find the best keywords in them is a difficult thing. Due to fact that the items of the training set have the information of classification, when a new item comes, through the process of classification, an item first can be classified to proper category, and then the keywords in the category can be recommended to the new item base on its content.

The process can be described as follows: First we run the classification algorithm to find the category of item with the result category k; Given a new item, our aim is to compute the probability of every keyword in category k and it can be described as follows: we compute the probability \( p(k_{i} |item) \), where \( k_{i} \) is the keyword from category k. Then we sort the list of \( p(k_{i} |item) \) and select the top k as the candidate keywords of item.

4.2 Probabilistic Modeling

The Bayes probability theory is used in modeling the probability of keywords \( k_{i} \) that mostly delegate the item. Given a new item we need to compute the probability \( p(k_{i} |item) \):

$$ p(k_{i} |item) = \frac{{p(item|k_{i} ) \times p(k_{i} )}}{p(item)} $$
(6)

The probability of every new item is no different from each other, so we can ignore the probability \( p(item) \), and the probability

$$ p(k_{i} |item) \propto p(item|k_{i} ) \times p(k_{i} ) $$
(7)

Every item is a fragment of text composed of words/phrases, and we make a hypothesis that every word/phrases is independent from each other which we called bag of words model. The probability of item given the keyword \( k_{i} \) can be calculated as follows:

$$ p(item|k_{i} ) = \prod\limits_{j = 1}^{m} {p(w_{j} |k_{i} )} $$
(8)

where \( w_{j} \) is the term of item, and \( p(w_{j} |k_{i} ) \) is the probability of every term \( w_{j} \) of item when annotated keyword \( k_{i} \) occurs.

We models the probability \( p(w_{j} |k_{i} ) \) below which is different from that in [11] and more efficient in experiment result.

$$ p(w_{j} |k_{i} ) \propto \frac{{tfidf(w_{j} ) \times tf(w_{j} ,k_{i} )}}{{p(k_{i} ) \times \sqrt {\sum\limits_{j = 1}^{m} {tf^{2} (w_{j} ,k_{i} )} } }} $$
(9)

\( tfidf(w_{j} ) \) is the weight of term \( w_{j} \) which can be computed by the typical tfidf formulae.

$$ tfidf(w_{j} ) = tf(w_{j} ) \times idf(w_{j} ) $$
(10)

Where \( tf(w_{j} ) \) is term frequency of term \( w_{j} \) in item and \( idf(w_{j} ) \) is the inverse document frequency of term \( w_{j} \).

\( tf(w_{j} ,k_{i} ) \) is the term frequency of \( w_{j} \) in keywords \( k_{i} \) annotated items. \( p(k_{i} ) \) is the probability of keyword \( k_{i} \) in all the training set TR of category j and it can be computed as follows:

$$ p(k_{i} ) = \frac{{tf(k_{i} ,TR_{j} )}}{{\sum\limits_{i} {tf(k_{i} ,TR_{j} )} }} $$
(11)

Where \( tf(k_{i} ,TR_{j} ) \) is the term frequency of keywords \( k_{i} \) in the training set \( TR_{j} \) of category j.

In the training period, we first calculate the probability of \( p(k_{i} ) \),\( p(w_{j} |k_{i} ) \), and stored the result to compute every \( p(k_{i} |item) \) of each candidate keyword. Finally we sort \( p(k_{i} |item) \) in descending order and select the top N keywords as the final recommendation.

5 A Hybrid Approach of Keyword Recommendation

The tfidf similarity based method and probability based method can utilize previous annotated keywords to precisely recommend meaningful keywords which keyword extraction techniques can’t deal with. Keywords extraction method can find the relative words/phrases in the text content of items. In the scenarios of digital publication, some of the time, keywords that describe items do not come from the content directly but are some comprehensive words that describe the domain and character of the item, and some other time, if the item is quite different from existed training data, keywords extraction method would be useful in recommending new keywords to the editor. The editors audit and check the recommended keywords and give feedback to the system that whether the keywords are appropriate or not and give the right keywords to update the training model.

Our proposed algorithm selects a hybrid approach of keyword recommendation which considers both the probability based method and traditional extraction based method mentioned above. The reason we select extraction based method as the partner of probability based method is that we hope it can extract some useful words from the item directly where probability based method may not cover (Fig. 1).

Fig. 1.
figure 1

Algorithm of hybrid keyword recommendation

Step1 to step3 is the training process of probability-based keyword recommendation. Step4 to step6 first classify the new item to category i, and then calculate the probability of every keyword in category i with respect to the new item. Step7 to step8 select part of the keywords from probability-based keyword recommendation and part of the keywords from the extraction based keyword recommendation method.

6 Experiments and Evaluation

6.1 Dataset

In digital publication domain, we have accumulated huge amounts of items collected from books published by the press. These items are xml texts which mainly contain Chinese words together with some English terminology. The annotated keywords are assigned to items by editors manually with each keyword consisting of one or more words. Items have been classified to a constrained category tree which will be used in the classification process. The dataset has 40147 annotated items in xml format with different category and number of keywords. We split the dataset into two parts: the training set and the test set and use the 10-fold cross-validation to test and validate our method. We evaluate our hybrid and probability based method against the traditional keyword extraction method (like TextRank [4]) and tfidf -similarity based method. We did not use the user study valuation method for that we have enough annotated items to test and the annotated items were annotated by expert editors who have enough authority in tagging work, and it also saves lots of time. The statistics of the corpus for training and testing is listed in Table 1.

Table 1. Statistics of the corpus for training and testing

6.2 Evaluation Metrics

This section presents the evaluation metrics in our experiments which include precision, recall, F1. These metrics, when used in combination, have shown to be effective for evaluation of the effect of our method. Precision, recall, and F1 (F-measure) are well-known evaluation metrics in information retrieval literature [12]. \( T_{r} \) denotes the number of keywords returned by the algorithm when new item comes. We use the original set of keywords as the ground truth \( T_{g} \).

In out experiment, Precision, recall and F1 measures are defined as follows:

$$ precision = \frac{{T_{g} \wedge T_{r} }}{{T_{r} }},recall = \frac{{T_{g} \wedge T_{r} }}{{T_{g} }},F1 = \frac{2 \times precision \times recall}{precision + recall} $$

6.3 Results

We test the number of keywords recommended to the testing item from 1 to 15 using four different keyword recommendation algorithms which include the hybrid and probability based method we proposed. In the digital publication domain, editors often annotate 3 to 5 keywords/phrases per item, so we pay more attention to the result of keywords recommendation that recommend 3 to 5 keywords.

Figures 2, 3 and 4 plots the precision, recall and f1 value of the four different keywords recommendation method. We can see that the hybrid and probability based keywords recommendation methods we proposed outperform other methods like the tfidf-similarity based method and traditional keyword extraction method. The probability based method performs better than hybrid method when 3 to 5 keywords are recommended because the hybrid result contains keywords from the result of the traditional extraction method which result in the loss of precision. When one to five keywords are recommended, the probability based method can achieve precision more than 90 % which is much higher than the tfidf similarity and extraction based method.

Fig. 2.
figure 2

Precision of four keywords recommendation method

Fig. 3.
figure 3

Recall of four keywords recommendation method

Fig. 4.
figure 4

F1 value of four keywords recommendation method

Extraction based method performs worst since previous keyword annotation work of items is done by editors and the keywords annotated mostly are not from the content of the items directly but from a comprehensive understanding of the item. Another reason is that the average length of the items is only 182, and finding appropriate keywords in short item is not very easy for traditional keyword extraction method. But it does not means that we would abandon the extraction method because there are cases that the coming new item is quite different from the training set and the recommended keywords from statistical information may not cover the main idea of the item. Extraction of keywords from the content of item helps editors to have a chance to give personalization keywords to the item.

We achieve a high recall value up to 0.98 when 5 or more keywords are recommended. Recall is rather important for the keyword recommendation process in digital publication domain for that most of the time the recommended keywords are not adopted automatically but needs manual verification and audit. High recall helps editors to select keywords that are most relevant to the new item in a wider range while low recall limits the scope the editor and if the editor can’t find the proper keywords in the recommended keywords list, it would cost the editor lots of time to look through the content of the item and select the keywords manually. From Fig. 3 we can see that the recall value rises when more keywords are recommended and when 5 or more words are recommended the highest steady recall value are achieved.

We found that when keywords number is five, we achieved the best f1 value(0.9) with pretty high precision and recall because of the large number of training set of items are annotated with five keywords. Less recommended keywords will result in the loss of recall and f1 value but more keywords will result in the loss of precision.

According to the experiment result in Figs. 5, 6 and 7, when we select 20 % percent of the result of extraction based method combining with 80 % percent of the result of probability based method, the relatively better precision, recall and f1 value are achieved. So we select the parameter ep20(0.2) as the p parameter in hybrid algorithm. The series ep0, ep20 to ep100 in Figs. 5, 6 and 7 means the percentage of extraction based method used in the final keyword recommendation process. ep0 equals with the probability based method and ep100 corresponds to the extraction based method. When editors hope recommend new keywords from the content of the item directly we can use the hybrid approach, otherwise, the probability approach are recommended.

Fig. 5.
figure 5

Precision of Hybrid method with different proportion of extraction based method

Fig. 6.
figure 6

Recall of Hybrid method with different proportion of extraction based method

Fig. 7.
figure 7

F1 value of Hybrid method with different proportion of extraction based method

7 Conclusions and Future Work

This paper presents probability based and hybrid keyword recommendation algorithm which get at most more than 90 % precision, recall and f1 value on the digital publication dataset which outperforms the traditional extraction based and tfidf similarity based method in keyword recommendation. The algorithm is motivated by the keyword annotation problem in digital publication. When there is a new item that is not annotated, the algorithm automatically recommends relative keywords to the editor.

The probability based method utilizes statistical information of annotated training sets to recommend existed annotated keywords to coming items. The hybrid method combines the traditional extraction based method and the probability based method to take advantage of the two methods. Experiments are done on the dataset of items of books provided by the press and show that probability based and hybrid method outperforms the traditional keyword extraction method and tfidf similarity based method. Future work includes experiment on other annotated datasets, improvement on topic model based algorithm and other extraction based algorithms.