1 Introduction

Citation analysis has been the main method of measuring innovation and identifying important and/or pioneering scientific papers. It is assumed that papers having high citation counts have made a significant impact on their fields of study and are considered innovative. This approach, however, has a number of shortcomings: Works by well-known authors and/or ones published at well-established publication venues tend to receive more attention and citations than others (the rich-get-richer effect) [35]. According to Merton [19], who first described this phenomenon in 1968, publications by more eminent researchers will receive disproportionately more recognition than similar works by less-well known authors. This is known as the Matthew Effect, named after the biblical Gospel of Matthew. Serenko and Dumay [30] observed that old citation classics keep getting cited because they appear among the top results in Google Scholar, and are automatically assumed as credible. Some authors also assume that reviewers expect to see those classics referenced in the submitted paper regardless of their relevance to the work being submitted. There is also the problem of self-citations: Increased citation count does not reflect the work’s impact on its field of study.

We addressed these shortcomings in our previous work [27] by proposing a machine learning-based method of measuring the innovativeness of scientific papers. Our current method involves training a Correlated Topic Model (CTM) [3] on a diachronic corpus of papers published at conference series or in different journal editions over as many years as possible, training a model for predicting publication years using topic distributions as feature vectors, and calculating a real number innovation score for each paper based on the prediction error.

We consider a paper innovative if it covers topics that will be popular in the future but have not been researched in the past. Therefore, the more recent the publication year predicted by our model compared to the actual year of publication, the greater the paper’s score. We showed in [27] that our innovation scores are positively correlated with citation counts, but there are also highly scored papers having few citations. These papers may be worth looking into as potential “hidden gems” – covering topics researched in the future but relatively unnoticed. Interestingly, we have not found any highly cited papers with low innovation scores.

2 Related Work

The development of research areas and the evolution of topics in academic conferences and journals over time have been investigated by numerous researchers. For example, Meyer et al. [20] study the Journal of Artificial Societies and Social Simulation (JASSS) by means of citation and co-citation analysis. They identify the most influential works and authors and show the multidisciplinary nature of the field. Saft and Nissen [25] also analyze JASSS, but they use a text mining approach linking documents into thematic clusters in a manner inspired by co-citation analysis. Wallace et al. [34] study trends in the ACM Conference on Computer Supported Cooperative Work (CSCW). They took over 1,200 papers published between the years 1990 and 2015, and they analyzed data such as publication year, type of empirical research, type of empirical evaluations used, and the systems/technologies involved. [21] analyze trends in the writing style in papers from the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI) published over a 36-year period.

Recent research on identifying potential breakthrough publications includes works such as Schneider and Costas [28, 29]. Their approach is based on analyzing citation networks, focusing on highly-cited papers. Ponomarev et al. [22] predict citation count based on citation velocity, whereas Wolcott et al. [36] use random forest models on a number of features, e.g. author count, reference count, H-index etc. as well as citation velocity. These approaches, in contrast to ours, take into account non-textual features. They also define breakthrough publications as either highly-cited influential papers resulting in a change in research direction, or “articles that result from transformative research” [36].

A different approach to identifying novelty was proposed by Chan et al. [5]. They developed a system for finding analogies between research papers, based on the premise that “scientific discoveries are often driven by finding analogies in distant domains”. One of the examples given is the simulated annealing optimization algorithm inspired by the annealing process commonly used in metallurgy. Identifying interdisciplinary ideas as a driver for innovation was also studied by Thorleuchter and Van den Poel [33]. Several works have employed machine learning-based approaches to predict citation counts and the long-term scientific impact (LTSI) of research papers, e.g., [37] or [31].

Examples of topic-based approaches include Hall et al. [11]. They trained an LDA model on the ACL Anthology, and showed trends over time like topics increasing and declining in popularity. Unlike our approach, they hand-picked topics from the generated model and manually seeded 10 more topics to improve field coverage. More recently Chen et al. [7] studied the evolution of topics in the field of information retrieval (IR). They trained a 5-topic LDA model on a corpus of around 20,000 papers from Web of Science. Sun and Yin [32] used a 50-topic LDA model trained on a corpus of over 17,000 abstracts of research papers on transportation published over a 25-year period to identify research trends by studying the variation of topic distributions over time. Another interesting example is the paper by Hu et al. [12] where Google’s Word2Vec model is used to enhance topic keywords with more complete semantic information, and topic evolution is analyzed using spatial correlation measures in a semantic space modeled as an urban geographic space.

Research on document dating (timestamping) is related to our work, too. Typical approaches to document dating are based on changes in word usage and on language change over time, and they use features derived from temporal language models [9, 14], diachronic word frequencies [8, 26], or occurrences of named entities. Examples of research articles based on heuristic methods include: [10, 15] or [16]. Jatowt and Campos [13] have implemented the visual, interactive system based on n-gram frequency analysis. In our work we rely on predicting publication dates to determine paper innovativeness. Ordinal regression models trained on topic vectors could be regarded as a variation of temporal language models and reflect vocabulary change over time. Aside from providing means for timestamping, they also allow for studying how new ideas emerge, gain and lose popularity.

3 Datasets

The corpora we study in this paper contain 3,577 papers published at the International World Wide Web Conference (WWW) between the years 1994 and 2019, and 835 articles published in the Journal of Artificial Societies and Social Simulation (JASSS)Footnote 1 from 1998 to 2019. We have studied papers from the WWW Conference before [27], which is the reason why we decided to use this corpus again, after updating it with papers published after our first analysis, i.e. ones in the years 2018 and 2019. We chose JASSS as the other corpus to analyze in order to demonstrate our method on another major publication venue in a related but separate field, published over a period of several years. It is publicly available in HTML, which makes it straightforward to extract text from the documents.

In an effort to extract only relevant content, we performed the following preprocessing steps on all texts before converting them to Bag-of-Words vectors:

  1. 1.

    Discarding page headers and footers, References, Bibliography and Acknowledgments sections as “noise” irrelevant to the main paper topic(s)

  2. 2.

    Conversion to lower case

  3. 3.

    Removal of stopwords and punctuation as well as numbers, including ones spelled out, e.g. “one”, “two”, “first” etc.

  4. 4.

    Part-of-Speech tagging using the Penn Treebank POS tagger (NLTK) [2] – This step is a prerequisite for the WordNet Lemmatizer, we do not use the POS tags in further processing

  5. 5.

    Lemmatization using the WordNet Lemmatizer in NLTK.

4 Method

4.1 Topic Model

In our previous work [27] we trained Latent Dirichlet Allocation (LDA) [4] topic models. In this paper, however, we have decided to move towards Correlated Topic Models (CTM) [3] and only built LDA models as a baseline. Unlike LDA, which assumes topic independence, CTM allows for correlation between topics. We have found this to be better suited for modeling topics evolving over time, including splitting or branching. We used the reference C implementation found at http://www.cs.columbia.edu/~blei/ctm-c/.

In order to choose the number of topics k, we have built a k-topic model for each k in a range we consider broad enough to include the optimum number of topics. In the case of LDA this range was \(\langle 10, 60\rangle \). We then chose the models with the highest \(C_V\) topic coherence. As shown by Röder et al. [24], this measure approximates human topic interpretability the best. Furthermore, according to Chang et al. [6], topic model selection based on traditional likelihood or perplexity-based approaches results in models that are worse in terms of human understandability. The numbers of topics we chose for our LDA models were 44 for the WWW corpus and 50 for JASSS. Because CTM supports more topics for a given corpus [3] and allows for a more granular topic model, we explored different ranges of k than in the case of LDA: \(\langle 30, 100\rangle \) for WWW and \(\langle 40, 120\rangle \) for JASSS. As before, we chose the models with the highest \(C_V\).

4.2 Publication Year Prediction

Because publication years are ordinal values rather than categorical ones, instead of One-vs-One or One-vs-Rest multiclass classifiers, which we had used previously, we have implemented ordinal regression (a.k.a. ordinal classification) based on the framework proposed by Li and Lin [17], as used by Martin et al. [18] for photograph dating. An N-class ordinal classifier consists of \(N - 1\) before-after binary classifiers, i.e. for each pair of consecutive years a classifier is trained, which assigns documents to one of two classes: “year y or before” and “year \(y+1\) or after”. Given the class membership probabilities predicted by these classifiers, the overall classifier confidence that paper p was published in the year Y is then determined, as in [18], by Eq. 1:

$$\begin{aligned} conf(p, Y) = \prod _{y = Y_{min}}^{Y} P(Y_p \le y) \cdot \prod _{y = Y + 1}^{Y_{max}}(1 - P(Y_p \le y)) \end{aligned}$$
(1)

where \(Y_{min}\) and \(Y_{max}\) are the first and last year in the corpus, and \(Y_p\) is the publication year of the paper p.

We used topic probability distributions as k-dimensional feature vectors, where k is the number of topics. Due to the small size of the JASSS corpus, we trained a separate model to evaluate each document (Leave-one-out cross-validation), whereas in the case of the WWW corpus we have settled for 10-fold cross-validation. We have implemented ordinal regression using linear Support Vector Machine (SVM) classifiers.

4.3 Paper Innovation Score

Following [27], we define our innovation score based on the results from the previous step - classifier confidence - as the weighted mean publication year prediction error with classifier confidence scores as weights:

$$\begin{aligned} S_P(p) = \frac{\sum _{y} conf(p, y) \cdot (y - Y_p)}{\sum _{y} conf(p, y)} \end{aligned}$$
(2)

where \(Y_p\) is the year paper p was published in and conf(py) is the classifier confidence for paper p and year y. Unlike the score defined in [27], the denominator in Eq. 2 does not equal 1, since the scores conf(py) defined in Eq. 1 are not class membership probabilities.

As illustrated in Fig. 1, the higher the publication year of paper p, the lower the minimum and maximum possible values of \(S_P(p)\). In order to make papers from different years comparable in terms of innovation scores, \(S_P(p)\) needs to be adjusted to account for the publication year of paper p.

Suppose the prediction error for papers published in the year Y is a discrete random variable \(Err_Y\). Based on the actual prediction error distributions for the WWW and JASSS corpora (see Fig. 3), let us define the expected publication year prediction error for papers published in the year Y as:

$$\begin{aligned} E(Err_Y) = \sum _{n = Y_{min} - Y}^{Y_{max} - Y} n \cdot Pr(Err_Y = n) \end{aligned}$$
(3)

where \(Y_{min}\) and \(Y_{max}\) are the minimum and maximum publication years in the corpus, and \(Pr(Err_Y = n)\) is the observed probability that the prediction error for a paper published in the year Y is n. To calculate \(Pr(Err_Y = n)\) we use the distribution from Fig. 3 truncated to the range \(\langle Y_{min} - Y, Y_{max} - Y\rangle \), i.e. the minimum and maximum possible prediction errors for papers published in the year Y.

Let us then define the adjusted innovation score as the deviation of \(S_P(p)\) from its expected value divided by its maximum absolute value:

$$\begin{aligned} S'_P(p) = {\left\{ \begin{array}{ll} \frac{S_P(p) - E(Err_{Y_p})}{E(Err_{Y_p}) - (Y_{min} - Y_p)} &{} \text {if } S_P(p) < E(Err_{Y_p}) \\ \frac{S_P(p) - E(Err_{Y_p})}{Y_{max} - Y_p - E(Err_{Y_p})} &{} \text {if } S_P(p) \ge E(Err_{Y_p}) \end{array}\right. } \end{aligned}$$
(4)
Fig. 1.
figure 1

Minimum and maximum prediction errors decrease as the publication year increases and so does the mean unadjusted score (\(S_P\)). To make papers from different years comparable in terms of innovation score, the adjusted innovation score (\(S'_P\)) measures the deviation of the prediction error from its expected value.

where \(Y_p\) is the publication year of the paper p.

\(S'_P(p)\) has the following characteristics:

  1. 1.

    \(-1 \le S'_P(p) \le 1\)

  2. 2.

    \(S'_P(p) = 0\) if paper p’s predicted publication year is as expected

  3. 3.

    \(S'_P(p) < 0\) if paper p’s predicted publication year is earlier than expected

  4. 4.

    \(S'_P(p) > 0\) if paper p’s predicted publication year is later than expected.

5 Results

Figure 2 shows the relation between the number of topics k and coherence \(C_V\) for CTM models trained on each of our corpora. Topic coherence initially peaks for values of k close to the optimal values found for LDA, then after a dip, it reaches global maxima for k equal to 74 and 88 for WWW and JASSS, respectively.

Fig. 2.
figure 2

\(C_V\) Topic coherence by number of topics. We chose the CTM models with the highest values of \(C_V\) coherence as described in Sect. 4.1.

Fig. 3.
figure 3

Distribution of publication year prediction errors for both corpora. We use these distributions to calculate the expected prediction error for each year and adjust paper innovation scores for their publication years.

Fig. 4.
figure 4

Topic popularity over time. The color of the cell in row t and column y represents the mean proportion of topic t in papers published in the year y. Bright red represents maximum values, white means zero. (Color figure online)

As shown in Table 1, publication year prediction accuracy expressed as Mean Absolute Error (MAE) is markedly improved both by using CTM over LDA and ordinal regression over a standard One-vs-One (OvO) multiclass SVM classifier. The best result we achieve for the WWW corpus was 2.56 and for JASSS: 3.56.

Table 3 shows the top 3 papers with the highest innovation scores for both corpora. For each of those papers we list the number of citations and some of their most significant topics. All of them have been cited, some of them widely. The more a paper’s topic distribution resembles the topic distributions of papers published in the future and the less it resembles that of papers from the past, the higher the innovation score. Some examples of highly scored, fairly recently published papers having few citations include:

  • WWW, 2019: Multiple Treatment Effect Estimation using Deep Generative Model with Task Embedding by Shiv Kumar Saini et al. – no citations, \(6^{th}\) highest score (0.946), topics covered: #10, #28, #33, #57 (see: Table 2)

  • JASSS, 2017: R&D Subsidization Effect and Network Centralization: Evidence from an Agent-Based Micro-Policy Simulation by Pierpaolo Angelini et al. – 2 citations, \(20^{th}\) highest score (0.634), topics covered: #4, #48, #65 (see: Table 2)

Table 1. Mean absolute prediction errors: CTM vs. LDA and Multiclass SVM vs. Ordinal regression
Table 2. Selected latent topics described by their top 30 words.
Table 3. Top 3 papers with the highest innovation scores in both corpora with citation counts and topics covered.
Fig. 5.
figure 5

Innovation score vs. Citation count for all papers (above) and papers at least 5 years old (below).

Figure 5 illustrates the correlation between Innovation Scores and citation counts. Because the number of citations is expected to grow exponentially [23], we have used \(log_2(citation~count + 1)\) instead of raw citation counts. The value of this expression is zero if the number of citations is zero and grows monotonically as the number of citations increases. The citation data for the WWW corpus come from ACM’s Digital LibraryFootnote 2, however publications from the JASSS journal are not available in the ACM DL. We were also unable to scrape complete citation data from Google Scholar. We have therefore manually collected citation counts for 5 randomly selected papers from each year. We have calculated Spearman’s \(\rho \) correlation coefficients between the innovation scores and citation counts. The results are: 0.28 with a p-value of \(1.21 \cdot 10^{-41}\) for the WWW corpus and 0.32 with a p-value of \(1.91 \cdot 10^{-6}\) for JASSS. The innovation scores are, therefore, weakly correlated to the citation counts. The correlation coefficients are slightly higher for papers at least 5 years old: 0.3 for WWW and 0.37 for JASSS. This may be explained by the fact that newer papers have not yet accumulated many citations regardless of their innovativeness.

6 Conclusion and Future Work

We have shown a simple yet significant improvement to our novel method of measuring the innovativeness of scientific papers in bodies of research spanning multiple years. Scaling the innovation score proposed in our previous research has enabled us to directly compare the scores of papers published at different years. We have also improved the prediction accuracy by employing ordinal regression models instead of regular multiclass classifiers and Correlated Topic Models instead of LDA. It may be argued that this makes our method more reliable, as deviations of the predicted publication year from the actual one are more likely to be caused by the paper actually covering topics popular in the future rather than just being usual prediction error. Moreover, CTM allowed to better model and understand the evolution of research topics over time.

In the future we plan to explore non-linear ways to scale the innovation scores, taking into account the observed error distribution (Fig. 3) to give more weight to larger deviations from the expected value. We also plan to use word embeddings or extracted scientific claims [1] as well as other means of effectively representing paper contents and conveyed ideas besides topic models as features to our methods.