1 Introduction

Many types of textual documents are rich in temporal information. A specific type of such information are temporal expressions, which again happen to occur in a wide variety of documents. Thus, during the last years, there has been a growing interest in temporal tagging within the NLP community. While variation of temporal expressions according to different domains has become a well established research area (Mazur and Dale 2010; Strötgen and Gertz 2012; Lee et al. 2014; Strötgen and Gertz 2016; Tabassum et al. 2016), variation of temporal expressions according to time within a domain has received less attention so far.Footnote 1 Knowing how temporal expressions might have changed over time within a domain is interesting not only in the field of NLP, e.g., for adaptation of temporal taggers to different time periods, but also in humanities studies in the fields of historical linguistics, sociolinguistics, and the like.

In this paper, we focus on temporal expressions in the scientific domain and study their diachronic development over a time frame of approx. 350 years (from the 1650s to the 2000s). While here we take an exploratory historical perspective, our findings have implications for improving temporal tagging, especially for recall.

Temporal expressions are related to situation-dependent reference (see notably, Biber et al. (1999)’s work), i.e., linguistic reference to a particular aspect of the text-external temporal context of an event (cf. Atkinson (1999, p. 120); Biber and Finegan (1989, p. 492)). While according to Biber et al.’s work, scientific writing has moved towards expressing less situation-dependent reference, to the best of our knowledge, there is no evidence of how this change has been manifested linguistically and whether the types of temporal expressions used in scientific writing have changed over time. To investigate this in more detail, we pose the following questions:

  • Do the types of temporal expressions vary diachronically in scientific writing, and if so how is this manifested linguistically?

  • What are typical temporal expressions of specific time periods and do these change over time?

  • Are different types of temporal expressions, e.g., duration expressions and date expressions referring to points in time, equally affected by a potential change over time?

To process temporal information in scientific research articles, we use HeidelTime (Strötgen and Gertz 2010), a domain-sensitive tagger to extract and normalize temporal expressions according to the TimeML standard (Pustejovsky 2005) for temporal annotation (see Sect. 4). To detect typical temporal expressions of specific time periods, we use relative entropy, more precisely Kullback-Leibler Divergence (KLD) (Dagan et al. 1999; Lafferty and Zhai 2001). By KLD we measure how typical a temporal expression is for a time period vs. another time period (see Sect. 5). The methodology has been adopted from Fankhauser et al. (2016) and successfully used in Degaetano-Ortlieb and Teich (2016) to detect typical linguistic features in scientific writing, Degaetano-Ortlieb and Teich (2017) to detect typical features of research article sections, and Degaetano-Ortlieb (2017) to observe typical features of social variables.

In the analysis, we inspect general diachronic tendencies based on relative frequency and use relative entropy to investigate more fine-grained changes in the use of temporal expressions over time in scientific writing (see Sect. 6). On a more abstract level, we observe that the use of temporal information in scientific writing reflects the paradigm change from observational to experimental science (cf. Fankhauser et al. (2016); Gleick (2010)) and moves further to descriptions of previous work (e.g., in the last decades) in contemporary scientific writing.

2 Related Work

Temporal information has been often employed to improve information retrieval (IR) approaches (see Campos et al. (2014) and Kanhabua et al. (2015) for an overview). A prerequisite to exploit temporal information is temporal tagging, i.e., the identification, extraction, normalization, and annotation of temporal expressions based on an annotation standard such as the temporal markup language TimeML (Pustejovsky 2005). While for quite a long time, temporal tagging was tailored towards processing news texts, in the last years, domain-sensitive approaches are being developed, as it has been shown that temporal information varies significantly across domains (Mazur and Dale 2009; Strötgen and Gertz 2016). Domain-sensitive temporal taggers are UWTime (Lee et al. 2014) and HeidelTime (Strötgen and Gertz 2012). We choose HeidelTime as it is being reported to be much faster than UWTime (Agarwal and Strötgen 2017).

Recently, there is also an increasing interest in temporal information in the field of digital humanities. An early approach to operationalize time in narratology has been applied by Meister (2005). Strötgen et al. (2014) show how temporal taggers can be extended for temporal expressions referring to historical dates in the AncientTimes corpus. Fischer and Strötgen (2015) apply temporal taggers to analyze date accumulations in large literary corpora. An analysis of temporal expressions and whether they refer to the future or the past has also been performed on English and Japanese twitter data (Jatowt et al. 2015).

Considering the diachronic aspect of temporal information in scientific writing, it has been mainly investigated by considering temporal adverbs in the context of register studies. Biber and Finegan (1989) and Atkinson (1999), for example, have shown a decrease of temporal adverbs in scientific writing in terms of relative frequencies. Fischer and Strötgen (2015) also studied temporal expressions in a diachronic corpus, but only temporal expressions with explicit day and month information have been considered.

We use temporal tagging tailored at identifying temporal information in scientific writing to obtain a more comprehensive picture of possible diachronic changes. Moreover, besides considering changes in terms of relative frequency, we look at typical temporal expressions and patterns of temporal expressions of specific time periods.

3 Data

As a dataset, we use texts of scientific writing ranging from 1665 to 2007. The first time periods (1665 up to 1869) are covered by the Royal Society Corpus (Kermes et al. 2016a) build from the Proceedings and Transactions of the Royal Society of London – the first periodical of scientific writing – covering several topics within biological sciences, general science, and mathematics. For the later time periods (1966 to 2007), we also use scientific research articles from various disciplines (e.g., biology, linguistics, computer science) taken from the SciTex corpus (Degaetano-Ortlieb et al. 2013; Teich et al. 2013). For comparative purposes, we divide the corpus into fifty year time periods. Table 1 shows the time periods, their coverage and the sub-corpus sizes in number of tokens and documents.

The corpus has been pre-processed in terms of OCR correction, normalization, tokenization, lemmatization, sentence segmentation, and part-of-speech tagging (cf. Kermes et al. (2016b)).

Table 1. Corpus details.

4 Processing Temporal Information

4.1 Temporal Expressions

Key characteristics. Temporal expressions have three important key characteristics (cf. Alonso et al. (2011); Strötgen and Gertz (2016)). First, they can be normalized, i.e., expressions referring to the same semantics can be normalized to the same value. For example, March 11, 2017 and the 2nd Saturday in March of this year point to the same point in time, even though both expressions are realized in different ways. Second, temporal expressions are well-defined, i.e., given two points in time X and Y, the relationship between these two points can always be determined, e.g., as X is before Y (cf. Allen (1983)). Third, they can be organized hierarchically on a granularity scale (from coarser to finer granularities and vice versa such as day, month or year). Relevant in our analysis are normalization and granularity. Normalized values are used to compare temporal expressions across time periods instead of considering only the single lexical realizations. In terms of granularity, we consider granularity scales to determine diachronic changes of temporal expressions.

Types. According to the temporal markup language TimeML (cf. Pustejovsky (2005)), there are four types of temporal expressions (cf. also Strötgen and Gertz (2016)):

  • Date expressions refer to a point in time of the granularity equal or coarser than ‘day’ (e.g., March 11, 2017, March 2017 or 2017).

  • Time expressions refer to a point in time of any granularity smaller than ‘day’ (e.g., Saturday morning or 10:30 am).

  • Duration expressions refer to the length of a time interval and can be of different granularity (e.g., two hours, three weeks, four years).

  • Set expressions refer to the periodical aspect of an event, describing set of times/dates (e.g., every Saturday) or a frequency within a time interval (e.g., twice a day).

In the analysis, we consider all these four types showing how their use has changed diachronically in scientific writing.

4.2 Temporal Tagging

For temporal tagging we use HeidelTime (Strötgen and Gertz 2010), a domain-sensitive temporal tagger. HeidelTime supports normalization strategies for four domains: news, narrative, colloquial, and autonomous. Although HeidelTime has been applied to process scientific documents using the autonomous domain, these scientific documents have been very specific, relatively short (biomedical abstracts) with many so-called autonomous expressions (i.e., expressions not referring to real points in time, but to references in a local time frame).

In contrast, our corpus is quite heterogeneous, containing letters and reports in the earlier time periods and full articles in the later time periods. Thus, we expect that most of the documents are written in such a way that the correct normalization of relative temporal expressions can be reached by using the document creation time as reference time. This makes the documents similar to news-style documents according to HeidelTime’s domain definitions. Thus, we apply HeidelTime with its news domain setting. Note, however, that in our analysis we use only normalized values of Duration and Set expressions, which are normalized to the length and granularity of an expression but not to an exact point in time. Thus, our findings still hold if some of the occurring temporal expressions are not normalized correctly to a point in time.

HeidelTime uses TIMEX3 tags, which are based on TimeML (Pustejovsky et al. 2010), the most widely used annotation standard for temporal expressions. In the following, we briefly explain the value attribute of TIMEX3 annotations of Duration and Set expressions, as we do consider their normalized values for a deeper analysis of the occurring temporal expressions. The value attribute of Duration and Set expressions contains information about the length of the duration that is mentioned, starting with P (or PT in case of time level durations) followed by a number and an abbreviation of the granularity (e.g., years: Y, month: M, week: W, days: D; hours: H, minutes: M). In addition, fuzzy expressions are referred to by X instead of precise numbers, e.g., several weeks is normalized to PXW, monthly is normalized to XXXX-XX and annually to XXXX.

4.3 Extraction Quality

For meaningful analysis and substantiated conclusions of temporal expressions in our diachronic corpus, the extraction (and normalization) quality of the temporal tagger should be reliable. Although HeidelTime has been extensively evaluated before on a variety of corporaFootnote 2, our corpus is quite different from standard temporal tagging corpora as it contains scientific documents from multiple scientific fields published across several centuries. Creation of a proper gold standard with manual annotations covering all scientific fields across all time periods would not be feasible in an appropriate time frame. Instead, for a valuable statement of temporal tagging quality on our corpus, determining the correctness of expressions tagged by the temporal tagger would be meaningful.

For this, we use precision, i.e., we randomly sample 250 instances for each time period, and manually validate whether the automatically annotated temporal expressions are correctly extracted.Footnote 3 Here, we consider correctly extracted instances (right) and wrongly extracted instances (wrong). The latter are either cases of ambiguity (e.g., spring as ‘season’ or ‘water spring’ or current meaning ‘now’ or ‘electric current’) or wrongly assigned temporal expressions to numbers occurring in the text. On top we differentiate correctly assigned but not relevant instances (other) due to noise in the data itself. These are, e.g., temporal expressions assigned to reference sections (especially in the 1950–2000 periods) or used within tables (mostly in the earlier time periods).

Table 2 presents precision information and the number of instances per assigned category of right, other, and wrong. We consider the other instances to be correct in terms of precision of extraction. Across periods, precision achieves 0.89 to 0.96.

Table 2. Precision across time periods.

5 Typicality of Temporal Expressions

To obtain temporal expressions typical of a time period, we use relative entropy, also known as Kullback-Leibler Divergence (KLD) (Kullback and Leibler 1951) – a well-known measure of (dis)similarity between probability distributions used in NLP, speech processing, and information retrieval. In comparison to relative frequency, i.e., the unconditioned probability of, e.g., a word over all words in a corpus, relative entropy is based on conditioned probability.

In information-theoretic parlance, relative entropy measures the average number of additional bits per feature (here: temporal expressions) needed to encode a feature of a distribution A (e.g., the 1650 time period) by using an encoding optimized for a distribution B (e.g., the 1700 time period). The more additional bits needed, the more distant A and B are. This is formalized as:

$$\begin{aligned} D(A||B) = \sum _{i} p(feature_i|A)log_2 \frac{p(feature_i|A)}{p(feature_i|B)} \end{aligned}$$
(1)

where \(p(feature_i|A)\) is the probability of a feature (i.e., a temporal expression) in a time period A, and \(p(feature_i|B)\) the probability of that feature in a time period B. The \(log_2\frac{p(feature_i|A)}{p(feature_i|B)}\) relates to the difference between both probability distributions (\(log_2p(feature_i|A)-log_2p(feature_i|B)\)), giving the number of additional bits. These are then weighted with the probability of \(p(feature_i|A)\) so that the sum over all \(feature_i\) gives the average number of additional bits per feature, i.e., the relative entropy.

In terms of typicality, the more bits are used to encode a feature, the more typical that feature is for a given time period vs. another time period. Thus, in a comparison of two time periods (e.g., 1650 vs. 1700), the higher the KLD value of a feature for one time period (e.g., 1650), the more typical that feature is for that given time period. In addition, we test for significance of a feature by an unpaired Welch’s t-test. Thus, features considered typical are distinctive according to KLD and show a p-value below a given threshold (e.g., 0.05).

To compare typical features across several time periods, the most high ranking features of each comparison are considered. For example, for 1650 we obtain six feature sets typical of 1650 as we have six comparisons of 1650 with each of the other six time periods (i.e., a feature set for features typical of 1650 vs. 1700, of 1650 vs. 1750, etc.). If features are shared across feature sets and are high ranking (e.g., in the top 5), these features are considered to be typical of 1650. In other words, these are features ranking high in terms of KLD, significant in terms of p-value, and typical of a time period across all/most comparisons with other time periods. As in our case we consider seven time periods, features are considered typical which rank high for one time period in 6 to 4 feature sets (i.e., typical in more than half of the comparisons).

6 Analysis

In the following, we analyze diachronic tendencies of temporal expressions from the period of 1650 to 2000 in terms of (1) relative frequency (i.e., unconditioned probabilities), and (2) typicality (i.e., conditioned probabilities of expressions in one vs. the other time periods as described in Sect. 5).

We show how the notion of typicality based on relative entropy leads to valuable insights on the change of temporal expressions in scientific writing w.r.t. more and less frequent expressions.

6.1 Frequency-Based Diachronic Tendencies

Comparing temporal types across fifty years time periods in terms of frequency (see Fig. 1 showing log of frequency per million (pM)), Date is the most frequent type, followed by Duration. Set and Time expressions are less frequent. In addition, while Date remains relatively stable over time, expressions of Duration, Set and Time drop quite a bit from 1850 onwards, getting relatively rare.

Fig. 1.
figure 1

Diachronic tendencies of temporal expression types in scientific writing.

6.2 Diachronic Tendencies of ‘Typical’ Temporal Expressions

Inspecting diachronic change through the lens of relative entropy (as described in Sect. 5) allows us to consider temporal expressions typical of one time period when compared to the other time periods. We study each type of temporal expression and carefully select the base of comparison.

Date Considering Date expressions, instead of comparing single dates (which mostly occur only once in the corpus, such as June, 3, 1769), we take a level of abstraction and consider part-of-speech (POS) sequences of annotated Date expressions to better inspect the types of changes that might have occurred over time. For each Date expression, we extract POS sequences and use relative entropy to detect typical POS sequences of temporal expressions for each time period.

Table 3. Typical POS sequences of Date.
Fig. 2.
figure 2

Specificity (black) and interval (gray) of typical Date expressions.

Table 3 shows POS sequences typical of one time period vs. 6-4 other time periods (see column comp.)Footnote 4. For example, for 1650 the POS sequences Determiner-Noun (DT-NN) and Proper Noun (NP) are quite typical, which are all temporal expression referring to seasons in terms of lexical realizations (see Example 1). Both POS sequences are typical of 1650 vs. 1750 to 2000 (i.e., 5 comparisons). If we consider the POS sequences that are typical across time periods and their lexical realizations, there seems to be a development in terms of specificity and interval (see Fig. 2).

To capture the notion of specificity, we consider how many pieces of temporal information are given by a POS sequence to make a temporal expression most specific, with a scale from 1 to 4, where 1 is least specific (e.g., NP denoting seasons such as Winter as we do not know of which year etc.) and 4 is most specific (e.g., NP CD, CD such as June 3, 1769 which gives us an exact date)Footnote 5. For comparison across time periods, Fig. 2 shows the average of the specificity count over all typical POS sequences of a time period (black line). For the interval of typical Date expressions, the amount of daysFootnote 6 the expressions refer to is used (shown in log in Fig. 2, gray line). The more specific an expression becomes, the smaller the interval it refers to and vice versa.

Figure 2 also shows how temporal expressions move from relatively unspecific (e.g., in the Spring in 1650) to very specific (June 18, 1784 in 1800) and back to unspecific expressions (e.g., the last decades in 2000). The interval moves instead from a wider to a smaller span and back to a wider span in 2000. Investigating the contexts, in which these expressions arise, gives further insights. While in the early time periods, season mentioning is typical, from 1800 to 1850, temporal expressions are typical with exact date, year or month expressions. These expressions are used to present exact dates of observations made by a researcher at several points in time, especially in the field of astronomy (see Example 1). From the 1950 onwards, typical Date expressions become less explicit, relating to broader (e.g., the 1970s in Example 3) and less specific (e.g., current literature in Example 3) temporal reference. These expressions are used, e.g., in the context of previous work descriptions in introduction sections of research papers.

Example 1

  • In Winter it will need longer infusion, than in the Spring or Autumn. \(\mathrm {(1650)}\)

  • The difference between these two plants is this; the papaver corniculatum dies to the root in the winter, and sprouts again from its root in the spring; \(\mathrm {(1750)}\)

Example 2

  • March 4, 1783. With a 7-feet reflector, I viewed the nebula near the 5th Serpentis, discovered by Mr. MESSIER, in 1764. \(\mathrm {(1750)}\)

Example 3

  • In the 1970s, Rabin [38] and Solovay and Strassen [44] developed fast probabilistic algorithms for testing primality and other problems. \(\mathrm {(2000)}\)

  • There is a significant confusion in the current literature on “cellular” or “tessellation arrays” concerning the concept of a “Garden-of-Eden configuration”. \(\mathrm {(1950)}\)

Time To investigate typical Time expressions, similarly to Date expressions, we consider their POS sequences (see Table 4).

It can be seen that only the intermediate time periods show typical expressions (1750 to 1850). In terms of granularity, in the period of 1750, expressions are less granular pointing to broader sections of a day (e.g., morning, evening) mostly used to describe observations made (see Example 4). In the 1850 period, expressions point to specific hours of a day (e.g., 9 A.M.) mostly in descriptions of experiments.

Table 4. Typical POS sequences of Time.

Example 4

  • Monday morning she appeared well, her pulse was calm, and she had no particular pain. \(\mathrm {(1750)}\)

  • There being usually but one assistant, it was impossible to observe during the whole twenty-four hours; the hours of observation selected were therefore from 3 A.M. to 9 P.M. inclusive. \(\mathrm {(1850)}\)

Duration For Duration we consider their TIMEX3 value, as it directly encodes normalized information on the duration length and granularity of temporal expressions. Figure 3 shows typical TIMEX3 values (e.g., P1D for expressions such as one day) of specific time periodsFootnote 7. The y-axis shows the duration length in seconds on a log scale. In general, duration length gets lower from 1750 to 1850 (with expressions of seconds and hours, which are more granular) and higher in 1950 and 2000 (with expressions of decades, which are less granular).

Fig. 3.
figure 3

Diachronic tendencies of typical Duration expressions.

We then again consider the contextual environments of these typical expressions. In the earlier time periods (1650 and 1700), day and year expressions are typical, mostly relating to observations or experiment descriptions (see Example 5).

Example 5

  • After the eleven Months, the Owner having a mind to try, how the Animal would do upon Italian Earth, it died three days after it had changed the Earth. \(\mathrm {(1650)}\)

  • [...] the Opium, being cut into very thin slices, [...] is to be put into, and well mixed with, the liquor, (first made luke-warm) and fermented with a moderate Heat for eight or ten Days, [...]. \(\mathrm {(1650)}\)

From the period of 1750 to 1950, duration length is relatively low with expressions of seconds, minutes and hours being typical of these time periods. These expressions are mainly related to observations in the 1750 period and experiment descriptions from 1800 to 1950 (see Example 6).

Example 6

  • June 4, the weather continued much the same, and about 9h 30 in the evening, we had a shock of an earthquake, which lasted about four seconds, and alarmed all the inhabitants of the island. \(\mathrm {(1750)}\)

  • [...] the glass produced by this fusion was in about twelve hours dissolved, by boiling it in a proper quantity of muriatic acid. \(\mathrm {(1800)}\)

  • In a few hours a mass of fawn-coloured crystals was deposited; \(\mathrm {(1850)}\)

  • The patient is then switched to the re-breathing system containing 133 Xenon at 5 mCi/1 for a period of one minute, and then returned to room air for a period of ten minutes. \(\mathrm {(1950)}\)

Fig. 4.
figure 4

Diachronic tendencies of typical Set expressions in scientific writing.

In the 1950, besides weeks and minutes, related to experiment descriptions (see Example 7), expressions of decades are typical. The latter is also true for the 2000. In both periods, expressions relating to decades refer to previous work (see Example 7).

Thus, Duration shifts from being used for purposes of observational to experimental science and finally to previous work references in the latest time periods.

Example 7

  • For each speaker, performance was observed across numerous repetitions of the vocabulary set within a single session, as well as across a 2-week time period. \(\mathrm {(1950)}\)

  • It constitutes the usual drift-diffusion transport equation that has been successfully used in device modeling for the last two decades. \(\mathrm {(1950)}\)

  • Provably correct and efficient algorithms for learning DNF from random examples would be a powerful tool for the design of learning systems, and over the past two decades many researchers have sought such algorithms. \(\mathrm {(2000)}\)

Set For Set expressions again their TIMEX3 value is considered. Figure 4 shows typical expressions with the times per year of a Set expression on the y-axis in logFootnote 8, mirroring also less granular (annually) and more granular (every day) expressions. As we have seen from Fig. 1, Set expressions are relatively rare in scientific writing and strongly decrease over time (see Sect. 6.1). This is also reflected in the few temporal expressions typical of each time period in Fig. 4. In terms of granularity, there is a shift from day to month expressions (see Example 8). Interestingly, for the latter, there has been a move from a noun phrase expression (every/each month) to an adverb expression (monthly). While, in the intermediate periods (1800 and 1850) both expressions are typical, in 1950 only monthly is typical. In 1750 to 1850, every/each month expressions relate to observations done on a monthly basis of which the mean or average is drawn and the same applies for monthly used with mean as a term (see Example 8). In 1950, instead, monthly is solely used as an adverb. Thus, there is a replacement of longer noun phrase expressions (every/each month) by the shorter adverb expression monthly.

Example 8

  • Besides this, you may there see, that every day the Sun sensibly passes one degree from West to East, [...]. \(\mathrm {(1650)}\)

  • In order to determine the annual variations of the barometer, I have taken the mean of the observations in each month, [...]. \(\mathrm {(1800)}\)

Example 9

  • The mean was then taken in every month of every lunar hour (attending to the signs), and the monthly means were collected into yearly means. \(\mathrm {(1850)}\)

  • A disk resident file of all current recipient numbers is created monthly from the eligibility tape file supplied by Medical Services Administration. \(\mathrm {(1950)}\)

7 Discussion and Conclusion

We have presented an approach to investigate diachronic change in the usage of temporal expressions. First, we use temporal tagging to obtain a more comprehensive coverage of possible temporal expressions, rather than investigating specific expressions only, as was the case in previous work. Evaluation of the tagging results showed high precision (approx. 90%) across time periods.

Second, we use relative entropy to detect typical temporal expressions of specific time periods. A clear advantage to frequency-based accounts is that with relative entropy frequent as well as rare phenomena can be investigated in terms of their ‘typicality’ according to a variable (here: temporal expressions typical of specific time periods). Apart from gaining knowledge on diachronic changes specific to different types of temporal expressions, we also capture more abstract and more fine-grained shifts. On a more abstract level, while our findings confirm the paradigm shift from the more observational to the more experimental character of scientific writing (cf. Fankhauser et al. (2016); Gleick (2010)) for Date and Duration expressions, we also show the tendency towards previous work descriptions for these two temporal types in contemporary scientific writing. On a more fine-grained level, for Set (a rarely used temporal type especially towards more contemporary time periods), there is a linguistic shift from longer noun-phrase to shorter adverb expressions.

These findings are not only interesting in historical linguistic terms, but are also relevant to improve adaptation of temporal taggers to different time periods. Especially for recall, gold-standard annotations are needed. Since this is a quite resource and time consuming task, our approach can help in gaining insights on the use of typical temporal expressions in specific contexts across periods. These contexts can then be further exploited in terms of possible temporal expression occurrences to achieve better recall. In addition, temporal expressions might change in terms of linguistic realization as with the Set type in our case. Accounting for shifts in linguistic realization will also improve recall. While this is true for diachronic variation, the approach also generalizes to domain-specific variation. In future work, we plan to work in this direction, further elaborating our methodology for diachronic and domain variation.