1 Introduction

Interdisciplinary corpus-based studies of natural languages have become a highly productive subfield of research in the last two decades. There are different approaches to computer processing of linguistic data. For example, linguistic units can be extracted from text corpora [1] and analysed from quantitative and sociolinguistic point of view [2, 3].

Corpus-based studies of words, which are the basic units of a language, are conducted most frequently. For example, Hilpert and Gries [4] use Google Books library data to perform diachronic analysis of frequency of words and study regularities of their use in different years. The issues of birth and death of words were studied in [5].

Besides words, studies of word combinations and collocations are conducted using extra-large text corpora. Juola [6] performs a diachronic study of word combinations and makes an attempt to assess the complexity of the culture and its evolution.

The quantitative analysis was performed in [7] to study dynamics of syntactic dependencies, the rate of emerging of new dependencies and factors that influence this process.

This work studies the frequency dynamics of syntactic bigrams in Russian and English. The objective is to study factors that influence the rate of changes in frequency of syntactic bigrams. The particular task is to analyse how changes in frequency of words contained in the bigrams and changes of the co-occurrence of these words contribute to the total rate of frequency changes. Besides, the task of the work is to compare the obtained data for these languages and decide whether the inner processes of the development of the languages are similar or have significant differences.

By syntactic bigrams we understand primary units of a syntactic structure denoting a binary relation between a pair of words in a sentence [8, 9]. One of them is the head, another one is its dependent. It is of greater interest to study syntactic bigrams than simple bigrams because sometimes two functional words form a simple bigram and it is difficult to interpret changes in their frequency in terms of semantics or culture. In contrast to previous works, we study not only the number of syntactic bigrams, but also quantitatively analyse the rate of changes in their frequency. The information metrics was used in [10] to perform corpus-based (Google Books Ngram) studies of the rate of changes in frequency of words in different European languages. The same method can be used to estimate the rate of changes in frequency of words and syntactic bigrams.

2 Data

The Google Books Ngram electronic library was used as a study material. This library is criticized by some scientists. They believe that it cannot be regarded as a corpus because it contains texts of different genres and the number of these texts differs greatly. However, the corpus contains a large number of books. The English (common) sub-corpus includes 470 billion words for the period 1505–2008, the Russian sub-corpus includes 67 billion words for the period 1607–2009. The corpus texts reflect the language behaviour almost in all spheres of human life. The large amount of data allows one to obtain more reliable results. Moreover, the Google Books Ngram texts were POS-tagged and syntactic dependencies (we call them syntactic bigrams) were determined [11].

There are 175 million of different syntactic bigrams in the English (common) and 65 million in the Russian sub-corpora of Google Books Ngram. A preliminary selection of bigrams was performed for the analysis, since it is impossible to obtain statistically reliable results for rare bigrams. Besides, a significant number of bigrams contains misprinted words. Syntactic bigrams, which have been used systematically for a long time, were selected (by analogy with the method of the lexicon core selection proposed in [12]).

Raw data on the frequencies of syntactic bigrams from the Google Books Ngram corpus have been preprocessed. Frequencies of only those bigrams that consist of vocabulary 1-grams (including only letters of the corresponding alphabet and possibly one apostrophe) were used for the analysis. Bigrams or 1-grams, which differ only by case are regarded as one and the same 1-gram or bigram. For example, History and history is the same 1-gram.

3 Method

The Kullback-Leibler divergence \( D_{A,B} \) characterizes deviation of the probability distribution \( p_{i}^{A} \) from the distribution \( p_{i}^{B} \) [13]:

$$ D_{A,B} = \sum\nolimits_{i} {p_{i}^{A} \log_{2} p_{i}^{A} } - \sum\nolimits_{i} {p_{i}^{A} \log_{2} p_{i}^{B} } $$
(1)

Unlike other metrics, the Kullback-Leibler divergence shows the degree of difference between one frequency distribution and the other. It considers the information content of words (the information content of a word occurrence in the text is proportional to the logarithm of its probability taken with the opposite sign). Thus, the same change in the frequency of a rare word and a widely used word has different consequences in terms of the amount of information transmitted. Due to this fact, this measure is widely used in computational linguistics. In many cases, it is preferable to use the symmetrized Kullback-Leibler divergence

$$ \rho \left( {A,B} \right) = D_{A,B} + D_{B,A} = - \sum\nolimits_{i} {\left[ {p_{i}^{A} - p_{i}^{B} } \right]\log_{2} \frac{{p_{i}^{B} }}{{p_{i}^{A} }}} $$
(2)

It is a good measure for distinguishing two probability distributions (or empirical frequency distributions). The Kullback-Leibler divergence is a dimensional value (measured in bits). To make the interpretation less complicated, it should be normalized by the entropy of the distribution.

The symmetrized Kullback-Leibler divergence was used in [10] to study the rate of change in frequencies of words. In this work, the notion “lexical rate of change” was introduced. It is determined by the following formula:

$$ V\left( t \right) = \frac{1}{T}\frac{{\rho \left( {p\left( t \right),p\left( {t - T} \right)} \right)}}{{\left[ { H\left( t \right) + H\left( {t - T} \right) } \right]/2}} $$
(3)

Thus, the lexical rate of change is defined as the average normalized Kullback-Leibler divergence per year between two points in time (t and t − T). By implication, this value shows the relative rate of change in frequency of lexicon from year to year, taking into account differences in the information content of words.

The rate of change in frequency of English words is shown in Fig. 1. The method is different from one described in [10]. The series of frequencies were smoothed by a median filter to avoid the influence of short-term frequency spikes associated with various historical events. A filter with a window length of 9 years was used for graphs shown in Fig. 1. This allows one to avoid the influence of such frequency spikes with a duration of up to 4 years, after which the frequencies return to their previous values. For example, abrupt changes in frequency of many words are observed in the corpus during wars. However, the frequencies of most words return to their previous values after the end of the wars. Also, the use of the filter reduces the effect of random frequency fluctuations. If we choose the given window length, the values of the filtered frequency series separated by 10 years or more remain uncorrelated. Therefore, the value of the parameter T = 10 years is chosen in formula (3).

Fig. 1.
figure 1

The rate of change in frequency of words and syntactic bigrams in English

The rate of change was determined in [10] using a sample of 100 thousand most frequent words. The objective of this paper is to study the rate of change in frequencies of syntactic bigrams. For this reason, the sample was formed according to a different principle. We selected syntactic bigrams which occur in the corpus every year for a sufficiently long period of time. The number of the selected English bigrams occurring in the period 1800–2008 was 1,026,098. These syntactic bigrams contain 37,846 unique words, which can be found in the corpus every year (in the given interval) and were used for constructing a curve in Fig. 1. These words can be regarded as words belonging to the core vocabulary. There are various approaches to the definition of the core vocabulary. This issue is discussed, for example, in [12, 14]. The words, which occur in the diachronic corpus every year during the 200-year period, are selected in [12]. These words are called “core lexicon” in this paper. The procedure used in this article is similar to one proposed in [12]. However, there is some difference. Syntactic bigrams found in the corpus every year are selected. Then, a list of words contained in these bigrams is made. Thus, the selection criterion of these words is stricter compared to [12]. The relative frequencies of the selected words were determined in accordance with the base of 1-grams of the common English sub-corpus of Google Books Ngram.

The proposed approach [10] for determining the rate of change in frequency of words can also be used to study changes in frequency of word combinations and collocations. To do this, one needs to replace the numerator in formula (3) by the expression for the Kullback-Leibler symmetrized divergence for the frequency distributions of word combinations in years t and t − T. To perform calculations, the frequency series of the syntactic bigrams are preprocessed the way the word frequency series were preprocessed.

The results are also shown in Fig. 1 (the dash-dotted curve). As it was stated above, the curve is built using a sample of 1,026,098 bigrams. As can be seen in the figure, the curves showing rate of change in frequencies of words and syntactic bi-grams are very similar. More precisely, the Spearman correlation coefficient is 0.8846 (p-value is 2.61·10−4). As it was said above, frequencies of bigrams can change due to changes in frequency of words they contain and changes in the word co-occurrence. To determine the contribution of each of these factors, let us consider the structure of the expression for the symmetrized Kullback-Leibler divergence for the frequencies of syntactic bigrams:

$$ D_{t,t - T}^{{\left( {12} \right)}} = \sum\nolimits_{i,j} {\left\{ {f_{ij}^{{\left( {12,t} \right)}} \log_{2} \frac{{f_{ij}^{{\left( {12,t} \right)}} }}{{f_{ij}^{{\left( {12,t - T} \right)}} }} + f_{ij}^{{\left( {12, t - T} \right)}} \log_{2} \frac{{f_{ij}^{{\left( {12,t - T} \right)}} }}{{f_{ij}^{{\left( {12,t} \right)}} }}} \right\}} $$
(4)

Here \( f_{ij}^{{\left( {12,t} \right)}} \) is the relative frequency of the word combination with the i-th word in the first place and the j-th in the second place for the year t. In contrast to the lexical level, one comes across with a new phenomenon: the divergence determined by formula (4) depends both on the change in frequencies of individual words, and on the changes in relations between them. The next task is to rearrange the terms in (4) and divide the expression into components that are associated mainly with the first or the second of the indicated processes.

To perform the transformations, the expression for the divergence of the distributions of words in the word combination is needed:

$$ D_{t,t - T}^{\left( q \right)} = \sum\nolimits_{i,j} {\left\{ {f_{i}^{{\left( {q,t} \right)}} \log_{2} \frac{{f_{i}^{{\left( {q,t} \right)}} }}{{f_{i}^{{\left( {q,t - T} \right)}} }} + f_{i}^{{\left( {q,t - T} \right)}} \log_{2} \frac{{f_{i}^{{\left( {q,t - T} \right)}} }}{{f_{i}^{{\left( {q,t} \right)}} }}} \right\}} $$
(5)

where q, which is equal to 1 or 2, is the place of the word in the word combination, and \( f_{i}^{{\left( {q,t} \right)}} \) is the relative frequency of the i-th word. To simplify the expression (4), we introduce the normalized frequency of the word combination \( M_{ij}^{t} \) and define it by the expression:

$$ M_{ij}^{t} = \frac{{f_{ij}^{{\left( {12,t} \right)}} }}{{f_{i}^{{\left( {1,t} \right)}} f_{j}^{{\left( {2,t} \right)}} }} $$
(6)

The value in the denominator (6) is the frequency of the phrase that would be used in a random text with an independent and random choice of words. The normalized frequency is related to pointwise mutual information MI (pointwise mutual information, which was initially introduced in theory of information, is used in linguistics to estimate associative connection between words and determining collocations [15]) in a simple relation

$$ M = 2^{MI} $$
(7)

Thus, one can expect that this value will depend on the degree of associative connection of words in the phrase, but not on their frequencies. Having substituted (6) into (4) and performed some transformations, the quantity \( D_{t,t - \Delta }^{{\left( {12} \right)}} \) can be reduced to the following form:

$$ \begin{array}{*{20}l} {D_{t,t - \Delta }^{{\left( {12} \right)}} = \left\{ {D_{t,t - \Delta }^{\left( 1 \right)} + D_{t,t - \Delta }^{\left( 2 \right)} } \right\} + \sum\nolimits_{i,j} {\frac{{f_{i}^{{\left( {1,t} \right)}} f_{j}^{{\left( {2,t} \right)}} + f_{i}^{{\left( {1,t - \Delta } \right)}} f_{j}^{{\left( {2,t - \Delta } \right)}} }}{2}\left( {M_{ij}^{t} - M_{ij}^{t - \Delta } } \right)\log_{2} \frac{{M_{ij}^{t} }}{{M_{ij}^{t - \Delta } }} + \ldots } } \hfill \\ {\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \ldots + \sum\nolimits_{i,j} {\left( {f_{i}^{{\left( {1,t} \right)}} f_{j}^{{\left( {2,t} \right)}} - f_{i}^{{\left( {1,t - \Delta } \right)}} f_{j}^{{\left( {2,t - \Delta } \right)}} } \right)\frac{{M_{ij}^{t} + M_{ij}^{t - \Delta } }}{2}\log_{2} \frac{{M_{ij}^{t} }}{{M_{ij}^{t - \Delta } }}} } \hfill \\ \end{array} $$
(8)

The first term in this expression, enclosed in braces, is the sum of one-dimensional divergences, and thus depends only on changes in frequency of the words themselves. To clarify the meaning of the second and third terms, they are expanded in a Taylor series. On the assumption that the frequency changes are small, we confine ourselves to the first approximation. For the second term, the approximate expression is obtained

$$ \approx \frac{1}{\ln 2}\sum\nolimits_{i,j} {\frac{{f_{i}^{{\left( {1,t} \right)}} f_{j}^{{\left( {2,t} \right)}} + f_{i}^{{\left( {1,t - \Delta } \right)}} f_{j}^{{\left( {2,t - \Delta } \right)}} }}{{M_{ij}^{t} + M_{ij}^{t - \Delta } }}\left( {M_{ij}^{t} - M_{ij}^{t - \Delta } } \right)^{2} } $$
(9)

Hence, it is clear that if the changes are small, this term is always positive and primarily depends on the change of \( M_{ij}^{t} \) (on the change in the co-occurrence of words). Similarly, the following expression is obtained for the third term

$$ \approx \frac{1}{\ln 2}\sum\nolimits_{i,j} {\left( {f_{i}^{{\left( {1,t} \right)}} f_{j}^{{\left( {2,t} \right)}} - f_{i}^{{\left( {1,t - \Delta } \right)}} f_{j}^{{\left( {2,t - \Delta } \right)}} } \right)\left( {M_{ij}^{t} - M_{ij}^{t - \Delta } } \right)} $$
(10)

Thus, this term depends both on the increments \( M_{ij}^{t} \) and on the increments of the word frequencies. It is similar to the covariance of these quantities. It was found that in the proposed expressions:

  1. 1.

    the first term shows the contribution of the rate of change in the frequencies of words to the observed values of the rate of change in the frequencies of word combinations;

  2. 2.

    the second term shows the contribution from the change in the co-occurrence of these words (if their frequencies are invariable);

  3. 3.

    the third term is determined both by the change in the co-occurrence of the words and their frequencies.

This allows us to determine what percentage of the changes is caused by the change in the frequency of words themselves and change in their co-occurrence.

4 Results

Figure 2 shows the values of these three components that contribute to the rate of change in frequencies of the syntactic bigrams in different years. The values of the Kullback-Leibler divergence components calculated in accordance with expressions (5, 8, 9 and 10) are normalized to the values of the time interval T and the entropy of the frequency distribution of syntactic bigrams in accordance with expression (3). Figure 2 shows the components mentioned above in the expression for Kullback-Leibler divergence for the frequencies of the syntactic bigrams. It should be noted that the value of the 3rd component (see formulas (8 and 10)) is negative for all years. Therefore, the values of this component are shown with the opposite sign in Fig. 2. The fact that these values are less than zero indicates that the correlation between the increments of the frequencies of words and the increments of the value \( M_{ij}^{t} \) is usually negative, although it is small.

Fig. 2.
figure 2

The rate of change in frequency of syntactic bigrams in English. (A) The contribution of each component. The 3rd component is shown with a minus sign; (B) Total rate of change. The contribution of changes in the co-occurrence of words (the sum of the 2nd and 3rd components)

Therefore, this curve shows the total contribution to the rate of changes in frequency of syntactic bigrams caused by frequency changes in the word co-occurrence. The curve showing the total rate of change in frequencies of syntactic bigrams is shown for comparison (see the curve ‘total’). When comparing Figs. 1 and 2, it should be borne in mind that the word frequencies in Fig. 1 are calculated using the 1-gram base and the word frequencies in Fig. 2 are calculated using the selected list of syntactic bigrams. These frequencies are slightly different for two main reasons. Firstly, there are not many rare syntactic bigrams in the list. Secondly, one word can be found in several syntactic bigrams in a sentence.

The curves in Fig. 2 are similar. Strong spikes indicate a response to significant historical events (primarily two world wars) and social changes in society. The spikes are interspersed with relatively smooth sections which are characterized by lower values of rate change and minor fluctuations near the smooth trend line. The curves, except the 2nd component, have some common features. It should be noted that the components associated with the change in co-occurrence of words respond to the historical events less significantly than the components associated with the change in frequencies of words belonging to syntactic bigrams. For example, the peak value of the spike associated with the First World War for the ‘total’ curve is 2.93 times higher than the values in the previous period (for the 1st component – by 3.8 times and for the sum of the 2nd and 3rd components by 1.8 times). It can be seen that the level of fluctuations of the curves associated with changes in the co-occurrence is significantly low during the periods when no major historical events occur.

One can select two areas without rapid spikes in the figure and estimate how the rate changes over large time intervals. Let us choose two time intervals 1875–1900 and 1945–1965. Historical events happened within these intervals, apparently, did not have a significant impact on changes in frequencies of the words and bigrams. Figure 3 shows boxplots for the rate of change in frequency of syntactic bigrams at specified time intervals (left), the components associated with changes in the frequency of words in the syntactic bigrams (middle) and the components associated with changes in the word co-occurrence (right).

Fig. 3.
figure 3

The rate of change in frequencies of syntactic bigrams in 1875–1900 and 1945–1965 (window A), the values of the component associated with changes in frequencies of words in the syntactic bigrams (window B), and the values of the component associated with changes in the word co-occurrence (window C)

In general, the rate of change in the frequencies of syntactic bigrams tends to increase over 70 years. However, this change is relatively small (taking into account the correlation between the samples, the change in the rate cannot be considered statistically significant). On the contrary, the component associated with the change in the frequency of words in syntactic bigrams during this time significantly increases, and the component associated with the change in co-occurrence has decreased to an even greater degree. Thus, the tendency of the English words (presented in the corpus) to decrease the proportion of components associated with the change in the co-occurrence of words (see in Fig. 2) is also confirmed when considering only “quite” areas.

A similar analysis was also carried out for the Russian language. We selected 539,940 syntactic bigrams which occur in the Google Books Ngram Russian sub-corpus every year in the period 1920–2009. This interval was chosen to avoid difficulties associated with the Russian spelling reform of 1918. The selected syntactic bigrams contain 81,991 unique words. Lemmatization was not used because, when calculating the average frequencies of the lemmas, distortions may occur due to homonymy. We performed calculations for the obtained samples the same way as for the samples of the English syntactic bigrams. The obtained results are shown in Fig. 4 (the curves are marked analogous to the curves in Fig. 2).

Fig. 4.
figure 4

The rate of change in frequency of syntactic bigrams in Russian. (A) The contribution of each component. The 3rd component is shown with a minus sign; (B) Total rate of change. The contribution of changes in the co-occurrence of words (the sum of the 2nd and 3rd components)

The large values of the rate of change at the beginning of the target time interval are due a relatively small size of the corpus in that time and significant social changes occurring during this period. A significant spike can be observed starting from the end of the 80 s, which is due to the collapse of the Soviet Union and related historical events. It can be seen that the component associated with the change in the frequency of words in the syntactic bigrams (the 1st component) strongly responds to these events. However, the component associated with the change in the word co-occurrence (the sum of the 2nd and 3rd components) has practically no response, its values even slightly decrease during this period.

If not taking into account the spike at the end of 80s–90s years, all the curves tend to decrease in values. It is natural to assume that this can be due to increase of the corpus size with time. The more the corpus size is, the less the random fluctuations of the selected frequencies are. Thus, the estimates of the rate of frequency changes can decrease. This effect explains the behaviour of the curves in the initial section of Fig. 4.

The corpus size has a greater impact on bigram frequencies than on frequencies of words because frequencies of bigrams are significantly lower, and their relative fluctuations are more significant. Therefore, special attention should be paid to the interval 1960–991. The number of books represented in the Google Books Ngram Russian sub-corpus varies greatly in different years. There is the largest number of books in 1960–1991. In this period, 65–85 thousand of books were published in the USSR every year. The corpus contains approximately 10 thousand of volumes (about 1–1.25 billion words) for each year, i.e. not less than 12% of all published books. Thus, there is a 31-year time period during which the corpus size varied within small limits.

The curves in Fig. 4 have a small downward trend in 1960–1985. Thus, the observed tendency to decrease the average rate of change of frequencies of syntactic bigrams cannot be explained only by the increase of the corpus size. Let us consider further how the ratio between the components of the rate of change in frequency changes over time. Figure 5 shows the percentage of the components associated with the change in the word co-occurrence in the total value of the rate of change for English and Russian sub-corpora of Google Books Ngram.

Fig. 5.
figure 5

Percentage of the components associated with the change in the word co-occurrence in the total value of the rate of change for English and Russian sub-corpora of Google Books Ngram. The 1960–1991 interval is marked by the dotted line. The corpus size changed insignificantly at that time

As can be seen, the share of the components associated with the change in the word co-occurrence has fallen from 40–50% in the middle of the 19th century to 25% in the 1990s and even lower in the early 2000s. As mentioned above, one of the reasons of this tendency is increase of the corpus size. There are no certain directions of change observed in the most interesting Russian time interval. The range of fluctuations is 21–26%. In the second half of the 1980s and 1990s, a decrease in the proportion of the components associated with the change in the word co-occurrence can be observed, which is due to the increase in the rate of change. By the beginning of the 21st century, the proportion values had returned to its previous level. The obtained data on the Russian language do not contradict the assumption that in “quiet” historical periods, when there are, for example, no wars or revolutions, the share of the components related to the change in the word co-occurrence is approximately constant.

The curves obtained for English and Russian were compared. The share of the components associated with the change in frequency of the word co-occurrence is higher for English than for Russian. However, there is a small area at the end of the graph, where it is lower in English than for Russian. This can be explained by the short-term impact of historical events. However, the values for these languages has converged by the present time.

5 Conclusion

The rate of changes in frequency of syntactic bigrams in Russian and English was calculated in this work. It is smooth in both languages, except for the periods when major events happened in the history of these countries, for example, wars.

When calculating the rate of change, the contribution of changes in the frequency of words in the syntactic bigrams and changes in the word co-occurrence were estimated. The changes in the frequency of words contained in the syntactic bigrams contribute to the rate of change more than changes in frequency of the word co-occurrence. The obtained results correlate with the findings in [7], where increase in the unique bigrams in the corpus was considered. Both processes respond to important historical events but differently. Response of the changes in frequency of the word co-occurrence is significantly lower than the response of frequency changes of words.

The rate of change associated with the changes in the word co-occurrence in English is lower than that in Russian. In conclusion it should be said that regularities associated with the use of syntactic bigrams are similar in both languages.