1 Introduction

In this chapter, we describe trackers and their uses in educational measurement. For now, we loosely define trackers as dynamic parameter estimates, adapting to possible changes in ability or item difficulty. Trackers can be especially useful in measurements that extend over a longer period of time at irregular time intervals, e.g., the continual measurement of abilities in computer adaptive practice (CAP) or computer adaptive learning (CAL) (Brinkhuis et al. 2018; Klinkenberg et al. 2011; Wauters et al. 2010; Veldkamp et al. 2011) or the monitoring of item difficulties in item banks (Brinkhuis et al. 2015). Many actors can be involved in the possible changes of these parameters, including the pupils themselves, their teachers, their parents, educational reforms, etc. Moreover, change in parameters, and the model itself, is especially likely if the outcomes of the measurements are used for feedback, as in assessment for learning (Black and Wiliam 2003; Bennett 2011; Wiliam 2011). Since feedback is provided to many actors in education, the result is a complex dynamical system, including all sorts of interactions.

The development of these parameters is not easily modeled due to these changes and feedback loops. Application of latent growth models (McArdle and Epstein 1987; Meredith and Tisak 1990; Hox 2002), change point estimation models (Hinkley 1970; Chib 1998; Visser et al. 2009) or other models that explicitly model the development of parameters is therefore not straightforward. Also state space models such as the Kalman filter (KF) (Kalman 1960; Welch and Bishop 1995; van Rijn 2008), or the more general particle filters (Arulampalam et al. 2002), include an explicit growth model and are therefore not ideal for following educational progress continually, as in CAP systems.

Historically, we see that in other fields where continual progress measurements take place, rating systems emerged. For example, in chess, rating systems were developed for the estimation of continually changing chess playing abilities, such as the widely used Elo rating system (ERS) (Elo 1978; Batchelder and Bershad 1979; Batchelder et al. 1992). Advantages of rating systems such as Elo’s are that they are computationally light and do not assume a growth model. After each new measurement, the parameter estimates can be updated using the previous parameter estimate and the new observation, without having to take history into account, i.e., satisfying the Markov property, nor assuming a model for the development of the parameters. The ERS has found many uses, including applications in educational measurement (Klinkenberg et al. 2011; Wauters et al. 2010; Pelánek et al. 2017). Though there are many practical uses of rating systems, they lack certain desirable statistical properties such as convergence and unbiasedness (Brinkhuis and Maris 2009).

In tracking educational progress, we are interested in several properties from both state space models such as KFs and rating systems like the ERS, i.e., we like to dynamically track changes in the abilities of individuals or in item difficulties, without having to assume specific types of development. Moreover we require that, if ability is stable for some time, our tracker provides unbiased estimates with a known error variance. In this chapter, we will describe a tracker with these properties.

2 Methods

2.1 Formalizing a Tracker

We formalize the representation of a tracker in the scheme in (8.1), where we illustrate the development of someone’s unknown true ability \(\theta \) over time t, where each column represents a consecutive time point:

$$\begin{aligned} \begin{array}{lccccccccccc} \text {ability}&{}&{}&{}\theta _1&{}\rightarrow &{}\theta _2&{}\rightarrow &{}\theta _3&{}\rightarrow &{} \cdots &{}\rightarrow &{}\theta _t\\ &{}&{}&{}\downarrow &{}&{}\downarrow &{}&{}\downarrow &{}&{}&{}&{}\downarrow \\ \text {responses}&{}&{}&{}Y_1&{}&{}Y_2&{}&{}Y_3&{}&{} \cdots &{}&{}Y_t\\ &{}&{}&{}\downarrow &{}&{}\downarrow &{}&{}\downarrow &{}&{}&{}&{}\downarrow \\ \text {estimates}\quad &{}X_0&{}\rightarrow &{}X_1&{}\rightarrow &{}X_2&{}\rightarrow &{}X_3&{}\rightarrow &{} \cdots &{}\rightarrow &{}X_t \end{array} \end{aligned}$$
(8.1)

The abilities \(\theta \) are related by horizontal arrows, since we assume that one’s true ability at time point t is related at least to one’s ability at time point \(t-1\) and likely influenced by many other factors, which we leave out of this scheme. At time point t, scored responses \(Y_t\) are obtained using a single item, or a number of items. The ability estimate \(X_t\) depends only on the previous state \(X_{t-1}\) and the current item response \(Y_t\), therefore satisfying the Markov property. The scheme in (8.1) represents Markov chains in general, including the ERS.

Since we are especially interested in the properties of unbiasedness and convergence, we present a more specific scheme in (8.2). Here, we assume for the moment that someone’s ability does not change, i.e., \(\theta _t=\theta \;\forall \;t\), and we require \(X_\infty \) to have a known distribution, for example centered around the true ability \(\theta \) with normal distributed error \(\mathcal {E}\):

$$\begin{aligned} \begin{array}{lcccccccccccl} \text {ability}&{}&{}&{}\theta &{}\rightarrow &{}\theta &{}\rightarrow &{}\theta &{}\rightarrow &{} \cdots &{}\rightarrow &{}\theta &{}\\ &{}&{}&{}\downarrow &{}&{}\downarrow &{}&{}\downarrow &{}&{}&{}&{}\downarrow &{}\\ \text {responses}&{}&{}&{}Y_1&{}&{}Y_2&{}&{}Y_3&{}&{} \cdots &{}&{}Y_{\infty }&{}\\ &{}&{}&{}\downarrow &{}&{}\downarrow &{}&{}\downarrow &{}&{}&{}&{}\downarrow &{}\\ \text {estimates}\quad &{}X_0&{}\rightarrow &{}X_1&{}\rightarrow &{}X_2&{}\rightarrow &{}X_3&{}\rightarrow &{} \cdots &{}\rightarrow &{}X_{\infty }&{}\sim \theta +\mathcal {E} \end{array} \end{aligned}$$
(8.2)

We want to create a tracking algorithm that provides estimates \(X_t\) that adapt to changes in \(\theta _t\), but has a known distribution if \(\theta _t\) is invariant for some t, as represented in (8.2). Such a tracking algorithm is similar to KFs in that its states have a known distribution (Arulampalam et al. 2002), and similar to the ERS, specifically Elo’s Current Rating Formula (Elo 1978), in that it continually adapts to changes in the underlying parameters without having to specify a growth model. An illustration of a simple tracker that conforms to this definition is given in Sect. 8.2.2, after which a proof of convergence is given in Sect. 8.2.3.

2.2 Example of a Tracker

We present a simple non-trivial case of a tracker that conforms to scheme (8.1), i.e., it dynamically adapts to change in the model parameters, and converges to a known error distribution if the scheme in (8.2) holds.

2.2.1 Coin Tossing Tracker

Consider the following coin tossing example.

$$\begin{aligned} \Pr (Y_i=1|\theta )=\theta \end{aligned}$$
(8.3)

where the probability of tossing head, i.e., \(Y_i=1\), is \(\theta \). If we consider a bandwidth of n sequential coin flips, then we simply define the sum score \(X_+^{(n)}\) as follows:

$$\begin{aligned} X_+^{(n)}=\sum _{i=1}^n Y_i \sim \text {binom}(n,\theta ) \end{aligned}$$
(8.4)

Since \((Y_1,\dots ,Y_n)\) is independent of \(\theta |X_+^{(n)}\), we can define an auxiliary variable Z using the sufficient statistic \(X_+\):

$$\begin{aligned} \Pr (Z=1|X_+^{(n)}=x_+)=\frac{x_+}{n}=\Pr (Y_i=1|X_+^{(n)},\theta ). \end{aligned}$$
(8.5)

Using the auxiliary variable Z, which is the expected response given the sum score \(X_+^{(n)}\), and the observed response Y, we readily find the following sequential update rule for \(X_+^{(n)}\), which is denoted with the subscript t as an index for time:

$$\begin{aligned} X_{t+1}^{(n)}=X_{t}^{(n)}+Y_t-Z_t\sim \text {binom}(n,\theta ) \end{aligned}$$
(8.6)

which gives us the simplest non-trivial tracker \(X_t^{(n)}/n\) for \(\theta _t\) meeting our definition in (8.2).

We provide some illustrations to demonstrate the workings of the tracker in (8.6), using simulations.Footnote 1

2.2.2 Illustration of Convergence

First, we demonstrate the convergence of the sequential estimates of \(X_t^{(n)}\) to the invariant distribution, \(\text {binom}(n,\theta )\). As data, 1000 coin tosses are simulated with \(\theta =.3\). Using the algorithm in (8.6) with \(n=30\), \(X_t^{(n)}\) was sequentially estimated on the data and its distribution plotted in Fig. 8.1. As a reference, the theoretical density of the binomial distribution (\(n=30,\theta =.3\)) was added. Clearly, this tracker nicely converged to the expected distribution as the two lines in Fig. 8.1 coincide. While the simulation used an invariant probability of the coin falling heads with \(\theta =.3\), i.e., conforming to the scheme in (8.2), we can also simulate different changes to \(\theta \) over time, conforming to the scheme in (8.1).

Fig. 8.1
figure 1

Theoretical and empirical cumulative score distribution

2.2.3 Illustration of Tracking Smooth Growth

We simulate a scenario where \(\theta \) smoothly changes over time t, i.e., we generate 1000 coin tosses with an increasing \(\theta \), and evaluate the development of the tracker in Fig. 8.2. Though \(\theta \) is not stable at any time, it is clear that the tracker follows the development of \(\theta \) quite closely with little lag. The step size n of the algorithm in (8.6) determines how fast the tracker can adapt to the changes in \(\theta \), where for this specific tracker a large n corresponds to a small step size and a small n to a large step size. Since \(\theta \) is continually changing here, the tracker does not converge, but tracks the change rather well.

Fig. 8.2
figure 2

Tracking smooth growth

2.2.4 Illustration of Tracking Sudden Changes

Next, we simulate a change point growth model where the probability of the coin falling heads changes from \(\theta =.3\) to \(\theta =.8\) at \(t=500\). The tracker is plotted in Fig. 8.3. Again, it can be seen that the tracker follows the development of \(\theta \) closely. The tracker is always lagging, i.e., its development follows the development of \(\theta \) with some delay depending on the step size n of the algorithm. This lag can be observed after the change point, and its size is related to the size of the change and step size of the algorithm.

Fig. 8.3
figure 3

Tracking a change point growth model

2.2.5 Illustration of Varying Step Sizes

In Fig. 8.4 we illustrate the effect of varying the step size. A smooth development of \(\theta \) is simulated, developing from about .1 to just over .8. Three different trackers are simulated using three different step sizes, \(n=10\), \(n=50\), and \(n=100\), where a small n is a large step size. It can be seen that the tracker with the most noise, having the largest step size and therefore the smallest n, adapts quickest to changes in ability \(\theta \), where the tracker with the smallest step size shows less noise, and therefore quite some lag. The step sizes are straightforward bias-variance trade-offs. Large step sizes allow for a quick adaption to possibly large changes in \(\theta \), at the cost of quite some variance if \(\theta \) is stable. Small step sizes reduce this variance at the risk of introducing bias under a changing \(\theta \).

Fig. 8.4
figure 4

Trackers with step sizes \(n=10\), \(n=50\) and \(n=100\)

In these examples, this simplest non-trivial example of a tracker has demonstrated that it tracks the development of \(\theta \), cf. the scheme in (8.1), and converges to an invariant distribution if \(\theta \) is stable, cf. the scheme in (8.2). We like to point out that though this example is simple, i.e., the properties of this tracker in the case of simple coin flips are the same as using a moving average, it is noteworthy that such trackers differ substantively from both maximum likelihood estimation (MLE) and Bayesian estimation techniques. The tracker estimates are continually adapting to changes in the model parameters, and convergence in distribution takes place while both the model parameter and the transition kernel are unchanging. This property of convergence under the scheme in (8.2) is generalized for all trackers in the following section.

2.3 Convergence in Kullback-Leibler Divergence

We provide some general proof of the convergence of Markov chains to an invariant distribution, given this distribution does not change between two time points, t and \(t+1\). We use the Kullback-Leibler (KL) divergence (Kullback and Leibler 1951; Eggen 1999) to quantify the divergence between the current distribution \(f_t\) and the invariant distribution \(f_\infty \).

Theorem 1

(Convergence in KL divergence) If the invariant distribution \(f_\infty (x)\) and the transition kernel \(f_\infty (x|y)\) do not change between t and \(t+1\), the KL divergence between the current distribution and the invariant distribution decreases between two time points:

$$\begin{aligned} \int _{\mathcal {R}}\ln \left( \frac{f_\infty (x)}{f_{t+1}(x)}\right) f_\infty (x)dx\le \int _{\mathcal {R}}\ln \left( \frac{f_\infty (y)}{f_t(y)}\right) f_\infty (y)dy. \end{aligned}$$
(8.7)

if

$$\begin{aligned} f_\infty (x)=\int _{\mathcal {R}}f_\infty (x|y)f_\infty (y)dy \end{aligned}$$
(8.8)

and

$$\begin{aligned} f_{t+1}(x)=\int _{\mathcal {R}}f_\infty (x|y)f_t(y)dy \end{aligned}$$
(8.9)

Proof

Using Bayes’ rule, we can rewrite (8.9) as follows:

$$\begin{aligned} f_{t+1}(x)=\int _{\mathcal {R}}\frac{f_\infty (y|x)f_\infty (x)}{f_\infty (y)}f_t(y)dy \end{aligned}$$
(8.10)

and place \(f_\infty (x)\) outside the integral:

$$\begin{aligned} \frac{f_{t+1}(x)}{f_\infty (x)}=\int _{\mathcal {R}}\frac{f_t(y)}{f_\infty (y)}f_\infty (y|x)dy . \end{aligned}$$
(8.11)

Taking the logarithm and integrating with respect to \(f_\infty (x)\) gives:

$$\begin{aligned} \int _{\mathcal {R}}\ln \left( \frac{f_{t+1}(x)}{f_\infty (x)}\right) f_\infty (x)dx = \int _{\mathcal {R}}\ln \left( \int _{\mathcal {R}}\frac{f_t(y)}{f_\infty (y)}f_\infty (y|x)dy\right) f_\infty (x)dx. \end{aligned}$$
(8.12)

Using Jensen’s inequality, we obtain:

$$\begin{aligned} \begin{aligned} \int _{\mathcal {R}}\ln \left( \int _{\mathcal {R}}\frac{f_t(y)}{f_\infty (y)}f_\infty (y|x)dy\right) f_\infty (x)dx \ge \int _{\mathcal {R}}\int _{\mathcal {R}}\ln \left( \frac{f_t(y)}{f_\infty (y)}\right) f_\infty (y|x)f_\infty (x)dydx \end{aligned} \end{aligned}$$
(8.13)

which we can use to simplify (8.12) into:

$$\begin{aligned} \int _{\mathcal {R}}\ln \left( \frac{f_{t+1}(x)}{f_\infty (x)}\right) f_\infty (x)dx\ge \int _{\mathcal {R}}\ln \left( \frac{f_t(y)}{f_\infty (y)}\right) f_\infty (y)dy. \end{aligned}$$
(8.14)

Writing (8.14) as a KL divergence, we interchange numerators and denominators and therefore change the sign of the inequality:

$$\begin{aligned} \int _{\mathcal {R}}\ln \left( \frac{f_\infty (x)}{f_{t+1}(x)}\right) f_\infty (x)dx\le \int _{\mathcal {R}}\ln \left( \frac{f_\infty (y)}{f_t(y)}\right) f_\infty (y)dy \end{aligned}$$
(8.15)

which concludes our proof.\(\square \)

It was proven quite generally that trackers as described by (8.2) possesses an attractive quality. After every item response, the ability distribution of \(X_{t}\) monotonically converges in KL divergence to \(X_\infty \) (Kullback and Leibler 1951). The KL divergence is a divergence measure between two distributions, in our case, the theoretical distribution of ability estimates \(X_{t}\) and the invariant distribution of estimates \(X_\infty \). If the KL divergence is small, the ability estimates are (almost) converged to the proper invariant distribution. Monotone convergence assures this divergence decreases with every new response under the conditions of (8.2).

2.3.1 Illustration of Development of Kullback-Leibler (KL) Divergence

In Fig. 8.5 we provide an illustrationFootnote 2 how the KL divergence could develop over time. If ability \(\theta \) is stable for some time, then changing, and then stable again, we can see how the KL divergence could decrease in times of stability and increase when ability changes.

Fig. 8.5
figure 5

Development of ability (dashed line) and Kullback-Leibler (KL) divergence (solid line) over time

We believe this illustration shows that the convergence property is suitable for use in the practice of educational measurement, where students mostly respond to sets of items, even if they are assessed frequently (Brinkhuis et al. 2018). The assumption here is that ability is stable during the relatively short time in which a student answers a set of items, and might change between the administrations of sets. Clearly, no convergence takes place if ability is continually changing (Sosnovsky et al. 2018).

2.4 Simulating Surveys

Section 8.2.2 provides some simulations to illustrate the invariant distribution of the simple tracker and to demonstrate how the estimates track individual development under several simulation conditions. One goal is to simulate and track the development of groups, as might be done in survey research.

We consider a simplified scenario where an entire population consisting of 100,000 persons would answer just 5 questions in a survey that is administered 4 times, for example to track educational progress of the entire population within a year. Using the algorithm in (8.6), that compares to 5 flips of 100,000 uniquely biased coins for each survey. The probabilities of the coins falling heads changes for every survey, and are sampled from a beta distribution. The parameters of these beta distribution where \(a=5,6\tfrac{2}{3},8\tfrac{1}{3},10\) and \(b=10\) for the 4 simulated surveys. These 4 beta distributions are plotted in Fig. 8.6 from left to right, using dashed lines. A very large step size (\(n=2\)) was used for the algorithm, to allow the estimates \(\varvec{X}\) to adapt quickly to the changes in \(\theta \). Since \(\theta \) is sampled from a beta distribution, the estimates \(\varvec{X}\) are beta-binomial distributed. Using MLE, the two parameters of the beta distribution of \(\theta \) were estimated for each of the 4 administrations, and these estimated beta distributions are plotted in Fig. 8.6. The graph demonstrates that it is possible to accurately track the development of an entire population by administrating a limited amount of items to all individuals.

Fig. 8.6
figure 6

Trackers with step sizes \(n=10\), \(n=50\) and \(n=100\)

Note that this scenario is quite different from a more traditional sampling approach where many items are administered to complex samples of individuals. If the total number of responses would be kept equal, for tracking the development of the entire population it is beneficial to administer a few questions to many individuals. On the contrary, for tracking individual development, it is beneficial to administer many items to few individuals. For example, while the information in the 5 items per survey described above is too limited for tracking individual growth, especially in considering progress that is made between surveys, it is sufficient for tracking the population parameters. Though trackers can both be used for tracking individual progress or progress of the population, the preferred design of data collection depends on the desired level of inference.

3 Discussion

In this chapter, trackers and their possible uses in educational measurement have been described. Trackers are defined as dynamic parameter estimates with specific properties. These trackers combine some of the properties of the ERS and state space models as KFs, which both have strengths and weaknesses. The ERS is a feasible method for dealing with data with changing model parameter, i.e., ratings. It is simple, provides real-time results, and requires no assumptions on the types of growth. However, it lacks a proper distribution of estimates, and therefore no statistics can be used on the estimates, e.g., to test for change, or to track any aggregate of estimates. KFs, on the other hand, do assume specific distributions of estimates, but need specified growth models, which are not readily available in many educational measurement applications.

Trackers should be able to adapt to changes in both model parameters and the transition kernel, cf. the scheme in (8.1). In addition, we require that the estimates converge in distribution if the model is invariant, cf. scheme (8.2). A simple example of a tracker conforming to this definition has been introduced in (8.6), with a transition kernel that creates a Markov chain with a binomial invariant distribution. A well-known technique for obtaining a transition kernel that creates a Markov chain with a specified distribution is called the Metropolis algorithm (Metropolis et al. 1953; Hastings 1970; Chib and Greenberg 1995). The Metropolis algorithm can be used to create a transition kernel that satisfies (8.2). The general proof that such Markov chains monotonically converge to their invariant distribution has been provided in Theorem 1.

While the simple binomial example might not have much practical use directly, other trackers can be developed to provide estimates with a known error distribution, for example an ability estimate \(\varvec{X}\) which is distributed \(\mathcal {N}(\theta ,\sigma ^2)\). Two simple examples of such trackers are presented in Brinkhuis and Maris (2010). Such estimates could directly be used in other statistical analyses since the magnitude of the error does not depend on the ability level itself. These assumptions compare directly to the assumptions of classical test theory, where an observed score equals the sum of the true score and an uncorrelated error. Another simple application of using these estimates directly is to look at empirical cumulative distributions of ability estimates.

Trackers as defined in this chapter retain the strengths of both state space models and ratings systems, and resolve some of their weaknesses. Their properties are suitable for, among others, applications in educational measurement in either tracking individual progress or any aggregate thereof, e.g., classes or schools, or the performance of the entire population as in survey research. The algorithms remain relatively simple and light-weight, and therefore allow to provide real-time results even in large applications. They are unique in that they continually adapt to a new transition kernel and converge in distribution if there is an invariant distribution, which is quite different from both MLE and Bayesian estimation techniques.