1 Introduction

In cognitive tests (achievement tests, placement tests) or aptitude tests, as well as in personality tests and attitude tests, the response variables for the items are discrete, having a very limited number of values: often only two in cognitive tests (representing incorrect/correct) or a few as with the use of Likert scales in attitude or personality tests, and described by expressions as ‘strongly disagree’, ‘disagree’, ‘neutral’, ‘agree’, ‘strongly agree’. Note that in the latter case, the categories are considered as ordered, meaning that, e.g., the choice of the category ‘strongly agree’ is a sign of a more positive attitude (motivation, etc.) than the answer ‘agree’, and ‘agree’ is more positive than ‘neutral’, etc., or the reverse, depending of the wording of the item stem. It is usually assumed that the direction of the order can be derived from a semantic analysis of the item stem, i.e., it is not an outcome of a statistical analysis of the response data.

Measurement models to analyze these highly discrete data are widespread and well known among psychometricians. Models from the logistic family, such as the one- two- or three-parameter logistic models for binary responses (Rasch 1960; Lord and Novick 1968), the (generalized) partial credit model (Masters 1982; Muraki 1997) and the graded response model (Samejima 1974, 1997) and models, known as ‘normal ogive’, originating in biological assay (Finney 1978), but becoming popular in psychometric circles with the seminal paper by Albert (1992) on the use of sampling methods to estimate parameters.

Although certainly less popular than models for discrete data, several interesting models have been proposed and studied for continuous data, almost all in the realm of attitude scaling and personality assessment. Also, different item formats are possible to record continuous responses. One could for example present a piece of line where the endpoints are labeled (e.g., as ‘strongly disagree’ and ‘strongly agree’) and ask the respondent to put a cross at a position that best represents his or her attitude. But there are others as well. The common feature is that all answers are bounded from below and from above and that the possible answers can be regarded as a (close approximation to a) continuous response. The precise item format, although important by itself, is not considered any further in this chapter.

Samejima (1973, 1974) proposed a (family) of continuous IRT models, all derived from the graded response model (Samejima 1969, 1997) as a limiting case when the number of category responses goes to infinity. The model is complex as it assumes a latent response on the interval (−∞, +∞) and a transformation to an interval bounded from both sides, see Bejar (1977) for an application. One of the complicating factors in Samejima’s approach is that there are two layers of latent variables: one is the construct (attitude, self-concept) to be measured and the other is a latent response which is an unbounded continuous variable, while the observable response is continuous but bounded from below and from above. To reconcile these incompatible restrictions, a transformation from an unbounded variable to a bounded one is proposed, e.g., the logit transformation. In the model developed by Mellenbergh (1994) such a transformation is not used: the continuous response variable is modeled with a one-factor model:

$$ X_{ij} = \mu_{j} + \lambda_{j} \theta_{i} + \varepsilon_{ij}, $$

where Xij is the continuous observed response of respondent i to item j, μj is the easiness parameter of the item, λj the discrimination parameter, θi the value of the latent variable for respondent i, and εij the continuous residual, which is usually but not necessarily normally distributed with mean zero and variance σ2. A comparison between the models, of Samejima and Mellenbergh, can be found in Ferrando (2002).

More recently, an interesting family of models has been proposed where the observed response variable, rescaled to the unit interval, follows a beta distribution, and can therefore be used to model monotone as well as single peaked densities (Noel and Dauvier 2007; Noel 2014). The work of Noel (2017) concentrated on unfolding models for personality traits, emotions and behavioral change.

Neither of the above models, however, are exponential family models, or do allow for separation of person and item parameters. The only published model for continuous responses which is an exponential family model and allows for parameter separation is published by Müller (1987). This model and a newly proposed one are the main topics of this chapter.

Although all the models discussed so far consider the use of continuous response variables, none of them is used for cognitive tests, like achievement tests or placement tests. For these tests considerable attention has been paid to continuous responses, especially for multiple choice tests, not in an IRT framework, but in what was called probability measurement. In this approach, the respondent does not have to pick a single alternative, but has to express his or her degree of belief or subjective probability that each of the response alternatives is the right answer (De Finetti 1965, 1970; Van Naerssen 1961; Toda 1963; Roby 1965). The attention in the sixties was directed to the problem of a good or ‘admissible’ scoring function, i.e., a scoring rule such that the continuous responses will reflect the true subjective probabilities. See Rippey (1970) for a comparative study of different scoring rules.

An important contribution in this area was made by Shuford et al. (1966) who showed that there was only one scoring rule (i) which maximized the expected score if and only if the continuous answers were equal to the subjective probabilities and (ii) where the score only depends on the response for the correct alternative and not on the distribution across the incorrect ones. This scoring rule is the logarithmic one, and of course has an evident disadvantage: if the continuous response to the correct alternative is zero, the score is minus infinity and can never be repaired by any finite number of (partially) correct responses. Shuford et al. were aware of this anomaly and proposed a slight modification of the logarithmic rule by a truncation on the observable variable: responses at or below 0.01 got a fixed penalty of −2. Dirkzwager (1997, 2001)Footnote 1 provided an elegant way to avoid very large penalties and at the same time to have a good approximation to the original logarithmic scoring rule. The approximation is dependent on a tuning parameter. Details on the scoring function and a quite extensive application of the system developed by Dirkzwager can be found in Holmes (2002).

Probably the main factor that hampered the application of these ideas was the poor development of suitable computer systems. Contrary to the situation with scales for personality traits, emotions and attitudes, where a continuous answer can be elicited, for example, by asking to put a mark on a line where the two end points are indicated by labels such as ‘strongly disagree’ and ‘strongly agree’ and where the mark expresses best the respondents position, the continuous answers for multiple choice questions are multivariate. Nowadays modern computers are widespread and constructing interfaces for administering multiple choice tests with continuous answers is hardly a serious challenge.

In Sect. 7.2 a new model, a simple continuous Rasch model is introduced and parameter estimation is discussed. Section 7.3 treats a slight generalization of Müller’s model and gives the details of parameter estimation along the lines sketched, but not elaborated in his 1987 article. Section 7.4 handles the problem of comparisons of information functions for different models and in Sect. 7.5 a discussion is started about the relation between IRT models for continuous responses in multiple choice tests and the scoring functions which were studied in the sixties.

2 A Rasch Model for Continuous Responses

2.1 The Model

Much in the spirit of the probability measurement approaches, let us imagine that a test consisting of k multiple choice items has been answered in a continuous way by assigning a non-zero number to each alternative under the restriction that the sum of these numbers across alternatives equals one. At this moment we leave the precise instructions to the respondents a bit vague, except for the fact that they know that the higher the number assigned to the right answer, the higher their score will be, i.e., there is a monotone increasing relationship between answer and the score. More on this will be said in the discussion section.

In the item response function of the Rasch model for binary data, the denominator 1 + exp(θ − βi) has the role of a normalizing constant, i.e., the sum of all possible numerators, and guarantees that the sum of the probabilities of all possible answers equals one. If one considers a similar model for continuous responses in the (closed) unit interval, one arrives readily at the conditional probability density function

$$ f_{i} (r_{i} |\theta ) = \frac{{\exp \left[ {r_{i} (\theta - \eta_{i} )} \right]}}{{\int_{0}^{1} {\exp [t(\theta - \eta_{i} )]{\kern 1pt} {\text{d}}t} }},\;(r_{i} \in [0,1]), $$
(7.1)

where ri is a realization of the random variable Ri, the answer given to the correct alternative of item i, θ is the underlying ability, continuous and unbounded and ηi is an item parameter, representing the difficulty as in the Rasch model for binary dataFootnote 2.

The denominator of (7.1) has a closed form solution. If θ = ηi, the integral clearly equals one, as well as the numerator of (7.1). Using αi as a shorthand for θ − ηi, and the result from calculus that

$$ \int {\exp (t\alpha )dt = \frac{1}{\alpha }} \exp (t\alpha ), $$

(7.1) can also be written as

$$ f_{i} (r_{i} |\theta ) = \left\{ {\begin{array}{*{20}c} {\frac{{(\theta - \eta_{i} )\exp \left[r_{i} (\theta - \eta_{i} )\right]}}{{\exp \left[(\theta - \eta_{i} )\right] - 1}}} & {{\text{if}} \, \theta \ne \eta_{i} \,{\text{or}}\,\alpha_{i} \ne 0,} \\ 1 & {{\text{if}} \, \theta = \eta_{i} \,{\text{or}}\,\alpha_{i} = 0.} \\ \end{array} } \right. $$
(7.2)

The regression of the response variable Ri on θ is given by

$$ E(R_{i} |\theta ) = \frac{{\int\nolimits_{0}^{1} {t\,\exp (t\alpha_{i} ) \, dt} }}{{\int\nolimits_{0}^{1} {\exp (t\alpha_{i} ) \, dt}}} = \left\{ {\begin{array}{*{20}c} {\frac{{e^{{\alpha_{i} }} (\alpha_{i} - 1) + 1}}{{\alpha_{i} (e^{{\alpha_{i} }} - 1)}}} & {{\text{if}}\,\alpha_{i} \ne 0,} \\ {0.5} & {{\text{if}} \, \alpha_{i} = 0,} \\ \end{array} } \right. $$
(7.3)

where αi is used as a shorthand for θ − ηi. (Note that for the case αi = 0, the two integrals in (7.3) have a trivial solution.) In Fig. 7.1 (left panel) a graphical representation of the regression function is given. Notice that the horizontal axis is the difference between θ and the item parameter, and therefore the regression graph itself (with θ on the horizontal axis) will have the same form as the curve in Fig. 7.1, and for different items the curves will be shifted horizontally with respect to each other, just as the trace lines in the Rasch model for binary items. Without giving formal proofs we state some characteristics of the regression function:

Fig. 7.1
figure 1

Expected value (left) and variance (right) of Ri as a function of αi = θ − ηi in the Rasch model for continuous responses

  1. 1.

    \( \mathop {\lim }\limits_{\theta \to - \infty } E(R_{i} |\theta ) = 0, \)

  2. 2.

    \( \mathop {\lim }\limits_{\theta \to + \infty } E(R_{i} |\theta ) = 1, \)

  3. 3.

    E(Ri | θ) is monotonically increasing in θ.

One might be wondering about the numbers along the horizontal axis. In the model for binary data the graph of the item response function (which is a regression function) is usually very close to zero or to one if | θ − βi | > 3, while we see here a substantial difference between the expected value of the response and its lower or higher asymptote for |α| as large as 15. In the section about the information function we will come back to this phenomenon and give a detailed account of it.

It is easy to see from (7.1) or (7.2) that it is an exponential family density with ri as sufficient statistic for θ. In exponential families the (Fisher) information is the variance of the sufficient statistic, and the expression for this variance is

$$ {\text{Var}}(R_{i} |\theta ) = \left\{ {\begin{array}{*{20}c} {\frac{1}{{\alpha_{i}^{2} }} - \frac{{e^{{\alpha_{i} }} }}{{(e^{{\alpha_{i} }} - 1)^{2} }}} & {{\text{if}}\, \, \alpha_{i} \ne 0,} \\ {\frac{1}{12}} & {{\text{if }}\,\alpha_{i} = 0,} \\ \end{array} } \right. $$
(7.4)

where the value 1/12 is the limit of the expression above it in (7.4), or the variance of a uniformly distributed variable in the unit interval. In Fig. 7.1 (right panel) a graph of the variance function is displayed.

2.2 Parameter Estimation

To make the model complete, we have to add an assumption about the dependence structure between item responses. As is common in most IRT models, we will assume that the responses to the items are conditionally (or locally) independent. With this assumption the model is still an exponential family and the sufficient statistic for the latent variable θ is R = ΣiRi. Therefore conditional maximum likelihood (CML) estimation is in principle possible and in this section the practical feasibility of CML is explored.

The conditional likelihoodFootnote 3 of the data, as a function of the parameter vector η = (η1,…,ηk) given the value of the sufficient statistic for θ is given by

$$ L(\eta \left| R \right. = r) = \frac{{\exp \left[ { - \sum r_{i} \eta_{i} } \right]}}{{\begin{array}{*{20}c} {\int\nolimits_{A} {\prod\nolimits_{i} {\exp ( - t_{i} \eta_{i} )dt_{i} } } } \\ \end{array} }} $$
(7.5)

where ∫A denotes the multiple integral over all score vectors (t1,…,tk) such that Σiti= r. As a general symbol for this multiple integral, we will use γ(r,k), where the first argument refers to the score r and the second argument denotes the number of items in the test.

To see the complexity of these γ-functions, consider first the case where k = 2. If r ≤ 1, the score on any of the two items cannot be larger than r, but each score can be zero. So, let t1 be the score on the first item; then t1 can run from 0 to r, and of course t2 = r − t1 is linearly dependent on t1. If r > 1, each score is bounded not to be smaller than r − 1, and of course each score cannot be larger than 1. Taking these considerations into account it is easily verified that

$$ \gamma (r,2) = \int\limits_{\max (r - 1,0)}^{\min (r,1)} {\exp \left[ - (r - t_{1} )\eta_{2} \right]} \exp ( - t_{1} \eta_{1} ) \, dt_{1} $$
(7.6)

Using similar considerations as in the case with k = 2, one can verify that

$$ \gamma (r,3) = \int\limits_{\max (r - 2,0)}^{\min (r,1)} {\exp ( - t_{2} \eta_{2} )\int\limits_{{\max (r - t_{2} - 1,0)}}^{{\min (r - t_{2} ,1)}} {\exp {\left[ {(r - t_{2} - t_{1} )\eta_{3} } \right]\exp ( - t_{1} \eta_{1} )dt_{1} dt_{2} } } }, $$

and in general we can write

$$ \gamma (r,k) = \int\limits_{{A_{k - 1} }}^{{B_{k - 1} }} {f_{k - 1} } (t_{k - 1} ) \ldots \int\limits_{{A_{i} }}^{{B_{i} }} {f_{1} } (t_{1} ) \ldots g(t_{1} , \ldots ,t_{k - 1} )dt_{1} \ldots dt_{k - 1}, $$
(7.7)

where

$$ \begin{aligned} A_{i} & = \max \left( {0,r - (k - i) - \sum\limits_{j = 1}^{i - 1} {t_{k - j} } } \right), \\ B_{i} & = \min \left( {1,r - \sum\limits_{j = 1}^{i - 1} {t_{k - j} } } \right), \\ f_{i} (t_{i} ) & = \exp ( - t_{i} \eta_{i} ), \\ g(t_{1} , \ldots ,t_{k - 1} ) & = \exp \left[ { - \eta_{k} \left( {r - \sum\limits_{j = 1}^{k - 1} {t_{j} } } \right)} \right]. \\ \end{aligned} $$

It will be clear that evaluation of (7.7), although an explicit solution exists, is very unpractical, since in all integrals a distinction is to be made between two possible minima and two possible maxima in every integration. To illustrate this, consider the solution of (7.6), assuming that η1 ≠ η2:

$$ \gamma (r,2) = \left\{ {\begin{array}{*{20}c} {\frac{1}{{\eta_{2} - \eta_{1} }}\left[ {\exp ( - r\eta_{1} ) - \exp ( - r\eta_{2} )} \right],} & {(0 < r < 1),} \\ {\frac{1}{{\eta_{2} - \eta_{1} }}\left[ {\exp ( - \eta_{1} - (r - 1)\eta_{2} ) - \exp ( - \eta_{2} - (r - 1)\eta_{1} )} \right],} & {(1 \le r < 2).} \\ \end{array} } \right. $$
(7.8)

If γ(r, k) is evaluated, k different expressions will result, which are too complicated to work with. Therefore, the maximization of the conditional likelihood function is abandoned; instead recourse is taken to a pseudo-likelihood method, where the product of the conditional likelihood of all pairs of variables is maximized (Arnold and Strauss 1988, 1991; Cox and Reid 2004). This means that the function

$$ PL(\eta ) = \prod\limits_{i < j} {\frac{{\exp ( - r_{i} \eta_{i} - r_{j} \eta_{j} )}}{{\gamma_{{r_{ij} }} (\eta_{i} ,\eta_{j} )}}} $$

will be maximized. The variable rij is defined by

$$ r_{ij} = r_{i} + r_{j} ,\quad (i \ne j) $$
(7.9)

and γrij(ηi,ηj) is the explicit notation of γ(rij,2) with ηi and ηj as arguments.

At this point, it proves useful to reparametrize the model. Define

$$ \eta_{ij} = \frac{{\eta_{i} + \eta_{j} }}{2} $$
(7.10)

and

$$ \varepsilon_{ij} = \eta_{ij} - \eta_{i} = \frac{{\eta_{j} - \eta_{i} }}{2}. $$
(7.11)

It follows immediately that εij = −εji. Using definitions (7.10) and (7.11) and Eq. (7.8), the factor of the PL-function referring to the item pair (i,j) can be written as

$$PL(\varepsilon_{ij} ) = \left\{ {\begin{array}{*{5}c} {\frac{{\varepsilon_{ij} \exp [(r_{i} - r_{j} )\varepsilon_{ij} ]}}{{\sinh (r_{ij} \varepsilon_{ij} )}},} & {(0 < r_{ij} < 1),} \\ {\frac{{\varepsilon_{ij} \exp [(r_{i} - r_{j} )\varepsilon_{ij} ]}}{{\sinh [(2 - r_{ij} )\varepsilon_{ij} ]}},} & {(1 \le r_{ij} < 2),} \\ \end{array} } \right.$$
(7.12)

if εij ≠ 0, this is, if ηi ≠ ηj. If j and i are interchanged (7.12) does not change because εij = −εji and sinh(−x) = −sinh(x). If ηi = ηj, the solution can be found directly from (7.5). It is given by

$$ PL(0) = \lim_{{\varepsilon_{ij} \to 0}} PL(\varepsilon_{ij} ) = \left\{ {\begin{array}{*{20}c} {\frac{1}{{r_{ij} }},} & {(0 < r_{ij} < 1),} \\ {\frac{1}{{2 - r_{ij} }},} & {(1 \le r_{ij} < 2).} \\ \end{array} } \right. $$
(7.13)

In the sequel, only reference will be made to (7.12), but (7.13) has to be used if appropriate.

Although the events (rij = 0) and (rij = 2) both have probability zero, they can occur in a data set. The joint conditional density of (ri, rj) given (rij = 0) or (rij = 2), however, is independent of the item parameters, and can therefore be eliminated from the data set; only values in the open interval (0,2) are to be considered. This is similar to the fact that in the Rasch model for binary data response patterns with all zeros or all ones can be removed from the data without affecting the CML-estimates of the item parameters.

To get rid of the double expression in the right-hand side of (7.12) define the indicator variable Aij, with realizations aij, as

$$ A_{ij} = \left\{ {\begin{array}{*{20}c} 0 & {{\text{if}}\;0 \le r_{ij} < 1,} \\ 1 & {{ \text{if}}\;1 \le r_{ij} \le 2,} \\ \end{array} } \right. $$

and define the random variable Bij, with realizations bij, as

$$ B_{ij} = R_{ij} + 2A_{ij} (1 - R_{ij} ) $$
(7.14)

Using (7.14), (7.12) can be rewritten as

$$ PL(\varepsilon_{ij} ) = \frac{{\varepsilon_{ij} \exp \left[ {(r_{i} - r_{j} )\varepsilon_{ij} } \right]}}{{\sinh (b_{ij} \varepsilon_{ij} )}} $$
(7.15)

and (7.13) can be rewritten as

$$ PL(0) = \lim_{{\varepsilon_{ij} \to 0}} PL(\varepsilon_{ij} ) = \frac{1}{{b_{ij} }} $$

Taking the logarithm of (7.15) and differentiating with respect to εij yields

$$ \frac{{d\,\ln [PL(\varepsilon_{ij} )]}}{{d\varepsilon_{ij} }} = (r_{i} - r_{j} ) + \frac{1}{{\varepsilon_{ij} }} - \frac{{b_{ij} }}{{\tanh (b_{ij} \varepsilon_{ij} )}}, $$
(7.16)

and differentiating a second time gives

$$ \frac{{d^{2} \ln [PL(\varepsilon_{ij} )]}}{{d\varepsilon_{ij}^{2} }} = - \frac{1}{{\varepsilon_{ij}^{2} }} + \frac{{b_{ij}^{2} }}{{\sinh^{2} (b_{ij} \varepsilon_{ij} )}} . $$
(7.17)

If εij = 0, (7.16) and (7.17) are undefined, but can be replaced by their limits:

$$ \lim_{{\varepsilon_{ij} \to 0}} \frac{{d\,\ln \left[PL(\varepsilon_{ij} )\right]}}{{d\varepsilon_{ij} }} = r_{i} - r_{j} $$

and

$$ \lim_{{\varepsilon_{ij} \to 0}} \frac{{d^{2} \ln [PL(\varepsilon_{ij} )]}}{{d\varepsilon_{ij}^{2} }} = - \frac{{b_{ij}^{2} }}{3}. $$

In order to obtain estimates of the original η-parameters, the restrictions, defined by (7.11) have to be taken in account. Define the k(k − 1)/2 × k matrix K by

$$ K(ij,\ell ) = \left\{ {\begin{array}{*{20}l} { - 1/2} \hfill & {{\text{if}}\;\ell = i,} \hfill \\ {1/2} \hfill & {{\text{if}}\;\ell = j,} \hfill \\ 0 \hfill & {\text{Otherwise}} \hfill \\ \end{array} } \right. $$
(7.18)

where the subscript ij of the rows refers to the item pair (i,j). Define ε = (ε12, ε13,…, εij,…, εk−1,k) and η = (η1,…, ηk), then it follows immediately from (7.11) to (7.18) that

$$ \varepsilon = K\eta . $$
(7.19)

It is immediately clear that

$$ \frac{\partial \,\ln \,PL(\eta )}{\partial \eta } = K^{\prime } \frac{\partial \,\ln \,PL(\varepsilon )}{\partial \varepsilon }, $$
(7.20)

and

$$ \frac{{\partial^{2} \ln PL(\eta )}}{{\partial \eta {\kern 1pt} \partial \eta^{\prime } }} = K^{\prime } \frac{{\partial^{2} \ln PL(\varepsilon )}}{{\partial \varepsilon {\kern 1pt} \partial \varepsilon '}}K, $$
(7.21)

where the elements of the partial derivatives with respect to ε are given by (7.16). The matrix of second partial derivatives in the right-hand side of (7.21) is a diagonal matrix, whose diagonal elements are defined by (7.17). For a sample of n response patterns, the gradient and the matrix of second order partial derivatives of the PL-function is simply the sum over response patterns of expressions given by the right-hand members of Eqs. (7.20) and (7.21), respectively. The model, however, is not identified unless a normalization restriction is imposed on the η-parameters, e.g., ηk = 0. This amounts to dropping the last element of the gradient vector and the last rows and columns from the matrix of second order partial derivatives.

Initial estimates can be found by equating the right-hand member of (7.16) to zero, and solving as a univariate problem, i.e., ignoring the restrictions (7.19). Applying (7.19) then yields least squares estimates of η, which can be used as initial estimates of the item parameters.

Standard errors can be found by the so-called sandwich method. Define gv as the vector of first partial derivatives of the pseudo-likelihood function for respondent v and H as the matrix of second partial derivatives (for all respondents jointly). All vectors gv and the matrix H are evaluated at the value of the pseudo-likelihood estimates of the parameters. Then, the asymptotic variance-covariance matrix can be estimated by (Cox and Read 2004, p. 733)

$$ H^{ - 1} \left[ {\sum\limits_{v} {{\mathbf{g}}_{v} {\mathbf{g}}_{v}^{\prime } } } \right]H^{ - 1} . $$

3 An Extension of the Müller Model

3.1 The Model

As Samejima derived her family of continuous models as limiting cases of the graded response model when the number of possible graded responses goes to infinity, Müller considers a limiting case of Andrich’s (1982) rating scale model when the number of thresholds tends to infinity. The category response function of the rating scale model (with possible answers 0, 1,…, m) is given by

$$ P(X_{i} = j|\theta^{*} ) = \frac{{\exp [j\alpha_{i}^{*} + j(m - j)\delta^{*} ]}}{{\sum\nolimits_{h = 0}^{m} {\exp [h\alpha_{i}^{*} + h(m - h)\delta^{*} ]} }} $$
(7.22)

where \( \alpha_{i}^{*} = \theta^{*} - \eta_{i}^{*} \), the difference between the latent value and an item specific location parameter, while δ* is half the (constant) difference between any two consecutive thresholdsFootnote 4,Footnote 5 If the answer Ri (with realizations ri) is elicited by asking the respondent to put a mark on a piece of line with length d(>0) and midpoint c and where the two endpoints are labeled, then the response can be modeled by the density

$$ f_{{_{i} }}^{*} (r_{i} |\theta ) = \frac{{\exp \left[ {r_{i} \alpha_{i} + r_{i} (2c - r_{i} )\delta } \right]}}{{\int_{c - d/2}^{c + d/2} {\exp \left[ {t\alpha_{i} + t(2c - t)\delta } \right]{\kern 1pt} {\text{d}}{\kern 1pt} t} }},\quad \left( {r_{i} \in \left[ {c - \frac{d}{2},c + \frac{d}{2}} \right]} \right), $$
(7.23)

with \( \alpha_{i} = \theta - \eta_{i} \). Moreover, Müller shows that the thresholds are uniformly distributed in the interval \( [\eta_{i} - \delta d,\eta_{i} + \delta d] \).

Of course, the length of the line in the response format can be expressed in arbitrary units and with an arbitrary reference point, so that we can assume without loss of generality that c = 0.5 and d = 1. And as a slight extension of Müller’s model, it will be assumed that the δ-parameter can vary across items. This gives the density equation we will be using in the present section:

$$ f_{i} (r_{i} |\theta ) = \frac{{\exp \left[ {r_{i} \alpha_{i} + r_{i} (1 - r_{i} )\delta_{i} } \right]}}{{\int_{0}^{1} {\exp \left[ {t\alpha_{i} + t(1 - t)\delta_{i} } \right]dt} }}, $$
(7.24)

and by completing the square, the numerator of (7.24) can be written as

$$ \exp \left[ {r_{i} \alpha_{i} + r_{i} (1 - r_{i} )\delta_{i} } \right] = \exp \left[ { - \delta_{i} \left( {r_{i} - \frac{{\alpha_{i} + \delta_{i} }}{{2\delta_{i} }}} \right)^{2} } \right] \times \exp \left[ {\frac{{(\alpha_{i} + \delta_{i} )^{2} }}{{4\delta_{i} }}} \right]. $$
(7.25)

The second factor in the right-hand side of (7.25) is independent of ri and will cancel when we substitute the right-hand side of (7.25) in (7.24). Defining

$$ \mu_{i} (\theta ) = \frac{{\alpha_{i} + \delta_{i} }}{{2\delta_{i} }}{\text{ and}}\, \, \sigma_{i}^{2} = \frac{1}{{2\delta_{i} }}, $$

(7.24) can be written as

$$ f_{i} (r_{i} |\theta ) = \frac{{\sigma_{i}^{ - 1} \varphi \left[ {\frac{{r - \mu_{i} (\theta )}}{{\sigma_{i} }}} \right]}}{{\Phi \left[ {\frac{{1 - \mu_{i} (\theta )}}{{\sigma_{i} }}} \right] - \Phi \left[ {\frac{{ - \mu_{i} (\theta )}}{{\sigma_{i} }}} \right]}} $$
(7.26)

with φ(.) and Φ(.) denoting the standard normal density and probability functions, respectively. One easily recognizes (7.26) as the probability density function of the truncated normal distribution (Johnson and Kotz 1970). The regression of the response on the latent variable and the variance of the response are given next. Using Di as shorthand for the denominator of (7.26), i.e.,

$$ D_{i} =\Phi \left[ {\frac{{1 - \mu_{i} (\theta )}}{{\sigma_{i} }}} \right] -\Phi \left[ {\frac{{ - \mu_{i} (\theta )}}{{\sigma_{i} }}} \right], $$

and

$$ z_{0i} = \frac{{ - \mu_{i} (\theta )}}{{\sigma_{i} }}{\text{ and }}z_{1i} = \frac{{1 - \mu_{i} (\theta )}}{{\sigma_{i} }}, $$

the regression function is given by

$$ E(R_{i} |\theta ) = \mu_{i} (\theta ) + \frac{{\varphi (z_{0i} ) - \varphi (z_{1i} )}}{{D_{i} }}\sigma_{i} , $$

and the variance by

$$ Var(R_{i} |\theta ) = \sigma_{i}^{2} \left[ {1 + \frac{{z_{0i} \varphi (z_{0i} ) - z_{1i} \varphi (z_{1i} )}}{{D_{i} }} - \left( {\frac{{\varphi (z_{0i} ) - \varphi (z_{1i} )}}{{D_{i} }}} \right)^{2} } \right]. $$

The regression function is displayed in the left panel of Fig. 7.2 for three different values of the δ-parameter. Some comments are in order here:

Fig. 7.2
figure 2

Expected value (left) and variance (right) of Ri as a function of αi= θ − ηi in Müller’s model

  1. 1.

    The computation of the regression and variance is tricky for extreme values of z0i and z1i. Using the standard implementation of the normal probability function in EXCEL or R generates gross errors. Excellent approximations are given in Feller (1957, p. 193).

  2. 2.

    Three values of the δ-parameter have been used, resulting in flatter curves for the higher values. Therefore, the parameter δ can be interpreted as a discrimination parameter, lower values yielding higher discrimination. A discussion on the model properties when δ → 0 can be found in Müller (1987).

  3. 3.

    Just as in the Rasch model for continuous responses, the numbers along the horizontal axis are quite different from the ones usually displayed for the item response functions of the common Rasch model. Further comments will be given in Sect. 7.4.

The right-hand panel of Fig. 7.2 displays the variance of the response function for the same three values of δ and for the same values of α. The figures are symmetric around the vertical axis at α = 0. As this model is an exponential family, the variance of the response is also the Fisher information.

To gain more insight in the information function, one can use a reparameterization of the rating scale model, where the scores are brought back to the unit interval, i.e., by dividing the original integer valued scores by mi, the original maximum score. Müller shows that this results in the discrete model with category response functionFootnote 6

$$ f_{i}^{*} (r_{i} |\theta ) = \frac{{\exp \left[ {r_{i} \alpha_{i} + r_{i} (1 - r_{i} )\delta_{i} } \right]}}{{\sum\limits_{j = 0}^{m} {\exp \left[ {r_{j} \alpha_{i} + r_{j} (1 - r_{j} )} \right]} }},\quad \left( {r_{i} = 0,\frac{1}{m},\frac{2}{m}, \ldots ,1} \right). $$
(7.27)

It is easy to see that (7.24) is the limiting case of (7.27) as m → ∞. In Fig. 7.3 the information functions are plotted for some finite values of m and for the limiting case, labeled as ‘continuous’. In the left-hand panel, the discrimination parameter δi equals 2 (high discrimination), in the right-hand panel it equals 10 (low discrimination); the value of ηi is zero in both cases such that αi = θ.

Fig. 7.3
figure 3

Information functions for different values of m (left: δi = 2; right: δi = 10)

The collection of curves can be described briefly as follows:

  1. 1.

    the maxima of the curves are lower with increasing values of m;

  2. 2.

    the tails are thicker with increasing values of m;

  3. 3.

    for low values of m the curves are not necessarily unimodal: see the case for m = 2 en δi = 10;

  4. 4.

    for m = 64 and m → ∞, the curves are barely distinguishable.

One should be careful with conclusions drawn from the graphs in Fig. 7.3: it does not follow, for example, from this figure that, if one switches from a discrete format of an item to a continuous one, that the discrimination or the location parameter will remain invariant. An example will be given in Sect. 7.4.

3.2 Parameter Estimation

Müller (1987) proposes the parameter estimation using only pairs of items, but does not give details. In the present section the method is explained; the technical expressions to compute the gradient and the matrix of second partial derivatives are given in the Appendix to this chapter.

As with the continuous Rasch model, definition (7.9), rij= ri+ rj is used and the reparameterization

$$ \varepsilon_{i} = \delta_{i} - \eta_{i} . $$

is used. The problems with obtaining CML estimates of the parameters are the same as with the Rasch model for continuous responses (and augmented with numerical problems in evaluating the probability function of the truncated normal distribution for extreme values of the argument). Therefore the pseudo-likelihood function, considering all pairs of items, is studied here.

The conditional likelihood function for one pair of items is given by

$$ f_{ij} (r_{i} ,r_{j} |r_{ij} ) = \frac{{\exp (r_{i} \varepsilon_{i} + r_{j} \varepsilon_{j} - r_{i}^{2} \delta_{i} - r_{j}^{2} \delta_{j} )}}{{\int_{{m_{ij} }}^{{M_{ij} }} {\exp \left[ {t\varepsilon_{i} + (r_{ij} - t)\varepsilon_{j} - t^{2} \delta_{i} - (r_{ij} - t)^{2} \delta_{j} )} \right]{\kern 1pt} d{\kern 1pt} t} }} $$
(7.28)

where the bounds of the integral in the denominator depend on the value of rij:

$$ \begin{aligned} m_{ij} & = \max (r_{ij} - 1,0), \\ M_{ij} & = \min (r_{ij} ,1). \\ \end{aligned} $$

Along the same lines of reasoning as followed in Sect. 3.1, (7.28) can also be shown to be a density of the truncated normal distribution, i.e.,

$$ f_{ij} (r_{i} ,r_{j} |r_{ij} ) = \frac{{\exp \left[ { - \frac{{(r_{i} - \mu_{ij} )^{2} }}{{2\sigma_{ij}^{2} }}} \right]}}{{\int_{{m_{ij} }}^{{M_{ij} }} {\exp \left[ { - \frac{{(t - \mu_{ij} )^{2} }}{{2\sigma_{ij}^{2} }}} \right]{\kern 1pt} d{\kern 1pt} t} }} = \frac{{\sigma_{ij}^{ - 1} \varphi (x_{ij} )}}{{\Phi (Z_{ij} ) - \Phi (z_{ij} )}}, $$
(7.29)

where

$$ \begin{aligned} \sigma_{ij}^{2} & = \frac{1}{{2(\delta_{i} + \delta_{j} )}}, \\ \mu_{ij} & = \sigma_{ij}^{2} (\varepsilon_{i} - \varepsilon_{j} + 2r_{ij} \delta_{j} ), \\ x_{ij} & = (r_{i} - \mu_{ij} )/\sigma_{ij} , \\ Z_{ij} & = (M_{ij} - \mu_{ij} )/\sigma_{ij} , \\ z_{ij} & = (m_{ij} - \mu_{ij} )/\sigma_{ij} . \\ \end{aligned} $$

Although the function μij is not symmetric in i and j, it follows readily that μij+ μji = rij= rji. Using this, it is not hard to show that the conditional response density (7.29) does not change if i and j are interchanged.

The function to be maximized is the logarithm of the pseudo-likelihood function:

$$ \sum\limits_{v} {\ln \,PL_{v} \left[ {\varepsilon ,\delta ;(r_{v1} , \ldots ,r_{vk} )} \right]} = \sum\limits_{v} {\sum\limits_{i < j} {\ln \,f_{ij} \left( {r_{vi} ,r_{vj} \left| {r_{vij} } \right.} \right)} } , $$
(7.30)

where ε and δ denote the vectors of ε- and δ-parameters, rvi is the response of respondent v to item i and k is the total number of items. Deriving the expressions for the first and second partial derivatives of (7.30) is not difficult, but rather tedious, the expressions are given in the Appendix to this chapter.

4 Comparison of Information Functions Across Models

4.1 The Unit of the Latent Variable

Müller (1987, p. 173) compares the information functions for the rating scale model and his own model for continuous data, assuming that the latent variable underlying the two models is the same, but this assumption needs not to be correct for two quite different reasons.

  1. 1.

    The first and most important threat to a meaningful comparison is the silent assumption that the same trait is measured because the items are the same; only the item format is different. For example, in a personality inventory, in the discrete case the respondent may be asked to express the applicability of a statement to him/herself on a discrete 5-point scale where each possible answer is labeled by descriptions as ‘certainly not applicable’, ‘not applicable’, ‘neutral’ (whatever this may mean), ‘applicable’ and ‘certainly applicable’, while in the continuous case the respondent is asked to express the applicability of the statement by putting a mark on a piece of line where only the two end points are labeled, as (e.g.,) ‘not applicable at all’ and ‘strongly applicable’. In the discrete case tendencies to the middle may strongly influence to prefer the middle category, while this tendency may be a less important determinant in the continuous case. In principle there is only one way to have good answers on this important question of construct validity: one has to collect data under both item formats, and estimate the correlation (disattenuated for unreliability) between the traits measured using either of the two formats.

  2. 2.

    But even if one can show that the traits measured using different item formats are the same (i.e., their latent correlation equals one), there is still the problem of the origin and the unit of the scale. It is easily understood that the origin of the scale is arbitrary, as most of the IRT models define their item response functions using a difference between the trait value and a location parameter, such that adding an arbitrary constant to both does not change response probabilities (or densities as is easily seen from (7.1)). But understanding that the unit of the scale is arbitrary as well is not so easy. A historic example of this is the claim in Fischer (1974) that the scale identified by the Rasch model is a ‘difference scale’, a scale where the origin is arbitrary but the unit fixed, while the same author (Fischer 1995) came, in a complicated chapter, to the conclusion that the scale measured by the Rasch model is an interval scale, with arbitrary origin and unit. In the common Rasch model the unit is chosen by fixing the discrimination parameter (common to all items) at the value of 1, but any other positive value may be chosen. Suppose that in the discrete model, one chooses c ≠ 1, then one can replace \( \theta - \beta_{i} \;{\text{with}}\;c(\theta^{*} - \beta_{i}^{*} ) \) where \( \theta^{*} = \theta /c\;{\text{and}}\;\beta_{i}^{*} = \beta_{i} /c \) and c the common (arbitrarily chosen) discrimination value.

With the continuous models, however, the choice of the unit of measurement (of the latent variable) is also influenced by the bounds of the integral. We illustrate this with the Rasch model for continuous responses. Suppose data are collected by instructing the respondents to distribute M tokens (with M large enough such that the answer can safely be considered as continuous) among the alternatives of a multiple choice question and let Ui (with realizations ui) be the number of tokens assigned to the correct alternative of item i. Then, the model that is equivalent to (7.1) is given by

$$ f_{i} (u|\theta^{*} ) = \frac{{\exp \left[ {u(\theta^{*} - \eta_{i}^{*} )} \right]}}{{\int_{0}^{M} {\exp \left[ {y(\theta^{*} - \eta_{i}^{*} )} \right]{\kern 1pt} {\text{d}}y} }},\quad \left( {u \in [0,M]} \right). $$

If we want to rescale the response variable by changing it to a relative value, i.e., Ri= Ui/M, then we find

$$ \begin{aligned} f_{i} (r_{i} |\theta^{*} ) & = \frac{{\exp \left[ {r_{i} M(\theta^{*} - \eta_{i}^{*} )} \right]}}{{M\int_{0}^{1} {\exp \left[ {tM(\theta^{*} - \eta_{i}^{*} )} \right]{\kern 1pt} {\text{d}}{\kern 1pt} t} }} \\ & = \frac{{\exp \left[ {r_{i} (\theta - \eta_{i} )} \right]}}{{M\int_{0}^{1} {\exp \left[ {t(\theta - \eta_{i} )} \right]{\kern 1pt} {\text{d}}{\kern 1pt} t} }} \\ & = \frac{1}{M}f_{i} (y_{i} |\theta ),\quad \left( {y_{i} = r_{i} /M \in [0,1]} \right) \\ \end{aligned} $$

with \( \theta = M\theta^{*} \;{\text{and}}\;\eta_{i} = M\eta_{i}^{*} \). So we see that changing the bounds of the integration in (7.1) changes the density by a constant and at the same time the unit of the underlying scale. Changing the density is not important as it will only affect the likelihood by a constant, but at the same time the unit of the underlying scale is changed. Choosing zero and one as the integration bounds in (7.1) is an arbitrary choice, and therefore the unit of measurement is arbitrary as well.

Exactly the same reasoning holds for Müller’s model where in Eq. (7.23) the constants c and d (>0) are arbitrary, but will influence the unit of measurement of the latent variable.

4.2 An Example

A simple example to see how the information function depends on the unit of measurement is provided by the Rasch model for binary data. Assuming, as in the previous section, that the common discrimination parameter is indicated by c, the information function for a single item is given by

$$ I_{i} (\theta ) = c^{2} f_{i} (\theta )\left[ {1 - f_{i} (\theta )} \right], $$

meaning that doubling c will quadruple the information, so that this gives the impression that the information measure is arbitrary. But one should keep in mind that doubling the value of c will at the same time halve the standard deviation (SD) of the distribution of θ or divide its variance by a factor four. This means that the information measure can only have meaning when compared to the variance of the θ-distribution. So, if we want to compare information functions across models, we must make sure that the latent variables measured by the models are congeneric, i.e., their correlation must be one, but at the same time they must be identical, i.e., having the same mean and variance. This is not easy to show empirically, but we can have an acceptable approximation as will be explained by the following example.

At the department of Psychology of the University of Amsterdam, all freshmen participate (compulsorily) in the so-called test weekFootnote 7: during one week they fill in a number of tests and questionnaires and take part as subjects in experiments run at the department. One of the questionnaires presented is the Adjective Check List (ACL, Gough & Heilbrunn, 1983; translated into Dutch by Hendriks et al. 1985), where a number of adjectives (363) is presented to the test takers. The task for the student is to indicate the degree to which each adjective applies to him/herself. The standard administration of the test asks the students to indicate the degree of applicability on a five point scale. In 1990, however, two different test forms were administered, each to about half of the students. In the first format, only a binary response was asked for (not applicable/applicable); in the second format, the student was asked to mark the degree of applicability on a line of 5.7 cm, with the left end corresponding to ‘not applicable at all’ and the right end to ‘completely applicable’. The score obtained by a student is the number of millimeters (m) of the mark from the left end of the line segment. This number m was transformed into a response (in model terms) by the transformation

$$ r = \frac{m + 0.5}{58}. $$

Seven of these adjectives, fitting in the framework of the Big Five, were taken as relevant for the trait ‘Extraversion’ and these seven are used in the present example. For the polytomous data, the responses collected in the years 1991 through 1993 were pooled, yielding 1064 records. For the continuous data 237 records were available. Dichotomous data are not used here. All these records are completely filled in; some 2% of the original records with one or more omissions have been removed. The polytomous data have been analyzed with the Partial Credit Model (PCM) using CML estimation for the category parameters for ordered categories running from 0 to 4; the continuous data were analyzed with the extension of Müller’s model as discussed in Sect. 7.3, using the maximum pseudo-likelihood method. After the estimation of the item parameters, the mean and SD of the θ-distribution were estimated, while keeping the item parameters constant at their previous estimates. For both models a normal distribution of the latent variable was assumed.

Parameter estimates are displayed in Table 7.1. The adjectives were presented to the students in Dutch; the English translation is only put there as information and did not have any influence at all on the actual answers. As there are four parameters per item in the polytomous model, and their estimates are of not much use in judging the item locations, they are not reported in Table 7.1. Instead, the value of the latent variable yielding two (half of the maximum) as expected value is reported, under the symbol β*.

Table 7.1 Parameter estimates of the ‘extraversion’ scale

As is seen from the table the estimated SD for the continuous model is much larger than for the PCM model, and this is consistent with the large numbers along the horizontal axis in Fig. 7.2. The correlation between the η- and β*-parameters is 0.85.

The latent variable estimated by the PCM will be indicated by θp; the one estimated by the continuous model by θc. Their means and SDs are indicated by the subscripts ‘p’ and ‘c’ as well. The variable θp will be linearly transformed such that the transformed variable has the same mean en SD as θc, i.e.,

$$ T(\theta_{p} ) = B\theta_{p} + A. $$

It is easy to check that

$$ B = \frac{{\sigma_{c} }}{{\sigma_{p} }}\;{\text{and}}\;A = \mu_{c} - B\mu_{p} $$

In Fig. 7.4, the two information functions for the seven item checklist are given, for values of the latent variable brought to the same scale where the population mean is 2.5 and the SD about 5.22. The vertical dashed line indicates the average of the latent value distribution, and the thick black line indicates the range that encompasses 95% of the distribution. So, for the great majority of the population the information provided by the continuous model is larger than for the discrete PCM.

Fig. 7.4
figure 4

Information functions for the PCM and for Müller’s model for the same adjective checklist

Of course the procedure we followed can be criticized: no account has been given to the standard errors of the estimates, and the data do not come from a sample that has been tested twice. So, silently, we have assumed that the distribution of the latent trait has not changed (very much) in the period that the data were collected. As in all these cases the population consists of Psychology freshman at the University of Amsterdam, it is fairly unlikely that the distribution has changed in any important way.

Even with all the criticism that may be put forward, it seems quite convincing that the continuous model is more informative than the PCM and therefore certainly deserves more attention than it has received thus far.

5 Discussion

Two IRT models for continuous responses that allow for separation between item and person parameters have been studied. Although CML estimation of the item parameters is possible, it leads to unwieldy formulae and only the case for two items has been considered for the two models. This, however, proves to be sufficient to obtain consistent estimates of the item parameters, using the theoretical contributions in the area of pseudo-likelihood estimation.

Using the technical details provided in this chapter, developing appropriate software for parameter estimation is not hard. For the extension of Müller’s model, the software developed applies the Newton-Raphson procedure directly on the initial estimates and works very well, although, admittedly, it was only used on a scale with no more than 7 items. For the extension of the Rasch model to continuous responses, software has to be developed yet.

But there are more aspects in the technical realm that have to be given attention to:

  1. 1.

    The estimation of the individual person parameter, not only as a technical problem, but also as a real statistical one: is the maximum likelihood estimator of θ, used with known values of the item parameters, unbiased, and if it is not, can one develop an unbiased estimator along the lines of the Weighted Maximum Likelihood or Warm-estimator?

  2. 2.

    The set-up of goodness-of-fit tests which are informative, e.g., for the contrast between the two models discussed here, but also techniques to detect non-fitting items, all based on sound statistical reasoning are areas which deserve attention.

  3. 3.

    The extension of Müller’s model to allow for different discriminations turned out to be rather simple, and as the example in Table 7.1 shows, worthwhile as the estimates differ considerably. Therefore, it also seems worthwhile to think about an extension of the Rasch model for continuous items that allows for different discriminations. A promising route would be to explore the possibilities of Haberman’s (2007) interaction model.

There are, however, aspects in testing that cannot be solved by sophisticated techniques but which are fundamental in the discussion of the validity of the test and the conclusion its use leads to. Take the small scale on extraversion which was discussed in Sect. 7.4. If the answers are collected using a Likert scale, there is ample evidence in the literature for certain tendencies (like the tendency to avoid extreme answers) which will create irrelevant variance in the test scores. Suppose we have two respondents who are quite extravert, i.e., who would on average score at or above the middle of the five categories, but one, A, has a strong tendency to avoid extreme answers while the other, B, has a preference for extreme answers, then on the average B will obtain a higher test score than A, while the difference cannot be unambiguously be attributed to a difference in extraversion, i.e., some of the variance of the observed scores is due to a variable which is irrelevant for the trait to be studied, and therefore forms a threat to the correct interpretation of the results, i.e., to the validity.

More in general, there are always determinants of behavior which are responsible for part of the variance in the responses, and which are a kind of a nuisance in the interpretation of the test scores, but which cannot be easily avoided. If these extra variables are associated with the format of the items then they are hard to discover if the use of this format is ubiquitous, like the multiple choice (MC) format. It is highly probable that ‘blind guessing’ as a much used model for explaining the behavior in MC test is highly unrealistic; students use all kinds of strategies to improve the result when they are not sure of the correct response, and some of these strategies will give better results than others, so that the result on an MC test is a contamination of the intended construct and the cleverness in using choice strategies.

As long as there is no variation in the format of the items, this contamination will not do much harm, as the irrelevant construct will be absorbed into the intended construct. But the risk of incorrect interpretations occurs if at some point a drastic change in the format is introduced, like switching from MC to probability measurement.

When Dirkzwager was developing his system of multiple evaluation—the name he gave to probability measurement—he was aware of a serious validity threat: students can show a lack of ‘realism’ when assigning continuous responses to each of the alternatives of a multiple choice question, either by being overconfident and giving too high a weight to one of the alternatives, or by being cautious and tending to a uniform distribution over the alternatives. here is a quotation from Holmes (2002, p. 48): Considering one item in a test, we can represent the student’s uncertainty as to the correct answer by a probability distribution p = (p1, p2,…,pk) over the set {1, 2,…,k} where 0 ≤ pj and Σpl= 1. The student’s response can be represented by r = (r1, r2,…rk). For a perfectly realistic student the response r is equal to p. In effect, such a student is stating: “Given my knowledge of the subject, this item is one of the many items for which my personal probability is p = (p1, p2,…,pk). For these items, answer one is correct in a proportionFootnote 8 of p1, answer 2 in a proportion of p2 etcetera.

This means that a student is realistic if his continuous responses match his subjective probabilities, which is already elicited by the (approximate) logarithmic scoring function, but for which Holmes developed a measure of realism (see his Chap. 4) and showed that using this measure as feedback was very effective in changing the behavior of ‘unrealistic’ students, which makes the multiple evaluation approach as developed originally by Dirkzwager a good and powerful instrument for formative assessment. Details of the application can be found in Holmes.

Finally, one might ask then why an IRT model is needed if good use can be made of probability measurement. There are, however aspects of formative assessment which are very hard to develop at the school level. Progress assessment, for example, is an important one, but also the choice of items whose difficulty matches the level of the students. The construction of an item bank at the national level, for example, and making it available to the schools together with the necessary software, could be a task for a testing agency having the necessary psychometric and IT competences.