Optimal Item Calibration for Computerized Achievement Tests
Abstract
Item calibration is a technique to estimate characteristics of questions (called items) for achievement tests. In computerized tests, item calibration is an important tool for maintaining, updating and developing new items for an item bank. To efficiently sample examinees with specific ability levels for this calibration, we use optimal design theory assuming that the probability to answer correctly follows an item response model. Locally optimal unrestricted designs have usually a few design points for ability. In practice, it is hard to sample examinees from a population with these specific ability levels due to unavailability or limited availability of examinees. To counter this problem, we use the concept of optimal restricted designs and show that this concept naturally fits to item calibration. We prove an equivalence theorem needed to verify optimality of a design. Locally optimal restricted designs provide intervals of ability levels for optimal calibration of an item. When assuming a twoparameter logistic model, several scenarios with Doptimal restricted designs are presented for calibration of a single item and simultaneous calibration of several items. These scenarios show that the naive way to sample examinees around unrestricted design points is not optimal.
Keywords
achievement tests computerized tests item calibration optimal restricted design twoparameter logistic model1 Introduction
Achievement tests are an important part, e.g., of higher education to quantify the proficiency of examinees. An alternative of growing importance to traditional paperandpencil tests is computerized adaptive tests (CAT). Examinees perform the achievement tests at the computer and everyone receives a sequence of questions, called items. The advantage of CAT is that the items received can depend on the answer to previous items, e.g., examinees with many correct answers can be given more difficult questions subsequently which can then characterize their ability in more detail. By this, questions which are too hard or too easy for an examinee are avoided and “a highquality estimate of the examinee’s proficiency can be made using as few as half as many items than in a fixedform test” (Buyske, 2005).
A prerequisite before administrating a CAT is the existence of a collection of items, an item bank. Based on the item bank, the CATalgorithm can choose appropriate items for the examinees. This means that the characteristics of items, e.g., the difficulty, need to be determined before they are included into a CAT. This determination of item characteristics is called calibration of items. A common situation is that achievement tests are done periodically, e.g., year by year. Then the task is to update an item bank continuously with new items. Zheng (2014) pointed out the importance of this item replenishment and stressed the need for efficient and accurate calibration of the new items.
In principle, one could perform separate calibration studies where some voluntary test takers answer to new items. However, this is usually a quite costly option and it can be more feasible to add instead a small calibration part to an ordinary achievement test. The items from the calibration part are then available in achievement tests in future examination periods. This principle is, e.g., applied in the Swedish Scholastic Assessment Test (Universitets och högskolerådet, 2019) which is administered as paperandpencil test. Adding similarly new items to be calibrated to a CAT has been called online calibration (Stocking, 1988) and Zheng (2014) reviews methods for it. Irrespectively if added to a paperandpencil, to a CAT, or to a nonadaptive computerized test, the calibration part has to be quite small such that the burden of this addon part on the examinees is negligible.
We assume that an ordinary computerized test is performed (CAT or nonadaptive) and that the abilities of the examinees are well determined by their answers to a larger part of the operational items. We focus here in this work on the calibration part for new items which are seeded into the later part of operational items in a computerized test. A set of new items should be tested in the calibration and we consider here the situation that we can allocate to each examinee a small, fixed number of these new items. Our aim is to allocate these items to examinees in a good way such that we obtain highquality estimates for the item characteristics.
For designing the calibration part, we will apply optimal design theory, see e.g., Atkinson, Donev and Tobias (2007). The use of optimal design theory for item calibration has been discussed previously and designs have been elaborated, see e.g., Berger (1992), Buyske (2005), Lu (2014), Zheng (2014), van der Linden and Ren (2015), Ren, van der Linden and Diao (2017).
In contrast to problems in traditional optimal design setup, we have in this context not the possibility to select examinees with desired proficiency freely within a design space. This would theoretically require the access to a large number of examinees with specific abilities, a problem discussed, e.g., by Zheng (2014), van der Linden and Ren (2015) and Ren et al. (2017). The problem is avoided if sequential optimization is done. Then, for a given examinee, the best calibration item is chosen. Some achievement tests, however, test examinees in parallel and a sequential optimal design cannot be applied. In the Swedish Scholastic Assessment Test, for example, more than 60,000 examinees participate on each of two test dates per year. We consider here such a parallel testing situation, where we have at one test date a given population of examinees for the item calibration: the examinees participating in the computerized test. Based on an assumed proficiency distribution of these examinees, we will apply in this work restricted optimization to this distribution. Restricted optimization (also called constrained or bounded optimization) has been discussed in other contexts than achievement tests by Wynn (1982), Sahm and Schwabe (2000). This type of restricted optimal designs has to our knowledge not been applied for item calibration despite that it is the natural adaption of traditional optimal design to finite populations (Wynn, 1982). We are able with this method to gain general insights how item calibration can be optimized.
In the following Sect. 2, we describe the assumed model and the optimal design theory used. We will then present a new equivalence theorem which provides us with a condition to check whether a certain restricted design is optimal or not. This theorem is very general, e.g., applies to general item response models. In Sect. 2, we also describe the algorithm developed for computation of optimal designs. In Sects. 3 and 4, we compute optimal designs in several scenarios. In these scenarios, we present situations with up to three items to calibrate. In real applications, the number of items usually is much larger. We discuss in Sect. 5 an easy way to apply our results for realistic situations. We summarize our insights and conclude with a discussion (Sect. 6) where we point out directions of future research. The proof of our equivalence theorem is elaborated in an “Appendix”.
2 Model for Optimal Item Calibration
2.1 Model for Item Calibration
Example 1
2.2 Optimal Unrestricted Design
A design for item calibration is a rule how to sample desired ability levels of examinees for estimation of unknown item parameters. We have here n different items to calibrate and assume that each examinee can calibrate at most one of those (see Sect. 5 for the case when each examinee calibrates \(k>1\) items). First, we are interested in unrestricted designs, meaning that we have no restrictions in availability of examinees with specific ability levels; the space of examinees’ abilities is called \(\Theta = {\mathbb {R}}\). Using continuous designs [see Chapter 9 in Atkinson et al. (2007)], we represent designs by probability measures \(\xi \) over the design space \(\chi =\Theta \times \{1,\dots ,n\}\). A \((\theta ,i) \in \chi \) means here that examinees with ability \(\theta \) are sampled for item i. The restriction \(\xi _i\) of \(\xi \) to \(\Theta \times \{i\}\) describes how abilities of examinees should be chosen for item i.
We assume to sample examinees with \(m_i\) distinct ability levels \(\theta _{i1},\,\theta _{i2}, \dots ,\,\theta _{im_i}\) in \(\Theta \) with sample proportions (weights) \(w_{i1}, \dots , w_{im_i} \ge 0\) for all items i, such that \(\sum \nolimits _{i=1}^n\sum \nolimits _{j=1}^{m_i}w_{ij}=1\). Here, \(w_{ij}\) is the sample proportion of examinees assigned to each distinct ability level \(\theta _{ij}\) for item i and \(\sum \nolimits _{j = 1}^{m_i} {w_{ij}}\) is the proportion of examinees assigned to item i.
In order to search for a good item calibration design \(\xi \), we follow classical optimal design theory and focus on the design’s Fisher information matrix for the item parameters. This matrix indicates the precision of the model parameters estimators.
In optimal design theory, we optimize some appropriate convex function \(\Psi \) of \(M(\xi )\). A design \({\xi ^*}\) is called \(\Psi \)optimal if \({\xi ^*} = \arg \min _{\xi } \,\,\Psi \{ M(\xi )\}\). The information matrix \(M(\xi )\) depends on the unknown model parameters \(\beta _i, i=1, \dots ,n\). If a researcher has a best guess or initial values about the model parameters, the optimal design can be constructed based on these initial values. Such an optimal design is referred to as a locally \(\,\Psi \)optimal design (Atkinson et al., 2007).

The design \({\xi ^*}\) minimizes \(\Psi \{ M(\xi )\} \).

The minimum over \((\theta ,i) \in \chi \) of \(F_\Psi ({\xi ^*},\theta , i ) \ge 0\).

The minimum over \((\theta ,i) \in \chi \) of \(F_\Psi ({\xi ^*},\theta , i ) = 0\) and it is achieved at the supportpoints \((\theta ,i)\) of the design \({\xi ^*}\).
Example 2
2.3 Optimal Restricted Design
Our aim is to select the best subsamples of examinees for each of n new items in order to optimize item calibration. Since we cannot sample a large number of examinees with specific abilities, we cannot apply directly the optimal design based on the method described in Sect. 2.2. However, we can use the main optimal design ideas described before but restrict the set of available designs using an approach initially described by Wynn (1982).
Theorem 1
We provide a formal proof in “Appendix A”.
The use of this equivalence theorem in applications is: For checking if a given candidate design \(h^*\) is optimal, compute and plot the n directional derivatives \(F_\Psi (h^*,\theta ,i), i=1,\dots ,n\) over \(\Theta \). The design is optimal if the sampling is only for items when their directional derivative is smallest and (in case \(s<1\)) if it is below some constant \(c^*\) which separates the regions of sampling (dir. dev. \(\le c^*\)) from the regions of nonsampling (dir. dev. \(\ge c^*\)). We will use this theorem for the examples in Sects. 3 and 4.
A consequence of the theorem is that the optimal design usually samples the full available population on ability intervals for a single item. Only if two (or more) directional derivatives coincide on an interval, it can be optimal to sample these two (ore more) items for the same ability interval.
Example 3
2.4 Optimization Algorithm
The optimization algorithm for optimal restricted designs for the one and two item case is presented below. An idea how to extent to larger numbers of items n will be visible from the case \(n=2\)—however, complexity will increase.
2.4.1 Optimization Algorithm for \(n=1\) Item

Step 1: Choose a starting design \(h^0=g \cdot \mathbf{1}_{[\theta ^0_{1L}, \theta ^0_{1U}] \cup [\theta ^0_{2L}, \theta ^0_{2U}]}\) which has density g on two intervals \([{\theta ^0 _{1L}},\,{\theta ^0 _{1U}}]\) and \([\,{\theta ^0 _{2L}},\,{\theta ^0 _{2U}}]\) and density 0 otherwise. One may choose the intervals in the starting design around the optimal unrestricted design points which are shown in Sect. 3.
 Step 2: Solve the constrained optimization problem: maximize \(\left {M(\xi )} \right \) or minimize \(  \log \left {M(\xi )} \right \) for parameters \({\theta _{1L}},{\theta _{1U}}, {\theta _{2L}}, {\theta _{2U}}\)

subject to equality constraint \(\int \limits _{{\theta _{1L}}}^{{\theta _{1U}}} {g(\theta )\mathrm{d}\theta } + \int \limits _{{\theta _{2L}}}^{{\theta _{2U}}} {g(\theta )\mathrm{d}\theta } = s\,[\hbox {based on }(3)\hbox { and }(4)]\) and

subject to inequality constraint \({\theta _{1L}} \le {\theta _{1U}} \le {\theta _{2L}} \le {\theta _{2U}}\).


Step 3: Finally for assurance, check whether this twointerval design is really Doptimal by computing the directional derivative of the Dcriterion and by checking whether the condition in the Equivalence Theorem for Item Calibration is fulfilled.
2.4.2 Optimization Algorithm for \(n=2\) Items
 Step 1: Choose a starting design \(\xi ^0\) which has density g on K intervals

for \(1\mathrm{st}\) item: \({I_{11}}=[{\theta ^0 _{11L}},\,{\theta ^0 _{11U}}], \dots , {I_{1K}}=[{\theta ^0 _{1KL}},\,{\theta ^0 _{1KU}}]\),

for \(2\mathrm{nd}\) item: \({I_{21}}=[{\theta ^0 _{21L}},\,{\theta ^0 _{21U}}], \dots , {I_{2K}}=[{\theta ^0 _{2KL}},\,{\theta ^0 _{2KU}}]\) and density 0 otherwise.

 Step 2: Solve the constrained optimization problem: maximize \(\left {M(\xi )} \right = \left {M_1(\xi _1 )} \right \cdot \left {M_2(\xi _2 )} \right \) or minimize$$\begin{aligned}  \log \left {{M_1}(\xi _1 )} \right  \log \left {{M_2}(\xi _2 )} \right \end{aligned}$$(9)Similarly we check other possible ordering of intervals. For \(K=2\), we have six inequalities constraints \({I_{11}}{I_{12}}{I_{21}}{I_{22}}\), \({I_{11}}{I_{21}}{I_{12}}{I_{22}}\), \({I_{11}}{I_{21}}{I_{22}}{I_{12}}\), \({{I_{21}}{I_{22}}I_{11}}{I_{12}}\), \({{I_{21}}I_{11}}{I_{22}}{I_{12}}\), \({{I_{21}}I_{11}}{I_{12}}{I_{22}}\). We select the inequity constraint which gives minimum value in (9).

subject to equality constraint \(\sum \limits _{r = 1}^n {\sum \limits _{t = 1}^K {\int \limits _{{I_{rt}}}^{} {g(\theta )\mathrm{d}\theta } = s} } {} \) and

subject to inequality constraint \({I_{11}}\dots {I_{1K}}{I_{21}}\dots {I_{2K}}\Leftrightarrow \theta _{11L}\le \theta _{11U} \le \dots \le \theta _{12L} \le \theta _{12U} \le \theta _{21L}\le \theta _{21U} \le \dots \le \theta _{22L} \le \theta _{22U}\).


Step 3: Finally for assurance, it is essential to check whether this Kinterval design is really Doptimal by computing the directional derivative of the Dcriterion and by checking whether the condition in the Equivalence Theorem for Item Calibration is fulfilled. If the design is not optimal, set \(K=K+1\) and go to Step 1.
Since it is allowed that two or more interval boundaries coincide, the designs which need less than K intervals are special cases of a Kinterval design. Hence, when increasing K, (9) cannot increase; it is decreasing until the right K is found and would then be constant for larger K. If some interval boundaries coincide in the determined optimal design, we can finally reduce the number of intervals of it.
The constrained optimization problem in our algorithm was solved by using the Rpackage nloptr (Borchers, 2013) with Sequential (leastsquares) Quadratic Programming (SQP) algorithm. We use the algorithm for the examples presented in Sects. 3 and 4. The number of iterations of the SQP algorithm can vary considerably for each case, but a final solution is obtained in all cases very quickly within one minute.
2.5 Relative Efficiency of Designs
We could assign the items randomly irrespectively of the ability such that each examinee has probability s / n to calibrate a specific item. This socalled random design has densities \(h_i^r=sg/n\) for \(i=1,\dots ,n\). In the examples, we will be interested, e.g., in the relative efficiency \(\mathrm{RE}_\mathrm{D}(h^r,h^*)\) of the random design compared to the Doptimal restricted design \(h^*\). Many researchers have compared optimal design efficiency with a random design in item calibration studies, see e.g., Buyske (2005). We consider a random design as a benchmark for comparison.
Another important design for comparison is the symmetric design. It divides first the sample proportion s equally to all, say m, unrestricted design points and a proportion s / m should be sampled around a design point \(\theta ^*\). A value d is computed such that the desired proportion s / m of examinees is in the symmetric interval \((\theta ^*d,\theta ^*+d)\), i.e., \(\int _{\theta ^*d}^{\theta ^*+d} g(x)~\mathrm{d}x=s/m\). The symmetric design is only well defined as long as the intervals from the design points do not overlap.
3 Results for Calibration of One Item

Item 1: Discrimination \(a_1=1\), difficulty \(b_1=0.5\);

Item 2: Discrimination \(a_2=1.5\), difficulty \(b_2=1.2\);

Item 3: Discrimination \(a_3=1.6\), difficulty \(b_3=2\).
Abdelbasit and Plackett (1983) showed that for the twoparameter logistic model the locally Doptimal design for a single item with best guess values a and b for discrimination and difficulty has two distinct equally weighted design points or ability levels \({\theta } = b \pm \frac{{1.543}}{a}\). The corresponding probabilities of correctly answering the question at these points, \({\theta _1}\) and \({\theta _2}\) say, are \(p({\theta _1}) = 0.176\) and \(p({\theta _2}) = 0.824\). This means that the unrestricted locally Doptimal design recommends to choose half of the examinees for calibration with ability \(\theta _1= b\,\mathrm{{  }}\frac{{1.543}}{a}\) and half with ability \(\theta _2= b\,+\frac{{1.543}}{a}\).
We assume in the examples in Sects. 3 and 4 that the examinees in the computerized test have standard normal distributed abilities. We compute locally Doptimal restricted designs with restriction \(g(\theta ) = \frac{1}{{\sqrt{2\pi } }}{e^{(  \frac{{{\theta ^2}}}{2})}}\). However, the method including the Equivalence Theorem in Sect. 2.3 is valid even if another assumption for the abilities is preferred. Since we compute here locally optimal designs, we have made investigations in the supplementary materials where we see robustness as long as the parameters are not severely misspecified.
3.1 Calibration of Item 1
It is hard to select a sample of examinees with these specific ability levels as there might be no such examinees available or we have a limited number of examinees with these ability levels. We sample instead the examinees from the available distribution in an optimal way using the techniques described in Sect. 2.3. For the restricted optimal design, we assume that the population of examinees has a standard normal ability distribution and we sample a proportion \(s=0.1\) of this population. The calculated optimal restricted design recommends to sample 5% examinees from the population with ability level between (− 1.215, \(0.984\)) and 5% between (1.600, 2.577), see the middle panel of Fig. 1a. The intervals are not equal in length: We have limited available examinees around the high unrestricted ability level 2.043 compared to the low level \(1.043\). So we need a longer interval around the unrestricted ability level 2.043 to select 5% examinees of population. The intervals are also asymmetrical around the unrestricted design points and extend more toward the extreme abilities since less examinees are available there.
The directional derivative for this twointerval design is shown in the lower panel of Fig. 1a with black line and interval limits are marked on it with red dots. Since these four points of the twointerval design are on one blue reference line and the intervals of the population sample have directional derivative below the reference line, the Equivalence Theorem for Item Calibration described in Sect. 2.3 confirms the optimality of this twointerval design (the blue reference line corresponds to \(c^*\) in the theorem).
We computed the optimal restricted design for other sample proportions than \(s=0.1\). We show the results in Fig. 1b for \(s=0, 0.05, 0.1,\dots ,0.95\), where \(s=0\) is the limiting case of unrestricted optimal design. We see there that we still have a twointerval design if we want to sample 95% of the population. This twointerval design becomes one interval if we sample 96% of the population. Figure 2 shows the determinant of the information matrix of the locally Doptimal restricted design for Item 1 for sample proportion \(s=0,0.05,\dots ,0.95,1\). The case \(s=0\) corresponds to the locally Doptimal unrestricted design, \(s=1\) to the random design. The loss of information for Item 1 is moderate if the population proportion is between 0.0 and 0.2.
3.2 Calibration of Item 2
3.3 Calibration of Item 3
In the third scenario, we want to sample 35% of the examinees population (\(s=0.35\)) in order to calibrate Item 3 with best guess for difficulty parameter \(b=2\) and discrimination parameter \(a=1.6\). The unrestricted optimal design recommends to choose 17.5% sample proportion of the population at each of the ability levels 1.035 and 2.965. The restricted optimal design samples 21.23% from the population of examinees between ability levels (0.043, 0.611) and 13.76% between (1.091, 5.417), see Fig. 5a with design and directional derivative plot. Both intervals have unequal length and different sample proportions of the examinees population. The lower limit of the upper interval is quite close to the lower point of the optimal unrestricted design. This seems reasonable as limited examinees are available around the high ability level 2.965. So to select the examinees this lower limit of the upper interval moves toward the left as more examinees are available to this side. To counter this, the lower interval is quite below from the lower point of the optimal unrestricted design. Similarly as for Item 2, the lower interval does here not contain the lower unrestricted design point. This effect happens for items with difficulty b not in the center of the ability distribution. The value of difficulty where this effect starts depends on the discrimination a. In the supplementary materials, we provide figures showing combinations of a and b where an unrestricted design point is not contained in the restricted optimal design.
Relative efficiency of random design versus Doptimal restricted design.
Proportion (%)  Item 1  Item 2  Item 3  

\(\mathrm{RE}_\mathrm{D}\)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  \(\mathrm{RE}_\mathrm{D}\)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  \(\mathrm{RE}_\mathrm{D}\)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  
0  0.7451  34.2076  0.6360  57.2429  0.3616  176.5380 
5  0.7463  33.9931  0.6407  56.0771  0.4094  144.2796 
10  0.7498  33.3776  0.6531  53.1182  0.4562  119.2079 
15  0.7551  32.4376  0.6695  49.3729  0.4964  101.4377 
20  0.7618  31.2638  0.6875  45.4549  0.5325  87.7791 
25  0.7697  29.9238  0.7062  41.6082  0.5658  76.7462 
30  0.7785  28.4552  0.7251  37.9183  0.5969  67.5348 
35  0.7882  26.8741  0.7440  34.4095  0.6263  59.6613 
40  0.7988  25.1845  0.7629  31.0821  0.6544  52.8130 
45  0.8105  23.3856  0.7817  27.9273  0.6813  46.7767 
50  0.8232  21.4783  0.8004  24.9334  0.7072  41.4018 
55  0.8370  19.4692  0.8191  22.0882  0.7329  36.4384 
60  0.8520  17.3720  0.8377  19.3804  0.7563  32.2257 
65  0.8680  15.2055  0.8562  16.8001  0.7807  28.0906 
70  0.8850  12.9922  0.8746  14.3393  0.8063  24.0296 
75  0.9029  10.7552  0.8929  11.9919  0.8331  20.0334 
80  0.9215  8.5176  0.9111  9.7539  0.8614  16.0870 
85  0.9407  6.3018  0.9292  7.6245  0.8915  12.1685 
90  0.9603  4.1296  0.9472  5.5690  0.9238  8.2444 
95  0.9802  2.0222  0.9692  3.1792  0.9592  4.2536 
Relative efficiency of symmetric design versus Doptimal restricted design.
Proportion (%)  Item 1  Proportion (%)  Item 2  

\(\mathrm{RE}_\mathrm{D}\)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  \(\mathrm{RE}_\mathrm{D}\)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  
0  1.0000  0.0000  0  1.0000  0.0000 
5  0.9999  0.0063  5  0.9993  0.0661 
10  0.9994  0.0646  10  0.9954  0.4620 
15  0.9980  0.1990  15  0.9887  1.1451 
20  0.9961  0.3953  20  0.9806  1.9777 
25  0.9937  0.6335  25  0.9722  2.8640 
30  0.9911  0.8937  30  0.9639  3.7426 
35  0.9886  1.1562  35  0.9563  4.5690 
40  0.9862  1.4028  40  0.9496  5.3106 
45  0.9841  1.6155  45  0.9439  5.9429 
50  0.9825  1.7809  50  0.9394  6.4461 
55  0.9814  1.8914  55  0.9363  6.8045 
60  0.9809  1.9450  
65  0.9809  1.9463  Proportion (%)  Item 3  
70  0.9813  1.9018  \(\mathrm{RE}_\mathrm{D}\)  \(\mathrm{RE}_\mathrm{SS}\) (%)  
75  0.9821  1.8195  0  1.0000  0.0000 
80  0.9832  1.7072  5  0.9633  3.8065 
85  0.9845  1.5726  10  0.9252  8.0835 
90  0.9860  1.4229  15  0.8998  11.1370 
95  0.9875  1.2639  20  0.8853  12.9498 
Relative efficiency \(\mathrm{RE}_\mathrm{D}\) versus Doptimal restricted design for calibration of two or more items.
Proportion (%)  Item 1, 2  Item 1, 3  Item 2, 3  

\(\mathrm{RE}_\mathrm{D} \)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  \(\mathrm{RE}_\mathrm{D} \)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  \(\mathrm{RE}_\mathrm{D} \)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  
(a) Relative efficiency of random design versus Doptimal restricted design  
0  0.6884  45.2694  0.5191  92.6486  0.4796  108.5273 
10  0.6915  44.6164  0.5528  80.8993  0.5113  95.5664 
20  0.6997  42.9275  0.5896  69.6201  0.5444  83.6903 
30  0.7107  40.7010  0.6226  60.6242  0.5748  73.9856 
40  0.7238  38.1549  0.6531  53.1150  0.6032  65.7839 
50  0.7404  35.0594  0.6821  46.5996  0.6304  58.6238 
60  0.7597  31.6309  0.7104  40.7728  0.6585  51.8552 
70  0.7810  28.0483  0.7399  35.1514  0.6871  45.5363 
80  0.8037  24.4270  0.7715  29.6245  0.7137  40.1228 
90  0.8275  20.8469  0.8048  24.2492  0.7392  35.2876 
100  0.8525  17.2983  0.8398  19.0805  0.7665  30.4547 
Proportion (%)  Item 1, 2  Item 2, 3  

\(\mathrm{RE}_\mathrm{D} \)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  \(\mathrm{RE}_\mathrm{D} \)  \(\mathrm{RE}_\mathrm{SS} (\%)\)  
(b) Relative efficiency of symmetric design versus Doptimal restricted design  
0  1.0000  0.0000  1.0000  0.0000  
10  0.9996  0.0370  0.9796  2.0784  
20  0.9972  0.2772  0.9571  4.4781  
30  0.9930  0.7070  0.9404  6.3430  
40  0.9885  1.1670  0.9289  7.6589 
3.4 Relative Efficiency of the Optimal Design
Table 1 shows the relative efficiency of the random design versus the Doptimal restricted design for each of the three items. The Doptimal restricted design is generally very efficient compared to the random design gaining up to 34% sample size for Item 1, up to 56% for Item 2, and 144% for Item 3 for giving the same precision of estimates. Additionally for the Doptimal restricted and random designs, we provide figures with the determinants of the information matrices for the three items in the supplementary materials. Table 2 shows efficiencies for the symmetric design. For Item 1 which has a difficulty close to the mean ability of the population, the symmetric design is quite efficient needing only up to 1.95% more examinees compared to the restricted Doptimal design. For Item 2 and 3, the intervals of the symmetric design would overlap for larger s; therefore, the design is only possible for \(s\le 0.55\) and \(s\le 0.2\), respectively. For these items having one unrestricted optimal design point where only few examinees are available, there is a higher sample size gain of the optimal compared to the symmetric design for some s (up to 6.80% for Item 2 and 12.95% for Item 3).
4 Results for Calibration of Two or More Items
We now present scenarios for calibration of two items. We start with briefly mentioning the locally Doptimal unrestricted design. One can show that it is Doptimal to sample exactly half of the examinees for each of the two items. Within each item, the oneitem optimal design mentioned in Sect. 3 is the best choice. This means, the locally Doptimal design for calibration of two items is to sample 25% of the examinees with ability levels \(\theta = b_1 \pm \frac{1.543}{a_1}\) for Item 1 and 25% of the examinees with ability levels \(\theta = b_2 \pm \frac{1.543}{a_2}\) for Item 2.
We compute now locally Doptimal restricted designs assuming that the examinees participating in the computerized test have standard normal distributed abilities. We use Item 1, 2, and 3 from Sect. 3 and compute the optimal design when at least two of these three items should be calibrated simultaneously.
In a first case (Sect. 4.1), the optimal designs for each of the two items are not overlapping. In more challenging cases (see Sects. 4.2 and 4.3), it can be seen that some examinees would be needed for both items – they compete with each other. Then, the optimal design will determine the best allocation to either of the items. The result can be a twointerval solution for both items (Sect. 4.2); in this case, the algorithm in Sect. 2.4.2 found the optimal design using \(K=2\). In Sect. 4.3, \(K=3\) intervals were needed for each item. Finally, we compute optimal designs when all examinees are sampled (\(s=1\); Sects. 4.4 and 4.5). In Table 3a, the relative efficiencies are calculated for the random design. Considerable sample size gains exist in all cases (17.30% to 95.57%). Table 3b shows efficiencies for the symmetric design in cases where intervals do not overlap. We see that in many cases including all cases for calibrating Item 1 and 3, we cannot apply the symmetric design directly due to overlapping.
4.1 Calibration for Noncompeting Items
In this first situation, we consider Item 1 (\(a=1\), \(b=0.5\)) and Item 2 (\(a=1.5\), \(b=1.2\)) for calibration. We are interested to sample 10% population of examinees to calibrate these two items in the item bank.
The unrestricted optimal design suggests to sample 2.5% examinees at each ability levels \(1.043\), 2.043 (for Item 1) and \(2.229, 0.171\) (for Item 2). Since it is in practice hard to sample the examinees at these specific ability levels due to unavailability or limited availability of number of examinees, we use restricted optimal design to sample the examinees between some intervals of ability levels in an optimal way. The unrestricted optimal design recommends us to sample 2.52% and 2.51% examinees from the population between ability levels (\(1.114\), \(1.004\)) and (1.804, 2.308), respectively, for Item 1. For Item 2, it suggests to choose 2.47% and 2.50% of the examinees between ability levels (\(2.617\), \(1.893\)) and (\(0.167\), \(0.103\)), see Fig. 7a. The directional derivative plot in the lower panel of Fig. 7a confirms that the design with these intervals limits is optimal: The blue reference line (corresponding to value \(c^*\) in the Equivalence Theorem for Item Calibration) separates the sampling regions from the nonsampling regions. Further, the sampling to item \(i, i=1,2,\) corresponds to the region where the respective item has the smallest directional derivative. We show optimal designs for other values of \(s \in \{0.1, 0.2, \dots , 1\}\) in Fig. 7b.
4.2 Calibration for Competing Items
In this case, we want to select a sample of 50% examinees from the population in order to calibrate Item 1 (\(a=1\), \(b=0.5\)) and Item 3 (\(a=1.6\), \(b=2\)) in the item bank. The unrestricted optimal design would select 12.5% examinees each at the ability levels \(1.043\), 2.043 for Item 1 and 12.5% examinees at each ability levels 1.035, 2.965 for Item 3. Selecting examinees around the unrestricted design points in a naive manner faces the problem that there are only few examinees around the ability levels 2.043 and 2.965. The restricted design recommends us to choose 15.1% and 13.6% of the population of examinees on the ability intervals (\(2.158\), \(0.967\)) and (0.836, 1.511) for Item 1 and choose 14.7% and 6.5% of the examinees for Item 3 on the intervals (0.299, 0.721) and (1.511, 5.197), see Fig. 8a. The directional derivative plot in the lower panel confirms based on the Equivalence Theorem for Item Calibration that this restricted optimal design is optimal. In each region, the item with the lowest directional derivative is sampled. The two upper intervals follow directly after each other with boundary point 1.511. This shows that the two items are competing for examinees around this ability \(\theta =1.511\). The directional derivative of both items is equal at this point. Examinees with such \(\theta \) would be good for both items since both directional derivatives are well below the reference line, but in order to maximize the overall information, this cutpoint was determined. Figure 8b shows optimal designs for other values of \(s \in \{0.1, 0.2, \dots , 1\}\).
4.3 Calibration for Items with Several Intervals
Now in this situation we want to choose a sample of 80% examinees from the population to calibrate Item 2 (\(a=1.5\), \(b=1.2\)) and Item 3 (\(a=1.6\), \(b=2\)) in the item bank. The unrestricted design recommends us to choose 20% examinees each with abilities \(2.229\), \(0.171\) for Item 2 and 1.035, 2.965 for Item 3. The restricted optimal design suggests us to select 18.78%, 10.14% and 14.29% of the population of examinees on the ability intervals (\(4.069\), \(0.886\)), (\(0.329\), \(0.069\)) and (0.338, 0.757), respectively, for Item 2. It also recommends to choose 16.01%, 5.05% and 15.72% of examinees from the population on the ability intervals (\(0.069\), 0.338), (0.757, 0.938) and (1.006, 5.508) for Item 3, see Fig. 9. The directional derivative plot in the lower panel of Fig. 9 shows together with the Equivalence Theorem for Item Calibration that this design is optimal for the selection of examinees: We select examinees on intervals below the blue line for Item 2 or 3 depending on which item’s directional derivative is smallest in these intervals. In contrast to the preceding example, the competition between the items leads here to the need of three intervals for each item. Note that in the region from \(\theta =0.3\) to 0.9, the two directional derivatives are quite close but do not exactly coincide—the minimum is unique except for the crossing points. Optimal designs for other values of \(s \in \{0.1, 0.2, \dots , 1\}\) are presented in Fig. 10.
4.4 Calibration of Two Items Using the Whole Population
Now we want to select all the available examinees in order to calibrate Item 2 and 3 in the item bank. The optimal unrestricted design suggests to choose 25% each of all available examinees at the ability levels \(2.229\) & \(0.171\) for Item 2 and 25% each for Item 3 at the ability levels 1.035 & 2.965. When we use restricted optimal design, we should choose 30.98% of the examinees on the ability interval (\(\infty \), \(0.496\)) and 23.28% on (0.081, 0.723) for Item 2. For Item 3, it suggests to choose 22.27% examinees on (\(0.496\), 0.081) and 23.47% on (0.723, \(\infty \)). (With an exact computation, the last interval is obtained to be (0.723, 10); examinees with higher ability should receive Item 2. However, the probability for an ability \(\ge 10\) is basically 0.) The directional derivative in the third panel of Fig. 9 shows that this design is optimal for selection of examinees: We choose examinees for Item 2 or 3 whenever the directional derivative is smallest. The random design requires 30.45% more examinees to be as efficient as the locally Doptimal restricted design.
4.5 Calibration of All Three Items Using the Whole Population
Finally, we calibrate Item 1, 2 and 3 simultaneously using the population of examinees participating in the computerized test. The optimal unrestricted design recommends us to select approximately 16.67% of all available examinees at each of six optimal unrestricted design points of ability. For the optimal restricted design, we remarked in Sect. 2.3 that examinees with very high abilities should be assigned to the item with the lowest discrimination, here Item 1. For numerical computation, we assign therefore intervals \(I_{10}=(\infty ,\theta _{10U}]\) and \(I_{1(K+1)}=[\theta _{1(K+1)L},\infty )\) to Item 1; between these intervals, we calculate an optimal Kinterval design. It turns out that \(K=2\) is sufficient here. The optimal restricted design suggests us to choose 11.97% and 23.17% of the total available examinees on the ability interval (\(5.147, 1.176\)) and (\(0.424\), 0.170) for Item 2. Besides the intervals \((\infty ,5.147)\) and \((5.975,\infty )\) where almost no examine falls in, we choose 21.62% and 16.02% of examinees for Item 1 on the intervals (\(1.176, 0.424\)) and (0.754, 1.513). Lastly, on the intervals (0.170, 0.754) and (1.513, 5.975) we select 20.70% and 6.82% of examinees for Item 3, see forth panel of Fig. 11. According to the Equivalence Theorem for Item Calibration the directional derivatives in the last panel of Fig. 11 show that the restricted design is optimal for selection of examinees based on their estimated abilities. The random design needs 32.04% more examinees to have the same efficiency as the restricted Doptimal design.
5 Scaling Up the Method for Large Banks of New Items
An assumption we made was that each examinee can calibrate (at most) one item. We had examples with an item bank of three items. Realistic situations have often large banks of new items, and it is desired that each examinee calibrates several items. We will show now how one can easily use the methods described for realistic situations. We assume that the maximal number k of items which an examinee can calibrate is given by practical circumstances, e.g., the time necessary for the test. The number of new items n to calibrate is \(>k\), such that we need to allocate them to different examinees. Let us assume for simplicity that n is a multiple of k. We divide the n items into k blocks of n / k items each. Each examinee is supposed to calibrate exactly one item per block. The blocking might be done taking content of items into account or simply randomly. We compute now for each of the n / kitem block the optimal restricted design separately. This gives us the optimal calibration with the additional restriction of this blocking.
We can compute the Defficiency of the random design compared to the restricted optimal design for each block. It follows from formula (10) that the overall efficiency of the random design compared to the blocked restricted optimal design is the geometric mean of the block efficiencies.
6 Discussion and Conclusion
Item calibration is an important tool for maintaining, updating and developing new items for an item bank. In the case of a twoparameter logistic model, the unrestricted Doptimal design for calibration of one new item has two optimal ability levels of examinees (\({\theta } = b \pm \frac{{1.543}}{a}\)) where one should sample equal proportions of the examinee population at these points. In practice, it is impossible to sample equal proportions of examinees from these optimal points of ability due to unavailability or limited availability of examinees. Sampling symmetrically around the optimal ability levels works in some situations. But in many cases, it is not clear how to define such symmetric designs, e.g., if optimal ability levels are too close to each other. To avoid possibly inefficient ad hoc solutions, we have used restricted optimal designs to calibrate new items where we used optimal intervals instead of points to sample the examinees from the population.
In this paper, we derived locally optimal designs. Their quality might depend on the quality of the prior guess about the item parameters. If the true item parameters are a little different from the prior guess, we have seen robustness; however, if the difference is large, the locally optimal design might be a bad choice. Therefore, alternatives to local optimality, Bayesian or maximin optimality, can be applied, see e.g., Atkinson et al. (2007), Chapter 17 and 18. Combination of these general optimal design approaches with the restricted optimality considered here could be an area of future research.
Further, an opportunity in computerized calibration is to reestimate the item parameters from the ongoing calibration and to apply a sequential optimal design, see Lu (2014), van der Linden and Ren (2015) and Ren et al. (2017). This sequential and the Bayesian (or minimax) approach can also be combined. However, in tests, e.g., the Swedish Scholastic Assessment Test, all examinees are tested more or less simultaneously. If calibration items are added to tests where all examinees are tested more or less in parallel, we think therefore that a minimax or Bayesian approach should be used in a nonsequential context.
In this manuscript, we assume that abilities of examinees are well determined in the operational part of the test before it is decided which item to calibrate depending on their ability. We ignore here the fact that we use estimated abilities and not true abilities, but there is some uncertainty around the estimates: The examinee might be a bit better or worse than the estimated ability (the examinee might have had bad or good luck in the examination). However, the abilities should be reasonably well estimated if the operational part of the achievement test is large and calibration items are added toward the end of this test. Note that Ren et al. (2017) suggested to seed the new items into the final part of the test and He et al. (2019) concluded in their situation that even middle positions worked equally well. Nevertheless, for handling the uncertainty in abilities, it is conceptually possible to use the here described restricted optimal design approach in connection with using posterior distributions of abilities [see e.g., Section 2.1. of Ren et al. (2017)] rather than point estimates.
While our developed theory applies generally to item response models and to convex and differentiable optimality criteria, we have in the examples considered a twoparametric logistic model together with Doptimality. It might be interesting to explore the structure of optimal designs for other models. For example including a third parameter modeling a guessing probability has been advocated in this context of achievement tests, see e.g., van der Linden and Ren (2015). Further, the examinees’ abilities might not adequately be characterized by a onedimensional ability parameter. Then a multidimensional IRT model might be considered. Optimal estimated designs for these models will be considered in future research where other optimality criteria will be considered as well.
Finally, we assumed in the Equivalence Theorem for Item Calibration that each examinee at most can calibrate one item. We described how this can be applied in a situation where everyone calibrates more items. This leads however to an optimal design under a blocking restriction. When there is no contentreason for a specific blocking and when the blocks are created randomly, it might be desirable to improve the design even more and to drop the blocking restriction. An extension of the Equivalence Theorem such that optimization can be done without the blocking restriction is a task for future research.
Notes
Acknowledgements
We would like to thank Professor Daniel Thorburn from our department for discussions about this topic and two unknown referees for their comments which improved our manuscript.
Supplementary material
References
 Abdelbasit, K. M., & Plackett, R. L. (1983). Experimental design for binary data. Journal of the American Statistical Association, 78(381), 90–98.CrossRefGoogle Scholar
 Atkinson, A. C., Donev, A. N., & Tobias, R. D. (2007). Optimum experimental designs, with SAS. Oxford: Oxford University Press.Google Scholar
 Berger, M. P. F. (1992). Sequential sampling designs for the twoparameter item response theory model. Psychometrika, 57(4), 521–538.CrossRefGoogle Scholar
 Berger, M. P. F., King, C. J., & Wong, W. K. (2000). Minimax Doptimal designs for item response theory models. Psychometrika, 65(3), 377–390.CrossRefGoogle Scholar
 Berger, M. P. F., & Wong, W. K. (2009). An introduction to optimal designs for social and biomedical research (Vol. 83). Chichester: Wiley.CrossRefGoogle Scholar
 Borchers, H. W. (2013). nloptwrap: Wrapper for package nloptr. R package version 0.51. http://CRAN.Rproject.org/package=nloptwrap. Accessed 30 July 2018.
 Buyske, S. (1998). Optimal design for item calibration in computerized adaptive testing: The 2PL case. In N. Flournoy, W. F. Rosenberger, & W. K. Wong (Eds.), New developments and applications in experimental design. Berkeley: Institute of Mathematical Statistics.Google Scholar
 Buyske, S. (2005). Optimal designs in educational testing. In M. P. F. Berger & W. K. Wong (Eds.), Applied optimal designs. Chichester: Wiley.Google Scholar
 Chang, Y.C. I., & Lu, H.Y. (2010). Online calibration via variable length computerized adaptive testing. Psychometrika, 75(1), 140–157.CrossRefGoogle Scholar
 He, Y., Chen, P., & Li, Y. (2019). New efficient and practicable adaptive designs for calibrating items online. Applied Psychological Measurement. https://doi.org/10.1177/0146621618824854.
 Jones, D. H., & Jin, Z. (1994). Optimal sequential designs for online item estimation. Psychometrika, 59(1), 59–75.CrossRefGoogle Scholar
 Kiefer, J., & Wolfowitz, J. (1960). The equivalence of two extremum problems. Canadian Journal of Mathematics, 12(5), 363–365.CrossRefGoogle Scholar
 Lu, H.Y. (2014). Application of optimal designs to item calibration. PLoS ONE, 9(9), e106747.CrossRefGoogle Scholar
 Pukelsheim, F. (2006). Optimal design of experiments. Philadelphia: SIAM.CrossRefGoogle Scholar
 Ren, H., van der Linden, W. J., & Diao, Q. (2017). Continuous online item calibration: Parameter recovery and item utilization. Psychometrika, 82(2), 498–522.CrossRefGoogle Scholar
 Sahm, M., & Schwabe, R. (2000). A note on optimal bounded designs. In A. Atkinson, B. Bogacka, & A. A. Zhigljavsky (Eds.), Optimum Design 2000 (pp. 131–140). Dordrecht: Kluwer Academic Publishers.Google Scholar
 Silvey, S. D. (1980). Optimal design. Monographs on applied probability and statistics (1st ed., Vol. 1). London: Chapman and Hall.Google Scholar
 Stocking, M. L. (1988). Scale drift in online calibration. New York: Wiley Online Library.CrossRefGoogle Scholar
 Universitets och högskolerådet (2019). Studera.nu. Provdagen – så fungerar det. http://www.studera.nu/hogskoleprov/inforhogskoleprovet/provdagensafungerardet/. Accessed 30 July 2018.
 van der Linden, W. J., & Ren, H. (2015). Optimal Bayesian adaptive design for testitem calibration. Psychometrika, 80(2), 263–288.CrossRefGoogle Scholar
 Whittle, P. (1973). Some general points in the theory of optimal experimental design. Journal of the Royal Statistical Society. Series B (Methodological), 35(1), 123–130.CrossRefGoogle Scholar
 Wynn, H. P. (1982). Optimum submeasures with applications to finite population sampling. Statistical decision theory and related topics III (pp. 485–495). New York: Academic Press.CrossRefGoogle Scholar
 Zheng, Y. (2014). New methods of online calibration for item bank replenishment. PhD thesis, University of Illinois at UrbanaChampaign.Google Scholar
 Zhu, R. (2006). Implementation of optimal design for item calibration in computerized adaptive testing (CAT). PhD thesis, Champaign, IL, USA.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.