1 Introduction

Computerized classification tests, like computerized adaptive tests, adapt some test elements to the student. Test length and/or item selection is tailored during test administration. The goal of a computerized classification test is to classify the examinee into one of two levels (e.g., master/nonmaster) or into one of multiple levels (e.g., basic/proficient/advanced). Classification methods stop testing when enough confidence has been established to make the decision. An item selection method selects items for the examinee on the fly.

Computerized classification testing was developed several decades ago for tests intended to measure one construct or dimension. Unidimensional item response theory (UIRT) is used to model the student’s performance, to select the items, to decide whether testing can be stopped, and to make the classification decisions. More recently, classification methods were developed for tests measuring multiple constructs or dimensions [(i.e., multidimensionality; Seitz and Frey (2013a, b), Spray et al. (1997), Van Groen et al. (2014b, c, 2016)] using multidimensional item response theory (MIRT). Classification decisions can be made for the entire test or for one or more dimensions or subsets of items.

Two types of multidimensionality Wang and Chen (2004) are discussed in the next section. In the case of between-item multidimensionality, each item is intended to measure just one ability whereas the test as a whole measures multiple abilities. For example, a test contains items for mathematics and language, and each item measures either mathematics or language ability. In case of within-item multidimensionality, items are intended to measure multiple abilities. Here, items measure both mathematics and language. The relative contribution of language and mathematics must vary over items. If not, no distinction can be made between the dimensions and the test should be modeled using UIRT. In what follows, adaptations of two unidimensional classification methods to multidimensional classification testing are discussed. These methods are based on Wald’s sequential probability ratio test SPRT; (1947/1973) and Kingsbury and Weiss’ confidence interval method (1979). Subsequently, three item selection methods are discussed. The efficiency and accuracy of the classification and item selection methods are investigated in the simulations section. Limitations and recommendations for further research are discussed in the final section.

2 Multidimensional Item Response Theory

Multidimensional computerized classification testing requires a statistical framework to model the student’s performance, to obtain item parameters, to select the items, to make the classification decisions, and to decide whether testing can be stopped before reaching the maximum test length. MIRT (Reckase 2009) provides such a framework. In a calibrated item bank, model fit is established, item parameter estimates are available, and items with undesired characteristics are removed (Van Groen et al. 2014a). During testing, it is assumed that the item parameters have been estimated with enough precision to consider them known (Veldkamp and Van der Linden 2002).

A vector \(\varvec{\theta }\) of p person abilities is used in MIRT to describe the skills and knowledge required for answering an item (Reckase 2009). The dichotomous two-parameter logistic model is used here. The probability of a correct answer, \(x_i=1\), to item i is given by Reckase (2009)

$$\begin{aligned} \mathrm {P}_i(\varvec{\theta }) =\mathrm {P}_i\left( x_i=1|\mathbf {a}_i,d_i,\varvec{\theta }\right) =\frac{\exp (\mathbf {a}_i^\prime \varvec{\theta }+d_i)}{1+\exp (\mathbf {a}_i^\prime \varvec{\theta }+d_i)}, \end{aligned}$$
(14.1)

where \(\varvec{a}_i\) is the vector of discrimination parameters and \(d_i\) is the easiness of the item.

Ability estimates can be used during testing by the item selection and classification methods in computerized classification testing. Ability is estimated using the likelihood function. The likelihood of a vector of observed responses \(\varvec{x}_j=(x_{1j},...,x_{kj})\) to items \(i=1,\cdots ,k\) for an examinee j with ability \(\varvec{\theta }_j\) equals the product of the probabilities of the responses to the administered items (Segall 1996):

$$\begin{aligned} \text {L}\left( \varvec{\theta }_j|\varvec{x}_j\right) =\prod _{i=1}^k \mathrm {P}_i\left( \varvec{\theta }_j\right) ^{x_{ij}}\mathrm {Q}_i\left( \varvec{\theta }_j\right) ^{1-x_{ij}}, \end{aligned}$$
(14.2)

where \(\mathrm {Q}_i(\varvec{\theta }_j)=1-\mathrm {P}_i(\varvec{\theta }_j)\). This likelihood can be used due to the local independence assumption, and it uses the fixed item parameters from the item bank.

The vector of values, \(\hat{\varvec{\theta }}=(\hat{\theta }_1, \cdots ,\hat{\theta }_p)\), that maximizes Eq. 14.2 is used as the ability estimate \(\varvec{\theta }_j\) (Segall 1996). Because the equations used to find the maximum likelihood (ML) estimates have no closed-form solution, an iterative search procedure, such as Newton-Raphson, is used. Weighted maximum likelihood (Tam 1992; Van Groen et al. 2016; Warm 1989) estimation reduces the bias in the ML estimates. Alternatively, Bayesian ability estimation approaches are also available (Segall 1996).

The two types of multidimensionality can be distinguished by the structure of their item parameters. If more than one discrimination parameter is nonzero for one or more items, within-item multidimensionality is present (Wang and Chen 2004), and items are intended to measure multiple (or all) abilities. Using a within-item multidimensional model, complex domains can be modeled while taking into account several abilities simultaneously (Hartig and Höhler 2008). Different combinations of abilities can be represented for different items (Hartig and Höhler 2008). If just one discrimination parameter is nonzero per item in the test, the test is considered to have between-item multidimensionality (Wang and Chen 2004). Items are then intended to measure just one ability, and the test contains several unidimensional subscales (Hartig and Höhler 2008).

3 Classification Methods

Multidimensional classification testing requires a method that decides whether enough evidence is available to make a classification decision (i.e., testing is stopped and a decision is made). The decision can be a classification into one of two levels (e.g., master/nonmaster) or into one of multiple levels (e.g., basic/proficient/advanced). Two well-known unidimensional methods, the SPRT (Eggen 1999; Reckase 1983; Spray 1993) and the confidence interval method (Kingsbury and Weiss 1979), were adapted to multidimensional tests. Both methods stop testing when a prespecified amount of confidence has been established to make the classification decision, but the way the decision is made differs. The methods can make a decision based on the entire test and all dimensions, or based on parts of the tests or dimensions. Different implementations of the methods are required for between-item and within-item multidimensionality.

3.1 The SPRT for Between-Item Multidimensionality

The SPRT (Wald 1947/1973) was applied to unidimensional classification testing by Eggen (1999), Reckase (1983), and Spray (1993). Seitz and Frey (2013a) applied the SPRT to between-item multidimensionality by making a classification decision per dimension. Van Groen et al. (2014b) extended the SPRT to make classification decisions on the entire between-item multidimensional test.

3.1.1 The SPRT for Making a Decision Per Dimension

When making a decision per dimension, a cutoff point, \(\theta _{cl}\), is set for each dimension l, \(l=1,...,p\) (Seitz and Frey 2013a). The cutoff point is used by the SPRT to make the classification decision. Indifference regions are set around the cutoff points and account for the measurement error in decisions for examinees with an ability close to the cutoff point (Eggen 1999). The SPRT compares two hypotheses for each dimension (Seitz and Frey 2013a):

$$\begin{aligned} \mathrm {H}_{0l}:\theta _{jl}<\theta _{cl}-\delta ,\end{aligned}$$
(14.3)
$$\begin{aligned} \mathrm {H}_{al}:\theta _{jl}>\theta _{cl}+\delta , \end{aligned}$$
(14.4)

where \(\delta \) is the distance between the cutoff point and the end of the indifference region.

The likelihood ratio between the hypotheses for dimension l after k items are administered is calculated for the SPRT (Seitz and Frey 2013a) after each item administration by

$$\begin{aligned} \text {LR}\left( \theta _{cl}+\delta ;\theta _{cl}-\delta \right) =\frac{\text {L}\left( \theta _{cl}+\delta ;\mathbf {x}_{jl}\right) }{\text {L}\left( \theta _{cl}-\delta ;\mathbf {x}_{jl}\right) },\quad l=1,\cdots ,p, \end{aligned}$$
(14.5)

in which \(\text {L}\left( \theta _{cl}+\delta ;\mathbf {x}_{jl}\right) \) and \(\text {L}\left( \theta _{cl}-\delta ;\mathbf {x}_{jl}\right) \) are calculated using Eq. 14.2 with those items included that load on dimension l.

Decision rules are then applied to the likelihood ratio to decide whether to continue testing or to make a classification decision (Seitz and Frey 2013a; Van Groen et al. 2014b):

$$\begin{aligned}&\mathrm {administer\ another\ item\ if}&\beta /(1-\alpha )<\text {LR}(\theta _{cl}+\delta ;\theta _{cl}-\delta )<(1-\beta )/\alpha ;\nonumber \\&\mathrm {ability\ below\ }\theta _{cl}\mathrm {\ if}&\text {LR}(\theta _{cl}+\delta ;\theta _{cl}-\delta )\le \beta /(1-\alpha );\nonumber \\&\mathrm {ability\ above\ }\theta _{cl}\mathrm {\ if}&\text {LR}(\theta _{cl}+\delta ;\theta _{cl}-\delta )\ge (1-\beta )/\alpha , \end{aligned}$$
(14.6)

where \(\alpha \) and \(\beta \) specify the acceptable classification error rates (Spray et al. 1997). Previous research has shown that the size of \(\alpha \), \(\beta \) and sometimes also \(\delta \) has a limited effect on classification accuracy (Van Groen et al. 2014a, c). When classifications need to be made into one of multiple levels, the decision rules are applied sequentially for each of the cutoffs until a decision can be made (Eggen and Straetmans 2000; Van Groen et al. 2014b).

3.1.2 The SPRT for Making a Decision on All Dimensions

In many testing situations, a pass/fail decision is required on the entire test in addition to classifications on dimensions of the test. The likelihood then includes all dimensions and items (Van Groen et al. 2014b):

$$\begin{aligned} \text {LR}\left( \varvec{\theta }_c+\varvec{\delta };\varvec{\theta }_c-\varvec{\delta }\right) =\frac{\text {L}\left( \varvec{\theta }_c+\varvec{\delta };\mathbf {x}_j\right) }{\text {L}\left( \varvec{\theta }_c-\varvec{\delta };\mathbf {x}_j\right) }, \end{aligned}$$
(14.7)

where \(\varvec{\theta }_c\) and \(\varvec{\delta }\) include all dimensions. The decision rules for the entire test then become (Van Groen et al. 2014b)

$$\begin{aligned}&\mathrm {administer\ another\ item\ if}&\beta /(1-\alpha )<\text {LR}(\varvec{\theta }_c+\varvec{\delta };\varvec{\theta }_c-\varvec{\delta })<(1-\beta )/\alpha ;\nonumber \\&\mathrm {ability\ below\ }\varvec{\theta }_c\mathrm {\ if}&\text {LR}(\varvec{\theta }_c+\varvec{\delta };\varvec{\theta }_c-\varvec{\delta })\le \beta /(1-\alpha );\nonumber \\&\mathrm {ability\ above\ }\varvec{\theta }_c\mathrm {\ if}&\text {LR}(\varvec{\theta }_c+\varvec{\delta };\varvec{\theta }_c-\varvec{\delta })\ge (1-\beta )/\alpha . \end{aligned}$$
(14.8)

The SPRT can also make decisions based on a subset of the dimensions or a subset of the items (Van Groen et al. 2014b). This implies that only the selected dimensions and items are included in the likelihoods. This makes it possible to include an item in several decisions.

3.2 The Confidence Interval Method for Between-Item Multidimensionality

An alternative unidimensional classification method, developed by Kingsbury and Weiss (1979), uses the confidence interval surrounding the ability estimate. The method stops testing as soon as the cutoff point is outside the confidence interval. Seitz and Frey (2013b) applied this method to tests with between-item multidimensionality to make a classification decision per dimension. Van Groen et al. (2014c) extended the method to make decisions based on a some or all dimensions within the test.

3.2.1 The Confidence Interval Method for Making a Decision Per Dimension

Seitz and Frey (2013b) used the fact that the likelihood function of a between-item multidimensional test reduces to a unidimensional function for each dimension. This unidimensional likelihood function is used to make the classification decisions. The method uses the following decision rules (Van Groen et al. 2014c):

$$\begin{aligned}&\mathrm {administer\ another\ item\ if}&\hat{\theta }_{jl}-\gamma \cdot \text {se}(\hat{\theta }_{jl})<\theta _{cl}<\hat{\theta }_{jl}+\gamma \cdot \text {se}(\hat{\theta }_{jl});\nonumber \\&\mathrm {ability\ below\ }\theta _{cl}\mathrm {\ if}&\hat{\theta }_{jl}+\gamma \cdot \text {se}(\hat{\theta }_{jl})<\theta _{cl};\\&\mathrm {ability\ above\ }\theta _{cl}\mathrm {\ if}&\hat{\theta }_{jl}-\gamma \cdot \text {se}(\hat{\theta }_{jl})>\theta _{cl}\nonumber , \end{aligned}$$
(14.9)

where \(\hat{\theta }_{jl}\) is the estimate examinee’s current ability for dimension l, \(\gamma \) is a constant related to the required accuracy (Eggen and Straetmans 2000), and \(\text {se}(\hat{\theta }_{jl})\) is the estimate’s standard error (Hambleton et al. 1991):

$$\begin{aligned} \text {se}(\hat{\theta }_{jl})=\frac{1}{\sqrt{\text {I}(\hat{\theta }_{jl})}}, \end{aligned}$$
(14.10)

where \(\text {I}(\hat{\theta }_{jl})\) is the Fisher information available to estimate \(\theta _{jl}\) (Mulder and Van der Linden 2009). Fisher information is given by Tam (1992)

$$\begin{aligned} \mathrm {I}(\hat{\theta }_{jl}) =\sum _{i=1}^k a_{il}^2\mathrm {P}_i(\hat{\theta }_{jl})\mathrm {Q}_i(\hat{\theta }_{jl}). \end{aligned}$$
(14.11)

3.2.2 The Confidence Interval for Making a Decision on All Dimensions

Seitz and Frey’s (2013b) application of the confidence interval method can be used to make decisions per dimension, but not to make decisions on the entire test. An approach developed by Van der Linden (1999) can be used as a starting point for applying the confidence interval method to between-item multidimensional classification testing (Van Groen et al. 2014c).

Van der Linden (1999) considered the parameter of interest to be a linear combination of the abilities, \(\varvec{\lambda }^\prime \varvec{\theta }\), where \(\varvec{\lambda }\) is a combination of p nonnegative weights. In contrast to Van Groen et al. (2014c), we set \(\varvec{\lambda }\) equal to the proportion of items intended to measure the dimension at maximum test length. The composite ability is given by Van der Linden (1999)

$$\begin{aligned} \zeta =\sum _{l=1}^p\theta _l\lambda _l. \end{aligned}$$
(14.12)

The confidence interval method requires the standard error of the ability estimates (Yao 2012):

$$\begin{aligned} \text {se}(\zeta )=\sqrt{\text {V}(\zeta )}, \end{aligned}$$
(14.13)

with variance

$$\begin{aligned} \text {V}(\zeta )=\varvec{\lambda }^\prime \text {V}(\varvec{\theta })\varvec{\lambda }, \end{aligned}$$
(14.14)

and \(\text {V}(\varvec{\theta })= \text {I}(\varvec{\theta })^{-1}\). The nondiagonal elements of matrix \(\text {I}\) are zero, and the diagonal elements can be calculated using Eq. 14.11.

The decision rules are (Van Groen et al. 2014c)

$$\begin{aligned}&\mathrm {administer\ another\ item\ if}&\hat{\zeta }_j-\gamma \cdot \text {se}(\hat{\zeta }_j)<\zeta _c<\hat{\zeta }_j+\gamma \cdot \text {se}(\hat{\zeta }_j);\nonumber \\&\mathrm {ability\ below\ }\zeta _c\mathrm {\ if}&\hat{\zeta }_j+\gamma \cdot \text {se}(\hat{\zeta }_j)<\zeta _c;\\&\mathrm {ability\ above\ }\zeta _c\mathrm {\ if}&\hat{\zeta }_j-\gamma \cdot \text {se}(\hat{\zeta }_j)>\zeta _c\nonumber , \end{aligned}$$
(14.15)

where \(\zeta _c\) denotes the cutoff point and \(\hat{\zeta }_j\) is the estimated composite for the student (Van Groen et al. 2014c). The cutoff point is determined with Eq. 14.12 using the cutoff points for the dimensions. If a decision on several, but not all, dimensions simultaneously is required, only those dimensions should be included in the equations.

3.3 The SPRT for Within-Item Multidimensionality

The SPRT can also be used for tests with within-item multidimensionality. This does require the computation of the so-called reference composite (Van Groen et al. 2016). The reference composite reduces the multidimensional space to an unidimensional line (Reckase 2009; Wang 1985, 1986). This line is then used to make the classification decision (Van Groen et al. 2016).

The reference composite describes the characteristics of the matrix of discrimination parameters for the items in the item bank (Reckase 2009) or test. The direction of the line is given by the eigenvector of the \(\mathbf {a}\mathbf {a}^\prime \) matrix that corresponds to the largest eigenvalue of this matrix (Reckase 2009). The elements of the eigenvector determine the direction cosines, \(\alpha _{\xi l}\), for the angle between the reference composite and the dimension axes.

\(\varvec{\theta }\)-points can be projected onto the reference composite (Reckase 2009). A higher value on the reference composite, \(\xi _j\), denotes a more proficient student than does a lower value (Van Groen et al. 2014c). Proficiency on the reference composite can be calculated using an additional line through the \(\varvec{\theta }_j\)-point and the origin. The length of this line is given by Reckase (2009)

$$\begin{aligned} \text {L}_j=\sqrt{\sum _{l=1}^p\hat{\theta }_{jl}^2}, \end{aligned}$$
(14.16)

and the direction cosines, \(\alpha _{jl}\), for this line and dimension axis, l, are calculated using (Reckase 2009)

$$\begin{aligned} \cos \alpha _{jl}=\frac{\hat{\theta }_{jl}}{L_j},\quad l=1,\cdots ,p. \end{aligned}$$
(14.17)

The angle, \(\alpha _{j\xi }=\alpha _{jl}-\alpha _{\xi l}\), between the reference composite and the student’s line is used to calculate the estimated proficiency, \(\hat{\xi }_j\), on the reference composite:

$$\begin{aligned} \hat{\xi }_j=L_j\cos \alpha _{j\xi }. \end{aligned}$$
(14.18)

The reference composite can now be used to make classification decisions with the SPRT (Van Groen et al. 2016). The cutoff point for the SPRT, \(\xi _c\), and \(\delta ^\xi \) are specified on the reference composite. The boundaries of the indifference region need to be transformed to their \(\varvec{\theta }\)-points using

$$\begin{aligned} \varvec{\theta }_{\xi _{c+\delta }}=\cos \varvec{\alpha }_\xi \times (\xi _c+\delta ^\xi );\end{aligned}$$
(14.19)
$$\begin{aligned} \varvec{\theta }_{\xi _{c-\delta }}=\cos \varvec{\alpha }_\xi \times (\xi _c-\delta ^\xi ), \end{aligned}$$
(14.20)

where \(\varvec{\alpha }_\xi \) includes the angles between the reference composite and all dimension axes. The likelihood ratio for the SPRT becomes (Van Groen et al. 2016)

$$\begin{aligned} \text {LR}\left( \varvec{\theta }_{\xi _{c+\delta }};\varvec{\theta }_{\xi _{c-\delta }}\right) =\frac{\text {L}\left( \varvec{\theta }_{\xi _{c+\delta }};\mathbf {x}_j\right) }{\text {L}\left( \varvec{\theta }_{\xi _{c-\delta }};\mathbf {x}_j\right) }, \end{aligned}$$
(14.21)

which can be used to make multidimensional classification decisions with the following decision rules (Van Groen et al. 2016):

$$\begin{aligned}&\mathrm {administer\ another\ item\ if}&\beta /(1-\alpha )<\text {LR}(\varvec{\theta }_{\xi _{c+\delta }};\varvec{\theta }_{\xi _{c-\delta }})<(1-\beta )/\alpha ;\nonumber \\&\mathrm {ability\ below\ }\xi _{c}\mathrm {\ if}&\text {LR}(\varvec{\theta }_{\xi _{c+\delta }};\varvec{\theta }_{\xi _{c-\delta }})\le \beta /(1-\alpha );\\&\mathrm {ability\ above\ }\xi _{c}\mathrm {\ if}&\text {LR}(\varvec{\theta }_{\xi _{c+\delta }};\varvec{\theta }_{\xi _{c-\delta }})\ge (1-\beta )/\alpha \nonumber . \end{aligned}$$
(14.22)

Decisions can be made using the reference composite for different subsets of items and for more than two decision levels (Van Groen et al. 2016).

3.4 The Confidence Interval Method for Within-Item Multidimensionality

The confidence interval method (Kingsbury and Weiss 1979) can also be used for tests with within-item multidimensionality. Again, the reference composite is used to make a classification decision (Van Groen et al. 2014c).

To determine whether testing can be stopped after an item is administered, the examinee’s ability is estimated after each item. This estimate is projected onto the reference composite. The proficiency on the reference composite is then transformed to the corresponding point in the multidimensional space (Van Groen et al. 2014c):

$$\begin{aligned} \varvec{\theta }_{\hat{\xi }_j}=\cos \varvec{\alpha }_\xi \times \hat{\xi }_j. \end{aligned}$$
(14.23)

The reference composite is considered to be a combination of abilities. The weights between the abilities, \(\varvec{\lambda }_\xi \), are based on the angles between the reference composite and the dimension axes; \(\lambda _{\xi l}=1/\alpha _{\xi l}\) (Van Groen et al. 2014c).

The standard error for the confidence interval method is given by Yao (2012)

$$\begin{aligned} \mathrm {se}(\xi )=\sqrt{\mathrm {V}(\xi )}, \end{aligned}$$
(14.24)

with

$$\begin{aligned} \mathrm {V}(\xi )=\varvec{\lambda }^{\prime }_\xi \mathrm {V}(\varvec{\theta }_\xi )\varvec{\lambda }_\xi . \end{aligned}$$
(14.25)

The variance at \(\varvec{\theta }_{\hat{\xi }_j}\) is approximated by the inverse of the information matrix at \(\varvec{\theta }_{\hat{\xi }_j}\). The decision rules are (Van Groen et al. 2014c)

$$\begin{aligned}&\mathrm {administer\ another\ item\ if}&\hat{\xi }_j-\gamma \cdot \text {se}(\hat{\xi }_j)<\xi _c<\hat{\xi }_j+\gamma \cdot \text {se}(\hat{\xi }_j);\nonumber \\&\mathrm {ability\ below\ }\xi _c\mathrm {\ if}&\hat{\xi }_j+\gamma \cdot \text {se}(\hat{\xi }_j)<\xi _c;\\&\mathrm {ability\ above\ }\xi _c\mathrm {\ if}&\hat{\xi }_j-\gamma \cdot \text {se}(\hat{\xi }_j)>\xi _c\nonumber , \end{aligned}$$
(14.26)

where \(\xi _c\) denotes the cutoff point on the reference composite. The cutoff point is determined based on the cutoff points for each dimension. Again, decisions can be made using the reference composite for different subsets of items and for more than two decision levels (Van Groen et al. 2016).

4 Item Selection Methods

Computerized classification testing requires a method to select the items during test administration. Most methods select an optimal item given some statistical criterion. Limited knowledge is available for selecting items in multidimensional classification testing (Seitz and Frey 2013a; Van Groen et al. 2014b, c, 2016). Van Groen et al. (2014c) based their item selection on knowledge about item selection methods from unidimensional classification testing (Eggen 1999; Spray and Reckase 1994; Van Groen et al. 2014a, c) and multidimensional computerized adaptive testing for ability estimation (Reckase 2009; Segall 1996; Yao 2012).

Following Van Groen et al. (2014c), three different types of item selection methods are included here for between-item and within-item multidimensionality. In unidimensional classification testing, items are often selected that maximize information at the current ability estimate or at the cutoff point (Eggen 1999; Spray 1993; Thompson 2009). Maximization at the cutoff point becomes more complicated when multiple cutoff points are included. One type of the methods available for selection with multiple cutoff points, that is weighting methods, is also used here (Van Groen et al. 2014a). The third item selection method selects items at random. This provides a benchmark that can be used to compare the efficiency and accuracy of the more complex methods. The method that selects items based on the ability estimate and the weighting method have different implementations for between- and within-item multidimensionality.

4.1 Item Selection Methods for Between-Item Multidimensionality

Unidimensional item selection methods can be used for tests with between-item multidimensionality (Van Groen et al. 2014b), but items must be selected per dimension. Here this is accomplished using the Kingsbury and Zara (1989) approach. This approach selects the items from the dimension for which the difference between the desired and achieved percentage is the largest (Van Groen et al. 2014c) and can be applied to the three described item selection methods. The unidimensional item selection method that maximizes information at the ability estimate is described first. The unidimensional weighting method is described thereafter.

4.1.1 Item Selection Using the Ability Estimate

In unidimensional classification testing, Fisher information is often maximized at the current ability estimate (Eggen 1999; Spray 1993). The aim of maximizing information at the current ability estimate is to reduce the estimate’s standard error. In between-item multidimensional testing, this boils down to (Van Groen et al. 2014b)

$$\begin{aligned} \max ~\text {I}_i(\hat{\theta }_{l}),\quad \text {for}\ i\in \text {V}_{al}, \end{aligned}$$
(14.27)

where \(V_{al}\) is the set of items available for selection for dimension l.

4.1.2 Item Selection Using the Weighting Method

Information is often maximized at the cutoff point in unidimensional classification testing (Eggen 1999; Spray 1993). In between-item multidimensional tests, items can be selected that maximize the information at the cutoff point using

$$\begin{aligned} \max ~\text {I}_i(\theta _{cl}),\quad \text {for}\ i\in \text {V}_{al} \end{aligned}$$
(14.28)

as a consequence of the selection per dimension in such tests.

This method can be used if just one cutoff point is specified per dimension. Several methods have been developed for item selection with multiple cutoff points for unidimensional classification testing (Eggen and Straetmans 2000; Van Groen et al. 2014a; Wouda and Eggen 2009), The weighting method combines the objective functions per cutoff point into one weighted objective function. The weight for the cutoff points depends on the distance of the cutoff point to the current ability estimate (Van Groen et al. 2014a). As a result, the test is adapted to the individual student’s ability. The item is selected for dimension l that fulfills (Van Groen et al. 2014b)

$$\begin{aligned} \max ~\sum _{c=1}^C\frac{1}{|\hat{\theta }_{jl}-\theta _{cl}|}\text {I}_i(\theta _{cl}),\text { for }i\in \text {V}_{al}. \end{aligned}$$
(14.29)

4.2 Item Selection Methods for Within-Item Multidimensionality

In contrast to between-item multidimensional tests, selecting items for within-item multidimensional tests has to take the multidimensional structure of the test into account. This implies that multidimensional item selection methods have to be used. However, these methods were developed for multidimensional computerized adaptive testing to obtain an efficient and precise ability estimate. Nevertheless, Segall’s (1996) method, which selects the items at the current ability estimate, can be used for classification testing. This method was adapted by Van Groen et al. (2016) for selection at a weighted combination of the cutoff points.

4.2.1 Item Selection Using the Ability Estimate

Segall (1996) developed an item selection method focused on estimating ability as precisely as possible. Precision is reflected by the size of the confidence ellipsoid surrounding the estimate. The item is selected that results in the largest decrement of the volume of the confidence ellipsoid. The size of the ellipsoid can be approximated by the inverse of the information matrix, so the item is selected that maximizes (Segall 1996)

$$\begin{aligned} \max ~\det {\left( \sum _{i=1}^k \mathrm {I}(\hat{\varvec{\theta }}_j,x_{ij})+\mathrm {I}(\hat{\varvec{\theta }}_j,x_{k+1,j})\right) },\quad \text {for }k+1\in V_{k+1}, \end{aligned}$$
(14.30)

where \(\sum _{i=1}^k \mathrm {I}(\hat{\varvec{\theta }}_j,x_{ij})\) denotes the information that has been collected thus far and \(\mathrm {I}(\hat{\varvec{\theta }}_j,x_{k+1,j})\) is the information that a potential item provides. Because the item is selected that results in the largest determinant of the information matrix, the volume of the confidence ellipsoid is minimized (Reckase 2009).

4.2.2 Item Selection Using the Weighting Method

The weighting method for within-item multidimensional classification testing is based on maximization at a weighted combination of the cutoff points (Van Groen et al. 2016). The weights for the dimensions depend on the distance between the cutoff points and the student’s proficiency. The weighting method then selects the item using (Van Groen et al. 2016)

$$\begin{aligned} \max \sum _{c=1}^C\frac{1}{|\xi _{\hat{\theta }_j}-\xi _c|}\text {I}_i(\theta _{\xi _c}),\text { for }i\in \text {V}_{al}. \end{aligned}$$
(14.31)

5 Examples

The accuracy and efficiency of the classification and item selection methods are illustrated using two example simulation studies. The first example deals with a test in which between-item multidimensionality is present. The second example shows a test with a within-item multidimensional structure. The examples demonstrate the differences in accuracy and efficiency in specific situations with specific settings for the classification methods.

5.1 Example 1: Between-Item Multidimensionality

The End of Primary School Test (Cito 2012) in the Netherlands can be calibrated with a between-item multidimensional model. The test provides a recommendation on the most suitable level of secondary education to pupils. As a test result, a scale score is given that relates to an advise on the most appropriate level of secondary education (Van Boxtel et al. 2011). The test is typically modeled using six unidimensional scales; four scales for Language (100 items), one scale for Mathematics (60 items), and one scale for Study Skills (40 items).

5.1.1 Simulation Design for Example 1

The example is based on the item parameters, ability distribution, and correlation structure from a NOHARM 4 (Fraser and McDonald 2012) calibration with 147,099 students. A dataset of 10,000 students was generated with abilities drawn from the observed correlation structure. The same dataset was used for all simulations. The generated abilities were used to determine the true classification for each student. The minimum test length was set at 20 items, with a maximum of 200 items. Three cutoffs were specified based on the 2012 ability and score distributions. Simulations were run with the SPRT and the confidence interval method. For the SPRT, the values for \(\alpha \) and \(\beta \) were set equal and varied between 0.1 and 0.2, and \(\delta \) was set to 0.4. Other values for \(\delta \) resulted in similar test lengths and accuracy. For the confidence interval method, \(\gamma \) varied between values corresponding to 90–97.5% confidence intervals for each of the individual cutoff points. Items were selected using random selection, maximization at the ability estimate, and the weighting method. The content structure of the item pool was respected in the item selection using the Kingsbury and Zara approach (1989).

The choices made here were different from those made for actual test administration for the purpose of simplicity and comparability with example 2. The number of cutoffs was limited to three in example 1. The actual test provides eight different advises and also reports at the subject level. The choices made here imply that the presented simulation results cannot be translated to actual test administration.

5.1.2 Simulation Results, Example 1

The results of the simulations for example 1 are presented in Figs. 14.1 and 14.2. The former indicates the average test length (ATL) over all students per condition. The latter denotes the proportion of correct decisions (PCD) as an indication of the accuracy of the classification decisions. The PCD compares the true classification and the decision made by the classification methods.

The plot for average test length for the SPRT shows that higher values for \(\alpha \) and \(\beta \) result in a shorter test. The plot for the confidence interval method shows that when the confidence interval is increased, the test length also increases. The SPRT results in longer tests than the confidence interval method for these settings of the SPRT and confidence interval method. Random item selection results in the longest tests. Maximization of information at the ability estimate results in the shortest tests.

Fig. 14.1
figure 1

Average test length, example 1. Left plot: SPRT; right plot: confidence interval method. \(\delta \) = 0.4. Dots: random selection; triangles: maximization at the ability estimate; diamonds: weighting method

The plot for the proportion of correct decisions for the SPRT indicates that 80–87% of the classifications are accurate. The plot for the confidence interval method indicates that the settings for \(\gamma \) have a limited influence on classification accuracy. The plot also shows that random selection results in slightly less accurate decisions. The SPRT results in more accurate decisions than the confidence interval method.

Based on the these simulations, one might conclude that the SPRT results in longer tests with more accurate decisions. However, this might be caused by the settings of the SPRT and the confidence interval method in the example.

Fig. 14.2
figure 2

Proportion of correct decisions, example 1. Left plot: SPRT; right plot: confidence interval method. \(\delta \) = 0.4. Dots: random selection; triangles: maximization at the ability estimate; diamonds: weighting method

5.2 Example 2: Within-Item Multidimensionality

An within-item multidimensional ACT mathematics item pool, as in Ackerman (1994) and Veldkamp and Van der Linden (2002), was used in the second example. The item parameters had been obtained previously using an orthogonal two-dimensional NOHARM II (Fraser and McDonald 1988) calibration. Items are intended to measure coordinate geometry, elementary algebra, intermediate algebra, pre-algebra, plane geometry, and trigonometry, and they require three skill categories; analysis, application, and basic skills (Veldkamp and Van der Linden 2002).

5.2.1 Simulation Design, Example 2

The standard normal multivariate ability distribution was used to generate 10,000 students. The cutoff points were specified on the reference composite based on the 33th and 66th percentiles of the underlying dimensions. Simulations were generated with \(\alpha =\beta \) between 0.025 and 0.10, and \(\delta \) was set at 0.40. \(\gamma \) varied between values corresponding to 90–97.5% confidence intervals. A maximum of 50 items was selected using random selection, maximization at the ability estimate, and the weighting method. No content restrictions were made in the simulations.

5.2.2 Simulation Results, Example 2

The results of the simulations for example 2 can be seen in Figs. 14.3 and 14.4. The plot for average test length for the SPRT shows that higher values for \(\alpha \) and \(\beta \) result in a shorter test. As expected, the average test length increased when a larger confidence interval was required. The plots also suggest that test length is often shorter for the confidence interval method. However, this depends on the settings for the SPRT and the confidence interval method. Random selection results in the longest tests, and the weighting method results in the shortest test. The effect of the item selection method appears to be consistent regardless of the classification method.

Fig. 14.3
figure 3

Average test length, example 2. Left plot: SPRT; right plot: confidence interval method. \(\delta \) = 0.4. Dots: random selection; triangles: maximization at the ability estimate; diamonds: weighting method

The plot for the proportion of correct decisions for the SPRT indicates that the values for \(\alpha \) and \(\beta \) did not influence the accuracy of the decisions. In contrast, the size of the confidence interval did influence the accuracy. As expected, random selection resulted in the least accurate decisions. The confidence interval method resulted in less accurate decisions than did the SPRT. This might be caused by the shorter tests for the confidence interval method.

Based on the current settings for the SPRT and confidence interval method, one might conclude that the SPRT resulted in longer tests in this example dataset. However, those longer tests result in more accurate classification decisions. This finding clearly indicates the well-known trade-off between accuracy and test length.

Fig. 14.4
figure 4

Proportion of correct decisions, example 2. Left plot: SPRT; right plot: confidence interval method. \(\delta \) = 0.4. Dots: random selection; triangles: maximization at the ability estimate; diamonds: weighting method

6 Conclusions and Discussion

Multidimensional computerized classification testing can be used when classification decisions are required for constructs that have a multidimensional structure. Here, two methods for making those decisions were included here for two types of multidimensionality. In the case of between-item multidimensionality, each item is intended to measure just one dimension. By contrast, in case of within-item multidimensionality, items are intended to measure multiple or all dimensions.

Wald’s (1947) sequential probability ratio test has previously been applied to multidimensional classification testing by Seitz and Frey (2013a) and Van Groen et al. (2014b) for between-item multidimensionality and by Van Groen et al. (2016) for within-item multidimensionality. Kingsbury and Weiss’s (1979) confidence interval method was applied to between-item multidimensionality by Seitz and Frey (2013b) and Van Groen et al. (2014c), and by Van Groen et al. (2014c) to within-item multidimensionality. The present study’s implementation of the confidence interval method for between-item dimensionality differs from the implementation in Van Groen et al. (2014c). Here, different weights were used that reflect the number of items per dimension relative to the test length.

Three methods were included for selecting the items: random item selection, maximization at the current ability estimate, and the weighting method. The last method maximizes information based on a combination of the cutoff points weighted by their distance to the ability estimate. Two examples illustrated the use of the classification and item selection methods.

Classifications were made into one of three classification levels on the entire test. Van Groen et al. (2014c) showed classifications on both the entire test as well as on some of the items or dimensions. They also required that decisions had to be made on all partial classification decisions before a classification decision could be made on the entire test. Here, only one decision was made per test.

Examples were included to demonstrate the use of the classification and item selection methods. Simulations included only two different item banks, a limited number of settings for the classification methods, and a limited number of students. More thorough studies are needed in order to draw conclusions regarding the most effective settings and classification methods. Van Groen et al. (2014c) provided some further suggestions for comparison of the classification methods. Moreover, only three item selection methods were included here. Many more item selection methods (Reckase 2009; Segall 1996; Yao 2012) exist for multidimensional adaptive testing. These should be investigated as well. Finally, \(\alpha \), \(\beta \), \(\delta \), and \(\gamma \) were set equal for all cutoffs points. The use of different values for different cutoff points should be investigated as well.

Two classification methods were included here. A third unidimensional classification method, the generalized likelihood ratio test developed by Bartroff, Finkelman, and Bartroff et al. (2008), has never been applied to multidimensional classification testing. This method is based on the sequential probability ratio test. This suggests that it should be feasible to expand the method to multidimensional classification testing.

Research on multidimensional classification testing is still limited to the work of (Seitz and Frey 2013a, b), Spray et al. (1997), and the current authors (Van Groen et al. 2014b, c, 2016). Further studies will be required before multidimensional classification testing can be applied in practice. To date, too little is know about the classification methods, the item selection methods in this context, and the settings for the classification methods to administer multidimensional classification tests with confidence.