Keywords

In this chapter, we first outline the models used and the estimation and testing procedures employed, and then summarize the results revealed by these models.

4.1 Estimation and Testing Procedures

The procedures we used for parameter estimation and evaluation of model fit are based on marginal maximum likelihood (MML). Most of the procedures we discuss are documented in more detail elsewhere (see Bock and Aitkin 1981; Bock et al. 1988; Gibbons and Hedeker 1992; Glas 1999; Adams and Wu 2006; De Jong et al. 2007; Jennrich and Bentler 2011; Glas and Jehangir 2014). We used the public domain software package MIRT (Glas 2010) in the calculations. Additional estimation and testing procedures were used for the bi-factor model, with unidimensional models as special cases, and random item parameters as a generalization.

4.1.1 MML Estimation

The bi-factor model used in this study was in two parts: a measurement model (i.e., an IRT model) and a structural model. The measurement model pertains to a polytomously-scored response of a student n to an item i. The possible item scores range from 0 to m i and the score of student n on item i is denoted by the variables x nij (j = 1, …, m i ) where x nij  = 1 if the response is in category 1 and zero otherwise. Note that m i has an index i, which indicates that the maximum score of items can differ.

We describe the procedure for the bi-factor model, combined with the partial credit model (PCM; Masters 1982) and generalized partial credit model (GPCM; Muraki 1992) as IRT models, since these two models were the ones we selected for the present study. However, the theory also applies to other IRT models, such as the unidimensional PCM and GPCM, the graded response model (Samejima 1969), the sequential model (Tutz 1990), and other versions of these models with random item parameters instead of fixed item parameters.

In the bi-factor GPCM, the probability of scoring in category j (j = 0, …, m i ) is given by

$$ p_{ij} (\theta_{n} ) \, = \, p(x_{nij} = 1|\theta_{n} ,a,b) \, = \, \frac{{\exp \left( {\sum\limits_{h = 1}^{j} {a_{i0} \theta_{n0} + a_{ig(n)} \theta_{ng(n)} - b_{ih} } } \right)}}{{1 + \sum\limits_{k = 1}^{{m_{i} }} {\exp \left( {\sum\limits_{h = 1}^{k} {a_{i0} \theta_{n0} + a_{ig(n)} \theta_{ng(n)} - b_{ik} } } \right)} }} $$
(4.1)

where, θ n0 is the score of a student n on the latent scale pertaining to all countries, θ ng(n) is the score on a country specific latent dimension, and the index g(n) indicates the country to which student n belongs. Further, a i0 and a ig(n) are the factor loadings of item i on these two dimensions, and b ih (h = 1, …, m i ) is the item location parameter. The location parameter b ih is the position on the latent scale, where it is assumed that summations such as h = 1 to 0 result in zero. The unidimensional GPCM lacks the country-specific dimensions θ ng(n) and the associated factor loadings a ig(n). Further, the PCM is obtained by fixing all item parameters a i0 to one.

The formula for the response probability and subsequent derivations can be simplified by introducing the re-parametrization d ij  = Σ j h = 1 b ih and by defining a t ig θ n as the inner product of the vectors (a i0, a ig(n)) and (θ n0, θ ng(n)), respectively. Thus, Eq. (4.1) becomes

$$ p_{ij} (\theta_{n} ) \, = \, \frac{{\exp \left( {ja_{ig}^{t} \theta_{n} - d_{ij} } \right)}}{{1 + \sum\limits_{k = 1}^{{m_{i} }} {\exp \left( {ka_{ig}^{t} \theta_{n} - d_{ik} } \right)} }} $$
(4.2)

The θ 0 -dimension is the general dimension that pertains to all countries and is the basis for the comparison of the countries. The θ g -dimensions are the country-specific dimensions, and the factor loadings on these dimensions give an indication of country-by-item interaction. It is assumed that within each country, the dimensions θ 0 and θ g have a bi-variate normal distribution \( N(\theta_{n0} ,\theta_{ng} ;\mu_{g} ,\varSigma_{g} ) \). For the two-dimensional country mean μ g  = (μ g0, μ g ), it holds that the mean on the second dimension is fixed at zero, that is μ g  = 0. The covariance matrix is given by

$$ \varSigma_{g} = \left[ {\begin{array}{*{20}c} {\sigma_{g}^{2} } & 0 \\ 0 & 1 \\ \end{array} } \right] $$

In the unidimensional GPCM and PCM, the latent student parameters θ 0 have a univariate normal distribution with a mean μ g and a variance σ 2 g . Finally, random item parameters are obtained by introducing independent multivariate normal distributions on the parameters for each item (for further details, please consult De Jong et al. 2007).

The present application of the bi-factor model is not standard, but an extension of the basic model. Thus, the technical details on the estimation equations, expressions for the covariance matrix of the estimates, and tests of model fit, are also provided (see Appendix A).

4.1.2 Detection and Modeling of Differential Item Functioning

Part of the process of establishing the construct validity of a scale may consist of showing that the scale fits an IRT model. In the present study, the focus is on country-specific CDIF. CDIF can be detected using Lagrange multiplier (LM) test statistics (Rao 1947; see also, Aitchison and Silvey 1958) and CDIF can be modeled using country-specific item parameters. Glas and Jehangir (2014) already showed the feasibility of the method using PISA data, although in the slightly simpler framework of one-dimensional IRT models. The method is implemented in the public domain software package MIRT (Glas 2010). LM tests have been previously applied to IRT frameworks (Glas 1999; Glas and Falcón 2003; Glas and Dagohoy 2007). Our primary interest is not in the actual outcome of the LM test, because due to the very large sample sizes in educational surveys even the smallest model violation, that is, the smallest amount of differential item functioning (DIF), will be significant. The reason for adopting the framework of the LM test is that it clarifies the connection between the model violations, and observations and expectations used to detect DIF. Further, because it produces comprehensible and well-founded expressions for model expectations, the value of the LM test statistic can be used as measure of the effect size of DIF, and the procedure can be easily generalized to a broad class of IRT models.

To define the test and the associated residuals, we define a background variable

$$ y_{nc} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if person }}n{\text{ belongs to country }}c ,} \hfill \\ 0 \hfill & {{\text{if person }}n{\text{ does not belong to country }}c .} \hfill \\ \end{array} } \right. $$

The LM test targets the null-hypothesis of no DIF, namely the null-hypothesis where \( \delta_{i} = 0 \). The LM test statistic is computed using the MML estimates of the null-model, where \( \delta_{i} \) is not estimated. The test is based on evaluation of the first-order derivatives of the marginal likelihood with respect to \( \delta_{i} \) evaluated at \( \delta_{i} = 0 \) (see Glas 1999). If the first-order derivative in this point is large, the MML estimate of \( \delta_{i} \) is far removed from zero, and the test is significant. If the first-order derivative in this point is small, the MML estimate of \( \delta_{i} \) is probably close to zero and the test is not significant. The actual LM statistic is the squared first-order derivative divided by its estimated variance, and it has an asymptotic chi-squared distribution with one degree of freedom. However, as already discussed, the primary interest is not so much in the test itself, but in the information it provides regarding the fit between the data and the model.

For a general definition of the approach, which also pertains to polytomously-scored items, the covariates y nc (c = 1, …, C) should be defined. Special cases leading to specific DIF statistics are given later. The covariates may be separately observed person characteristics, but they may also depend on the observed response pattern, but without the response to the item i targeted.

The LM approach can be outlined using the bi-factor GPCM; the special cases for the unidimensional PCM and GPCM are obtained if the restrictions denoted above are invoked. The probability of a response is given by a generalization of the bi-factor GPCM, namely,

$$ p_{ij} (\theta_{n} ) \, = \, \frac{{\exp \left( {ja_{ig}^{t} \theta_{n} - d_{ij} + j\sum\limits_{c}^{{}} {y_{nc} \delta_{ic} } } \right)}}{{1 + \sum\limits_{k = 1}^{{m_{i} }} {\exp \left( {ka_{ig}^{t} \theta_{n} - d_{ik} + k\sum\limits_{c}^{{}} {y_{nc} \delta_{ic} } } \right)} }} $$

For one so-called reference country, the covariate y nv is equal to zero. This country serves as a baseline where the bi-factor GPCM with item parameters a and b holds. In the other C-1 countries, the covariates y nv are equal to one. It can be shown (see Glas 1999) that the test statistic is based on the residuals

$$ \frac{{\sum\limits_{n = 1}^{N} {\sum\limits_{j = 1}^{{m_{i} }} {y_{nc} jX_{ij} } } }}{{\sum\limits_{n = 1}^{N} {y_{nc} } }}{ - }\frac{{\sum\limits_{n = 1}^{N} {\sum\limits_{j = 1}^{{m_{i} }} {y_{nc} j} } E\left( {P_{ij} (\theta_{n} )|{{x}}_{{n}}{;} \lambda } \right) \, }}{{\sum\limits_{n = 1}^{N} {y_{nc} } }} $$
(4.3)

for c = 1, …, C-1. Dividing this residual by the number of respondents Σ n y nc produces residuals that are the differences between the observed and expected average item-total score in country c = 1, …, C-1. The residual gauges so-called uniform DIF, in other words, the residual indicates whether the item total function (ITF) Σ j jP ij (θ) is shifted for the item, namely whether there is item-by-country interaction.

The LM statistic for the null-hypothesis \( \delta_{i} = 0 \) (c = 1, …, C-1) is a quadratic form in the (C-1)-dimensional vector of residuals and the inverse of their covariance matrix (for details, see Glas 1999). It has an asymptotic chi-squared distribution with C-1 degrees of freedom.

A special case of this procedure is obtained if one country serves as the focal country and all other countries serve as reference. Then the model under the alternative hypothesis has only one additional parameter, \( \delta_{i} \), and the associated LM statistic has an asymptotic chi-squared distribution with one degree of freedom.

Items that show the worst misfit, based on their value of the LM statistic and residuals, are given country-specific item parameters. From a practical point of view, defining country-specific item parameters is equivalent to defining an incomplete design where the DIF item is split into a number of virtual items, and where each virtual item is considered as administered in a specific country. The resulting design can be analyzed using IRT software that supports the analysis of data collected in an incomplete design. We here refer to items with country-specific parameters as split items.

The method is motivated by the assumption that a substantial part of the items function the same in all countries and a limited number of items have CDIF. In the IRT model, it is assumed that all items pertain to the same latent variable θ. Items without CDIF have the same item parameters in every country. However, items with CDIF have item parameters that differ across countries. These items refer to the same latent variable θ as all the other items, but their location on the scale differs across countries. For instance, the number of cars in the family may be a good indicator of wealth, but the actual number of cars at a certain level of wealth may vary across countries, or even within countries. Having a car in the inner city of Amsterdam is clearly a sign of wealth, but, in the rural eastern part of the Netherlands, an equivalent level of wealth would probably result in the ownership of three cars.

The number of items given country-specific item parameters is a matter of choice where two considerations are relevant. First, there should remain a sufficient number of anchor items in the scale. Second, the model including the split items should fit the data. DIF statistics no longer apply to the split items. However, the fit of the item response curve of an individual item, say item i, can be evaluated using the test for non-uniform DIF described earlier, but using a model including country-specific items parameters. So, in this application too, test-score ranges are used as proxies for locations on the θ scale, and the test evaluates whether the model with the country-specific item parameters can properly predict the ITF.

4.2 Results of Modeling Country-Specific Differential Item Functioning

We here provide descriptive statistics at country level for each of the five parental involvement components under the PCM and GPCM, including sample size and estimated global reliability (Tables 4.1, 4.2, 4.3, 4.4 and 4.5). Sample sizes for the first four components (early literacy activities, help with homework, school practices on parental involvement from a parental perspective, and parental involvement from a student perspective) were taken from the PIRLS home and student data, providing a significantly larger sample than that available for the last component (school practices on parental involvement, school perspective), where data were derived from the PIRLS school questionnaire. The GPCM rarely improved global reliability. Components 1 (early literacy activities), 2 (help with homework), and 5 (school practices on parental involvement, school perspective) were evaluated using nine, eight, and 15 items, respectively (see also Table 3.2). Their global reliability is generally >0.70, which is an acceptable level for country inferences. A value of 0.80 is generally considered an acceptable reliability level for individual inferences, and for many combinations of components and countries, this level was attained. Components 3 (school practices on parental involvement, parental perspective) and 4 (parental involvement from a student perspective), were evaluated using three items and five items, respectively; the global reliability of these estimates was thus correspondingly lower.

Table 4.1 Country characteristics component 1: early literacy activities before beginning primary school
Table 4.2 Country characteristics component 2: help with homework
Table 4.3 Country characteristics component 3: school practices on parental involvement, parent perspective
Table 4.4 Country characteristics component 4: student perception of parental involvement
Table 4.5 Country characteristics component 5: school practices on parental involvement, school perspective

We also investigated the item characteristics for each component (Tables 4.6, 4.7, 4.8, 4.9 and 4.10). Local reliability, namely the extent to which different θ-values can be distinguished, was assessed using the “slope” parameter. The relatively high value for PIRLS item ASBH02A (“read books”), indicates that this item of the scale performed best in this respect. Local reliability is further supported if the item location parameters agree closely with the mean of a latent distribution. In this respect, item ASBH02G (“play word games”) performed best, because the latent distributions of the countries were normed to an overall mean of zero. Together the intercept and slope parameters determine the information value of an item. Higher values for the information value of an item at θ = 0, namely I(0), indicate the item made a higher contribution to the local reliability of the component.

Table 4.6 Response frequencies and item parameter estimates under the generalized partial credit model for items in component 1: early literacy activities
Table 4.7 Response frequencies and item parameter estimates under the generalized partial credit model for items in component 2: help with homework
Table 4.8 Response frequencies and item parameter estimates under the generalized partial credit model for items in component 3: school practices on parental involvement, parent perspective
Table 4.9 Response frequencies and item parameter estimates under the generalized partial credit model for items in component 4: student perception of parental involvement
Table 4.10 Response frequencies and item parameter estimates under the generalized partial credit model for items in component 5: school practices on parental involvement, school perspective

For component 1 (early literacy activities), the item ASBH02C (“sing songs”) has a lower information value than the other items. This should be taken into account when redesigning the instrument for future surveys; in other words, this item may be the first candidate for replacement. Compared to component 1 (early literacy activities), the items in component 2 (helping with homework) were more informative, while items in component 3 (school practices on parental involvement, parent perspective) performed poorly. Components 4 (school practices for parental involvement from a student perspective) and 5 (school practices for parental involvement from a school perspective) provided differing results; in particular, the last two items of component 5 (“parental support for student achievement within school” and “parental involvement in school activities”) performed particularly poorly.

Comparing the parameter estimates in the GPCM and the GPCM with random item parameters (henceforth the random GPCM) revealed that the agreement between the slopes and intercepts under the GPCM and the means of the slopes and intercepts under the random GPCM was high (Tables 4.11, 4.12, 4.13, 4.14 and 4.15). A higher variance provides an initial indication that the item functions differently in different countries, a topic we address in more detail later. Here, the effects are global over countries and thus only permit global inferences. For instance, for component 1, the last item, ASBH02I (“read aloud signs and tables”) has the lowest CDIF because the variance of the intercepts and slopes across the countries is the lowest among the items (Table 4.11). A low variance indicates that the item parameters do not vary much across countries. Evaluating the relative CDIF of the other eight items is more difficult, because of the trade-off between the standard deviation for the slope and the intercept.

Table 4.11 Item parameter estimates under the generalized partial credit model (GPCM) and GPCM with random item parameters for items in component 1: early literacy activities
Table 4.12 Item parameter estimates under the generalized partial credit model (GPCM) and GPCM with random item parameters for items in component 2: help with homework
Table 4.13 Item parameter estimates under the generalized partial credit model (GPCM) and GPCM with random item parameters for items in component 3: school practices on parental involvement, parent perspective
Table 4.14 Item parameter estimates under the generalized partial credit model (GPCM) and GPCM with random item parameters for items in component 4: student perception of parental involvement
Table 4.15 Item parameter estimates under the generalized partial credit model (GPCM) and GPCM with random item parameters for items in component 5: school practices on parental involvement, school perspective

This pattern is repeated for component 2; the items ASBH09F (“helping child practice reading”) and ASBH09G (“helping child practice math skills”) performed slightly better than the other items (Table 4.12). Conversely, component 3 showed a substantial difference between the item parameters estimated with the GPCM and those estimated using the random GPCM (Table 4.13), indicating this short scale was quite unstable.

The analyses of components 4 and 5 indicated all the items performed comparably with respect to CDIF (Tables 4.14 and 4.15), although questions surrounding specific item-by-country interaction and the influence of the inferences on country means and latent regression remain unanswered.

We compared CDIF as identified by the random GPCM with CDIF as identified using the latent residuals defined by Eq. (4.3) and aggregated over countries (Tables 4.16, 4.17, 4.18, 4.19 and 4.20). Overall the agreement between the methods was high. For instance, item ASBH02I performed strongly in all methods, as did item ASBH02G (Table 4.16). In general, the residuals with the GPCM are smaller than those with the PCM, because the latter model has fewer parameters. Other studies (see e.g., Glas and Jehangir 2014) confirm this expectation. However, we found that differences between the PCM and the GPCM were very small. We tentatively conclude the PCM fits the data quite well. A striking exception, again, was component 3. Here the fit of the GPCM was worse than the fit of the PCM, which leads to the conclusion that the slopes are very hard to estimate. This is in agreement with the reported low global reliability. Obviously, variance in the θ-distribution is too small to support a proper estimate of the slope parameters.

Table 4.16 Absolute differential item functioning (DIF) under the partial credit model (PCM) and the generalized partial credit model (GPCM) and standard deviation random item parameters on items in component 1: early literacy activities
Table 4.17 Absolute differential item functioning (DIF) under the partial credit model (PCM) and the generalized partial credit model (GPCM) and standard deviation random item parameters on items in component 2: help with homework
Table 4.18 Absolute differential item functioning (DIF) under the partial credit model (PCM) and the generalized partial credit model (GPCM) and standard deviation random item parameters on items in component 3: school practices on parental involvement, parent perspective
Table 4.19 Absolute differential item functioning (DIF) under the partial credit model (PCM) and the generalized partial credit model (GPCM) and standard deviation random item parameters on items in component 4: student perception of parental involvement
Table 4.20 Absolute differential item functioning (DIF) under the partial credit model (PCM) and the generalized partial credit model (GPCM) and standard deviation random item parameters on items in component 5: school practices on parental involvement, school perspective

We then addressed the distribution of country-by-item interaction across countries and items, to determine whether the sizes and directions of the residuals were randomly distributed across all countries and items, or whether they exhibited notable patterns of interaction (Tables 4.21, 4.22, 4.23, 4.24 and 4.25). Residuals were defined by Eq. (4.3), estimated under the GPCM, and calculated for every country, with that country as a focus and all other countries as a reference. To simplify, here we shall not consider the specific values of the residuals, but instead concentrate on the outlying values. For example, if we examine results obtained for the Republic of Azerbaijan and Australia for component 1 (early literacy activities, Table 4.21), it is clear that, aggregated over the items, the mean absolute residual for the Republic of Azerbaijan is much larger than the mean absolute residual for Australia. The responses were coded 0, 1 and 2, so the residuals, which are the differences between a mean observed and expected response are also on a scale from 0 to 2. Closer inspection at the item level for Republic of Azerbaijan reveals that items 3 and 5 have residuals among the 10 % most positive among the countries, while the items 6 and 8 have residuals among the 10 % most negative among the countries. Australia, however, has only one negative residual, and this is among the 20 % most negative residuals among the countries. Checking the absolute residuals further reveals Poland fits the model best with the lowest CDIF, while Indonesia has the most significant CDIF.

Table 4.21 Residual analysis for country-by-item interactions for component 1: early literacy activities
Table 4.22 Residual analysis for country-by-item interactions for component 2: help with homework
Table 4.23 Residual analysis for country-by-item interactions for component 3: school practices on parental involvement, parent perspective
Table 4.24 Residual analysis for country-by-item interactions for component 4: student perception of parental involvement
Table 4.25 Residual analysis for country-by-item interactions for component 5: school practices on parental involvement, school perspective

In a similar way, component 2 (helping with homework) functions very differently in the Netherlands than in other countries (Table 4.22), probably because giving students homework is not a daily practice in Dutch primary schools. This different item functioning is indicated by both the high mean for the absolute values of the residuals and the large number of outliers among the residuals. Canada fits the model best, having the lowest CDIF for this component. For component 3 (school practices on parental involvement, parents perspective) the highest mean absolute residual was found for Germany. However, the scale for measuring school practices on parental involvement from the school perspective (component 5) showed relatively little evidence of CDIF.

We undertook a marginal count of the outliers for the items aggregated over the countries (Table 4.26). No one item count was prominent, although the first item in component 3 (“my child’s school includes me in my child’s education”) seemed more susceptible to CDIF than other items, since this item had the greatest number of residual outliers among countries: 13 in the 10 % outliers region and 15 in the 20 % outliers region. Items 5 (“volunteering”) and 13 (“organize workshops or seminars for parents on learning or pedagogical issues”) within component 5 also scored more highly than other items in the component. However, this does not of course mean that these items have CDIF; if 10 and 20 % extreme values are considered, then 10 and 20 % of the residuals must be included, thus such information only serves as a tool to further scrutinize the items.

Table 4.26 Distribution of cultural differential item functioning (CDIF) across items on parental involvement

We also calculated country-specific factor loadings for the bi-factor model, where we first transformed country-specific factor loadings to standard normals, and then identified the 2.5 and 5 % most extreme outlying values (Tables 4.27, 4.28, 4.29, 4.30 and 4.31). This distribution of country-specific factor loadings gives an indication of the extent to which items load on a country-specific factor in addition to the general factor of the item, and can, as in our earlier residual analysis, be used to determine whether the sizes and directions of the factor loadings are randomly distributed across all countries and items, or whether they exhibit notable patterns of interaction.

Table 4.27 Outliers of country-specific factor loadings in the bi-factor model for component 1: early literacy activities
Table 4.28 Outliers of country-specific factor loadings in the bi-factor model for component 2: help with homework
Table 4.29 Outliers of country-specific factor loadings in the bi-factor model for component 3: school practices on parental involvement, parent perspective
Table 4.30 Outliers of country-specific factor loadings in the bi-factor model for component 4: student perception of parental involvement
Table 4.31 Outliers of country-specific factor loadings in the bi-factor model for component 5: school practices on parental involvement, school perspective

For component 1, the greatest number of outliers of the country-specific factor loadings and the highest mean absolute factor loading were found for Colombia (Table 4.27), suggesting a high level of CDIF. Interestingly, in the residual analysis for this component, a total of 15 countries showed a higher mean absolute residual (Table 4.21). Regarding help with homework (component 2), Malta was identified as having the highest number of outliers in country-specific factor loadings (Table 4.28), while The Netherlands, which we earlier identified as exhibiting CDIF for component 2 (Table 4.22), also had a high number of outliers. For component 3, counting the number of outliers provided little information, as only three outliers were counted in the 2.5 % region (Table 4.29). Hungary did show a high mean absolute country-specific factor loading on this component, though the questionable reliability of the scale must be kept in mind. Student perception of parental involvement (component 4) was measured with the least CDIF in Denmark, whereas the school practices on parental involvement from the school perspective showed the least CDIF for Italy (Tables 4.30 and 4.31).

Aggregating the items over the countries provides a tool for further investigation of items (Table 4.32), with the same caveats as before; if the 2.5 and 5 % most extreme values are considered, then similarly 2.5 and 5 % of the residuals must fall in this region, but this does not imply that 2.5 and 5 % of the items have CDIF. No item count is prominent. Item 5 (“talk about things you had done”) in component 1 did seem more susceptible to CDIF than other items, since this item revealed the greatest number of outliers in country-specific factor loadings over countries.

Table 4.32 Distribution of outliers of country-specific factor loadings in the bi-factor model across items on parental involvement

We then addressed whether the residual analyses using the GPCM and the bi-factor GPCM analyses led to the same conclusions (see Table 4.33). A priori, this would be unexpected. The residual analyses target so-called uniform CDIF, namely a shift in the item location (item intercept) parameters over countries. The bi-factor analyses target non-uniform CDIF, namely differences in the slopes and the dimensionality across items. The correlations for components 2, 4 and 5 were moderate, while for component 1, the correlation was much lower, and for component 3, the correlation completely vanished. The result for component 3 is probably because both the residuals and the country-specific factor loadings are poorly estimated for a test containing only three items.

Table 4.33 Relation between residuals under the generalized partial credit model (GPCM) and country-specific factor loadings in the bi-factor GPCM

Though the correlation between the residuals and the country-specific factor loadings is a reasonable estimate between the two measures, it does not properly indicate to what extent the two measures have the same outliers. To investigate this, we ordered and classified the residuals and country-specific factor loadings in three categories according to their size (a category with negative values, a category with positive values and a middle category). Further, we varied the definition of what constituted an outlying value by varying the size of the middle group (assigning it variously as 33, 40, or 80 % of values). The calculation of Kappa establishes the agreement in categorization between the residual analyses using the GPCM and the bi-factor GPCM. This revealed that agreement was poor throughout for component 3, while, for component 1, the agreement was poor in the 33 % category; for other categories in component 1 the agreement was only fair to moderate. In general, the results indicate that it is not a good policy to rely on one approach for the investigation of CDIF.

We investigated the influence of CDIF by calculating the correlation and rank correlation between country means estimated with no, 10, and 20 % CDIF parameters, and with random item parameters (Table 4.34). Estimates of the means using the unidimensional GPCM without country-specific item parameters and using the bi-factor GPCM could not be distinguished, so we exclude them from further discussion. In general, correlations were high, indicating that, in the estimation of the country means and the rank order of the country means, CDIF had little impact. Component 3 remained the exception; both correlations and rank correlations were low. Further, for components 2 and 4, the correlations between the means estimated using the GPCM with random item parameters and the other three models were also low; however this was not the case for the rank correlations. This is because the relationship between means is not linear. We discuss the possible influence of CDIF further in the next chapter.

Table 4.34 Correlation and rank correlation between country means estimated with no, 10 and 20 % cultural differential item functioning (CDIF) parameters, and random item parameters