Abstract
This chapter presents a psychometric framework aimed at identifying and modeling cultural differential item functioning (CDIF) in multiple ways. One line of modelling uses the residual approach to identify CDIF, and country-specific and random item parameters for the affected items. A second approach uses a non-standard application of the bi-factor model. The results of all approaches for each of the five parental involvement components provide insights into the extent to which they are affected by CDIF.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
Keywords
In this chapter, we first outline the models used and the estimation and testing procedures employed, and then summarize the results revealed by these models.
4.1 Estimation and Testing Procedures
The procedures we used for parameter estimation and evaluation of model fit are based on marginal maximum likelihood (MML). Most of the procedures we discuss are documented in more detail elsewhere (see Bock and Aitkin 1981; Bock et al. 1988; Gibbons and Hedeker 1992; Glas 1999; Adams and Wu 2006; De Jong et al. 2007; Jennrich and Bentler 2011; Glas and Jehangir 2014). We used the public domain software package MIRT (Glas 2010) in the calculations. Additional estimation and testing procedures were used for the bi-factor model, with unidimensional models as special cases, and random item parameters as a generalization.
4.1.1 MML Estimation
The bi-factor model used in this study was in two parts: a measurement model (i.e., an IRT model) and a structural model. The measurement model pertains to a polytomously-scored response of a student n to an item i. The possible item scores range from 0 to m i and the score of student n on item i is denoted by the variables x nij (j = 1, …, m i ) where x nij = 1 if the response is in category 1 and zero otherwise. Note that m i has an index i, which indicates that the maximum score of items can differ.
We describe the procedure for the bi-factor model, combined with the partial credit model (PCM; Masters 1982) and generalized partial credit model (GPCM; Muraki 1992) as IRT models, since these two models were the ones we selected for the present study. However, the theory also applies to other IRT models, such as the unidimensional PCM and GPCM, the graded response model (Samejima 1969), the sequential model (Tutz 1990), and other versions of these models with random item parameters instead of fixed item parameters.
In the bi-factor GPCM, the probability of scoring in category j (j = 0, …, m i ) is given by
where, θ n0 is the score of a student n on the latent scale pertaining to all countries, θ ng(n) is the score on a country specific latent dimension, and the index g(n) indicates the country to which student n belongs. Further, a i0 and a ig(n) are the factor loadings of item i on these two dimensions, and b ih (h = 1, …, m i ) is the item location parameter. The location parameter b ih is the position on the latent scale, where it is assumed that summations such as h = 1 to 0 result in zero. The unidimensional GPCM lacks the country-specific dimensions θ ng(n) and the associated factor loadings a ig(n). Further, the PCM is obtained by fixing all item parameters a i0 to one.
The formula for the response probability and subsequent derivations can be simplified by introducing the re-parametrization d ij = Σ j h = 1 b ih and by defining a t ig θ n as the inner product of the vectors (a i0, a ig(n)) and (θ n0, θ ng(n)), respectively. Thus, Eq. (4.1) becomes
The θ 0 -dimension is the general dimension that pertains to all countries and is the basis for the comparison of the countries. The θ g -dimensions are the country-specific dimensions, and the factor loadings on these dimensions give an indication of country-by-item interaction. It is assumed that within each country, the dimensions θ 0 and θ g have a bi-variate normal distribution \( N(\theta_{n0} ,\theta_{ng} ;\mu_{g} ,\varSigma_{g} ) \). For the two-dimensional country mean μ g = (μ g0, μ g ), it holds that the mean on the second dimension is fixed at zero, that is μ g = 0. The covariance matrix is given by
In the unidimensional GPCM and PCM, the latent student parameters θ 0 have a univariate normal distribution with a mean μ g and a variance σ 2 g . Finally, random item parameters are obtained by introducing independent multivariate normal distributions on the parameters for each item (for further details, please consult De Jong et al. 2007).
The present application of the bi-factor model is not standard, but an extension of the basic model. Thus, the technical details on the estimation equations, expressions for the covariance matrix of the estimates, and tests of model fit, are also provided (see Appendix A).
4.1.2 Detection and Modeling of Differential Item Functioning
Part of the process of establishing the construct validity of a scale may consist of showing that the scale fits an IRT model. In the present study, the focus is on country-specific CDIF. CDIF can be detected using Lagrange multiplier (LM) test statistics (Rao 1947; see also, Aitchison and Silvey 1958) and CDIF can be modeled using country-specific item parameters. Glas and Jehangir (2014) already showed the feasibility of the method using PISA data, although in the slightly simpler framework of one-dimensional IRT models. The method is implemented in the public domain software package MIRT (Glas 2010). LM tests have been previously applied to IRT frameworks (Glas 1999; Glas and Falcón 2003; Glas and Dagohoy 2007). Our primary interest is not in the actual outcome of the LM test, because due to the very large sample sizes in educational surveys even the smallest model violation, that is, the smallest amount of differential item functioning (DIF), will be significant. The reason for adopting the framework of the LM test is that it clarifies the connection between the model violations, and observations and expectations used to detect DIF. Further, because it produces comprehensible and well-founded expressions for model expectations, the value of the LM test statistic can be used as measure of the effect size of DIF, and the procedure can be easily generalized to a broad class of IRT models.
To define the test and the associated residuals, we define a background variable
The LM test targets the null-hypothesis of no DIF, namely the null-hypothesis where \( \delta_{i} = 0 \). The LM test statistic is computed using the MML estimates of the null-model, where \( \delta_{i} \) is not estimated. The test is based on evaluation of the first-order derivatives of the marginal likelihood with respect to \( \delta_{i} \) evaluated at \( \delta_{i} = 0 \) (see Glas 1999). If the first-order derivative in this point is large, the MML estimate of \( \delta_{i} \) is far removed from zero, and the test is significant. If the first-order derivative in this point is small, the MML estimate of \( \delta_{i} \) is probably close to zero and the test is not significant. The actual LM statistic is the squared first-order derivative divided by its estimated variance, and it has an asymptotic chi-squared distribution with one degree of freedom. However, as already discussed, the primary interest is not so much in the test itself, but in the information it provides regarding the fit between the data and the model.
For a general definition of the approach, which also pertains to polytomously-scored items, the covariates y nc (c = 1, …, C) should be defined. Special cases leading to specific DIF statistics are given later. The covariates may be separately observed person characteristics, but they may also depend on the observed response pattern, but without the response to the item i targeted.
The LM approach can be outlined using the bi-factor GPCM; the special cases for the unidimensional PCM and GPCM are obtained if the restrictions denoted above are invoked. The probability of a response is given by a generalization of the bi-factor GPCM, namely,
For one so-called reference country, the covariate y nv is equal to zero. This country serves as a baseline where the bi-factor GPCM with item parameters a and b holds. In the other C-1 countries, the covariates y nv are equal to one. It can be shown (see Glas 1999) that the test statistic is based on the residuals
for c = 1, …, C-1. Dividing this residual by the number of respondents Σ n y nc produces residuals that are the differences between the observed and expected average item-total score in country c = 1, …, C-1. The residual gauges so-called uniform DIF, in other words, the residual indicates whether the item total function (ITF) Σ j jP ij (θ) is shifted for the item, namely whether there is item-by-country interaction.
The LM statistic for the null-hypothesis \( \delta_{i} = 0 \) (c = 1, …, C-1) is a quadratic form in the (C-1)-dimensional vector of residuals and the inverse of their covariance matrix (for details, see Glas 1999). It has an asymptotic chi-squared distribution with C-1 degrees of freedom.
A special case of this procedure is obtained if one country serves as the focal country and all other countries serve as reference. Then the model under the alternative hypothesis has only one additional parameter, \( \delta_{i} \), and the associated LM statistic has an asymptotic chi-squared distribution with one degree of freedom.
Items that show the worst misfit, based on their value of the LM statistic and residuals, are given country-specific item parameters. From a practical point of view, defining country-specific item parameters is equivalent to defining an incomplete design where the DIF item is split into a number of virtual items, and where each virtual item is considered as administered in a specific country. The resulting design can be analyzed using IRT software that supports the analysis of data collected in an incomplete design. We here refer to items with country-specific parameters as split items.
The method is motivated by the assumption that a substantial part of the items function the same in all countries and a limited number of items have CDIF. In the IRT model, it is assumed that all items pertain to the same latent variable θ. Items without CDIF have the same item parameters in every country. However, items with CDIF have item parameters that differ across countries. These items refer to the same latent variable θ as all the other items, but their location on the scale differs across countries. For instance, the number of cars in the family may be a good indicator of wealth, but the actual number of cars at a certain level of wealth may vary across countries, or even within countries. Having a car in the inner city of Amsterdam is clearly a sign of wealth, but, in the rural eastern part of the Netherlands, an equivalent level of wealth would probably result in the ownership of three cars.
The number of items given country-specific item parameters is a matter of choice where two considerations are relevant. First, there should remain a sufficient number of anchor items in the scale. Second, the model including the split items should fit the data. DIF statistics no longer apply to the split items. However, the fit of the item response curve of an individual item, say item i, can be evaluated using the test for non-uniform DIF described earlier, but using a model including country-specific items parameters. So, in this application too, test-score ranges are used as proxies for locations on the θ scale, and the test evaluates whether the model with the country-specific item parameters can properly predict the ITF.
4.2 Results of Modeling Country-Specific Differential Item Functioning
We here provide descriptive statistics at country level for each of the five parental involvement components under the PCM and GPCM, including sample size and estimated global reliability (Tables 4.1, 4.2, 4.3, 4.4 and 4.5). Sample sizes for the first four components (early literacy activities, help with homework, school practices on parental involvement from a parental perspective, and parental involvement from a student perspective) were taken from the PIRLS home and student data, providing a significantly larger sample than that available for the last component (school practices on parental involvement, school perspective), where data were derived from the PIRLS school questionnaire. The GPCM rarely improved global reliability. Components 1 (early literacy activities), 2 (help with homework), and 5 (school practices on parental involvement, school perspective) were evaluated using nine, eight, and 15 items, respectively (see also Table 3.2). Their global reliability is generally >0.70, which is an acceptable level for country inferences. A value of 0.80 is generally considered an acceptable reliability level for individual inferences, and for many combinations of components and countries, this level was attained. Components 3 (school practices on parental involvement, parental perspective) and 4 (parental involvement from a student perspective), were evaluated using three items and five items, respectively; the global reliability of these estimates was thus correspondingly lower.
We also investigated the item characteristics for each component (Tables 4.6, 4.7, 4.8, 4.9 and 4.10). Local reliability, namely the extent to which different θ-values can be distinguished, was assessed using the “slope” parameter. The relatively high value for PIRLS item ASBH02A (“read books”), indicates that this item of the scale performed best in this respect. Local reliability is further supported if the item location parameters agree closely with the mean of a latent distribution. In this respect, item ASBH02G (“play word games”) performed best, because the latent distributions of the countries were normed to an overall mean of zero. Together the intercept and slope parameters determine the information value of an item. Higher values for the information value of an item at θ = 0, namely I(0), indicate the item made a higher contribution to the local reliability of the component.
For component 1 (early literacy activities), the item ASBH02C (“sing songs”) has a lower information value than the other items. This should be taken into account when redesigning the instrument for future surveys; in other words, this item may be the first candidate for replacement. Compared to component 1 (early literacy activities), the items in component 2 (helping with homework) were more informative, while items in component 3 (school practices on parental involvement, parent perspective) performed poorly. Components 4 (school practices for parental involvement from a student perspective) and 5 (school practices for parental involvement from a school perspective) provided differing results; in particular, the last two items of component 5 (“parental support for student achievement within school” and “parental involvement in school activities”) performed particularly poorly.
Comparing the parameter estimates in the GPCM and the GPCM with random item parameters (henceforth the random GPCM) revealed that the agreement between the slopes and intercepts under the GPCM and the means of the slopes and intercepts under the random GPCM was high (Tables 4.11, 4.12, 4.13, 4.14 and 4.15). A higher variance provides an initial indication that the item functions differently in different countries, a topic we address in more detail later. Here, the effects are global over countries and thus only permit global inferences. For instance, for component 1, the last item, ASBH02I (“read aloud signs and tables”) has the lowest CDIF because the variance of the intercepts and slopes across the countries is the lowest among the items (Table 4.11). A low variance indicates that the item parameters do not vary much across countries. Evaluating the relative CDIF of the other eight items is more difficult, because of the trade-off between the standard deviation for the slope and the intercept.
This pattern is repeated for component 2; the items ASBH09F (“helping child practice reading”) and ASBH09G (“helping child practice math skills”) performed slightly better than the other items (Table 4.12). Conversely, component 3 showed a substantial difference between the item parameters estimated with the GPCM and those estimated using the random GPCM (Table 4.13), indicating this short scale was quite unstable.
The analyses of components 4 and 5 indicated all the items performed comparably with respect to CDIF (Tables 4.14 and 4.15), although questions surrounding specific item-by-country interaction and the influence of the inferences on country means and latent regression remain unanswered.
We compared CDIF as identified by the random GPCM with CDIF as identified using the latent residuals defined by Eq. (4.3) and aggregated over countries (Tables 4.16, 4.17, 4.18, 4.19 and 4.20). Overall the agreement between the methods was high. For instance, item ASBH02I performed strongly in all methods, as did item ASBH02G (Table 4.16). In general, the residuals with the GPCM are smaller than those with the PCM, because the latter model has fewer parameters. Other studies (see e.g., Glas and Jehangir 2014) confirm this expectation. However, we found that differences between the PCM and the GPCM were very small. We tentatively conclude the PCM fits the data quite well. A striking exception, again, was component 3. Here the fit of the GPCM was worse than the fit of the PCM, which leads to the conclusion that the slopes are very hard to estimate. This is in agreement with the reported low global reliability. Obviously, variance in the θ-distribution is too small to support a proper estimate of the slope parameters.
We then addressed the distribution of country-by-item interaction across countries and items, to determine whether the sizes and directions of the residuals were randomly distributed across all countries and items, or whether they exhibited notable patterns of interaction (Tables 4.21, 4.22, 4.23, 4.24 and 4.25). Residuals were defined by Eq. (4.3), estimated under the GPCM, and calculated for every country, with that country as a focus and all other countries as a reference. To simplify, here we shall not consider the specific values of the residuals, but instead concentrate on the outlying values. For example, if we examine results obtained for the Republic of Azerbaijan and Australia for component 1 (early literacy activities, Table 4.21), it is clear that, aggregated over the items, the mean absolute residual for the Republic of Azerbaijan is much larger than the mean absolute residual for Australia. The responses were coded 0, 1 and 2, so the residuals, which are the differences between a mean observed and expected response are also on a scale from 0 to 2. Closer inspection at the item level for Republic of Azerbaijan reveals that items 3 and 5 have residuals among the 10 % most positive among the countries, while the items 6 and 8 have residuals among the 10 % most negative among the countries. Australia, however, has only one negative residual, and this is among the 20 % most negative residuals among the countries. Checking the absolute residuals further reveals Poland fits the model best with the lowest CDIF, while Indonesia has the most significant CDIF.
In a similar way, component 2 (helping with homework) functions very differently in the Netherlands than in other countries (Table 4.22), probably because giving students homework is not a daily practice in Dutch primary schools. This different item functioning is indicated by both the high mean for the absolute values of the residuals and the large number of outliers among the residuals. Canada fits the model best, having the lowest CDIF for this component. For component 3 (school practices on parental involvement, parents perspective) the highest mean absolute residual was found for Germany. However, the scale for measuring school practices on parental involvement from the school perspective (component 5) showed relatively little evidence of CDIF.
We undertook a marginal count of the outliers for the items aggregated over the countries (Table 4.26). No one item count was prominent, although the first item in component 3 (“my child’s school includes me in my child’s education”) seemed more susceptible to CDIF than other items, since this item had the greatest number of residual outliers among countries: 13 in the 10 % outliers region and 15 in the 20 % outliers region. Items 5 (“volunteering”) and 13 (“organize workshops or seminars for parents on learning or pedagogical issues”) within component 5 also scored more highly than other items in the component. However, this does not of course mean that these items have CDIF; if 10 and 20 % extreme values are considered, then 10 and 20 % of the residuals must be included, thus such information only serves as a tool to further scrutinize the items.
We also calculated country-specific factor loadings for the bi-factor model, where we first transformed country-specific factor loadings to standard normals, and then identified the 2.5 and 5 % most extreme outlying values (Tables 4.27, 4.28, 4.29, 4.30 and 4.31). This distribution of country-specific factor loadings gives an indication of the extent to which items load on a country-specific factor in addition to the general factor of the item, and can, as in our earlier residual analysis, be used to determine whether the sizes and directions of the factor loadings are randomly distributed across all countries and items, or whether they exhibit notable patterns of interaction.
For component 1, the greatest number of outliers of the country-specific factor loadings and the highest mean absolute factor loading were found for Colombia (Table 4.27), suggesting a high level of CDIF. Interestingly, in the residual analysis for this component, a total of 15 countries showed a higher mean absolute residual (Table 4.21). Regarding help with homework (component 2), Malta was identified as having the highest number of outliers in country-specific factor loadings (Table 4.28), while The Netherlands, which we earlier identified as exhibiting CDIF for component 2 (Table 4.22), also had a high number of outliers. For component 3, counting the number of outliers provided little information, as only three outliers were counted in the 2.5 % region (Table 4.29). Hungary did show a high mean absolute country-specific factor loading on this component, though the questionable reliability of the scale must be kept in mind. Student perception of parental involvement (component 4) was measured with the least CDIF in Denmark, whereas the school practices on parental involvement from the school perspective showed the least CDIF for Italy (Tables 4.30 and 4.31).
Aggregating the items over the countries provides a tool for further investigation of items (Table 4.32), with the same caveats as before; if the 2.5 and 5 % most extreme values are considered, then similarly 2.5 and 5 % of the residuals must fall in this region, but this does not imply that 2.5 and 5 % of the items have CDIF. No item count is prominent. Item 5 (“talk about things you had done”) in component 1 did seem more susceptible to CDIF than other items, since this item revealed the greatest number of outliers in country-specific factor loadings over countries.
We then addressed whether the residual analyses using the GPCM and the bi-factor GPCM analyses led to the same conclusions (see Table 4.33). A priori, this would be unexpected. The residual analyses target so-called uniform CDIF, namely a shift in the item location (item intercept) parameters over countries. The bi-factor analyses target non-uniform CDIF, namely differences in the slopes and the dimensionality across items. The correlations for components 2, 4 and 5 were moderate, while for component 1, the correlation was much lower, and for component 3, the correlation completely vanished. The result for component 3 is probably because both the residuals and the country-specific factor loadings are poorly estimated for a test containing only three items.
Though the correlation between the residuals and the country-specific factor loadings is a reasonable estimate between the two measures, it does not properly indicate to what extent the two measures have the same outliers. To investigate this, we ordered and classified the residuals and country-specific factor loadings in three categories according to their size (a category with negative values, a category with positive values and a middle category). Further, we varied the definition of what constituted an outlying value by varying the size of the middle group (assigning it variously as 33, 40, or 80 % of values). The calculation of Kappa establishes the agreement in categorization between the residual analyses using the GPCM and the bi-factor GPCM. This revealed that agreement was poor throughout for component 3, while, for component 1, the agreement was poor in the 33 % category; for other categories in component 1 the agreement was only fair to moderate. In general, the results indicate that it is not a good policy to rely on one approach for the investigation of CDIF.
We investigated the influence of CDIF by calculating the correlation and rank correlation between country means estimated with no, 10, and 20 % CDIF parameters, and with random item parameters (Table 4.34). Estimates of the means using the unidimensional GPCM without country-specific item parameters and using the bi-factor GPCM could not be distinguished, so we exclude them from further discussion. In general, correlations were high, indicating that, in the estimation of the country means and the rank order of the country means, CDIF had little impact. Component 3 remained the exception; both correlations and rank correlations were low. Further, for components 2 and 4, the correlations between the means estimated using the GPCM with random item parameters and the other three models were also low; however this was not the case for the rank correlations. This is because the relationship between means is not linear. We discuss the possible influence of CDIF further in the next chapter.
References
Adams, R., & Wu, M. (2006). The mixed-coefficients multinomial logit model: A generalized form of the Rasch model. In M. von Davier & C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications (pp. 57–75). New York: Springer.
Aitchison, J., & Silvey, S. D. (1958). Maximum likelihood estimation of parameters subject to restraints. Annals of Mathematical Statistics, 29, 813–828.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM-algorithm. Psychometrika, 46, 443–459.
Bock, R. D., Gibbons, R. D., & Muraki, E. (1988). Full-information factor analysis. Applied Psychological Measurement, 12, 261–280.
De Jong, M. G., Steenkamp, J. B. E. M., & Fox, J. P. (2007). Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. Journal of Consumer Research, 34(2), 260–278. doi:10.1086/518532.
Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436. doi:10.1007/BF02295430.
Glas, C. A. W. (1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64(3), 273–294. doi:10.1007/bf02294296.
Glas, C. A. W. (2010). Multidimensional item response theory (MIRT), manual and computer program. Retrieved from http://www.utwente.nl/gw/omd/Medewerkers/temp_test/mirt_package.zip, http://www.utwente.nl/gw/omd/Medewerkers/temp_test/mirt-manual.pdf.
Glas, C. A. W., & Dagohoy, A. V. T. (2007). A person fit test for IRT models for polytomous items. Psychometrika, 72(2), 159–180. doi:10.1007/s11336-003-1081-5.
Glas, C. A. W., & Falcón, J. C. S. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. doi:10.1177/0146621602250530.
Glas, C. A. W., & Jehangir, K. (2014). Modeling country-specific differential item functioning. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 97–115). New York: Springer.
Jennrich, R. I., & Bentler, P. M. (2011). Exploratory bi-factor analysis. Psychometrika, 76(4), 537–549. doi:10.1007/s11336-011-9218-4.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Rao, C. R. (1947). Large sample tests of statistical hypothesis concerning several parameters with applications to problems of estimation. Proceedings of the Cambridge Philosophical Society, 44, 50–57.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 34(4, Pt. 2), 1–100.
Tutz, G. (1990). Sequential item response models with an ordered response. British Journal of Mathematical and Statistical Psychology, 43, 39–55.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license, and any changes made are indicated. The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt, or reproduce the material.
Copyright information
© 2016 The Author(s)
About this chapter
Cite this chapter
Punter, R.A., Glas, C.A.W., Meelissen, M.R.M. (2016). Modeling Parental Involvement. In: Psychometric Framework for Modeling Parental Involvement and Reading Literacy. IEA Research for Education, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-319-28064-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-28064-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28710-2
Online ISBN: 978-3-319-28064-6
eBook Packages: EducationEducation (R0)