A Comparison of the Hierarchical Generalized Linear Model, Multiple-Indicators Multiple-Causes, and the Item Response Theory-Likelihood Ratio Test for Detecting Differential Item Functioning

  • Mei Ling OngEmail author
  • Laura Lu
  • Sunbok Lee
  • Allan Cohen
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 89)


The purpose of this study was to compare the DIF detection performance of the hierarchical generalized linear model (HGLM), the multiple-indicators multiple-causes (MIMIC) method, and the IRT likelihood ratio (IRT-LR) test in simulated hierarchical data. Conditions in the simulation study included the number of clusters, cluster sizes, and the intraclass correlation coefficient (ICC). Those methods are compared in terms of Type I error rates. These rates should be close to 0.05 when the level of significance is set at 0.05. Results show that the HGLM maintained the marginal Type I error rate. The MIMIC model maintained a Type I error control rate better than the other two methods when cluster sizes were small. When cluster size and intraclass correlation ρ increased, however, the Type I error rates increased as well. The IRT-LR test maintained a marginal Type I error control for small sample cluster sizes but failed to do so for larger cluster sizes.


DIF MIMIC HGLM IRT-LR test Rasch model Type I error rates 


  1. Acar T (2012) Determination of a differential item functioning procedure using the hierarchical generalized linear model: a comparison study with logistic regression and likelihood ratio procedure. SAGE Open. Advance online publication. doi:10.1177/2158244012436760Google Scholar
  2. Baker FB, Kim S-H (2004) Item response theory: parameter estimation techniques. Taylor & Francis, Boca RatonGoogle Scholar
  3. Bates D, Marchler M, Bolker B (2013) Linear mixed-effects models using S4 classes (R package).
  4. Binici S (2007) Random-effect differential item functioning via hierarchical generalized linear model and generalized linear latent mixed model: a comparison of estimation methods. Unpublished doctoral dissertation. Florida State UniversityGoogle Scholar
  5. Camilli G, Shepard LA (1994) Methods for identifying biased test items. Sage, Thousand OaksGoogle Scholar
  6. Cheong YF, Kamata A (2013) Centering, scale indeterminacy, and differential item functioning detection in hierarchical generalized linear and generalized linear mixed models. Appl Meas Educ 26(4):233–252CrossRefGoogle Scholar
  7. Chu K (2002) Equivalent group test equating with the presence of differential item functioning. Unpublished doctoral dissertation. Florida State UniversityGoogle Scholar
  8. Cohen AS, Kim S-H, Wollack JA (1996) An investigation of the likelihood ratio test for detection of differential item functioning. Appl Psychol Meas 20(1):15–26CrossRefGoogle Scholar
  9. Dorans NJ, Kulick E (1986) Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the scholastic aptitude test. J Educ Meas 23(4):355–368CrossRefGoogle Scholar
  10. Finch WH (2005) The MIMIC model as a method for detecting DIF: comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Appl Psychol Meas 29(4):278–295CrossRefMathSciNetGoogle Scholar
  11. Finch WH, French BF (2010) Detecting differential item functioning of a course satisfaction instrument in the presence of multilevel data. J First Year Exp Stud Transit 22(1):27–47Google Scholar
  12. French BF, Finch WH (2010) Hierarchical logistic regression: accounting for multilevel data in DIF detection. J Educ Meas 47(3):299–317CrossRefGoogle Scholar
  13. French BF, Finch WH (2013) Extensions of Mantel-Haenszel for multilevel DIF detection. Educ Psychol Meas. doi: 10.1177/0013164412472341, Advance online publicationGoogle Scholar
  14. Holland PW, Thayer DT (1988) Differential item functioning and the Mantel-Haenszel procedure. In: Wainer H, Braun HI (eds) Test validity. Lawrence Erlbaum Associates, Hillsdale, pp 129–145Google Scholar
  15. Hox JJ, Maas CJM (2001) The accuracy of multilevel structural equation modeling with pseudobalanced groups and small samples. Struct Equ Model 8:157–174CrossRefGoogle Scholar
  16. Jones RN (2006) Identification of measurement differences between English and Spanish language versions of the mini-mental state examination: detecting differential item functioning using MIMIC modeling. Med Care 44(11):124–133CrossRefGoogle Scholar
  17. Kamata A (1998) One-parameter hierarchical generalized linear logistic model: an application of HGLM to IRT. Paper presented at the annual meeting of the American Educational Research Association, April, CaliforniaGoogle Scholar
  18. Kamata A (2001) Item analysis by the hierarchical generalized linear model. J Educ Meas 38(1):79–93CrossRefGoogle Scholar
  19. Kamata A (2002) Procedure to perform item response analysis by hierarchical generalized linear model. Paper presented at the annual meeting of the American Educational Research Association, April, New OrleansGoogle Scholar
  20. Kamata A, Cheong YF (2007) Multilevel Rasch models. In: von Davier M, Carstensen CH (eds) Multivariate and mixture distribution Rasch models: extensions and applications. Springer Science + Business Media, New York, pp 217–232CrossRefGoogle Scholar
  21. Kamata A, Vaughn BK (2011) Multilevel IRT modeling. In: Hox JJ, Roberts JK (eds) Handbook of advanced multilevel analysis. Taylor and Francis Group, New York, pp 41–57Google Scholar
  22. Kim S-K, Cohen AS (1998) Detection of differential item functioning under the graded response model with the likelihood ratio test. Appl Psychol Meas 22(4):345–355CrossRefGoogle Scholar
  23. Lord FM (1980) Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates, HillsdaleGoogle Scholar
  24. Maas CJM, Hox JJ (2005) Sufficient sample sizes for multilevel modeling. Methodology 1(3): 86–92MathSciNetGoogle Scholar
  25. McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hill, LondonCrossRefzbMATHGoogle Scholar
  26. Muthén BO (1989) Latent variable modeling in heterogeneous populations. Psychometrika 54(4):557–585CrossRefMathSciNetGoogle Scholar
  27. Muthén LK, Muthén BO (1998–2012) Mplus user’s guide, 7th edn. Muthén & Muthén, Los AngelesGoogle Scholar
  28. National Assessment of Educational Progress (2009). Reading assessment and item specifications. Retrieved March 14, 2014 from
  29. Raju NS (1988) The area between two item characteristic curves. Psychometrika 53(4):495–502CrossRefzbMATHMathSciNetGoogle Scholar
  30. Raju NS (1990) Determining the significance of estimated signed and unsigned areas between two item response functions. Appl Psychol Meas 14(2):197–207CrossRefGoogle Scholar
  31. Rasch G (1960) Probabilistic models for some intelligence and attainment tests. The Danish Institute for Educational Research, CopenhagenGoogle Scholar
  32. Raudenbush S, Bryk AS (1986) A hierarchical model for studying school effects. Sociol Educ 59(1):1–17CrossRefGoogle Scholar
  33. Raudenbush SW, Bryk AS (2002) Hierarchical linear models: applications and data analysis methods, 2nd edn. Sage, NewburyGoogle Scholar
  34. Shih C-L, Wang W-C (2009) Differential item functioning detection using the multiple indicators, multiple causes method with a pure short anchor. Appl Psychol Meas 33(3):184–199CrossRefMathSciNetGoogle Scholar
  35. Snijder TAB, Bosker RJ (2012) Multilevel analysis: an introduction to basic and advanced multilevel modeling, 2nd edn. Sage, Thousand OaksGoogle Scholar
  36. Thissen D (2001) IRTLRDIF v2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning [Computer software documentation]. L. L. Thurstone Psychometric Laboratory, University of North Carolina, Chapel HillGoogle Scholar
  37. Thissen D, Steinberg L, Gerrard M (1986) Beyond group mean differences: the concept of item bias. Psychol Bull 99(1):118–128CrossRefGoogle Scholar
  38. Thissen D, Steinberg L, Wainer H (1988) Use of item response theory in the study of group differences in trace lines. In: Wainer H, Braun HI (eds) Test validity. Erlbaum, Hillsdale, pp 147–169Google Scholar
  39. Thissen D, Steinberg L, Wainer H (1993) Detection of differential item functioning using the parameters of item response model. In: Holland PW, Wainer H (eds) Differential item functioning. Lawrence Erlbaum Associates, Hillsdale, pp 67–114Google Scholar
  40. Willse JT, Goodman JT (2008) Comparison of multiple-indicators, multiple-causes- and item response theory-based analyses of subgroup differences. Educ Psychol Meas 68(4):587–602CrossRefMathSciNetGoogle Scholar
  41. Woods CM (2008) Likelihood-ratio DIF testing: Effects of nonnormality. Appl Psychol Meas 32(7):511–526CrossRefMathSciNetGoogle Scholar
  42. Woods CM (2009) Evaluation of MIMIC-model methods for DIF testing with comparison to two-groups analysis. Multivar Behav Res 44(1):1–27CrossRefGoogle Scholar
  43. Woods CM, Oltmanns TF, Turkheimer E (2009) Illustration of MIMIC-Model DIF testing with the schedule for nonadaptive and adaptive personality. J Psychopathol Behav Asses 31(4):320–330CrossRefGoogle Scholar
  44. Zimowski MF, Muraki E, Mislevy RJ, Bock RD (2003) BILOG-MG 3 [Computer software]. Scientific Software International, LincolnwoodGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Mei Ling Ong
    • 1
    Email author
  • Laura Lu
    • 2
  • Sunbok Lee
    • 3
  • Allan Cohen
    • 4
  1. 1.Quantitative Methods, Department of Education PsychologyUniversity of GeorgiaAthensUSA
  2. 2.Department of Education PsychologyUniversity of Georgia AthensUSA
  3. 3.Center for Family ResearchAthensUSA
  4. 4.Department of Education PsychologyUniversity of Georgia AthensUSA

Personalised recommendations