Vertical Comparison Using Reference Sets

  • Anton A. BéguinEmail author
  • Saskia Wools
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 89)


When tests for different populations are compared, vertical item response theory (IRT) linking procedures can be used. However, the validity of the linking might be compromised when items in the procedure show differential item functioning (DIF), violating the assumption of the procedure that the item parameters are stable in different populations. This article presents a procedure that is robust against DIF but also exploits the advantages of IRT linking. This procedure, called comparisons using reference sets, is a variation of the scaling test design. Using reference sets, an anchor test is administered in all populations of interest. Subsequently, different IRT scales are estimated for each population separately. To link an operational test to the reference sets, a sample of the items from the reference set is administered with the operational test. In this article, a simulation study is presented to compare a linking method using reference sets with a linking method using a direct anchor. From the simulation study, we can conclude that the procedure using reference sets has an advantage over other vertical linking procedures.


  1. Abramowitz M, Stegun IA (1972) Handbook of mathematical functions. Dover Publications, New YorkzbMATHGoogle Scholar
  2. Béguin AA (2000) Robustness of equating high-stakes tests. Doctoral thesis, University of Twente, EnschedeGoogle Scholar
  3. Carlton JE (2011) Statistical models for vertical linking. In: von Davier AA (ed) Statistical models for test equating, scaling, and linking. Springer, New York, pp 59–70Google Scholar
  4. Hanson BA, Bguin, AA (2002) Obtaining a common scale for item response theory parameters using separate versus concurrent estimation in the common-item equating design. Appl Psychol Meas 26:3--14Google Scholar
  5. Harris DJ (2007) Practical issues in vertical scaling. In: Dorans NJ, Pommerich M, Holland PW (eds) Linking and aligning scores and scales. Springer, New York, pp 233–252CrossRefGoogle Scholar
  6. Holland PW, Dorans NJ (2006) Linking and equating. In: Brennan RL (ed) Educational measurement, 4th edn. Praeger, Westport, pp 189–220Google Scholar
  7. Kolen MJ (2006) Scaling and norming. In: Brennan RL (ed) Educational measurement, 4th edn. Praeger, Westport, pp 155–186Google Scholar
  8. Kolen MJ, Brennan RL (2004) Test equating, 2nd edn. Springer, New YorkCrossRefzbMATHGoogle Scholar
  9. Lord FM, Wingersky MS (1984) Comparison of IRT true-score and equipercentile observed-score “equatings”. Appl Psychol Meas 8:453–461CrossRefGoogle Scholar
  10. Ofqual (2011) A Review of the Pilot of the Single Level Test Approach (Ofqual/11/4837). Author, Coventry, UK. Retrieved from:
  11. Petersen NS, Kolen MJ, Hoover HD (1989) Scaling, norming and equating. In: Linn RL (ed) Educational measurement, 3rd edn. American Council on Education and Macmillan, New York, pp 221–262Google Scholar
  12. Scheerens J, Ehren M, Sleegers P, De Leeuw R (2012) OECD review on evaluation and assessment frameworks for improving school outcomes. Country background report for the Netherlands. OECD, Brussels. Retrieved from:
  13. Verhelst ND, Glas CAW (1995) The one parameter logistic model. In: Fischer GH, Molenaar IW (eds) Rasch models: foundations, recent developments, and applications. Springer, New York, pp 215–238CrossRefGoogle Scholar
  14. Verhelst ND, Glas CAW, Verstralen HHFM (1994) OPLM: computer program and manual. [Computer Program]. Cito, ArnhemGoogle Scholar
  15. Von Davier M, Von Davier AA (2004) A unified approach to IRT scale linking and scale transformations (ETS Research Reports RR-04-09). ETS, PrincetonGoogle Scholar
  16. Von Davier M, Von Davier AA (2012) A general model for IRT scale linking and scale transformations. In: von Davier AA (ed) Statistical models for test equating, scaling, and linking. Springer, New York, pp 225–242Google Scholar
  17. Zeng L, Kolen MJ (1995) An alternative approach for IRT observed-score equating of number-correct scores. Appl Psychol Meas 19:231–240CrossRefGoogle Scholar
  18. Zimowski MF, Muraki E, Mislevy RJ, Bock RD (1996) BILOG-MG: multiple-group IRT analysis and test maintenance for binary items. [Computer Program]. Scientific Software International, Inc., ChicagoGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Cito, Institute for Educational MeasurementArnhemThe Netherlands

Personalised recommendations