High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance

Culpepper, Steven Andrew; Aguinis, Herman; Kern, Justin L.; Millsap, Roger

doi:10.1007/s11336-018-9649-2

High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance

Published: 22 January 2019

Volume 84, pages 285–309, (2019)
Cite this article

Psychometrika Aims and scope Submit manuscript

Steven Andrew Culpepper^1,2,
Herman Aguinis ORCID: orcid.org/0000-0002-3485-9484³,
Justin L. Kern⁴ &
…
Roger Millsap⁵^na1

1137 Accesses
10 Citations
4 Altmetric
Explore all metrics

“It is extremely rare to find an empirical prediction invariance study that also examines measurement invariance empirically, using the same data. No particular barrier exists to conducting such studies however.”

Roger E. Millsap (2007, p. 472)

Abstract

The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

National and International Educational Achievement Testing: A Case of Multi-level Validation Framed by the Ecological Model of Item Responding

The use of test scores from large-scale assessment surveys: psychometric and statistical considerations

Article Open access 07 November 2017

Large-Scale Group-Score Assessment

Notes

Throughout our article, we use the term cohort to refer to “institution-cohort” because in some cases there is more than one cohort of students per institution (i.e., up to three cohorts for some institutions given that data were collected in 2006, 2007, and 2008).
But, please see the Potential Limitations and Additional Future Directions section for additional commentary regarding this issue.
The interaction of HSGPA and the grouping variable (i.e., $X_4 g$) was included as an instrument in the measurement models, but not HSGPA (i.e., $X_4$), alone. Preliminary analyses provided evidence of significant J-statistics for many cohorts when including $X_4$ as an instrument in the measurement models. One explanation as to why $X_4 g$ is a valid instrument, but not $X_4$, relates to the orthogonality of these variables with error terms. That is, the J-statistics provided evidence that $E(X_4 g\delta _j)=E(X_4 \delta _j |g=1)=0$ (for $j=1,2,3$), which suggests the orthogonality condition is satisfied for Blacks and Hispanics. The J tests that included $X_4$ suggested that $E(X_4 \delta _j )\ne 0$, which, given evidence that $E(X_4 \delta _j |g=1)=0$, suggests the orthogonality condition may not be satisfied for Whites.

References

Aguinis, H. (2004). Regression analysis for categorical moderators. New York: Guilford.
Google Scholar
Aguinis, H. (2019). Performance management (4th ed.). Chicago, IL: Chicago Business Press.
Google Scholar
Aguinis, H., Cortina, J. M., & Goldberg, E. (1998). A new procedure for computing equivalence bands in personnel selection. Human Performance, 11, 351–365.
Article Google Scholar
Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2010a). Revival of test bias research in preemployment testing. Journal of Applied Psychology, 95, 648–680.
Article PubMed Google Scholar
Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2016). Differential prediction generalization in college admissions testing. Journal of Educational Psychology, 108, 1045–1059.
Article Google Scholar
Aguinis, H., Werner, S., Abbott, J. L., Angert, C., Park, J. H., & Kohlhausen, D. (2010b). Customer-centric science: Reporting significant research results with rigor, relevance, and practical impact in mind. Organizational Research Methods, 13, 515–539.
Article Google Scholar
Albano, A. D., & Rodriguez, M. C. (1998). Examining differential math performance by gender and opportunity to learn. Educational and Psychological Measurement, 73, 836–856.
Article Google Scholar
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Aronson, J., & Dee, T. (2012). Stereotype threat in the real world. In T. Schmader & M. Inzlicht (Eds.), Stereotype threat: Theory, process, and application (pp. 264–278). Oxford: Oxford University Press.
Google Scholar
Bernerth, J., & Aguinis, H. (2016). A critical review and best-practice recommendations for control variable usage. Personnel Psychology, 69, 229–283.
Article Google Scholar
Berry, C. M., & Zhao, P. (2015). Addressing criticisms of existing predictive bias research: Cognitive ability test scores still overpredict African Americans’ job performance. Journal of Applied Psychology, 100, 162–179.
Article PubMed Google Scholar
Birnbaum, Z. W., Paulson, E., & Andrews, F. C. (1950). On the effect of selection performed on some coordinates of a multi-dimensional population. Psychometrika, 15, 191–204.
Article PubMed Google Scholar
Bollen, K. A. (1996). An alternative two stage least squares (2SLS) estimator for latent variable equations. Psychometrika, 61, 109–121.
Article Google Scholar
Bollen, K. A., Kolenikov, S., & Bauldry, S. (2014). Model-implied instrumental variable—generalized method of moments (MIIV-GMM) estimators for latent variable models. Psychometrika, 79, 20–50.
Article PubMed Google Scholar
Bollen, K. A., & Maydeu-Olivares, A. (2007). A polychoric instrumental variable (PIV) estimator for structural equation models with categorical variables. Psychometrika, 72, 309–326.
Article Google Scholar
Bollen, K. A., & Paxton, P. (1998). Two-stage least squares estimation on interaction effects. In R. E. Schumacker & G. A. Marcoulides (Eds.), Interaction and nonlinear effects in structural equation modeling (pp. 125–151). Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440.
Article PubMed PubMed Central Google Scholar
Borsboom, D., Romeijn, J. W., & Wicherts, J. M. (2008). Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods, 13, 75–98.
Article PubMed Google Scholar
Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods & Research, 21, 230–258.
Article Google Scholar
Bryant, D. (2004). The effects of differential item functioning on predictive bias. Unpublished doctoral dissertation), University of Central Florida, Orlando, Florida.
Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5, 115–124.
Article Google Scholar
Coyle, T. R., & Pillow, D. R. (2008). SAT and ACT predict college GPA after removing g. Intelligence, 36, 719–729.
Article Google Scholar
Coyle, T. R., Purcell, J. M., Snyder, A. C., & Kochunov, P. (2013). Non-g residuals of the SAT and ACT predict specific abilities. Intelligence, 41, 114–120.
Article Google Scholar
Coyle, T. R., Purcell, J. M., Snyder, A. C., & Richmond, M. C. (2014). Ability tilt on the SAT and ACT predicts specific abilities and college majors. Intelligence, 46, 18–24.
Article Google Scholar
Culpepper, S. A. (2010). Studying individual differences in predictability with gamma regression and nonlinear multilevel models. Multivariate Behavioral Research, 45, 153–185.
Article PubMed Google Scholar
Culpepper, S. A. (2012a). Using the criterion-predictor factor model to compute the probability of detecting prediction bias with ordinary least squares regression. Psychometrika, 77, 561–580.
Article PubMed Google Scholar
Culpepper, S. A. (2012b). Evaluating EIV, OLS, and SEM estimators of group slope differences in the presence of measurement error: The single indicator case. Applied Psychological Measurement, 36, 349–374.
Article Google Scholar
Culpepper, S. A. (2016). An improved correction for range restricted correlations under extreme, monotonic quadratic nonlinearity and heteroscedasticity. Psychometrika, 81, 550–564.
Article PubMed Google Scholar
Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166–178.
Article PubMed Google Scholar
Culpepper, S. A., & Davenport, E. C. (2009). Assessing differential prediction of college grades by race/ethnicity with a multilevel model. Journal of Educational Measurement, 46, 220–242.
Article Google Scholar
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to misspecified structural or measurement model components: Rationale of two-index strategy revisited. Structural Equation Modeling, 12, 343–367.
Article Google Scholar
Fischer, F. T., Schult, J., & Hell, B. (2013a). Sex-specific differential prediction of college admission tests: A meta-analysis. Journal of Educational Psychology, 105, 478–488.
Article Google Scholar
Fischer, F., Schult, J., & Hell, B. (2013b). Sex differences in secondary school success: Why female students perform better. European Journal of Psychology of Education, 28, 529–543.
Article Google Scholar
Gottfredson, L. S. (1988). Reconsidering fairness: A matter of social and ethical priorities. Journal of Vocational Behavior, 33, 293–319.
Article Google Scholar
Gottfredson, L. S., & Crouse, J. (1986). Validity versus utility of mental tests: Example of the SAT. Journal of Vocational Behavior, 29, 363–378.
Article Google Scholar
Hägglund, G. (1982). Factor analysis by instrumental variables methods. Psychometrika, 47, 209–222.
Article Google Scholar
Hayashi, F. (2000). Econometrics. Princeton, NJ: Princeton University Press.
Google Scholar
Hausman, J. A., Newey, W. K., Woutersen, T., Chao, J. C., & Swanson, N. R. (2012). Instrumental variable estimation with heteroskedasticity and many instruments. Quantitative Economics, 3, 211–255.
Article Google Scholar
Hong, S., & Roznowski, M. (2001). An investigation of the influence of internal test bias on regression slope. Applied Measurement in Education, 14, 351–368.
Article Google Scholar
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1–55.
Article Google Scholar
Humphreys, L. G. (1952). Individual differences. Annual Review of Psychology, 3, 131–150.
Article PubMed Google Scholar
Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409–426.
Article Google Scholar
Jöreskog, K. G. (1998). Interaction and nonlinear modeling: Issues and approaches. In R. E. Schumacker & G. A. Marcoulides (Eds.), Interaction and nonlinear effects in structural equation modeling (pp. 239–250). Mahwah, NJ: Lawrence Erlbaum Associates Inc.
Google Scholar
Keiser, H. N., Sackett, P. R., Kuncel, N. R., & Brothen, T. (2016). Why women perform better in college than admission scores would predict: Exploring the roles of conscientiousness and course-taking patterns. Journal of Applied Psychology, 101, 569–581.
Article PubMed Google Scholar
Kling, K. C., Noftle, E. E., & Robins, R. W. (2012). Why do standardized tests underpredict women’s academic performance? The role of conscientiousness. Social Psychological and Personality Science, 4, 600–606.
Article Google Scholar
Lance, C. E., Beck, S. S., Fan, Y., & Carter, N. T. (2016). A taxonomy of path-related goodness-of-fit indices and recommended criterion values. Psychological Methods, 21, 388–404.
Article PubMed Google Scholar
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635–694.
Google Scholar
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Charlotte: Information Age Publishing Inc.
Google Scholar
MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 130–149.
Article Google Scholar
Marsh, H. W., Wen, Z., & Hau, K. (2004). Structural equation models of latent interactions: Evaluation of alternative estimation strategies and indicator construction. Psychological Methods, 9, 275–300.
Article PubMed Google Scholar
Mattern, K. D., & Patterson, B. F. (2013). Test of slope and intercept bias in college admissions: A response to Aguinis, Culpepper, and Pierce (2010). Journal of Applied Psychology, 98, 134–147.
Article PubMed Google Scholar
McDonald, R. P., & Ho, M. H. R. (2002). Principles and practice in reporting structural equation analyses. Psychological Methods, 7, 64–82.
Article Google Scholar
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543.
Article Google Scholar
Millsap, R. E. (1995). Measurement invariance, predictive invariance, and the duality paradox. Multivariate Behavioral Research, 30, 577–605.
Article PubMed Google Scholar
Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the single-factor case. Psychological Methods, 2, 248–260.
Article Google Scholar
Millsap, R. E. (1998). Group differences in regression intercepts: Implications for factorial invariance. Multivariate Behavioral Research, 33, 403–424.
Article PubMed Google Scholar
Millsap, R. E. (2007). Invariance in measurement and prediction revisited. Psychometrika, 72, 461–473.
Article Google Scholar
Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routledge.
Google Scholar
Moulder, B. C., & Algina, J. (2002). Comparison of methods for estimating and testing latent variable interactions. Structural Equation Modeling, 9, 1–19.
Article Google Scholar
Muthén, B. O. (1989). Factor structure in groups selected on observed scores. British Journal of Mathematical and Statistical Psychology, 42, 81–90.
Article Google Scholar
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52, 431–462.
Article Google Scholar
Nestler, S. (2014). How the 2SLS/IV estimator can handle equality constraints in structural equation models: A system-of-equations approach. British Journal of Mathematical and Statistical Psychology, 67, 353–369.
Article PubMed Google Scholar
Nguyen, H. H. D., & Ryan, A. M. (2008). Does stereotype threat affect test performance of minorities and women? A meta-analysis of experimental evidence. Journal of Applied Psychology, 93, 1314–1334.
Article PubMed Google Scholar
Nye, C. D., & Drasgow, F. (2011). Assessing goodness of fit: Simple rules of thumb simply do not work. Organizational Research Methods, 14, 548–570.
Article Google Scholar
Oczkowski, E. (2002). Discriminating between measurement scales using nonnested tests and 2SLS: Monte Carlo evidence. Structural Equation Modeling, 9, 103–125.
Article Google Scholar
Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of Applied Psychology, 79, 845–851.
Article Google Scholar
Ployhart, R. E., Schmitt, N., & Tippins, N. T. (2017). Solving the supreme problem: 100 years of recruitment and selection research. Journal of Applied Psychology, 102, 291–304.
Article PubMed Google Scholar
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation modeling. Psychometrika, 69, 167–190.
Article Google Scholar
Ree, M. J., & Earles, J. A. (1991). Predicting training success: Not much more than g. Personnel Psychology, 44, 321–332.
Article Google Scholar
Ree, M. J., Earles, J. A., & Teachout, M. S. (1994). Predicting job performance: Not much more than g. Journal of Applied Psychology, 79, 518–524.
Article Google Scholar
Sackett, P. R., & Ryan, A. M. (2011). Concerns about generalizing stereotype threat research findings to operational high-stakes testing settings. In T. Schmader & M. Inzlicht (Eds.), Stereotype threat: Theory, process, and application (pp. 246–259). Oxford: Oxford University Press.
Google Scholar
Schmitt, N., Keeney, J., Oswald, F. L., Pleskac, T., Quinn, A., Sinha, R., et al. (2009). Prediction of 4-year college student performance using cognitive and noncognitive predictors and the impact of demographic status on admitted students. Journal of Applied Psychology, 94, 1479–1497.
Article Google Scholar
Schult, J., Hell, B., Päßler, K., & Schuler, H. (2013). Sex-specific differential prediction of academic achievement by German ability tests. International Journal of Selection and Assessment, 21, 130–134.
Article Google Scholar
Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Washington, DC: American Psychological Association.
Sörbom, D. (1974). A general method for studying differences in factor means and factor structure between groups. British Journal of Mathematical and Statistical Psychology, 27, 229–239.
Article Google Scholar
Sörbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika, 43, 381–396.
Article Google Scholar
Steele, C. M. (2011). Whistling Vivaldi: How stereotypes affect us and what we can do. New York: WW Norton & Company.
Google Scholar
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70.
Article Google Scholar
Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574.
Article Google Scholar
Walton, G. M., Murphy, M. C., & Ryan, A. M. (2015). Stereotype threat in organizations: Implications for equity and performance. Annual Review of Organizational Psychology and Organizational Behavior, 2, 523–550.
Article Google Scholar
Wicherts, J. M., & Millsap, R. E. (2009). The absence of underprediction does not imply the absence of measurement bias. American Psychologist, 64, 281–283.
Article PubMed Google Scholar
Wicherts, J. M., Dolan, C. V., & Hessen, D. J. (2005). Stereotype threat and group differences in test performance: A question of measurement invariance. Journal of Personality and Social Psychology, 89, 696–716.
Article PubMed Google Scholar
Widaman, K. F., & Thompson, J. S. (2003). On specifying the null model for incremental fit indices in structural equation modeling. Psychological Methods, 8, 16–37.
Article PubMed Google Scholar
Wu, W., West, S. G., & Taylor, A. B. (2009). Evaluating model fit for growth curve models: Integration of fit indices from SEM and MLM frameworks. Psychological Methods, 14, 183–201.
Article PubMed Google Scholar
Young, J. W. (1991a). Gender bias in predicting college academic performance: A new approach using item response theory. Journal of Educational Measurement, 28, 37–47.
Article Google Scholar
Young, J. W. (1991b). Improving the prediction of college performance of ethnic minorities using the IRT-based GPA. Applied Measurement in Education, 4, 229–239.
Article Google Scholar
Zwick, R., & Himelfarb, I. (2011). The effect of high school socioeconomic status on the predictive validity of SAT scores and high school grade-point average. Journal of Educational Measurement, 48, 101–121.
Article Google Scholar

Download references

Author information

R. Millsap: deceased.

Authors and Affiliations

Department of Statistics, University of Illinois at Urbana–Champaign, Champaign, IL, USA
Steven Andrew Culpepper
Department of Psychology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana–Champaign, Urbana, USA
Steven Andrew Culpepper
Department of Management, School of Business, George Washington University, Washington, USA
Herman Aguinis
Department of Educational Psychology, University of Illinois at Urbana-Champaign, Champaign, IL, USA
Justin L. Kern
Department of Psychology, Arizona State University, Tempe, USA
Roger Millsap

Authors

Steven Andrew Culpepper
View author publications
You can also search for this author in PubMed Google Scholar
Herman Aguinis
View author publications
You can also search for this author in PubMed Google Scholar
Justin L. Kern
View author publications
You can also search for this author in PubMed Google Scholar
Roger Millsap
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Andrew Culpepper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Roger Millsap passed away unexpectedly on May 9, 2014 due to a brain hemorrhage. This article is the product of our collective work involving conceptualization, data collection and analysis, and writing. We dedicate the article to him.

We thank Alberto Maydeu-Olivares, a Psychometrika associate editor, and two anonymous reviewers for their excellent recommendations, which allowed us to improve our manuscript in a substantial manner.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (csv 1757 KB)

Supplementary material 2 (csv 188 KB)

Appendices

Appendix A Parameter Inference and Assessing Validity of Instruments

1.1 Parameter Inference

Under the assumption of constant error variance, asymptotic theory (e.g., see Hayashi, 2000) implies that,

$$\begin{aligned} \sqrt{n}(\hat{{{\varvec{b}}}}_j -{{\varvec{b}}}_j) \sim \mathcal {N}\left( 0,\hat{\sigma }_j^2 (\mathbf{S}_{vzj}^{\prime } \mathbf{S}_{vvj}^{-1} \mathbf{S}_{vzj})^{-1}\right) \end{aligned}$$

(A1)

where the estimator for the conditional error variance is,

$$\begin{aligned} \hat{\sigma }_j^2 =\frac{n-1}{n}\left( s_j^2 -2\mathbf{S}_{xzj}^{\prime } \hat{{{\varvec{b}}}}_j +\hat{{{\varvec{b}}}}_j^{\prime } \mathbf{S}_{zz} \hat{{{\varvec{b}}}}_j\right) \end{aligned}$$

(A2)

and $s_j^2$ is the sample variance of $X_j$, n is the sample size, $\mathbf{S}_{xzj}$ is a vector of covariances between $X_j$ and ${{\varvec{Z}}}$, and $\mathbf{S}_{zz}$ is the variance–covariance matrix of ${{\varvec{Z}}}$.

1.2 Assessing Validity of Instruments

The question of whether a latent structure is adequate is generally translated into a statistical question as to whether the model fits the data. There is a vast body of work on the development and evaluation of model fit indices for structural equation models (e.g., Browne & Cudeck, 2002; Fan & Sivo, 2005; Hu & Bentler, 1999; Lance, Beck, Fan, & Carter, 2016; MacCallum, Browne, Sugawara, 1996; McDonald & Ho, 2002; Nye & Drasgow, 2011; Vandenberg & Lance, 2000; Widaman & Thompson, 2003; Wu, West, & Taylor, 2009). Much of prior research developed fit indices for ML estimators although there are also formal tests for model fit for the IVs estimator. Assessing model fit for the IVs estimator is based upon assessing the quality of the instruments used to estimate model parameters. We employ Sargan’s J test for overidentification (Hayashi, 2000) to evaluate the adequacy of the 2SLS model fit. As Bollen et al. (2014) noted, the J test is used for a hypothesis test where, “The null hypothesis is that all IVs for each equation are uncorrelated with the disturbance of the same equation and this is true for each equation in the system. Rejection of the null hypothesis means that at least one IV in at least one equation is invalid” (p. 31). In the particular case of MI&PI studies, the J test statistics can be used to infer whether the measurement or structural models are misspecified. Note that the J tests do not detect misspecifications in the latent variable variance and covariance structure (e.g., missing covariance parameters between residual terms), which is different than typical SEM fit indices.

For the 2SLS estimator, Sargan’s omnibus J test of overidentification is,

$$\begin{aligned} J=n\mathop {\sum }\limits _j \frac{(\mathbf{S}_{vxj} -\mathbf{S}_{vzj} \hat{{{\varvec{b}}}}_j)^{\prime }{} \mathbf{S}_{vvj}^{-1} (\mathbf{S}_{vxj} -\mathbf{S}_{vzj} \hat{{{\varvec{b}}}}_j)}{\hat{\sigma }_j^2}, \end{aligned}$$

(A3)

which is evaluated using an asymptotic Chi-square distribution with degrees of freedom equal to the number of instruments less the number of unrestricted coefficients.

Appendix B Monte Carlo Simulation Study Assessing the Accuracy of the 2SLS Estimator

1.1 Overview

We conducted a Monte Carlo simulation study to assess the accuracy of the 2SLS estimator for MI&PI studies, because prior research (e.g., Marsh, Wen, & Hau, 2004; Moulder & Algina, 2002) recommends against using 2SLS to estimate latent interaction effects involving continuous variables (Bollen & Paxton, 1998). Thus, our Monte Carlo study is necessary to evaluate the performance of the 2SLS estimator for latent interaction effects between categorical and continuous variables. We also compared the performance of 2SLS estimator to the traditional multigroup ML procedure (e.g., see Jöreskog, 1971; Sörbom, 1974, 1978).

We based the Monte Carlo study upon the model in Fig. 1 where there are three observed variables ($X_1$, $X_2$, and $X_3$) as measures of a common factor $\xi $. Additionally, we assess parameter recovery for the structural relationship between $\xi $ and a single criterion variable, Y. Note that we fixed the correlation between $X_4$ and $\xi $ and the slope relating $X_4$ to Y to zero to focus on the accuracy of estimating group differences in measurement intercepts, prediction intercepts, and prediction slopes.

We chose parameter values for the Monte Carlo simulation based on values used in prior PI research (e.g., Aguinis et al., 2010; Culpepper & Aguinis, 2011; Culpepper & Davenport, 2009; Moulder & Algina, 2002) and estimates from the application reported in the main body of our article. We manipulated the following seven parameters: sample size (i.e., $n = 250$, 500, and 1000), proportion of the sample in the focal group (i.e., $p= 0.1$, 0.3, and 0.5), observed variable reliabilities (i.e., $r_{xx} = 0.5$, 0.7, and 0.9), group latent mean differences (i.e., $\kappa _1 -\kappa _0 = 0, -0.25$, and $-0.5$), measurement intercept differences for $X_2$ (i.e., $\tau _{21} -\tau _{20} = 0, -0.25$, and $-0.5$), latent prediction intercept differences (i.e., $\beta _{01} -\beta _{00} = 0, -0.25$, and $-0.5$), and latent slope differences (i.e., $\beta _{11} -\beta _{10} = 0, -0.125$, and $-0.25$). The remaining parameters were fixed across the simulation conditions; i.e., the loadings were defined as $\lambda _1 =\lambda _2 =\lambda _3 =1$, the latent intercept and slope for group $g = 0$ were $\beta _{00} =0$ and $\beta _{10} =\sqrt{0.5}$, measurement intercepts for both groups were set to zero (i.e., $\tau _{10} =\tau _{11} =\tau _{20} =\tau _{30} =\tau _{31} =0)$, and the criterion residual variance was $\psi =0.5$. Note that the unique factor variances for $X_1$, $X_2$, and $X_3$ (i.e., $\theta _1$, $\theta _2$, and $\theta _3$) were determined by values for $r_{xx}$.

Table 8 Type I error and power rates ML and 2SLS estimators for measurement intercept differences, $\tau _{21} -\tau _{20}$, by n, p, $r_{xx}$.

Full size table

1.2 Results

We performed the simulation study with a total of 2187 combinations of parameters values. The outcomes of interest for the ML and 2SLS estimators were bias, Type I error rates, and power rates for $\tau _{21} -\tau _{20}$ (i.e., measurement intercept differences), $\beta _{01} -\beta _{00}$ (i.e., latent intercept differences), and $\beta _{11} -\beta _{10}$ (i.e., latent slope differences). We estimated the outcomes from 5000 replications and employed an a priori Type I error rate of 0.05 for all tests.

Overall, the 2SLS estimator provided accurate estimates for all combinations of parameter values. More specifically, the mean bias for the 2SLS estimator across conditions and parameter values was $-0.001$, 0.000, and $-0.001$ for $\tau _{21} -\tau _{20} $, $\beta _{01} -\beta _{00} $, and $\beta _{11} -\beta _{10} $, respectively, and bias for the parameter values was less than 0.01 in absolute value for 99% of conditions. In contrast, the ML estimator failed to converge for some of the conditions with small n and p. The ML estimator demonstrated similar bias as the 2SLS estimator after removing 119 of the 2187 conditions for which the ML estimator did not converge. Table 8 reports Type I error rates and power for the ML and 2SLS tests of group measurement intercept differences, $\tau _{21} -\tau _{20}$, by values of n, p, and $r_{xx}$. Note that “a” in Table 8 denotes conditions where ML failed to converged for all replications. Table 8 provides evidence that the ML and 2SLS estimators effectively controlled Type I error rates. Furthermore, the power to detect group measurement intercept differences was affected by n, p, and $r_{xx}$. In general, power was larger for ML than 2SLS, but the difference between the methods declined as $\tau _{21} -\tau _{20}$, n, p, and $r_{xx}$ increased.

Tables 9 and 10 report Type I error rates and power for the ML and 2SLS tests of group differences in latent prediction intercepts (i.e., $\beta _{01} -\beta _{00}$) and latent slopes (i.e., $\beta _{11} -\beta _{10}$). Similar to the results in Table 8, the ML and 2SLS estimators controlled the Type I error rate at the a priori level and ML tended to be more powerful than 2SLS across parameter values. Additionally, the power to detect latent prediction intercept differences tended to be larger than the power to detect latent slope differences.

Table 9 Type I error and power rates of ML and 2SLS estimators for latent prediction intercept difference, $\beta _{01} -\beta _{00}$, by n, p, $r_{xx}$.

Full size table

Table 10 Type I error and power rates of ML and 2SLS estimators for latent score slope differences, $\beta _{11} -\beta _{10}$, by n, p, $r_{xx}$.

Full size table

In short, results summarized in Tables 8, 9, and 10 support the use of the 2SLS estimator to perform MI&PI studies. Reassuringly, statistical power for the 2SLS estimator was satisfactory for parameter conditions typically found in high-stakes testing contexts (e.g., $n > 500$ and $r_{xx} > 0.7$).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Culpepper, S.A., Aguinis, H., Kern, J.L. et al. High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance. Psychometrika 84, 285–309 (2019). https://doi.org/10.1007/s11336-018-9649-2

Download citation

Received: 18 October 2017
Published: 22 January 2019
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11336-018-9649-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance

Abstract

Access this article

Similar content being viewed by others

National and International Educational Achievement Testing: A Case of Multi-level Validation Framed by the Ecological Model of Item Responding

The use of test scores from large-scale assessment surveys: psychometric and statistical considerations

Large-Scale Group-Score Assessment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (csv 1757 KB)

Supplementary material 2 (csv 188 KB)

Appendices

Appendix A Parameter Inference and Assessing Validity of Instruments

1.1 Parameter Inference

1.2 Assessing Validity of Instruments

Appendix B Monte Carlo Simulation Study Assessing the Accuracy of the 2SLS Estimator

1.1 Overview

1.2 Results

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance

Abstract

Access this article

Similar content being viewed by others

National and International Educational Achievement Testing: A Case of Multi-level Validation Framed by the Ecological Model of Item Responding

The use of test scores from large-scale assessment surveys: psychometric and statistical considerations

Large-Scale Group-Score Assessment

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (csv 1757 KB)

Supplementary material 2 (csv 188 KB)

Appendices

Appendix A Parameter Inference and Assessing Validity of Instruments

1.1 Parameter Inference

1.2 Assessing Validity of Instruments

Appendix B Monte Carlo Simulation Study Assessing the Accuracy of the 2SLS Estimator

1.1 Overview

1.2 Results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation