A panel study is a powerful longitudinal design in which data are observed or gathered from exactly the same people, group, or organization across multiple time points (Neuman, 2009). Panel studies allow researchers to investigate a moving picture of observed units over time (i.e., a trajectory), rather than a single snapshot, as in cross-sectional studies. During the past few decades, two-stage cluster sampling (TCS) has been widely adopted for most large-scale panel studies (e.g., the Education Longitudinal Study of 2002, Ingels et al., 2013; or the Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999, Tourangeau, Nord, Lê, Sorongon, & Najarian, 2009). Briefly speaking, TCS is conducted by randomly selecting clusters (e.g., schools), and then randomly selecting individuals (e.g., students) within the selected clusters (Lohr, 2009). Incorporating TCS in panel studies not only makes the research design more cost-efficient (Scheaffer, Mendenhall, & Ott, 2005), but also generates three-level data (e.g., school/cluster, student/individual, and time point) that permit a comprehensive investigation of trajectories at both the individual (e.g., student) and cluster (e.g., school) levels.

The multilevel latent growth curve model (MLGCM), which is subsumed by the multilevel structural equation modeling (MSEM) framework, has been advocated as a means of investigating individual and cluster trajectories (for further discussion, see B. O. Muthén & Asparouhov, 2011). In an MLGCM, the time dimension is converted into a multivariate vector, which allows three-level data to be analyzed with a two-level model in which individual-related parameters are estimated in the within model and cluster-related parameters are evaluated in the between model. An example of using an MLGCM to investigate individual and cluster trajectories can be found in B. O. Muthén’s (1997) study, in which he analyzed data drawn from the Longitudinal Study of American Youth (LSAY; Miller, Kimmel, Hoffer, & Nelson, 2000), a national panel study of mathematics and science education in US public schools.

Still, how to evaluate the goodness of fit of MLGCMs has not been well addressed. Model evaluation is required in order to examine the extent to which the hypothesized models, proposed on the basis of solid theories or empirical findings, are representative of the relationships among the variables, given the data (Kaplan, 2009; Kline, 2011). One common approach to model evaluation uses fit indices (e.g., the root mean square error of approximation [RMSEA], comparative fit index [CFI], Tucker–Lewis index [TLI], and standardized root mean square residual [SRMR]) to assess the model fit. However, because a traditional MLGCM comprises both between and within models, it has been suggested that the models at different levels should be evaluated separately by level-specific fit indices (Hox, 2010; Hsu, Kwok, Acosta, & Lin, 2015; Ryu, 2014; Ryu & West, 2009). Studies contributing to understanding the performance of level-specific fit indices in MSEM have been conducted in the context of multilevel confirmatory factor analysis (MCFA; e.g., Hsu, Lin, Kwok, Acosta, & Willson, 2016; Ryu & West, 2009), multilevel path models (Ryu, 2014), and multilevel nonlinear models (Schermelleh-Engel, Kerwer, & Klein, 2014). Among these three approaches, MCFA is the most similar to MLGCMs. Some researchers have recommended using the aforementioned level-specific fit indices, on the basis of simulation studies conducted in the context of MCFA. Particularly, Ryu and West (2009) investigated whether level-specific CFI and RMSEA could detect a lack of fit at both the within and between levels in MCFA, and they found that these fit indices correctly indicated poor model fit in the models at different levels, regardless of the sample size. However, we argue that this recommendation for using level-specific fit indices is hard to generalize from the context of MCFA to MLGCM. The reason is that Ryu and West’s simulation study only considered one misspecification, occurring in the covariance structure of MCFA—the covariance between factors was 0.3 in a two-factor population MCFA model, but it was misspecified as 1.0. The designed misspecification was meaningful in MCFA but also limited the generalizability of their recommendations to MLGCMs, because MLGCMs often estimate the covariance structure as well as the mean structure, in order to have a comprehensive understanding of trajectories (W. Wu & West, 2010). Consequently, the current recommendations for using level-specific fit indices cannot effectively guide applied researchers to evaluate their hypothesized MLGCMs. For this reason, in our study we attempted to address this gap in the literature by systematically investigating the sensitivity of level-specific fit indices to misspecifications occurring within the different structures of MLGCMs.

In addition to level-specific fit indices, our study also evaluated the performance of target-specific fit indices, originally computed for single-level latent growth curve models (SLGCMs) that have both a covariance structure and a mean structure. If any misspecifications occur in SLGCMs, traditional fit indices cannot tell which structure in the model is misspecified. To obtain more informative model evaluation results, W. Wu and West (2010) suggested that researchers consider evaluating these two structures separately. In their study, W. Wu and West generated and evaluated fit indices targeting the covariance structure and the mean structure separately. In the present study, we extended their investigation from the context of SLGCMs to MLGCMs. In MLGCMs, both the between and within models contain a covariance structure and a mean structure. However, the parameters in the within-mean structure are fixed to zero (dummy zero means; B. O. Muthén, 1997), and therefore within-level-specific fit indices are sufficient to describe the misspecification occurring in the within model. Consequently, we evaluated two kinds of target-specific fit indices in our study: target-specific fit indices for (a) the between-covariance structure (T_S_COV fit indices) and (b) the between-mean structure (T_S_MEAN fit indices). Following W. Wu and West (2010) as well as Ryu and West (2009), we created T_S_COV fit indices by saturating the within model as well as the mean structure of the between model. On the other hand, we created T_S_MEAN fit indices by saturating the within model as well as the covariance structure of the between model.

It should be noted that the extent to which target-specific fit index findings from SLGCMs can be generalized to MLGCMs remains a question. Target-specific fit indices are different in nature in SLGCMs versus MLGCMs. In SLGCMs, there is only a single-level model, and saturation can only occur in either the mean structure or the covariance structure. However, as is shown in Appendix A, which outlines a practical way to derive target-specific fit indices, the computation of target-specific fit indices in MLGCM requires that the within model be saturated as well. To the best of our knowledge, the sensitivity of target-specific fit indices in MLGCMS has not been investigated, and this study was intended to close this gap in the literature.

In summary, the purpose of this study was to conduct a systematic Monte Carlo simulation in order to carefully investigate the effectiveness of (a) level-specific fit indices and (b) target-specific fit indices in MLGCMs across varying conditions. We evaluated the extent to which fit indices could be independent of sampling error due to small sample sizes when a hypothesized model was correctly specified (Gerbing & Anderson, 1992; Marsh, Hau, & Grayson, 2005). Moreover, we evaluated the extent to which different fit indices could reflect the discrepancy between correctly specified models and misspecified hypothesized models (i.e., the indices’ sensitivity). We expected desirable fit indices to be less impacted by sampling error and to demonstrate reasonable sensitivity to misspecifications. This study contributes to the MSEM literature in two ways. First, it adds an understanding of the performance of level-specific and target-specific fit indices in MSEMs. Second, it makes recommendations regarding model evaluation practices for MLGCMs.

Multilevel latent growth curve models

This section presents a two-level latent growth curve model capturing quadratic growth at both levels as an example. The featured clustered longitudinal data include repeated measures for each individual nested within the groups, thus forming a three-level structure. Consider a multilevel dataset with T waves of repeated measures for each of N individuals nested within G groups. For the ith individual within the gth group, yig is a multivariate normally distributed random vector with T elements of repeated measures ytig, t = 1, 2, … , T, which can be expressed as

\( {\mathbf{y}}_{ig}={\left[{y}_{1 ig},{y}_{2 ig},...,{y}_{Tig}\right]}_{T\times 1}^{\hbox{'}},i=1,2,...,N;g=1,2,...,G. \)

The random vector yig can be decomposed into its between-level (B) and within-level (W) components:

$$ {\displaystyle \begin{array}{c}{\mathbf{y}}_{ig}={\mathbf{y}}_{B..g}+{\mathbf{y}}_{W. ig}\\ {}={\boldsymbol{\upmu}}_B+{\boldsymbol{\Lambda}}_B{\boldsymbol{\upeta}}_{B..g}+{\boldsymbol{\upvarepsilon}}_{B..g}+{\boldsymbol{\upmu}}_W+{\boldsymbol{\Lambda}}_W{\boldsymbol{\upeta}}_{W. ig}+{\boldsymbol{\upvarepsilon}}_{W. ig}\end{array}}. $$
(1)

Here, two random vectors representing the unique variances of repeated measures, εB..g and εW. ig, are specified separately for the two levels and are assumed to be uncorrelated with each other; εB..g is multivariately normally distributed with mean zero and variance ΘB, and εW. ig is multivariately normally distributed with mean zero and variance ΘW. The random vectors of latent growth factors ηB..g and ηW. ig comprise the latent growth factors I (intercept factor), L (linear slope factor), and Q (quadratic slope factor), and the corresponding factor loading matrices ΛB and ΛW for T (e.g., five) waves of measurements are set as:

$$ {\boldsymbol{\upeta}}_{B..g}={\left[\begin{array}{l}{I}_B\\ {}{L}_B\\ {}{Q}_B\end{array}\right]}_{3\times 1}\ with\;{\boldsymbol{\Lambda}}_B={\left[\begin{array}{ccc}1& 0& 0\\ {}1& 1& 1\\ {}1& 2& 4\\ {}\begin{array}{l}1\\ {}1\end{array}& \begin{array}{l}3\\ {}4\end{array}& \begin{array}{l}9\\ {}16\end{array}\end{array}\right]}_{5\times 3}, and\;{\boldsymbol{\upeta}}_{W. ig}={\left[\begin{array}{l}{I}_W\\ {}{L}_W\\ {}{Q}_W\end{array}\right]}_{3\times 1}\ with\;{\boldsymbol{\Lambda}}_W={\left[\begin{array}{ccc}1& 0& 0\\ {}1& 1& 1\\ {}1& 2& 4\\ {}\begin{array}{l}1\\ {}1\end{array}& \begin{array}{l}3\\ {}4\end{array}& \begin{array}{l}9\\ {}16\end{array}\end{array}\right]}_{5\times 3}. $$

The variance–covariance matrix for yig is presented in Eq. 2:

$$ Cov\left({\mathbf{y}}_{ig}\right)={\boldsymbol{\Lambda}}_B{\boldsymbol{\Psi}}_b{\boldsymbol{\Lambda}}_B^{\hbox{'}}+{\boldsymbol{\Theta}}_B+{\boldsymbol{\Lambda}}_W{\boldsymbol{\Psi}}_W{\boldsymbol{\Lambda}}_W^{\hbox{'}}+{\boldsymbol{\Theta}}_W. $$
(2)

Model evaluation in multilevel structural equation modeling

Level-specific fit indices

Previous studies have indicated that traditional global fit indices (e.g., RMSEA) can only reveal the model fit of the within models, and thus cannot be used to evaluate the between models (Hsu et al., 2015; Ryu & West, 2009). Hox (2010) drew attention to the need to develop level-specific (l-s) fit indices to evaluate the within model and the between model separately. Ryu and West began the evaluation of l-s fit indices. Several recent published works have recommended that researchers apply l-s fit indices to evaluate the corresponding models at different levels in MSEMs (Hsu et al., 2016; Ryu, 2014; Ryu & West, 2009; Schermelleh-Engel et al., 2014).

According to Ryu and West (2009), the partially saturated model method (PS method) can be used straightforwardly to compute l-s fit indices. For example, using the PS method, between-level-specific (b-l-s) \( {\mathcal{X}}^2 \) test statistics (\( {\chi}_{PS\_B}^2 \)) can be obtained by specifying a hypothesized between model and saturating the within model (Hox, 2010). A saturated model can be seen as a just-identified model with zero degrees of freedom, and thus has a \( {\mathcal{X}}^2 \) test statistic equal to zero. As a result, b-l-s \( {\mathcal{X}}^2 \) test statistics only reflect the model fit of the hypothesized between model (Hox, 2010). After \( {\chi}_{\mathrm{PS}\_\mathrm{B}}^2 \) is obtained, b-l-s fit indices (e.g., RMSEAPS _ B, CFIPS _ B, and TLIPS _ B) can be computed, because these fit indices are a function of the \( {\mathcal{X}}^2 \) test statistics. In the same way, within-level-specific (w-l-s) \( {\mathcal{X}}^2 \) test statistics (\( {\chi}_{\mathrm{PS}\_\mathrm{W}}^2 \)) can be also derived by using the PS method, reflecting the model fit of the hypothesized within model. After \( {\chi}_{\mathrm{PS}\_\mathrm{W}}^2 \) is obtained, w-l-s fit indices (e.g., RMSEAPS _ W, CFIPS _ W, and TLIPS _ W) can be computed. In addition, alternative l-s fit indices, SRMRB and SRMRW, which are not computed on the basis of l-s \( {\mathcal{X}}^2 \) test statistics, are also available to evaluate models at different levels in some statistical packages (e.g., Mplus). The formulas for computing l-s fit indices are introduced in Appendix B. Generally, b-l-s and w-l-s fit indices are expected to detect any misspecifications occurring in the between model and the within model, respectively.

Target-specific fit indices

We also evaluated target-specific (t-s) fit indices in this study. The idea of t-s fit indices originated in W. Wu and West’s (2010) study, which investigated the performance of fit indices in SLGCMs. SLGCMs contain a covariance structure and a mean structure. As W. Wu and West pointed out, traditional fit indices, such as RMSEA, can reflect the overall fit of an SLGCM, but fail to detect the structure in which the misspecification occurs. Thus, the results from an SLGCM evaluation using global fit indices do not provide sufficient information to substantive researchers for further model modification. Accordingly, W. Wu and West asserted that there is a need for fit indices that target evaluating the fit of one specific structure (covariance or mean structure) of the model.

More specifically, t-s fit indices for the mean structure only (e.g., RMSEAT _ S _ Mean, CFIT _ S _ Mean, TLIT _ S _ Mean, and SRMRT _ S _ Mean) can be derived by saturating the covariance structure of the SLGCM, whereas t-s fit indices for the covariance structure only (e.g., RMSEAT_S_COV, CFIT_S_COV, TLIT_S_COV, and SRMRT _ S _ COV) can be derived by saturating the mean structure of the SLGCM. W. Wu and West (2010) found that RMSEAT _ S _ MEAN, CFIT _ S _ MEAN, TLIT _ S _ MEAN, and SRMRT _ S _ MEAN were more sensitive to misspecifications in the mean structure than are traditional fit indices. However, RMSEAT_S_COV, CFIT_S_COV, TLIT _ S _ COV, and traditional fit indices performed similarly in terms of their sensitivity to misspecifications in the covariance structure. Leite and Stapleton (2011) concurred with W. Wu and West’s findings, by confirming that SRMRT _ S _ MEAN has greater power for rejecting misspecifications in the mean structure than does traditional SRMR, whereas RMSEAT_S_COV could not improve the power of detection of traditional RMSEA. W. Wu and West (2010) explained that saturating the covariance structure could dramatically reduce the degrees of freedom, which in turn increases the power of the t-s-mean fit indices to detect misspecified mean structures. On the other hand, saturating the mean structure usually decreases the degrees of freedom by only a small amount, and thus cannot substantially change the power of t-s-cov fit indices.

Although t-s fit indices were recommended as a useful strategy for evaluating SLGCMs, as of yet there has been no empirical evidence in the literature to support the use of t-s fit indices for MLGCMs. To contribute to the understanding of how t-s fit indices perform, we examined the effectiveness of t-s fit indices in the context of MLGCMs. As previously mentioned, we evaluated two kinds of t-s fit indices in this study: the target-specific fit indices for (a) the between-covariance structure (RMSEAT_S_COV, CFIT_S_COV, TLIT _ S _ COV, and SRMRT _ S _ COV) and (b) the between-mean structure (RMSEAT _ S _ Mean, CFIT _ S _ Mean, TLIT _ S _ Mean, and SRMRT _ S _ Mean). We note that we did not evaluate t-s fit indices for the within model, because the means of growth factors are fixed at zero (dummy zero means; B. O. Muthén, 1997), and it appears self-evident that any misspecifications detected by the w-l-s fit indices could be attributed to the within-covariance structure.

Following W. Wu and West (2010) and Ryu and West (2009), we created RMSEAT_S_COV, CFIT_S_COV, TLIT _ S _ COV, and SRMRT _ S _ COV by saturating the within model as well as the mean structure of the between model. More specifically, to saturate the mean structure, we freely estimated the intercepts for all repeated measures and fixed the means of the growth factors (i.e., IB, LB, and QB in Fig. 1) to zero. On the other hand, we created RMSEAT _ S _ Mean, CFIT _ S _ Mean, TLIT _ S _ Mean, and SRMRT _ S _ Mean by saturating the within model as well as the covariance structure of the between model. Appendix A outlines a practical way to derive t-s fit indices.

Fig. 1
figure 1

A two-level MLGCM

More considerations when computing CFI- and TLI-related fit indices Both

CFI and TLI are used to evaluate model fit by comparing the hypothesized model to the independence model (Bentler, 1990; Tucker & Lewis, 1973). Note that the independence model must be nested within the hypothesized model. We created CFI- and TLI-related fit indices based on Widaman and Thompson’s (2003) approach, in which the independence model is an intercept-only growth model in which only the mean of the intercept factor and the residual variances are freely estimated.

Method

We conducted a Monte Carlo study to assess the performance both of l-s fit indices (RMSEAPS _ B, CFIPS _ B, TLIPS _ B, SRMRB, RMSEAPS _ W, CFIPS _ W, TLIPS _ W, SRMRW) and of t-s fit indices (RMSEAT_S_COV, CFIT_S_COV, TLIT _ S _ COV, SRMRT _ S _ COV , RMSEAT _ S _ Mean, CFIT _ S _ Mean, TLIT _ S _ Mean, and SRMRT _ S _ Mean) in an MLGCM in terms of their independence from the sample size’s influence and their sensitivity to misspecifications occurring in the between-covariance, between-mean, or within-covariance structures. The design factors we considered included the number of clusters, the cluster size, and the model specification. We used Mplus 7.4 (L. K. Muthén & Muthén, 1998–2017) to generate simulated replications and estimate each of the models.

Population model

We adopted an MLGCM, shown in Fig. 1, as the population model. In line with previous simulation studies (W. Wu & West, 2010; W. Wu, West, & Taylor, 2009), we used a quadratic trajectory population model to generate simulated data. The repeated measures, denoted as V1–V5, are assumed to be on a standardized scale (i.e., M = 0 and SD = 1). The quadratic growth pattern is modeled at both the within and the between levels. The factor loadings of the intercept factors (IW and IB) are fixed at 1.0, and those of the linear slope factors (LW and LB) are set to 0, 1, 2, 3, and 4 as part of the growth model parameterization. To model a quadratic growth pattern, we specified the quadratic slope factors (QW and QB) with factor loadings set to 0, 1, 4, 9, and 16.

To mimic more realistic conditions from an empirical dataset, we adopted parameter settings from LSAY (Miller et al., 2000), which used a two-stage stratified probability sampling approach—representative schools were randomly selected, and students within the selected school were then randomly sampled. The LSAY has been widely used to study the growth in mathematics and science performance (e.g., Ma & Ma, 2004; Ma & Wilkins, 2007). Following B. O. Muthén’s (2004) study, we analyzed the Cohort 2 data, which contain 3,102 students nested within 52 schools. We used an MLGCM, as presented in Fig. 1, to analyze the students’ grade 7 to grade 11 mathematics achievement scores obtained by item response theory equating. The intraclass correlation coefficients (ICCs) of five repeated measures ranged from .15 to .19. The ICC herein is a cluster statistic (e.g., school level) defined as the ratio between cluster-level variance and total variance (Cohen, Cohen, West, & Aiken, 2003; B. O. Muthén & Satorra, 1995). The identified magnitudes for the ICCs are common in educational research and suggested that clustering should not be ignored (Hox, 2010).

The parameter settings for the mean structure (αW) and covariance structure (ΦW) of the population within model are presented in the following matrices:

$$ {\alpha}_{\mathrm{W}}=\left[\begin{array}{c}0\\ {}0\\ {}0\end{array}\right],\kern0.5em {\varPhi}_{\mathrm{W}}=\left[\begin{array}{ccc}71.453& 6.762\ \left({\tau}_{01}\right)& 0\\ {}6.762\ \left({\tau}_{10}\right)& 14.755& 0\\ {}0& 0& 0.703\ \left({\tau}_{22}\right)\end{array}\right]. $$

In αW, the means of IW, LW, and QW, referred to as dummy zero means by B. O. Muthén (1997), are fixed at zero. In ΦW, the diagonal values are the variances of IW (71.453), LW (14.755), and QW (0.703), and the nondiagonal values are the covariances among the three factors. Following W. Wu and West’s (2010) simulation design, the covariances between IW and QW and between LW and QW are constrained to be zero for simplicity. The error variances are 11.906, 15.249, 10.321, 12.592, and 1.931 for the repeated measures V1–V5, respectively, and are uncorrelated over time.

In this simulation, the between model has the same structure as the within-level model. The parameter settings for the mean structure and covariance structure in the between model are presented in matrices αB and ΦB, respectively:

$$ {\alpha}_{\mathrm{B}}=\left[\begin{array}{c}49.956\\ {}4.324\\ {}-0.127\ \left({\gamma}_{200}\right)\end{array}\right],\kern0.5em {\varPhi}_{\mathrm{B}}=\left[\begin{array}{ccc}16.200& 2.819\ \left({\beta}_{01}\right)& 0\\ {}2.819\ \left({\beta}_{10}\right)& 0.609& 0\\ {}0& 0& 0.018\ \left({\beta}_{22}\right)\end{array}\right]. $$

The means of IB, LB, and QB are set to 49.956, 4.324, and – 0.127, respectively. In ΦB, the diagonal values are the variances of IB (16.200), LB (0.609), and QB (0.018), and the nondiagonal values are the covariances among the three factors. The covariances between IB and QB and between LB and QB are constrained to be zero. The error variances of the repeated measures V1–V5 are independent and set to 1.800, 1.277, 0.059, 0.541, and 0.305, respectively.

Design factors

We took three related design factors into account: the number of clusters (NC), cluster size (CS), and model specification. NC is a critical factor when estimating MSEMs (Hox & Maas, 2001; Hox, Maas, & Brinkhuis, 2010). Therefore, we considered different NCs in accordance with Hox and Maas’s and J.-Y. Wu, Kwok, and Willson’s (2015) simulation studies: 50, 100, and 200. The CS was manipulated into three levels, 5, 10, and 20, in line with the simulation design in Hox and Maas’s (2001) study. This CS range is also consistent with the CSs found in two large-scale educational databases: the Early Childhood Longitudinal Study, Kindergarten Class of 1998–99 (CS = 18; Tourangeau et al., 2009) and the Early Childhood Longitudinal Study, Kindergarten Class of 2010–11 (CS = 15; Tourangeau et al., 2015).

The last design factor, the model specification, had two scenarios: correct specification and misspecification. Correct specification meant that a hypothesized model identical to the population model was specified to fit each of the simulated replications. An ideal fit index would successfully identify the correctly specified models. Conversely, if a hypothesized model with an intentionally imposed misspecification were fitted to the simulated replications, the ideal fit index would indicate that the hypothesized model was misspecified. The following section presents detailed information regarding the different types of intentional misspecifications.

Intentional misspecifications in the hypothesized models

We created a total of five types of model misspecifications in this study. In line with previous simulation studies (W. Wu & West, 2010; W. Wu et al., 2009), we considered three intentional misspecifications in the hypothesized between model for this simulation study: (1) misspecification of the covariance between the intercept factor and the linear slope factor at the between level as 0 (β10/β01 = 0; MIS_COVB), (2) misspecification of the variance of the quadratic slope factor at the between level as 0 (β22 = 0; MIS_VARB), and (3) misspecification of the mean of the quadratic factor at the between level as 0 (γ200 = 0; MIS_MEANB). Additionally, we considered two intentional misspecifications in the hypothesized within model: (4) misspecification of the covariance between the intercept factor and the linear slope factor at the within level as 0 (τ10/τ01 = 0; MIS_COVW), and (5) misspecification of the variance of the quadratic slope factor at the within level as 0 (τ22 = 0; MIS_VARW). Consistent with previous studies, we only considered underparameterized misspecifications in the study. We implemented an underparameterized misspecification by fixing at zero targeted parameter values whose population values were nonzero (Hu & Bentler, 1998). In each case, only one misspecification was imposed in the hypothesized model. Between-model-related fit indices were expected to detect intentional misspecifications occurring in the between model (MIS_COVB, MIS_VARB, and MIS_MEANB), and within-model-related fit indices were expected to detect intentional misspecifications occurring in the within model (MIS_COVW and MIS_VARW).

In this study, we did not consider misspecifications in the residual (co) variances at the between or within levels, for two reasons. First, the residual variances at the between level are often trivial (Hox, 2010). Therefore, misspecification in the residual (co) variances at the between level is less likely to be of practical concern for researchers. Second, the structure of residual (co) variances at the within level (i.e., within-subject residuals) can be very complicated, and an independent simulation study on this issue is therefore warranted (e.g., Kwok, West, & Green, 2007)

Population parameters

To ensure that the severities of the five intentional misspecifications were the same (Fan & Sivo, 2005), we adjusted the magnitudes of key parameters in the population models in order to achieve a severity of misspecification equal to a power of .80, given the number of clusters = 100 and the cluster size = 10 (for further discussion, see W. Wu & West, 2010, pp. 427–428), before generating the simulated replications. Appendix C presents the key parameter settings in the population models for each of the five misspecified conditions. For example, population model M1 in Appendix C was used to generate simulated data for the MIS_COVB condition (i.e., the intentional misspecification β10/β01 = 0). The value of the population parameter β10/β01 was adjusted to 4.390 rather than 2.819, and hence the severity of the misspecification β10/β01 = 0 reached a power of .80. As a result, the severity of the misspecifications would not confound the performance of the fit indices of interest.

Analysis

The two design factors related to the sample size (NC: 50, 100, and 200; CS: 5, 10, and 20) and six specification conditions (correct specification, MIS_COVB, MIS_VARB, MIS_MEANB, MIS_COVW, and MIS_VARW) were integrated into 54 conditions. For each condition, replications with convergence problems or improper solutions (e.g., negative unique variances) were excluded until 1,000 replications had been generated. The fit indices of interest produced by both the correctly specified models and the misspecified models were saved for further analyses.

We conducted two sets of analyses. The first set of analyses evaluated whether the fit indices of interest could be independent of sampling errors due to small sample sizes when a hypothesized model was correctly specified. Following Marsh, Hau, and Grayson (2005), we analyzed the values of the fit indices (i.e., outcome variables) under the condition that the hypothesized models were correctly specified (i.e., correct specification condition). We conducted a series of factorial analyses of variance (ANOVAs) on the fit indices values to examine the effects of the design factors NC and CS on the performance of the fit indices. Fit indices with lower effect sizes (indicated by eta-squared, discussed below) of the factors NC and CS were less influenced by sampling errors. Because RMSEA-, CFI-, and TLI-related fit indices were a function of the \( {\mathcal{X}}^2 \) test statistic, we conducted similar ANOVAs on \( {\mathcal{X}}^2 \) test statistics to inform readers of the extent to which the \( {\mathcal{X}}^2 \) test statistic could be influenced by design factors.

The second set of analyses evaluated the extent to which the fit indices were able to reflect the discrepancy between correctly specified models and misspecified models (i.e., their sensitivity). Ideal fit indices should reflect the misfit arising from the imposed misspecification. Therefore, the sensitivity of fit indices can be captured as a discrepancy between the values derived from the misspecified model and the correctly specified model. A larger magnitude of discrepancy suggests higher sensitivity of a fit index. Analytically, we combined the fit indices values derived from a misspecified model and from the correctly specified model into one dataset and then analyzed the values of fit indices (outcome variables) with ANOVAs to determine sensitivity. Factors in the ANOVAs included sensitivity (SEN; i.e., replications fitted to misspecified models vs. correctly specified models), type of misspecification (MIS; e.g., MIS_COVB vs. MIS_VARB), NC, CS, and all interaction terms. Similar ANOVAs were conducted on \( {\mathcal{X}}^2 \) test statistics for comparison purposes. On the basis of the results of these two analyses, we were able to make recommendations for practical and theoretical research. These recommendations appear in the Results and Conclusion sections.

As was mentioned in the Method section, we adjusted key parameters in the population models (see Appendix C, population models M1–M5) to ensure that the findings derived from the five intentional misspecification conditions would be comparable (i.e., not confounded by the severity of different misspecifications). Population models M1 and M2 were designed to generate simulated replications for evaluating fit indices in terms of their sensitivity to the misspecified between-covariance structure. The evaluation of the b-l-s and t-s-cov fit indices was based on replications generated by population models M1 and M2. In contrast, population model M3 was designed to evaluate the sensitivity of the fit indices to the misspecified between-mean structure. Therefore, we evaluated the t-s-mean fit indices based on replications generated by population model M3. Finally, we evaluated the w-l-s fit indices based on replications generated by population models M4 and M5. Furthermore, for each factorial ANOVA, the total sum of squares (SOS) of each fit index provided the variability of the fit index values across all replications under specific simulation conditions. We computed eta-squared (η2) by dividing the Type III SOS of a particular predictor or the interaction effect by the corrected total SOS, which provides the proportion of the variance accounted for by a particular design factor or interaction effect term. Following Cohen’s (1988, 1992) suggestion, we adopted a moderate η2 of .0588 in order to identify influential design factors in the fit indices values (i.e., practically significant). Note that in cases in which a fit index had a corrected total SOS close to 0 (or a variance close to 0), the impact of design factors on the fit index were self-evidently trivial, even though the η2s were larger than .0588. In our analysis, when fit indices have extremely low variability, we have further clarified the interpretation of their design factors’ η2s in the Results section.

Results

In this section, three tables have been created to present the simulation results. Table 1 shows the ANOVA results (η2) with \( {\mathcal{X}}^2 \) test statistics or fit indices values as the dependent variables to evaluate the sensitivity to sampling errors, whereas Table 2 shows the ANOVA results to investigate the sensitivity to misspecifications. To understand the difference on the performance of the targeted fit indices and \( {\mathcal{X}}^2 \) test statistics, we also present the descriptive statistics of \( {\mathcal{X}}^2 \) test statistics including means, standard deviations, and Type I error rates/power in Table 3. Our intent in presenting Table 3 is to inform researchers under which conditions we encourage that \( {\mathcal{X}}^2 \) test statistics be used (given reasonable Type I error rates or power) along with fit indices.

Table 1 ANOVA results (η2) with \( {\mathcal{X}}^2 \) test statistics and fit indices values as the dependent variable to indicate the sensitivity to sampling error (left side) and required NC and CS to accurately identify correct models (right side)
Table 2 ANOVA results (η2) with \( {\mathcal{X}}^2 \) test statistics and fit indices values as the dependent variable to indicate the sensitivity to misspecifications
Table 3 Descriptive statistics of \( {\mathcal{X}}^2 \) test statistics by NC and CS for the correctly specified and misspecified conditions

Convergence rates

The convergence rates for each simulation condition were approximately 100%. The results suggest that the smallest sample size we considered in this study (NC = 50, CS = 5) was unlikely to encounter convergence problems if a MLGCM with five repeated measures was specified and analyzed.

First set of analyses: Sensitivity to sampling error

The first set of analyses we conducted aimed at evaluating whether fit indices of interest could be independent of sampling errors due to small sample sizes when a hypothesized model was correctly specified. The left side of Table 1 presents the ANOVA results (η2) with \( {\mathcal{X}}^2 \) test statistics or fit index values as the dependent variables in order to evaluate the sensitivity of each approach to sampling errors. Note that the main effects of NC and CS were practically significant for some fit indices (described below), but none of interaction effects of NC and CS were practically significant. We provide a visual representation of the main effects of NC and CS with boxplots for each fit index in Figs. 2 and 3, respectively. The horizontal dashed lines noted on the figures denote the traditional cutoff criteria of the fit indices (RMSEA-related fit indices <.06; CFI- and TLI-related fit indices >.95; SRMR-related fit indices <.08; Hu & Bentler, 1999). Those lines are expected to facilitate a better understanding regarding whether these fit indices were able to accurately identify correct models across NCs and CSs if the cutoff criteria were applied.

Fig. 2
figure 2

(a) Box plot of RMSEA- and SRMR-related fit indices values derived from correctly specified models by number of clusters (50, 100, and 200). (b) Box plot of CFI- and TLI-related fit indices values derived from correctly specified models by number of clusters. Horizontal dashed lines indicate the traditional cutoff criteria of fit indices (RMSEA-related fit indices < .06; SRMR-related fit indices < .08; CFI- and TLI-related fit indices > .95, Hu & Bentler, 1999)

Fig. 3
figure 3

(a) Box plot of RMSEA- and SRMR-related fit indices values derived from correctly specified models by cluster size (5, 10, and 20). (b) Box plot of CFI- and TLI-related fit indices values derived from correctly specified models by cluster size. Horizontal dashed lines indicate the traditional cutoff criteria of fit indices (RMSEA-related fit indices < .06; SRMR-related fit indices < .08; CFI- and TLI-related fit indices > .95, Hu & Bentler, 1999)

\( {\mathcal{X}}^2 \) test statistics

Provided in Table 1 are the various \( {\mathcal{X}}^2 \) test statistics, including \( {\chi}_{PS\_B}^2 \), \( {\chi}_{PS\_W}^2 \), \( {\chi}_{T\_S\_ COV}^2 \), and \( {\chi}_{T\_S\_ MEAN}^2 \), which had η2s ranging from .000 to .037. In other words, these \( {\mathcal{X}}^2 \) test statistics were not practically significantly impacted by NC, CS, or NC × CS. Presented in Table 3 (top of table) are the means, standard deviations, and Type I errors of these four \( {\mathcal{X}}^2 \) test statistics. In general, \( {\chi}_{PS\_B}^2 \) (means ranging from 6.897 to 21.886), \( {\chi}_{PS\_W}^2 \) (means ranging from 1.464 to 17.477), \( {\chi}_{T\_S\_ COV}^2 \) (means ranging from 5.082 to 13.893), and \( {\chi}_{T\_S\_ MEAN}^2 \) (means ranging from 9.520 to 10.097) had means approaching the degrees of freedom when NC = 200 and CS = 20.

The Type I error rates of the four \( {\mathcal{X}}^2 \) test statistics tended to be closer to the nominal α level (.05) when the sample size increased; however, we found that Type I error rates were inflated across various NC and CS conditions. More specifically, the Type I error rates of three of the between-level-related \( {\mathcal{X}}^2 \) test statistics (\( {\chi}_{PS\_B}^2 \), \( {\chi}_{T\_S\_ COV}^2 \), \( {\chi}_{T\_S\_ MEAN}^2 \)) were not satisfactory:\( {\chi}_{PS\_B}^2 \) had Type I error rates ranging from .094 to .421 across all sample conditions; \( {\chi}_{T\_S\_ COV}^2 \)’s error rates ranged from .116 to .354; and \( {\chi}_{T\_S\_ MEAN}^2 \)’s error rates ranged from .065 to .285, where .065 occurred when the sample size was 4,000 (NC/CS = 200/20). On the other hand, \( {\chi}_{PS\_W}^2 \) had Type I error rates ranging from .051 to .426, where increasing the sample size to 2,000 or above resulted in a reduced Type I error rate. For example, given a sample size of 2,000 (NC/CS = 100/20 or 200/10), the Type I error rates of \( {\chi}_{PS\_W}^2 \) were between .062 and .077; given a sample size of 4,000 (NC/CS = 200/20), the Type I error rate was .051.

Between-level-specific fit indices

Practically, RMSEAPS _ B was significantly influenced by NC only (η2 = .124), whereas CFIPS _ B and TLIPS _ B were by CS only (η2s = .063 and .059, respectively). As is suggested by Fig. 2a, RMSEAPS _ B approached values indicative of good model fit (i.e., 0) when NC increased. Thus, NC = 200 is recommended if it is being used to identify a correct between model. On the other hand, in Fig. 3b, CFIPS _ B and TLIPS _ B approached values indicative of good model fit (i.e., 1) when CS increased. CFIPS _ B was able to identify correct between models given CS = 5. However, as compared with CFIPS _ B, the values of TLIPS _ B were more spread out and thus needed CS = 10 in order to correctly identify correct between models.

At first glance, SRMRB appears to be strongly impacted by both NC (η2 = .088) and CS (η2 = .189). However, the boxplots in Figs. 2a and 3a clearly depict that the variability of SRMRB is extremely low. Indeed, the variance of SRMRB, computed on the basis of all simulated replicates, was close to 0. Therefore, the noted magnitude of the η2s of NC and CS did not necessarily mean that NC and CS had any practical impact on SRMRB.

Within-level-specific fit indices

RMSEAPS _ W was impacted by both NC (η2 = .086) and CS (η2 = .124). On the basis of the magnitude of η2, RMSEAPS _ W was more influenced by CS than by NC. We found that both CFIPS _ W and TLIPS _ W had low variabilities and, practically, were not significantly influenced by NC or CS. On the other hand, although SRMRW had practically significant values for the η2s of NC and CS, SRMRW had very little variability, so NC and CS did not have practical impacts on SRMRW. For all w-l-s fit indices, NC = 50 (as indicated by Figs. 2a and 2b) and CS = 5 (as indicated by Figs. 3a and 3b) were sufficient to correctly identify correct within models.

Target-specific fit indices for the between-covariance structure

RMSEAT _ S _ COV practically was significantly influenced by NC (η2 = .158) but not by CS (η2 = .051). As is suggested in Fig. 2a, RMSEAT _ S _ COV required NC = 200 in order to accurately identify correct between-covariance structures. On the other hand, both CFIT _ S _ COV and TLIT _ S _ COV were practically significantly affected by both NC (η2s = .069 and .062, respectively) and CS (η2s = .071 and .064, respectively). Nevertheless, we found that NC = 50 (indicated by Fig. 2b) and CS = 5 (indicated by Fig. 3b) were sufficient to accurately identify correct between-covariance structures. As for SRMRT _ S _ COV, we found that its variance computed on the basis of all simulated replicates was close to 0. Therefore, SRMRT _ S _ COV’s larger noted values of η2 for NC and CS did not mean it was practically affected by NC and CS.

Target-specific fit indices for the between-mean structure

RMSEAT _ S _ MEAN practically was significantly influenced by both NC (η2 = .133) and CS (η2 = .114.), and so was TLIT _ S _ MEAN (η2: NC = .061, CS = .120). RMSEAT _ S _ MEAN had a higher demand on both NC (200, as indicated by Fig. 2a) and CS (> 20, as indicated by Fig. 3a) in order to accurately identify correct between-mean structures, whereas TLIT _ S _ MEAN had a moderate demand (NC = 100, as indicated by Fig. 2b, and CS = 10, as indicated by Fig. 3b). CFIT _ S _ MEAN was practically significantly affected by CS (η2 = .094), but nonetheless we found that CS = 5 was sufficient. On the other hand, SRMRT _ S _ MEAN had a variance close to 0, and thus NC and CS had no practical effect on it.

Required NC and CS in order to accurately identify correct models/structures

We have summarized the required values of NC and CS for the \( {\mathcal{X}}^2 \) test statistics and each fit index on the right side of Table 1. As we mentioned earlier, the Type I error rates of the four \( {\mathcal{X}}^2 \) test statistics were inflated across all sample conditions (see Table 3). The results provided that \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ COV}^2 \) could not reasonably identify correct models even when the sample size was as large as 4,000. In contrast, \( {\chi}_{PS\_W}^2 \) and \( {\chi}_{T\_S\_ MEAN}^2 \) had Type I error rates close to .05 when sample sizes were 2,000 or above (NC/CS = 100/20, 200/10, or 200/20) and 4,000 (NC/CS = 200/20), respectively. In general, all fit indices had low demand on NC (50) and CS (5), except for the following fit indices. Specifically, three of the RMSEA-related fit indices—namely RMSEAPS _ B, RMSEAT _ S _ COV, and RMSEAT _ S _ MEAN—had high demand on NC (= 200) and CS (> 20) in comparison to the other fit indices. Moreover, two of the TLI-related fit indices, TLIPS _ B and TLIT _ S _ MEAN, had medium demand of NC (100) and CS (10). Our results suggested that the fit indices (except for RMSEA-related fit indices) were more effective in identifying correct models/structures than were the \( {\mathcal{X}}^2 \) test statistics.

Second set of analyses: Sensitivity to misspecification

In this section, we evaluate the performance of the fit indices under various scenarios in which the hypothesized models were misspecified. We present η2s for the main effects of sensitivity (SEN), NC, CS, and type of misspecification (MIS), as well as for some of their second-order interaction effects, in Table 2. The remaining second-, third-, or higher-order interaction effects are not presented in this table because the results showed that none of η2s of the interaction effects were practically significant. The means of the fit indices across different sample size conditions can be requested by contacting the first author.

Misspecified between-covariance structure

The top ten rows of Table 2 contain the η2s of the design factors for both the b-l-s and t-s-cov \( {\mathcal{X}}^2 \) test statistics and fit indices. Note that the factor MIS included two types of misspecifications, MIS_COVB and MIS_VARB, and a practically significant η2 of MIS indicated that the fit index was more sensitive to one type of misspecification (e.g., MIS_COVB) than to the other (e,g., MIS_VARB).

The results suggested that both \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ COV}^2 \) had practically significant η2s for sensitivity (.179 and .189, respectively) and were not impacted by the other design factors. The means, standard deviations, and powers of \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ COV}^2 \) for the MIS_COVB and MIS_VARB conditions are reported in Table 3. Although the values of \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ COV}^2 \) were not practically sensitive to our design factors, Table 3 suggests that these two \( {\mathcal{X}}^2 \) test statistics were able to detect a misspecified between-covariance structure when the sample sizes were large. Specifically, in the MIS_COVB condition, both \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ COV}^2 \) had adequate power (≥ .800) given a sample size of 1,000 (NC/CS = 50/20, 100/10, or 200/5), whereas in the MIS_VARB condition, a sample size over 1,000 was required in order to reach an adequate power level. The results also suggested that \( {\chi}_{T\_S\_ COV}^2 \) outperformed \( {\chi}_{PS\_B}^2 \) in most sample size scenarios. In summary, a sample size over 1,000 was suggested for \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ COV}^2 \) to appropriately detect misspecified between-covariance structures. Moreover, \( {\chi}_{T\_S\_ COV}^2 \) was favored over \( {\chi}_{PS\_B}^2 \) due to its relatively higher power in most sample size conditions.

On the other hand, the t-s-cov fit indices (saturating both the within model and the between-mean structure) had relatively larger sensitivity η2s than did the b-l-s fit indices (saturating the within model only), except for SRMR. For example, the η2 of sensitivity for RMSEAT _ S _ COV was .280, which was larger than that of RMSEAPS _ B (.148). Similarly, TLIT _ S _ COV had a larger η2 of sensitivity (.082) than did TLIPS _ B (.012). Although CFIT _ S _ COV had a larger η2 of sensitivity (.054) than did CFIPS _ B (.026), both η2s were not practically significant.

Alternatively, SRMRB and SRMRT _ S _ COV had similarly high η2s of sensitivity (.307 and .305, respectively), suggesting that both fit indices performed equally well in terms of their sensitivity to the misspecified between-covariance structure. Nevertheless, we noticed that both SRMRB and SRMRT _ S _ COV were also practically significantly affected by CS (η2s = .069). In other words, when the between-covariance structure was misspecified, the values of SRMRB and SRMRT _ S _ COV reflected not only the severity of the misspecification, but also the size of CS. Our further data analysis showed that the means of SRMRB and SRMRT _ S _ COV approached the values indicative of poor model fit (i.e., 1) when CS decreased.

In summary, SRMRB and SRMRT _ S _ COV acted comparably—both had the highest sensitivity to misspecified between-covariance structures, relative to all other fit indices, but they were also influenced by CS. Therefore, researchers are encouraged to use either SRMRB or SRMRT _ S _ COV, but they need to be aware that a small CS (e.g., 5) can contribute to the values of SRMRB and SRMRT _ S _ COV (leading to values indicative of poor model fit). Computing t-s-cov fit indices for RMSEA and TLI (i.e., RMSEAT _ S _ COV and TLIT _ S _ COV) was a favorable strategy for between-covariance structure evaluation. Strategically, researchers should rely more on RMSEAT _ S _ COV than on TLIT _ S _ COV, because RMSEAT _ S _ COV had a larger η2 of sensitivity. The results did not support CFIPS _ B or CFIT _ S _ COV having a practically significant sensitivity. In addition, on the basis of the performance of \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ COV}^2 \) discussed earlier, we encourage researchers to use \( {\chi}_{PS\_B}^2 \) or \( {\chi}_{T\_S\_ COV}^2 \) (especially the latter, because of its higher power) in combination with fit indices when the sample size is over 1,000.

Misspecified between-mean structure

The middle ten rows in Table 2 contain η2s of the design factors for both the b-l-s and t-s-cov \( {\mathcal{X}}^2 \) test statistics and fit indices. The factor MIS for the between-mean structure by design included only one type of misspecification (MIS_MEANB), and thus this factor had no variation, and its η2 cannot be computed. Our original simulation design included three levels of NC (50, 100, and 200) and of CS (5, 10, and 20). Our preliminary analysis resulted in 12.77% of the replications for CS = 5 producing TLIT _ S _ MEAN < 0 (the descriptive statistics of TLIT _ S _ MEAN for these 12.77% of the replications were M = – 4.52, SD = 18.75, min = – 369.56, and max = 0.00). Moreover, we also found that a high percentage (7.11%) of the replications for CS = 10 produced TLIT _ S _ MEAN < 0 (M = – 4.25, SD = 19.70, min = – 230.47, and max = 0.00), whereas most replications (57.36%) were with NC = 50. To more comprehensively evaluate the impacts of NC and CS on the b-l-s and t-s-mean fit indices, we expanded on our original work by considering larger sizes for NC and CS: NC = 100, 200, and 300; and CS = 10, 20, and 30. Changing the levels of the NC and CS design factors did not compromise our findings because we analyzed the values of the fit indices produced by the M3-model-simulated replications separately.

The results indicated that both \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ MEAN}^2 \) had practically significant η2s of sensitivity (.645 and .393, respectively). However, \( {\chi}_{PS\_B}^2 \)’s sensitivity was impacted by NC (η2 = .079) and also moderated by NC (η2 = .090). The results suggested that increasing NC from 200 to 300 caused a larger growth in \( {\chi}_{PS\_B}^2 \)’s means than did increasing NC from 100 to 200. On the other hand, \( {\chi}_{T\_S\_ MEAN}^2 \) was not practically significantly impacted by either NC or CS.

We report the means, standard deviations, and power of \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ MEAN}^2 \) in the MIS_MEANB condition in Table 3. For simplicity, we do not report that information for NC = 300 or CS = 30 in the table, given that the patterns of the means, standard deviations, and powers of \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ MEAN}^2 \) exhibited across NC = 100 to 200 and CS = 10 to 20 were quite clear. The results suggested that \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ MEAN}^2 \) had adequate power to detect misspecified between-mean structures when the sample size was 500 (NC/CS = 50/10 or 100/5) or above. In addition, we found that the two χ2 test statistics performed similarly in most sample size conditions. In sum, considering that \( {\chi}_{PS\_B}^2 \) is less computationally complicated, researchers might use \( {\chi}_{PS\_B}^2 \) when the sample size is approximately 500 or above.

Both RMSEAPS _ B and RMSEAT _ S _ MEAN performed comparably: Both had outstanding sensitivity (η2s = .805 and .827, respectively) and were not significantly impacted practically by the design factors. On the basis of the magnitudes of sensitivity η2 values, we found that using RMSEAT _ S _ MEAN instead of RMSEAPS _ B did not gain much sensitivity. On the other hand, CFIPS _ B was superior in comparison to CFIT _ S _ MEAN—CFIPS _ B had a larger η2 of sensitivity (.605) than did CFIT _ S _ MEAN (.335). In addition, unlike CFIT _ S _ MEAN, which was practically significantly impacted by CS (η2 = .074), CFIPS _ B was not practically significantly impacted by any design factor. A similar pattern was found when we compared the performance of TLIPS _ B against that of TLIT _ S _ MEAN.

As for SRMR, the results suggested that SRMRT _ S _ MEAN (η2 of sensitivity = .223) outperformed SRMRB (η2 = .164) regarding their sensitivity to the misspecified between-mean structure. However, we also recognize that SRMRB and SRMRT _ S _ MEAN had both means and variances close to 0 across different combinations of NC and CS, given the misspecified between-mean structure. In other words, both SRMRB and SRMRT _ S _ MEAN were not sensitive to the misspecified between-mean structure, and therefore their η2s of sensitivity had no practical meaning and thus can be ignored. On the basis of this finding, we suggest that both SRMRB and SRMRT _ S _ MEAN not be used for between-mean structure evaluation.

To sum up, RMSEAT _ S _ MEAN slightly outperformed RMSEAPS _ B in terms of its sensitivity to the misspecification. However, considering that RMSEAPS _ B (saturating the within model only) can be computed in a less complicated way than RMSEAT _ S _ MEAN (saturating both the within model and the between-covariance structure), RMSEAPS _ B was favored. For CFI and TLI, their b-l-s forms (CFIPS _ B and TLIPS _ B) were preferred, too, because they had a larger sensitivity to misspecification and were not practically significantly impacted by other design factors. Although the η2s of sensitivity for both SRMRB and SRMRT _ S _ MEAN were practically significant (.164 and .223, respectively), the results showed they had both means and variances close to 0 across different combinations of NC and CS, given the misspecified between-mean structure. As a result, these practically significant η2s of sensitivity did not have a practical meaning, and neither SRMRB nor SRMRT _ S _ MEAN was recommended. Comparing the η2s of sensitivity for RMSEAPS _ B, CFIPS _ B, and TLIPS _ B that were useful for between-mean structure evaluation, we found that RMSEAPS _ B was more sensitive to the misspecified between-mean structure and therefore should be used preferentially. Both CFIPS _ B and TLIPS _ B, on the other hand, performed alike and thus can be used interchangeably. In addition, according to our findings on \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ MEAN}^2 \), we suggest that researchers use the aforementioned fit indices along with \( {\chi}_{PS\_B}^2 \) (because it can be easily computed and was as effective as \( {\chi}_{T\_S\_ MEAN}^2 \)) when the sample size is 500 or above.

Misspecified within-covariance structure

The last five rows in Table 2 contain the η2s of the design factors for both the w-l-s \( {\mathcal{X}}^2 \) test statistics and fit indices. Note that the factor MIS included two types of misspecifications, MIS_COVW and MIS_VARW. The results suggested that \( {\chi}_{PS\_W}^2 \) had a practically significant η2s of sensitivity (.195) and was not impacted by the other design factors. The means, standard deviations, and power of \( {\chi}_{PS\_W}^2 \) for the MIS_COVW and MIS_VARW conditions are reported in Table 3. In the MIS_COVW condition, \( {\chi}_{PS\_W}^2 \) had adequate power given a sample size of 2,000 or higher (NC/CS = 100/20, 200/10, or 200/20), whereas in the MIS_VARW condition, a sample size larger than 2,000, and closer to 4,000, was needed.

The results showed that none of the w-l-s fit indices had a practically significant η2 of sensitivity, and all were impacted by CS or by both NC and CS. Nevertheless, a deep investigation of the data showed that CFIPS _ W, TLIPS _ W, and SRMRW had variances approaching 0 across different combinations of NC and CS, given the misspecified within-covariance structure. That is, the impacts of NC and CS on CFIPS _ W, TLIPS _ W, and SRMRW were not practically significant. Our findings raised a concern that using w-l-s fit indices might yield an invalid conclusion on the fitness of the within-covariance structure. On the basis of our findings regarding \( {\chi}_{PS\_W}^2 \), researchers might use \( {\chi}_{PS\_W}^2 \) to evaluate the within-covariance structure. However, in general, a large sample size (larger than 2,000 and closer to 4,000) was needed for \( {\chi}_{PS\_W}^2 \) to reach an adequate power level, which would not necessarily be practical.

Discussion

Sample size

In this study, we evaluated the performance of l-s and t-s fit indices in terms of their independence from sample size influence and sensitivity to misspecifications in a MLGCM. We expected ideal fit indices to be less influenced by sampling errors arising from a small sample size and to be more sensitive to misspecifications. Accordingly, we investigated the extent to which the fit indices of interest could be influenced by sampling errors, based on simulated data derived from correctly specified models. Table 1 presents the influences (in terms of η2) of NC and CS (left side). The results showed that most of the fit indices were practically significantly influenced by NC or CS. Specifically, the fit indices indicated poor model fit as the sample size (a function of NC or CS) decreased, even though the hypothesized models were correctly specified (i.e., the hypothesized model was the same as the population model). This finding is in line with previous research (Marsh, Balla, & McDonald, 1988; Marsh et al., 2005; McDonald & Marsh, 1990; W. Wu & West, 2010; W. Wu et al., 2009) that has explored the issue of sample size dependency among fit indices. As was pointed out by Marsh et al. (2005), the discrepancy between the covariance matrix reproduced by a correctly specified model and a sample covariance matrix can vary systematically with the sample size. The reason is that a sample covariance matrix derived from a small sample size no longer approaches the population covariance matrix due to sampling error, and this in turn increases the discrepancy between the two matrices.

We would note that not all small sizes of NC or CS raised practical concerns when fit indices were applied. Specifically, traditional cutoff criteria of the fit indices (RMSEA-related fit indices < .06; CFI- and TLI-related fit indices > .95; SRMR-related fit indices < .08; Hu & Bentler, 1999) were evaluated to determine the NC and CS required in order to accurately identify correct models (see Figs. 2a, 2b, 3a, and 3b). We summarize the required NC and CS for each fit index on the right side of Table 1. We found that the CFI- and SRMR-related fit indices were not practically affected by sampling errors resulting from a small NC or CS, because they were able to identify correct models even if a small NC (50) and CS (5) was given. The performance of SRMR-related fit indices was noteworthy. The ANOVAs on values of the SRMR-related fit indices had a corrected total SOS close to 0 (or a variance close to 0), so the impact of NC and CS on SRMR-related fit indices was self-evidently trivial. In contrast, RMSEA-related fit indices for the between-model evaluation, including RMSEAPS _ B, RMSEAT _ S _ COV, and RMSEAT _ S _ MEAN, were very likely to be affected by sampling errors in practice. Large NC (200) and CS (> 20) are recommended when these fit indices are used. Two TLI-related fit indices for the between-model evaluation, namely TLIPS _ B and TLIT _ S _ MEAN, needed a moderate NC (100) and CS (10). Substantive researchers need to be aware of these characteristics of the fit indices and to strive for a sufficient sample size to obtain a more accurate model evaluation. Future studies can further investigate the necessary sample sizes for different population models—for example, a model with more time-point measures or with different types of trajectories (e.g., a piecewise-linear trajectory).

Furthermore, NC and CS, two parameters related to sample size in MLGCMs, might also impact the performance of fit indices in different ways. In fit indices for between-model evaluation, NC determines the sample size of the between model. As we discussed earlier, the NC decides the magnitude of the sampling error carried into the model—where smaller NC increases the amount of the sampling errors in model estimation. Consequently, some between-level fit indices can fail to identify correctly specified between models unless the NC is large (e.g., NC = 200 for RMSEAPS _ B, RMSEAT _ S _ COV, and RMSEAT _ S _ MEAN). On the other hand, CS can influence the quality of the scores of indicators (e.g., V1gV5g in Fig. 1) in the between model of MLGCMs (Lüdtke, Marsh, Robitzsch, & Trautwein, 2011). Note that in our simulation (and in practice), the scores of between-level indicators were estimated by the scores of within-level indicators. As was pointed out by Lüdtke et al., the quality of the estimated between-level indicator scores can be influenced by CS—where smaller CS increases the amount of sampling error in the between-level indicators scores. As a result, some between-level fit indices can fail to identify correctly specified between models unless the CS is large (e.g., CS > 20 for RMSEAPS _ B, RMSEAT _ S _ COV, and RMSEAT _ S _ MEAN). In contrast to the case of fit indices for between-model evaluation, both NC and CS jointly determine the sample size of the within model, and therefore influence the performance of fit indices in similar ways.

Finally, in the exploration of the performance of four \( {\mathcal{X}}^2 \) test statistics (\( {\chi}_{PS\_B}^2 \), \( {\chi}_{PS\_W}^2 \), \( {\chi}_{T\_S\_ COV}^2 \), and \( {\chi}_{T\_S\_ MEAN}^2 \)), we found that their Type I error rates were inflated unless the sample size was extremely large. These findings were consistent with Schermelleh-Engel et al.’s (2014) study, which investigated the effectiveness of l-s \( {\mathcal{X}}^2 \) test statistics (\( {\chi}_{PS\_B}^2 \) and \( {\chi}_{PS\_W}^2 \)) under 6,000, 15,000, and 30,000 sample size conditions (NC/CS = 200/30, 500/30, and 1,000/30), which were much larger than our greatest sample size condition (NC/CS = 200/20). Schermelleh-Engel et al. found that \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{PS\_W}^2 \) had Type I error rates lower than .05 when the sample sizes were 30,000 and 15,000, respectively. As a result, it was not surprising to see inflated Type I error rates for the four \( {\mathcal{X}}^2 \) test statistics in our study. Given our findings, researchers need to be aware that using these \( {\mathcal{X}}^2 \) test statistics will very likely lead to overrejection of correctly specified MLGCMs. Therefore, using fit indices jointly to evaluate model fit is highly recommended.

Fit indices for evaluating between-covariance structures

Our results in Table 2 show that RMSEA, CFI, and TLI in the form of t-s-cov (RMSEAT_S_COV, CFIT_S_COV, and TLIT _ S _ COV) expressed a higher sensitivity to the misspecified between-covariance structure than did those same statistics in the form of b-l-s (RMSEAPS _ B, CFIPS _ B, and TLIPS _ B). This finding supports computing RMSEA, CFI, and TLI by saturating the within model as well as the between-mean structure as a favorable strategy for the evaluation of the between-covariance structure. The results also suggest that researchers should prioritize the utilization of RMSEAT_S_COV, CFIT_S_COV, and TLIT _ S _ COV. Specifically, researchers can rely more on RMSEAT _ S _ COV than on TLIT _ S _ COV, because RMSEAT _ S _ COV had a larger η2 of sensitivity. CFIT_S_COV, on the other hand, had no practically significant sensitivity and was therefore not recommended. The aforementioned finding that fit indices in the t-s-cov form were superior to those in the b-l-s form was expected, because the results presented in Table 3 suggested that \( {\chi}_{T\_S\_ COV}^2 \) was favored over \( {\chi}_{PS\_B}^2 \) due to its relatively higher power in most sample size conditions. Since the t-s-cov fit indices are a function of \( {\chi}_{T\_S\_ COV}^2 \), it was not surprising to find that fit indices in the t-s-cov form outperformed those in the b-l-s form in terms of their sensitivity to misspecified between-covariance structures.

Alternatively, we found that SRMRB and SRMRT _ S _ COV acted comparably. That is, computing SRMR in the form of t-s-cov or of b-l-s did not make any difference. On the basis of this finding, we recommend saturating only the within model to obtain SRMRB as a simple strategy to using SRMR. We did find that both SRMRB and SRMRT _ S _ COV had the largest sensitivity to misspecification, in comparison to the other fit indices. However, the results also suggested that SRMRB and SRMRT _ S _ COV were also influenced by CS—that is, a small CS (e.g., 5) resulted in SRMRB and SRMRT _ S _ COV values indicative of poor model fit. Overall, we recommend that SRMR be used along with other t-s-cov fit indices, especially when the CS is small.

Our findings are not consistent with W. Wu and West’s (2010) study, in which they concluded that saturating the mean structure did not influence the sensitivity of fit indices except for SRMR (p. 446) in the context of a single-level latent growth curve model. We consider our findings to be reasonable for the following reasons. First, as W. Wu et al. (2009) mentioned, RMSEA is based only on the chi-squared statistic (χ2) for the hypothesized model, and CFI and TLI are also defined by χ2. Therefore, these three fit indices can reflect the fit of a model to the mean structure. In other words, saturating the between-mean structure can influence the sensitivity of RMSEA, CFI, and TLI. Alternatively, SRMR is a weighted function of the model residuals, and it is not necessary to take into account the residuals of the means (deviations of the sample means from the model-implied means; W. Wu et al., p. 193). That is, SRMR disregards the information from the between-mean structure, so that saturating the between-mean structure therefore has no influence on SRMR.

Fit indices for evaluating between-mean structures

When evaluating misspecified between-mean structures, we found that (a) RMSEA, CFI, and TLI in the form of t-s-mean (RMSEAT _ S _ MEAN, CFIT _ S _ MEAN, and TLIT _ S _ MEAN) did not necessarily display a higher sensitivity to misspecifications than those same statistics in the form of b-l-s (RMSEAPS _ B, CFIPS _ B, and TLIPS _ B), and (b) SRMRB and SRMRT _ S _ MEAN had both means and variances close to 0 across different combinations of NC and CS, given the misspecified between-mean structure. As a result, their practically significant η2s of sensitivity did not have practical meaning, such that both SRMRB and SRMRT _ S _ MEAN are not recommended. This initial finding suggested that researchers would not make substantial gains by using t-s-mean fit indices (RMSEAT _ S _ MEAN, CFIT _ S _ MEAN, and TLIT _ S _ MEAN). As is shown in Table 2, RMSEAPS _ B demonstrated high sensitivity to misspecified between-mean structures (.805), and its sensitivity was very close to RMSEAT _ S _ MEAN’s (.827). On the other hand, both CFIPS _ B and TLIPS _ B outperformed CFIT _ S _ MEAN and TLIT _ S _ MEAN, respectively, in terms of their sensitivity to misspecifications. As a result, we recommend that researchers use RMSEAPS _ B, CFIPS _ B, and TLIPS _ B to evaluate between-mean structures, and RMSEAPS _ B should be used with a higher priority, because it was more sensitive to the misspecified between-mean structure. These findings are not consistent with W. Wu and West’s (2010) study, which concluded that saturating the covariance structure could increase the sensitivity of fit indices in the context of single-level latent growth curve models. We consider our findings to be reasonable, because we found that \( {\chi}_{PS\_B}^2 \) and \( {\chi}_{T\_S\_ MEAN}^2 \) performed similarly in most sample size conditions, in terms of their power to detect the misspecified between-mean structures (see Table 3). Because RMSEA, CFI, and TLI are computed from χ2, the fit indices in the t-s-mean form were expected to perform comparably to those in the b-l-s form. We encourage future studies to validate our findings in a multilevel context.

On the other hand, SRMR is not necessary for reflecting the model fit for means. Leite and Stapleton (2011) confirmed the superior sensitivity of RMSEA and the limited sensitivity of SRMR for identifying misspecified mean structures in latent growth models. In line with Leite and Stapleton’s findings, we found that both SRMRB and SRMRT _ S _ MEAN had means close to zero as well as trivial variability across all sample size conditions.

We note that the η2s shown in Table 2 were derived from larger sizes for NC (100, 200, and 300) and CS (10, 20, and 30). Our original simulation design (NC = 50, 100, and 200; CS = 5, 10, and 20) resulted in a high percentage of replications producing TLIT _ S _ MEAN < 0, especially when (a) CS = 5, regardless of NC, or (b) CS = 10 and NC = 50. We extracted those replications and found that they produced extremely large chi-squared values when we intended to compute t-s-mean fit indices (i.e., saturating the within model as well as the between-covariance structure) for the model with a misspecified between-mean structure. In fact, saturating the between-covariance structure in order to obtain t-s-mean fit indices raises the number of estimated parameters (e.g., increasing four parameters in our study), and in this case, the small sample size resulted in a larger than expected obtained chi-squared value (Bentler & Dudgeon, 1996; Jackson, 2003). On the basis of our simulation results, NC = 100 and CS = 10 were required for computing t-s-mean fit indices, but we encourage future studies to investigate this issue further.

Last but not least, on the basis of the η2s presented in Table 2, there was a tendency for the fit indices of interest to have a higher sensitivity to the misspecified between-mean structure than to the misspecified between-covariance structure. Nevertheless, as we mentioned, the η2s in this section were derived from an alternative NC and CS design (NC = 100, 200, and 300; CS = 10, 20 and 30). Therefore, these η2s were not necessarily comparable to the η2s derived from the original NC and CS design (NC = 50, 100, and 200; CS = 5, 10 and 20). Additional efforts will be needed to verify this tendency.

Fit indices for evaluating within-covariance structures

None of the w-l-s fit indices demonstrated promise in detecting a misspecified within-covariance structure. We did not expect to observe such low or near-zero sensitivity values for w-l-s fit indices, because previous research simulation studies (e.g., Ryu & West, 2009) had shown that w-l-s fit indices can successfully detect misspecified within models in the context of MCFA models. We therefore wondered whether traditional fit indices (e.g., RMSEA, CFI, TLI, and SRMR) could be more effective in identifying misspecified within-covariance structures. After comparing the performance of the w-l-s fit indices with that of traditional fit indices, we found that these two types of fit indices acted almost identically, showing little to no sensitivity to the misspecifications. We validated our finding by comparing it with W. Wu and West’s (2010) study, which had evaluated fit indices in the context of a single-level latent growth curve model. More specifically, according to the information presented in Fig. 3 of W. Wu and West (2010, p. 437), they had an RMSEA close to .04, an SRMR close to .06, and a CFI/TLI > .99 across sample sizes from 125 to 1,000, given a moderate severity of misspecification (defined as power = .80, which we also adopted) in the covariance structures. Their findings are consistent with ours, except that they found slightly higher SRMR values. Moreover, the findings on w-l-s fit indices are also confirmed by our findings on \( {\chi}_{PS\_W}^2 \) (see Table 3). Specifically, we found that \( {\chi}_{PS\_W}^2 \) had little power to detect misspecified within-covariance structures unless the sample size was larger (greater than 2,000, and closer to 4,000). For those w-l-s fit indices that were a function of \( {\chi}_{PS\_W}^2 \) (RMSEAPS _ W, CFIPS _ W, and TLIPS _ W), it was not surprising to see that they were not sensitive to the misspecification. In summary, our findings, as well as those of W. Wu and West, suggest that the evaluation of within-covariance structures can be challenging. Substantive researchers might be overly optimistic about the fit of the within model in MLGCMs, and future researchers are encouraged to validate our findings and look for an optimal strategy for within-model evaluation.

Limitations and future research direction

Because it is not possible to consider all plausible scenarios in a single simulation study, generalizations beyond the set of conditions investigated in our study should be made with caution. First, we adopted an MLGCM, as is shown in Fig. 1, for data generation. Therefore, our findings can only be generalized to studies that use similar MLGCMs. Further studies are encouraged to investigate whether our findings can also be replicated using different models (e.g., piecewise-linear trajectory models). Second, in our study, we did not consider misspecifications in the residual (co) variances for the between and within models. Practically, the residual variances at the between level are often low (Hox, 2010). Freely estimating between-level residual variances and constraining the covariances to zero seems to be a reasonable approach. Therefore, we did not consider misspecifications in residual (co) variances at the between level. On the other hand, the structure of residual (co) variances at the within level (i.e., within-subject residuals) can be more complicated (Kwok et al., 2007). Misspecifications in within-subject residuals are possible and deserve further systematic investigation in another simulation study. Future studies could evaluate fit indices in scenarios in which within-subject residuals are misspecified. Third, in our study, we considered a limited number of design factors. Additional scenarios created by adopting different design factors, such as unbalanced designs (unequal group conditions), the number of time-point measures, and ICCs of repeated measures, will be needed in future studies.

Last but not least, in our simulation design, the variance of the quadratic slope factor in the between-level (or within-level) population model was nonzero but was constrained to 0 as a type of misspecification (i.e., conditions MIS_VARB or MIS_COVW). In each misspecification condition, only one parameter was misspecified. However, in practice, when the variance of the quadratic slope factor is misspecified (i.e., constrained to 0), two other parameters would be also automatically constrained to 0: (a) the covariance between the quadratic slope factor and the linear slope factor, and (b) the covariance between the quadratic slope factor and the intercept factor. Consequently, the findings on sensitivity of fit indices to a misspecified quadratic slope factor may be confounded with two additional potential misspecified parameters. To address this issue, W. Wu and West’s (2010) specification (the two aforementioned covariance parameters also being set to 0 in the population model) can be applied to control the confounding factors. Nevertheless, W. Wu and West’s specification of the population model might decrease the generalizability of our findings to empirical research. Future studies will be needed to validate our findings using a population with the aforementioned covariance parameters not equal to 0.

Conclusion

Previous simulation studies have investigated the performance of level-specific fit indices in the context of MCFA (Hsu et al., 2016; Ryu, 2014; Ryu & West, 2009). Our study has extended this research line by systematically examining the effectiveness of level-specific fit indices (RMSEAPS _ W, CFIPS _ W, TLIPS _ W, SRMRW, RMSEAPS _ B, CFIPS _ B, TLIPS _ B, and SRMRB) and target-specific fit indices (RMSEAT_S_COV, CFIT_S_COV, TLIT _ S _ COV, SRMRT _ S _ COV, RMSEAT _ S _ Mean, CFIT _ S _ Mean, TLIT _ S _ Mean, and SRMRT _ S _ Mean) in terms of their independence from the sample size’s influence and their sensitivity to misspecification in MLGCMs. We appropriately controlled the severity of misspecification when we generated the simulated replications. On the basis of our simulation results, we recommend applying RMSEAT _ S _ COV and TLIT _ S _ COV along with SRMRB in order to maximize the capacity to detect misspecifications in the between-covariance structure. On the other hand, we recommend using RMSEAPS _ B, CFIPS _ B, and TLIPS _ B to detect misspecifications in the between-mean structure. Evaluation of the within-covariance structure turned out to be unexpectedly challenging, as none of w-l-s fit indices (RMSEAPS _ W, CFIPS _ W, TLIPS _ W, and SRMRW) had a practically significant sensitivity. Future researchers are encouraged to validate our findings and to look for an optimal strategy for within-model evaluation.