A Bayesian approach to estimating variance components within a multivariate generalizability theory framework
 378 Downloads
 1 Citations
Abstract
In many behavioral research areas, multivariate generalizability theory (mG theory) has been typically used to investigate the reliability of certain multidimensional assessments. However, traditional mGtheory estimation—namely, using frequentist approaches—has limits, leading researchers to fail to take full advantage of the information that mG theory can offer regarding the reliability of measurements. Alternatively, Bayesian methods provide more information than frequentist approaches can offer. This article presents instructional guidelines on how to implement mGtheory analyses in a Bayesian framework; in particular, BUGS code is presented to fit commonly seen designs from mG theory, including singlefacet designs, twofacet crossed designs, and twofacet nested designs. In addition to concrete examples that are closely related to the selected designs and the corresponding BUGS code, a simulated dataset is provided to demonstrate the utility and advantages of the Bayesian approach. This article is intended to serve as a tutorial reference for applied researchers and methodologists conducting mGtheory studies.
Keywords
Multivariate statistic Generalizability theory Measurement Bayesian inference Markov chain Monte Carlo ReliabilityIn this article, the authors present BUGS code to fit several commonly seen designs from multivariate generalizability theory (mG theory). Prior to delving into mG theory, it is necessary to introduce univariate generalizability theory (G theory).
G theory has been developed for several decades since Cronbach and his associates initially introduced it (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Cronbach, Rajaratnman, & Gleser, 1963; Gleser, Cronbach, & Rajaratnam, 1965). Similar to wellknown classical test theory (CTT), which decomposes observed variances into several components, G theory provides a framework for separating factors underlying various measurement designs. It is primarily used to inform users the quality of a certain measurement and what decisions can be made on the basis of the analyses. To be more specific, G theory offers a comprehensive understanding of different sources of error variance within a measurement system (Brennan, 2001a). Thus, for all studies that require measurement instruments, G theory is essential to inform researchers what reliability level the studies do possess.
For illustration, assume that there are scores from an administration of a 3rdgrade writing test. It’s known that the writing performance scores do not reflect test takers’ true abilities with perfect reliability, as raters and/or test formats will also affect scores. Rater and format effects influence observed scores with systematic variability, which eventually impede the validity of the scores. These effects must be understood before the writing ability levels of those test takers can be estimated such that fair decisions about those scores can be made. In addition to studies on educational testing, works across other fields have been presented under the framework of G theory—for example, student evaluations of instruction by Gillmore, Kane, and Naccarato (1978), teaching behavior measurement by Shavelson and DempseyAtwood (1976), psychotherapy process ratings’ dependability by Wasserman, Levy, and Loken (2009), clinical child and adolescent research by Lakes and Hoyt (2009), and the Big Five Personality Inventory score generalizability by Arterberry, Martens, Cadigan, and Rohrer (2014).
In addition to person effect θ _{ p } and item effect θ _{ i }, Eq. 3 also contains rater effect θ _{ h } and interaction terms of any two random components. Equation 4 provides variance components in accordance with Eq. 3. Once one obtains the variance components, generalizability coefficients (G coefficients; a Gtheorybased analogue to the CTT reliability coefficient) may be calculated.
Estimating variance components as well as G coefficients is typically called a generalizability (G) study, whereas sequential research dealing with the practical application of a measurement procedure is named a decision (D) study; that is, a D study helps researchers use the results of a G study to design a measurement procedure that minimizes error at selected facet(s). Sound D studies can provide practitioners useful information that involves logistical and cost considerations as well as the change of G coefficients. For example, a D study could inform one that increasing the number of items from, say, 10 to 20, on the basis of findings from the corresponding G study, would increase the G coefficient from .8 to .9 (how much the G coefficient would increase would depend directly on the size of the estimated variance components and the number of facets in the study).
As the name indicates, mG theory is an extension of univariate G theory. Similarly, the G coefficient can be extended to the multivariate case (Joe & Woodward, 1976; Marcoulides, 1995; Woodward & Joe, 1973). Statistically, mG theory assumes that scores from the universe of levels of the fixed multivariate facets are not orthogonal, meaning that there are correlations within the facets. In contrast to univariate G theory, in which universe scores and error scores are estimated via variance components, analyzing measurement procedures via a multivariate approach “can provide information about facets that contribute to covariations among the multiple scores.” (Raykov & Marcoulides, 2006, p. 213). A simple example is that of the Medical College Admission Test (MCAT), which is composed of four subtests: biological sciences (biology MCAT), physical sciences (physical MCAT), verbal reasoning (verbal MCAT), and a writing sample (written MCAT). The subtest designs of MCAT lend themselves naturally to mG theory, in which the proficiencies measured by the various dimensions are presumed to be correlated. Thus, in addition to variance components that are estimated in univariate G theory, the multivariate version also accounts for covariance components among subtests. One could theoretically conduct four separate univariate Gtheory analyses for these subtests, however, that assumes zero correlation between the subtests. Marcoulides (1990) points out that such a flawed assumption can potentially influence the accuracy of the estimated variance components.
Overall, mG theory for MCAT and other measures of this kind provides information at the profile level. The relation between univariate G theory and mG theory are akin to that of between the analysis of variance (ANOVA) and the multivariate analysis of variance (MANOVA): mG theory gives reliability of profile scores, not the reliability of individual variables. If the reliability of individual variables were also of interest, one could run a mGtheory analysis and follow it with individual univariate Gtheory analyses, much like one might follow a MANOVA with separate ANOVAs for each dependent variable. That said, for instance, if physical MCAT is defined to be the attribute of interest, one can use univariate G theory to conduct further studies.
To summarize, when research design and/or data nature are multidimensionally structured, mG theory can be implemented to (1) calculate the reliability of difference scores, observable correlations, or universe score and error correlations (Brennan, 2001a), (2) obtain score profile reliability estimates (Brennan, 2001a), and (3) estimate composite scores with highest generalizability level (Shavelson & Webb, 1981). Most of the published works using mG theory demonstrate its application to testing structure. For example, Nußbaum (1984) implemented mG theory to study students’ ability to paint in watercolors; Wu and Tzou (2015) use mG theory to analyze standard setting data; Clauser and colleagues have published several mGtheory studies using National Board of Medical Examiners’ data (Clauser, Harik, & Margolis, 2006; Clauser, Margolis, & Swanson, 2002; Clauser, Swanson, & Harik, 2002). Details about relevant designs and mathematical expressions of mG theory are provided in following sections. Further information about theoretical concerns with subtest dimensionality, for example, the level of correlation between subtests for supporting the application of mG theory, can be found in the subscore literature (see Puhan, Sinharay, Haberman, & Larkin, 2008; Sinharay, 2010, 2013; Sinharay & Haberman, 2014, for details). Researchers can also use techniques provided in the present article to conduct simulation studies for addressing more indepth subscore inquiries.
Multivariate generalizability theory designs
Six commonly seen multivariate generalizability theory designs
Singlefacet designs  p•× i ^{0} 

p•× i•  
Twofacet crossed designs  p•× i•× h• 
p•× i•× h ^{0}  
Twofacet nested designs  p•× (i•: h•) 
p•× (i ^{0}: h ^{0}) 
Estimation
Although Gtheory analyses have been conducted in different statistical software platforms such as R (R Development Core Team, 2017), SAS (SAS Institute, 1990), and SPSS (Norusis, 1990), current estimation practice for conducting mGtheory analyses primarily relies on mGENOVA software (Brennan, 2001b). The computation of mGENOVA is based on mean square error, which is an extension of a traditional MANOVA approach. Although Brennan (2001a) provides solutions to handling missing data, sparse matrices, and other problems caused by violating model assumptions, a large number of these solutions are not available in mGENOVA. One of the most important criticisms of the traditional estimation approach is that an assessment of modeldata fit is not provided (Gessaroli, 2003). If a proposed model—that is, mG theory in the present context—fits the data poorly, further analyses based on the estimation are not trustworthy. In addition, traditional MANOVA and mG theory do not provide standard errors for the estimation of variance and covariance components, potentially masking the precision of the estimates.
Using a Bayesian approach, however, can minimize the aforementioned problems. In this article we demonstrate a Bayesian approach to mG theory using Markov chain Monte Carlo (MCMC) techniques to estimate the parameters of interest. MCMC techniques are able to deal with complex models and sparse data matrices that many traditional methods cannot (see Muthén & Asparouhov, 2012, for example). Furthermore, users are allowed to incorporate prior information to make estimations more robust (Lynch, 2007, p. 2). For example, incorporating prior distributions of variance components allows users to restrict the estimates to permissible numeric space, and therefore avoid such issues as Heywood cases (negative variance estimates). If prior information is unavailable, uninformative priors may be specified; posterior estimation is then essentially equivalent to traditional estimation (Jeffreys, 1946). Note that uninformative prior distributions are used throughout the article. Another advantage of a Bayesian framework is that missingdata imputation can be accommodated simultaneously. The MCMC samples can be applied to missing values as part of the algorithm (Little & Rubin, 2014). Literature has shown that Bayesian methods are an effective way to impute missing values, and the process is consistently stable (Enders, 2010). Lastly, similar to confidence intervals from frequentist approaches, Bayesian methods provide credible intervals for each posterior estimate so that users can investigate the probable range of the parameters with a specified probability level (e.g., 95%). This feature circumvents the asymptotic assumptions, which may not be always reasonable (Jackman, 2009).
The following sections provide additional details on how to implement a Bayesian version of mG theory using BUGS software. In particular, the content (1) covers how to specify BUGS code, (2) provides specific information about popular mGtheory designs, and (3) sheds lights on the further applications of the proposed approach.
The BUGS language and software
The use of BUGS software to perform analytical tasks has grown in popularity, as it is straightforward to alter existing code to fit new variations of current models. For example, Curtis (2010) illustrates a family of item response theory (IRT) models that can be fit by modifying existing BUGS code from simpler models. The BUGS software possesses multiple MCMC sampling methods customizable to different needs, including derivativefree adaptive rejection sampling (Gilks, 1992), Slice sampling (Neal, 1997), currentpoint Metropolis, and direct sampling using standard algorithms (Lunn, Spiegelhalter, Thomas, & Best, 2009). As a result, a wide range of statistical models can be appropriately estimated, including mG theory in the present article.
Three statistical packages are frequently used for running BUGS syntax: WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000; Spiegelhalter, Thomas, Best, & Lunn, 2003), OpenBUGS (Thomas, O’Hara, Ligges, & Sturtz, 2006), and JAGS (Plummer, 2003, 2010). Details about each of these software packages can be found from their websites. Although some differences exist among these options, they frequently produce similar results when the identical tasks are given. OpenBUGS is implemented in the present article, as it has been actively developed over the past decade.
BUGS code
We begin with an example of a singlefacet design, in which the construction of corresponding statistical model is illustrated. In addition, we describe how input data are structured and loaded. Throughout the instructional guidelines, the same number of observations N _{ p } and number of dimensions, V, are used in every design, whereas the levels of random effects, N _{ i } and N _{ h }, vary across designs.
p• × i° design
The left side on the equation is the observed covariance matrix, whereas the matrices on the right are covariances for person, p, item, i, and residual effects, e, respectively. Note that the offdiagonals for the item effects i and the residual effect e are zeros. This assumption matches the design, in that different subtests possess different items; there is no way to estimate covariance since the items are not crossed.

When data are read directly from BUGS, they need to be specified as a matrix by using structure function so that the data can be loaded as seen in Fig. 1. Note that BUGS places data into matrices or arrays by row, which is different from R environment by default that places data into matrices by column (R Development Core Team, 2017).

Certain observed variables need to be manually created in addition to the data. Np = 100 is the number of persons, or examinees in the current example; Ni = 24 is the total number of items; V = 2 is the number of subtests; Ncol = 24 is the number of columns of the input data, which in this case is equivalent to Ni; zero.vector = c(0,0) is used to fulfill the means of the random effects that are multivariate normally distributed; grand.mu = c(50,60) is the grand means that users calculated by averaging data points of each subtest; R = diagonal(1) is used to specifying an unstructured uninformative prior distributions for the precision (inverse of covariance matrix), which is assumed to follow a Wishart distribution (Gelman, 2006).

Other variables are created for model formatting so that the loaded data can be mapped to corresponding random effects. i.format = c(1, 2, . . . , 24) is simply an identification vector for items; iv.format = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2) is a vector to show the subtest to which the items belong.
Code for a p• × i ^{0} design model
p• × i• design
The left side on the equation is the observed covariance matrix, whereas the matrices on the right side are covariance for person effects, p, item effects, i, and residual effects, e, respectively. Note that for each matrix both offdiagonal and diagonal elements are not fixed to zero. This setting matches the nature of a fully crossed design: all the effects are present in each dimension v.

Unlike in the p• × i ^{0} design, when data are read directly from BUGS, they need to be specified as an array instead of a matrix. Follow the order of N _{ p } ^{∗} N _{ i } ^{∗} V array, the current data are formatted by commending: structure(.Data = c(x_{1p1i1}, x_{1p1i2}, …, x_{1p100i12}, x_{1p1i1}, x_{2p1i1}, …, x_{2p100i12} ), .Dim = c(100, 12, 2) ). The subscripts of x indicate the identification of criterion (or dimension), person, and item, respectively. It means that BUGS forms an array by loading v1 data matrix and v2 data matrix in sequential order.

As we have shown previously, users need to manually create some of the observed variables in addition to the data. Np, V, zero.vector, grand.mu, and R are identical to those in p• × i ^{0} design, except Ni = 12 in the present example.

There is no need to specify iv.format as the data loading process over an array already navigates the identification for each dimension. Meanwhile a new identification vector for items needs to be specified accordingly: i.format = c(1, 2, . . . , 12), since the present example only contains 12 items in total instead of 24.
Code for a p• × i• design model
p•× i•× h• design
Since the p• × i• × h• design is fully crossed, each element of all matrices in Eq. 9 needs to be estimated as an extension of the p• × i• design. Note that the total the number of the estimates is 21 in the present example: Generally, models with a large number of parameters require a larger sample size, but a Bayesian framework can use informative prior distributions to circumvent that challenging requirement (Dunson, 2001).

Similar to p• × i• design, when data are read directly from BUGS, they need to be specified as an array instead of a matrix. Again BUGS fills data matrix by matrix and thus in the example v1 data points and v2 data points are loaded in order.

Np, V, zero.vector, grand.mu, and R are identical to what were specified in p•× i• design. Ni = 4 and Nh = 3 indicate that there are four items and three raters. Ncol = 24 is the number of columns of the input data, but in this case it is not equivalent to Ni anymore.

The formatting variables are straightforward; each matches the corresponding header from Fig. 3. i.format = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4) and h.format = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3). Similar to p•× i• design, the formatting variables do not have to go through both dimensions as the formats of v1 and v2 are identical.
Code for a p• × i• × h• design model
p• × i• × h ^{0} design
In Eq. 10, several matrices’ offdiagonal elements are fixed at zero; this pattern, again, matches the nature of a nonfullycrossed design. A straightforward way to view this equation, in line with Fig. 4 ,is that the rater effects h are not defined identically across grading criteria v. Thus, matrices related to h do not have covariance components including the residual effect.

The data are entered as a matrix, as Fig. 4 demonstrates.

Np, V, zero.vector, grand.mu, Ncol, and R are identical to those in the previous designs, whereas some new and different variables are required in the present example: Ni is now 3, since there are three items only; Nh.v1 = 3 and Nh.v2 = 5 represent the numbers of raters within each dimension v.

There are more formatting variables than in previous designs, due to the complexity of the model: i.format = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3) is the identification vector of items matching the data columns; iv.format = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2) indicates that the first nine columns of the input matrix belong to dimension v1 and the rest 15 columns are connected to dimension v2; ih.format = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8) contains the structure for items i and rater h effects, for example, item responses from the first three columns of the input data are associated with Rater 1 (h1); hv.format = c(1, 1, 1, 2, 2, 2, 2, 2) connects the dimensions v for rater effects h.
Code for a p•× i•× h ^{0} design model
p• × (i°: h•) design
As one can see, as compared with the aforementioned crossed designs, the nested design has fewer parameters to be estimated. It can be considered that (1) item effects and the interaction effects between item i and rater h are bundled, and (2) the interaction effects between person p and item i are now entangled with residual effects.

The data are entered as a matrix, as Fig. 5 shows.

Np, V, zero.vector, grand.mu, Ncol, and R are identical to those variables in the previous designs. The only exception is that Nh = 3 due to the fact that the present example has three raters in total.

Specifying i.format = c(1, 2, . . . , 24 ) is not necessary, because the looping through 1 to Ncol already takes over item navigation task. But for the sake of consistency, i.format is kept in the example code. iv.format = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2) is a vector showing the subtest that an item belongs to. Finally, ih.format = c (1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3) indicates the relationship between items and raters.
Code for a p•× (i ^{0}: h•) design model
p• × (i°: h°) design
From Eq. 13, one can tell that only a full matrix (person effects) is estimated. Thus, among all the twofacet designs presented in this article, the present example has the fewest parameter to estimate.

The data are entered as a matrix, as Fig. 6 shows.

Np, V, zero.vector, grand.mu, Ncol, and R are identical to those variables in the previous designs. The only exception is Nh = 5, since the present example has three raters in total.

As we mentioned earlier, i.format = c(1, 2, . . . , 24) is not necessary but still specified in order to maintain consistency. iv.format = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2) is a vector to show to which subtest an item belongs: Items i1 to i15 are under subtest v1 and the rest items are nested within subtest v2. ih.format = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5) indicates the relations between items and raters, for example, items i20 to i24 are graded by rater h5. Finally, hv.format = c(1, 1, 1, 2, 2, 2, 2, 2) represents the nested structure of raters and subtests.
Code for a p• × (i°: h°) design model
Example
∑_{ p } is the person effect covariance matrix, whereas ∑_{ δ } is the error covariance matrix. In the equation, n _{ i } is the number of items. The a vector is a weighting scheme; that is, it defines the importance levels of each subtest. There are various approaches to the estimation of weights, which are beyond the scope of this tutorial (see Brennan, 2001a; Marcoulides, 1994; and Srinivasan & Shocker, 1973, for details). It is evident that different weight schemes can lead to different G coefficients. One can use \( \frac{1}{V} \) to deploy an equalweight scheme if no preassumptions are made, where V is the number of the subtest dimensions. Throughout the article, the equal weight scheme is applied throughout the simulation study and real data analysis. Here in the example, the true\( {\rho}_{\delta}^2 \) is .904.
Estimation of the simulated dataset
Nonmissing Situation  10% Missing Situation  

name  True Value  mGENOVA  BUGS Estimate  2.50%  97.50%  BUGS Estimate  2.50%  97.50% 
\( {\sigma}_{1p}^2 \)  4  3.28  3.28  2.64  4.07  3.37  2.82  3.76 
σ _{1 p2 p}  3  2.83  2.82  2.15  3.61  2.68  1.92  3.76 
σ _{1 p3 p}  2  1.78  1.76  1.20  2.41  1.70  1.08  1.75 
\( {\sigma}_{2p}^2 \)  5  4.46  4.45  3.58  5.52  4.14  3.84  6.33 
σ _{2 p3 p}  1  1.42  1.41  0.81  2.10  1.17  0.66  1.57 
\( {\sigma}_{3p}^2 \)  4.5  3.66  3.66  2.97  4.48  4.48  3.32  4.50 
\( {\sigma}_{1i}^2 \)  6.5  8.77  8.92  3.64  21.29  9.45  3.24  21.89 
σ _{1 i2 i}  3.3  5.86  5.89  1.39  16.08  5.90  1.27  16.62 
σ _{1 i3 i}  4.2  4.04  4.05  1.48  9.93  3.59  2.22  9.03 
\( {\sigma}_{2i}^2 \)  8.5  8.17  8.37  3.35  19.97  8.79  3.31  20.17 
σ _{2 i3 i}  2.4  3.57  3.61  1.20  8.93  3.21  1.25  8.71 
\( {\sigma}_{3i}^2 \)  4.5  2.26  2.39  0.98  5.75  2.61  0.85  6.25 
\( {\sigma}_{1e}^2 \)  4.1  4.10  4.10  3.84  4.39  4.74  4.88  4.06 
σ _{1 e2 e}  3.5  3.45  3.45  3.20  3.72  4.27  3.16  3.83 
σ _{1 e3 e}  2.2  2.20  2.20  2.03  2.39  2.01  2.70  2.35 
\( {\sigma}_{2e}^2 \)  5.1  5.01  5.02  4.71  5.36  5.44  3.82  5.79 
σ _{2 e3 e}  2.1  2.06  2.06  1.87  2.25  2.90  1.40  2.55 
\( {\sigma}_{3e}^2 \)  2.3  2.34  2.34  2.19  2.50  2.61  3.26  2.62 
Meanwhile, in the missingdata situation, BUGS provided results similar to those in the full situation, but mGENOVA was not able to handle the missingdata problems. In addition to variance component estimates, the G coefficient (reliability) can be calculated on fly, and thus form a posterior distribution. The deviance information criterion (DIC), a model fit index, is also provided by BUGS, so that model comparisons can become available.
Discussion
The value of the proposed approach method primarily lies in providing uncertainty to estimates that other methods do not have. In practice, practitioners often rely on the point estimate when using G coefficients to make decisions, but it can be problematic if the estimate lacks precision. In the previous example section, the 95% credible interval of the G coefficient ranges from .873 to .916. If one uses .88 as a criterion, he/she may think the test is reliable because the point estimate is .896, which is higher than the criterion. On the other hand, if he/she realizes that uncertainty could pull the G coefficient down below the criterion with a good chance, the decision could be different from the view that only focuses on point estimates.
In the example section, we used plots to verify the status of convergence. Relying on visualizations only, although they are straightforward and useful, cannot guarantee that the MCMC chain has converged. To provide supplementary supports for monitoring the MCMC process, researchers have proposed various ways to quantify the convergence: For example, (1) Geweke (1991) compared means obtained from different Markov chains, (2) Raftery and Lewis (1992) estimated the minimum chain length needed to estimate a percentile to a certain precision level, (3) Gelman and Rubin’s (1992) \( \widehat{R} \) compares variances between chains, and (4) Brooks and Gelman (1998) modified \( \widehat{R} \) to \( \widehat{R_c} \) so that it could account for sampling variability. Here we briefly discuss \( \widehat{R} \) and its modified version, since they have been widely used in practice (Woodard, 2007).
Note that σ _{ wc } and σ _{ bc } represent the betweenchain and withinchain variances, respectively. According to Gelman and Rubin (1992), if the convergence has been reached, \( \widehat{R} \) would be close to 1. Furthermore, correcting for the influence sampling variability through MCMC process, Brooks and Gelman (1998) propose \( \widehat{R_c} \) by multiplying \( \widehat{R} \) with \( \sqrt{\frac{\widehat{d}+3}{\widehat{d}+1}} \), where \( \widehat{d} \), the degrees of freedom estimate of a Student t distribution, can be estimated by:
\( d\approx \frac{2\left(\widehat{V}\right)}{\widehat{\mathit{\operatorname{var}}}\left(\widehat{V}\right)},\mathrm{where}\ \widehat{V}=\frac{2\left(\widehat{V}\right)}{\widehat{\mathit{\operatorname{var}}}\left(\widehat{V}\right)}{\sigma}_{wc}+\frac{2\left(\widehat{V}\right)}{\widehat{\mathit{\operatorname{var}}}\left(\widehat{V}\right)}{\sigma}_{bc} \).
In practice, model parameters are considered converged if \( \widehat{R_c}<1.2 \), and models overall are considered converged if \( \widehat{R_c}<1.2 \) for all model parameters.
DIC values are automatically calculated and available from BUGS analyses (except for a few highly complex models). Although it is not discussed in detail here, DIC is beneficial when one is examining whether a model fits data better than an alternative model. For instance, if the data in the example section are fitted with a univariate Gtheory model, DIC would be able to indicate the inappropriateness of the wrong model selection when compared with the correctly specified model.
In addition to the capacity to handle missingdata problems, the Bayesian approach to G theory has other beneficial features. Traditional methods can produce negative estimated variance components due to sampling errors or model misspecifications, in which theoretically the sample space is truncated at zero to positive infinity (Searle, 1971). Using bounded priors in the Bayesian approach can avoid the problem: It can set a lower bound of zero on the estimates such that values in impermissible numeric space can be prohibited (Box & Tiao, 1973; Fyans, 1977). For example, one can use a uniform prior distribution (0,∞) to define the sampling space of variance estimates such that the draws of the posterior distributions are bounded above zero. This “brute force” practice, however, is always improper (i.e., it has infinite total mass) and therefore yields unsatisfactory results in many situations Instead, Gelman, Jakulin, Pittau, and Su (2008) suggest using weakly informative priors to solve the problem; That is, specifying the priors that are not only proper for the concern of analytical convenience but also contain little prior knowledge as fully subjective priors would. To illustrate, setting halfCauchy with scale 25 as prior distributions of variance parameters in hierarchical linear models is more accurate than using uniform (0,∞), although both distributions are not informative (see Gelman, 2006, for details). In the present article, inverse Wishart distributions have been used in several models to define the prior information of the covariance matrices. The reason for using inverse Wishart distributions is that if a sample is randomly drawn from a multivariate normal distribution with covariance matrix Σ, mathematically the sample covariance matrix is proved to follow a Wishart distribution. Specifying a Wishart distribution with the identity matrix (mean) and a positive scalar degrees of freedom parameter V (dispersion), as outlined in Tables 2, 3, 4, 5, 6 and 7, is uninformative and does not guarantee nonnegative estimates. In addition, smaller degrees of freedom, as discussed in the present article, would lead to larger sampling variability in some situations. Chung, Gelman, RabeHesketh, Liu, and Dorie (2015) proposed a class of Wishart priors whose (1) degrees of freedom are set equal to the number of varying coefficients plus 2 and (2) scale matrix is set to the identity matrix multiplied by a value that is large relative to the scale of the problem; this solution assures the posterior distributions of covariance matrices strictly positive definite as well as less errors for the fixed coefficient estimates.
Despite the fact that weakly informative priors can be used to constrain the numeric space of posterior distributions, they do not necessarily “quantify one’s prior beliefs about the likely values for the unknowns independently of the data from the current study” (Dunson, 2001). Using informative priors instead, one can formally incorporate information that is not presented in the present data. This practice can be controversial because subjective judgments, especially those of uneducated guesses or unverified beliefs, can impede the integrity of the study. To handle the criticism, one reasonable strategy is adopting empirical Bayes methods in which the prior distribution is estimated from the data, or from previous analyses. For instance, given that mG theory variance and covariance components can be estimated via minimizing mean square error (Brennan, 2001a) and some nonfullycrossed designs can be estimated via factor analysis approach (Woehr, Putka, & Bowler, 2012), researchers can specify informative priors based on the results obtained from the frequentist methods.
Summary
In this article, several examples of BUGS code are provided to fit many of the common mGtheory designs found in the literature. Examples are provided along with theoretical benefits of estimating these variance components in a fully Bayesian framework. Note that other designs are mentioned in Brennan’s (2001a) generalizability theory book that are not presented in the BUGS code section. However, it should be easy for readers to extend the current code to those designs not covered in this article.
Footnotes
 1.
The simulated dataset and mGENOVA syntax can be downloaded via: https://s3uswest1.amazonaws.com/zjiang4/data_code.zip.
References
 Arterberry, B. J., Martens, M. P., Cadigan, J. M., & Rohrer, D. (2014). Application of generalizability theory to the big five inventory. Personality and Individual Differences, 69, 98–103.CrossRefGoogle Scholar
 Box, G. E., & Tiao, G. C. (1973). Bayesian inference in statistical analysis. New York, NY: Wiley.Google Scholar
 Brennan, R. L. (2001a). Generalizability theory. New York, NY: Springer.CrossRefGoogle Scholar
 Brennan, R. L. (2001b). Manual for mGENOVA. Iowa City, IA: Iowa Testing Programs, University of Iowa.Google Scholar
 Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455.Google Scholar
 Chung, Y., Gelman, A., RabeHesketh, S., Liu, J., & Dorie, V. (2015). Weakly informative prior for point estimation of covariance matrices in hierarchical models. Journal of Educational and Behavioral Statistics, 40, 136–157.CrossRefGoogle Scholar
 Clauser, B. E., Harik, P., & Margolis, M. J. (2006). A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. Journal of Educational Measurement, 43, 173–191.CrossRefGoogle Scholar
 Clauser, B. E., Margolis, M., &Swanson, D. B. (2002). An examination of the contribution of computerbased case simulations to the USMLE step 3 examination. Academic Medicine 77, 80–82.CrossRefGoogle Scholar
 Clauser, B. E., Swanson, D. B., & Harik, P. (2002). Multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an Angoffstyle standardsetting procedure. Journal of Educational Measurement, 39, 269–290.CrossRefGoogle Scholar
 Cowles, M. K., & Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91, 883–904.CrossRefGoogle Scholar
 Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability scores and profiles. New York, NY: Wiley.Google Scholar
 Cronbach, L. J., Rajaratnman, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137–163.CrossRefGoogle Scholar
 Curtis, S. M. (2010). BUGS code for item response theory. Journal of Statistical Software, 36, 1–34.CrossRefGoogle Scholar
 R Development Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from www.Rproject.org/Google Scholar
 Dunson, D. B. (2001). Commentary: Practical advantages of Bayesian analysis of epidemiologic data. American Journal of Epidemiology, 153, 1222–1226.CrossRefGoogle Scholar
 Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press.Google Scholar
 Fyans, L. J., Jr. (1977). A new multiple level approach to crosscultural psychological research. Unpublished doctoral dissertation, University of Illinois at UrbanaChampaign.Google Scholar
 Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1, 515–533.CrossRefGoogle Scholar
 Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics, 1360–1383.CrossRefGoogle Scholar
 Gelman, A., & Rubin, D. B. (1992). A single series from the Gibbs sampler provides a false sense of security. Bayesian statistics, 4, 625–631.Google Scholar
 Gessaroli, M. E. (2003). Addressing generalizability theory via structural modelling: Interesting relationships and practical implications. Paper presented at the Annual meeting of the National Council on Measurement in Education, Philadelphia, PA.Google Scholar
 Geweke, J. (1991). Evaluating the accuracy of samplingbased approaches to the calculation of posterior moments (Vol. 196). Minneapolis, MN, USA: Federal Reserve Bank of Minneapolis, Research Department.Google Scholar
 Gilks, W. R. (1992). Derivativefree adaptive rejection sampling for Gibbs sampling. In J.M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 641–649). Oxford, UK: Oxford University Press.Google Scholar
 Gillmore, G. M., Kane, M. T., & Naccarato, R. W. (1978). The generalizability of student ratings of instruction: Estimation of the teacher and course components. Journal of Educational Measurement, 15, 1–13.CrossRefGoogle Scholar
 Gleser, G., Cronbach, L., & Rajaratnam, N. (1965). Generalizability of scores influenced by multiple sources of variance. Psychometrika, 30, 395–418.CrossRefGoogle Scholar
 Jackman, S. (2009). Bayesian analysis for the social sciences. New York, NY: Wiley.CrossRefGoogle Scholar
 Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society, 186, 453–461. https://doi.org/10.1098/rspa.1946.0056 CrossRefGoogle Scholar
 Joe, G. W., & Woodward, J. A. (1976). Some developments in multivariate generalizability. Psychometrika, 41, 205–217.CrossRefGoogle Scholar
 Lakes, K. D., & Hoyt, W. T. (2009). Applications of generalizability theory to clinical child and adolescent psychology research. Journal of Clinical Child & Adolescent Psychology, 38, 144–165.CrossRefGoogle Scholar
 Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. New York, NY: Wiley.Google Scholar
 Lunn, D., Spiegelhalter, D., Thomas, A., & Best, N. (2009). The BUGS project: Evolution, critique and future directions. Statistics in Medicine, 28, 3049–3067.CrossRefGoogle Scholar
 Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337. https://doi.org/10.1023/A:1008929526011 CrossRefGoogle Scholar
 Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scientists. New York, NY: Springer.CrossRefGoogle Scholar
 Marcoulides, G. A. (1990). An alternative method for estimating variance components in generalizability theory. Psychological Reports, 66 (2), 379–386.CrossRefGoogle Scholar
 Marcoulides, G. A. (1994). Selecting weighting schemes in multivariate generalizability studies. Educational and Psychological Measurement, 54, 3–7.CrossRefGoogle Scholar
 Marcoulides, G. A. (1995). Designing measurement studies under budget constraints: Controlling error of measurement and power. Educational and Psychological Measurement, 55 (3), 423–428.CrossRefGoogle Scholar
 Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313–335. https://doi.org/10.1037/a0026802 CrossRefPubMedGoogle Scholar
 Neal, R. M. (1997). Markov chain Monte Carlo methods based on ‘slicing’ the density function. (Technical Report No. 9722). Toronto, ON, Canada: University of Toronto, Department of Statistics and Department of Computer Science.Google Scholar
 Norusis, M. J. (1990). SPSS: Statistical data analysis. SPSS.Google Scholar
 Nußbaum, A. (1984). Multivariate generalizability theory in educational measurement: An empirical study. Applied Psychological Measurement, 8, 219–230.CrossRefGoogle Scholar
 Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing (Vol. 124, p. 125). Vienna, Austria: Technische Universität Wien. Retrieved from https://www.rproject.org/conferences/DSC2003/Proceedings/ Google Scholar
 Plummer, M. (2010). JAGS Version 2.2. 0 manual. Available from http://mcmcjags.sourceforge.net.
 Puhan, G., Sinharay, S., Haberman, S., & Larkin, K. (2008). Comparison of subscores based on classical test theory methods (ETS Research Report No. RR0854). Princeton, NJ: Educational Testing Service.CrossRefGoogle Scholar
 Raftery, A. E., & Lewis, S. M. (1992). How many iterations in the Gibbs sampler? In J.M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 763–773). Oxford, UK: Oxford University Press.Google Scholar
 Raykov, T., & Marcoulides, G. A. (2006). Estimation of generalizability coefficients via a structural equation modelling approach to scale reliability evaluation. International Journal of Testing, 6, 81–95.CrossRefGoogle Scholar
 SAS Institute. (1990). SAS/STAT user’s guide: Version 6 (Vol. 2). Cary, NC: Author.Google Scholar
 Searle, S. R. (1971). Linear models. New York, NY: Wiley.Google Scholar
 Shavelson, R., & DempseyAtwood, N. (1976). Generalizability of measures of teaching behavior. Review of Educational Research, 46, 553–611.CrossRefGoogle Scholar
 Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973–1980. British Journal of Mathematical and Statistical Psychology, 34, 133–166.CrossRefGoogle Scholar
 Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174.CrossRefGoogle Scholar
 Sinharay, S. (2013). A note on assessing the added value of subscores. Educational Measurement: Issues and Practice, 32, 38–42.CrossRefGoogle Scholar
 Sinharay, S., & Haberman, S. J. (2014). An empirical investigation of population invariance in the value of subscores. International Journal of Testing, 14, 22–48.CrossRefGoogle Scholar
 Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2003). WinBUGS user manual. Cambridge, UK: MRC Biostatistics Unit.Google Scholar
 Srinivasan, V., & Shocker, A. D. (1973). Estimating the weights for multiple attributes in a composite criterion using pairwise judgments. Psychometrika, 38, 473–493.CrossRefGoogle Scholar
 Thomas, A., O’Hara, B., Ligges, U., & Sturtz, S. (2006). Making BUGS open. R News, 6, 12–17.Google Scholar
 Wasserman, R. H., Levy, K. N., & Loken, E. (2009). Generalizability theory in psychotherapy research: The impact of multiple sources of variance on the dependability of psychotherapy process ratings. Psychotherapy Research, 19, 397–408.CrossRefGoogle Scholar
 Woehr, D. J., Putka, D. J., & Bowler, M. C. (2012). An examination of Gtheory methods for modeling multitrait–multimethod data: Clarifying links to construct validity and confirmatory factor analysis. Organizational Research Methods, 15, 134–161.CrossRefGoogle Scholar
 Woodard, D. B. (2007). Detecting poor convergence of posterior samplers due to multimodality (Discussion Paper 200805). Duke University, Department of Statistical Science, Durham, NC.Google Scholar
 Woodward, J. A., & Joe, G. W. (1973). Maximizing the coefficient of generalizability in multifacet decision studies. Psychometrika, 38, 173–181.CrossRefGoogle Scholar
 Wu, Y. F., & Tzou, H. (2015). A multivariate generalizability theory approach to standard setting. Applied Psychological Measurement, 39, 507–524.CrossRefGoogle Scholar