Advertisement

Behavior Research Methods

, Volume 51, Issue 2, pp 651–662 | Cite as

Using Hamiltonian Monte Carlo to estimate the log-linear cognitive diagnosis model via Stan

  • Zhehan JiangEmail author
  • Richard Carter
Article

Abstract

The Bayesian literature has shown that the Hamiltonian Monte Carlo (HMC) algorithm is powerful and efficient for statistical model estimation, especially for complicated models. Stan, a software program built upon HMC, has been introduced as a means of psychometric modeling estimation. However, there are no systemic guidelines for implementing Stan with the log-linear cognitive diagnosis model (LCDM), which is the saturated version of many cognitive diagnostic model (CDM) variants. This article bridges the gap between Stan application and Bayesian LCDM estimation: Both the modeling procedures and Stan code are demonstrated in detail, such that this strategy can be extended to other CDMs straightforwardly.

Keywords

Markov chain Monte Carlo (MCMC) Bayesian Cognitive diagnostic model LCDM Stan Hamiltonian Monte Carlo (HMC) 

The popularity of Bayesian inference has gained attention in the past decade. The number of publications indexed by Google Scholar with terms “Bayes” and/or “Bayesian” has increased to above 25% from 5% in the early 2000s (a similar finding can be found in M. D. Lee & Wagenmakers, 2014, p. 7). The core of Bayesian inferences, Markov chain Monte Carlo (MCMC) techniques can provide accurate estimation for models with high complexity, where many traditional methods cannot (see, e.g., Muthén & Asparouhov, 2012). On the other hand, the availability of software programs like WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000), JAGS (Plummer, 2003), and Stan (Carpenter et al., 2017) have improved the usability of Bayesian estimation in practice. Among those software programs, the Gibbs sampling (Geman & Geman, 1984), the Metropolis algorithm (MH; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953), and other hybrid variants such as Metropolis Hastings-within-Gibbs (Gilks, 1998) are widely known. These programs are incorporated in WinBUGS and JAGS. In the current context, examples using MCMC algorithms to estimate the parameters of several simplified cognitive diagnostic models via MH or Gibbs sampling methods can be found in Culpepper (2015), de la Torre (2009), de la Torre and Douglas (2004), and Junker and Sijtsma (2001). In addition, DeCarlo (2012) uses Bayesian approach to recognize uncertainty in DINA Q-matrix.

These algorithms have a high chance of failing convergence due to their inefficiencies in the MCMC processes, especially models with many correlated parameters, and models lacking posterior conjugacy. Sorensen, Hohenstein, and Vasishth (2016) claim that “a model with 14 fixed effects predictors and two crossed random effects by subject and item, each involving a 14×14 variance-covariance matrix” cannot be fit in WinBUGS nor JAGS. In contrast, Stan (Carpenter et al., 2017) is capable to fit the aforementioned model and other arbitrarily complex models, as it possesses the no-U-turn sampler (Hoffman & Gelman, 2014), an extension to Hamiltonian Monte Carlo (HMC; Neal, 2011) algorithm. Some simulation studies have shown that HMC is more efficient than its counterparts such as MH algorithm and the Gibbs samplers in many situations (Girolami & Calderhead, 2011). On the other hand, simulation studies conducted by Almond (2014) show that in a simple model JAGS can converge in a shorter time, despite Stan provides more effective sample sizes. Da Silva, de Oliveira, von Davier, and Bazán (2017) claim that a tailored Gibbs sampler, incorporated in the R package dina (Culpepper, 2015) can be more efficient than HMC in Stan, where OpenBUGS is less efficient. Note that the efficiency comparison result can vary from model to model and/or condition to condition. Therefore, a general statement can’t conclude at this point.

HMC is the default choice in Stan software program, which has been introduced to psychometric modeling; for example, Luo and Jiao (2017) provide a tutorial on how to apply Stan for item response models (IRT). In yet another tutorial, Annis, Miller, and Palmeri (2017) delineate the steps of specifying customized distributions in Stan via linear ballistic accumulator models. What’s more, Sorensen, Hohenstein, and Vasishth (2016) illustrate the procedures of loading two-level hierarchical linear models on Stan. Finally, Merkle and Wang (2018) implement Stan to structural equation models. With respect to the family of cognitive diagnosis models (CDMs), S. T. Lee (2016) present the details on specifying Stan code for the deterministic inputs, noisy “and” gate (DINA) model (Haertel, 1989; Junker & Sijtsma, 2001; Macready & Dayton, 1977), and similar work can be found in a recent contribution by da Silva, de Oliveira, von Davier, and Bazan (2017). To date, no instructions are found for estimating other advanced CDM variants via Stan. Zhan (2017) demonstrates JAGS code to fit some common CDMs including DINA, the deterministic input, noisy “or” gate (DINO) model (Templin & Henson, 2006), the linear logistic model (LLM; Maris, 1999), and finally, the saturated version of these models: the log-linear CDM (LCDM; Henson, Templin, & Willse, 2009). Although Zhan specifies Bayesian LCDM via JAGS comprehensively, the algorithm is not based upon HMC.

The LCDM has gained emphases as it provides more information than other CDM variants. In addition, it is relatively robust when the hierarchy of structure model is misspecified (Templin & Bradshaw, 2014). Note that the LCDM is essentially equivalent to the generalized DINA model (GDINA; de la Torre, 2011). For consistency purpose, the abbreviation “LCDM” is used to represent the saturated model in the following sections. Essentially, the present article aligns the LCDM estimation with HMC via Stan program. Readers are assumed to have fundamental knowledge on Bayesian inferences and statistical programming languages (see Sorensen et al., 2016, for the prerequisite for understanding the present article). The LCDM will be briefly revisited in the coming section.

The log-linear cognitive diagnostic model

Different from item response theory (IRT) that assumes a continuum for latent variables of interest, cognitive diagnostic models (CDMs; DiBello & Stout, 2007) have been developed to evaluate the strengths and weaknesses in a preverified content domain by identifying the presence or absence of multiple fine-grained attributes (or skills). The features of CDMs naturally sets latent variables of interest onto a binary scale such that “mastery” or “nonmastery” of the attributes can be determined. As a result, CDMs can yield an attribute profile of each respondent that essentially serves as a diagnosis report. Similar to IRT, CDMs possess high interpretability for attribute structure that can be used to support or verify a certain theory, but CDMs have higher reliability and can offer richer diagnostic information to aid decision-making (Rupp & Templin, 2008). Famous CDM variants include the aforementioned DINA, as well as DINO, with noisy input; the deterministic “and” gate model (NIDA; Junker & Sijtsma, 2001); and the reparameterized unified model (RUM; Hartz, 2002). Subsequent advances in model development have produced general diagnostic models. Examples of these include the generalized deterministic input, noisy “and” gate model (G-DINA; de la Torre, 2011), general diagnostic model (GDM; von Davier, 2005), and the aforementioned LCDM. An LCDM (G-DINA with a logit link) provides great flexibility such as (1) subsuming most latent attributes, (2) enabling both additive and non-additive relationships between attributes and items simultaneously, and (3) syncing with other psychometric models, increasing insightfulness. Given these advantages, this article extends the DINA-HMC work by da Silva et al. (2017) to a more applicable situation via an LCDM strategy.

To sum, CDMs are a class of probabilistic, confirmatory, multidimensional latent-class models (von Davier, 2009). The LCDM is distinct from other CDM variants in terms of its parameterization of the measurement component. We start with the formulas for a general latent class model and move on to the specific parameterization of the LCDM. As a member of latent class models, an LCDM is mathematically defined as:
$$ P\left({\boldsymbol{Y}}_p={\boldsymbol{y}}_{\boldsymbol{p}}\right)=\sum \limits_{c=1}^C{v}_c\prod \limits_{i=1}^I{\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}, $$
(1)
where yp = (yp1, yp2, …, ypN) is the correct/incorrect response vector of respondent p on a test comprised of I items, and element ypi is the corresponding response on item i. vc is the probability of membership in latent class c, and πpi is the probability of correct response to item i by respondent p in the class. Extended from Eq. 1, the log-likelihood function for a random sample of size N can be expressed:
$$ L=\sum \limits_{p=1}^N\mathit{\log}\left\{\sum \limits_{c=1}^C\left({v}_c\prod \limits_{i=1}^I{\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right)\right\}. $$
(2)
To simplify computational efforts, Eq. 2 is often rewritten as:
$$ L=\sum \limits_{p=1}^N\mathit{\log}\left\{\sum \limits_{c=1}^C\left(\exp \left(\log \left({v}_c\right)+\log \left(\prod \limits_{i=1}^I{\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right)\right)\right)\right\}, $$
(3)
where \( \log \left(\prod \limits_{i=1}^I{\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right) \) can be further converted to \( \sum \limits_{i=1}^I\log \left({\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right) \).
Suppose there are A attributes. The cognitive state of a respondent is denoted by attribute vector α = (α1, α2, …, αA), where each element in α is a 1/0 binary variable indicating whether a respondent has mastered ath attribute: αa. There are a total number of 2A possible attribute patterns (i.e., classes). To illustrate, a respondent p with a pattern α = (0, 1, 1, 0 ) has mastered the second and the third attributes, but not the first and the forth ones. Similarly, if the pattern becomes α = (1, 1, 1, 1 ), it means the respondent has mastered all attributes. To identify attributes that are required to solve each item, content experts provide a Q-matrix of size I ∗ A, where I and A are the numbers of items and attributes in a test, respectively. The (i, a) entry of the Q-matrix qia is 1 when item i is associated with attribute a, and otherwise qia = 0. Given respondent p’s attribute pattern is αc, the conditional probability of item i can be stated as
$$ {\pi}_{ci}=P\left({y}_{p\mathrm{i}}=1\right|{\boldsymbol{\upalpha}}_{\boldsymbol{c}}\Big)=\frac{\exp \left({\lambda}_{i,0}+{\boldsymbol{\lambda}}_i^T\boldsymbol{h}\left({\boldsymbol{\upalpha}}_{\boldsymbol{c}},{\boldsymbol{q}}_{\boldsymbol{i}}\right)\right)}{1+\exp \left({\lambda}_{i,0}+{\boldsymbol{\lambda}}_i^T\boldsymbol{h}\left({\boldsymbol{\upalpha}}_{\boldsymbol{c}},{\boldsymbol{q}}_{\boldsymbol{i}}\right)\right)}, $$
(4)
where qi is the set of Q-matrix entries for item i, λi, 0 is the intercept parameter, where λi represents a vector of size (2A − 1) ∗ 1 that contains main effect and interaction effect parameters of item i, and h(αc, qi) is a vector of size (2A − 1) ∗ 1 with linear combinations of the αc and qi. Particularly, \( {\boldsymbol{\lambda}}_i^T\boldsymbol{h}\left({\boldsymbol{\upalpha}}_{\boldsymbol{c}},{\boldsymbol{q}}_{\boldsymbol{i}}\right) \) inside the exponent function can be expressed as
$$ {\boldsymbol{\lambda}}_i^T\boldsymbol{h}\left({\boldsymbol{\upalpha}}_{\boldsymbol{c}},{\boldsymbol{q}}_{\boldsymbol{i}}\right)=\sum \limits_{a=1}^A{\lambda}_{i,1,(a)}{\upalpha}_{ca}{q}_{ia}+\sum \limits_{a=1}\sum \limits_{a^{\prime }>1}^A{\lambda}_{i,2,\left(a,{a}^{\prime}\right)}{\upalpha}_{ca}{\upalpha}_{c{a}^{\prime }}{q}_{ia}{q}_{i{a}^{\prime }}+\dots, $$
(5)
where λi, 1, (a) and \( {\lambda}_{i,2,\left(a,{a}^{\prime}\right)} \) are the main effect for ath attribute αa and a two-way interaction effect for αa and \( {\alpha}_{a^{\prime }} \). Since elements of αc and qi are binary, h(αc, qi) contains binary elements, which indicate effects needed to be estimated. For an item measuring n attributes, n-way interaction effects should be specified in h(αc, qi). Table 1 shows a concrete example of a measure with three attributes: Item 1, which measures α1, only has two estimates, whereas Item 3, measuring all three attributes, has eight estimates in total. The item parameters, however, do require monotonicity constraints; otherwise, the LCDM estimation is likely to encounter label-switching problems (Lao & Templin, 2016). Particularly, the expectation maximization (EM; Dempster, Laird, & Rubin, 1977) algorithm—a dominant method in CDM estimations—could produce results leading to unreasonable interpretations of item parameters, as well as disruption of the converging process. Rupp, Templin, and Henson (2010) outlined the parameter constraint approach, for example, ensuring the positiveness of λi, 1 in Eq. 5 and forcing the two-way interaction effect \( {\lambda}_{i,2,\left(a,{a}^{\prime}\right)} \) to be bigger than the corresponding negative main effects −λi, 1, (a) and \( -{\lambda}_{i,1,\left({a}^{\prime}\right)} \). The evidence suggests that the parameter constraint approach would keep label consistency (Lao & Templin, 2016); however, the constrained true sampling space remains unknown, due to mathematical complexity. Note that the use of monotonicity constraints introduced bias in the estimates for parameters with small values (in the population), and therefore the constraints remain controversial in the literature. After introducing the measurement component, we then describe the structural component.
Table 1

Formula expression example of a log-linear cognitive diagnosis model

Item

α 1

α 2

α 3

\( \mathrm{Complete}\ {\lambda}_{i,0}+{\boldsymbol{\lambda}}_i^T\boldsymbol{h}\left({\boldsymbol{\upalpha}}_{\boldsymbol{c}},{\boldsymbol{q}}_{\boldsymbol{i}}\right)\ \mathrm{Expression} \)

Simplified Expression

1

1

0

0

λ1, 0 + λ1, 1(1) + λ1, 2(0) + λ1, 3(0) + λ1, 12(1 ∗ 0) + λ1, 13(1 ∗ 0) + λ1, 23(0 ∗ 0) + λ1, 123(1 ∗ 0 ∗ 0)

λ1, 0 + λ1, 1(1)

2

0

1

1

λ2, 0 + λ2, 1(0) + λ2, 2(1) + λ2, 3(1) + λ2, 12(0 ∗ 1) + λ2, 13(0 ∗ 1) + λ2, 23(1 ∗ 1) + λ2, 123(0 ∗ 1 ∗ 1)

λ2, 0 + λ2, 2(1) + λ2, 3(1) + λ2, 23(1)

3

1

1

1

λ3, 0 + λ3, 1(1) + λ3, 2(1) + λ3, 3(1) + λ3, 12(1 ∗ 1) + λ3, 13(1 ∗ 1) + λ3, 23(1 ∗ 1) + λ3, 123(1 ∗ 1 ∗ 1)

λ3, 0 + λ3, 1(1) + λ3, 2(1) + λ3, 3(1) + λ3, 12(1) + λ3, 13(1) + λ3, 23(1) + λ3, 123(1)

In addition to item parameters, researchers would also focus on the class membership of respondents. Given a response vector yp, let P(C = c | Yp= yp) as the posterior class probability for a respondent, where again, subscripts i, c, and p represent the item, latent class, and person, respectively. Based upon the EM deriving process (Dempster et al., 1977), it can be proved that
$$ P\left(C=c\ |\ {\boldsymbol{Y}}_{\boldsymbol{p}}={\boldsymbol{y}}_{\boldsymbol{p}}\right)=\frac{v_c\prod \limits_{i=1}^I{\pi}_{pi}{\left(1-{\pi}_{pi}\right)}^{1-{y}_{pi}}}{\sum \limits_{c=1}^{N_c}{v}_c\prod \limits_{i=1}^I{\pi}_{pi}{\left(1-{\pi}_{pi}\right)}^{1-{y}_{pi}}}. $$
(6)

This estimate provides diagnostic information for each individual. The deriving details can be found in Knott and Bartholomew (1999, pp. 90–92). The posterior probabilities of mastery of attribute a can be further derived simply by summing all \( \sum \limits_{c^{\prime }}P\left(C={c}^{\prime }\ |\ {\boldsymbol{Y}}_{\boldsymbol{p}}={\boldsymbol{y}}_{\boldsymbol{p}}\right) \), where c is any latent class vector with ath element equals 1. Note that Eqs. 16 are the foundations in constructing Stan code: the likelihood function is comprised of samplers that align with the equations.

Hamiltonian Monte Carlo and no-U-turn sampler

Under the umbrella of MCMC methods, HMC replaces probability distributions of the Markov chain with Hamiltonian dynamics so that the process of exploiting the target distribution can be achieved more efficiently. That is, HMC extends MH algorithm by providing more precise proposal values using Hamiltonian dynamics. In each iteration of the algorithm, the values of parameters are said to “leapfrog” to states closer to their posterior densities, short-cutting the time the MH algorithm takes by avoiding proposal values that are ultimately rejected. Once new values are proposed, the HMC algorithm uses MH methods to accept/reject choices. Therefore, compared with MH algorithms, the HMC algorithm leads to a more efficient Monte Carlo sampler (Neal, 2011).

The original version of HMC has two parameters: the trajectory length and the step size. To be specific, the step size represents the sizes between candidate solution points on a single leap trajectory, while the trajectory length defines how long a leap trajectory should be. However, Hoffman and Gelman (2014) prove that the HMC performance highly depends on the two parameters which lead to a risk of poor estimation results. For example, while larger step sizes accelerate calculations, this can have a negative cost of rejecting appropriate MCMC samples. Similarly, if the trajectory is not sufficiently long, the random walk of HMC can be inefficient. On the other hand, computation powers will be wasted if the trajectory is redundantly long. The no-U-turn sampler is proposed to reduce the HMC dependency on the two parameters (Hoffman & Gelman, 2014). Instead of using a fixed step size, the no-U-turn sampler first pre-explores the sampling space and tunes the step size to a target acceptance rate (default of 0.8) that has been proved to be optimal (Betancourt, Byrne, & Girolami, 2014). Then the no-U-turn sampler adopts a recursive tree-building algorithm that keeps doubling the leap trajectory until a U-turn is encountered; the U-turn is a sign for computationally unworthy exploration. Therefore, the trajectory length is adaptive to a certain optimal range. Technical details can be found in Hoffman and Gelman (2014) and Betancourt, Byrne, and Girolami (2014).

Stan software program

Why Stan? The primary reason is that it possesses the no-U-turn HMC sampler, which has been proved to be efficient. In addition to HMC, Stan provides other estimation options such as variational inference and penalized maximum likelihood. The second reason is that, among many software programs of its kind, Stan has, arguably, the most straightforward syntax for modeling, as demonstrated in the later sessions; users having experience in BUGS code tend to find easy to switch to Stan. The third reason is that Stan allows users to insert customized functions that complied in C++ resulting in more flexible statistical modeling constructions and faster model estimations. The fourth reason is that the Stan community has been contributing to the Stan improvement since its debut. The Stan functions and flexibility have gained responsive updates on an open source platform. For example, a LKJ distribution (Lewandowski, Kurowicka, & Joe, 2009)—a recent developed uniform prior on correlation matrices—is a built-in choice in Stan. Last but not least, interfacing with many mainstream programming environments such as R, Python, Stat, and Matlab easily, Stan has great accessibility. Throughout the article, the rstan interface in the R software (R Core Team, 2018) is implemented to call the Stan functions as well as to simulate data responses. Particularly, the interface is available by loading the rstan (Stan Development Team, 2016a) package on R. See http://mc-stan.org/users/interfaces/rstan for details about installing the rstan. Implementing Stan on other environments follows a similar fashion.

Follow the article structure of Luo and Jiao (2017) and Jiang and Skorupski (2017), this paragraph introduces the components of a Stan program via R. The R code consists of two parts: (1) .stan file specification and (2) syntax for executing R. The .stan file is made of six Stan code blocks, as can be seen in Table 2; of these blocks, data, parameters, and model are mandatory, whereas the remaining blocks are need-based. There are a few general rules being applied to each code block of the .stan file: (1) the symbol // is used to place comment, (2) the data type needs to be specified for both the data records and model variables (including those before transformations), (3) the value range constraint can be placed to meet the requirement of permissible sampling numerical space, and (4) the symbol ; should be put at the end of each code line but before the symbol //. After saving a .stan file, R syntax is executed to call the rstan package, compile models, push data records into the analysis, save estimation results, and finally produce outcomes. In the following section, the aforementioned work flows, including setting arguments within the functions, will be explained step by step.
Table 2

Six code blocks and their definitions in a Stan program

Stan Code Block

Code Block Function

data

declare the data to be pushed into the Stan program

transformed data

transform the data passed in above

parameters

define the unknowns to be estimated and corresponding constraints

transformed parameters

transform the parameters passed in above

model

Specify prior distributions and likelihood functions

generated quantities

Generate outputs from the model such as posterior predictions

Stan code for the LCDM

To explain how Stan code should be created, assume there is a dataset containing 2,000 respondents (N = 2,000). Note that the parameter recovery and estimate accuracy are not discussed here; this investigation will be preserved in the later simulation study session. This dataset was based on the Q-matrix listed in Table 3. The classes were labeled as “000,” “001,” “010,” “011,” “100,” “101,” “110,” and “111.” The labels indicate the mastery of attributes in alphabetical order—for example, “010” means that only the second attribute is mastered. Furthermore, following the practice shown in Table 1, \( {\lambda}_{i,0}+{\boldsymbol{\lambda}}_i^T\boldsymbol{h}\left({\boldsymbol{\upalpha}}_{\boldsymbol{c}},{\boldsymbol{q}}_{\boldsymbol{i}}\right) \), or namely the kernel function, for each class is outlined in Fig. 1.
Table 3

A Q-matrix for responses generation

Item No.

Attribute1

Attribute2

Attribute3

1

1

0

0

2

1

0

0

3

0

1

0

4

0

1

0

5

0

0

1

6

0

0

1

7

1

1

0

8

0

1

1

9

1

0

1

Fig. 1

Kernel function expressions

The expression rules in Fig. 1 follow the conventions proposed by Rupp, Templin, and Hensen (2010, p. 206): That is, (1) l simply represents λ; (2) the number before the symbol _ indicates the item number; (3) the first number after the symbol _ represents the item effect name, where 0, 1, and n are the labels of the intercept, main effect, and n-way interaction effects, respectively; and (4) the remaining numbers identify items that contain an attribute interaction, if there is any. To illustrate, in the last cell of Fig. 1, l9_0 represents the intercept of Item 9, and l9_213 represents the two-way interaction effects between the first and the third attributes. According to Rupp, Templin, and Henson (2010), in addition to ensuring the nonnegativity of the main effects, the following constraints are also required: (1) l9_213 > – l9_13 and l9_213 > – l9_11, (2) l8_223 > – l8_13 and l8_223 > – l8_12, and (3) l7_212 > – l7_12 and l7_212 > – l7_11.

In total, there are four steps for specifying the syntax: (1) the data and parameters blocks, (2) the transformed parameters block, (3) the model block, and (4) the generated quantities block. Figure 2 shows the specifications for the data and parameters blocks in the .stan file. The data block defines the data records (inputs) that Stan absorbs into the algorithm. Since N, I, C, A, and O are all integers, int is specified correspondingly. Note that O is the total number of observations (i.e., responses across items and respondents). Y is the generated dataset in a long format such that each row contains a response, the respondent index, and the item index. For example, the first row is the observation of the first respondent at the first item, the second row is that of the first respondent at the second item, and the rest of rows follow this pattern. Alternatively, one can specify it into three separate vectors and therefore loop through the response vector, the item ID vector, and the respondent ID vector. Alpha is the attribute pattern matching each class, as demonstrated in Table 4; these are defined as matrices. In the parameters block, Line 11 specifies the class membership probability vc = [v1, …, vC] as simplex instead of vector. According to the Stan user manual (Stan Development Team, 2016b), simplex is used to map (K – 1) unconstrained variables to a K-simplex. Since the sum of [v1, …, vC] is constrained to 1 at a mixture model, simplex was a reasonable choice to specify vc. Lines 12 to 35 define all item parameters as real numbers; however, <lower = 0> is added to the main effects, such that their numerical spaces are nonnegative. Adding the nonnegativity constraint can effectively avoid the occurrence of label switching, which is a common problem in mixture model estimations (Stephens, 2000). This practice essentially sets truncations on the priors, and therefore the posteriors. More constraints were placed in the model block.
Fig. 2

Stan code for the data and parameters blocks

Table 4

An attribute pattern matrix

Class.

Attribute1

Attribute2

Attribute3

000

0

0

0

001

0

0

1

010

0

1

0

011

0

1

1

100

1

0

0

101

1

0

1

110

1

1

0

111

1

1

1

Figure 3 shows the specification for the transformed parameters block. Note that a few lines are skipped due to space limitations. PImat—an I ∗ C matrix—is created to hold transformed parameters that follow the structure in Fig. 1. Essentially, PImat is the result of applying inverse-logit transformation on the kernel functions in Fig. 1. For example, the terms inside the inv_logit expressions in Fig. 3 are identical to those in each cell in Fig. 1. Matching Eq. 4, inv_logit(x) is a Stan function for yielding \( \frac{\exp \left(\mathrm{x}\right)}{1+\exp (x)} \), and therefore PImat is essentially the π matrix.
Fig. 3

Stan code for the transformed parameters block

The model block is specified in Fig. 4. Both contributionsC and contributionsI are temporary vectors for cumulating log-likelihood: The two temporary vectors are created for the looping process of HMC. Lines 5 to 29 specify the prior distributions for all item parameters. Note that the priors are relatively uninformative (diffused), since the mean and standard deviation of the normal distribution are 0 and 15, respectively, except in Lines 18 to 20, which possess extra constraints. When no prejudgment or educated guessing is available, it is not uncommon to use uninformative distributions. Lines 18 to 20 match the aforementioned constraints; that is, (1) l9_213 > – l9_13 and l9_213 > – l9_11, (2) l8_223 > – l8_13 and l8_223 > – l8_12, and (3) l7_212 > – l7_12 and l7_212 > – l7_11. Instead of using two lines, the fmax function allows the prior truncation to start at the point having the larger value. For example, if – l9_13 > – l9_11, l9_213 only needs to be larger than – l9_13. Line 30 specifies a Dirichelet prior for class membership probability. The function rep_vector (x, t) forms a vector by replicating x t times. Therefore, according to Line 30, the Dirichelet prior contains nine parameters that are all equal to 1. This practice, by setting all Dirichelet parameters equal, is essentially setting a uniform prior for a multinomial distribution (Ishwaran & Zarepour, 2002). Lines 32–44 define the (log-)likelihood function, which is illustrated in Eqs. 2 and 3. Lines 33–35 construct loops for the respondent, class, and item. Refer to Eqs. 2 and 3; one needs to either multiply/add all units in order to compute the (log-)likelihood. As one finds, Lines 38 to 41 are calculating \( \log \left({\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right) \). When ypi = 1, \( {\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}} \) is reduced to πci; otherwise, it becomes 1 − πci: This condition mechanism is defined by the if –else statements in Lines 38 and 40. The function bernoulli_lpmf(ypi ∣ πci) provides the log of the probability mass function: \( \log \left({\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right) \). Therefore, when the item loop (iteri) closes, the temporary vector contributionsI would become [\( \log \left({\pi}_{1i}^{y_{pi}}{\left(1-{\pi}_{1i}\right)}^{1-{y}_{pi}}\right) \), \( \log \left({\pi}_{2i}^{y_{pi}}{\left(1-{\pi}_{2i}\right)}^{1-{y}_{pi}}\right) \), . . ., \( \log \left({\pi}_{N_ci}^{y_{pi}}{\left(1-{\pi}_{N_ci}\right)}^{1-{y}_{pi}}\right) \)]. Line 43 corresponds to the calculation within the exponential function in Eq. 3. Log(vc) and \( \sum \limits_{i=1}^I\log \left({\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right) \) are realized by log(Vc[iterc]) and sum(contributionsI), respectively, within the model block. Accordingly, \( \log \left({v}_c\right)+\log \left(\prod \limits_{i=1}^I{\pi}_{ci}^{y_{pi}}{\left(1-{\pi}_{ci}\right)}^{1-{y}_{pi}}\right) \) is saved in the temporary vector contributionsC. Last but not least, Line 45 exponentiates all elements in the contributionsC vector, sums those exponentiated parameters, performs logarithm transformation, and finally adds the respondent’s log-likelihood to the total log-likelihood. In other words, +=log_sum_exp(x) executes the core part of Eq. 3, \( \sum \limits_{p=1}^N\log \left\{\sum \limits_{c=1}^C\left(\exp (x)\right)\right\} \).
Fig. 4

Stan code for the model block

The last block is generated quantities (Fig. 5), which is used to produce the posterior distributions. Line 2 is the posterior probability matrix for attribute mastery that corresponds to Ppc = 1 | Yp= yp), where Line 3 defines P(C = c | Yp= yp), such that each respondent has a C-element class probability vector. Lines 4 to 6 are once again temporary containers for the algorithm loops. Lines 8–21 are fairly similar to what is specified in Fig. 4, except that Line 16 is calculating \( {v}_c\prod \limits_{i=1}^I{\pi}_{pi}{\left(1-{\pi}_{pi}\right)}^{1-{y}_{pi}} \), and Line 20 refers to Eq. 6. Equation 7 in turn can be referred to Lines 23–31, where the posterior probabilities of mastery of attributes are generated for producing posterior distributions. At the current step, the .stan file is completed and can be saved as LCDM.stan.
Fig. 5

Stan code for the generated quantities block

Stan outputs

Before analyzing data, R needs to load the rstan interface by calling library(rstan). The .stan file should be compiled by executing model <- stan_model(“LCDM.stan”). The first time a model is executed, LCDM.stan will take some time to compile before sampling. On subsequent runs, it will only recompile if the .stan file is altered. In addition, the data and other input variables need to be organized in a list and passed to Stan via syntax data = list(Y = Y, A = A, N = N, I = I, C = C, Alpha = Alpha). These inputs were pre-defined prior to the analysis. Finally, the estimation task is performed by executing a command such as stan.result <- sampling(model, data = data, iter = 8000). No specifying the number of iterations or the number of chains would automatically set the configurations to default values—2,000 and 4, respectively. The machine executed rstan is Lenovo IdeaPad with 8GB RAM and a 2.6 GHz i7 6th Gen 4­core Intel processor. All four cores were implemented in this article for faster computing.

Before using the Stan results, the convergence should be confirmed via the Gelman–Rubin convergence diagnostic statistics (represented by rhat in the rstan output; Gelman & Rubin, 1992). Values close to 1 indicate a high chance that the multiple chains have converged to the same distribution and therefore the mixing process is sufficient. In addition, the effective sample size (ESS), an estimate of the number of independent samples from the HMC posterior distribution, is recommended to accompany rhat to assess convergence (Gelman, Lee, & Guo, 2015): a larger ESS implies a lower risk of autocorrelation. Finally, Stan provides posterior distributions of parameters of interest that includes point estimates, standard deviations, and quantile values.

Simulation study

Simulation design

To demonstrate the utility of estimating an LCDM via Stan, a small simulation study is provided here. Note that a comprehensive investigation on how Stan performs in this vein of estimation is beyond the scope of this article, as Bayesian studies often involve intricate trials of priors and tuning strategies. We refer the more comprehensive study for future works. For the purpose of this study we adopt the simulation design by Templin and Bradshaw (2014). In this design, there were 3,000 respondents, 30 items, and three attributes resulting in eight classes in total. To avoid a confound potentially caused by the magnitudes of item parameters, the main effects were all set to a value of 2, the interactions were all set to a value of 1, and the intercept was set to −0.5 · \( {\boldsymbol{\lambda}}_i^T\boldsymbol{h}\left({\boldsymbol{\upalpha}}_{\mathbb{1}},{\boldsymbol{q}}_{\boldsymbol{i}}\right) \), where \( {\boldsymbol{\upalpha}}_{\mathbb{1}} \) = {1, 1, . . . , 1}. According to Templin and Bradshaw, the setting of the intercept can make attribute misclassifications roughly equal. Based upon LCDM analysis of the Examination for the Certificate of Proficiency in English (ECPE), the class membership probability was set to [0.301, 0.129, 0.012, 0.175, 0.009, 0.018, 0.011, 0.346] for classes labeled as “000,” “001,” “010,” “011,” “100,” “101,” “110,” and “111”, respectively. Finally, a balanced Q-matrix with each item measuring either one or two attributes was utilized. The study was replicated for 300 times.

For reference, the CDM package (George, Robitzsch, Kiefer, Groß, & Ünlü, 2016) in R was used to estimate the model in addition to the rstan. To estimate an LCDM, the CDM package uses the EM algorithm tailored by de la Torre (2011). Similar estimation practices can be found in the GDINA package, as detailed by Ma and de la Torre (2016). In addition, the Mplus software (Muthén & Muthén, 2013) can also be used to estimate an LCDM (see Templin & Hoffman, 2013, for syntax instruction). Both the CDM and GDINA packages allow users to set the monotonicity constraints via a command line, while the Mplus requires the specification one-by-one. With respect to Stan configurations, both informative and uninformative priors for item parameters were specified. To be concrete, both prior sets use 0 as the mean for each item parameter prior, while the informative set use 5 as the standard deviation and the other set uses 15. The chain number remained default while the HMC iteration number was set to 5,000. The thin parameter was set to 5. We used the posterior mean as the estimate of a parameter.

Simulation results

A pilot study was conducted to ensure the convergence of the Bayesian estimations: ESSs and rhats were recorded and therefore investigated for all parameters across 20 replications. The ESSs ranged from 94 to 200 out of 200 samples (i.e., the effective ratios were between .47 to 1.00). In addition, the rhats ranged from 0.99 to 1.02. This pilot study validated that 1,000 iterations were sufficient for reaching a convergence. Therefore, the simulation results could be considered trustworthy. Besides, the average elapsed time, including the warm-up and the sampling, is 43 min for four chains. Note that the computation time is expected to vary substantively when (1) the number of iterations, (2) the number of chains, (3) the selection of priors, (4) the complexity of models, and (5) the configurations of machines are different.

Two generic outcomes are presented here: (1) parameter recovery measured by bias and root mean squared error (RMSE), and (2) classification accuracy measured by the attribute profile classification rate and marginal attribute classification rate. In particular, the attribute profile classification rate is the proportion of respondents that were correctly classified as having a given class (namely attribute pattern), while the marginal attribute classification rate is the proportion of times a single attribute was classified correctly, across all respondents and attributes.

Figure 6 presents the relative biases and RMSEs in box plots that arrange the results in terms of item parameter categories: intercept, main effect, and interaction effect. The dark line inside the box represents the median. The top of box is 75th percentile, and bottom of box is 25th percentile. The end points of the lines (i.e., whiskers) are at a distance of 1.5*IQR, where the interquartile range, or IQR, is the distance between the 25th and 75th percentiles. The points outside the whiskers are marked as dots and are normally considered as extreme points. In addition, the means and the standard deviations of the estimates of each category were appended. As the upper panel indicates, the HMC method with informative priors (labeled as staninf) yielded more precise results than its competitors. In particular, the HMC method, with informative priors, was about 0.2 more precise than the EM approach in terms of the relative biases of both intercept and interaction effects. All three methods produced scattered relative biases in the interaction effects. However, the HMC method with informative priors had fewer and less-extreme outliers. Note that with uninformative priors, the HMC method yielded results similar to those from the EM approach, except in the main effect category. The lower panel contains the RMSE results. The HMC method with informative priors had the lowest means for RMSEs in both intercept and interaction terms, while the EM approach has the lowest mean RMSE in main effect. It can be seen that the RMSEs tended to vary more in the HMC method with uninformative priors. Overall, most point estimates of the main effect parameters were close to the true values, some intercept point estimates deviated from their true values substantively, and finally, the point estimates of the interaction terms were less accurate. This finding is not uncommon, because in a latent-modeling context, the estimate accuracy of interaction effect and its related main effect(s) tend to be lower than the intercept effects (Jiang, Wang, & Weiss, 2016).
Fig. 6

Box plots of simulation results

The class probability estimates tend to be very similar across all three approaches. The means of the relative biases (across eight classes) are 0.013, 0.016, and 0.014 for the EM approach and the HMC with uninformative and informative priors, respectively. Correspondingly, the RMSEs are 0.018, 0.021, and 0.019 for the three approaches, following the aforementioned order. For the three attributes, the profile classification rates of the EM approach are .98, .98, and .97, while those of the HMC with uninformative priors are .97, .97, and .96, and finally those of the HMC with informative priors are .99, .97, and .97. To emphasize, the priors were varied in the item parameters only; the priors for the class probabilities were identical for the two HMC estimations. These results show that the classification qualities, from the perspectives of both class probability estimates and attribute assignments, were relatively high for all three estimations, while the EM approach seemed to perform slightly better.

Discussion and conclusion

Unlike other works of this kind, the present article uses Stan to estimate the LCDM only. To emphasize, the reason is that the LCDM is the parent model of other common CDM variants (i.e., the specification of the LCDM is more complicated). Tuning the code to other CDMs simply needs to constrain and/or transform certain parameters. For instance, allowing the intercepts and interaction effects to be estimated but setting main effects to zeros would transform an LCDM to a DINA model (see Rupp et al., 2010, pp. 159–167, for details about converting the LCDM to other CDMs). In addition, when one masters the knowledge on applying Stan to a complicated model, it is straightforward to extend the strategy to those of simpler variants. Overall, estimating an LCDM through Stan is straightforward and it can be tuned for estimating other variants by adding zero-constraints. This practice allows users to use Bayesian features for model estimations and therefore is recommended.

Despite the instruction provided in the present article functions appropriately, there may have space for specifying .stan file code and model configurations more efficiently. Stan code has substantive variety such that an identical model can be specified in multiple ways resulting in different computational speeds and estimates. In addition, changing the input parameters of Stan, such as draws per chain and iteration numbers, may cause differences in final results, although truly converged models should have no systemic discrepancies between replications with nonidentical algorithm settings.

Like other Bayesian techniques, the results provided Stan needs model fit assessment. There are four typical problems that an insufficient estimation may encounter: (1) lack of mixing, (2) nonstationarity, (3) autocorrelation, and (4) divergent transitions. The first problem can be handled by providing different starting values and multiple chains. As demonstrated earlier, when the Gelman–Rubin convergence diagnostic statistics approach 1, the sampling is claimed to be from a well-specified posterior. Note that a poorly specified model could cause the mixing failed repetitively, even adopting the multiple starting values strategy. Nonstationarity occurs when a model is misspecified and/or insufficient number of iterations; therefore, if time is not a concern, a large number of iterations is always preferred. Autocorrelation is a commonly seen issue in a weakly identified model; that is, the drawing of certain parameter(s) depends highly on the previous draws, resulting in unreliable estimates. Thinning and reparametrizing are two solutions to autocorrelation. Arguably, thinning a Markov chain can result in a loss of information, whereas reparametrizing in many situations complicates the modeling procedures. The last problem—divergent transitions—is about insufficient sampling under irregular posterior densities. This issue can be fixed by manually increasing the desired average acceptance value close to 1 (e.g., .99). In rstan, this tuning is executed by specifying stan(. . . , control = list(adapt_delta = 0.99), . . .). As a trade-off, the mixing process will be slowed down.

Stan doesn’t provide fit statistics such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and the deviance information criterion (DIC). In particular, providing DIC is a standard function provided in WinBUGS, JAGS, and other Bayesian software programs. However, DIC is known to have some challenges, since it is based on a point estimate. For example, DIC can produce negative values, which are prohibited (Plummer, 2008; van der Linde, 2005). Instead of using the aforementioned statistics, Stan adopts the practice of approximating leave-one-out (LOO) cross-validation error during sampling processes. This practice generates probabilistic estimates of the degree to which the model has the highest probability to produce the best predictions. As a result, LOO can be used to serve as an information criterion (LOOIC) for model comparisons. Similar to AIC, BIC, and DIC, a smaller LOOIC means that the model generates more plausible predictions than does one with a larger LOOIC (see Vehtari, Gelman, & Gabry, 2017, for details about LOO). In addition, this approach can be further adapted for more trustworthy invariance tests (Shi et al., 2018, manuscript submitted for publication; Shi, Song, Liao, Terry, & Snyder, 2017).

The intention of this article is to act as a tutorial, and therefore the utility of HMC can be realized for LCDM estimation. The simulation provides evidence that HMC can appropriately and accurately estimate an LCDM. However, as an initial illustrative work, this article does not provide comprehensive instructions about how to improve the estimation speed and quality. In addition, it is known that different prior distributions are always an important concern in Bayesian inference. If a researcher has a scientific judgment and/or well-educated guess, this “non-data-based” information can be blended in with the estimation, as the previous example demonstrated. Alternatively, to study the robustness of a model estimation, researchers can conduct sensitivity analysis by trying different prior distributions. Although this task is beyond the scope of the present article, future studies are encouraged to focus on varying the priors and so examining the sensitivity of Bayesian LCDM. Note that incorporating prior distributions could be also realized in other algorithms, such as the EM approach, such that the ML estimation essentially becomes MAP estimation. Further research could focus on diverse situations, such as the impact of Q-matrix designs on the Bayesian estimations (Liu, 2017) and the selection of prior distributions with missing responses in Q-matrix validation (Dai, Svetina, & Chen, 2018).

References

  1. Almond, R. (2014). Comparison of two MCMC algorithms for hierarchical mixture models. In Bayesian Modeling Application Workshop at the Uncertainty in Artificial Intelligence Conference (pp. 1–19). Corvallis, OR: AUAI Press.Google Scholar
  2. Annis, J., Miller, B. J., & Palmeri, T. J. (2017). Bayesian inference with Stan: A tutorial on adding custom distributions. Behavior Research Methods, 49, 863–886. doi: https://doi.org/10.3758/s13428-016-0746-9 CrossRefGoogle Scholar
  3. Betancourt, M. J., Byrne, S., & Girolami, M. (2014). Optimizing the integrator step size for Hamiltonian Monte Carlo. arXiv:1411.6669Google Scholar
  4. Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., … Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 20, 1–37.Google Scholar
  5. Culpepper, S. A. (2015). Bayesian estimation of the DINA model with Gibbs sampling. Journal of Educational and Behavioral Statistics, 40, 454–476.CrossRefGoogle Scholar
  6. Dai, S., Svetina, D., & Chen, C. (2018). Investigation of missing responses in Q-matrix validation. Applied Psychological Measurement. Advance online publication. doi: https://doi.org/10.1177/0146621618762742
  7. de la Torre, J. (2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34, 115–130.CrossRefGoogle Scholar
  8. de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179–199.CrossRefGoogle Scholar
  9. de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353.CrossRefGoogle Scholar
  10. DeCarlo, L. T. (2012). Recognizing uncertainty in the Q-matrix via a Bayesian extension of the DINA model. Applied Psychological Measurement, 36, 447–468.CrossRefGoogle Scholar
  11. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EMalgorithm. Journal of the Royal Statistical Society, 39, 1–38.Google Scholar
  12. Gelman, A., Lee, D., & Guo, J. (2015). Stan: A probabilistic programming language for Bayesian inference and optimization. Journal of Educational and Behavioral Statistics, 40, 530–543.CrossRefGoogle Scholar
  13. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–472. doi: https://doi.org/10.2307/2246093 CrossRefGoogle Scholar
  14. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6, 721–741. doi: https://doi.org/10.1109/TPAMI.1984.4767596 CrossRefGoogle Scholar
  15. George, A. C., Robitzsch, A., Kiefer, T., Groß, J., & Ünlü, A. (2016). The R package CDM for cognitive diagnosis models. Journal of Statistical Software, 74(2), 1–24.CrossRefGoogle Scholar
  16. Gilks, W. R. (1998). Full conditional distributions. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 75–88). Boca Raton, FL: Chapman & Hall.Google Scholar
  17. Girolami, M., & Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 73, 123–214.CrossRefGoogle Scholar
  18. Haertel, E. H. (1989). Using restricted latent class models to map the skill structure of achievement items. Journal of Educational Measurement, 26, 301–321.CrossRefGoogle Scholar
  19. Hartz, S. M. (2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality (Doctoral dissertation), University of Illinois at Urbana-Champaign, IL.Google Scholar
  20. Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210.CrossRefGoogle Scholar
  21. Hoffman, M. D., & Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 1593–1623.Google Scholar
  22. Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 941–963.Google Scholar
  23. Jiang, S., Wang, C., & Weiss, D. J. (2016). Sample size requirements for estimation of item parameters in the multidimensional graded response model. Frontiers in Psychology, 7, 109. doi: https://doi.org/10.3389/fpsyg.2016.00109 Google Scholar
  24. Jiang, Z., & Skorupski, W. (2017). A Bayesian approach to estimating variance components within a multivariate generalizability theory framework. Behavior Research Methods. Advance online publication. doi:10.3758/s13428-017-0986-3Google Scholar
  25. Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272.CrossRefGoogle Scholar
  26. Knott, M., & Bartholomew, D. J. (1999). Latent variable models and factor analysis (No. 7). Edward Arnold.Google Scholar
  27. Lao, H., & Templin, J. (2016, April). Estimation of diagnostic classification models without constraints: Issues with class label switching. Paper presented at Annual Meeting of the National Council on Measurement in Education, Washington, DC.Google Scholar
  28. Lee, M. D., & Wagenmakers, E. J. (2014). Bayesian cognitive modeling: A practical course. New York, NY: Cambridge University Press.Google Scholar
  29. Lee, S. T. (2016, November 21). DINA model with independent attributes. Retrieved from http://mc-stan.org/documentation/case-studies/dina_independent.html.
  30. Lewandowski, D., Kurowicka, D., & Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100, 1989–2001.CrossRefGoogle Scholar
  31. Liu, R. (2017). Misspecification of attribute structure in diagnostic measurement. Educational and Psychological Measurement.  https://doi.org/10.1177/0013164417702458.
  32. Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.CrossRefGoogle Scholar
  33. Luo, Y., & Jiao, H. (2017). Using the Stan program for Bayesian item response theory. Educational and Psychological Measurement, 78, 384–408. doi: https://doi.org/10.1177/0013164417693666 CrossRefGoogle Scholar
  34. Ma, W., & de la Torre, J. (2016). GDINA: The Generalized DINA model framework (R package version 0.13.0). Available online at http://CRAN. R-project.org/package=GDINA.
  35. Macready, G. B., & Dayton, C. M. (1977). The use of probabilistic models in the assessment of mastery. Journal of Educational Statistics, 2, 99–120.Google Scholar
  36. Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187–212.CrossRefGoogle Scholar
  37. Merkle, E. C., & Wang, T. (2018). Bayesian latent variable models for the analysis of experimental psychology data. Psychonomic Bulletin & Review, 25, 256–270. doi: https://doi.org/10.3758/s13423-016-1016-7 CrossRefGoogle Scholar
  38. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.CrossRefGoogle Scholar
  39. Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313–335. doi: https://doi.org/10.1037/a0026802 CrossRefGoogle Scholar
  40. Muthén, L. K., & Muthén, B. O. (2013). Mplus user’s guide (Version 7.1) [Computer software and manual]. Los Angeles, CA: Muthén & Muthén.Google Scholar
  41. Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In S. Brooks (Ed.), Handbook of Markov Chain Monte Carlo (pp. 113–162). Boca Raton, FL: CRC Press/Taylor & Francis.Google Scholar
  42. Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Paper presented at the 3rd International Workshop on Distributed Statistical Computing, Vienna, Austria.Google Scholar
  43. Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatistics, 9, 523–539CrossRefGoogle Scholar
  44. R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from www.Rproject.org/
  45. Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York, NY: Guilford Press.Google Scholar
  46. Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement: Interdisciplinary Research and Perspectives, 6, 219–262.Google Scholar
  47. Shi, D., Song, H., Liao, X., Terry, R., & Snyder, L. A. (2017). Bayesian SEM for specification search problems in testing factorial invariance. Multivariate Behavioral Research, 52, 430–444.CrossRefGoogle Scholar
  48. da Silva, M. A., de Oliveira, E. S. B., von Davier, A. A., & Bazán, J. L. (2017). Estimating the DINA model parameters using the No-U-Turn Sampler. Biometrical Journal. Advance online publication. doi: https://doi.org/10.1002/bimj.201600225
  49. Sorensen, T., Hohenstein, S., & Vasishth, S. (2016). Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists. Quantitative Methods for Psychology, 12, 175–200.CrossRefGoogle Scholar
  50. Stan Development Team. (2016a). rstan: R interface to Stan (R package version 2.0.3). Retrieved from http://mc-stan.org
  51. Stan Development Team. (2016b). Stan: A C++ library for probability and sampling (Version 2.8.0). Retrieved from http://mc-stan.org
  52. Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 62, 795–809.CrossRefGoogle Scholar
  53. Templin, J., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79, 317–339.CrossRefGoogle Scholar
  54. Templin, J., & Hoffman, L. (2013). Obtaining diagnostic classification model estimates using Mplus. Educational Measurement: Issues and Practice, 32, 37–50.CrossRefGoogle Scholar
  55. Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287–305.  https://doi.org/10.1037/1082-989X.11.3.287 CrossRefGoogle Scholar
  56. van der Linde, A. (2005). DIC in variable selection. Statistica Neerlandica, 59, 45–56.CrossRefGoogle Scholar
  57. Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432.CrossRefGoogle Scholar
  58. von Davier, M. (2009). Some notes on the reinvention of latent structure models as diagnostic classification models. Measurement: Interdisciplinary Research and Perspectives, 7, 67–74.Google Scholar
  59. Zhan, P. (2017). Using JAGS for Bayesian cognitive diagnosis models: A tutorial. arXiv:1708.02632Google Scholar

Copyright information

© Psychonomic Society, Inc. 2018

Authors and Affiliations

  1. 1.University of AlabamaTuscaloosaUSA
  2. 2.University of WyomingLaramieUSA

Personalised recommendations