1 Introduction

In psychology, social sciences, and many other fields, researchers are usually interested in “latent” variables that cannot be measured directly, e.g., depression, anxiety, or intelligence. To get a grip on these latent concepts, one commonly used strategy is to construct a measurement model for such a latent variable, in the sense that domain experts design multiple “items” or “questions” that are considered to be indicators of the latent variable. For exploring evidence of construct validity in theory-based instrument construction, confirmatory factor analysis (CFA) has been widely studied (Jöreskog 1969; Castro et al. 2015; Li 2016). In CFA, researchers start with several hypothesized latent variable models that are then fitted to the data individually, after which the one that fits the data best is picked to explain the observed phenomenon. In this process, the fundamental task is to learn the parameters of a hypothesized model from observed data, which is the focus of this paper. For convenience, we simply refer to these hypothesized latent variable models as CFA models from now on.

The most common method for parameter estimation in CFA models is maximum likelihood (ML), because of its attractive statistical properties (consistency, asymptotic normality, and efficiency). The ML method, however, relies on the assumption that observed variables follow a multivariate normal distribution (Jöreskog 1969). When the normality assumption is not deemed empirically tenable, ML may not only reduce the accuracy of parameter estimates, but may also yield misleading conclusions drawn from empirical data (Li 2016). To this end, a robust version of ML was introduced for CFA models when the normality assumption is slightly or moderately violated (Kaplan 2008), but still requires the observations to be continuous. In the real world, the indicator data in questionnaires are usually measured on an ordinal scale (resulting in a bunch of ordered categorical variables, or simply ordinal variables) (Poon and Wang 2012), in which neither normality nor continuity is plausible (Lubke and Muthén 2004). In this case, Item Response Theory (IRT) models (Embretson and Reise 2013) are widely used, in which a mathematical item response function is applied to link an item to its corresponding latent trait. However, the likelihood of the observed ordinal random vector does not have closed-form and is considerably complex due to the presence a multi-dimensional integral, so that learning the model given just the ordinal observations is typically intractable especially when the number of latent variables and the number of categories of the observed variables are large. Another class of methods designed for ordinal observations is the diagonally weighted least squares (DWLS), which has been suggested to be superior to the ML method and is usually considered to be preferable over other methods (Barendse et al. 2015; Li 2016). Various implementations of DWLS are available in popular softwares or packages, e.g., LISREL (Jöreskog 2005), Mplus (Muthén 2010), lavaan (Rosseel 2012) and OpenMx (Boker et al. 2011)

However, there are two major issues that the existing approaches do not consider. One is the mixture of continuous and ordinal data. As we mentioned above, ordinal variables are omnipresent in questionnaires, whereas sensor data are usually continuous. Therefore, a more realistic case in real applications is mixed continuous and ordinal data. A second important issue concerns missing values. In practice, all branches of experimental science are plagued by missing values (Little and Rubin 1987), e.g., failure of sensors, or unwillingness to answer certain questions in a survey. A straightforward idea in this case is to combine missing values techniques with existing parameter estimation approaches, e.g., performing listwise-deletion or pairwise-deletion first on the original data and then applying DWLS to learn parameters of a CFA model. However, such deletion methods are only consistent when the data are missing completely at random (MCAR), which is a rather strong assumption (Rubin 1976), and cannot transfer the sampling variability incurred by missing values to follow-up studies. The two modern missing data techniques, maximum likelihood and multiple imputation, are valid under a less restrictive assumption, missing at random (MAR) (Schafer and Graham 2002), but they require the data to be multivariate normal.

Therefore, there is a strong demand for an approach that is not only valid under MAR but also works for mixed continuous and ordinal data. For this purpose, we propose a novel Bayesian Gaussian copula factor (BGCF) approach, in which a Gibbs sampler is used to draw pseudo Gaussian data in a latent space restricted by the observed data (unrestricted if that value is missing) and draw posterior samples of parameters given the pseudo data, iteratively. We prove that this approach is consistent under MCAR and empirically show that it works quite well under MAR.

The rest of this paper is organized as follows. Section 2 reviews background knowledge and related work. Section 3 gives the definition of a Gaussian copula factor model and presents our novel inference procedure for this model. Section 4 compares our BGCF approach with two alternative approaches on simulated data, and Sect. 5 gives an illustration on the ‘Holzinger & Swineford 1939’ dataset. Section 6 concludes this paper and provides some discussion.

2 Background

This section reviews basic missingness mechanisms and related work on parameter estimation in CFA models.

2.1 Missingness mechanism

Following Rubin (1976), let \(\varvec{Y} = (y_{ij}) \in \mathbb {R}^{n \times p}\) be a data matrix with the rows representing independent samples, and \( \varvec{R} = (r_{ij}) \in \{0,1\}^{n \times p}\) be a matrix of indicators, where \(r_{ij} = 1\) if \(y_{ij}\) was observed and \(r_{ij} = 0\) otherwise. \(\varvec{Y}\) consists of two parts, \(\varvec{Y}_\mathrm{obs}\) and \(\varvec{Y}_\mathrm{miss}\), representing observed and missing elements in \(\varvec{Y}\), respectively. When the missingness does not depend on the data, i.e., \(P(\varvec{R}|\varvec{Y}, \theta ) = P(\varvec{R}|\theta )\) with \(\theta \) denoting unknown parameters, the data are said to be missing completely at random (MCAR), which is a special case of a more realistic assumption called missing at random (MAR). MAR allows the dependency between missingness and observed values, i.e., \(P(\varvec{R}|\varvec{Y}, \theta ) = P(\varvec{R}|\varvec{Y}_\mathrm{obs},\theta )\). For example, all people in a group are required to take a blood pressure test at time point 1, while only those whose values at time point 1 lie in the abnormal range need to take the test at time point 2. This results in some missing values at time point 2 that are MAR.

2.2 Parameter estimation in CFA models

When the observations follow a multivariate normal distribution, maximum likelihood (ML) is the mostly-used method. It is equivalent to minimizing the discrepancy function \(F_{\mathrm{ML}}\) (Jöreskog 1969):

$$\begin{aligned} F_{\mathrm{ML}} = \ln |\varSigma (\theta )|{\,+\,} \mathrm{trace}[S\varSigma ^{-1}(\theta )] - \ln |S|- p, \end{aligned}$$

where \(\theta \) is the vector of model parameters, \(\varSigma (\theta )\) is the model-implied covariance matrix, S is the sample covariance matrix, and p is the number of observed variables in the model. When the normality assumption is violated either slightly or moderately, robust ML (MLR) offers an alternative. Here, parameter estimates are still obtained using the asymptotically unbiased ML estimator, but standard errors are statistically corrected to enhance the robustness of ML against departures from normality (Kaplan 2008; Muthén 2010). Another method for continuous nonnormal data is the so-called asymptotically distribution free method, which is a weighted least squares (WLS) method using the inverse of the asymptotic covariance matrix of the sample variances and covariances as a weight matrix (Browne 1984).

When the observed data are on ordinal scales, Muthén (1984) proposed a three-stage approach. It assumes that a normal latent variable \(x^*\) underlies an observed ordinal variable x, i.e.,

$$\begin{aligned} x = m, \text{ if } \tau _{m-1}< x^* < \tau _m , \end{aligned}$$
(1)

where m\((=1,2,\ldots ,c)\) denotes the observed values of x, \(\tau _m\) are thresholds \((-\infty =\tau _0< \tau _1< \tau _2< \cdots < \tau _c = +\infty )\), and c is the number of categories. The thresholds and polychoric correlations are estimated from the bivariate contingency table in the first two stages (Olsson 1979; Jöreskog 2005). Parameter estimates and the associated standard errors are then obtained by minimizing the weighted least squares fit function \(F_{\mathrm{WLS}}\):

$$\begin{aligned} F_{\mathrm{WLS}} = [s-\sigma (\theta )]^\mathrm{T}\varvec{W}^{-1}[s-\sigma (\theta )], \end{aligned}$$

where \(\theta \) is the vector of model parameters, \(\sigma (\theta )\) is the model-implied vector containing the nonredundant vectorized elements of \(\varSigma (\theta )\), s is the vector containing the estimated polychoric correlations, and the weight matrix \(\varvec{W}\) is the asymptotic covariance matrix of the polychoric correlations. A mathematically simple form of the WLS estimator, the unweighted least squares (ULS), arises when the matrix \(\varvec{W}\) is replaced with the identity matrix \(\varvec{I}\). Another variant of WLS is the diagonally weighted least squares (DWLS), in which only the diagonal elements of \(\varvec{W}\) are used in the fit function (Muthén et al. 1997; Muthén 2010), i.e.,

$$\begin{aligned} F_{\mathrm{DWLS}} = [s-\sigma (\theta )]^\mathrm{T}\varvec{W}^{-1}_{\mathrm{D}}[s-\sigma (\theta )], \end{aligned}$$

where \(\varvec{W}^{-1}_{\mathrm{D}} = \mathrm {diag}(\varvec{W})\) is the diagonal weight matrix. Various recent simulation studies have shown that DWLS is favorable compared to WLS, ULS, as well as the ML-based methods for ordinal data (Barendse et al. 2015; Li 2016).

3 Method

In this section, we introduce the Gaussian copula factor model and propose a Bayesian inference procedure for this model. Then, we theoretically analyze the identifiability and prove the consistency of our procedure.

3.1 Gaussian copula factor model

Definition 1

(Gaussian copula factor model) Consider a latent random (factor) vector \(\varvec{\eta } = (\eta _1,\ldots ,\eta _k)^\mathrm{T}\), a response random vector \(\varvec{Z} = (Z_1,\ldots ,Z_p)^\mathrm{T}\) and an observed random vector \(\varvec{Y}=(Y_1,\ldots ,Y_p)^\mathrm{T}\), satisfying

$$\begin{aligned} \varvec{\eta }\sim & {} \mathscr {N}(0,C), \end{aligned}$$
(2)
$$\begin{aligned} \varvec{Z}= & {} \varLambda \varvec{\eta } + \varvec{\epsilon }, \end{aligned}$$
(3)
$$\begin{aligned} Y_j= & {} F_j^{-1}\big (\varPhi \big [Z_j/\sigma (Z_j)\big ]\big ), \quad \forall j = 1,\ldots ,p, \end{aligned}$$
(4)

with C a correlation matrix over factors, \(\varLambda = (\lambda _{ij})\) a \(p \times k\) matrix of factor loadings (\(k \le p\)), \(\varvec{\epsilon } \sim \mathscr {N}(0,D)\) residuals with \(D = \mathrm {diag}(\sigma _1^2,\ldots ,\sigma _p^2)\), \(\sigma (Z_j)\) the standard deviation of \(Z_j\), \(\varPhi (\cdot )\) the cumulative distribution function (CDF) of the standard Gaussian, and \({F_{j}}^{-1}(t) = \inf \{ x: F_{j}(x) \ge t\}\) the pseudo-inverse of a CDF \(F_j(\cdot )\). Then, this model is called a Gaussian copula factor model.

Fig. 1
figure 1

Gaussian copula factor model

The model is also defined in Murray et al. (2013), but the authors restrict the factors to be independent of each other while we allow for their interactions. Our model is a combination of a Gaussian factor model (from \(\varvec{\eta }\) to \(\varvec{Z}\)) and a Gaussian copula model (from \(\varvec{Z}\) to \(\varvec{Y}\)). The factor model allows us to grasp the latent concepts that are measured by multiple indicators. The copula model provides a good way to conduct multivariate data analysis for two reasons. First, it raises the theoretical framework in which multivariate associations can be modeled separately from the univariate distributions of the observed variables (Nelsen 2007). Especially, when we use a Gaussian copula, the multivariate associations are uniquely determined by the covariance matrix because of the elliptically symmetric joint density, which makes the dependency analysis very simple. Second, the use of copulas is advocated to model multivariate distributions involving diverse types of variables, say binary, ordinal, and continuous (Dobra and Lenkoski 2011). A variable \(Y_j\) that takes a finite number of ordinal values \(\{1,\, 2,\, \ldots ,\, c\}\) with \(c \ge 2\), is incorporated into our model by introducing a latent Gaussian variable \(Z_j\), which complies with the well-known standard assumption for an ordinal variable (Muthén 1984) (see Eq. 1). Figure 1 shows an example of the model. Note that we allow the special case of a factor having a single indicator, e.g., \(\eta _1 \rightarrow Z_1 \rightarrow Y_1\), because this allows us to incorporate other (explicit) variables (such as age and income) into our model. In this special case, we set \(\lambda _{11} = 1\) and \(\epsilon _1 = 0\), thus \(Y_1 = F_1^{-1}(\varPhi [\eta _1])\).

In the typical design for questionnaires, one tries to get a grip on a latent concept through a particular set of well-designed questions (Martínez-Torres 2006; Byrne 2013), which implies that a factor (latent concept) in our model is connected to multiple indicators (questions) while an indicator is only used to measure a single factor, as shown in Fig. 1. This kind of measurement model is called a pure measurement model (Definition 8 in Silva et al. (2006)). Throughout this paper, we assume that all measurement models are pure, which indicates that there is only a single non-zero entry in each row of the factor loadings matrix \(\varLambda \). This inductive bias about the sparsity pattern of \(\varLambda \) is fully motivated by the typical design of a measurement model.

In what follows, we transform the Gaussian copula factor model into an equivalent model that is used for inference in the next subsection. We consider an integrated \((p + k)\)-dimensional random vector \(\varvec{X} = (\varvec{Z}^\mathrm{T}, \varvec{\eta }^\mathrm{T})^\mathrm{T}\), which is still multivariate Gaussian, and obtain its covariance matrix

$$\begin{aligned} \varSigma = \begin{bmatrix} \varLambda C \varLambda ^\mathrm{T} + D&\varLambda C \\ C \varLambda ^\mathrm{T}&C \\ \end{bmatrix} , \end{aligned}$$
(5)

and precision matrix

$$\begin{aligned} \varOmega = \varSigma ^{-1} = \begin{bmatrix} D^{-1}&-D^{-1} \varLambda \\ -\varLambda ^\mathrm{T} D^{-1}&C^{-1} + \varLambda ^\mathrm{T} D^{-1} \varLambda \\ \end{bmatrix} . \end{aligned}$$
(6)

Since D is diagonal and \(\varLambda \) only has one non-zero entry per row, \(\varOmega \) contains many intrinsic zeros. The sparsity pattern of such \(\varOmega = (\omega _{ij})\) can be represented by an undirected graph \(G = (\varvec{V}, \varvec{E})\), where \((i,j) \not \in \varvec{E}\) whenever \(\omega _{ij} = 0\) by construction. Then, a Gaussian copula factor model can be transformed into an equivalent model controlled by a single precision matrix \(\varOmega \), which in turn is constrained by G, i.e., \(P(\varvec{X}|C,\varLambda ,D) = P(\varvec{X}|\varOmega _G)\).

Definition 2

(G-Wishart distribution) Given an undirected graph \(G = (\varvec{V},\varvec{E})\), a zero-constrained random matrix \(\varOmega \) has a G-Wishart distribution, if its density function is

$$\begin{aligned} p(\varOmega |G) = \frac{|\varOmega |^{(\nu - 2)/2}}{I_G(\nu , \varPsi )} \exp \bigg [-\frac{1}{2} \mathrm{trace}(\varPsi \varOmega )\bigg ] \mathbb {1}_{\varOmega \in M^+(G)}, \end{aligned}$$

with \(M^+(G)\) the space of symmetric positive definite matrices with off-diagonal elements \(\omega _{ij} = 0\) whenever \((i,j) \not \in \varvec{E}\), \(\nu \) the number of degrees of freedom, \(\varPsi \) a scale matrix, \(I_G(\nu , \varPsi )\) the normalizing constant, and \(\mathbb {1}\) the indicator function (Roverato 2002).

The G-Wishart distribution is the conjugate prior of precision matrices \(\varOmega \) that are constrained by a graph G (Roverato 2002). That is, given the G-Wishart prior, i.e., \(P(\varOmega |G) = \mathcal{W}_G(\nu _0, \varPsi _0)\) and data \(\varvec{X} = (\varvec{x_1},\ldots ,\varvec{x_n})^\mathrm{T}\) drawn from \(\mathscr {N}(0,\varOmega ^{-1})\), the posterior for \(\varOmega \) is another G-Wishart distribution:

$$\begin{aligned} P(\varOmega | G, \varvec{X}) = \mathcal{W}_G (\nu _0 + n, \varPsi _0 + \varvec{X}^\mathrm{T} \varvec{X}). \end{aligned}$$
(7)

When the graph G is fully connected, the G-Wishart distribution reduces to a Wishart distribution (Murphy 2007). Placing a G-Wishart prior on \(\varOmega \) is equivalent to placing an inverse-Wishart on C, a product of multivariate normals on \(\varLambda \), and an inverse-gamma on the diagonal elements of D. With a diagonal scale matrix \(\varPsi _0\) and the number of degrees of freedom \(\nu _0\) equal to the dimension of \(\varvec{X}\) plus one, the implied marginal densities between any pair of variables are uniformly distributed between \([-1,1]\) (Barnard et al. 2000).

3.2 Inference for Gaussian copula factor model

We first introduce the inference procedure for complete mixed data and incomplete Gaussian data, respectively, based on which the procedure for mixed data with missing values is then derived. From this point on, we use S to denote the correlation matrix over the response vector \(\varvec{Z}\).

3.2.1 Mixed data without missing values

For a Gaussian copula model, Hoff (2007) proposed a likelihood that only concerns the ranks among observations, which is derived as follows. Since the transformation \(Y_j = F_j^{-1}\big (\varPhi \big [Z_j\big ]\big )\) is non-decreasing, observing \(\varvec{y}_j = (y_{1,j},\ldots ,y_{n,j})^\mathrm{T}\) implies a partial ordering on \(\varvec{z}_j = (z_{1,j},\ldots ,z_{n,j})^\mathrm{T}\), i.e., \(\varvec{z}_j\) lies in the space restricted by \(\varvec{y}_j\):

$$\begin{aligned} \mathscr {D}(\varvec{y}_j) = \left\{ \varvec{z}_j \in \mathbb {R}^n: y_{i,j}< y_{k,j} \Rightarrow z_{i,j} < z_{k,j}\right\} . \end{aligned}$$

Therefore, observing \(\varvec{Y}\) suggests that \(\varvec{Z}\) must be in

$$\begin{aligned} \mathscr {D}(\varvec{Y}) = \{\varvec{Z} \in \mathbb {R}^{n \times p}: \varvec{z}_j \in \mathscr {D}(\varvec{y}_j), \forall j = 1,\ldots ,p\} . \end{aligned}$$

Taking the occurrence of this event as the data, one can compute the following likelihood Hoff (2007)

$$\begin{aligned} P(\varvec{Z} \in \mathscr {D}(\varvec{Y})|S,F_1,\ldots ,F_p) = P(\varvec{Z} \in \mathscr {D}(\varvec{Y})|S). \end{aligned}$$

Following the same argumentation, the likelihood in our Gaussian copula factor model reads

$$\begin{aligned} P(\varvec{Z} \in \mathscr {D}(\varvec{Y})|\varvec{\eta },\varOmega ,F_1,\ldots ,F_p) = P(\varvec{Z} \in \mathscr {D}(\varvec{Y})|\varvec{\eta },\varOmega ), \, \end{aligned}$$

which is independent of the margins \(F_j\).

For the Gaussian copula factor model, inference for the precision matrix \(\varOmega \) of the vector \(\varvec{X} = (\varvec{Z}^\mathrm{T}, \varvec{\eta }^\mathrm{T})^\mathrm{T}\) can now proceed via construction of a Markov chain having its stationary distribution equal to \(P(\varvec{Z},\varvec{\eta },\varOmega |\varvec{Z} \in \mathscr {D}(\varvec{Y}),G)\), where we ignore the values for \(\varvec{\eta }\) and \(\varvec{Z}\) in our samples. The prior graph G is uniquely determined by the sparsity pattern of the loading matrix \(\varLambda = (\lambda _{ij})\) and the residual matrix D (see Eq. 6), which in turn is uniquely decided by the pure measurement models. The Markov chain can be constructed by iterating the following three steps:

  1. 1.

    Sample\(\varvec{Z}\): \(\varvec{Z} \sim P(\varvec{Z}|\varvec{\eta },\varvec{Z} \in \mathscr {D}(\varvec{Y}),\varOmega )\);

    Since each coordinate \(Z_j\) directly depends on only one factor, i.e., \(\eta _q\) such that \(\lambda _{jq} \ne 0\), we can sample each of them independently through \( Z_j \sim P(Z_j|\eta _q,\varvec{z}_j \in \mathscr {D}(\varvec{y}_j),\varOmega ) \).

  2. 2.

    Sample\(\varvec{\eta }\): \(\varvec{\eta } \sim P(\varvec{\eta }|\varvec{Z},\varOmega )\);

  3. 3.

    Sample\(\varOmega \): \(\varOmega \sim P(\varOmega |\varvec{Z},\varvec{\eta },G)\).

3.2.2 Gaussian data with missing values

Suppose that we have Gaussian data \(\varvec{Z}\) consisting of two parts, \(\varvec{Z}_\mathrm{obs}\) and \(\varvec{Z}_\mathrm{miss}\), denoting observed and missing values in \(\varvec{Z}\), respectively. The inference for the correlation matrix of \(\varvec{Z}\) in this case can be done via the so-called data augmentation technique that is also a Markov chain Monte Carlo procedure and has been proven to be consistent under MAR (Schafer 1997). This approach iterates the following two steps to impute missing values (Step 1) and draw correlation matrix samples from the posterior (Step 2):

  1. 1.

    \(\varvec{Z}_\mathrm{miss} \sim P(\varvec{Z}_\mathrm{miss}|\varvec{Z}_\mathrm{obs},S)\) ;

  2. 2.

    \(S \sim P(S|\varvec{Z}_\mathrm{obs},\varvec{Z}_\mathrm{miss})\).

3.2.3 Mixed data with missing values

For the most general case of mixed data with missing values, we combine the procedures of Sects. 3.2.1 and 3.2.2 into the following four-step inference procedure:

  1. 1.

    \(\varvec{Z}_\mathrm{obs} \sim P(\varvec{Z}_\mathrm{obs}|\varvec{\eta },\varvec{Z}_\mathrm{obs} \in \mathscr {D}(\varvec{Y}_\mathrm{obs}),\varOmega )\);

  2. 2.

    \(\varvec{Z}_\mathrm{miss} \sim P(\varvec{Z}_\mathrm{miss}|\varvec{\eta },\varvec{Z}_\mathrm{obs},\varOmega )\);

  3. 3.

    \(\varvec{\eta } \sim P(\varvec{\eta }|\varvec{Z}_\mathrm{obs},\varvec{Z}_\mathrm{miss},\varOmega )\);

  4. 4.

    \(\varOmega \sim P(\varOmega |\varvec{Z}_\mathrm{obs},\varvec{Z}_\mathrm{miss},\varvec{\eta },G)\).

A Gibbs sampler that achieves this Markov chain is summarized in Algorithm 1 and implemented in R.Footnote 1 Note that we put Step 1 and Step 2 together in the actual implementation since they share some common computations (lines 2–4). The difference between the two steps is that the values in Step 1 are drawn from a space restricted by the observed data (lines 5–13), while the values in Step 2 are drawn from an unrestricted space (lines 14–17). Another important point is that we need to relocate the data such that the mean of each coordinate of \(\varvec{Z}\) is zero (line 20). This is necessary for the algorithm to be sound because the mean may shift when missing values depend on the observed data (MAR).

figure a

By iterating the steps in Algorithm 1, we can draw correlation matrix samples over the integrated random vector \(\varvec{X}\), denoted by \(\{\varSigma ^{(1)},\ldots , \varSigma ^{(m)}\}\). The mean over all the samples is a natural estimate of the true \(\varSigma \), i.e.,

$$\begin{aligned} \hat{\varSigma } = \dfrac{1}{m}\sum _{i = 1}^{m} \varSigma ^{(i)} . \end{aligned}$$
(8)

Based on Eqs. (5) and (8), we obtain estimates of the parameters of interests:

$$\begin{aligned} \hat{C}= & {} \hat{\varSigma }_{[\varvec{\eta }, \varvec{\eta }]}; \nonumber \\ \hat{\varLambda }= & {} \hat{\varSigma }_{[\varvec{Z}, \varvec{\eta }]} \hat{C}^{-1}\, ; \nonumber \\ \hat{D}= & {} \hat{S} - \hat{\varLambda }\hat{C}\hat{\varLambda }^\mathrm{T}, \text{ with } \hat{S} = \hat{\varSigma }_{[\varvec{Z}, \varvec{Z}]} . \end{aligned}$$
(9)

We refer to this procedure as a Bayesian Gaussian copula factor approach (BGCF).

3.2.4 Discussion on prior specification

For the default choice of the prior G-Wishart distribution, we set the degrees of freedom \(\nu _0 = \dim (\varvec{X}) + 1\) and the scale matrix \(\varPsi _0 = \epsilon \mathbb {1}\) in the limit \(\epsilon \downarrow 0\), where \(\dim (\varvec{X})\) is the dimension of the integrated random vector \(\varvec{X}\) and \(\mathbb {1}\) is the identity matrix. This specification results in a non-informative prior, in the sense that the posterior only depends on the data and the prior is ignorable. We recall Eq. (7) and take the posterior expectation as an example. The expectation of the covariance matrix is

$$\begin{aligned} {\mathbb {E}}\,(\varSigma ) = {\mathbb {E}}\,(\varOmega ^{-1}) = \dfrac{\varPsi _0+ \varvec{X}^\mathrm{T}\varvec{X}}{\nu _0 + n - \dim (\varvec{X}) - 1} = \dfrac{\varPsi _0+ \varvec{X}^\mathrm{T}\varvec{X}}{n}, \end{aligned}$$

which reduces to the maximum likelihood estimate in the limit \(\epsilon \downarrow 0\). In the actual implementation, we simply set \(\varPsi _0 = \mathbb {1}\), which is accurate enough when the sample size is not too small. In the case of a very small data size, it is needed to make \(\varPsi _0\) smaller than the identity matrix.

To incorporate prior knowledge into the inference procedure, our model enjoys some flexibility. As mentioned in Sect. 3.1, placing a G-Wishart prior on \(\varOmega \) is equivalent to placing an inverse-Wishart on C, a product of multivariate normals on \(\varLambda \), and an inverse-gamma on the diagonal elements of D. Therefore, one could choose one’s favorite informative priors on C, \(\varLambda \), and D separately, and then derive the resulting G-Wishart prior on \(\varOmega \). While the inverse-Wishart and inverse-gamma distributions have been criticized as unreliable when the variances are close to zero (Schuurman et al. 2016), our model does not suffer from this issue. This is because in our model the response variables (i.e., the Z variables) depend only on the ranks of the observed data, and in our sampling process we always set the variances of the response variables and latent variables to one, which is scale-invariant to the observed data.

One limitation of the current inference procedure is that one has to choose the prior on C from the inverse-Wishart family, on \(\varLambda \) from the normal family, and on D from the inverse-gamma family in order to keep the conjugacy, so that one can enjoy the fast and concise inference. When the prior is chosen from other families, sampling \(\varOmega \) from the posterior distribution (Step 4 in Algorithm 1) is no longer straightforward. In this case, a different strategy like the Metropolis-Hastings algorithm might be needed to implement our Step 4.

3.3 Theoretical analysis

3.3.1 Identifiability of C

Without additional constraints, C is non-identifiable (Anderson and Rubin 1956). More precisely, given a decomposable matrix \(S = \varLambda C \varLambda ^\mathrm{T} + D\), we can always replace \(\varLambda \) with \(\varLambda U\) and C with \(U^{-1} C U^{-T}\) to obtain an equivalent decomposition \(S = (\varLambda U)(U^{-1} C U^{-T})(U^\mathrm{T} \varLambda ^\mathrm{T}) + D\), where U is a \(k \times k\) invertible matrix. Since \(\varLambda \) only has one non-zero entry per row in our model, U can only be diagonal to ensure that \(\varLambda U\) has the same sparsity pattern as \(\varLambda \) (see Lemma 1 in “Appendix”). Thus, from the same S, we get a class of solutions for C, i.e., \(U^{-1} C U^{-1}\), where U can be any invertible diagonal matrix. In order to get a unique solution for C, we impose two sufficient identifying conditions: 1) restrict C to be a correlation matrix; 2) force the first non-zero entry in each column of \(\varLambda \) to be positive. See Lemma 2 in “Appendix” for proof. Condition 1 is implemented via line 31 in Algorithm 1. As for the second condition, we force the covariance between a factor and its first indicator to be positive (line 27), which is equivalent to Condition 2. Note that these conditions are not unique; one could choose one’s favorite conditions to identify C, e.g., setting the first loading to 1 for each factor. The reason for our choice of conditions is to keep it consistent with our model definition where C is a correlation matrix.

3.3.2 Identifiability of \(\varLambda \) and D

Under the two conditions for identifying C, factor loadings \(\varLambda \) and residual variances D are also identified except for the case in which there exists one factor that is independent of all the others and this factor only has two indicators. For such a factor, we have 4 free parameters (2 loadings, 2 residuals) while we only have 3 available equations (2 variances, 1 covariance), which yields an underdetermined system. See Lemmas 3 and 4 in “Appendix” for detailed analysis. Once this happens, one could put additional constraints to guarantee a unique solution, e.g., by setting the variance of the first residual to zero. However, we would recommend to leave such an independent factor out (especially in association analysis) or study it separately from the other factors.

Under sufficient conditions for identifying C, \(\varLambda \), and D, our BGCF approach is consistent even with MCAR missing values. This is shown in Theorem 1, whose proof is provided in “Appendix”.

Theorem 1

(Consistency of the BGCF approach) Let \(\varvec{Y}_n=(\varvec{y}_1,\ldots ,\varvec{y}_n)^\mathrm{T}\) be independent observations drawn from a Gaussian copula factor model. If \(\varvec{Y}_n\) is complete (no missing data) or contains missing values that are missing completely at random, then

$$\begin{aligned} \lim \limits _{n \rightarrow \infty } P\big (\hat{C}_n = C_0\big )= & {} 1 , \\ \lim \limits _{n \rightarrow \infty } P\big (\hat{\varLambda }_n = \varLambda _0\big )= & {} 1 , \\ \lim \limits _{n \rightarrow \infty } P\big (\hat{D}_n = D_0\big )= & {} 1 , \end{aligned}$$

where \(\hat{C}_n\), \(\hat{\varLambda }_n\), and \(\hat{D}_n\) are parameters learned by BGCF, while \(C_0\), \(\varLambda _0\), and \(D_0\) are the true ones.

4 Simulation study

In this section, we compare our BGCF approach with alternative approaches via simulations.

4.1 Setup

4.1.1 Model specification

Following typical simulation studies on CFA models in the literature (Yang-Wallentin et al. 2010; Li 2016), we consider a correlated 4-factor model in our study. Each factor is measured by 4 indicators, since Marsh et al. (1998) concluded that the accuracy of parameter estimates appeared to be optimal when the number of indicators per factor was four and marginally improved as the number increased. The interfactor correlations (off-diagonal elements of the correlation matrix C over factors) are randomly drawn from [0.2, 0.4], which is considered a reasonable and empirical range in the applied literature (Li 2016). For the ease of reproducibility, we construct our C as follows.

figure b

In the majority of empirical research and simulation studies (DiStefano 2002), reported standardized factor loadings range from 0.4 to 0.9. For facilitating interpretability and again reproducibility, each factor loading is set to 0.7. Each corresponding residual variance is then automatically set to 0.51 under a standardized solution in the population model, as done in Li (2016).

4.1.2 Data generation

Given the specified model, one can generate data in the response space (the \(\varvec{Z}\) in Definition 1) via Eqs. (2) and (3). When the observed data (the \(\varvec{Y}\) in Definition 1) are ordinal, we discretize the corresponding margins into the desired number of categories. When the observed data are nonparanormal, we set the \(F_j(\cdot )\) in Eq. (4) to the CDF of a \(\chi ^2\)-distribution with degrees of freedom df. The reason for choosing a \(\chi ^2\)-distribution is that we can easily use df to control the extent of non-normality: a higher df implies a distribution closer to a Gaussian. To fill in a certain percentage \(\beta \) of missing values (we only consider MAR), we follow the procedure in Kolar and Xing (2012), i.e., for \(j = 1,\ldots ,\lfloor p/2 \rfloor \), \(i = 1,\ldots ,n\): \(y_{i,2*j}\) is missing if \(z_{i,2*j-1} < \varPhi ^{-1}(2*\beta )\).

4.1.3 Evaluation metrics

We use average relative bias (ARB) and root mean squared error (RMSE) to examine the parameter estimates, which are defined as

$$\begin{aligned} \mathrm{ARB}= \dfrac{1}{r}\sum _{i = 1}^{r} \dfrac{\hat{\theta _i} - \theta _i}{\theta _i}, \,\, \mathrm{RMSE}= \sqrt{\dfrac{1}{r}\sum _{i = 1}^{r} (\hat{\theta _i} - \theta _i)^2} , \end{aligned}$$

where \(\hat{\theta _i}\) and \(\theta _i\) represent the estimated and true values, respectively. An ARB value less than 5% is interpreted as a trivial bias, between 5 and 10% as a moderate bias, and greater than 10% as a substantial bias (Curran et al. 1996). Note that ARB describes an overall picture of average bias, that is, summing up bias in a positive and a negative direction together. A smaller absolute value of ARB indicates better performance on average.

4.2 Ordinal data without missing values

In this subsection, we consider ordinal complete data since this matches the assumptions of the diagonally weighted least squares (DWLS) method, in which we set the number of ordinal categories to be 4. We also incorporate the robust maximum likelihood (MLR) as an alternative approach, which was shown to be empirically tenable when the number of categories is more than 5 (Rhemtulla et al. 2012; Li 2016). See Sect. 2 for details of the two approaches.

Before conducting comparisons, we first check the convergence property of the Gibbs sampler used in our BGCF approach. We randomly generate a dataset of sample size \(n=500\). With this dataset, we run our Gibbs sampler five times independently (with different starting values), in which we collect 2000 successive samples for each chain. Table 1 shows the Potential Scale Reduction Factor (PSRF) (Gelman and Rubin 1992) with 95% upper confidence limit (within the parentheses) of the 6 interfactor correlations and 16 factor loadings over the 5 chains. From Table 1, we see quite a good convergence of the Gibbs sampler. Figure 2 shows the RMSE of estimated interfactor correlations (left panel) and factor loadings (right panel) over the first 100 iterations for the first chain. We see that the sampler converges very fast, in which the burn-in period is only around 10. More experiments done for different numbers of categories and different random datasets show that the burn-in is less than 20 on the whole across various conditions. Figure 3 shows the autocorrelation function of Gibbs samples in the first chain, where we randomly select 3 interfactor correlations and 3 factor loadings, respectively. We see that the autocorrelations almost disappear with a lag of 10. Based on these results, in our following simulations, we just run one chain, in which we set the burn-in to 50, thin the chain with step-size 10, and collect 100 independent Gibbs samples.Footnote 2

Table 1 Potential Scale Reduction Factor (PSRF) with 95% upper confidence limit of the 6 interfactor correlations and 16 factor loadings over 5 chains
Fig. 2
figure 2

Convergence property of our Gibbs sampler over 100 iterations. Left panel: RMSE of interfactor correlations; Right panel: RMSE of factor loadings

Fig. 3
figure 3

Autocorrelation function (ACF) of Gibbs samples for a randomly select three out of six interfactor correlations, and b randomly select three out of sixteen factor loadings

Now we evaluate the three involved approaches. Figure 4 shows the performance of BGCF, DWLS, and MLR over different sample sizes \(n \in \{100, 200, 500, 1000\}\), providing the mean of ARB (left panel) and the mean of RMSE with 95% confidence interval (right panel) over 100 experiments. From Fig. 4a, interfactor correlations are, on average, trivially biased (within two dashed lines) for all the three methods that in turn give indistinguishable RMSE regardless of sample sizes. From Fig. 4b, MLR moderately underestimates the factor loadings and performs worse than DWLS w.r.t. RMSE especially for a larger sample size, which confirms the conclusion in previous studies (Barendse et al. 2015; Li 2016).

Fig. 4
figure 4

Results obtained by the Bayesian Gaussian copula factor (BGCF) approach, the diagonally weighted least squares (DWLS), and the robust maximum likelihood (MLR) on complete ordinal data (4 categories) over different sample sizes, showing the mean of ARB (left panel) and the mean of RMSE with 95% confidence interval (right panel) over 100 experiments for a interfactor correlations and b factor loadings, where dashed lines and dotted lines in left panels denote \(\pm \, 5\%\) and \(\pm \,10\%\) bias, respectively

4.3 Mixed data with missing values

In this subsection, we consider mixed nonparanormal and ordinal data with missing values, since some latent variables in real-world applications are measured by sensors that usually produce continuous but not necessarily Gaussian data. The 8 indicators of the first 2 factors (4 per factor) are transformed into a \(\chi ^2\)-distribution with \(df = 8\), which yields a slightly nonnormal distribution (skewness is 1, excess kurtosis is 1.5) (Li 2016). The 8 indicators of the last 2 factors are discretized into ordinal with 4 categories.

One alternative approach in such cases is DWLS with pairwise-deletion (\(\hbox {DWLS} + \hbox {PD}\)), in which heterogeneous correlations (Pearson correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables) are first computed based on pairwise complete observations, and then DWLS is used to estimate model parameters. A second alternative concerns DWLS with multiple imputation (\(\hbox {DWLS} + \hbox {MI}\)), where we choose 20 imputed datasets for the follow-up study.Footnote 3 Specifically, we use the R package mice (Buuren and Groothuis-Oudshoorn 2010), in which the default imputation method “predictive mean matching” is applied. A third alternative is the full information maximum likelihood (FIML) (Arbuckle 1996; Rosseel 2012), which first applies an EM algorithm to impute missing values and then uses MLR to learn model parameters.

Figure 5 shows the performance of BGCF, DWLS + PD, \(\hbox {DWLS} + \hbox {MI}\), and FIML for \(n = 500\) over different percentages of missing values \(\beta \in \{0\%, 10\%, 20\%, 30\%\}\). First, despite a good performance with complete data (\(\beta = 0\%\)) \(\hbox {DWLS} + \hbox {PD}\) deteriorates significantly with an increasing percent of missing values especially for factor loadings. \(\hbox {DWLS} + \hbox {MI}\) works better than \(\hbox {DWLS} + \hbox {PD}\), but still does not perform well when there are more missing values. Second, our BGCF approach overall outperforms FIML: indistinguishable for interfactor correlations but better for factor loadings.

Fig. 5
figure 5

Results for \(n = 500\) obtained by BGCF, \(\hbox {DWLS} + \hbox {PD}\) (pairwise deletion), \(\hbox {DWLS} + \hbox {MI}\) (multiple imputation), and the full information maximum likelihood (FIML) on mixed nonparanormal (\(\textit{df} = 8\)) and ordinal (4 categories) data with different percentages of missing values, for the same experiments as in Fig. 4

Two more experiments are provided in “Appendix”. One concerns incomplete ordinal data with different numbers of categories, showing that BGCF is favorable over the alternatives for learning factor loadings. Another one considers incomplete nonparanormal data with different extents of deviation from a Gaussian, which indicates that FIML is rather sensitive to the deviation and only performs well for a slightly nonnormal distribution, while the deviation has no influence on BGCF at all. See “Appendix” for more details.

5 Application to real-world data

In this section, we illustrate our approach on the ‘Holzinger & Swineford 1939’ dataset (Holzinger and Swineford 1939), a classic dataset widely used in the literature and publicly available in the R package lavaan (Rosseel 2012). The data consists of mental ability test scores of 301 students, in which we focus on 9 out of the original 26 tests as done in Rosseel (2012). A latent variable model that is often proposed to explore these 9 variables is a correlated 3-factor model shown in Fig. 6, where we rename the observed variables to “Y1, Y2, ..., Y9” for simplicity in visualization and to keep it identical to our definition of observed variables (Definition 1). The interpretation of these variables is given in the following list.

  • Y1: Visual perception;

  • Y2: Cubes;

  • Y3: Lozenges;

  • Y4: Paragraph comprehension;

  • Y5: Sentence completion;

  • Y6: Word meaning;

  • Y7: Speeded addition;

  • Y8: Speeded counting of dots;

  • Y9: Speeded discrimination straight and curved capitals.

Fig. 6
figure 6

Path diagram for the Holzinger & Swineford data, in which latent variables are in ovals while observed variables are in squares, bidirected edges between latent variables denote correlation coefficients (interfactor correlations), directed edges denote factor loadings, and self-referring arrows denote residual variance, respectively. The edge weights in the graph are the model parameters learned by our BGCF approach

The summary of the 9 variables in this dataset is provided in Table 2, showing the number of unique values, skewness, and (excess) kurtosis for each variable (this dataset contains no missing values). From the column of unique values, we notice that the data are approximately continuous. The average of ‘absolute skewness’ and ‘absolute excess kurtosis’ over the 9 variables are around 0.40 and 0.54, respectively, which is considered to be slightly nonnormal (Li 2016). Therefore, we choose MLR as the alternative to be compared with our BGCF approach, since these conditions match the assumptions of MLR.

We run our Bayesian Gaussian copula factor approach on this dataset. The learned parameter estimates are shown in Fig. 6, in which interfactor correlations are on the bidirected edges, factor loadings are in the directed edges, and unique variance for each variable is around the self-referring arrows. The parameters learned by the MLR approach are not shown here, since we do not know the ground truth so that it is hard to conduct a comparison between the two approaches.

In order to compare the BGCF approach with MLR quantitatively, we consider answering the question: “What is the value of \(Y_j\) when we observe the values of the other variables, denoted by \(\varvec{Y}_{\backslash j}\), given the population model structure in Fig. 6?”

Table 2 The number of unique values, skewness, and (excess) kurtosis of each variable in the ‘HolzingerSwineford1939’ dataset

This is a regression problem but with additional constraints to obey the population model structure. The difference from a traditional regression problem is that we should learn the regression coefficients from the model-implied covariance matrix rather than the sample covariance matrix over observed variables.

  • For MLR, we first learn the model parameters on the training set, from which we extract the linear regression intercept and coefficients of \(Y_j\) on \(\varvec{Y}_{\backslash j}\). Then, we predict the value of \(Y_j\) based on the values of \(\varvec{Y}_{\backslash j}\). See Algorithm 2 for pseudo code of this procedure.

  • For BGCF, we first estimate the correlation matrix \(\hat{S}\) over response variables (the \(\varvec{Z}\) in Definition 1) and the empirical CDF \(\hat{F}_j\) of \(Y_j\) on the training set. Then we draw latent Gaussian data \(Z_j\) given \(\hat{S}\) and \(\varvec{Y}_{\backslash j}\), i.e., \(P(Z_j |\hat{S}, \varvec{Z}_{\backslash j} \in \mathscr {D}(\varvec{Y}_{\backslash j}))\). Lastly, we obtain the value of \(Y_j\) from \(Z_j\) via \(\hat{F}_j\), i.e., \(Y_j = \hat{F}_j^{-1} \big (\varPhi [Z_j]\big )\). See Algorithm 3 for pseudo code of this procedure. Note that we iterate the prediction stage (lines 7–8) for multiple times in the actual implementation to get multiple solutions to \(Y_j^{(new)}\), then the average over these solutions is taken as the final predicted value of \(Y_j^{(new)}\). This idea is quite similar to multiple imputation.

figure c
figure d
Fig. 7
figure 7

MSE obtained by BGCF and MLR when we take each \(Y_j\) as outcome variable (the others as predictors) alternately, showing the mean over 100 experiments (10 times tenfold cross validation) with error bars representing a standard error

The mean squared error (MSE) is used to evaluate the prediction accuracy, where we repeat a tenfold cross validation for 10 times (thus 100 MSE estimates totally). Also, we take \(Y_j\) as the outcome variable alternately while treating the others as predictors (thus 9 tasks totally). Figure 7 provides the results of BGCF and MLR for all the 9 tasks, showing the mean of MSE with a standard error represented by error bars over the 100 estimates. We see that BGCF outperforms MLR for Tasks 5 and 6 although they perform indistinguishably for the other tasks. The advantage of BGCF over MLR is encouraging, considering that the experimental conditions match the assumptions of MLR. More experiments are done (not shown) after we make the data moderately or substantially nonnormal, suggesting that BGCF is significantly favorable to MLR, as expected.

6 Summary and discussion

In this paper, we proposed a novel Bayesian Gaussian copula factor (BGCF) approach for learning parameters of CFA models that can handle mixed continuous and ordinal data with missing values. We analyzed the separate identifiability of interfactor correlations C, factor loadings \(\varLambda \), and residual variances D, since different researchers may care about different parameters. For instance, it is sufficient to identify C for researchers interested in learning causal relations among latent variables (Silva and Scheines 2006; Silva et al. 2006; Cui et al. 2016), with no need to worry about additional conditions to identify \(\varLambda \) and D. Under sufficient identification conditions, we proved that our approach is consistent for MCAR data and empirically showed that it works quite well for MAR data.

In the experiments, our approach outperforms DWLS even under the assumptions of DWLS. Apparently, the approximations inherent in DWLS, such as the use of the polychoric correlation and its asymptotic covariance, incur a small loss in accuracy compared to an integral approach like the BGCF. When the data follow from a more complicated distribution and contain missing values, the advantage of BGCF over its competitors becomes more prominent. Another highlight of our approach is that the Gibbs sampler converges quite fast, where the burn-in period is rather short. To further reduce the time complexity, a potential optimization of the sampling process is available (Kalaitzis and Silva 2013).

There are various generalizations to our inference approach. While our focus in this paper is on the correlated k-factor models, it is straightforward to extent the current procedure to other class of latent models that are often considered in CFA, such as bi-factor models and second-order models, by simply adjusting the sparsity structure of the prior graph G.

Also, one may consider models with impure measurement indicators, e.g., a model with an indicator measuring multiple factors (cross-loadings) or a model with residual covariances (Bollen 1989), which can be easily solved with BGCF by changing the sparsity pattern of \(\varLambda \) and D. However, two critical issues might arise in this case: the non-identification problems due to a large number of parameters and the slow convergence problem of MCMC algorithms because of dependencies in D. The first issue can be solved by introducing strongly-informative priors (Muthén and Asparouhov 2012), e.g., putting small-variance priors on all cross-loadings. The caveat here is that one needs to choose such priors very carefully to reach a good balance between incorporating correct information and avoiding non-identification. See Muthén and Asparouhov (2012) for more details about the choice of priors on cross-loadings and correlated residuals. Once having the priors on C, \(\varLambda \), and D, one can derive the prior on \(\varOmega \). The second issue can be alleviated via the parameter expansion technique (Ghosh and Dunson 2009; Merkle and Rosseel 2018), in which the residual covariance matrix is decomposed into a couple of simple components through some phantom latent variables, resulting in an equivalent model called a working model. Then, our inference procedure can proceed based on the working model.

It is possible to extend the current approach to multiple groups to accommodate cross-national research or by incorporating a multilevel structure, although this is not quite straightforward. Then, one might not be able to draw the precision matrix directly from a G-Wishart (Step 4 in Algorithm 1) since different groups may have different C and D while they share the same \(\varLambda \). However, this step can be implemented by drawing C, \(\varLambda \), and D separately.

Another line of future work is to analyze standard errors and confidence intervals while this paper concentrates on the accuracy of parameter estimates. Our conjecture is that BGCF is still favorable because it naturally transfers the extra variability incurred by missing values to the posterior Gibbs samples: we indeed observed a growing variance of the posterior distribution with the increase of missing values in our simulations. On top of the posterior distribution, one could conduct further studies, e.g., causal discovery over latent factors (Silva et al. 2006; Cui et al. 2018), regression analysis (as we did in Sect. 5), or other machine learning tasks. Instead of using a Gaussian copula, some other choices of copulas are available to model advanced properties in the data such as tail dependence and tail asymmetry (Krupskii and Joe 2013, 2015).