Advertisement

TEST

, Volume 27, Issue 1, pp 197–220 | Cite as

On the correspondence from Bayesian log-linear modelling to logistic regression modelling with g-priors

  • Michail Papathomas
Open Access
Original Paper
  • 517 Downloads

Abstract

Consider a set of categorical variables where at least one of them is binary. The log-linear model that describes the counts in the resulting contingency table implies a specific logistic regression model, with the binary variable as the outcome. Within the Bayesian framework, the g-prior and mixtures of g-priors are commonly assigned to the parameters of a generalized linear model. We prove that assigning a g-prior (or a mixture of g-priors) to the parameters of a certain log-linear model designates a g-prior (or a mixture of g-priors) on the parameters of the corresponding logistic regression. By deriving an asymptotic result, and with numerical illustrations, we demonstrate that when a g-prior is adopted, this correspondence extends to the posterior distribution of the model parameters. Thus, it is valid to translate inferences from fitting a log-linear model to inferences within the logistic regression framework, with regard to the presence of main effects and interaction terms.

Keywords

Categorical variables Contingency tables Mixtures of g-priors Prior correspondence Posterior correspondence 

Mathematics Subject Classification

62F15 

1 Introduction

Consider observations \({\varvec{v}}=\{v_1,\ldots ,v_n\}\), parameters \(\varvec{\theta }=\{\theta _1,\ldots ,\theta _n\}\), and known quantities or nuisance parameters \(\varvec{\phi }=\{\phi _1,\ldots ,\phi _n \}\). Following standard notation, \(v_i, i=1,\ldots ,n\), follows a distribution that is a member of the exponential family when its probability function can be written as,
$$\begin{aligned} f(v_i | \theta _i, \phi _i) = \text{ exp } \left\{ \frac{w_i}{\phi _i} \left[ v_i \theta _i - b(\theta _i)\right] +c(v_i,\phi _i) \right\} , \end{aligned}$$
where \(\varvec{w}=\{w_1,\ldots ,w_n \}\) are known weights, and \(\phi _i\) is described as the dispersion or scale parameter. With regard to first- and second-order moments, \(\mu _i\equiv E(v_i)=b^{'}(\theta _i)\) and \(\text{ Var }(v_i)=\frac{w_i}{\phi _i} b^{''}(\theta _i)\). The variance function is defined as \(V(\mu _i) = b^{''}(\theta _i)\). A generalized linear model relates \(\varvec{\mu }=\{\mu _1,\ldots ,\mu _n \}\) to covariates by setting \(\zeta (\varvec{\mu })=X_d \varvec{\gamma }\), where \(\zeta \) denotes the link function, \(X_d\) the covariate design matrix, and \(\varvec{\gamma }\) a vector of parameters. For a single \(\mu _i\), we write \(\zeta (\mu _i)=X_{d(i)} \varvec{\gamma }\), where \(X_{d(i)}\) denotes the ith row of \(X_d\). So, \(\zeta \) is defined as a vector function \(\zeta \equiv \{\zeta _1,\ldots ,\zeta _n \}\) with n elements.

Denote with \(\mathcal {P}\) a finite set of P categorical variables. Observations from \(\mathcal {P}\) can be arranged as counts in a P-way contingency table. Denote the cell counts as \(n_i, i=1,\ldots ,n_{ll}\). We use the ‘ll’ indicator to allude to the log-linear model that will describe these counts. A Poisson distribution is assumed for the counts so that \(E(n_i)=\mu _i\). A Poisson log-linear interaction model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\) is a generalized linear model that relates the expected counts to \(\mathcal {P}\). Assuming that one of the categorical variables, denoted with Y, is binary, a logistic regression can also be fitted with Y as the outcome, and all or some of the remaining \(P-1\) variables as covariates. We write, \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }, \varvec{p}=(p_1,\ldots ,p_{n_{lt}})\), using the ‘lt’ indicator for the logistic model. Here, \(p_i\) denotes the conditional probability that \(Y=1\) given covariates \(X_{lt(i)}\), and \(\varvec{\beta }\) is a vector of parameters.

Within the Bayesian framework, a prior distribution \(f(\varvec{\gamma })\) is assigned to the parameters of the log-linear or logistic regression model. This can be an informative prior that incorporates prior information on the magnitude of the effect of the different covariates or interactions. Eliciting such a prior distribution is not straightforward, especially for the coefficients of interaction terms (Consonni and Veronese 2008). Typically, lack of information for the parameters of a generalized linear model leads to a relatively flat but proper prior distribution, so that model determination based on Bayes factors is valid (O’Hagan 1995). A popular choice among Bayesian statisticians is the g-prior or a mixture of g-priors, described in detail in Sect. 2. These are flexible priors designed to carry very little information so that inferences are driven by the observed data. See, for example, Wang and George (2007), Sabanès Bovè and Held (2011), Overstall and King (2014a, b) and Mukhopadhyay and Samantha (2016). This type of prior was first proposed by Zellner (1986) for general linear models. In this context, it is known as Zellner’s g-prior. Other priors have been proposed, especially for analyses where the focus is on model comparison and variable selection. For example, Jeffreys prior (Liang et al. 2008), the generalized hyper-g prior (Sabanès Bovè and Held 2011), and the expected-posterior priors and power-expected-posterior priors (Fouskakis et al. 2015). Our manuscript concerns the g-prior and mixtures of g-priors. After data are collected, the prior \(f(\varvec{\gamma })\) is updated to the posterior distribution \(f(\varvec{\gamma }|\text{ Data })\) via the conditional probability formula and Bayes Theorem, so that,
$$\begin{aligned} f(\varvec{\gamma }|\text{ Data }) = \frac{f(\text{ Data }|\varvec{\gamma }) f(\varvec{\gamma })}{f(\text{ Data })}. \end{aligned}$$
For the prior distributions discussed above, closed-form expressions for the posterior distribution \(f(\varvec{\gamma }|\text{ Data })\) do not exist. The posterior is typically calculated using Markov chain Monte Carlo stochastic simulation, or Normal approximations (O’Hagan and Forster 2004).
It is known (Agresti 2002) that when \(\mathcal {P}\) contains a binary Y, a log-linear model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\) implies a specific logistic regression model with parameters \(\varvec{\beta }\) defined uniquely by \(\varvec{\lambda }\). The logistic regression model for the conditional odds ratios for Y implies an equivalent log-linear model with arbitrary interaction terms between the covariates in the logistic regression, plus arbitrary main effects for these covariates. We provide a simple example to illustrate this result and clarify additional notation. Assume three categorical variables XY, and Z, with Y binary. Let ijk be integer indices that describe the level of XY, and Z, respectively. For instance, as Y is binary, \(j=0,1\). Consider the log-linear model,
$$\begin{aligned} \text{ log }(\mu _{ijk})=\lambda + \lambda _{i}^{X} + \lambda _{j}^{Y} + \lambda _{k}^{Z} + \lambda _{ij}^{XY} + \lambda _{ik}^{XZ} + \lambda _{jk}^{YZ}, \end{aligned}$$
(M1)
where the superscript denotes the main effect or interaction term. The corresponding logistic regression model for the conditional odds ratios for Y is derived as follows,
$$\begin{aligned} \text{ log } \left( \frac{P(Y=1 | X,Z)}{P(Y=0 | X,Z)} \right)= & {} \text{ log } \left( \frac{P(Y=1,X,Z)}{P(Y=0,X,Z)} \right) \\= & {} \text{ log }(\mu _{i1k}) - \text{ log }(\mu _{i0k})\\= & {} \lambda _{1}^{Y} - \lambda _{0}^{Y} + \lambda _{i1}^{XY} - \lambda _{i0}^{XY}+ \lambda _{1k}^{YZ} - \lambda _{0k}^{YZ}. \end{aligned}$$
This is a logistic regression with parameters, \(\varvec{\beta }=(\beta ,\beta _{i}^{X},\beta _{k}^{Z})\), so that, \(\beta =\lambda _{1}^{Y} - \lambda _{0}^{Y}, \beta _{i}^{X}=\lambda _{i1}^{XY} - \lambda _{i0}^{XY}\), and \(\beta _{k}^{Z}=\lambda _{1k}^{YZ} - \lambda _{0k}^{YZ}\). Considering identifiability corner point constraints, all elements in \(\varvec{\lambda }\) with a zero subscript are set to zero. Then, \(\beta =\lambda _{1}^{Y}, \beta _{i}^{X}=\lambda _{i1}^{XY} \) and \(\beta _{k}^{Z}=\lambda _{1k}^{YZ}\). This scales in a straightforward manner to larger log-linear models. For instance, if (M1) contained the three-way interaction XYZ, then the corresponding logistic regression model would contain the XZ interaction, so that, \(\beta _{ik}^{XZ}=\lambda _{i1k}^{XYZ} - \lambda _{i0k}^{XYZ}\), and under corner point constraints, \(\beta _{ik}^{XZ}= \lambda _{i1k}^{XYZ}\). If a factor does not interact with Y in the log-linear model, then this factor disappears from the corresponding logistic regression model. To demonstrate that the correspondence between log-linear and logistic models is not bijective, it is straightforward to show that, for example, the log-linear model, \(\text{ log }(\mu _{ijk})=\lambda + \lambda _{i}^{X} + \lambda _{j}^{Y} + \lambda _{k}^{Z} + \lambda _{ij}^{XY} + \lambda _{jk}^{YZ}\), implies the same logistic regression as (M1). More generally, the relation between \(\varvec{\beta }\) and \(\varvec{\lambda }\) can be described as \(\varvec{\beta }=\varvec{T}\varvec{\lambda }\), where \(\varvec{T}\) is an incidence matrix (Bapat 2011). In the context of this manuscript, matrix \(\varvec{T}\) has one row for each element of \(\varvec{\beta }\), and one column for each element of \(\varvec{\lambda }\). The elements of \(\varvec{T}\) are zero, except in the case where the element of \(\varvec{\beta }\) is defined by the corresponding element of \(\varvec{\lambda }\). The number of rows of \(\varvec{T}\) cannot be greater than the number of columns. To simplify the analysis and notation, for the remainder of this manuscript we consider models specified under corner point constraints. Then, every logistic regression model parameter is defined uniquely by the corresponding log-linear model parameter, and the correspondence from a log-linear to a logistic regression model is direct.

The contribution of our manuscript is twofold. First, Theorem 1 states that assigning to \(\varvec{\lambda }\) the g-prior that is specific to log-linear modelling implies the g-prior specific to logistic modelling on the parameters \(\varvec{\beta }\) of the corresponding logistic regression. The log-linear model has to be the largest model that corresponds to the logistic regression, i.e. the model that contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). Second, under the reasonable assumption that an investigator who chooses a g-prior for \(\varvec{\lambda }\) would also choose a g-prior for \(\varvec{\beta }\) if they were to fit a logistic regression directly, inferences on the parameters of a log-linear model translate to inferences on the parameters of the corresponding logistic regression. Closed-form expressions for the posterior distributions do not exist. Wang and George (2007) utilize the Laplace approximation for generalized linear models, focusing on the approximation of the marginal likelihood for the purpose of variable selection. Theorem 2 shows that, asymptotically, the matching between the prior distributions of the corresponding parameters extends to the posterior distributions. It is then demonstrated by numerical illustrations that the presence or absence of interaction terms in the log-linear model can inform on the relation between the binary Y and the other variables as described by logistic regression. For example, assume that after fitting the log-linear model, the credible interval for an element of \(\varvec{\lambda }\) contains zero. When fitting the corresponding logistic regression model, the investigator will anticipate that the credible interval for the corresponding element of \(\varvec{\beta }\) will also contain zero. Importantly, for this translation to hold, it is essential that the prior distribution for \(\varvec{\beta }\) implied by the prior on \(\varvec{\lambda }\) is the same to the distribution the investigator would assign to \(\varvec{\beta }\) if they were to fit the logistic model directly. If the implied prior on \(\varvec{\beta }\) is not the same as a directly assigned prior then, with regard to \(\varvec{\beta }\), the correspondence from the Bayesian log-linear analysis to the logistic one becomes dubious. In both illustrations in Sect. 4, we observe that the credible intervals of the corresponding \(\varvec{\lambda }\) and \(\varvec{\beta }\) parameters are virtually identical considering simulation error.

In Sect. 2, we provide the definition of the g-prior and mixtures of g-priors and describe how the g-prior is derived for log-linear and logistic regression models. Section 3 contains the main contributions in this manuscript. In Sect. 4, the correspondence from a log-linear to a logistic regression model is illustrated using simulated and real data. We conclude with a discussion.

2 The g-prior and mixtures of g-priors

A g-prior for the parameters \(\varvec{\gamma }\) of a generalized linear model is a multivariate Normal distribution \(N(\varvec{m}_{\gamma },g \varSigma _{\gamma })\), constructed so that the prior variance is a multiple of the inverse Fisher information matrix by a scalar g. See Liang et al. (2008) for a discussion on the choice of g. In accordance with Ntzoufras et al. (2003) and Ntzoufras (2009), the g-prior for the parameters of log-linear and logistic regression models is specified so that, \(\varvec{m}_{\gamma }=(m_{\gamma _1},0,\ldots ,0)^{\top }\), where \(m_{\gamma _{1}}\) corresponds to the intercept and can be nonzero, and,
$$\begin{aligned} \varSigma _{\gamma } = V(m^{*}) \zeta ^{'}(m^{*})^{2} \left[ X_{d}^{\top } \text{ diag }\left( \frac{1}{\phi _i}\right) X_{d}\right] ^{-1}, \end{aligned}$$
where \(\text{ diag }(1/\phi _i)\) denotes a diagonal \(n\times n\) matrix with nonzero elements \(1/\phi _i\), and \(m^{*}=\zeta ^{-1}(m_{\gamma _1})\).
The unit information prior is a special case of the g-prior, obtained by setting \(g=N\), where N denotes the total number of observations. It is constructed so that the information contained in the prior is equal to the amount of information in a single observation (Kass and Wasserman 1995). Assuming that g is a random variable, with prior f(g), leads to a mixture of g-priors, so that,
$$\begin{aligned} \varvec{\gamma }| g \sim N(\varvec{m}_{\gamma },g \varSigma _{\gamma }), \quad g \sim f(g). \end{aligned}$$
Mixtures of g-priors are also called hyper-g priors (Sabanès Bovè and Held 2011).
Log-linear regression Consider counts \(n_i\) \(i=1,\ldots ,n_{ll}\). Now, \(N=\sum \nolimits _{i=1}^{n_{ll}} n_i\), and,
$$\begin{aligned} f(n_i | \mu _i) = \frac{\mathrm{e}^{-\mu _i} \mu _{i}^{n_i}}{n_i !}, \end{aligned}$$
with \(\theta _i = \text{ log }(\mu _i), b(\theta _i)=\mathrm{e}^{\theta _i}\) and \(c(n_i,\phi _i)=-\text{ log } (n_i !)\). Also, \(w_i \phi _{i}^{-1}=1\), so that \(w_i=1\) implies \(\phi _i=1\). Note that,
$$\begin{aligned} \mu _i=b^{'}(\theta _i)=\mathrm{e}^{\theta _i}, \quad \text{ Var }(n_i)=\phi _i w_{i}^{-1} b^{''}(\theta )=\mathrm{e}^{\theta _i}, \quad \text{ and } \quad V(\mu _i)=\mu _i. \end{aligned}$$
For the log-linear model, \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\), and \(\zeta (\mu _i) = \text{ log }(\mu _i)\) so that \(\zeta ^{'}(\mu _i)=\mu _{i}^{-1}\). The g-prior is constructed as \(N(\varvec{m}_{\lambda },g \varSigma _{\lambda })\), where \(\varvec{m}_{\lambda }=(\text{ log }(\bar{n}),0,\ldots ,0)\). Here, \(\bar{n}\) denotes the average cell count. The prior mean for the log-linear model intercept is also often set to zero (Dellaportas et al. 2012). (Note that altering the prior mean for the log-linear model intercept does not affect the validity of the theoretical results in Sect. 3. This is straightforward to deduce from the proof of Theorem 1 given in ‘Appendix’, as the prior mean for the log-linear intercept does not affect the implied distribution of the logistic regression parameters.) In addition,
$$\begin{aligned} \varSigma _{\lambda } = \bar{n} \frac{1}{(\bar{n})^2} \left( X_{ll}^{\top } X_{ll}\right) ^{-1} = \frac{1}{\bar{n}} \left( X_{ll}^{\top } X_{ll}\right) ^{-1} = \frac{n_{ll}}{N} \left( X_{ll}^{\top } X_{ll}\right) ^{-1}. \end{aligned}$$
Logistic regression Assume that \(y_i, i=1,\ldots ,n_{lt}\), is the proportion of successes out of \(t_i\) trials. Now, \(N=\sum \nolimits _{i=1}^{n_{lt}} t_i\), and,
$$\begin{aligned} f(t_i y_i | p_i) = {t_i \atopwithdelims ()t_i y_i} p_i^{t_i y_i} (1-p_i)^{t_i - t_i y_i}, \end{aligned}$$
where \(\theta _i=\text{ logit }(p_i), b(\theta _i)=\text{ log }(1+\mathrm{e}^{\theta _i})\), and \(c(y_i,\phi _i)=\text{ log } {t_i \atopwithdelims ()t_i y_i}\). Also, \(w_i \phi _{i}^{-1}=t_i\), so that \(w_i=1\) implies \(\phi _i=t_{i}^{-1}\). Note that,
$$\begin{aligned} E(y_i)=b^{'}(\theta _i)=\frac{\mathrm{e}^{\theta _i}}{1+\mathrm{e}^{\theta _i}}=p_i, \quad \text{ Var }(y_i)=\frac{\phi _i}{w_i} b^{''}(\theta _i)=\frac{1}{t_i} \frac{\mathrm{e}^{\theta _i}}{(1+\mathrm{e}^{\theta _i})^2}=\frac{p_i(1-p_i)}{t_i}, \end{aligned}$$
and,
$$\begin{aligned} V(p_i)=p_i(1-p_i). \end{aligned}$$
The logistic regression model is defined as \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }\), so that \(X_{lt}\) is a \(n_{lt} \times n_{\beta }\) design matrix, and \(\zeta (p_i) = \text{ logit }(p_i)\) so that \(\zeta ^{'}(p_i)=[p_i(1-p_i)]^{-1}\). The g-prior is \(N(\varvec{m}_{\beta },g \varSigma _{\beta })\), where \(\varvec{m}_{\beta }=(0,0,\ldots ,0)\), and,
$$\begin{aligned} \varSigma _{\beta } = p^{*} (1-p^{*}) \frac{1}{\left[ p^{*} (1-p^{*})\right] ^2} \left[ X_{lt}^{\top } \text{ diag }(t_i) X_{lt}\right] ^{-1} = \frac{1}{0.25} \left[ X_{lt}^{\top } \text{ diag }(t_i) X_{lt}\right] ^{-1}. \end{aligned}$$
Here, \(p^{*}\) corresponds to \(m^{*}\) in the general definition of the g-prior at the start of this section, so that \(p^{*}=\zeta ^{-1}(m_{\gamma _{1}})\), where \(m_{\gamma _{1}}\) is the first element of \(\varvec{m}_{\beta }\) which is zero. Thus, we obtain that \(p^{*}=\mathrm{e}^{0}/(\mathrm{e}^{0}+1)=0.5\). By approximating each \(t_i\) with the average number of trials \(\bar{t}\), as suggested by Ntzoufras et al. (2003),
$$\begin{aligned} \varSigma _{\beta } \simeq 4 \frac{1}{\bar{t}} \left( X_{lt}^{\top } X_{lt}\right) ^{-1} = 4 \frac{n_{lt}}{\sum _{i=1}^{n_{lt}} t_i} \left( X_{lt}^{\top } X_{lt}\right) ^{-1} = 4 \frac{n_{lt}}{N} \left( X_{lt}^{\top } X_{lt}\right) ^{-1}. \end{aligned}$$

3 Correspondence from log-linear to logistic regression models

Consider a set of categorical variables \(\mathcal {P}\) that includes a binary variable Y. Assume a log-linear model that, in addition to the terms that involve Y, contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). We show that, given that a g-prior is assigned to the log-linear model parameters \(\varvec{\lambda }\), the implied prior for \(\varvec{\beta }\) is a g-prior for logistic regression models, i.e. the one that would be assigned if the investigator considered the logistic regression model directly.

Theorem 1

Assume a g-prior \(\varvec{\lambda }\sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda })\) on the parameters of a log-linear model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\), that contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). This prior implies a g-prior \(N(\varvec{m}_{\beta }, g \varSigma _{\beta })\) for the parameters \(\varvec{\beta }\) of the corresponding logistic regression \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }\).

Proof

The proof is based on rearranging the rows and columns of \(X_{ll}\), and partitioning so that one part of \(X_{ll}\) consists of the logistic design matrix \(X_{lt}\), or replications of \(X_{lt}\). We then show that the prior mean and variance of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\) are the prior that would be assigned to \(\varvec{\beta }\) if the logistic regression was fitted directly. The complete proof is given in ‘Appendix’.\(\square \)

Corollary 1

A unit information prior \(\varvec{\lambda }\sim N(\varvec{m}_{\lambda }, N \varSigma _{\lambda })\) implies a unit information prior \(N(\varvec{m}_{\beta }, N \varSigma _{\beta })\) for the parameters \(\varvec{\beta }\) of the corresponding logistic regression.

Corollary 1 follows directly from Theorem 1 by setting \(g=N\). The following Corollary concerns mixtures of g-priors. It is implicitly assumed that the investigator would adopt the same prior density f(g) for both modelling approaches.

Corollary 2

A mixture of g-priors so that \(\varvec{\lambda }| g \sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda }), g\sim f(g)\), implies a mixture of g-priors for the parameters \(\varvec{\beta }\) of the corresponding logistic regression, so that \(\varvec{\beta }| g \sim N(\varvec{m}_{\beta }, g \varSigma _{\beta }), g\sim f(g)\).

This also follows from Theorem 1, which states that when \(\varvec{\lambda }| g \sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda })\), the conditional prior for \(\varvec{\beta }\) is \(\varvec{\beta }| g \sim N(\varvec{m}_{\beta }, g \varSigma _{\beta })\).

When the g-prior is utilized, it is common to assign a locally uniform Jeffreys prior (\(\propto 1\)) on the intercept, after the covariate columns of the design matrix have been centred to ensure orthogonality with the intercept (Liang et al. 2008). If one decides to adopt the approach where a flat prior is assigned to the intercept in both log-linear and logistic formulations, the correspondence between log-linear and logistic regression breaks, but only with regard to the intercept of the logistic regression. The prior on the log-linear intercept does not have a bearing on the implied prior for the logistic regression parameters, because the log-linear intercept does not contribute to the formation of the logistic regression parameters, as described in Sect. 1. After assigning a flat prior on the intercept of the log-linear model, all \(\varvec{\beta }\) parameters (including the intercept) are still Normal as linear combinations of Normal random variables, and the distribution of \(\varvec{\beta }\) is the one given by Theorem 1. For details, see the additional material in the proof of Theorem 1 in ‘Appendix’. For an illustration, see Table 3 in Sect. 4.2.

Closed-form expressions for the posterior distribution of the parameters of a generalized linear model do not exist. However, it is known (O’Hagan and Forster 2004) that a Normal approximation applies. Consider a g-prior for the parameters \(\varvec{\gamma }\) of the generalized linear model, \(\zeta (\varvec{\mu })=X_d \varvec{\gamma }\), so that, for fixed g,
$$\begin{aligned} \varvec{\gamma }\sim N(\varvec{m}_{\gamma },g \varSigma _{\gamma }). \end{aligned}$$
Given observations \({\varvec{v}}=\{ v_1,\ldots ,v_n \}\), the posterior distribution of \(\gamma \) is approximated by a Normal density, so that,
$$\begin{aligned} \varvec{\gamma }| {\varvec{v}}\sim N\left( \left[ g^{-1} \varSigma _{\gamma }^{-1} + \mathcal{I}(\hat{\varvec{\gamma }})\right] ^{-1} \times \left[ g^{-1} \varSigma _{\gamma }^{-1} \varvec{m}_{\gamma } + \mathcal{I}(\hat{\varvec{\gamma }}) \hat{\varvec{\gamma }}\right] , \left[ g^{-1} \varSigma _{\gamma }^{-1} + \mathcal{I}(\hat{\varvec{\gamma }})\right] ^{-1} \right) .\nonumber \\ \end{aligned}$$
(1)
Here, \(\hat{\varvec{\gamma }}\) is the maximum likelihood estimate of \({\varvec{\gamma }}\), and \(\mathcal{I}(\hat{\varvec{\gamma }})\) is the information matrix \(X_d^{\top } \mathcal{V} X_d\). For the log-linear model, the diagonal matrix \(\mathcal{V}\) (denoted by \(\mathcal{V}_{\text {log-linear}}\)) has diagonal elements \(\text{ exp }\{ X_{ll(i)} \hat{\varvec{\lambda }} \}, i=1,\ldots ,n_{ll}\). When the logistic regression is fitted, \(\mathcal{V}_{\mathrm{logistic}}\) has diagonal elements \(t_i \text{ exp }\{ X_{lt(i)} \hat{\varvec{\beta }} \} \text{ exp }\{1 + X_{lt(i)} \hat{\varvec{\beta }} \}^{-2}, i=1,\ldots ,n_{lt}\). Within the Bayesian framework, when fitting a generalized linear model, a large sample \((n \rightarrow \infty )\) will swamp the prior distribution, rendering it irrelevant for deriving posterior inferences (O’Hagan and Forster 2004). In practice, this can be true even for moderate sample sizes (say, of order \(10^2\) or larger), especially when the prior is not informative, which is typically the case with g-priors.

Theorem 2

Consider a g-prior \(\varvec{\lambda }\sim N(\varvec{m}_{\lambda }, g \varSigma _{\lambda })\) on the parameters of a log-linear model \(\text{ log }(\varvec{\mu }) = X_{ll} \varvec{\lambda }\), that contains all possible interaction terms between the categorical factors in \(\mathcal {P} {\setminus } \{ Y \}\). Consider also the analogous g-prior \(N(\varvec{m}_{\beta }, g \varSigma _{\beta })\) for the parameters \(\varvec{\beta }\) of the corresponding logistic regression \(\text{ logit }(\varvec{p}) = X_{lt} \varvec{\beta }\). For fixed g, and for a large sample, the posterior distribution of \(\varvec{\beta }\), as given in (1), is approximately equal to the posterior distribution of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\).

Proof

A partitioning similar to the one adopted for the proof of Theorem 1 is utilized. First, we show that, asymptotically, the posterior variance of \(\varvec{\beta }\) is the posterior variance of the elements of \(\varvec{\lambda }\) that correspond to \(\varvec{\beta }\). Then, we do the same for the posterior means. The proof is based on the crucial assumption that for a large sample the contribution of the prior in deriving the posterior moments can be ignored. A standard result utilized in the proof is that, asymptotically, the Binomial distribution for a data point can be approximated by a Poisson distribution. The complete proof is given in ‘Appendix’.\(\square \)

In the next section, we demonstrate with numerical illustrations that, for fixed g, the correspondence between the priors extends to posterior distributions, so that the posterior distribution of the logistic regression parameters matches the one of the corresponding log-linear model parameters. This is true even for relatively moderate sample sizes N, say a few hundred, and for standard choices of g such as \(g=N\).

4 Illustrations

Unit information priors were adopted for the model parameters (\(g=N\)). The size of the burn-in sample was \(10^4\), followed by \(5\times 10^5\) iterations.

4.1 A simulation study

We simulate data from 1000 subjects, on six binary variables \(\{Y,A,B,C,D,E\}\). Probabilities that correspond to the cells of the \(2^6\) contingency table are generated in accordance with the log-linear model, \(\text{ log }(\varvec{\mu })=YAB+YCD+YE\). Adopting the notation in Agresti (2002), a single letter denotes the presence of a main effect, two letter terms denote the presence of the implied first-order interaction and so on and so forth. The presence of an interaction between a set of variables implies the presence of all lower-order interactions plus main effects for that set. Cell counts are simulated according to the generated cell probabilities. Parameter values and the design matrix of the log-linear model used to generate the cell probabilities are given in Supplemental material, Section S2.

We fit to the simulated data the log-linear model,
$$\begin{aligned} \text{ log }(\varvec{\mu })=YAB+YCD+YE+ABCDE. \end{aligned}$$
(M2)
According to the discussion and results in Sects. 1 and 3, the corresponding logistic regression where Y is treated as the outcome only contains the first-order interactions AB and CD plus the main effect for E,
$$\begin{aligned} \text{ logit }(\varvec{p}) = AB+CD+E. \end{aligned}$$
(M3)
In Table 1, we present credible intervals (CIs) for the parameters of (M3) and the relevant parameters of (M2). The CIs for the corresponding \(\varvec{\lambda }\) and \(\varvec{\beta }\) parameters are almost identical, considering simulation error. For example, the CI for \(\lambda _{1,1,1}^{YCD}\) is \((-2.01,-0.85)\), whilst the CI for \(\beta _{1,1}^{CD}\) is \((-2.00,-0.84)\).
Table 1

Simulated data illustration

Log-linear model (M2), \(\text{ log }(\varvec{\mu }) = YAB + YCD + YE + ABCDE\)

Y

YA

YB

YC

YD

YE

YAB

YCD

(0.21, 1.07)

(−0.57,0.26)

(−0.44,0.43)

(−0.24,0.63)

(−0.38,0.50)

(−0.84,−0.27)

(−1.66,−0.50)

(−2.01,−0.85)

Outcome is Y (M3), \(\text{ logit }(\varvec{p}) = AB+CD+E\)

Intercept

A

B

C

D

E

AB

CD

(0.21, 1.08)

(−0.58, 0.27)

(−0.45, 0.43)

(−0.23, 0.61)

(−0.38, 0.49)

(−0.84, −0.27)

(−1.66, −0.50)

(−2.00, −0.84)

Credible intervals (CIs) for the relevant parameters of log-linear model (M2), plus the parameters of the corresponding logistic regression (M3)

In Table 2, we present minimum, maximum and quantile values for the \(t_i\) observations, for the logistic regression in Table 1. It is clear that the simulated data do not represent balanced Binomial experiments where \(t_i=\bar{t}\). The credible intervals listed in Table 1 demonstrate that the correspondence studied in this manuscript is very robust to departures from \(t_i = \bar{t}\). This is also demonstrated in the real data analysis presented in the next subsection, where the collected data do not represent balanced Binomial experiments when one of the factors is treated as the outcome. In Supplemental material, we present additional analyses on simulated data sets, including results on smaller samples, roughly one quarter the size of the data set analysed in this section. Inferences on the correspondence between the posterior distributions remain unchanged.
Table 2

Simulated data illustration

Outcome

Minimum

25% quantile

Median

75% quantile

Maximum

Y

11

17

21

41.5

124

A

12

19

23

30

144

B

10

18

22.5

31

165

C

12

18.5

23

26.5

151

D

11

19.5

23

27.5

147

E

10

17.5

22

27

191

Maximum, minimum, and quantiles for \(t_i, i=1,\ldots , n_{lt}\), for different outcomes for the simulated data in Sect. 4.1

4.2 A real data illustration

Edwards and Havránek (1985) presented a \(2^{6}\) contingency table in which 1841 men were cross-classified by six binary risk factors \(\{A, B, C, D, E, F\}\) for coronary heart disease. The data were also analysed in Dellaportas and Forster (1999), where the top hierarchical model was, \(\text{ log }(\varvec{\mu })=AC+AD+AE+BC+CE+DE+F\), with posterior model probability 0.28. In Table 3, we present CIs for the parameters of the log-linear model,
$$\begin{aligned} AC+AD+AE+BCDEF. \end{aligned}$$
(M4)
We also present CIs for the parameters of the corresponding logistic regression model when A is treated as the outcome,
$$\begin{aligned} \text{ logit }(\varvec{p}) = C+D+E. \end{aligned}$$
(M5)
We performed this analysis twice. Once after considering the g-priors described in Sect. 2 (\(g=N\)), as in the previous illustration, and after adopting a g-prior with a locally flat prior for the intercept. Under the g-prior described in Sect. 2, the CIs for the corresponding \(\varvec{\lambda }\) and \(\varvec{\beta }\) parameters (including the intercept) are almost identical, considering simulation error. For instance, the CI for both the coefficient of A in the log-linear model and the intercept in the logistic regression is \((-0.59, -0.24)\). Under the flat prior for the intercepts, the correspondence breaks down with regard to the intercept in the logistic regression model. The CI for the coefficient of A in the log-linear model is \((-0.59, -0.24)\), whilst the CI for the intercept of the corresponding logistic regression model is \((-0.17, 0.02)\). Concurrently, the credible intervals for the coefficients of CD and E in the logistic regression model are almost identical to the corresponding CIs for ACAD and AE in the log-linear model, with differences due to simulation error.
Table 3

Real data illustration

Log-linear model (M4), \(\text{ log }(\varvec{\mu }) = AC + AD + AE + BCDEF\) (g-prior in Sect. 2)

A

AC

AD

AE

(−0.59, −0.24)

(0.36, 0.74)

(−0.56, −0.18)

(0.30, 0.68)

Outcome is A (M5), \(\text{ logit }(\varvec{p}) = C+D+E\) (g-prior in Sect. 2)

Intercept

C

D

E

(−0.59, −0.24)

(0.37, 0.74)

(−0.56, −0.18)

(0.30, 0.68)

Log−linear model (M4), \(\text{ log }(\varvec{\mu }) = AC + AD + AE + BCDEF\) (flat prior on intercept)

A

AC

AD

AE

(−0.59, −0.24)

(0.35, 0.76)

(−0.55, −0.19)

(0.29, 0.67)

Outcome is A (M5), \(\text{ logit }(\varvec{p}) = C+D+E\) (flat prior on intercept)

Intercept

C

D

E

(−0.17, 0.02)

(0.35, 0.75)

(−0.56, −0.19)

(0.30, 0.68)

Relevant credible intervals for the parameters of log-linear model (M4) and the corresponding logistic regression model when A is treated as the outcome. Intervals are shown under the g-priors in Sect. 2 (\(g=N\)), and after considering a locally flat prior on the intercepts

5 Discussion

The correspondence we investigated is not unexpected, given the results in Agresti (2002) discussed in Introduction, and also the link between the g-prior and Fisher’s information matrix (Held et al. 2015), although this link is stronger for general linear models. Our investigation is also related to Consonni and Veronese (2008), where specifying a prior for the parameters of one model, and then, transferring this specification to the parameters of another is discussed. Of the four strategies considered in Consonni and Veronese (2008), the one directly linked to our manuscript is ‘Marginalization’, as the derived prior for the parameters of the logistic regression is the one that is the marginal prior of the relevant parameters of the log-linear model. Results on the relation between different statistical models are of interest, as they improve understanding and enhance the models’ utility. Often, developments for one modelling framework are not readily available for the other. For example, Papathomas and Richardson (2016) comment on the relation between log-linear modelling and variable selection within clustering, in particular with regard to marginal independence, without examining logistic regression models.

Our numerical illustrations concern the g-prior, where the parameter g is fixed. To further explore the correspondence between the two modelling frameworks, we also considered the two hyper priors that are prominent in Liang et al. (2008). This is the Zellner–Siow prior [IG(0.5, N/2)], and the prior introduced in the aforementioned manuscript in Sect. 4.2, with the suggested specification \(\alpha =3\). Furthermore, the two data sets were analysed after adopting a mixture of g-priors such that, \(g\sim \text{ IG }(a_{g}, b_{g})\). We considered \(a_g = 2+\text{ mean }(g)^2/\text{ var }(g)\) and \(b_g = \text{ mean }(g) + \text{ mean }(g)^3/\text{ var }(g)\), in accordance with the specified prior moments \(\text{ mean }(g)\) and \(\text{ var }(g)\). We considered distinct Inverse Gamma densities with markedly different expectations and variances, as well as the vague prior IG(0.1, 0.1). We observed that the correspondence does not hold exactly when a mixture of g-priors is adopted. This seems to be because the posterior distribution for g is different under the two modelling frameworks, something that affects to a small, but noticeable degree, the posterior credible intervals for the model parameters. For more details, see the analyses presented in Supplemental material.

Theoretical results in this manuscript refer to a specific log-linear model and the corresponding logistic regression model, for a given set of covariates. Therefore, our results should not be misinterpreted as licence to readily translate log-linear model selection inferences to inferences concerning logistic regression models. When performing model selection in a space of log-linear models, the prominent log-linear model describes a certain dependence structure between the categorical factors, including the relation of the binary Y with all other factors. The logistic regression that corresponds to the prominent log-linear model describes the dependence structure between Y and the other factors that is supported by the data in accordance with the log-linear analysis. Therefore, under reasonable expectation, results from a single log-linear model determination analysis may translate, at the very least, to interesting logistic regressions for any of the binary factors that formed the contingency table. However, the mapping between log-linear and logistic regression model spaces is not bijective. Furthermore, posterior model probabilities depend on the prior on the model space, with various different approaches for defining such a prior discussed in Dellaportas et al. (2012). For the simulated data analysed in Sect. 4.1, log-linear model \(YAB+YCD+YE\) has posterior probability 0.98, whilst the posterior probability of the corresponding logistic regression model (M3) is 0.59. Similar results from analysing the real data in Sect. 4.2, not presented here, also support this note of caution. In all model determination analyses, the Reversible Jump MCMC algorithm proposed in Papathomas et al. (2011) was employed. All possible graphical log-linear models were assumed equally likely a priori, as were all possible logistic graphical models for some given outcome.

Notes

Acknowledgements

The author wishes to thank Professor Petros Dellaportas and Dr. Antony Overstall for useful discussions during the preparation of this manuscript. We would also like to thank two reviewers and the editors for comments that helped to improve the manuscript.

Supplementary material

11749_2017_540_MOESM1_ESM.pdf (117 kb)
Supplementary material 1 (pdf 117 KB)

References

  1. Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, HobokenCrossRefMATHGoogle Scholar
  2. Bapat RB (2011) Graphs and matrices. Springer, Hindustan Book Agency, New DelhiMATHGoogle Scholar
  3. Consonni G, Veronese P (2008) Compatibility of prior specifications across linear models. Stat Sci 23:232–353MathSciNetCrossRefMATHGoogle Scholar
  4. Dellaportas P, Forster JJ (1999) Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika 86:615–633MathSciNetCrossRefMATHGoogle Scholar
  5. Dellaportas P, Forster JJ, Ntzoufras I (2012) Joint specification of model space and parameter space prior distributions. Stat Sci 27:232–246MathSciNetCrossRefMATHGoogle Scholar
  6. Edwards D, Havránek T (1985) A fast procedure for model search in multi-dimensional contingency tables. Biometrika 72:339–351MathSciNetCrossRefMATHGoogle Scholar
  7. Fouskakis D, Ntzoufras I, Draper D (2015) Power-expected-posterior priors for variable selection in Gaussian linear models. Bayesian Anal 10:75–107MathSciNetCrossRefMATHGoogle Scholar
  8. Held L, Sabanès Bovè D, Gravestock I (2015) Approximate Bayesian model selection with the deviance statistic. Stat Sci. http://www.imstat.org/sts/future_papers.html. Accessed 17 Mar 2016
  9. Kass RE, Wasserman L (1995) A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J Am Stat Assoc 90:928–934MathSciNetCrossRefMATHGoogle Scholar
  10. Liang F, Paulo R, Molina G, Clyde MA, Berger JO (2008) Mixtures of \(g\)-priors for Bayesian variable selection. J Am Stat Assoc 103:410–423MathSciNetCrossRefMATHGoogle Scholar
  11. Lutkepohl H (1996) Handbook of matrices. Wiley, ChichesterMATHGoogle Scholar
  12. Mukhopadhyay M, Samantha T (2016) A mixture of \(g\)-priors for variable selection when the number of regressors grows with the sample size. Test. doi: 10.1007/s11749-016-0516-0 Google Scholar
  13. Ntzoufras I, Dellaportas P, Forster JJ (2003) Bayesian variable and link determination for generalized linear models. J Stat Plan Inference 111:165–180CrossRefMATHGoogle Scholar
  14. Ntzoufras I (2009) Bayesian modelling using WinBugs. Wiley, HobokenCrossRefMATHGoogle Scholar
  15. O’Hagan A (1995) Fractional Bayes factors for model comparison. J R Stat Soc Ser B 57:99–138MathSciNetMATHGoogle Scholar
  16. O’Hagan A, Forster JJ (2004) Bayesian inference, 2nd edn. vol 2B of ‘Kendall’s Advanced Theory of Statistics’. Arnold, LondonGoogle Scholar
  17. Overstall A, King R (2014a) A default prior distribution for contingency tables with dependent factor levels. Stat Methodol 16:90–99MathSciNetCrossRefGoogle Scholar
  18. Overstall A, King R (2014b) Conting: an R package for Bayesian analysis of complete and incomplete contingency tables. J Stat Softw 58:1–27CrossRefGoogle Scholar
  19. Papathomas M, Richardson S (2016) Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms. J Stat Plan Inference 173:47–63MathSciNetCrossRefMATHGoogle Scholar
  20. Papathomas M, Dellaportas P, Vasdekis VGS (2011) A novel reversible jump algorithm for generalized linear models. Biometrika 98:231–236MathSciNetCrossRefMATHGoogle Scholar
  21. Rohatgi VK (1976) An introduction to probability theory and mathematical statistics. Wiley, New YorkMATHGoogle Scholar
  22. Sabanès Bovè D, Held L (2011) Hyper-g priors for generalized linear models. Bayesian Anal 6:387–410MathSciNetMATHGoogle Scholar
  23. Wang X, George GI (2007) Adaptive Bayesian criteria in variable selection for generalized linear models. Stat Sinica 17:667–690MathSciNetMATHGoogle Scholar
  24. Wood SN (2006) Generalized additive models. An introduction with R, Chapman and Hall/CRC, New YorkMATHGoogle Scholar
  25. Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with \(g\)-prior distributions. In: Goel PK, Zellner A (eds) Bayesian inference and decision techniques: essays in honor of Bruno de Finetti. North-Holland/Elsevier, Amsterdam, pp 233–243Google Scholar

Copyright information

© The Author(s) 2017

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.School of Mathematics and StatisticsUniversity of St AndrewsSt AndrewsUK

Personalised recommendations