Advertisement

Bayesian Models

  • Thomas OtterEmail author
Living reference work entry
  • 194 Downloads

Abstract

Bayesian models have become a mainstay in the tool set for marketing research in academia and industry practice. In this chapter, I discuss the advantages the Bayesian approach offers to researchers in marketing, the essential building blocks of a Bayesian model, Bayesian model comparison, and useful algorithmic approaches to fully Bayesian estimation. I show how to achieve feasible Bayesian inference to support marketing decisions under uncertainty using the Gibbs sampler, the Metropolis Hastings algorithm, and point to more recent developments – specifically the no-U-turn implementation of Hamiltonian Monte Carlo sampling available in Stan. The emphasis is on the development of an appreciation of Bayesian inference techniques supported by references to implementations in the open source software R, and not on the discussion of individual models. The goal is to encourage researchers to formulate new, more complete, and useful prior structures that can be updated with data for better marketing decision support.

Keywords

Marketing decision-making Bayesian inference Gibbs sampling Metropolis Hastings Hamiltonian Monte Carlo R bayesm Stan 

Introduction: Why Use Bayesian Models?

Bayesian models have gained popularity over the past 30 years both among academics in marketing and marketing research practitioners. There are several reasons for this popularity. First, many marketing problems involve data in the form of relatively short panels but with many observational units (large N, small T ). Each observational unit, e.g., a respondent, a customer, or a store supplies only a limited amount of data, but there are many observational units in the data set. In the vast majority of these applications, decision makers know a priori that observational units are heterogeneous in their underlying, at least partially unobserved characteristics that generated the data. And the successful marketing of differentiated goods that involves market segmentation, targeting and positioning requires measures of heterogeneity in the population of observational units. Estimating separate, independent models for each observational unit results in unreliable estimates, and in many applications, individual level time series are too sparse for individual level maximum likelihood estimates to be defined. Hierarchical Bayes models offer a convenient and practical solution to this problem.

Second, the overwhelming majority of marketing data sets involve so-called limited dependent variables, e.g., choices, ratings, rankings, or generally dependent variables that have strongly noncontinuous features. Although a number of non-Bayesian estimators are available for models with such dependent variables (see e.g., Amemiya 1985; Long 1997), the assessment of statistical uncertainty in estimates relies on large sample asymptotic arguments. In marketing, large samples that allow for inference based on asymptotic arguments are the exception, even in an era where big data has become a ubiquitous buzzword. Big data, by definition, involves large data sets. However, the size of the data set usually does not translate into more statistical information about individual target parameters. Big data are always “big” because of their dimensionality spanning across, e.g., tens of thousands of customers, products, and time points, and include a myriad of potentially useful conditioning arguments. The dimensionality of the data at the very source of its size, or “bigness,” regularly translates into similarly high-dimensional models and estimation problems, such that the amount of statistical information about individual target parameters is small yet again. Bayesian models allow for coherent inference even in small samples, or more generally in situations where there is little data-based information about individual parameters. Moreover, a number of relatively simple yet powerful computational algorithms facilitate the estimation of limited dependent variable models.

Third, in marketing, inference about model parameters or more generally about different models, i.e., the statistical assessment of the likely mechanisms that bring about consumers’ and competitors’ behaviors in a market is usually not an end in itself but input to the decisions of marketing managers in companies. The likely benefit from various alternatives for, e.g., product design, product line composition, pricing, or advertising schedules can be expressed as a function of a model and its parameters. However, knowledge of model parameters and generally the model that generated the observed market behaviors will never be perfect. Bayesian modeling facilitates the accurate incorporation of any remaining uncertainty about the mechanism behind observed market behaviors in managerial decisions.

Fourth, computational resources become more powerful and affordable every year, facilitating the estimation of ever more realistic and thus complex models in academic and industry applications. In addition, freely available software such as, e.g., the R-package bayesm (see Rossi et al. 2005) makes a collection of Bayesian models useful for marketing applications readily accessible (The latest version of bayesm is written for speed using the R-package Rcpp (Eddelbuettel and François 2011; Eddelbuettel 2013). The last complete version mostly written in plain R is version 2.2–5. The R-files are available from the CRAN-archives and often a useful start when developing your own routines). In fact, one reason for the popularity of Bayesian modeling among market research practitioners has been the adoption of hierarchical Bayes models for inference by companies like Sawtooth software (Orme 2017) that revolutionized how market research consultants approach the analysis of, for example, choice-based conjoint experiments. Finally, Stan (Carpenter et al. 2017) appears as a big step towards freeing creative modeling from having to invest substantial amounts of time in the development of efficient Bayesian estimation routines.

Fifth, because Bayesian estimation is simply the exact reverse of the data generating process (DGP), it is naturally attractive to researchers that are interested in the development and the empirical test of their own marketing models. Some researchers view the need to specify a complete DGP as a drawback. The argument is that theory never is precise enough to do so, and that this requirement leads to arbitrary choices that unduly impact the inference for quantities the data are more or less directly informative about. The Bayesian response to this criticism is to specify highly flexible DGPs in instances where theory is lacking. This strategy is facilitated by algorithms that adaptively determine a reasonable dimensionality of a flexibly formulated model. This determination is based on statistical evidence that potentially favors a lower dimensional, simpler model and not just fails to reject that model as in classical hypothesis testing.

All that said, it usually still takes longer to estimate a fully Bayesian model than it takes to compute maximum likelihood estimates, in case they exist. I have also heard people “complain” about the amount of information contained in large samples from posterior distributions as produced by modern numerical Bayesian inference tools (Compared to a collection of maximum likelihood estimates and their standard errors). However, it seems natural to wait somewhat longer for a more complete answer to a decision problem. And many interesting decision problems cannot be properly addressed based on a collection of maximum likelihood estimates (should they even exist) and especially upon realizing that their standard errors cannot be reliably estimated with the data at hand.

Bayesian Essentials

A Bayesian model consists of a likelihood function p (y|θ) that fully specifies the probability of the data y given parameters θ, i.e., the process that generates the data for known parameters. In fact, if the researcher only wants to work with one likelihood function, is not interested in comparing across different mechanisms that may have generated the data, any function that is proportional to p (y|θ) will do, i.e., all functions that differ from p (y|θ) only by an arbitrary positive constant c are likelihood functions, ℓ(y| θ) ≡ c ⋅ p(y|θ). We will revisit this point later. A simple example is the linear regression model \( {y}_i={\mathbf{x}}_{\mathbf{i}}^{\hbox{'}}\;\boldsymbol{\beta} +{\varepsilon}_i,\, {\varepsilon}_i\sim iid\mathcal{N}\;\left(0,{\sigma}_{\varepsilon}^2\right) \) that implies the following likelihood for the data \( p\, \left(\mathbf{y}|\boldsymbol{\beta}, {\sigma}_{\varepsilon}^2\right)={\prod}_{i=1}^N\mathcal{N}\left({y}_i|{\mathbf{x}}_{\mathbf{i}}^{\hbox{'}}\;\boldsymbol{\beta}, {\sigma}_{\varepsilon}^2\right) \).

The second component of a Bayesian model is a prior distribution for the parameters indexing the likelihood p(θ). The notation p(θ) means “the density p evaluated at the value θ”. Further, defining the prior distribution as p(θ) implies that θ ~ p, i.e., that θ is (a priori) distributed according to density p, or simply is p- distributed. The notation p(θ) is short-hand because it omits the (subjective prior) parameters indexing the prior distribution. For example, in an application the statement that the prior is a multivariate normal distribution is incomplete. We need to add the information about the prior mean and variance, e.g., \( p\left(\boldsymbol{\theta} \right)=\mathcal{N}\left(\boldsymbol{\theta} |{\boldsymbol{\theta}}^0,{\boldsymbol{\Sigma}}^0\right) \), where \( \mathcal{N}\left(\boldsymbol{\theta} |{\boldsymbol{\theta}}^0,{\boldsymbol{\Sigma}}^0\right) \) is the multivariate normal distribution with mean θ0 and variance-covariance Σ0 evaluated at θ. The multivariate normal density can be evaluated in R using the command dmvn from the R-package mvnfast (Fasiolo 2016) or the command dmvnorm from R-package mvtnorm (Genz et al. 2018). Both commands support computations on the log-scale which are essential for numerical accuracy. For example, a log-likelihood value of −2000 can only be numerically distinguished from a log-likelihood value of, say, −2050 on the log-scale, because both likelihoods, i.e., exp(−2000) and exp(−2050) evaluate to an “exact” machine zero at currently available machine accuracies.

The need to specify prior distributions for Bayesian analysis is often viewed as a drawback of the Bayesian approach. There are several aspects to the specification and the role of the prior distribution in a Bayesian model. First, as suggested by the name, the prior distribution is the formal vehicle to bring prior substantive knowledge to bear on the analysis. And it is sometimes overlooked by critics of the Bayesian approach that such knowledge is already required when specifying the likelihood function. Second, from a purely technical point of view, prior distributions improve the statistical properties of estimators derived from the model (see e.g., Robert 1994, p. 75).

In the regression example, a useful way to probe into prior knowledge is to think about expected changes in yi as a function of changes in xi. Unless the substantive domain the data originates from is unknown, it is extremely likely that the analyst will have some substantive idea about the DGP that should be used in the formulation of prior distributions. In the event where the analysis is a follow-up on previous statistical analyses in the same or a related domain, the choice of prior can build on these results. An example would be market research companies that more or less continuously study demand in a set of markets.

With the specification of a prior distribution, the analyst expresses his beliefs about what parameter values are more likely than other parameter values and by how much, based on his existing substantive understanding of the DGP. If the analyst specifies a prior such that parameters in a relatively small subset of the parameter space are much more likely than other parameters, the prior is usually referred to as an informative prior. The most extreme case of an informative prior is a distribution that concentrates all its mass on a singular parameter value. Such a prior is called degenerate. Degenerate priors constrain parameters to take particular values known a priori. Conversely, the prior is weakly informative or diffuse if there is no discernible concentration of prior mass on subsets of the parameter space. However, unless the parameter space is bounded in all directions as, e.g., in the case of a parameter measuring a probability, it is impossible to put exactly equal prior weight on all parameter values without violating the requirement that the prior needs to be in the form of a probability density function (A function p(θ) is a probability density function if ∫p(θ)dθ = 1). Priors that fulfill this requirement are also referred to as proper priors and priors that do not are improper or literally noninformative. Finally, if the prior puts zero mass on subsets of the parameter space, e.g., zero mass on positive price coefficients in a demand model, it is called a constrained prior.

Bayesian models then apply Bayes’ theorem to derive the posterior distribution of model parameters given the data:
$$ p\left(\boldsymbol{\theta} |\mathbf{y}\right)=\frac{p\left(\mathbf{y}|\boldsymbol{\theta} \right)\, p\left(\boldsymbol{\theta} \right)}{\int p\left(\mathbf{y}|\boldsymbol{\theta} \right)\, p\left(\boldsymbol{\theta} \right)\, d\boldsymbol{\theta}}=\frac{p\left(\mathbf{y},\boldsymbol{\theta} \right)}{p\left(\mathbf{y}\right)} $$
(1)
Equation 1 identifies the goal of a Bayesian model as to make probability statements about quantities of interest, θ. More specifically, a Bayesian model extracts information in the data y via the likelihood function p(y|θ) to update prior knowledge about these quantities summarized in the prior distribution p(θ). The updated knowledge is then used to compare among marketing actions a with payoffs that depend on θ. If we define the loss from an action a given θ as \( \mathcal{L} \)(a, θ) the optimal Bayes action minimizes the posterior expected loss:
$$ \mathcal{L}\left(a|\mathbf{y}\right)=\int \mathcal{L}\left(a,\boldsymbol{\theta} \right)\, p\left(\theta |\mathbf{y}\right)\, d\boldsymbol{\theta} $$
(2)

In marketing applications, the loss usually does not directly depend on θ but on the implied data \( \hat{\mathbf{y}} \), usually some manifestation of demand, i.e., \( \mathcal{L}\left(a,,,\, ,\boldsymbol{\theta} \right)=\int \mathcal{L}\left(a,,,\, ,\hat{\mathbf{y}}\right)p\left(\hat{\mathbf{y}}|\boldsymbol{\theta}, a\right)\, d\hat{\mathbf{y}} \). The notation \( p\left(\hat{\mathbf{y}}|\boldsymbol{\theta}, a\right) \) covers the relevant case where the actions under investigation are conditioning arguments to the DGP. A well-known example is finding the coupon strategy that maximizes net revenues, i.e., minimizes the loss defined as negative net revenues in Rossi et al. (1996).

The denominator in Eq. 1, p (y), is known as the marginal likelihood of the data y or the normalizing constant of the posterior distribution p (θ|y). As we will see in section “Bayesian Estimation”, knowledge of this quantity is not required for Bayesian inference given a particular model. However, statements about quantities of interest θ in probability form require that 0 < p(y< ∞. Only if this condition is met, the posterior p (θ|y) will be in the form of a probability density functions, i.e., ∫p(θ|y)dθ = 1.

In addition, the marginal likelihood of the data is p (y) needed for the comparison across different models for the same data where models may be arbitrarily different in terms of the likelihood function, the prior distribution or both. In fact, based on the marginal likelihood of the data given a particular model ℳ, i.e., p(y|ℳ), the decision theoretic framework in Eq. 2 can be extended to cover decisions about the DGP itself, and to take uncertainty about the data generating model into account when choosing a marketing action. The optimal action given a set of possible data generating models ℳ1, … , ℳK and the data minimizes
$$ \mathcal{L}\left(a|\mathbf{y},{\mathrm{\mathcal{M}}}_1,\dots, {\mathrm{\mathcal{M}}}_K\right)=\sum \limits_kp\;\left(\mathbf{y},\mid, {\mathrm{\mathcal{M}}}_k\right)\;\Pr \left({\mathrm{\mathcal{M}}}_k\right)\int \mathcal{L}\left(a,,,\boldsymbol{\theta} \right)p\;\left(\boldsymbol{\theta} |\mathbf{y},{\mathrm{\mathcal{M}}}_k\right)\;d\boldsymbol{\theta} $$
(3)
where Pr(ℳk) is the subjective prior probability that model k is the true model that is often chosen to be 1/K in the absence of better knowledge. A marketing application following this general idea is presented in Montgomery and Bradlow (1999).

The fundamental appeal of being able to make probability statements about quantities of interest θ is the seamless integration with decision-making based on the expected utility from a set of possible actions. Note that the posterior expected loss in Eq. 2 will only usefully distinguish between different actions a if the posterior p(θ|y) integrates to 1, i.e., is a valid probability density function. It should be recognized that a proper prior distribution p(θ) essentially guarantees that we can make these probability statements, independent of any data deficiencies that may be present. A Bayesian model therefore quantifies how much the data, through the likelihood, add to our prior understanding of a DGP by comparing the prior distribution p(θ) to the posterior distribution p(θ|y). This is different from the classical question what models or model parameters the data can identify.

Consider the following illustrative example. Let us assume that someone measured the preferences for various credit cards on a linear, continuous scale. The cards vary in terms of brand: Mastercard, Visa, Discover; interest rate on outstanding balances: 18%, 15%, 12%; annual fee: no annual fee, $10, $20; and finally the credit limit: $1000, $2500, $5000. The researcher has preference measures for the following eight cards in Table 1, where “1 s” indicate which attribute levels are present.
Table 1

Credit card Designmatrix

 

Brand

Interest

Annual fee

Credit limit

#

Master

Visa

Discover

18%

15%

12%

$0

$10

$20

$1000

$2500

$5000

1

1

0

0

1

0

0

1

0

0

1

0

0

2

1

0

0

0

0

1

0

0

1

0

0

1

3

0

1

0

1

0

0

0

1

0

0

0

1

4

0

0

1

1

0

0

0

0

1

0

1

0

5

0

0

1

0

0

1

0

1

0

1

0

0

6

0

0

1

0

1

0

1

0

0

0

0

1

7

0

1

0

0

0

1

1

0

0

0

1

0

8

1

0

0

0

1

0

0

1

0

0

1

0

Dummy coding using the brand Mastercard, 18% interest, no annual fee and a credit limit of $1000 as base lines, and adding a constant, we obtain the matrix corresponding to the linear regression model \( {y}_i={\beta}_0+{x}_{1,i}{\beta}_1+\dots +{x}_{8,i}{\beta}_8+{\varepsilon}_i,{\varepsilon}_i\sim \mathcal{N}\left(0,{\sigma}_{\varepsilon}^2\right) \) for cards i = 1, … , 8 in Table 2.
Table 2

Credit card Modelmatrix

  

Brand

Interest

Annual fee

Credit limit

 

Constant

Visa

Discover

15%

12%

$10

$20

$2500

$5000

#

x0

x1

x2

x3

x4

x5

x6

x7

x8

1

1

0

0

0

0

0

0

0

0

2

1

0

0

0

1

0

1

0

1

3

1

1

0

0

0

1

0

0

1

4

1

0

1

0

0

0

1

1

0

5

1

0

1

0

1

1

0

0

0

6

1

0

1

1

0

0

0

0

1

7

1

1

0

0

1

0

0

1

0

8

1

0

0

1

0

1

0

1

0

It is easy to verify that the overall nine β-coefficients in this model are not jointly likelihood identified, because there are only eight observations. This can be viewed as toy example of the increasingly common situation, where the number of (potential) explanatory variables exceeds the number of observations, including big data that owe their size to the number of variables in addition to the number of observations (are “broader” than “long”). In such data sets, a purely data-based distinction between connections from explanatory variables to the dependent variable is no longer possible, even if all explanatory variables come from independent processes a priori.

Inspecting the bivariate correlations between covariates in Table 2 that are depicted in Table 3, we can see that these correlations are not too strong, individually. However, we also see that no two design columns are perfectly orthogonal. I further investigate the model in Table 2 using regression analysis. Specifically, I regress each column in Table 2 (excluding the constant) on the remaining columns. Every one of these eight regression results in a perfect prediction because we have 9 − 1 = 8 predictors and 8 observations, each. The rows in Table 4 report the coefficients from regressing the covariate indicated by the row name on the remaining seven covariates in addition to a constant. A dash indicates that the covariate indicated by the column label in Table 4 is the dependent variable. The “NAs” result from perfect predictions of the covariates “Discover,” “12% interest rate,” and “$10 annual fee” before including the covariate “$5,000 credit limit” as predictor.
Table 3

Correlations between design columns

 

Visa

Discover

15%

12%

$10

$20

$2500

$5000

Visa

1

       

Discover

−0.45

1

      

15%

−0.33

0.15

1

     

12%

0.15

−0.07

−0.45

1

    

$10

0.15

−0.07

0.15

−0.07

1

   

$20

−0.33

0.15

−0.33

0.15

−0.45

1

  

$2500

0.15

−0.07

0.15

−0.07

−0.07

0.15

1

 

$5000

0.15

−0.07

0.15

−0.07

−0.07

0.15

−0.6

1

Table 4

Design column dependence – regression analysis

 

Constant

Visa

Discover

15%

12%

$10

$20

$2500

$5000

Visa

0

0

−1

0

0

−1

1

1

Discover

0.5

−0.5

0

0

0

0

0

NA

15%

0

−1

0

0

0

−1

1

1

12%

0.5

0

0

−0.5

0

0

0

NA

$10

0.5

0

0

0

0

−0.5

0

NA

$20

0

−1

0

−1

0

0

1

1

$2500

0

1

0

1

0

0

1

−1

$5000

0

1

0

1

0

0

1

−1

For example, the last line of Table 4 implies the following deterministic equation from regressing the covariate “$5,000 credit limit” on the remaining covariates in Table 2: x8 = 0 + 1x1 + 0x2 + 1x3 + 0x4 + 0x5 + 1x6 – 1x7. The contrasts x1, x3, and x6 involving “Visa,” “15% interest,” and “$20 annual fee” are therefore positively confounded with the contrast involving “$5,000 credit limit,” and this latter contrast is negatively confounded with the contrast x7 involving “$2,500 credit limit.”

Now, what are the implications for modeling the variation in the preference measures y as a function of covariates? In order to arrive at a likelihood-identified regression model, we need to reduce the number of covariates (the number of columns in Table 2) such that the resulting X-matrix is of full column rank, and the inverse of X′X is well defined. As a general rule, we can always throw out covariates that are independent of all covariates we would like to keep in the model, without biasing our inference for the influence of the latter. Throwing out such covariates, at worst, increases the unexplained variance. In this example, no covariate fulfills this criterion by the mere fact that we have too many covariates to choose from, relative to the number of observations.

As a second general rule, we can eliminate covariates from the model which we strongly believe (know a priori) to have no (direct) effect on the dependent variable. We can do so regardless of how such covariates are related to covariates we would like to keep in the model, for unbiased inference about the influence of the latter.

However, if we eliminate a covariate that actually has a direct effect on the dependent variable that is not independent of all covariates we would like to keep in the model, the resulting inference will be biased. For example, whatever the true preference contribution of “$5,000 credit limit” relative to the baseline of only “$1,000 credit limit,” the coefficients associated with “Visa,” “15% interest,” and “$20 annual fee” will be biased upward by this amount, and the coefficient associated with “$2,500 credit limit” will be biased downward by the same amount upon deleting column x8 (“$5,000 credit limit”) for identification in this example. Also, note that the confounds identified here are not automatically resolved upon collecting more data. In fact, even an infinite number of observations from the model in Table 2 will exhibit the same problem. What is required for improved data based identification is not only more but also “different” data, i.e., data generated by X-configurations different from those in Table 2. However, more data will necessarily be “suitably different” if the processes that generate the covariates are independent, at least conditionally.

In this particular example, there is no obvious choice of covariates that could be omitted based on strong prior beliefs that their direct effect is equal to zero. In fact, a prior understanding of preferences for credit cards would suggest that all covariates likely causally relate to the observed preferences for the different cards. Thus, any likelihood identified model obtained by omitting covariates from Table 2 is likely to yield substantially biased inferences regarding the influence of covariates retained in the model.

At this point, it is useful to relate likelihood-identification by omitting covariates to the formulation of a prior. In a sense, omitting covariates to achieve likelihood-identification corresponds to a degenerate prior concentrated on zero for the effects of omitted covariates, coupled with an improper prior for the effects of covariates retained in the model. In contrast, a Bayesian model for this data defined through a proper prior over all observed covariates expresses the belief that these covariates contributed causally independently to the observed preferences, with some prior uncertainty about the size of the individual contributions.

From the perspective of different (implied) priors, I believe that essentially nobody would prefer one of the many possible likelihood identified models in this example to the Bayesian model that keeps with the prior causal structure. Mutilating the prior causal structure to overcome data deficiencies and to achieve likelihood-identification (and more generally statistical efficiency) does not seem to be a generally useful strategy. Obviously, one often can (and should) try to obtain more informative data. However, completely discounting the information in only partially informative data seems to be a wasteful strategy.

Importantly, a prior that expresses the belief in invariant structural aspects of the data generating process will eventually translate into accurate posterior measures of the strength of structural relationships, once more likelihood information becomes available. A model (or prior structure) that is formulated in response to observed data deficiencies will not. Thus, the findings from such a model are generally not useful as prior input to future analysis of data from the same process, be it informative, or again deficient per se, potentially in a different way. We will revisit this topic when we discuss and numerically illustrate hierarchical Bayesian models that manage to extract information about the distribution of parameters from a collection of likelihoods that individually fail likelihood-identification (a collection of “deficient” data sets).

A big intellectual step is thus to acknowledge the limits of a perspective that literally asks “for the data to speak.” The decisions that go into “making the data speak,” be it in the form of simple summaries or complicated (likelihood identified) models, always involve prior knowledge. In this context, trading beliefs about an underlying structure for the ability to relate parameters to well-determined functions of the data only regularly voids the thus identified parameters from the meaning sought by the analyst in the first place. In contrast, updating a structurally intact prior with deficient data preserves the structural interpretation of parameters, at the expense of “purely” data-based identification (I put “purely” in quotes, because the decision about how to arrive at a model that can be identified only based on the data at hand always involves subjective, i.e., non-data based prior knowledge).

Now back to our example. When passed to R’s lm-function, for example, lm automatically deletes the last column from the model for a model that just identifies the remaining β-coefficients. This model computes eight parameters from eight observations and thus trivially fits the data perfectly. Because of the perfect fit of every member of the class of just identified models, the data cannot distinguish among models in this class. However, as mentioned earlier, prior knowledge strongly suggests that no likelihood-identified model obtained by deleting covariates makes much structural sense in this example.

For illustration, I simulate 1000 data sets using the model matrix in Table 2, a coefficient vector β = (4,2,0,1,1.5,−1,−1.5,2,3), and \( {\sigma}_{\varepsilon}^2=1 \). For each data set, I estimate the regression model in Table 2 dropping column x8 for identification which corresponds to the default in R’s lm-function. I also estimate a fully conjugate (Conjugacy refers to mathematical properties of a prior in combination with a particular likelihood function. So-called conjugate priors result in posteriors of the same distributional form as the prior. For example, a normal prior is the conjugate prior for the parameters in a normal likelihood with known variance, i.e., a likelihood that implies (conditionally) normally distributed data) Bayesian regression model with conditional prior \( \boldsymbol{\beta} \sim \mathcal{N}\left(\mathbf{0},\boldsymbol{I}{\sigma}_{\varepsilon}^2100\right) \) and without dropping any columns from Table 2 using the routine runireg in the R-package bayesm (Rossi et al. 2005) (The marginal prior for \( {\sigma}_{\varepsilon}^2 \). is inverse Gamma with 3 degrees of freedom and scale equal to the observed variance of y in each data set, i.e., the default in the R-package bayesm).

Table 5 reports the data generating true β-values, the mean of the OLS- and Bayes-estimates across 1000 data replications, as well as the corresponding standard deviations. The comparison between the data generating values and the mean of the OLS-estimates clearly illustrates the bias analyzed theoretically earlier. The coefficients associated with “Visa,” “15% interest,” and “$20 annual fee” are biased upwards by about a value of 3 which corresponds to the data generating preference contribution of x8 = 1, i.e., “$5,000 credit limit” which was dropped from estimation for identification. The coefficient associated with “$2,500 credit limit” is biased downward by the same amount. Taking into account the standard deviations in parentheses, these biases appear to be statistically significant, despite the small samples of eight observations. In contrast, the mean of the Bayes-estimates for the same coefficients is much closer to the data generating values. In addition, the standard deviations show that especially the parameters affected by bias in the OLS-regression are estimated with more statistical precision in the Bayesian model.
Table 5

Sampling experiment

  

OLS

Bayes

True values

Mean

Standard deviation

Mean

Standard deviation

Constant

4.0

3.96

0.91

3.95

0.88

Visa

2.0

5.04

1.29

2.72

0.53

Discover

0.0

0.01

0.70

0.01

0.69

15%

1.0

4.02

1.33

1.70

0.54

12%

1.5

1.49

0.68

1.49

0.66

$10

−1.0

−1.00

0.70

−0.98

0.68

$20

−1.5

1.53

1.31

−0.77

0.55

$2500

2.0

−0.98

0.68

1.33

0.38

$5000

3.0

0

2.31

0.42

The main difference between the classical OLS approach and the Bayesian approach here are the assumptions that enable the extraction of information from the data. While classical estimation requires prior information about how to reduce the dimensionality of the inferential problem to deliver estimates, the Bayesian approach allows us to retain the original dimensionality at the expense of assumptions that make regression parameters outside of some range very unlikely. In applications where the form and thus the dimensionality of the likelihood function derive from causal reasoning, i.e., theory, the Bayesian approach thus facilitates inference without having to compromise on what is the core of existing beliefs about the DGP in response to data deficiencies.

The rapidly developing field of machine learning provides alternative approaches to flexibly “regularize” a likelihood function (see e.g., Hastie et al. 2001). On a formal level, the regularization techniques employed in machine learning can be re-expressed as prior assumptions about parameters or likely model structures. And while the machine learning approach may have advantages in applications where the analyst has minimal to no prior knowledge about the DGP, the Bayesian approach excels when such knowledge is available.

The prior employed in our illustrative example certainly is closer to a common sense understanding of preferences for credit cards than the model implied by deleting x8 (“$5,000 credit limit”), or any other likelihood-identified model obtained by deleting covariates in this example. However, it is still in the spirit of regularization without much attention to details and incidentally essentially corresponds to a ridge-regression approach (Hoerl and Kennard 1970).

To illustrate further, Fig. 1 depicts the joint posterior of coefficients associated with “Visa” and “$5,000 credit limit” obtained from one of the 1000 simulated data sets. It illustrates a strong one-to-one trade-off between the “Visa” and “$5000 credit limit” coefficients (compare the 45-degree downward sloping solid line through the origin). When the draw of the “Visa”-coefficient suggests an exceedingly positive preference for Visa relative to the baseline brand Mastercard, the “$5,000 credit limit “-coefficient suggests a pronounced distaste for the $5000 credit limit relative to the baseline, and vice versa. Without the prior, this distribution would collapse to a line with equal support for all coefficients from (βVisa = − ∞, β$5, 000 = ∞) to (βVisa = ∞, β$5, 000 = − ∞), and consequently zero support for any finite set of coefficients. This line is the graphical analogue to nonidentifiability. The prior essentially allows for point identification by concentrating posterior support away from the endpoints (−∞, ∞) and (∞, −∞). I believe that essentially everybody would view this as a reasonable assumption after pondering combinations of, say “infinite” preference for Visa with “infinite” distaste for a credit limit of $5000.
Fig. 1

Posterior correlation of the “Visa” and the “$5,000 credit limit” coefficient in one simulated data set

A more elaborate prior could, for example, harness the (weak) prior preference ordering of the levels of interest rate, annual fee, and credit limit, or specific knowledge about the person rating the credit cards (see e.g., Allenby et al. 1995).

Finally, many marketing applications such as, for example, conjoint experiments or the analysis of scanner panel data are characterized by a collection of small data sets that individually are similarly problematic as the one corresponding to Table 2. In such settings, so-called hierarchical Bayes models are useful. Hierarchical Bayes models learn the form of the prior to apply to each individual data set from the collection of data sets. In a hierarchical model, the prior that regularizes each individual level likelihood is therefore itself an object of statistical inference (see e.g., Lenk et al. 1996).

Even in settings where a data set formally identifies the parameters in a likelihood function Bayes theorem (Eq. 1) implies that the prior distribution will “bias away” the posterior from the information in the data. At least in small samples or generally in the context of data that does not contain much information about target parameters, the optimal Bayes action (see Eq. 2) may thus be different from the action that only conditions on likelihood information. And often analysts trained in classical frequentist statistics point out that an objective assessment of, for example, the statistical relevance of a parameter is no longer possible once a subjectively formulated prior enters the inferential procedure.

This criticism is certainly valid. However, the quest for objective inference comes at the price of not being able to use some data sets at all, or only subject to assumptions that likely are less defensible or further removed from a common understanding of the DGP than can be incorporated in a prior distribution. Furthermore, when only finite amounts of data are available, the frequentist assessment of statistical uncertainty in estimates or about models often relies on large sample asymptotic arguments in all but simple linear models. Large sample asymptotic arguments are certainly objective but may or may not hold in a particular application that has to rely on finite data.

Finally, the posterior distribution from priors that have positive support over the entire support of the parameter space as defined by the likelihood function, i.e., are neither degenerate or constrained, will converge to the maximum likelihood estimate as the data become more and more informative. In this sense, priors that are neither degenerate nor constrained result in large sample consistent inferences.

Bayesian Estimation

For the purpose of inference given a particular Bayesian model, knowledge of the marginal likelihood p(y) is not required, because as long as p(y) is finite and positive, we have
$$ p\left(\boldsymbol{\theta} |\mathbf{y}\right)\propto p\left(\mathbf{y}|\boldsymbol{\theta} \right)p\left(\boldsymbol{\theta} \right) $$
(4)
i.e., the posterior distribution is proportional to the product of the likelihood times the prior. This proportionality follows from elementary probability calculus upon recognizing that the product of likelihood times the prior defines the joint density of the data y and parameters θ, i.e., the conditional distribution of θ given the data y is proportional to the joint distribution of parameters and the data.

Another way to appreciate this proportionality is to think about the graphical representation of the posterior distribution of a scalar parameter. It is obvious that the linear scaling of the y-axis in this graph does not matter for relative probability statements of the form p (θi|y) /p (θj|y), because any finite multiplicative constant would cancel from this ratio. For the same reason, posterior Bayesian inference given a model is invariant to rescaling the likelihood, the prior, or both by multiplicative constants. Similarly, the relative expected loss from two actions ak and al given a particular model \( \mathcal{L} \)(ak|y)/\( \mathcal{L} \)(al|y) does not depend on multiplicative constants. However, to compute the expected loss in Eq. 2, we need absolute probability statements about θ, i.e., we need to normalize the product c1p (y|θ) c2p(θ), where c1 and c2 are arbitrary positive “rescaling” constants.

I first discuss two examples where it is relatively obvious how to compute the normalizing constant ∫c1p(y| θ) c2p(θ) d(θ) in closed form. When the normalizing constant is available in closed form, the posterior p(θ|y) will usually be in the form of a known distribution. For known distributions, random number generators are implemented as part of statistical programming languages such as, for example, R or can be easily constructed. Based on r = 1, … , R draws from such a random number generator, we can approximate the posterior expected loss in Eq. 2 to an arbitrary degree of precision and for arbitrarily complicated nonlinear loss-functions as
$$ \mathcal{L}\left(a|\mathbf{y}\right)\approx \frac{1}{R}\sum \limits_{r=1}^R\mathcal{L}\left(a,{\boldsymbol{\theta}}^r\right),\quad {\boldsymbol{\theta}}^r\sim p\left(\boldsymbol{\theta} |\mathbf{y}\right), $$
(5)
because \( \underset{R\to \infty }{\lim}\frac{1}{R}{\sum}_{r=1}^R\mathcal{L}\left(a,{\boldsymbol{\theta}}^r\right)=\int \mathcal{L}\left(a,\boldsymbol{\theta} \right)\;p\left(\boldsymbol{\theta} |\mathrm{y}\right)\;d\boldsymbol{\theta} \) by the law of large numbers provided that \( \mathcal{L} \)(a| y) is known to be finite (Compare this to the definition of the (posterior) mean, i.e., ∫θp(θ|y)dθ and its estimator from a sample θ1, … , θr, … , θR, i.e., \( \frac{1}{R}{\sum}_{r=1}^R{\boldsymbol{\theta}}^r \)). This condition will always hold if the loss function evaluates to finite values over the definitional range of θ, formally \( -\infty <\underset{\boldsymbol{\theta}}{\min}\left(\mathcal{L}\left(a,\boldsymbol{\theta} \right)\right)\le \underset{\boldsymbol{\theta}}{\max}\left(\mathcal{L}\left(a,\boldsymbol{\theta} \right)\right)<\infty \), or more generally if nonfinite \( \mathcal{L} \)(a, θ) is an event of probability measure zero.

I then move to models where the posterior distribution cannot be computed in closed form and introduce Gibbs sampling facilitated by data augmentation and the Metropolis-Hastings algorithm as solutions to Bayesian inference in this case.

Examples of Posterior Distributions in Closed Form

Beta-binomial model. Consider a Bernoulli experiment that yields identically, independently (iid) distributed observations yi taking one of two values, say “1” and “0” with probabilities θ and 1 – θ. Repeating the Bernoulli experiment n times results in \( s={\sum}_{i=1}^n{y}_i \) “1 s” and \( n-{\sum}_{i=1}^n{y}_i \) “0 s”. The probability of observing s in n trials given θ is then
$$ p\left(s|n,\theta \right)=\left(\begin{array}{l}n\\ {}s\end{array}\right){\theta}^s{\left(1-\theta \right)}^{n-s}=\frac{\Gamma \left(n+1\right)}{\Gamma \left(s+1\right)\Gamma \left(n-s+1\right)}{\theta}^s{\left(1-\theta \right)}^{n-s} $$
(6)
where Γ is the Gamma-function (The relation Γ(n + 1) = n! provides some useful intuition for the Gamma-function).
As we will see, a convenient prior for the unobserved p (yi = 1) = θ is in the form of a Beta density:
$$ p\left(\theta |a,b\right)=\frac{\Gamma \left(a+b\right)}{\Gamma (a)\Gamma (b)}{\theta}^{a-1}\;{\left(1-\theta \right)}^{b-1} $$
(7)

The parameters a and b can be interpreted as the number of “1 s” and “0 s” in a hypothetical prior experiment and serve to express prior beliefs about θ. However, all real valued a,b > 0 result in proper priors for the probability θ over its definitional range, i.e., \( {\int}_0^1p\left(\theta |a,b\right)\; d\theta =1 \). For example, setting both a and b equal to 1 yields the uniform density over the unit interval expressing the absence of prior knowledge about what θ-values are more likely than others. Setting a and b equal to the same value larger than 1 yields a density that in the limit of a, b → ∞ degenerates to a point mass at 0.5, which corresponds to various degrees of prior belief strength about θ being equal to 0.5. The mean and mode of the Beta density are given by a/ (a + b) and (a − 1) / (a + b − 2). Therefore, a > b (a < b) expresses prior beliefs that θ > 0.5 (θ < 0.5). Finally, for 0 < a, b < 1, the Beta density takes a bathtub shape that piles up mass at the borders of the parameter space 0 and 1.

Conditional on the data y1,…, yn, the binomial coefficient that forms the first factor in Eq. 6 is a fixed constant. Similarly, the normalizing constant of the Beta density, i.e., the first factor on the right hand side of Eq. 7 is fixed for a given choice of a,b.

Defining c1 = (Γ (n + 1))−1 Γ (s + 1) Γ (n − s + 1) and c2 = (Γ (a + b))−1 Γ (a) Γ (b) and making use of the proportionality in Eq. 4, we thus have
$$ {\displaystyle \begin{array}{l}p\left(\theta |a,b,s,n\right)\propto {c}_1p\left(s|n,\theta \right){c}_2p\left(\theta |a,b\right)\\ {}\qquad \qquad \propto {\theta}^s{\left(1-\theta \right)}^{n-s}{\theta}^{a-1}{\left(1-\theta \right)}^{b-1}={\theta}^{s+\alpha -1}{\left(1-\theta \right)}^{n-s+b-1}\end{array}} $$
(8)
Comparing the rightmost expression in Eqs. 8 to 7, we see that this product is in the form of a (non-normalized) Beta density with parameters \( \tilde{a}=s+a \) and \( \tilde{b}=n-s+b \), and therefore
$$ p\left(\theta |a,b,s,n\right)=\frac{\Gamma \left(\tilde{a}+\tilde{b}\right)}{\Gamma \left(\tilde{a}\right)\Gamma \left(\tilde{b}\right)}{\theta}^{\tilde{a}-1}{\left(1-\theta \right)}^{\tilde{b}-1} $$
(9)

The fact that the posterior distribution in Eq. 9 is of the same known distributional from as the Beta-prior makes the Beta-prior very convenient in the context of a binomial likelihood function. Technically, the Beta-prior is the conjugate prior to the binomial likelihood.

Moving from Eqs. 6 and 7 to Eq. 8 we dropped all multiplicative constants from the likelihood and the prior that do not depend on θ and then normalized the result from Eq. 8 to arrive at Eq. 9. As discussed following Eq. 4 above, we can do so for the purpose of inference given a particular model that consists of a specific likelihood function and prior. I will address the role of these model-specific constants in the context of formal comparisons between different models further below.

Finally, a useful exercise for first time acquaintances with Bayesian inference is to simulate binomial data, for example, using R’s binom command, or simply by making up n and s, and then simulate from the posterior in Eq. 9 using R’s rbeta command for different specifications of a and b. Observe how the posterior changes as you use more or less (informative) data and more or less informative priors.

Another intellectually useful exercise is to think about different finite amounts of Bernoulli data that either consists of only “1 s” (or only “0 s”). Clearly, the maximum-likelihood estimate of the data generating probability is one (zero) in this case, and a purely data-based assessment of uncertainty in this estimate is impossible. A question at the core of statistical decision theory then is the following: Is a decision maker better off taking the maximum likelihood probability estimate of one (zero) for granted, or should he rather base his decisions on a proper posterior distribution? (Obtained using a proper prior distribution with positive support over the uniform interval.) A general answer to this question, which we will not attempt to prove here, is that any proper prior will translate into better decisions than taking the maximum likelihood estimate for granted. The only exception is the case where prior knowledge itself implies a deterministic process.

Normal-Normal model. The second example is a normal regression likelihood with a known observation error variance coupled with a normal prior for the regression coefficients. This example is of limited direct practical value. However, it showcases another important conjugate relationship. Moreover, this model serves as a useful building block for Bayesian inference in the binomial probit model discussed later, and numerous other models. Consider the following regression model and implied likelihood function
$$ {\displaystyle \begin{array}{c}{y}_i={\mathbf{x}}_i^{\hbox{'}}\;\boldsymbol{\beta} +{\varepsilon}_i\quad {\varepsilon}_i\sim iid\mathcal{N}\;\left(0,1\right)\\ {}p\left({y}_1,\dots, {y}_n\right)=\frac{1}{\sqrt{2\pi }}{\prod}_{i=1}^n\exp \left(-\frac{1}{2}{\left({y}_i-{\mathbf{x}}_i^{\hbox{'}}\;\boldsymbol{\beta} \right)}^2\right),\end{array}} $$
(10)
and a multivariate normal prior distribution for the k regression coefficients corresponding to the entries in xi, i.e.,
$$ p\left(\boldsymbol{\beta} |{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)={\left(2\pi \right)}^{-k/2}{\left|{\boldsymbol{\Sigma}}^0\right|}^{-1/2}\exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)}^{\prime }{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)\right). $$
(11)
Defining y = (y1, … , yn)′ and X = (x1, … , xn)′ the posterior distribution is then proportional to (see Eq. 4):
$$ {\displaystyle \begin{array}{l}p\left(\boldsymbol{\beta} |y,{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)\propto \exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)}^{\prime }{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)\right)\prod \limits_{i=1}^n\exp \left(-\frac{1}{2}{\left({y}_i-{\mathbf{x}}_i^{\hbox{'}}\;\boldsymbol{\beta} \right)}^2\right)\\ {}\qquad \qquad \quad \propto \exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)\right)\exp \left(-\frac{\tilde{s}}{2}\right)\\ {}\qquad \qquad \quad \propto \exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)\right),\end{array}} $$
(12)
where
$$ \tilde{\boldsymbol{\beta}}={\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)}^{-1}\left({\mathbf{X}}^{\prime}\mathbf{y}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}{\boldsymbol{\beta}}^0\right) $$
(13)
$$ \tilde{s}={\left(\mathbf{y}-{\mathbf{X}}^{\prime}\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left(\mathbf{y}-{\mathbf{X}}^{\prime}\tilde{\boldsymbol{\beta}}\right)+{\left(\tilde{\boldsymbol{\beta}}-{\boldsymbol{\beta}}^0\right)}^{\prime }{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\left(\tilde{\boldsymbol{\beta}}-{\boldsymbol{\beta}}^0\right) $$
(14)
See Rossi et al. (2005) or Zellner (1971) for the details of the transformations in Eq. 12 and note that the posterior mean \( \tilde{\boldsymbol{\beta}} \) in Eq. 13 will converge to the ordinary least squares or maximum likelihood estimate as the sample size (the information in the data) increases, for all nondegenerate prior settings (i.e., |Σ0| > 0). Smaller (larger) variances in Σ0 put more (less) prior weight behind the prior guess β0. For a well-defined ordinary least squares estimate \( \hat{\boldsymbol{\beta}} \) (a well-defined inverse of X′X), we can write Eq. 13 as
$$ {\displaystyle \begin{array}{c}\tilde{\boldsymbol{\beta}}={\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)}^{-1}\left({\mathbf{X}}^{\prime}\mathbf{X}{\left({\mathbf{X}}^{\prime}\mathbf{X}\right)}^{-1}{\mathbf{X}}^{\prime}\mathbf{y}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}{\boldsymbol{\beta}}^0\right)\\ {}=\tilde{\boldsymbol{\beta}}={\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)}^{-1}\left({\mathbf{X}}^{\prime}\mathbf{X}\hat{\boldsymbol{\beta}}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}{\beta}^0\right)\end{array}} $$
which illustrates that the posterior mean \( \tilde{\boldsymbol{\beta}} \) is a weighted convex combination of the ordinary least squares or maximum likelihood estimate \( \hat{\boldsymbol{\beta}} \) and the prior mean β0, where the weights are information from the data in X′X and the amount of prior information (Σ0)−1, respectively. Thus, the posterior mean will be somewhere “in between” the ordinary least squares estimate and the prior mean.
When combining the normal likelihood (Eq. 10) with the normal prior (Eq. 11) in Eq. 12, we dropped the multiplicative constants \( \frac{1}{\sqrt{2\pi }} \) and (2π)k/2|Σ0|−1/2 from the likelihood and the prior, respectively. Again, this is fine as long we are only interested in inference given this specific model. Upon recognizing that the last line of Eq. 12 is the so-called kernel of a multivariate normal distribution (the kernel of a distribution drops all factors that do not directly depend on both unobserved parameters and the data or variables the distribution is for) and thus using
$$ {\displaystyle \begin{array}{l}\int \exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)\right)\;d\boldsymbol{\beta} =\\ {}\qquad \qquad ={\left(2\pi \right)}^{k/2}{\left|{\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right|}^{-1/2}\end{array}} $$
(15)
we obtain the joint posterior distribution of the k regression coefficients in closed form:
$$ p\left(\boldsymbol{\beta} |\mathbf{y},{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)={\left(2\pi \right)}^{-k/2}{\left|{\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right|}^{1/2}\;\exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)\right)=\mathcal{N}\left(\boldsymbol{\beta} |\tilde{\boldsymbol{\beta}},{\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)}^{-1}\right) $$
(16)
We can directly sample from this distribution using, for example, the command rmvnorm in the R-package mvtnorm (Genz et al. 2018) or the faster version rmvn available in the R-package mvnfast (Fasiolo 2016). The bayesm (Rossi et al. 2005) routine corresponding to this model is breg.

Posterior Distributions Not in Closed Form

Next, I discuss the model defined by the combination of a binomial probit likelihood and a multivariate normal prior for the regression coefficients (see Eq. 11). Bayesian inference for this model is relatively much more challenging than for the two models discussed already because the normalizing constant of the posterior distribution is not available in closed form. The binomial probit likelihood is, similar to the binomial likelihood in Eq. 6, a DGP for independently distributed observations yi taking one of two values, say “1” and “0”. The probit likelihood defines the probability of observing yi = 1 as a function of covariates xi and (probit-)regression parameters β as follows:
$$ p\left({y}_i=1|\boldsymbol{\beta} \right)=\Phi \left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta} \right)={\int}_{-\infty}^{{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}}\mathcal{N}\left(z|0,1\right)\, dz $$
(17)
$$ p\left({y}_i=0|\boldsymbol{\beta} \right)=\Phi \left(-{\mathbf{x}}_i^{\prime}\boldsymbol{\beta} \right)={\int}_{{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}}^{\infty}\mathcal{N}\left(z|0,1\right)\, dz $$
(18)
Thus, observations y = (y1, … , yn)′ are not identically distributed but provide information about β exchangeably. “Exchangeably” essentially means that we don’t need to keep track of the order or sequence of the data for proper inference. Exchangeability here is a consequence of conditional independence given the data generating parameters and observed covariates (see e.g., Bernardo and Smith 2001), conditional on covariates X = (x1, … ,xn)′. The data y then have probit likelihood:
$$ {\displaystyle \begin{array}{l}p\left(\mathbf{y}|\boldsymbol{\beta} \right)=\prod \limits_{i=1}^n{\left(\Phi \left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta} \right)\right)}^{y_i}{\left(\Phi \left(-{\mathbf{x}}_i^{\prime}\boldsymbol{\beta} \right)\right)}^{1-{y}_i}\\ {}\, =\prod \limits_{i=1}^n{\left({\int}_{-\infty}^{{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}}\mathcal{N}\left(z|0,1\right)\, dz\right)}^{y_i}{\left({\int}_{{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}}^{\infty}\mathcal{N}\left(z|0,1\right) dz\right)}^{1-{y}_i}\end{array}} $$
(19)
By Eq. 4, the posterior distribution of β is proportional to:
$$ p\left(\boldsymbol{\beta} |\boldsymbol{y},{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)\propto \exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)}^{\prime }{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)\right)\prod \limits_{i=1}^n{\left(\Phi \left({\mathbf{x}}_{i^{\prime }}\boldsymbol{\beta} \right)\right)}^{y_i}{\left(\Phi \left(-{\mathbf{x}}_{i^{\prime }}\boldsymbol{\beta} \right)\right)}^{1-{y}_i} $$
(20)
As already mentioned, the normalizing constant of the right hand side in Eq. 20 cannot be computed in closed form and we thus cannot derive the posterior distribution directly, unlike in the previous examples. I will introduce Gibbs sampling as one solution to Bayesian inference in this model. To this end, an alternative interpretation of the probit likelihood suggested by the integral on the right hand side of Eq. 17 will be useful. Taking advantage of the symmetry of the normal distribution, we can rewrite:
$$ p\left({y}_i=1|\boldsymbol{\beta} \right)={\int}_{-\infty}^{{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}}\mathcal{N}\left(z|0,1\right) dz={\int}_0^{\infty}\mathcal{N}\left(z|{\mathbf{x}}_{i^{\prime }}\boldsymbol{\beta}, 1\right) dz $$
(21)
$$ p\left({y}_i=0|\boldsymbol{\beta} \right)={\int}_{{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}}^{\infty}\mathcal{N}\left(z|0,1\right) dz={\int}_{-\infty}^0\mathcal{N}\left(z|{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) dz $$
(22)
and interpret the binomial probit model as a random utility model in which latent utilities zi are independently normally distributed with means xiβ and standard deviations equal to 1. A latent utility draw zi from \( \mathcal{N}\left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) \) larger than 0 generates an observed yi = 1 and a draw smaller than 0 an observed yi = 0, i.e., yi = 1 (zi > 0). This is exactly equivalent to generating a y-observation using the probability in Eq. 21 as the parameter of a Bernoulli distribution (Draw a random uniform number u from the interval [0, 1], e.g., using runif (1) in R and compare to the probability in Eq. 21. Set yi = 1 (yi = 0) when u is smaller (larger) than this probability or use the R-command rbinom) because, e.g., \( \underset{R\to \infty }{\mathit{\lim}}\frac{1}{R}{\sum}_{r=1}^R\mathbf{1}\left({z}^r>0\right)={E}_z\mathbf{1}\left(z>0\right)={\int}_0^{\infty}\mathcal{N}\left(z|{\mathbf{x}}^{\prime}\boldsymbol{\beta}, 1\right)\; dz \).

If we had access to the latent utilities z = (z1, … ,zn)′ that generated the observed binomial data y = (y1, … ,yn)′, we could comfortably rely on the closed form results in Eq. 16 for Bayesian inference. Conditional on the data generating z, we would in fact learn more about the regressions coefficients than we ever could from the corresponding y.

Conversely, if we knew the regression coefficients β that generated the data, we could make an informed guess about the corresponding data generating z. Based on the y-data, we know that z that correspond to observed 1’s must have been larger than zero and those corresponding to observed 0’s smaller than zero. Based on β and the random utility interpretation of the probit likelihood, we know that the zi came independently from \( \mathcal{N}\left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) \). Putting these insights together, we arrive at the following conditional distribution for a zi corresponding to observed yi = 1, and that for a zj corresponding to observed yj = 0 given β:
$$ p\left({z}_i|\boldsymbol{\beta}, {y}_i=1\right)=\frac{\mathcal{N}\left({z}_i|{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right)\, \mathbf{1}\, \left({z}_i>0\right)}{\int_0^{\infty}\mathcal{N}\left(z|{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) dz}=\mathcal{TN}\left({z}_i|{\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1,0,\infty \right) $$
(23)
$$ p\left({z}_i|\boldsymbol{\beta}, {y}_j=0\right)=\frac{\mathcal{N}\left({z}_j|{\mathbf{x}}_j^{\prime}\boldsymbol{\beta}, 1\right)\mathbf{1}\, \left({z}_j<0\right)}{\int_{-\infty}^0\mathcal{N}\left(z|{\mathbf{x}}_j^{\prime}\boldsymbol{\beta}, 1\right)\, dz}=\mathcal{TN}\left({z}_j|{\mathbf{x}}_j^{\hbox{'}}\boldsymbol{\beta}, 1,-\infty, 0\right) $$
(24)

Here 1(·) is an indicator function that evaluates to one it its argument is true and else to zero, and \( \mathcal{TN}\left(a,b,c,d\right) \) is short for a normal distribution with mean a, variance b, truncated below c, and above d. We can simulate from these distributions using a trick known as the inverse CDF-transformation (see e.g., Rossi et al. 2005), or rely on the command rtruncnorm in the R-package truncnorm (Mersmann et al. 2018) which builds on Geweke (1991).

Based on the results in Eq. 16 the conditional distribution of β given the z and the y is:
$$ p\left(\boldsymbol{\beta} |\mathbf{z},\mathbf{y},{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)=p\left(\boldsymbol{\beta} |\mathbf{z},{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)=\mathcal{N}\left(\tilde{\boldsymbol{\beta}},{\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)}^{-1}\right), $$
(25)
where
$$ \tilde{\boldsymbol{\beta}}={\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)}^{-1}\left({\mathbf{X}}^{\prime}\mathbf{z}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}{\boldsymbol{\beta}}^0\right) $$
(26)

Note that once we condition on the z in Eq. 25, the y are no longer required as conditioning argument. A particular set of z transmits all the information, and in fact more information than contained in the y, to β (I will discuss general rules for the derivation of conditional distributions later and for now concentrate on what can be achieved based on conditional distributions).

Gibbs sampler. Our goal is thus to derive the marginal posterior distribution p (β| y, β0, Σ0) that is free from the extra, but virtual information about β that comes with each particular set of z we may condition on in Eq. 25. However, as we already know, this posterior is not available in closed form. A convenient solution to this problem is the Gibbs sampler. The Gibbs sampler allows us to generate draws from p (β, z|y, β0, Σ0) based on knowledge of \( p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)={\prod}_{i=1}^np\left({z}_i|\boldsymbol{\beta}, {y}_i\right) \) and p (β|z, β0, Σ0), i.e., conditional distributions only. Once we have draws from p (β, z|y, β0, Σ0), each draw of β in that sample is a draw from our target distribution p (β|y, β0, Σ0) (Recall that the joint distribution p (β, z|y, β0, Σ0) can be decomposed into the product of the marginal distribution p (β|y, β0, Σ0) and the conditional distribution p (z|y, β) by elementary probability calculus. If we have access to a sample from the joint distribution, drawing a β with no regard to the companion z and then looking at the companion z in the sample is equivalent to drawing from p (β|y, β0, Σ0) and then from p(z|y, β)).

The Gibbs sampler is an application of the fact that the joint distribution p (β, z|y, β0, Σ0) is uniquely determined by corresponding complete sets of conditional distributions (Besag 1974). The correspondence between the conditional distributions p (β|z, β0, Σ0) and p (z|y, β) and the joint posterior distribution is illustrated in Eq. 27 which is an instance of the Hammersley-Clifford theorem. For clarity of notation, I abbreviate the subjective prior parameters β0, Σ0 to “•” in the following.
$$ {\displaystyle \begin{array}{l}p\left(\boldsymbol{\beta}, \mathbf{z}|\mathbf{y},\bullet \right)=p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right)p\left(\mathbf{z}|\mathbf{y},\bullet \right)\\ {}\qquad \qquad =p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right){\left(\int \frac{p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right)}{p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)}d\boldsymbol{\beta} \right)}^{-1}\end{array}} $$
(27)
Proof:
$$ {\displaystyle \begin{array}{c}p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)p\left(\boldsymbol{\beta} |\mathbf{y},\bullet \right)=p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right)p\left(\mathbf{z}|\mathbf{y},\bullet \right)\\ {}\frac{p\left(\boldsymbol{\beta} |\mathbf{y},\bullet \right)}{p\left(\mathbf{z}|\mathbf{y},\bullet \right)}=\frac{p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right)}{p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)}\\ {}\int \frac{p\left(\boldsymbol{\beta} |\mathbf{y},\bullet \right)}{p\left(\mathbf{z}|\mathbf{y},\bullet \right)}d\boldsymbol{\beta} =\int \frac{p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right)}{p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)}d\boldsymbol{\beta} \\ {}\quad \frac{1}{p\left(\mathbf{z}|\mathbf{y},\bullet \right)}=\int \frac{p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right)}{p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)}d\boldsymbol{\beta} \end{array}} $$
(28)
Based on r = 1, … , R draws from p (β|z, •), we can therefore estimate the marginal distribution:
$$ p\left(\mathbf{z}|\mathbf{y}\right)={\left(\int \frac{p\left(\boldsymbol{\beta} |\mathbf{z},\bullet \right)}{p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)}d\boldsymbol{\beta} \right)}^{-1}\approx {\left(\frac{1}{R}\sum \limits_{r=1}^R\frac{1}{p\left(\mathbf{z}|\mathbf{y},{\boldsymbol{\beta}}^r\right)}\right)}^{-1} $$
(29)
and thus compute the joint distribution p (β, z|y, •) based on only knowledge of the conditional distributions p (z|y, β) and p (β|z, •). The Gibbs sampler which builds on this fundamental relationship proceeds as follows:
  1. 1.

    Based on a starting value for β draw z from p(z|y, β) as given in Eqs. 23 and 24.

     
  2. 2.

    Use the most recent draw of z as conditioning argument in p(β|z,•) (Eq. 25) and draw a new β.

     
  3. 3.

    Use the most recent draw of β as conditioning argument in p(z|y, β) (Eqs. 23 and 24) and draw new z.

     
  4. 4.

    Return to step 2, until completing R cycles through step 2 and step 3, and then stop.

     

Each completed cycle through steps 2 and 3 delivers a pair (β, z)r where r = 1, … ,R indexes the cycle or iteration number of the Gibbs-sampler. Under rather general conditions for the conditional distributions involved, these pairs will represent draws from the joint distribution after some initial iterations, and independent of the choice of starting value. The initial iterations serve to “make the Gibbs sampler forget” the arbitrary starting value in step 1 above. This is often referred to as the “burn-in” period of the Gibbs-sampler. Intuitively, the choice of starting value does not matter, because the Gibbs sampler will forget it, no matter which value was chosen (However, the choice of starting value may influence how many iterations it takes before the Gibbs sampler converges, i.e., delivers pairs (β, z)r in proportion to their joint posterior density in a finite sample of R draws. Another practical concern for the choice of starting values is the numerical stability of the techniques used to draw from the conditional distributions).

Steps 2 and 3 above are often referred to as “blocks of the sampler.” Note that step 2 itself consists of n-subblocks that each draw from the conditional distribution of a particular zi. However, because all zi are conditionally independent, i.e., \( p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)={\prod}_{i=1}^np\left({z}_i|\boldsymbol{\beta}, {y}_i\right) \) (see Eqs. 23 and 24), step 2 effectively draws from the joint conditional posterior distribution of z. Similarly, step 3 draws from the joint conditional posterior distribution of all elements in β.

To further strengthen the intuition for the Gibbs sampler, it is useful to think about each iteration as an exploration of the joint distribution in some neighborhood defined by the respective conditioning arguments. By the notion of sampling and updating of conditioning arguments, the Gibbs sampler is, however, not going to stay in this neighborhood but will move away from it and eventually return.

Each time it returns to some fixed neighborhood of β-values, for example, it will do so from a different constellation of z. Returns from z-constellations that are closer to this β-neighborhood in the sense of Eqs. 25 and 26 will occur more often than returns from z-constellations that are further away. Thus, looking at pairs (β, z)r in this neighborhood, it is impossible to distinguish between moves “from β to z” and moves “from z to β,” and this will be true of every β-neighborhood and z-neighborhood supported by the posterior distribution. In addition, by successively sampling from conditional distributions which are, by definition, proportional to the joint distribution, the Gibbs sampler is going to spend relatively more (fewer) iterations in areas of higher (lower) density under the joint distribution.

In other words, successive pairs (β, z)1 , … , (β, z)r , … , (β, z)R produced by iterations of the Gibbs sampler are locally dependent in the sense that pairs produced in successive iterations are more similar to each other than pairs produced further apart from each other, where distance is measured in iteration counts of the Gibbs sampler. However, all pairs provide exchangeable information about the joint posterior distribution. We can therefore use the output from the Gibbs sampler to approximate posterior expected loss (see Eq. 5) and any aspect of the posterior distribution we may be interested in by the corresponding expectation using the Gibbs output. For example, the posterior probability that a particular regression coefficient is larger than zero, i.e., \( P\left({\beta}_k>0|\mathbf{y},\bullet \right)={\int}_0^{\infty }p\left({\beta}_k|\mathbf{y},\bullet \right) \) would be estimated from the Gibbs output as \( \frac{1}{R}{\sum}_{r=1}^R\mathbf{1}\, \left({\beta}_k^r>0\right) \) Note that we control the degree of accuracy of these approximations by the length of the Gibbs sample R.

The particular Gibbs sampler described here is implemented as routine rbprobitGibbs in the R-package bayesm (Rossi et al. 2005) and dates back to Albert and Chib (1993). The routine comes with an example that illustrates input and output (Another bayesm routine, rbiNormGibbs, nicely illustrates how the Gibbs sampler explores a two-dimensional joint distribution by successively sampling from the corresponding two conditional distributions).

Data augmentation. In this application of the Gibbs sampler, the interest really is on the marginal posterior distribution of probit regression coefficients, i.e., p(β|y, •), and Gibbs sampling from the joint posterior distribution of β and z is just a means to obtaining the marginal distribution of interest. Drawing from p(z|y, β) is therefore referred to as “data augmentation” in the literature. Data augmentation often helps transform Bayesian inference problems that involve “unknown” distributions, i.e., distributions without a normalizing constant in closed form, into problems that only involve sampling from distributions with known normalizing constants through conditioning. Canonical examples for the successful application of this technique are the multinomial probit model (McCulloch and Rossi 1994), the multivariate probit model Edwards and Allenby (2003), mixture models (see e.g., Allenby et al. 1998; Frühwirth-Schnatter et al. 2004; Lenk and DeSarbo 2000; Otter et al. 2004), and hierarchical models in general.

From the perspective of Gibbs sampling, there is no distinction between (unobserved) aspects of the data, unobserved parameters, or any unobservable we can derive a conditional distribution for, within the confines of the Bayesian model under investigation. However, before one gets too excited about the possibilities of inference about any unobservable, it is useful to reflect about how much we can learn about β and z from the data in this example.

While it is possible to attain perfect posterior knowledge about β in this model in the limit of an infinitely large sample, it is impossible to ever learn the particular set of z’s that generated the data. This information is lost forever when moving from the data generating z to the observed y based on the indicator function yi = 1(zi > 0). We have one observation yi to learn about each zi. This observation only set identifies zi, i.e., indicates if zi < 0 or zi > 0. In addition \( \mathcal{N}\left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) \) which can be viewed as a hierarchical prior for the zi cannot degenerate, i.e., cannot deliver a perfect prediction by the definition of the probit likelihood. Any finite valued xiβ allows for yi = 1 and yi = 0, even if one of the two outcomes is extremely unlikely.

As such, we are severely limited in what we can learn about the data generating z no matter how many probit observations become available or what subjective prior parameters β0 and Σ0 we use. Thus, it is generally useful to distinguish between unobservables that can be consistently estimated in a particular model and unobservables that cannot, before further using the output from the Gibbs sampler. Here “consistently” means that we can think of amounts of data, i.e., likelihood information, or a subjective prior setting that translates into a degenerate posterior distribution which concentrates all its mass in one point. For example, it would be foolish to believe that using the posterior distribution of z could somehow further improve decisions informed by the data y and the model at hand, which depend on p(β|y), only.

Blocking. One could replace step 2 in the Gibbs sampler above by a Gibbs cycle through the full conditional distributions of each element βk. in β, i.e., p (βk | βk, z, •), where βk is short for all but the k-th element (These conditional densities are easily derived from the joint conditional normal distribution in Eq. 16 using linear regression theory).

Because any corresponding complete set of conditional distributions uniquely determines the joint distribution, this alternative sampler again delivers draws from the same joint posterior distribution p(β, z|y, •). However, the local dependence between successive pairs (β, z)1, … , (β, z)r, … , (β, z)R produced by iterations of this alternative Gibbs sampler is relatively higher. This is because two successive cycles through p (βk|βk, z, •) for all k-elements deliver draws of β that are more similar in expectation than two draws from p(β|z,•), which are independently distributed.

Replacing a cycle like that through p (βk | βk, z, •) for all k-elements by a direct draw from the corresponding conditional joint distribution, in this case p (β|z, β0, Σ0), in a Gibbs sampler is referred to as “blocking,” or “grouping” (e.g., Chen et al. 2000). In general, blocked Gibbs samplers deliver more additional information about the posterior distribution per incremental iteration than unblocked samplers, which is intuitive considering direct iid-sampling from the joint posterior distribution as the theoretical limit of blocking. As such, blocked samplers also deliver pairs (β, z)r in proportion to their joint posterior density in a finite sample based on fewer iterations, converge faster from arbitrary starting values.

Another technical aspect is the order in which to successively draw from the blocks of a Gibbs sampler. The theory of Gibbs sampling implies that the order does not matter and in fact a random ordering is easiest to motivate theoretically (see e.g., Roberts 1996, p. 51). However, in our particular example, repeated draws from step 2, i.e., p(β|z, •), or step 3, i.e., p(z|y, β), without switching to the respective other block in between are a perfect waste of time because these draws are conditionally iid. Furthermore, randomly switching to step 2 before updating all elements of z in step 3 is inefficient because step 2 pools information across all z. The updated pooled information is then “redistributed” across all z when drawing from p (z|y, β) in step 3.

Conditional posterior distributions. Next I show how to derive the full conditional distributions that define the Gibbs sampler for the probit model above (see also Gilks 1996). Recall that by specifying a prior distribution and a likelihood function, we implicitly specify the joint distribution of unobservables and the data (see Eq. 1). Starting from the joint distribution of the data and unobservables in our example, i.e.,
$$ p\left(\mathbf{y},\mathbf{z},\boldsymbol{\beta} |{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)=p\left({y}_1,\dots, {y}_n,{z}_1,\dots, {z}_n,{\beta}_1,\dots, {\beta}_K|{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right) $$
we can derive any conditional distribution of interest using elementary probability calculus. Omitting the conditioning arguments β0 and Σ0 for clarity of notation we have for example
$$ {\displaystyle \begin{array}{l}p\left({z}_1|{y}_1,\dots, {y}_n,{z}_2,\dots, {z}_n,{\beta}_1,\dots, {\beta}_K\right)\\ {}\quad =\frac{p\left({y}_1,\dots, {y}_n,{z}_1,\dots, {z}_n,{\beta}_1,\dots, {\beta}_K\right)}{\int p\left({y}_1,\dots, {y}_n,{z}_1,\dots, {z}_n,{\beta}_1,\dots, {\beta}_K\right)\;{dz}_1}\end{array}} $$
(30)
which does not look simple or useful yet. However, based on an understanding of how the model operates as a DGP, we can greatly simplify this expression. It is in this sense that Bayesian inference exactly reverses the steps that we believe generated the data.
Recall the latent utility interpretation of the probit likelihood function. Given β, latent utilities z are generated independently from \( \mathcal{N}\left(\mathbf{X}\boldsymbol{\beta }, {\mathbf{I}}_n\right)={\prod}_{i=1}^N\mathcal{N}\left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) \). Then the signs of the elements in z independently determine the data y according to indicator functions yi = 1 (zi > 0) for all i = 1, … , n. Based on this understanding of the conditional independence relationships in the DGP, we can rewrite and simplify Eq. 30 as follows:
$$ {\displaystyle \begin{array}{l}p\left({z}_1|{y}_1,\dots, {y}_n,{z}_2,\dots, {z}_n,{\beta}_1,\dots, {\beta}_K\right)\\ {}\quad =\frac{p\left({\beta}_1,\dots {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)}{\int p\left({\beta}_1,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right){dz}_1}\\ {}\quad =\frac{p\left({\beta}_1,\dots, {\beta}_K\right){\prod}_{i=2}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)p\left({y}_1|{z}_1\right)p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right)}{p\left({\beta}_1,\dots, {\beta}_K\right){\prod}_{i=2}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)\int p\left({y}_1|{z}_1\right)p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right){dz}_1}\\ {}\quad =\frac{p\left({y}_1|{z}_1\right)p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right)}{\int p\left({y}_1|{z}_1\right)p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right){dz}_1}=\frac{p\left({y}_1|{z}_1\right)p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right)}{p\left({y}_1|{\beta}_1,\dots, {\beta}_K\right)}\\ {}\quad \propto p\left({y}_1|{z}_1\right)p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right)\propto p\left({z}_1|{y}_1,{\beta}_1,\dots, {\beta}_K\right)\end{array}} $$
(31)

The last line in Eq. 31 follows from the fact that both y1 and β1,…,βK are conditioning arguments, i.e., fixed (for the moment). A useful interpretation of the final result, and in fact a way to derive the result almost instantly, is that the (conditional) posterior of z1 is proportional to the “likelihood” of z1 i.e., \( p\left({y}_1|{z}_1\right)=\mathbf{1}{\left({z}_i>0\right)}^{y_1}\mathbf{1}{\left({z}_i<0\right)}^{1-{y}_i} \) times a “prior probability” of z1, i.e., \( p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right)=\mathcal{N}\left({z}_1|{\mathbf{x}}_1^{\prime}\boldsymbol{\beta} \right) \). In other words, the (conditional) posterior is proportional to the probability of everything that directly depends on z1, i.e., the probability of z1’s “children,” times the probability of z1 given everything z1 directly depends on, i.e., z1’s “parents.” (The terminology “children” and “parents” is owed to the representation of joint distributions and their conditional independence relationships in the form of directed acyclic graphs (see e.g., Pearl 2009, p. 12))

Using the same logic, we can derive the full conditional density of, e.g., the first element in β:
$$ {\displaystyle \begin{array}{l}p\left({\beta}_1|{y}_1,\dots, {y}_n,{z}_1,\dots, {z}_n,{\beta}_2,\dots, {\beta}_K\right)\\ {}\quad =\frac{p\left({\beta}_1,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)}{\int p\left({\beta}_1,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)d{\beta}_1}\\ {}\quad =\frac{p\left({\beta}_2,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)p\left({\beta}_1|{\beta}_2,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)}{p\left({\beta}_2,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)\int p\left({\beta}_1|{\beta}_2,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)d{\beta}_1}\\ {}\quad =\frac{p\left({\beta}_1|{\beta}_2,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)}{\int p\left({\beta}_1|{\beta}_2,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)d{\beta}_1}\\ {}\quad =\frac{p\left({\beta}_1|{\beta}_2,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)}{\prod_{i=1}^np\left({z}_i|{\beta}_2,\dots, {\beta}_K\right)}\\ {}\quad \propto p\left({\beta}_1|{\beta}_2,\dots, {\beta}_K\right)\prod \limits_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)\propto p\left({\beta}_1|{z}_1,\dots, {z}_n,{\beta}_2,\dots, {\beta}_K\right)\end{array}} $$
(32)

Therefore, the full conditional posterior of β1 does not depend on the observed data y, conditional on z. Again we find that the conditional posterior is proportional to the product of the (conditional) prior p (β1|β2, … , βK) times the “likelihood,” i.e., the probability of everything that directly depends on β1 in the DGP, i.e., \( {\prod}_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right) \). Note both factors in this product involve normal distributions, and drawing all elements of β jointly from p (β1, … , βK|zn, … , zn), as in Eq. 25 is simple if the joint prior distribution of β is multivariate normal.

Bayesian prediction. We just saw that in a Bayesian model conditional posterior distributions derive from the joint density of the data and the parameters defined by the Bayesian model, i.e., the combination of a likelihood function with a prior distribution for its parameters. Now consider the problem of making predictions from the perspective of expanding this joint density to include the unobserved data response yu. In the context of our exemplary Bayesian model, we move from p (y, z, β) to p (yu, zu, y, z, β) noting that the former is obtained from the latter by integration with respect to (yu, zu).
$$ {\displaystyle \begin{array}{ll}& -, p\left({y}^u,{z}^u|{y}_1,\dots, {y}_n,{z}_1,\dots, {z}_n,{\beta}_1,\dots, {\beta}_K\right)\\ {}& =\frac{p\left({\beta}_1,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)p\left({y}^u|{z}^u\right)p\left({z}^u|{\beta}_1,\dots, {\beta}_K\right)}{\int p\left({\beta}_1,\dots, {\beta}_K\right){\prod}_{i=1}^np\left({y}_i|{z}_i\right)p\left({z}_i|{\beta}_1,\dots, {\beta}_K\right)p\left({y}^u|{z}^u\right)p\left({z}^u|{\beta}_1,\dots, {\beta}_K\right)d\left({y}^u,{z}^u\right)}\\ {}& =\frac{p\left({y}^u|{z}^u\right)p\left({z}^u|{\beta}_1,\dots, {\beta}_K\right)}{\int p\left({y}^u|{z}^u\right)p\left({z}^u|{\beta}_1,\dots, {\beta}_K\right)d\left({y}^u,{z}^u\right)}\\ {}=p\left({y}^u|{z}^u\right)p\left({z}^u|{\beta}_1,\dots, {\beta}_K\right)\end{array}} $$
(33)

For predicting a pair yu, zu conditional on β, we are thus back at data generation, i.e., get a draw zu from \( \mathcal{N}\left({z}^u|{\left({\mathbf{x}}^u\right)}^{\prime}\boldsymbol{\beta} \right) \) and determine yu according to the sign of zu. The predictive probability p(yu = l|β) can be simulated as \( \frac{1}{R}{\sum}_{r=1}^R\mathbf{1}\left({\left({z}^u\right)}^r>0\right) \) or computed using Eq. 17.

However, predictions conditional on a particular value of β are rarely of interest or relevant because, with finite data and nondegenerate priors, β will only be known up to a posterior distribution. As a consequence, p (yu, y, β) ≠ p (yu, y) where the latter is defined as ∫p(yu, y, β) dβ which in turn is defined as ∫p(yu, zu, y, z, β)d(zu, z, β). The corresponding predictive probability marginalized with respect to latent utility zu and parameters β, i.e., p (yu = l|y) can be simulated as:
$$ p\left({y}^u=1|\mathbf{y}\right)\approx \frac{1}{R}\sum \limits_{r=1}^R1\left({\left({z}^u\right)}^r>0\right),\quad {\left({z}^u\right)}^r\sim \mathcal{N}\left({z}^u|{\left({\mathbf{x}}^u\right)}^{\prime }{\beta}^r\right) $$
(34)
or more efficiently as:
$$ p\left({y}^u=1|\mathbf{y}\right)\approx \frac{1}{R}\sum \limits_{r=1}^R\Phi \left({\left({\mathbf{x}}^u\right)}^{\prime }{\beta}^r\right) $$
(35)
in the sense that the approximation to p(yu = l|y) in Eq. 35 delivers the same accuracy as that in Eq. 34 based on relatively smaller R. The sample (β1, … , βR) to be averaged over is obtained by Gibbs sampling from the posterior distribution p(β|y). Note that because of the nonlinearity of the probit likelihood, \( p\left({y}^u=1|y\right)\ne p\left({y}^u=1|\hat{\boldsymbol{\beta}}\right) \) where \( \hat{\boldsymbol{\beta}} \) is some point estimate. Specifically, probabilities larger (smaller) than 0.5 will be over- (under-) estimated if posterior uncertainty in β is ignored.

To better appreciate this generally important point, it is useful to simulate probit data following the example given with the rbprobitGibbs routine in the R-package bayesm, to sample from the corresponding posterior using rbprobitGibbs, and then to simulate and compare predictions for different xu as explained above. For a comparison with predictions at a frequentist point, estimate the R-command glm(…, family=binomial(link=”probit”),…) is useful.

Conditional posterior distributions in hierarchical models. Hierarchical models estimate a distribution of response coefficients, e.g., \( {\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N\sim p\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right) \) from a collection of i = 1, … , N time series Y = (y1, … ,yN)’ where \( {\mathbf{y}}_i={\left({y}_{i,1},\dots, {y}_{i.t},\dots, {y}_{i,{T}_i}\right)}^{\prime } \) . \( P\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right) \) forms a hierarchical prior distribution. The difference to a purely subjective prior distribution is that the sample of time series observations contains likelihood information about parameters τ that index the hierarchical prior. In other words, upon placing a subjective prior distribution on τ, the likelihood information contained in the collection of time series will update this prior distribution to the posterior distribution p(τ|Y).

It should be noted that in these models, marginal posteriors for individual level coefficients, i.e., p(βi|Y) will be biased or “shrunk” towards the hierarchical prior distribution for Ti relatively small or, more precisely, limited individual level likelihood information in p(yi|βi) relative to the information about βi in the hierarchical prior. And it is precisely this situation that motivates the use of hierarchical models in the first place.

However, parameters τ indexing the hierarchical prior can be estimated consistently, and in many marketing applications where the behavior of the particular consumers in the estimation sample is just a means to learning about optimal actions in the population these consumers belong to, p (τ|Υ) is the main target of inference.

The currently popular algorithms for Bayesian inference in a hierarchical model take advantage of the following decomposition of the joint distribution of the data and the parameters which is characteristic, if not definitive of a hierarchical model:
$$ p\left(\mathbf{Y},{\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N,\boldsymbol{\tau} \right)=p\left(\mathbf{Y}|{\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N\right)p\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right)p\left(\boldsymbol{\tau} \right) $$
(36)
An important consequence of this decomposition is that, by the rules developed earlier, the conditional posterior distribution of τ does not involve the data Y as conditioning argument:
$$ p\left(\boldsymbol{\tau} |\mathbf{Y},{\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N\right)=p\left(\boldsymbol{\tau} |{\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N\right)\propto p\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right)p\left(\boldsymbol{\tau} \right) $$
(37)

For many popular and useful choices of p(τ), Eq. 37 results in a conjugate update, i.e., a conditional distribution in the form of known distribution we can directly sample from. Perhaps the most prominent example is the model that takes \( p\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right)={\prod}_{i=1}^N\mathcal{N}\left({\boldsymbol{\beta}}_i|\overline{\boldsymbol{\beta}},{\mathbf{V}}_{\boldsymbol{\beta}}\right) \) and uses a so-called Normal-Inverse Wishart prior for \( p\left(\overline{\boldsymbol{\beta}},{\mathbf{V}}_{\boldsymbol{\beta}}\right) \) that is sometimes rather confusingly referred to as “the H(ierarchical)B(ayes)-model.” Examples are the routines rhierBinLogit, rhierLinearModel, rhierMnlRwMixture and rhierNegbinRw in the R-package bayesm (Rossi et al. 2005) that implement this hierarchical prior (or its finite mixture generalization in the case of rhierMnlRwMixture) for collections of time series of binomial logit, linear, multinomial logit, and negative binomial observations, respectively.

One interpretation of this approach towards inference for the parameters in the hierarchical prior is that it relies on the so-called random effects \( {\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N \) as augmented data, similar to the augmentation of latent utilities in the probit model discussed earlier. Different authors have argued that this approach may be suboptimal depending on the amount of likelihood information at the individual level and the amount of unobserved heterogeneity in (β1, … , βΝ) (see e.g., Chib and Carlin 1999; Frühwirth-Schnatter et al. 2004). However, practical alternative approaches that apply beyond the special case of conditionally normal individual level likelihood functions coupled with a (conditionally) normal hierarchical prior have yet to be developed.

In the common situation where \( p\left(\mathbf{Y}|{\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N\right)={\prod}_{i=1}^Np\left({\mathbf{y}}_i|{\boldsymbol{\beta}}_i\right) \) and similarly \( p\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right)={\prod}_{i=1}^Np\left({\boldsymbol{\beta}}_i|\boldsymbol{\tau} \right) \), we obtain the following conditional posterior distribution for βi.
$$ p\left({\boldsymbol{\beta}}_i|{\mathbf{y}}_i,\boldsymbol{\tau} \right)\propto p\left({\mathbf{y}}_i|{\boldsymbol{\beta}}_i\right)p\left({\boldsymbol{\beta}}_i|\boldsymbol{\tau} \right) $$
(38)
p(βi|τ) acts as a usually rather informative prior for βi here. However, as already discussed τ is not subjectively set but estimated from the data.

For many individual level likelihood functions of interest in marketing, and perhaps most prominently so for the multinomial logit likelihood, the product on the right hand side of Eq. 38 does not translate into a known distribution. A solution to generating draws from distributions with unknown normalizing constants, the Metropolis-Hastings algorithm is discussed next. Finally, if sampling from the distribution in Eq. 38 is computationally expensive, the combination of Eqs. 37 and 38 suggests scope for parallel sampling from the latter for i = 1, … , N and then feeding back the updated (β1, … , βΝ) as conditioning arguments into Eq. 37 and so on.

Metropolis-Hastings. The Gibbs sampler solves the problem posed by a (joint) posterior distribution with unknown normalizing constant if there is a corresponding set of conditional posterior distributions with known normalizing constants. The Gibbs sampler is extremely powerful and in some sense universal if one is content with approximations to the posterior on a discrete grid (Ritter and Tanner 1992). However, a general technique to sample from distributions with unknown normalizing constants known as the Metropolis-Hastings (MH) algorithm further substantially facilitates real world applications of Bayesian inference. A practically important example in marketing is Bayesian inference for models defined by type-I extreme value error (T1EV) likelihoods, e.g., logit-models, coupled with normal prior distributions for the (regression) coefficients in the likelihood.

The MH-sampler generates a dependent sample from some posterior p(θ|y) according to the following transition rule:
$$ \alpha =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast}\right)p\left({\boldsymbol{\theta}}^{\ast}\right)q\left({\boldsymbol{\theta}}^r\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^r\right)q\left({\boldsymbol{\theta}}^{\ast}\right)}\right),\qquad {\boldsymbol{\theta}}^{\ast}\sim q $$
(39)
$$ p\left({\boldsymbol{\theta}}^{r+1}|\mathbf{y},{\boldsymbol{\theta}}^r\right)=\left\{\begin{array}{ll}\alpha & {\boldsymbol{\theta}}^{r+1}={\boldsymbol{\theta}}^{\ast}\\ {}1-\alpha & {\boldsymbol{\theta}}^{r+1}={\boldsymbol{\theta}}^r\end{array}\right. $$
(40)

On iteration r, the MH-sampler thus transitions from the current “state” or parameter value θr to a new state θ* with probability α. With probability 1 – α, the current state at iteration r + 1 equals that at iteration r, i.e., θr+1 = θr (Compute α according in Eq. 39, preferably on the log-scale, exponentiate, and compare the result to a draw u from a standard uniform distribution. If u < α move to θ*, else stay at θr, to obtain θr+1). The so-called candidate value or state θ* is sampled from the known “candidate generating” or “proposal” density q. Note that the unknown normalizing constant p(y) =  ∫ p(y| θ)p(θ)dθ cancels from Eq. 39.

A remarkable property of this transition rule is that it defines a Markov chain or process with invariant or stationary distribution equal to the posterior distribution p (θ|y). (A Markov process is a stochastic process in which the future, i.e., the (r + 1)-th value only depends on the value attained in the r-th iteration. All values taken before at the (r − 1)-th, (r − 2)-th, and so on iteration are irrelevant for predicting or generating the (r + 1)-th value.) In practice, this implies that subject to rather weak conditions for the proposal density q, repeated application of the transition rule in Eq. 40 eventually delivers draws from the posterior distribution of the model under investigation, independent of the choice of initial or starting value θr=0. In other words, after discarding, say the first b values θ1, … , θr, … , θb generated by b applications of Eq. 40 starting from θ0, we can use the remaining R − b draws as a representative sample of the posterior distribution.

To better appreciate this point, define the parameter space countably such that we can replace integration by summing over (a potentially infinite number of) countable sets (This is a technicality to avoid measure theoretic complications associated with events “of probability measure zero,” and without loss of generality. The event that a continuous parameter takes a particular value, for example, is an event of probability measure zero because any ε-environment around that value – no matter how small – contains uncountably infinitely many values), and consider a condition known as “detailed balance”:
$$ {\displaystyle \begin{array}{lll}p\left({\boldsymbol{\theta}}_i|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_j\right)\alpha \left({\boldsymbol{\theta}}_i\to {\boldsymbol{\theta}}_j\right)& =& p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_i\right)\alpha \left({\boldsymbol{\theta}}_j\to {\boldsymbol{\theta}}_i\right)\\ {}p\left({\boldsymbol{\theta}}_i|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_j\right)\times & & p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_i\right)\times \\ {}\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}_j\right)p\left({\boldsymbol{\theta}}_j\right)q\left({\boldsymbol{\theta}}_i\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}_i\right)p\left({\boldsymbol{\theta}}_i\right)q\left({\boldsymbol{\theta}}_j\right)}\right)& =& \min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}_i\right)p\left({\boldsymbol{\theta}}_i\right)q\left({\boldsymbol{\theta}}_j\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}_j\right)p\left({\boldsymbol{\theta}}_j\right)q\left({\boldsymbol{\theta}}_i\right)}\right)\\ {}\min \left(p\left({\boldsymbol{\theta}}_i|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_j\right),p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_i\right)\right)& =& \min \left(p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_i\right),p\left({\boldsymbol{\theta}}_i|\mathbf{y}\right)q\left({\boldsymbol{\theta}}_j\right)\right)\end{array}} $$
(41)
where the last line establishes that the first two equalities hold. Now rewrite the first line of Eq. 41 as follows:
$$ \frac{p\left({\boldsymbol{\theta}}_i|\mathbf{y}\right)}{p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)}=\frac{q\left({\boldsymbol{\theta}}_i\right)\alpha \left({\boldsymbol{\theta}}_j\to {\boldsymbol{\theta}}_i\right)}{q\left({\boldsymbol{\theta}}_j\right)\alpha \left({\boldsymbol{\theta}}_i\to {\boldsymbol{\theta}}_j\right)} $$
(42)
Equation 42 makes apparent that the probability of proposing and accepting the move from θj to θi relative to the probability of proposing and accepting the reverse move in the MH algorithm is equal to the ratio of posterior probabilities of the respective target values. Because Eq. 42 holds for all θi, θjΘ where Θ is the parameter space defined by the model under investigation, we have:
$$ {\displaystyle \begin{array}{l}\sum \limits_{{\boldsymbol{\theta}}_i}\frac{p\left({\boldsymbol{\theta}}_i|\mathbf{y}\right)}{p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)}=\sum \limits_{{\boldsymbol{\theta}}_i}\frac{q\left({\boldsymbol{\theta}}_i\right)\alpha \left({\boldsymbol{\theta}}_j\to {\boldsymbol{\theta}}_i\right)}{q\left({\boldsymbol{\theta}}_j\right)\alpha \left({\boldsymbol{\theta}}_i\to {\boldsymbol{\theta}}_j\right)}\\ {}\quad p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)={\left(\sum \limits_{{\boldsymbol{\theta}}_i}\frac{q\left({\boldsymbol{\theta}}_i\right)\alpha \left({\boldsymbol{\theta}}_j\to {\boldsymbol{\theta}}_i\right)}{q\left({\boldsymbol{\theta}}_j\right)\alpha \left({\boldsymbol{\theta}}_i\to {\boldsymbol{\theta}}_j\right)}\right)}^{-1}\\ {}\sum \limits_{{\boldsymbol{\theta}}_i}p\left({\boldsymbol{\theta}}_j|\mathbf{y}\right)=\sum \limits_{{\boldsymbol{\theta}}_j}{\left(\sum \limits_{{\boldsymbol{\theta}}_i}\frac{q\left({\boldsymbol{\theta}}_i\right)\alpha \left({\boldsymbol{\theta}}_j\to {\boldsymbol{\theta}}_i\right)}{q\left({\boldsymbol{\theta}}_j\right)\alpha \left({\boldsymbol{\theta}}_i\to {\boldsymbol{\theta}}_j\right)}\right)}^{-1}\\ {}\qquad \qquad =1\end{array}} $$
(43)

Equation 43 makes intuitive that the collection of moves away from θj and moves returning to θj by the MH sampler eventually represent the posterior support for θj and, because this holds for all values θj, the entire posterior support. The “eventual” part of this statement comes from the fact that we may start off the sampler at a parameter value θj = θ0 in a region of the parameter space Θ with extremely small posterior probability, i.e., in some extreme tail of the posterior distribution. As the MH sampler perhaps very slowly navigates the posterior, i.e., using many iterations depending on the proposal density q, moving into regions of the parameter space with higher posterior support, the draws along the path to that region over-represent the posterior support for these draws in any finite MH sample. This explains why the first b-iterations of the MH sampler that deliver the sequence θ1, … , θr, … , θb from the arbitrary initial starting value θ0 need to be discarded as burn-in for the sequence θb+1, … , θb+r, … , θR to be representative of the posterior distribution.

Convergence. Unfortunately, there is no simultaneously practical and reliable way to assess the length of the burn-in sample b. I strongly recommend that users of so-called Monte-Carlo-Markov-Chain (MCMC) techniques that encompass the Gibbs sampler, the MH sampler, as well as collections and combinations, these techniques always take the time to check the convergence behavior of a particular algorithm using simulated data, no matter if the algorithm was designed by someone else or is being newly developed, coded from scratch. In this process, three additional advantages emerge from working with simulated data. First, it forces the researcher to be absolutely clear about his understanding of the data generating process. Second, it delivers an understanding of what informative and less informative data are. Third, it helps with assessing the influence of subjective prior choices.

The investigation of convergence behavior relies on time-series plots of posterior quantities of interest where “time” is measured in iterations of the MCMC sampler. We want these time series plots to look stationary, at least after projecting to the loss from different actions. In other words, at least times series plots of \( \mathcal{L} \)(a, θr ) need to have converged to stationary sequences over the first b iterations of the sampler. Obviously, the series of \( \mathcal{L} \)(a, θr ) will converge if the series of parameter draws θr converges. However, it sometimes may be easier to assess convergence in \( \mathcal{L} \)(a, θr) than in θr because the latter often is a high-dimensional object in applied work. In addition, strong posterior dependence between elements of the parameter vector θ may mask convergence to a stable predictive distribution. Interesting examples are “fundamentally over-parameterized” models in the sense that even an infinite amount of data only likelihood-identifies lower dimensional projections of the parameters (see e.g., McCulloch and Rossi 1994; Edwards and Allenby 2003; Wachtel and Otter 2013) (As discussed in section “Bayesian Essentials” above, a proper prior distribution effectively guarantees that the posterior distribution is proper, independent of what can be identified from the likelihood). However, strong posterior dependence between elements of θr is not limited to fundamentally over-parameterized models.

If a MCMC explores the posterior distribution quickly (“mixes well”), it will yield a representative sample of the posterior distribution in fewer iterations than a MCMC that explores the posterior distribution more slowly (“does not mix well”). The mixing-behavior of a MCMC has implications for the required length of the burn-in sample b. If a chain mixes well, we can choose vastly different starting values and we will quickly lose the ability to distinguish among chains that use different starting values based on summaries of draws. The information in the draws from the posterior all chains converge to will swamp the initial differences between chains. Reliable formal tests of convergence implemented in the R-package CODA (Plummer et al. 2006), for example, build on this idea. However, when a chain mixes well, the researcher will (almost always) see this when exploring the posterior sample generated by the MCMC graphically. And because chains that mix well converge quickly, this limits the need for formal testing. In applied work, it thus is a priority to make sure that the MCMC employed mixes well. This brings us back to the role of simulated data in the development and testing of numerically intensive inference routines such as MCMC. I will give practical examples further below.

Construction of proposal densities The proposal density q needs to be known in the sense that we need to generate draws from it. In general, we also need to be able to evaluate the proposal density, i.e., to compute q(θ) when computing α in Eq. 39. However, normalizing constants can be omitted because they cancel from the ratio in α. The best proposal density possible is the posterior distribution itself. Setting q (θ= p(θ|y) the acceptance probability α becomes
$$ {\displaystyle \begin{array}{l}\alpha =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast}\right)p\left({\boldsymbol{\theta}}^{\ast}\right)p\left({\boldsymbol{\theta}}^r|\mathbf{y}\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^{\ast }|\mathbf{y}\right)}\right)\quad {\boldsymbol{\theta}}^{\ast}\sim p\left(\boldsymbol{\theta} |\mathbf{y}\right)\\ {}\, =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast}\right)p\left({\boldsymbol{\theta}}^{\ast}\right)p\left(\mathbf{y}|{\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^r\right)p\left(\mathbf{y}\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^r\right)p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast}\right)p\left({\boldsymbol{\theta}}^{\ast}\right)p\left(\mathbf{y}\right)}\right)\\ {}\, =\min \left(1,\frac{p\left(\mathbf{y}\right)}{p\left(\mathbf{y}\right)}\right)\\ {}\, =1\end{array}} $$
(44)

However, the reason for using the MH sampler in the first place is that we cannot directly sample from the posterior distribution (Note that one can think of the Gibbs sampler as a cycle through MH steps with conditional proposal densities equal to the conditional posterior distributions). Nevertheless, it is sometimes possible to construct proposal densities as close approximations to the posterior distributions. An example is the routine rmnllndepMetrop in the R-package bayesm (Rossi et al. 2005) that uses a normal approximation to the likelihood to construct a multivariate t-distributed proposal centered at a penalized maximum likelihood estimate.

An obvious requirement for the proposal density is that the parameter set over which the proposal density q has positive support Θq is equal to, or a superset of the parameter set over which the posterior distribution has positive support, i.e., Θp(θ| y) ⊆ Θq. If the proposal density q is such that parameter values that have positive support under the posterior distribution can never be reached, an MH sampler using this proposal density cannot possibly deliver draws that are representative of the posterior distribution.

Conversely, if the proposal density extends beyond the support of the posterior, i.e., Θp(θ| y) ⊂ Θq, proposals to move into a region of the parameter space that is not supported under the posterior will simply be rejected. The corresponding acceptance probability α is equal to zero (see Eq. 39).

A related, less obvious but nevertheless practically important requirement for the proposal density is that it should have more mass in its tails relative to the posterior distribution. The reason is that a concentrated proposal density may effectively fail to navigate the entire posterior distribution in a way similar to a proposal that is only defined over a subset of the parameters space. A tricky aspect of thin tailed proposal densities, and concentrated in an area where the posterior distribution is relatively flat, is that time series plots of any finite number of MH draws may fail to indicate that the sampler has not converged, i.e., the plots may indicate convergence over a range of parameters that is not representative of the entire posterior distribution.

A simple recipe to specifying a proposal that necessarily has more mass in the tails relative to the posterior distribution is to define q as a random walk (RW), i.e., θ* = θr + ϵ with q (ϵ) defined such that q (ϵ) = q(−ϵ) for all θ ⊂ Θq. This recipe works for continuous and discrete distributions, and both for multivariate and univariate posterior distributions, in principle. Based on a RW proposal, the MH acceptance probability α simplifies to
$$ {\displaystyle \begin{array}{l}\alpha =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast}\right)p\left({\boldsymbol{\theta}}^{\ast}\right)q\left({\boldsymbol{\theta}}^r\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^r\right)q\left({\boldsymbol{\theta}}^{\ast}\right)}\right),\, {\boldsymbol{\theta}}^{\ast }={\boldsymbol{\theta}}^r+\boldsymbol{\epsilon}, \boldsymbol{\epsilon} \sim q\\ {}\, =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\mathrm{r}}+\boldsymbol{\epsilon} \right)p\left({\boldsymbol{\theta}}^r+\boldsymbol{\epsilon} \right)q\left({\boldsymbol{\theta}}^{\ast }-{\boldsymbol{\theta}}^r\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast }-\boldsymbol{\epsilon} \right)p\left({\boldsymbol{\theta}}^{\ast }-\boldsymbol{\epsilon} \right)q\left({\boldsymbol{\theta}}^r-{\boldsymbol{\theta}}^{\ast}\right)}\right)\\ {}\, =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\mathrm{r}}+\boldsymbol{\epsilon} \right)p\left({\boldsymbol{\theta}}^r+\boldsymbol{\epsilon} \right)q\left(\boldsymbol{\epsilon} \right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast }-\boldsymbol{\epsilon} \right)p\left({\boldsymbol{\theta}}^{\ast }-\boldsymbol{\epsilon} \right)q\left(-\boldsymbol{\epsilon} \right)}\right)\\ {}\, =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}^{\ast}\right)p\left({\boldsymbol{\theta}}^{\ast}\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^r\right)}\right)\end{array}} $$
(45)

However, in many applications, the dimensionality of the parameter space is too large for a RW proposal that attempts to move all parameters simultaneously in one “big” MH step to work well. Conditional independence relationships in the DGP can be exploited to break one big MH step into a collection of MH steps of smaller dimensionality following the same logic that we used earlier to decompose the joint posterior distribution into a set of more manageable conditional posterior distributions for the Gibbs sampler.

In fact, the MH sampler delivers draws from conditional posterior distributions automatically if we propose to only change an individual element of the parameter vector, say θk:
$$ {\displaystyle \begin{array}{l}\alpha =\min \left(1,\frac{p\left(\mathbf{y}|{\boldsymbol{\theta}}_{-k}^r,{\theta}_k^{\ast}\right)p\left({\boldsymbol{\theta}}_{-k}^r,{\theta}_k^{\ast}\right)q\left({\boldsymbol{\theta}}^r\right)}{p\left(\mathbf{y}|{\boldsymbol{\theta}}^r\right)p\left({\boldsymbol{\theta}}^r\right)q\left({\boldsymbol{\theta}}_{-k}^r,{\theta}_k^{\ast}\right)}\right),\quad {\theta}_k^{\ast}\sim q\left({\theta}_k|{\boldsymbol{\theta}}_{-k}\right)\\ {}\, =\min \left(1,\frac{p\left({\theta}_k^{\ast }|\mathbf{y},{\boldsymbol{\theta}}_{-k}^r\right)q\left({\theta}_k^r|{\boldsymbol{\theta}}_{-k}^r\right)q\left({\boldsymbol{\theta}}_{-k}^r\right)}{p\left({\theta}_k|\mathbf{y},{\boldsymbol{\theta}}_{-k}^r\right)q\left({\theta}_k^{\ast }|{\boldsymbol{\theta}}_{-k}^r\right)q\left({\boldsymbol{\theta}}_{-k}^r\right)}\right)\\ {}\, =\min \left(1,\frac{p\left({\theta}_k^{\ast }|\mathbf{y},{\boldsymbol{\theta}}_{-k}^r\right)q\left({\theta}_k^r|{\boldsymbol{\theta}}_{-k}^r\right)}{p\left({\theta}_k|\mathbf{y},{\boldsymbol{\theta}}_{-k}^r\right)q\left({\theta}_k^{\ast }|{\boldsymbol{\theta}}_{-k}^r\right)}\right)\end{array}} $$
(46)

The second line in Eq. 46 follows from the application of Bayes’ theorem (see Eq. 1 and note that normalizing constants \( \int p\left(\mathbf{y}|{\boldsymbol{\theta}}_{-k}^r,{\theta}_k\right)p\left({\boldsymbol{\theta}}_{-k}^r,{\theta}_k\right)\;d{\theta}_k \) cancel) and the decomposition of the joint proposal density into a conditional times a marginal. However, it is wasteful not to exploit conditional independence relationships that often vastly simplify the computation of the ratio in Eq. 46 for particular conditional posterior distributions (see e.g., the conditional posterior distribution in Eq. 31).

Moreover, unobservables that are conditionally independent a posteriori should always be drawn in separate MH steps, upon introducing the respective conditioning argument. It would be wasteful to constrain the sampler to either accept a joint move of all these unobservables to the respective candidate values or to reject the entire move and to repeat all respective values from iteration r. The conditional posterior \( p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)={\prod}_{i=1}^np\left({z}_i|\boldsymbol{\beta}, {y}_i\right) \) discussed earlier in the context of the binomial probit likelihood serves as an example.

The practical advantage of working with full conditional distributions as the basis for MH-RW sampling is that the proposal densities qk (ϵ) are univariate. As a consequence, we only need to determine the concentration of these distributions around ϵk = 0, which corresponds to \( {\theta}_k^{\ast }={\theta}_k^r \). When attempting to make multivariate proposals with the goal to move more than one element of the parameter vector in one step, a simple multivariate RW proposal of the form \( q\left(\boldsymbol{\upepsilon} \right)={\prod}_{k=1}^Kq\left({\upepsilon}_k\right) \) may suggest moves into directions with minimal support under the posterior which will result in θr + 1 = θr for many iterations. Thus, setting up an MCMC as a repeated cycle through conditional MH steps facilitates the definition of suitable proposal densities. This is analogous to conditioning leading to known distributions in the Gibbs sampler, which can be viewed as a special case of MH sampling (see Eq. 44).

For continuous parameters the default choice for qk (ϵ) is \( N\left(0,{\sigma}_k^2\right) \) where the parameter \( {\sigma}_k^2 \) is subject to “tuning” by the analyst. For an integer parameter ϵ = (η + 1) s could be used, where η is distributed Poisson with tuning parameter λ, and s takes values from {−1,1} with probability 0.5 (For strictly categorical parameters with no ordering among their values, the notion of a random walk is not defined. However, because of the finite prior support of such parameters, it is possible to use discrete uniform proposal distributions. Because all values have the same probability under a uniform distribution, the proposal distributions again cancel from the ratio in the acceptance probability α).

The tuning parameter implicitly specifies an average size of ϵ and thus an average distance between \( {\theta}_k^{\ast } \) and \( {\theta}_k^r \) (also known as the step-size of the proposal distribution), ϵ small in absolute value result in \( {\theta}_k^{\ast } \) close to \( {\theta}_k^r \) that are more likely accepted, i.e., \( {\theta}_k^{r+1}={\theta}_k^{\ast } \) than ϵ large in absolute value that will more likely result in \( {\theta}_k^{r+1}={\theta}_k^r \) when applying Eq. 40. If the number of total iterations R to run the MH sampler were of no concern, any setting of the tuning parameters that results in nondegenerate qk (ϵ) would result in valid posterior inferences based on applications of Eq. 40.

However, both ϵ that are too small on average and ϵ that are too large on average will result in MH samplers that require a larger number of total iterations R to deliver the same amount of information about the posterior distribution than “optimally sized” ϵ. The situation is analogous to studying a population based on sampling. Larger samples result in more reliable inference and some sampling techniques result in higher statistical efficiency than others based on the same number of observations. Here, the population is the posterior distribution, the proposal density plays the role of the sampling plan, and importantly the sample size R is under our control, within the limits set by computational speed and time.

When the tuning parameter is set such that ϵ is too small on average, the MH sampler will explore the posterior in local neighborhoods extensively and navigate the entire posterior over many, many small steps creating “large swings” such that time series plots look like those of financial indices that can move into one direction for extended periods of times, in this case potentially for tens of thousands of iterations. The consequence is that the chain may appear as if it does not converge to a stationary distribution at all.

When the tuning parameter is set such that ϵ is too large on average, the chain will remain at the same value for many iterations and may fail to move at all, i.e., never accept to set \( {\theta}_k^{r+1}={\theta}_k^{\ast } \). However, if it at least moves sometimes, such a chain will arrive at a region of relatively large posterior support in large jumps and tend to stay there. In that sense, ϵ that are too large – provided that the chain moves at all – are the lesser evil. However, any reliable statements about posterior uncertainty based on a finite number of MH draws require decently tuned proposal densities. In practice, some experimentation is required that again is supported by the analysis of simulated data.

To illustrate, I simulated 500 observations from a binomial-probit model with data generating parameter vector β = (−3, 2, 4). The first coefficient is an intercept and the remaining two are slope coefficients for two randomly uniformly distributed covariates (see the Appendix for the corresponding R-script). The script calls a simple, stylized RW-MH-sampler for a binomial probit model coupled with a multivariate normal prior for the probit coefficients implemented in plain R (see the function rbprobitRWMetropolis in the Appendix).

I ran the MCMC for 200,000 iterations using a weakly informative prior and initializing the chain at βr=0 = (0,0,0). Fig. 2 shows MCMC-traces of β for four different \( q\left(\boldsymbol{\upepsilon} \right)={\prod}_{k=1}^KN\left(0,{\sigma}_k^2\right) \), i.e., σk = 0.001, σk = 0.005, σk = 0.2, and finally σk = 3 for all k = 1,2,3. These step-sizes translate into average acceptance rates α of RW-proposals of 99%, 97%, 25%, and 0.05% (see Eq. 45). The black, red, and green MCMC-traces correspond to the first, second, and third element of the parameter vector, respectively.
Fig. 2

MH-sampling – different step-sizes, 200,000 iterations

The top-left plot in Fig. 2 depicts the MCMCs that use the smallest step-size investigated here. It presents an example of an MCMC-trace from a sampler that has not converged to delivering samples from the posterior distribution. All three traces exhibit a trend away from zero over the entire course of the 200,000 iterations the sampler was run. Looking at the y-axis, we see that the individual traces are nowhere near the data generating values and reflective of the starting values, even in the last iteration. In an application to real data, we would not now what the data generating parameter values are to compare. However, upon seeing something similar to the top-left plot, we would conclude that the sampler has not converged to a stationary distribution yet. Thus, summaries of the full set or any subset of the 200,000 draws in the top-left plot do not represent the posterior distribution.

The traces in the top-right plot are with a step-size σk that is five times larger than that in the top-left plot. We see that the three traces appear to converge to stationarity around iteration 50,000 or so, and we could use summaries of the last 150,000 draws to learn about the posterior distribution. With an even larger σk = 0.2, convergence to the stationary distribution is much quicker (see the bottom-left plot). Finally, when we use σk = 3, the largest MH step-size investigated here, we see that the MCMC relatively quickly jumps into the neighborhood of the data generating β, but sticks to the same parameter value, often for thousands of iterations.

From theory we have that all four MCMC chains investigated here will eventually represent the posterior distribution p(β|y) equally well, when run for an infinite number of iterations. The concept of an infinite number of iterations is not helpful in practice. However, to illustrate convergence of traces even with poorly tuned MH-steps, I ran each chain for 400,000 more iterations. Figure 3 depicts MCMC-traces obtained by stringing together the first 200,000 iterations from Fig. 2 with the subsequent 400,000 for a total of 600,000 iterations. It can be seen that all four MH-samplers converge eventually, even the sampler that uses σk = 0.001.
Fig. 3

MH-sampling – different step-sizes, 600,000 iterations

However, convergence of the MCMC to its stationary distribution is a necessary but not a sufficient criterion for high-quality inferences about the posterior distribution based on any finite sample of MCMC draws. To illustrate this point, Fig. 4 zooms into the last 50,000 iterations of the 600,000 total iterations from each sampler. Intuitively, the collection of draws in the bottom-left contain most information about the posterior, followed by that in the top-right. It is harder to order the collection of draws in the top-left and the bottom-right according to their information content by visual inspection.
Fig. 4

MH-sampling – different step-sizes, last 50,000 of 600,000 iterations

Table 6 summarizes the traces depicted in Fig. 4 numerically, i.e., the last 50,000 draws from each chain. We see reasonable agreement between the chains operating with step-sizes of 0.005, 0.02, and 3 in terms of posterior means. However, the chains with step-sizes of 0.005 and 3 underestimate the posterior standard deviations relative to that with a step-size of 0.2 based on the last 50,000 draws (The posterior standard deviations of MCMC draws measure the posterior uncertainty in the knowledge about the parameters to be estimated. The analogy to frequentist standard errors applies. However, while reasonable estimates of frequentist standard errors maybe hard to come by in finite samples, posterior standard deviations are well defined automatically by virtue of using proper priors. In addition, based on a sample from the posterior distribution, posterior standard deviations of functions of parameters are easily computed as the standard deviation of functional values computed at each draw from the posterior). The chain with step-size 0.001, which required about 350,000 draws to converge to stationarity (see the top-left plot in Fig. 3), results in different means and dramatically smaller posterior standard deviations when looking at the last 50,000 draws.
Table 6

Posterior means and standard deviations from the last 50,000 iterations

Step-size

Mean

Standard deviation

 

β0

β1

β2

β0

β1

β2

0.001

−2.32

1.41

3.28

0.06

0.05

0.09

0.005

−2.86

1.85

3.84

0.16

0.25

0.21

0.2

−2.89

1.89

3.83

0.24

0.26

0.30

0.3

−2.85

1.87

3.82

0.17

0.18

0.25

Table 7 reports analogous summaries, but now based on the last 250,000 draws (compare Fig. 3). Based on these five times larger samples from the posterior, we see reasonable agreement between chains with step-sizes 0.005, 0.2, and 3 both in terms of posterior means and posterior standard deviations. This again illustrates that MCMC will “always work,” if we only run the chains for long enough. However, it also illustrates that some MCMCs deliver more information about the posterior holding the number of iterations fixed than others, and that a valid MCMC chain can be practically useless if it explores the posterior too slowly.
Table 7

Posterior means and standard deviations from the last 250,000 iterations

Step-size

Mean

Standard deviation

 

β0

β1

β2

β0

β1

β2

0.001

−2.28

1.34

3.23

0.10

0.10

0.13

0.005

−2.92

1.90

3.89

0.23

0.27

0.29

0.2

−2.88

1.88

3.83

0.24

0.26

0.30

3

−2.90

1.94

3.82

0.23

0.23

0.32

Finally, I illustrate the notion of exploring the posterior distribution more quickly (more efficiently) and more slowly (less efficiently) by comparing the RW-MH-chains with step-sizes 0.005, 0.2, and 3 to each other, and to posterior draws from the Gibbs-sampler that relies on data-augmentation discussed earlier (rbprobitGibbs in the R-package bayesm). I focus on the first slope coefficient (the red trace in the figures above), and compute means and standard deviations from batches of 1000 consecutive draws starting from iteration 50,001 until iteration 600,000. The histograms in Figs. 5 and 6 summarize the resulting distributions of 550 (= (600,000–50,000)/1000) batch means and batch standard deviations for RW-MH-chains with step-sizes 0.005, 0.2, and 3 and the Gibbs-sampler.
Fig. 5

MH-sampling – distribution of batch means from different step-sizes compared to Gibbs-sampling

Fig. 6

MH-sampling – distribution of batch standard deviations from different step-sizes compared to Gibbs-sampling

Assuming the true posterior standard deviation (from a hypothetical infinite run of the MCMC) to be about 0.26 (see Table 7), we would expect the batch means to be distributed normally around the true mean with standard deviation \( .26/\sqrt{1000}\approx .008 \) simply because we cannot learn the exact mean of a nondegenerate posterior distribution from a finite sample. This translates into a 5-σ interval around the mean with a length of about 0.08. Any excess variation in batch means is evidence of the inefficiency of the employed sampling technologies relative to a hypothetical iid-sampler. From the x-axes in Fig. 5, we can see that batch means are distributed much more widely. Intuitively, a single 1000-iterations batch from each of the MCMCs is less informative about the posterior (more likely to summarize information from only parts of the posterior) than 1000 draws from a hypothetical iid-sampler. In addition, if someone had to bet on the inference from a randomly drawn single batch, he would prefer a draw from the Gibbs-sampler (in the bottom-right), followed by a draw from the RW-MH-sampler with step-size 0.2 (in the top-right). The decision between step-sizes 0.005 and 3 is less clear. However, from the wider distribution of batch means, it is obvious that MCMCs with these step-sizes explore the posterior less efficiently.

Finally, the batch standard deviations in Fig. 6 again identify the Gibbs-sampler as most efficient, followed by the RW-MH-chain with step-size 0.2. A randomly drawn batch of 1000 consecutive draws from these samplers is likely to yield a posterior standard deviation close to the posterior standard deviation estimated from all 600,000 − 50,000 = 550,000 draws. In addition, the top-left plot in Fig. 6 demonstrates that each and every single 1000 consecutive iterations batch from the chain with step-size 0.005 substantially underestimates the posterior standard deviation. In contrast, the chain with the (too) large step-size of 3 often suggests no posterior uncertainty at all – when no proposal is accepted in the batch – but does not uniformly underestimate the posterior standard deviation. This again suggests that chains with step-sizes that are too small are potentially more misleading than chains with step-sizes that are too large.

The examples discussed here nicely showcase that the emphasis in applied work should be on using, devising sampling schemes that mix well, before even considering the formal assessment of convergence. In a sense, it is almost always obvious from a graphical inspection of MCMC-trace plots whether a sampler that mixes well has converged or not.

For first time acquaintances with MH sampling I suggest the following additional coding exercises to develop an intuition for MH-sampling based on personal experience:
  1. 1.

    Change the function rbprobitRWMetropolis in the Appendix to cycling through MH-steps that update individual elements of the parameter vector one at a time from their conditional posterior distributions. Experiment with tuning RW-proposals for each element of the parameter vector independently.

     
  2. 2.

    Obtain a copy of the “plain R” version of rbprobitGibbs (version 2.2–5 of bayesm available from the CRAN-archives), replace the part that generates latent utilities z in line 141 with RW-MH steps, and verify with simulated data that this new algorithm works. The setup is generally interesting, because it is a toy version of a hierarchical model with MH-updates at the lower level and conjugate updates of parameters that form the hierarchical prior.

     
  3. 3.

    Modify this sampler such that you propose candidate values \( {z}_i^{\ast } \) from their (hierarchical) prior distribution \( \mathcal{N}\left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) \). Note that the proposal and the prior distribution will cancel from the ratio in the MH-acceptance probability α. You will likely see that this sampler does not converge to a posterior distribution p (β|y) anywhere near the data generating values, even though the time series of β1, … , βr, … , βR suggests immediate convergence and superior mixing! This is an example of the drawbacks of a (collection of) proposal densities that do not have enough mass in their tails.

     

Recent developments. An important recent development in the context of making numerically intensive Bayesian analysis more practical is the No U-turn Sampler (NUTS) by Hoffman and Gelman (2014) which is a self-tuning Hamiltonian-Monte-Carlo sampler (see e.g., Neal 2011). This technique has been implemented in Stan (Carpenter et al. 2017) which interfaces with many popular software environments including R, Python, Matlab, and Stata, for example.

The basic principle of Hamiltonian-Monte-Carlo (HMC) is to leverage Hamiltonian dynamics for a more effective exploration of the posterior. In physics, Hamiltonian dynamics describe the change in location and momentum of an object by differential equations. The solutions to the differential equations yield the location and the momentum of an object at any particular point in time.

In HMC, the locations correspond to value of the q-element parameter vector to be estimated. Each location is associated with a potential energy and the statistical analogue is the negative of log-posterior evaluated these values (Thus, the posterior mode is the point of lowest potential energy we would gravitate to in the absence of “extra” kinetic energy that enables movements away from this point). The analogue to the momentum comes from expanding the parameter space by p additional parameters (where p = q), the negative log-density of which is the statistical analogue of kinetic energy (Thus, again the mode of this density is the point (the momentum vector) with the lowest kinetic energy). Usually, these additional parameters are assumed to be standard normally distributed. However, it should be noted that the p additional parameters and their density are purely technical devices to complete the Hamiltonian. Similarly, proposal distributions in the context of MH-sampling are technical devices to accomplish MH-sampling.

The algorithm first draws a p-element “momentum” vector from standard normal distributions. The momentum vector both defines the direction of the movement away from the current location (parameter value), and the maximum distance that can be realized, as explained next. HMC obeys the principle that the total energy, i.e., the sum of the potential and the kinetic energy is constant in the closed system described by the Hamiltonian, when deriving a new location (and a new momentum) at any point in time (see Eq. 2.12 in Neal 2011). Here time refers to some arbitrary time point after the onset of the momentum that generates a movement away from the current location.

The location-change is a function of the change in kinetic energy and the momentum-change a function of the change in potential energy. Note that the change in potential energy corresponds to the gradient of the negative log-posterior, and the change in kinetic energy to the gradient of the negative log-density of auxiliary momentum variables respectively, in statistical applications. If the differential equations describing the change in position and momentum could be solved exactly, one could solve for the location that is furthest away from the current location that can be reached in the direction of the current draw of the momentum, given its associated kinetic energy, define this as the new location, draw a new momentum vector, and so on.

It is useful to contemplate how such a procedure would explore the posterior. With a fixed distribution of momentum vectors (and corresponding kinetic energies), it would tend to move away more slowly from a pronounced posterior mode, i.e., in smaller steps in expectation, because of the steep increase in potential energy (defined as the negative of the log-posterior) around this mode. Here, the expectation is with respect to the fixed distribution of momentum vectors (and corresponding kinetic energies). Only outlying momentum vectors would supply sufficient kinetic energy to move far into directions of (much) higher potential energy. Conversely, it would tend to move more quickly, i.e., in larger steps in expectation, through areas of high-potential energy (small values of the log-posterior), and in the direction of low potential energy, in expectation. It is therefore somewhat intuitive that such a procedure would result in direct draws from the posterior that could represent the posterior effectively based on a relatively small number of draws. In contrast to RW-MH-sampling, the distance between two successive draws from this procedure would automatically reflect the concentration of the posterior at every value of the parameter space.

However, in practice, the solutions to the differential equations defining the Hamiltonian dynamics need to be approximated in discretized time. Again, time here refers to the time after the onset of the momentum that generates a movement away from the current location, i.e., the current parameter value. A discrete approximation that can be tuned to high accuracy (relative to the exact solution) is leapfrog integration. At each iteration of the HMC, L leapfrog steps that each correspond to a discrete time step of length ϵ are performed. Ideally, the number of steps L and the length of each step ϵ are chosen so that the new location (a new parameter value) is as far away as possible from the current parameter value, given the current draw of the p-element momentum vector and its associated kinetic energy, while keeping the approximation error low. Any remaining approximation error is controlled in a MH-step that compares the value of the Hamiltonian at the new position and the momentum at this position to the value of the Hamiltonian at the old position and the momentum vector that initiated the movement to the new position (In other words, the potential energy at the new location and the (remaining) kinetic energy are compared to the potential energy at the old location and the kinetic energy that brought about the movement to the new location). By the law of conservation of energy in the closed system described by the Hamiltonian, the Hamiltonian would evaluate to the same value if the discrete time approximation were exact.

NUTS automatically tunes L, ϵ, and additional parameters that rescale the kinetic energy in different dimensions of the log-posterior to arrive at a highly effective HMC-sampler that does not normally require user intervention. Thus, the researcher can fully concentrate on specifying the model, i.e., the likelihood and the pior, knowing that high quality numerical inference from the implied posterior is available through NUTS. A limitation is that the gradient of the log-posterior needs to be defined, which excludes discrete variables as direct objects of inference. However, in many models, discrete latent variables are introduced as augmented data, such as in models defining a discrete mixture of distributions. In these cases, NUTS could be used to sample from the posterior marginalized with respect to discrete latent variables. Based on the marginal posterior, the posterior distribution of discrete latent variables can be easily derived.

To numerically illustrate the performance of NUTS, I revisit the binomial probit example discussed earlier. I run the NUTS implemented in Stan for 600,000 iterations and compute 550 batch means and batch standard deviations of the first slope coefficient (the red trace in Figs. 2 to 4) from the last 550,000 iterations (see Fig. 7). A comparison between Fig. 7 and Figs. 5 and 6 shows that a randomly drawn batch of 1000 consecutive iterations from NUTS is likely to be a better representation of the posterior than a randomly drawn batch of 1000 consecutive iterations from the samplers discussed earlier, including the Gibbs-sampler. However, it should be noted that each NUTS-iteration is more computationally intensive than one iteration of the MH-sampler investigated. The computational intensity of Gibbs-sampling relative to NUTS in this model depends on the sample size, where larger samples are likely to favor NUTS because of the need to augment latent utilities for all observations when Gibbs-sampling.
Fig. 7

No U-Turn-sampling – distribution of batch means and batch standard deviations

Model comparison

In the introduction, I mentioned the possibility of determining the dimensionality of a flexibly formulated model using the Bayesian approach. I also alluded to the possibility of making comparisons across different models for the same data, where models may arbitrarily differ in terms of likelihood functions, prior specifications, or both. Here, I will briefly describe the basic principles to this end. Specifically, I will show how the Bayesian approach can deliver consistent evidence for a more parsimonious model. As usual, consistency means convergence to the data generating truth as the sample size increases (When the set of models compared does not contain the model that in fact corresponds to the data generating truth, consistency means convergence to the model that is closest to the data generating truth in a predictive sense).

This contrasts with the classical frequentist approach, where we can only “fail to reject” relatively simpler descriptions of the world, i.e., more parsimonious theories and models in comparison to more complex models. I personally see this as a drawback of the classical frequentist approach because theory aimed at understanding the underlying causal mechanisms of observed associations generally thrives on establishing that particular (direct) causal effects do not exist.

The Bayesian approach towards comparing between two or more alternative models builds – as one may expect – on Bayes’ theorem. Consider a set of models ℳ1, … , ℳK formulated for the same observed data y. Note that this encompasses the possibility that models use different sets of covariates, different likelihood functions, different priors, or may be calibrated including additional or even different data y, as long as they define a predictive density for the same y (For example, Otter et al. (2011) show how to derive a marginal likelihood for demand data in a model that specifies a joint density for supply side variables (that enter the demand model as conditioning arguments) and demand data). Bayesian model comparisons then rest on the posterior probabilities of a model given the (focal) data (Eq. 47).
$$ \Pr \left({\mathrm{\mathcal{M}}}_j|\mathbf{y}\right)=\frac{p\left(\mathbf{y}|{\mathrm{\mathcal{M}}}_j\right)\Pr \left({\mathrm{\mathcal{M}}}_j\right)}{\sum_{k=1}^Kp\left(\mathbf{y}|{\mathrm{\mathcal{M}}}_k\right)\Pr \left({\mathrm{\mathcal{M}}}_k\right)} $$
(47)
Here Pr(ℳk) is the subjective prior probability that model k is the true model which is often chosen to be 1/K in the absence of better knowledge, and p(y|ℳK) is the so-called marginal likelihood of the data given model k defined as ∫pk(y|θ)pk(θ)dθ. The subscript k indicates that the likelihood and the prior and thus the “content” of θ can be model dependent. If Pr(ℳk) can be reduced to one and the same constant for all models under consideration, this constant can obviously be ignored in Eq. 47. Then, the comparison between any two models k and j in the set can be based on so-called Bayes’ factors, defined as ratios of marginal likelihoods (Eq. 47).
$$ {BF}_{k,j}=\frac{p\left(\mathbf{y}|{\mathrm{\mathcal{M}}}_k\right)}{p\left(\mathbf{y}|{\mathrm{\mathcal{M}}}_j\right)} $$
(48)

By convention, Bayes Factors larger 3 count as weak but sufficient evidence in favor of the model in the numerator; Bayes Factors larger 20 count as strong evidence (Kass and Raftery 1995). I will comment more on this convention later.

For example, it would be perfectly alright to compare model k with marginal likelihood ∫pk(y|θ)pk(θ)dθ to a model j that introduces observed conditioning arguments (predictors, covariates) X, i.e., ∫pj(y|X, θ)pj(θ)dθ, or to include model i that uses additional data y′ for calibration in the comparison, based on ∫pi (y|θ)pi(θ|y)dθ (However, note that ∫pi(y, y|θ)pi(θ)dθ ≠  ∫ pi(y|θ)pi(θ| y)dθ. The former is a marginal likelihood for the data (y, y') and not for the data y. Marginal likelihoods for different models can only be directly compared as long as they pertain to the same data). A useful intuition for marginal likelihoods is that they reduce radically different, per se incomparable “stories” about what may have generated the data to densities for the data, which are directly comparable in the same way as we can compare predictions for the same event completely independent of the considerations that gave rise to the prediction.

However, we still need to establish the intuition for how Bayesian model comparisons can possibly consistently support the more parsimonious model. I will do this by returning to the regression example from Eq. 10. Recall that we were able to derive the posterior distribution analytically in this example (see Eq. 16). Exploiting this fact, we obtain an analytical expression for the marginal likelihood of the data under this model as follows:
$$ {\displaystyle \begin{array}{ll}p\left(\mathbf{y}|{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)& =\frac{p\left(\mathbf{y}|\boldsymbol{\beta} \right)p\left(\boldsymbol{\beta} |{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)}{p\left(\boldsymbol{\beta} |\mathbf{y},{\boldsymbol{\beta}}^0,{\boldsymbol{\Sigma}}^0\right)}\\ {}& =\frac{\frac{1}{\sqrt{2\pi }}{\prod}_{i=1}^n\exp \left(-\frac{1}{2}{\left({y}_i-{\mathrm{x}}_i^{\prime}\boldsymbol{\beta} \right)}^2\right){\left(2\pi \right)}^{-k/2}{\left|{\boldsymbol{\Sigma}}^0\right|}^{-1/2}\exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)}^{\prime }{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\left(\boldsymbol{\beta} -{\boldsymbol{\beta}}^0\right)\right)}{{\left(2\pi \right)}^{-k/2}{\left|{\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right|}^{1/2}\exp \left(-\frac{1}{2}{\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left({\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right)\left(\boldsymbol{\beta} -\tilde{\boldsymbol{\beta}}\right)\right)}\\ {}& ={\left|{\boldsymbol{\Sigma}}^0\right|}^{-1/2}{\left|{\mathbf{X}}^{\prime}\mathbf{X}+{\left({\boldsymbol{\Sigma}}^0\right)}^{-1}\right|}^{-1/2}\exp \left(-\frac{\tilde{s}}{2}\right)\end{array}} $$
(49)

Here, we exploited the fact p(y| β0, Σ0)p(β| y, β0, Σ0) = p(y| β)p(β| β0, Σ0), by elementary rules of probability. Also note that with the intent to eventually compare across models defined by different likelihoods and priors, we kept track of all normalizing constants that we conveniently ignored before, when deriving the posterior distribution in Eq. 16. Specifically, we previously ignored the factors \( 1/\sqrt{2\pi } \) and (2π)k/2|Σ0|−1/2 in the likelihood p(y|β) and the prior p(β|β0, Σ0), respectively.

Recall that \( \tilde{s} \) in the last line of Eq. 49 is a deterministic function of the subjective prior parameters β0, Σ0, and the data y (see Eq. 14). For all nondegenerate prior choices, \( \tilde{s} \) is going to be dominated by the term \( {\left(\mathbf{y}-{\mathbf{X}}^{\prime}\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left(\mathbf{y}-{\mathbf{X}}^{\prime}\tilde{\boldsymbol{\beta}}\right) \), where \( \tilde{\boldsymbol{\beta}} \) converges to the maximum likelihood or ordinary least squares estimate as more data become available (assuming regular X).

Now consider the comparison between two models. Model ℳ0 happens to employ the p-column X-matrix that collects all covariates that systematically influenced y, when the data was generated – the true model. Model ℳ1 uses a model matrix that features the same p covariates in X plus s additional covariates in Xs that did not contribute to the variation in y, when the data was generated. The Bayes’ factor BF0,1 is then:
$$ {\displaystyle \begin{array}{l}{BF}_{0,1}=\frac{p\left(\mathbf{y}|{\mathrm{\mathcal{M}}}_0\right)}{p\left(\mathbf{y}|{\mathrm{\mathcal{M}}}_1\right)}\\ {}\qquad =\frac{{\left|{\boldsymbol{\Sigma}}_0^0\right|}^{-1/2}{\left|{\mathbf{X}}^{\prime}\mathbf{X}+{\left({\Sigma}_0^0\right)}^{-1}\right|}^{-1/2}\exp \left(-\frac{{\tilde{s}}_0}{2}\right)}{{\left|{\boldsymbol{\Sigma}}_1^0\right|}^{-1/2}{\left|{\left(\mathbf{X},{\mathbf{X}}^{\mathrm{s}}\right)}^{\prime}\Big(\mathbf{X},{\mathbf{X}}^{\mathrm{s}}\Big)+{\left({\boldsymbol{\Sigma}}_1^0\right)}^{-1}\right|}^{-1/2}\exp \left(-\frac{{\tilde{s}}_1}{2}\right)}\end{array}} $$
(50)
where \( {\boldsymbol{\Sigma}}_0^0 \) and \( {\boldsymbol{\Sigma}}_1^0 \) are of dimension p × p and (p + s) × (p + s), respectively. In the limit of more and more data, \( {\tilde{s}}_0 \) and \( {\tilde{s}}_1 \) will converge to the same value, as the data determine that the elements in \( {\tilde{\boldsymbol{\beta}}}_1 \) that correspond to Xs are equal to zero. Then, the limit of the ratio in Eq. 50 only depends on:
$$ {\displaystyle \begin{array}{l}{\left({\left|{\left(\mathbf{X},{\mathbf{X}}^{\mathbf{s}}\right)}^{\prime}\Big(\mathbf{X},{\mathbf{X}}^{\mathbf{s}}\Big)\parallel {\mathbf{X}}^{\prime}\mathbf{X}\right|}^{-1}\right)}^{1/2}\\ {}={\left({n}^{p+s}{\left|{n}^{-1}{\left(\mathbf{X},{\mathbf{X}}^{\mathbf{s}}\right)}^{\prime}\Big(\mathbf{X},{\mathbf{X}}^{\mathbf{s}}\Big)\parallel {n}^{-1}{\mathbf{X}}^{\prime}\mathbf{X}\right|}^{-1}{n}^{-p}\right)}^{1/2}\\ {}\approx {n}^{\boldsymbol{s}/2}\end{array}} $$
which is easily seen to converge to infinity in the limit of more and more data (larger n), for regular (X,XS) (The expressions n−1 (X, XS)′ (X, XS) and n−1XX define covariance matrices that will converge to fixed matrices in the sample size n for covariates with finite variance). Thus, the Bayes’ factor can in fact produce infinitely strong evidence for the more parsimonious model, if it is the data generating mechanism.

If in contrast ℳ1 were the true model, or just closer to the truth in this case, the coefficients in \( {\tilde{\boldsymbol{\beta}}}_1 \) that correspond to Xs do not converge to zero. As a consequence, \( {\tilde{s}}_0 \) would grow faster in n than \( {\tilde{s}}_1 \), and SF0,1 would converge to zero (Note that \( \exp \left(\frac{-{\tilde{s}}_0+{\tilde{s}}_1}{2}\right)=\exp \left(n\frac{-{\tilde{s}}_0/n+{\tilde{s}}_1/n}{2}\right) \) converges to zero faster than ns/2 grows because of the exponential function, where \( -{\tilde{s}}_0/n+{\tilde{s}}_1/n \) converges to the true difference in average squared errors between ℳ1 and ℳ0). Thus, the Bayes’ factor can both produce increasing evidence for the more parsimonious model, when the constraints imposed by this model hold exactly, and increasing evidence against it, when they do not (consider BF1,0 instead of BF0,1 in this case) (In this case, the conventional classifications of weak and strong evidence in favor of the model in the numerator of the Bayes’ factor often align with the usual cut-off values for rejecting a more constrained model based on p-values). In contrast, p-values can reliably reject a parsimonious model but are incapable of producing increasing evidence for such a model. By construction, the probability of rejecting a true, more parsimonious model in favor of a larger, over-parameterized model is equal to the chosen significance level data in repeated applications of the frequentist testing procedure, and independent of the sample size (the amount of information in the data).

Numerical Illustrations

A Brief Note on Software Implementation

Researcher interested in adopting the Bayesian approach nowadays have quite some choice regarding different software and available implementations of the Bayesian approach. More recently, established products for data analysis such as SPSS, STATA, or SAS have started to include options for Bayesian estimation of well established “standard” statistical models such as ANOVA and generalized linear regression models (Advanced users can certainly use these tools to estimate “their own” models too, and STATA specifically emphasizes this possibility). In contrast, WINBUGS is an example of an attempt to automate Bayesian inference, with the idea that the user should be able to exclusively concentrate on the specification of a model – likely outside of the set of “standard” statistical models implemented elsewhere – aided by a graphical user interface.

Much if not the vast majority of “Bayesian-papers” published in marketing to this day have relied on coding up the model and the (invariably) MCMC-routine to perform Bayesian inference “from scratch,” starting with some example code and taking advantage of components that repeat themselves across different models, e.g., conditionally conjugate updating of parameters indexing hierarchical priors. The programming languages used in this context include compiled languages such as C or Fortran, and interpreted languages such as Matlab, R, and Gauss. Here, the former are by construction less interactive when coding and the latter slower in the execution of code “that works.” Recently, Rcpp (Eddelbuettel and François 2011; Eddelbuettel 2013) emerged as an extremely useful compromise between the speed of compiled and the coder-friendliness of interpreted languages.

I am currently relying heavily on Rcpp in my own research. However, I view the advent of the No U-turn Sampler (NUTS) by Hoffman and Gelman (2014) as implemented in Stan (Carpenter et al. 2017) as a major breakthrough towards the goal of focusing on the specification of innovative models (almost) exclusively.

A Hierarchical Bayesian Multinomial Logit Model

At least in marketing, no treatment of Bayesian modeling would be complete without illustrating the benefits from a hierarchical Bayesian model in the context of large N, small T data. I consider the stylized case of multinomial logit choice from choice sets with two inside alternatives, say brands A and B, and an outside option with expected utility normalized to zero. The utility of the two inside alternatives stems, in addition to alternative specific constants, from a uniformly distributed covariate x, i.e., UAit = βAi + βixAit + εAit and UBit = βBi + βixBit + εBit. Here, i = 1, … , N indexes heterogeneous individuals and t = 1, … , T choice occasions. Population preferences are distributed according to:
$$ {\boldsymbol{\beta}}_i=\left(\begin{array}{l}{\beta}_{Ai}\\ {}{\beta}_{Bi}\\ {}\beta \end{array}\right)\sim \mathcal{N}\left(\left[\begin{array}{l}.3\\ {}-2\\ {}-1\end{array}\right],\, \left[\begin{array}{ccc}3& -2.99& 0\\ {}-2.99& 3& 0\\ {}0& 0& .1\end{array}\right]\right) $$

Thus, brand A is slightly preferred to the outside good on average, whereas brand B is less attractive than the outside good to the average consumer in this market. However, there is a fair amount of heterogeneity in brand preferences in this market. For example, about 12.4% of consumers in this market prefer brand B to the outside good and around 43% prefer the outside good to brand A at x = 0. Moreover, consumers that have an above average preference for brand A are likely to have a below average preference for brand B in this market, as per the strongly negatively correlated brand coefficients in the population (\( \rho =-.997 \)). The tastes for the covariate x are relatively more homogenous and only consumers in the extreme tail of the preference distribution exhibit a higher preference for larger values of x in this population. I simulate N=2,000 individuals from this population and have each individual make T = 5 choices from complete sets that randomly vary in the x-values for brands A and B, both across t = 1,…,T and i = 1,…,N. I use this data to calibrate a Bayesian hierarchical MNL-model. I rely on the default subjective prior distributions implemented in \( \mathrm{bayesm} \)’s estimation routine \( \mathrm{rhierMnIRwMixture} \) and run this RW-MH-sampler with automatic tuning of proposal densities for 100,000 iterations saving every 10th draw (in \( \mathrm{bayesm}:\mathrm{R}=\mathrm{100,000},\mathrm{keep}=10 \)). The complete posterior is a 6009-dimensional object (3 means plus 3 variances plus 2 covariances plus 2000 times 3 individual level random effects). Because of the high dimensionality of the posterior, saving every draw from a long MCMC run can easily produce an object that taxes a computer’s RAM heavily. Saving every \( \mathrm{keep} \)-th draw increases the information content in a posterior sample limited by a computer’s RAM. For a maximum number of draws than can be saved, we can increase the number of MCMC-iterations \( \mathrm{R} \), when we simultaneously increase the number of iterations between parameters to be saved \( \left(\mathrm{keep}-1\right) \). The information content in the resulting sample is increased because saved draws separated by \( \mathrm{keep}-1 \) MCMC iterations will tend to be more independent from each other, replicate less of the information contained the preceding draw saved.

Figure 8 exhibits individual level posteriors for individuals 3, 99, and 2000 in our simulated panel data. For this purpose, I use the last 9000 draws of the 10,000 draws I saved. The three rows in Fig. 8 correspond to βA, βB, and β, respectively. Each individual’s posterior is depicted in two different ways in two adjacent columns of Fig. 8, each. The first column summarizes marginal posterior densities using density plots obtained by Gaussian-kernel-smoothing. The second column shows the MCMC-trace plot of draws underlying the density plots in the respective first column. Green solid bars indicate data generating parameter values. Red-dashed bars indicate individual level maximum likelihood estimates obtained from numerical maximization of the likelihood function using the R-function optim (I report numerical estimates regardless of the existence of a finite maximum likelihood estimate given the 5 choices from a specific individual and the corresponding design matrices). If no red-dashed bar is showing, this indicates that the maximum likelihood estimate falls outside of the range of parameter values plotted. To still give an impression of maximum likelihood estimates, the MCMC-trace plots in the respective second columns have the maximum likelihood estimates in the title.
Fig. 8

Individual level posterior distributions

Looking at the maximum likelihood estimates and comparing them to the green bars, we can see that they are extremely inaccurate. Clearly, individual level posterior inference benefits tremendously from the information in the hierarchical prior distribution that the model learns by pooling information across the 2000 consumers in our simulated short panel.

Finally, Fig. 9 illustrates how the hierarchical Bayesian MNL-model recovers the joint distribution of preferences for brands A and B in the population of consumers. We recognize the strongly negative relationship between preferences for brands A and B in the population (However, the posterior mean correlation of −0.88 (0.037) overestimates the data generating correlation of−0.997, which can be traced back to the finite information in the data available for calibration and the subjective priors for population level parameters employed here. See the documentation of rhierMnlRwMixture for details). Thus, if a particular individual level likelihood is only informative about the preference for brand A (B), the corresponding preference for brand B (A) can be inferred rather accurately from the hierarchical prior distribution. Dashed lines in Fig. 9 indicate posterior population means, that nicely recover the data generating values from the information in all N × T choice observations. The code to replicate this illustration is again available in the Appendix.
Fig. 9

Joint posterior distribution of {βAi} and {βAi}

Note that we estimated the model that was used to generate the data here. In applications, it is very likely that some or all subjective choices that go into the formulation of the model result in systematic differences from the data generating mechanism, including the choice of the hierarchical prior distribution that was (implicitly) chosen to be multivariate normal in this illustration. However, it is also clear that even misspecified hierarchical prior distributions can strike a beneficial bias-variance trade-off in applications where individual level maximum likelihood estimates are extremely noisy or may not exist at all. In fact, this bias-variance trade-off is at the source of the inroads Bayesian hierarchical models have made into applications in marketing. For a discussion of how to imbue hierarchical prior distributions with subjective knowledge about ordinal relationships, see Pachali et al. (2018).

Mediation Analysis: A Case for Bayesian Model Comparisons

In this section, we borrow from Otter et al. (2018). Mediation analysis has developed in psychology, as a tool to empirically establish the process by which an experimental manipulation brings about its effect on the dependent variable of interest. An important distinction in this context is that between full and partial mediation at a causal theory level. I will not discuss the related model specification questions here but focus on the fact that if an experimentally manipulated cause X and a measured consequence Y become independent when conditioned on a measured mediator M, evidence for (full) mediation is established (This is because conditional independence would only result in very particular essentially zero probability circumstances from models where full mediation is not the causal mechanism at work. Results that do not establish some form of conditional independence, which are often interpreted as “partial mediation,” actually are ambiguous with regarding their interpretation (Otter et al. 2018)). The original test for mediation proposed by Baron and Kenny (1986) builds on the connection between full mediation and conditional independence and tests conditional mean independence. Their test rests on the following set of regression equations, where t’s denote intercepts (see also Fig. 10).
$$ -, {Y}_i={t}_1+{cX}_i+{\varepsilon}_{Y,i} $$
(51)
$$ -, {M}_i={t}_2+{aX}_i+{\varepsilon}_{M,i} $$
(52)
$$ {Y}_i={t}_3+{c}^{\ast }{X}_i+{bM}_i+{\varepsilon}_{Y,i}^{\ast } $$
(53)
Fig. 10

Mediation according to Baron and Kenny (Baron and Kenny 1986)

The first equation regresses Y on the randomly assigned experimental variable X. A statistically significant coefficient c establishes empirical support for the total effect from X to Y (see Fig. 10, Panel a). Because of random assignment of X, the coefficient c necessarily measures a causal effect. The second equation regresses M on X. A statistically significant coefficient a establishes empirical support for the effect from X to M that is again causal by experimental design. The third equation regresses Y on randomly assigned X and on observed M. Finding that the effect from X on Y vanishes, when conditioned on M (i.e., that there is no direct effect c*), unequivocally establishes (full) mediation as the causal data generating model (see Fig. 10, Panel b) (In the limit of an infinite amount of data, the estimate of c* will only converge to exactly zero under full mediation. The only alternative process that yields c* = 0 in the limit features M as a joint cause of X and Y without another connection between X and Y. This process is ruled out a priori, when X is experimentally manipulated). Usually, empirical support for the hypothesis of c* = 0 is established based on p-values larger than some subjectively chosen significance level. An obvious drawback of this approach is that p-values, by construction, fail to measure the strength of empirical support for conditional independence, which in turn establishes full mediation. Based on p-values, we can only “fail to reject” the null-hypothesis.

Next, I illustrate the differences between the classical and the Bayesian approach in the context of c* = 0 using a sampling experiment. I thus consider the case of full mediation as DGP. Accordingly, I set t2 = t3 = 1, a = 4, c* = 0, b = 0.5, and σm = σY* = 1 in Eqs. 52 and 53 and generate artificial data sets of different sizes: N1 = 50, N2 = 200, as well as N3 = 2000 (Xi is drawn from a uniform distribution for each i ∈, … , N}). I conduct 1000 replications for each data set size and compute Bayes’ factors defined as ratios of marginal likelihoods of the model \( {\mathrm{\mathcal{M}}}_0:{Y}_i={t}_3+{bM}_i+{\varepsilon}_{Y,i}^{\ast } \) and the model \( {\mathrm{\mathcal{M}}}_1:{Y}_i={t}_3+{c}^{\ast }{X}_i+{bM}_i+{\varepsilon}_{Y,i}^{\ast } \). Note that the former is more restricted than the latter and implies that the coefficient c* is equal to zero in the latter model (see Otter et al. (2018) for the computational details and R-scripts).

Table 8 illustrates the distribution of estimated Bayes Factors over the 1000 simulation replications testing the hypothesis of c* = 0. The results in Table 8 verify that the Bayes Factor correctly favors ℳ0 over ℳ1 for the vast majority of sampling replications. Importantly, this table also illustrates that the Bayes Factor provides increasingly stronger evidence for ℳ0 (i.e., c* = 0) as the sample size increases.
Table 8

Distribution of Bayes’ factors in simulation

 

Pr(BF) > 3

Pr(BF) > 20

Pr(BF) > 100

N = 50

0.94

0.04

0.00

N = 200

0.97

0.71

0.00

N = 2000

0.99

0.93

0.43

The classical testing framework based on p-values fails to measure the strength of evidence in favor of c* = 0. In line with how they are defined, p-values are uniformly distributed over sampling replications in the interval of (0,1) (see Table 9). The probability of observing a p-value smaller than the specified significance level is equal to this level, and independent of the sample size, when the null-hypothesis is actually true. In contrast, the probability of obtaining a Bayes Factor larger than 20 in support of c* = 0 increases in the sample size and, for example, approaches one for N = 2000 (see Table 8).
Table 9

Distribution of p-values in simulation

 

Pr(p-value) > 0.01

Pr(p-value) > 0.05

Pr(p-value) > 0.10

N = 50

0.99

0.96

0.90

N = 200

0.99

0.96

0.90

N = 2000

0.99

0.96

0.90

Thus, when the data generating process implies conditional independence (c* = 0), the Bayesian approach is the superior measure of empirical evidence for this process, compared to the approach based on p-values.

Conclusion

Writing a chapter like this, one certainly involves many trade-offs. I have chosen to emphasize general principles of Bayesian decision-making and inference in the hope of interesting and exciting readers that have an inclination towards quantitative methodology and are serious about improving marketing decisions. The promise from a deeper appreciation of the Bayesian paradigm, both in terms of its foundations in (optimal) decision-making and in terms of its computational approaches are better tailored quantitative approaches that can be developed and implemented as required by a new decision problem, or for the purpose of extracting (additional) knowledge from a new data source.

A drawback of this orientation is that the plethora of existing models that are usefully implemented in a fully Bayesian estimation framework, including common place prior distributions, are not even enumerated in this chapter. However, I believe that the full appreciation of individual, concrete applications requires a more general understanding of the Bayesian paradigm. Once this understanding develops, that for different individual models follows naturally.

Cross-References

Notes

Acknowledgments

I would like to thank Anocha Aribarg, Albert Bemmaor, Joachim Büschken, Arash Laghaie, anonymous reviewers, the editors, and participants in my class on “Bayesian Modeling for Marketing” helpful comments and feedback. All remaining errors are obviously mine.

References

  1. Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. http://www.jstor.org/stable/2290350
  2. Allenby, G. M., Arora, N., & Ginter, J. L. (1995). Incorporating prior knowledge into the analysis of conjoint studies. Journal of Marketing Research, 32(2), 152–162. http://www.jstor.org/stable/3152044
  3. Allenby, G. M., Arora, N., & Ginter, J. L. (1998). On the heterogeneity of demand. Journal of Marketing Research, 35(3), 384–389. http://www.jstor.org/stable/3152035
  4. Amemiya, T. (1985). Advanced econometrics. Cambridge, MA: Harvard University Press.Google Scholar
  5. Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182.  https://doi.org/10.1037/0022-3514.51.6.1173.CrossRefGoogle Scholar
  6. Bernardo, J. M., & Smith, A. F. M. (2001). Bayesian theory. Measurement Science and Technology, 12(2), 221. http://stacks.iop.org/0957-0233/12/i=2/a=702.Google Scholar
  7. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2), 192–236. http://www.jstor.org/stable/2984812 Google Scholar
  8. Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1), 1–32.  https://doi.org/10.18637/jss.v076.i01. https://www.jstatsoft.org/v076/i01.CrossRefGoogle Scholar
  9. Chen, M.-H., Shao, Q.-M., & Ibrahim, J. G. (2000). Monte Carlo methods in Bayesian computation. New York: Springer. http://gateway.library.qut.edu.au/login?url=http://link.springer.com/openurl?genre=book&isbn=978-1-4612-1276-8.CrossRefGoogle Scholar
  10. Chib, S., & Carlin, B. P. (1999). On MCMC sampling in hierarchical longitudinal models. Statistics and Computing, 9(1), 17–26.  https://doi.org/10.1023/A:1008853808677.CrossRefGoogle Scholar
  11. Eddelbuettel, D. (2013). Seamless R and C+ + integration with Repp. New York: Springer.CrossRefGoogle Scholar
  12. Eddelbuettel, D., & François, R. (2011). Repp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.  https://doi.org/10.18637/jss.v040.i08. http://www.jstatsoft.org/v40/i08/
  13. Edwards, Y. D., & Allenby, G. M. (2003). Multivariate analysis of multiple response data. Journal of Marketing Research, 40(3), 321–334.  https://doi.org/10.1509/jmkr.40.3.321.19233.CrossRefGoogle Scholar
  14. Fasiolo, M. (2016). An introduction to mvnfast. R package version 0.1.6. https://CRAN.R-project.org/package=mvnfast
  15. Frühwirth-Schnatter, S., Tüchler, R., & Otter, T. (2004). Bayesian analysis of the heterogeneity model. Journal of Business & Economic Statistics, 22(1), 2–15.  https://doi.org/10.1198/073500103288619331.CrossRefGoogle Scholar
  16. Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., & Hothorn, T. (2018). mvtnorm: Multivariate normal and t distributions. https://CRAN.R-project.org/package=mvtnorm. R package version 1.0-8.
  17. Geweke, John. (1991). Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints and the evaluation of constraint probabilities. In: E. M. Keramidas (Ed.), Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pp. 571–578.Google Scholar
  18. Gilks, W. R. (1996). Full conditional distributions. In S. (Sylvia) Richardson, D. J Spiegelhalter, & W. R. (Walter R.) Gilks (Eds.), Markov chain Monte Carlo in practice (pp. 75–88). London/Melbourne: Chapman & Hall.Google Scholar
  19. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: data mining, inference, and prediction. New York: Springer.CrossRefGoogle Scholar
  20. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.  https://doi.org/10.1080/00401706.1970.10488634.CrossRefGoogle Scholar
  21. Hoffman, M. D., & Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 1593–1623. http://jmlr.org/papers/vl5/hoffmanl4a.html.Google Scholar
  22. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795.  https://doi.org/10.1080/01621459.1995.10476572.CrossRefGoogle Scholar
  23. Lenk, P. J., & DeSarbo, W. S. (2000). Bayesian inference for finite mixtures of generalized linear models with random effects. Psychometrika, 65(1), 93–119.  https://doi.org/10.1007/BF02294188.CrossRefGoogle Scholar
  24. Lenk, P. J., DeSarbo, W. S., Green, P. E., & Young, M. R. (1996). Hierarchical Bayes conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2), 173–191.  https://doi.org/10.1287/mksc.15.2.173.CrossRefGoogle Scholar
  25. Long, J. S. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks: Sage Publications. https://uk.sagepub.com/en-gb/eur/regression-models-for-categorical-and-limited-dependent-variables/book6071.Google Scholar
  26. McCulloch, R., & Rossi, P. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64(1–2), 207–240. https://EconPapers.repec.org/RePEc:eee:econom:v:64:y:1994:i:1-2:p:207-240.CrossRefGoogle Scholar
  27. Mersmann, O., Trautmann, H., Steuer, D., & Bornkamp, B. (2018). truncnorm: Truncated normal distribution. https://CRAN.R-project.org/package=truncnorm. R package version 1.0-8
  28. Montgomery, A. L., & Bradlow, E. T. (1999). Why analyst overconfidence about the functional form of demand models can lead to overpricing. Marketing Science, 18(4), 569–583. http://www.jstor.org/stable/193243
  29. Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. L. Jones, & X-L. Meng (Eds.), Handbook of Markov chain Monte Carlo (Chap. 5). Chapman & Hall/CRC. http://arxiv.org/abs/1206.1901
  30. Orme, B. (2017). The CBC system for choice-based conjoint analysis. Technical Report. https://sawtoothsoftware.com/download/techpap/cbctech.pdf
  31. Otter, T., Tüchler, R., & Frühwirth-Schnatter, S. (2004). Capturing consumer heterogeneity in metric conjoint analysis using Bayesian mixture models. International Journal of Research in Marketing, 21(3), 285–297.  https://doi.org/10.1016/j.ijresmar.2003.11.002. http://www.sciencedirect.com/science/article/pii/S0167811604000308 CrossRefGoogle Scholar
  32. Otter, T., Gilbride, T. J., & Allenby, G. M. (2011). Testing models of strategic behavior characterized by conditional likelihoods. Marketing Science, 30(4), 686–701. http://www.jstor.org/stable/23012019
  33. Otter, T., Pachali, M. J., Mayer, S., & Landwehr, J. R. (2018). Causal inference using mediation analysis or instrumental variables – Full mediation in the absence of conditional independence. Marketing ZFP, 40(2), 41–57.  https://doi.org/10.15358/0344-1369-2018-2-41.CrossRefGoogle Scholar
  34. Pachali, M. J., Kurz, P., & Otter, T. (2018). How to generalize from a hierarchical model? Technical Report. https://ssrn.com/abstract=3018670
  35. Pearl, J. (2009). Causality: Models, reasoning and inference (2nd ed.). New York: Cambridge University Press.CrossRefGoogle Scholar
  36. Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). Coda: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7–11. https://journal.r-project.org/archive/.Google Scholar
  37. Ritter, C., & Tanner, M. A. (1992). Facilitating the Gibbs sampler: The Gibbs stopper and the Griddy-Gibbs sampler. Journal of the American Statistical Association, 87(419), 861–868.  https://doi.org/10.1080/01621459.1992.10475289.CrossRefGoogle Scholar
  38. Robert, C. P. (1994). The Bayesian choice: a decision-theoretic motivation. New York: Springer.CrossRefGoogle Scholar
  39. Roberts, G. O. (1996). Markov chain concepts related to sampling algorithms. In S. (Sylvia) Richardson, D. J. Spiegelhalter, & W. R. (Walter R.) Gilks (Eds.), Markov chain Monte Carlo in practice (pp. 45–58). London/Melbourne: Chapman & Hall.Google Scholar
  40. Rossi, P. E., McCulloch, R. E., & Allenby, G. M. (1996). The value of purchase history data in target marketing. Marketing Science, 15(4), 321–340.  https://doi.org/10.1287/mksc.l5.4.321.CrossRefGoogle Scholar
  41. Rossi, P. E., Allenby, G. M., & McCulloch, R. E. (2005). Bayesian statistics and marketing. Chichester: Wiley.CrossRefGoogle Scholar
  42. Wachtel, S., & Otter, T. (2013). Successive sample selection and its relevance for management decisions. Marketing Science, 32(1), 170–185.  https://doi.org/10.1287/mksc.1120.0754.CrossRefGoogle Scholar
  43. Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Goethe University Frankfurt am MainFrankfurt am MainGermany

Personalised recommendations