Handbook of Market Research pp 1-64 | Cite as
Bayesian Models
- 194 Downloads
Abstract
Bayesian models have become a mainstay in the tool set for marketing research in academia and industry practice. In this chapter, I discuss the advantages the Bayesian approach offers to researchers in marketing, the essential building blocks of a Bayesian model, Bayesian model comparison, and useful algorithmic approaches to fully Bayesian estimation. I show how to achieve feasible Bayesian inference to support marketing decisions under uncertainty using the Gibbs sampler, the Metropolis Hastings algorithm, and point to more recent developments – specifically the no-U-turn implementation of Hamiltonian Monte Carlo sampling available in Stan. The emphasis is on the development of an appreciation of Bayesian inference techniques supported by references to implementations in the open source software R, and not on the discussion of individual models. The goal is to encourage researchers to formulate new, more complete, and useful prior structures that can be updated with data for better marketing decision support.
Keywords
Marketing decision-making Bayesian inference Gibbs sampling Metropolis Hastings Hamiltonian Monte Carlo R bayesm StanIntroduction: Why Use Bayesian Models?
Bayesian models have gained popularity over the past 30 years both among academics in marketing and marketing research practitioners. There are several reasons for this popularity. First, many marketing problems involve data in the form of relatively short panels but with many observational units (large N, small T ). Each observational unit, e.g., a respondent, a customer, or a store supplies only a limited amount of data, but there are many observational units in the data set. In the vast majority of these applications, decision makers know a priori that observational units are heterogeneous in their underlying, at least partially unobserved characteristics that generated the data. And the successful marketing of differentiated goods that involves market segmentation, targeting and positioning requires measures of heterogeneity in the population of observational units. Estimating separate, independent models for each observational unit results in unreliable estimates, and in many applications, individual level time series are too sparse for individual level maximum likelihood estimates to be defined. Hierarchical Bayes models offer a convenient and practical solution to this problem.
Second, the overwhelming majority of marketing data sets involve so-called limited dependent variables, e.g., choices, ratings, rankings, or generally dependent variables that have strongly noncontinuous features. Although a number of non-Bayesian estimators are available for models with such dependent variables (see e.g., Amemiya 1985; Long 1997), the assessment of statistical uncertainty in estimates relies on large sample asymptotic arguments. In marketing, large samples that allow for inference based on asymptotic arguments are the exception, even in an era where big data has become a ubiquitous buzzword. Big data, by definition, involves large data sets. However, the size of the data set usually does not translate into more statistical information about individual target parameters. Big data are always “big” because of their dimensionality spanning across, e.g., tens of thousands of customers, products, and time points, and include a myriad of potentially useful conditioning arguments. The dimensionality of the data at the very source of its size, or “bigness,” regularly translates into similarly high-dimensional models and estimation problems, such that the amount of statistical information about individual target parameters is small yet again. Bayesian models allow for coherent inference even in small samples, or more generally in situations where there is little data-based information about individual parameters. Moreover, a number of relatively simple yet powerful computational algorithms facilitate the estimation of limited dependent variable models.
Third, in marketing, inference about model parameters or more generally about different models, i.e., the statistical assessment of the likely mechanisms that bring about consumers’ and competitors’ behaviors in a market is usually not an end in itself but input to the decisions of marketing managers in companies. The likely benefit from various alternatives for, e.g., product design, product line composition, pricing, or advertising schedules can be expressed as a function of a model and its parameters. However, knowledge of model parameters and generally the model that generated the observed market behaviors will never be perfect. Bayesian modeling facilitates the accurate incorporation of any remaining uncertainty about the mechanism behind observed market behaviors in managerial decisions.
Fourth, computational resources become more powerful and affordable every year, facilitating the estimation of ever more realistic and thus complex models in academic and industry applications. In addition, freely available software such as, e.g., the R-package bayesm (see Rossi et al. 2005) makes a collection of Bayesian models useful for marketing applications readily accessible (The latest version of bayesm is written for speed using the R-package Rcpp (Eddelbuettel and François 2011; Eddelbuettel 2013). The last complete version mostly written in plain R is version 2.2–5. The R-files are available from the CRAN-archives and often a useful start when developing your own routines). In fact, one reason for the popularity of Bayesian modeling among market research practitioners has been the adoption of hierarchical Bayes models for inference by companies like Sawtooth software (Orme 2017) that revolutionized how market research consultants approach the analysis of, for example, choice-based conjoint experiments. Finally, Stan (Carpenter et al. 2017) appears as a big step towards freeing creative modeling from having to invest substantial amounts of time in the development of efficient Bayesian estimation routines.
Fifth, because Bayesian estimation is simply the exact reverse of the data generating process (DGP), it is naturally attractive to researchers that are interested in the development and the empirical test of their own marketing models. Some researchers view the need to specify a complete DGP as a drawback. The argument is that theory never is precise enough to do so, and that this requirement leads to arbitrary choices that unduly impact the inference for quantities the data are more or less directly informative about. The Bayesian response to this criticism is to specify highly flexible DGPs in instances where theory is lacking. This strategy is facilitated by algorithms that adaptively determine a reasonable dimensionality of a flexibly formulated model. This determination is based on statistical evidence that potentially favors a lower dimensional, simpler model and not just fails to reject that model as in classical hypothesis testing.
All that said, it usually still takes longer to estimate a fully Bayesian model than it takes to compute maximum likelihood estimates, in case they exist. I have also heard people “complain” about the amount of information contained in large samples from posterior distributions as produced by modern numerical Bayesian inference tools (Compared to a collection of maximum likelihood estimates and their standard errors). However, it seems natural to wait somewhat longer for a more complete answer to a decision problem. And many interesting decision problems cannot be properly addressed based on a collection of maximum likelihood estimates (should they even exist) and especially upon realizing that their standard errors cannot be reliably estimated with the data at hand.
Bayesian Essentials
A Bayesian model consists of a likelihood function p (y|θ) that fully specifies the probability of the data y given parameters θ_{,} i.e., the process that generates the data for known parameters. In fact, if the researcher only wants to work with one likelihood function, is not interested in comparing across different mechanisms that may have generated the data, any function that is proportional to p (y|θ) will do, i.e., all functions that differ from p (y|θ) only by an arbitrary positive constant c are likelihood functions, ℓ(y| θ) ≡ c ⋅ p(y|θ). We will revisit this point later. A simple example is the linear regression model \( {y}_i={\mathbf{x}}_{\mathbf{i}}^{\hbox{'}}\;\boldsymbol{\beta} +{\varepsilon}_i,\, {\varepsilon}_i\sim iid\mathcal{N}\;\left(0,{\sigma}_{\varepsilon}^2\right) \) that implies the following likelihood for the data \( p\, \left(\mathbf{y}|\boldsymbol{\beta}, {\sigma}_{\varepsilon}^2\right)={\prod}_{i=1}^N\mathcal{N}\left({y}_i|{\mathbf{x}}_{\mathbf{i}}^{\hbox{'}}\;\boldsymbol{\beta}, {\sigma}_{\varepsilon}^2\right) \).
The second component of a Bayesian model is a prior distribution for the parameters indexing the likelihood p(θ). The notation p(θ) means “the density p evaluated at the value θ”. Further, defining the prior distribution as p(θ) implies that θ ~ p, i.e., that θ is (a priori) distributed according to density p, or simply is p- distributed. The notation p(θ) is short-hand because it omits the (subjective prior) parameters indexing the prior distribution. For example, in an application the statement that the prior is a multivariate normal distribution is incomplete. We need to add the information about the prior mean and variance, e.g., \( p\left(\boldsymbol{\theta} \right)=\mathcal{N}\left(\boldsymbol{\theta} |{\boldsymbol{\theta}}^0,{\boldsymbol{\Sigma}}^0\right) \), where \( \mathcal{N}\left(\boldsymbol{\theta} |{\boldsymbol{\theta}}^0,{\boldsymbol{\Sigma}}^0\right) \) is the multivariate normal distribution with mean θ^{0} and variance-covariance Σ^{0} evaluated at θ. The multivariate normal density can be evaluated in R using the command dmvn from the R-package mvnfast (Fasiolo 2016) or the command dmvnorm from R-package mvtnorm (Genz et al. 2018). Both commands support computations on the log-scale which are essential for numerical accuracy. For example, a log-likelihood value of −2000 can only be numerically distinguished from a log-likelihood value of, say, −2050 on the log-scale, because both likelihoods, i.e., exp(−2000) and exp(−2050) evaluate to an “exact” machine zero at currently available machine accuracies.
The need to specify prior distributions for Bayesian analysis is often viewed as a drawback of the Bayesian approach. There are several aspects to the specification and the role of the prior distribution in a Bayesian model. First, as suggested by the name, the prior distribution is the formal vehicle to bring prior substantive knowledge to bear on the analysis. And it is sometimes overlooked by critics of the Bayesian approach that such knowledge is already required when specifying the likelihood function. Second, from a purely technical point of view, prior distributions improve the statistical properties of estimators derived from the model (see e.g., Robert 1994, p. 75).
In the regression example, a useful way to probe into prior knowledge is to think about expected changes in y_{i} as a function of changes in x_{i}. Unless the substantive domain the data originates from is unknown, it is extremely likely that the analyst will have some substantive idea about the DGP that should be used in the formulation of prior distributions. In the event where the analysis is a follow-up on previous statistical analyses in the same or a related domain, the choice of prior can build on these results. An example would be market research companies that more or less continuously study demand in a set of markets.
With the specification of a prior distribution, the analyst expresses his beliefs about what parameter values are more likely than other parameter values and by how much, based on his existing substantive understanding of the DGP. If the analyst specifies a prior such that parameters in a relatively small subset of the parameter space are much more likely than other parameters, the prior is usually referred to as an informative prior. The most extreme case of an informative prior is a distribution that concentrates all its mass on a singular parameter value. Such a prior is called degenerate. Degenerate priors constrain parameters to take particular values known a priori. Conversely, the prior is weakly informative or diffuse if there is no discernible concentration of prior mass on subsets of the parameter space. However, unless the parameter space is bounded in all directions as, e.g., in the case of a parameter measuring a probability, it is impossible to put exactly equal prior weight on all parameter values without violating the requirement that the prior needs to be in the form of a probability density function (A function p(θ) is a probability density function if ∫p(θ)dθ = 1). Priors that fulfill this requirement are also referred to as proper priors and priors that do not are improper or literally noninformative. Finally, if the prior puts zero mass on subsets of the parameter space, e.g., zero mass on positive price coefficients in a demand model, it is called a constrained prior.
In marketing applications, the loss usually does not directly depend on θ but on the implied data \( \hat{\mathbf{y}} \), usually some manifestation of demand, i.e., \( \mathcal{L}\left(a,,,\, ,\boldsymbol{\theta} \right)=\int \mathcal{L}\left(a,,,\, ,\hat{\mathbf{y}}\right)p\left(\hat{\mathbf{y}}|\boldsymbol{\theta}, a\right)\, d\hat{\mathbf{y}} \). The notation \( p\left(\hat{\mathbf{y}}|\boldsymbol{\theta}, a\right) \) covers the relevant case where the actions under investigation are conditioning arguments to the DGP. A well-known example is finding the coupon strategy that maximizes net revenues, i.e., minimizes the loss defined as negative net revenues in Rossi et al. (1996).
The denominator in Eq. 1, p (y), is known as the marginal likelihood of the data y or the normalizing constant of the posterior distribution p (θ|y). As we will see in section “Bayesian Estimation”, knowledge of this quantity is not required for Bayesian inference given a particular model. However, statements about quantities of interest θ in probability form require that 0 < p(y) < ∞. Only if this condition is met, the posterior p (θ|y) will be in the form of a probability density functions, i.e., ∫p(θ|y)dθ = 1.
The fundamental appeal of being able to make probability statements about quantities of interest θ is the seamless integration with decision-making based on the expected utility from a set of possible actions. Note that the posterior expected loss in Eq. 2 will only usefully distinguish between different actions a if the posterior p(θ|y) integrates to 1, i.e., is a valid probability density function. It should be recognized that a proper prior distribution p(θ) essentially guarantees that we can make these probability statements, independent of any data deficiencies that may be present. A Bayesian model therefore quantifies how much the data, through the likelihood, add to our prior understanding of a DGP by comparing the prior distribution p(θ) to the posterior distribution p(θ|y). This is different from the classical question what models or model parameters the data can identify.
Credit card Designmatrix
Brand | Interest | Annual fee | Credit limit | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
# | Master | Visa | Discover | 18% | 15% | 12% | $0 | $10 | $20 | $1000 | $2500 | $5000 |
1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
5 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
6 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
7 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
8 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
Credit card Modelmatrix
Brand | Interest | Annual fee | Credit limit | ||||||
---|---|---|---|---|---|---|---|---|---|
Constant | Visa | Discover | 15% | 12% | $10 | $20 | $2500 | $5000 | |
# | x_{0} | x_{1} | x_{2} | x_{3} | x_{4} | x_{5} | x_{6} | x_{7} | x_{8} |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
3 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
5 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
6 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
7 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
8 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
It is easy to verify that the overall nine β-coefficients in this model are not jointly likelihood identified, because there are only eight observations. This can be viewed as toy example of the increasingly common situation, where the number of (potential) explanatory variables exceeds the number of observations, including big data that owe their size to the number of variables in addition to the number of observations (are “broader” than “long”). In such data sets, a purely data-based distinction between connections from explanatory variables to the dependent variable is no longer possible, even if all explanatory variables come from independent processes a priori.
Correlations between design columns
Visa | Discover | 15% | 12% | $10 | $20 | $2500 | $5000 | |
Visa | 1 | |||||||
Discover | −0.45 | 1 | ||||||
15% | −0.33 | 0.15 | 1 | |||||
12% | 0.15 | −0.07 | −0.45 | 1 | ||||
$10 | 0.15 | −0.07 | 0.15 | −0.07 | 1 | |||
$20 | −0.33 | 0.15 | −0.33 | 0.15 | −0.45 | 1 | ||
$2500 | 0.15 | −0.07 | 0.15 | −0.07 | −0.07 | 0.15 | 1 | |
$5000 | 0.15 | −0.07 | 0.15 | −0.07 | −0.07 | 0.15 | −0.6 | 1 |
Design column dependence – regression analysis
Constant | Visa | Discover | 15% | 12% | $10 | $20 | $2500 | $5000 | |
Visa | 0 | – | 0 | −1 | 0 | 0 | −1 | 1 | 1 |
Discover | 0.5 | −0.5 | – | 0 | 0 | 0 | 0 | 0 | NA |
15% | 0 | −1 | 0 | – | 0 | 0 | −1 | 1 | 1 |
12% | 0.5 | 0 | 0 | −0.5 | – | 0 | 0 | 0 | NA |
$10 | 0.5 | 0 | 0 | 0 | 0 | – | −0.5 | 0 | NA |
$20 | 0 | −1 | 0 | −1 | 0 | 0 | – | 1 | 1 |
$2500 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | – | −1 |
$5000 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | −1 | – |
For example, the last line of Table 4 implies the following deterministic equation from regressing the covariate “$5,000 credit limit” on the remaining covariates in Table 2: x_{8} = 0 + 1x_{1} + 0x_{2} + 1x_{3} + 0x_{4} + 0x_{5} + 1x_{6} – 1x_{7}. The contrasts x_{1}, x_{3}, and x_{6} involving “Visa,” “15% interest,” and “$20 annual fee” are therefore positively confounded with the contrast involving “$5,000 credit limit,” and this latter contrast is negatively confounded with the contrast x_{7} involving “$2,500 credit limit.”
Now, what are the implications for modeling the variation in the preference measures y as a function of covariates? In order to arrive at a likelihood-identified regression model, we need to reduce the number of covariates (the number of columns in Table 2) such that the resulting X-matrix is of full column rank, and the inverse of X′X is well defined. As a general rule, we can always throw out covariates that are independent of all covariates we would like to keep in the model, without biasing our inference for the influence of the latter. Throwing out such covariates, at worst, increases the unexplained variance. In this example, no covariate fulfills this criterion by the mere fact that we have too many covariates to choose from, relative to the number of observations.
As a second general rule, we can eliminate covariates from the model which we strongly believe (know a priori) to have no (direct) effect on the dependent variable. We can do so regardless of how such covariates are related to covariates we would like to keep in the model, for unbiased inference about the influence of the latter.
However, if we eliminate a covariate that actually has a direct effect on the dependent variable that is not independent of all covariates we would like to keep in the model, the resulting inference will be biased. For example, whatever the true preference contribution of “$5,000 credit limit” relative to the baseline of only “$1,000 credit limit,” the coefficients associated with “Visa,” “15% interest,” and “$20 annual fee” will be biased upward by this amount, and the coefficient associated with “$2,500 credit limit” will be biased downward by the same amount upon deleting column x_{8} (“$5,000 credit limit”) for identification in this example. Also, note that the confounds identified here are not automatically resolved upon collecting more data. In fact, even an infinite number of observations from the model in Table 2 will exhibit the same problem. What is required for improved data based identification is not only more but also “different” data, i.e., data generated by X-configurations different from those in Table 2. However, more data will necessarily be “suitably different” if the processes that generate the covariates are independent, at least conditionally.
In this particular example, there is no obvious choice of covariates that could be omitted based on strong prior beliefs that their direct effect is equal to zero. In fact, a prior understanding of preferences for credit cards would suggest that all covariates likely causally relate to the observed preferences for the different cards. Thus, any likelihood identified model obtained by omitting covariates from Table 2 is likely to yield substantially biased inferences regarding the influence of covariates retained in the model.
At this point, it is useful to relate likelihood-identification by omitting covariates to the formulation of a prior. In a sense, omitting covariates to achieve likelihood-identification corresponds to a degenerate prior concentrated on zero for the effects of omitted covariates, coupled with an improper prior for the effects of covariates retained in the model. In contrast, a Bayesian model for this data defined through a proper prior over all observed covariates expresses the belief that these covariates contributed causally independently to the observed preferences, with some prior uncertainty about the size of the individual contributions.
From the perspective of different (implied) priors, I believe that essentially nobody would prefer one of the many possible likelihood identified models in this example to the Bayesian model that keeps with the prior causal structure. Mutilating the prior causal structure to overcome data deficiencies and to achieve likelihood-identification (and more generally statistical efficiency) does not seem to be a generally useful strategy. Obviously, one often can (and should) try to obtain more informative data. However, completely discounting the information in only partially informative data seems to be a wasteful strategy.
Importantly, a prior that expresses the belief in invariant structural aspects of the data generating process will eventually translate into accurate posterior measures of the strength of structural relationships, once more likelihood information becomes available. A model (or prior structure) that is formulated in response to observed data deficiencies will not. Thus, the findings from such a model are generally not useful as prior input to future analysis of data from the same process, be it informative, or again deficient per se, potentially in a different way. We will revisit this topic when we discuss and numerically illustrate hierarchical Bayesian models that manage to extract information about the distribution of parameters from a collection of likelihoods that individually fail likelihood-identification (a collection of “deficient” data sets).
A big intellectual step is thus to acknowledge the limits of a perspective that literally asks “for the data to speak.” The decisions that go into “making the data speak,” be it in the form of simple summaries or complicated (likelihood identified) models, always involve prior knowledge. In this context, trading beliefs about an underlying structure for the ability to relate parameters to well-determined functions of the data only regularly voids the thus identified parameters from the meaning sought by the analyst in the first place. In contrast, updating a structurally intact prior with deficient data preserves the structural interpretation of parameters, at the expense of “purely” data-based identification (I put “purely” in quotes, because the decision about how to arrive at a model that can be identified only based on the data at hand always involves subjective, i.e., non-data based prior knowledge).
Now back to our example. When passed to R’s lm-function, for example, lm automatically deletes the last column from the model for a model that just identifies the remaining β-coefficients. This model computes eight parameters from eight observations and thus trivially fits the data perfectly. Because of the perfect fit of every member of the class of just identified models, the data cannot distinguish among models in this class. However, as mentioned earlier, prior knowledge strongly suggests that no likelihood-identified model obtained by deleting covariates makes much structural sense in this example.
For illustration, I simulate 1000 data sets using the model matrix in Table 2, a coefficient vector β = (4,2,0,1,1.5,−1,−1.5,2,3), and \( {\sigma}_{\varepsilon}^2=1 \). For each data set, I estimate the regression model in Table 2 dropping column x_{8} for identification which corresponds to the default in R’s lm-function. I also estimate a fully conjugate (Conjugacy refers to mathematical properties of a prior in combination with a particular likelihood function. So-called conjugate priors result in posteriors of the same distributional form as the prior. For example, a normal prior is the conjugate prior for the parameters in a normal likelihood with known variance, i.e., a likelihood that implies (conditionally) normally distributed data) Bayesian regression model with conditional prior \( \boldsymbol{\beta} \sim \mathcal{N}\left(\mathbf{0},\boldsymbol{I}{\sigma}_{\varepsilon}^2100\right) \) and without dropping any columns from Table 2 using the routine runireg in the R-package bayesm (Rossi et al. 2005) (The marginal prior for \( {\sigma}_{\varepsilon}^2 \). is inverse Gamma with 3 degrees of freedom and scale equal to the observed variance of y in each data set, i.e., the default in the R-package bayesm).
Sampling experiment
OLS | Bayes | ||||
---|---|---|---|---|---|
True values | Mean | Standard deviation | Mean | Standard deviation | |
Constant | 4.0 | 3.96 | 0.91 | 3.95 | 0.88 |
Visa | 2.0 | 5.04 | 1.29 | 2.72 | 0.53 |
Discover | 0.0 | 0.01 | 0.70 | 0.01 | 0.69 |
15% | 1.0 | 4.02 | 1.33 | 1.70 | 0.54 |
12% | 1.5 | 1.49 | 0.68 | 1.49 | 0.66 |
$10 | −1.0 | −1.00 | 0.70 | −0.98 | 0.68 |
$20 | −1.5 | 1.53 | 1.31 | −0.77 | 0.55 |
$2500 | 2.0 | −0.98 | 0.68 | 1.33 | 0.38 |
$5000 | 3.0 | 0 | – | 2.31 | 0.42 |
The main difference between the classical OLS approach and the Bayesian approach here are the assumptions that enable the extraction of information from the data. While classical estimation requires prior information about how to reduce the dimensionality of the inferential problem to deliver estimates, the Bayesian approach allows us to retain the original dimensionality at the expense of assumptions that make regression parameters outside of some range very unlikely. In applications where the form and thus the dimensionality of the likelihood function derive from causal reasoning, i.e., theory, the Bayesian approach thus facilitates inference without having to compromise on what is the core of existing beliefs about the DGP in response to data deficiencies.
The rapidly developing field of machine learning provides alternative approaches to flexibly “regularize” a likelihood function (see e.g., Hastie et al. 2001). On a formal level, the regularization techniques employed in machine learning can be re-expressed as prior assumptions about parameters or likely model structures. And while the machine learning approach may have advantages in applications where the analyst has minimal to no prior knowledge about the DGP, the Bayesian approach excels when such knowledge is available.
The prior employed in our illustrative example certainly is closer to a common sense understanding of preferences for credit cards than the model implied by deleting x_{8} (“$5,000 credit limit”), or any other likelihood-identified model obtained by deleting covariates in this example. However, it is still in the spirit of regularization without much attention to details and incidentally essentially corresponds to a ridge-regression approach (Hoerl and Kennard 1970).
A more elaborate prior could, for example, harness the (weak) prior preference ordering of the levels of interest rate, annual fee, and credit limit, or specific knowledge about the person rating the credit cards (see e.g., Allenby et al. 1995).
Finally, many marketing applications such as, for example, conjoint experiments or the analysis of scanner panel data are characterized by a collection of small data sets that individually are similarly problematic as the one corresponding to Table 2. In such settings, so-called hierarchical Bayes models are useful. Hierarchical Bayes models learn the form of the prior to apply to each individual data set from the collection of data sets. In a hierarchical model, the prior that regularizes each individual level likelihood is therefore itself an object of statistical inference (see e.g., Lenk et al. 1996).
Even in settings where a data set formally identifies the parameters in a likelihood function Bayes theorem (Eq. 1) implies that the prior distribution will “bias away” the posterior from the information in the data. At least in small samples or generally in the context of data that does not contain much information about target parameters, the optimal Bayes action (see Eq. 2) may thus be different from the action that only conditions on likelihood information. And often analysts trained in classical frequentist statistics point out that an objective assessment of, for example, the statistical relevance of a parameter is no longer possible once a subjectively formulated prior enters the inferential procedure.
This criticism is certainly valid. However, the quest for objective inference comes at the price of not being able to use some data sets at all, or only subject to assumptions that likely are less defensible or further removed from a common understanding of the DGP than can be incorporated in a prior distribution. Furthermore, when only finite amounts of data are available, the frequentist assessment of statistical uncertainty in estimates or about models often relies on large sample asymptotic arguments in all but simple linear models. Large sample asymptotic arguments are certainly objective but may or may not hold in a particular application that has to rely on finite data.
Finally, the posterior distribution from priors that have positive support over the entire support of the parameter space as defined by the likelihood function, i.e., are neither degenerate or constrained, will converge to the maximum likelihood estimate as the data become more and more informative. In this sense, priors that are neither degenerate nor constrained result in large sample consistent inferences.
Bayesian Estimation
Another way to appreciate this proportionality is to think about the graphical representation of the posterior distribution of a scalar parameter. It is obvious that the linear scaling of the y-axis in this graph does not matter for relative probability statements of the form p (θ_{i}|y) /p (θ_{j}|y), because any finite multiplicative constant would cancel from this ratio. For the same reason, posterior Bayesian inference given a model is invariant to rescaling the likelihood, the prior, or both by multiplicative constants. Similarly, the relative expected loss from two actions a_{k} and a_{l} given a particular model \( \mathcal{L} \)(a_{k}|y)/\( \mathcal{L} \)(a_{l}|y) does not depend on multiplicative constants. However, to compute the expected loss in Eq. 2, we need absolute probability statements about θ, i.e., we need to normalize the product c_{1}p (y|θ) c_{2} p(θ), where c_{1} and c_{2} are arbitrary positive “rescaling” constants.
I then move to models where the posterior distribution cannot be computed in closed form and introduce Gibbs sampling facilitated by data augmentation and the Metropolis-Hastings algorithm as solutions to Bayesian inference in this case.
Examples of Posterior Distributions in Closed Form
The parameters a and b can be interpreted as the number of “1 s” and “0 s” in a hypothetical prior experiment and serve to express prior beliefs about θ. However, all real valued a,b > 0 result in proper priors for the probability θ over its definitional range, i.e., \( {\int}_0^1p\left(\theta |a,b\right)\; d\theta =1 \). For example, setting both a and b equal to 1 yields the uniform density over the unit interval expressing the absence of prior knowledge about what θ-values are more likely than others. Setting a and b equal to the same value larger than 1 yields a density that in the limit of a, b → ∞ degenerates to a point mass at 0.5, which corresponds to various degrees of prior belief strength about θ being equal to 0.5. The mean and mode of the Beta density are given by a/ (a + b) and (a − 1) / (a + b − 2). Therefore, a > b (a < b) expresses prior beliefs that θ > 0.5 (θ < 0.5). Finally, for 0 < a, b < 1, the Beta density takes a bathtub shape that piles up mass at the borders of the parameter space 0 and 1.
Conditional on the data y_{1},…, y_{n}, the binomial coefficient that forms the first factor in Eq. 6 is a fixed constant. Similarly, the normalizing constant of the Beta density, i.e., the first factor on the right hand side of Eq. 7 is fixed for a given choice of a,b.
The fact that the posterior distribution in Eq. 9 is of the same known distributional from as the Beta-prior makes the Beta-prior very convenient in the context of a binomial likelihood function. Technically, the Beta-prior is the conjugate prior to the binomial likelihood.
Moving from Eqs. 6 and 7 to Eq. 8 we dropped all multiplicative constants from the likelihood and the prior that do not depend on θ and then normalized the result from Eq. 8 to arrive at Eq. 9. As discussed following Eq. 4 above, we can do so for the purpose of inference given a particular model that consists of a specific likelihood function and prior. I will address the role of these model-specific constants in the context of formal comparisons between different models further below.
Finally, a useful exercise for first time acquaintances with Bayesian inference is to simulate binomial data, for example, using R’s binom command, or simply by making up n and s, and then simulate from the posterior in Eq. 9 using R’s rbeta command for different specifications of a and b. Observe how the posterior changes as you use more or less (informative) data and more or less informative priors.
Another intellectually useful exercise is to think about different finite amounts of Bernoulli data that either consists of only “1 s” (or only “0 s”). Clearly, the maximum-likelihood estimate of the data generating probability is one (zero) in this case, and a purely data-based assessment of uncertainty in this estimate is impossible. A question at the core of statistical decision theory then is the following: Is a decision maker better off taking the maximum likelihood probability estimate of one (zero) for granted, or should he rather base his decisions on a proper posterior distribution? (Obtained using a proper prior distribution with positive support over the uniform interval.) A general answer to this question, which we will not attempt to prove here, is that any proper prior will translate into better decisions than taking the maximum likelihood estimate for granted. The only exception is the case where prior knowledge itself implies a deterministic process.
Posterior Distributions Not in Closed Form
If we had access to the latent utilities z = (z_{1}, … ,z_{n})′ that generated the observed binomial data y = (y_{1}, … ,y_{n})′, we could comfortably rely on the closed form results in Eq. 16 for Bayesian inference. Conditional on the data generating z, we would in fact learn more about the regressions coefficients than we ever could from the corresponding y.
Here 1(·) is an indicator function that evaluates to one it its argument is true and else to zero, and \( \mathcal{TN}\left(a,b,c,d\right) \) is short for a normal distribution with mean a, variance b, truncated below c, and above d. We can simulate from these distributions using a trick known as the inverse CDF-transformation (see e.g., Rossi et al. 2005), or rely on the command rtruncnorm in the R-package truncnorm (Mersmann et al. 2018) which builds on Geweke (1991).
Note that once we condition on the z in Eq. 25, the y are no longer required as conditioning argument. A particular set of z transmits all the information, and in fact more information than contained in the y, to β (I will discuss general rules for the derivation of conditional distributions later and for now concentrate on what can be achieved based on conditional distributions).
Gibbs sampler. Our goal is thus to derive the marginal posterior distribution p (β| y, β^{0}, Σ^{0}) that is free from the extra, but virtual information about β that comes with each particular set of z we may condition on in Eq. 25. However, as we already know, this posterior is not available in closed form. A convenient solution to this problem is the Gibbs sampler. The Gibbs sampler allows us to generate draws from p (β, z|y, β^{0}, Σ^{0}) based on knowledge of \( p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)={\prod}_{i=1}^np\left({z}_i|\boldsymbol{\beta}, {y}_i\right) \) and p (β|z, β^{0}, Σ^{0}), i.e., conditional distributions only. Once we have draws from p (β, z|y, β^{0}, Σ^{0}), each draw of β in that sample is a draw from our target distribution p (β|y, β^{0}, Σ^{0}) (Recall that the joint distribution p (β, z|y, β^{0}, Σ^{0}) can be decomposed into the product of the marginal distribution p (β|y, β^{0}, Σ^{0}) and the conditional distribution p (z|y, β) by elementary probability calculus. If we have access to a sample from the joint distribution, drawing a β with no regard to the companion z and then looking at the companion z in the sample is equivalent to drawing from p (β|y, β^{0}, Σ^{0}) and then from p(z|y, β)).
Each completed cycle through steps 2 and 3 delivers a pair (β, z)^{r} where r = 1, … ,R indexes the cycle or iteration number of the Gibbs-sampler. Under rather general conditions for the conditional distributions involved, these pairs will represent draws from the joint distribution after some initial iterations, and independent of the choice of starting value. The initial iterations serve to “make the Gibbs sampler forget” the arbitrary starting value in step 1 above. This is often referred to as the “burn-in” period of the Gibbs-sampler. Intuitively, the choice of starting value does not matter, because the Gibbs sampler will forget it, no matter which value was chosen (However, the choice of starting value may influence how many iterations it takes before the Gibbs sampler converges, i.e., delivers pairs (β, z)^{r} in proportion to their joint posterior density in a finite sample of R draws. Another practical concern for the choice of starting values is the numerical stability of the techniques used to draw from the conditional distributions).
Steps 2 and 3 above are often referred to as “blocks of the sampler.” Note that step 2 itself consists of n-subblocks that each draw from the conditional distribution of a particular z_{i}. However, because all z_{i} are conditionally independent, i.e., \( p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)={\prod}_{i=1}^np\left({z}_i|\boldsymbol{\beta}, {y}_i\right) \) (see Eqs. 23 and 24), step 2 effectively draws from the joint conditional posterior distribution of z. Similarly, step 3 draws from the joint conditional posterior distribution of all elements in β.
To further strengthen the intuition for the Gibbs sampler, it is useful to think about each iteration as an exploration of the joint distribution in some neighborhood defined by the respective conditioning arguments. By the notion of sampling and updating of conditioning arguments, the Gibbs sampler is, however, not going to stay in this neighborhood but will move away from it and eventually return.
Each time it returns to some fixed neighborhood of β-values, for example, it will do so from a different constellation of z. Returns from z-constellations that are closer to this β-neighborhood in the sense of Eqs. 25 and 26 will occur more often than returns from z-constellations that are further away. Thus, looking at pairs (β, z)^{r} in this neighborhood, it is impossible to distinguish between moves “from β to z” and moves “from z to β,” and this will be true of every β-neighborhood and z-neighborhood supported by the posterior distribution. In addition, by successively sampling from conditional distributions which are, by definition, proportional to the joint distribution, the Gibbs sampler is going to spend relatively more (fewer) iterations in areas of higher (lower) density under the joint distribution.
In other words, successive pairs (β, z)^{1} , … , (β, z)^{r} , … , (β, z)^{R} produced by iterations of the Gibbs sampler are locally dependent in the sense that pairs produced in successive iterations are more similar to each other than pairs produced further apart from each other, where distance is measured in iteration counts of the Gibbs sampler. However, all pairs provide exchangeable information about the joint posterior distribution. We can therefore use the output from the Gibbs sampler to approximate posterior expected loss (see Eq. 5) and any aspect of the posterior distribution we may be interested in by the corresponding expectation using the Gibbs output. For example, the posterior probability that a particular regression coefficient is larger than zero, i.e., \( P\left({\beta}_k>0|\mathbf{y},\bullet \right)={\int}_0^{\infty }p\left({\beta}_k|\mathbf{y},\bullet \right) \) would be estimated from the Gibbs output as \( \frac{1}{R}{\sum}_{r=1}^R\mathbf{1}\, \left({\beta}_k^r>0\right) \) Note that we control the degree of accuracy of these approximations by the length of the Gibbs sample R.
The particular Gibbs sampler described here is implemented as routine rbprobitGibbs in the R-package bayesm (Rossi et al. 2005) and dates back to Albert and Chib (1993). The routine comes with an example that illustrates input and output (Another bayesm routine, rbiNormGibbs, nicely illustrates how the Gibbs sampler explores a two-dimensional joint distribution by successively sampling from the corresponding two conditional distributions).
Data augmentation. In this application of the Gibbs sampler, the interest really is on the marginal posterior distribution of probit regression coefficients, i.e., p(β|y, •), and Gibbs sampling from the joint posterior distribution of β and z is just a means to obtaining the marginal distribution of interest. Drawing from p(z|y, β) is therefore referred to as “data augmentation” in the literature. Data augmentation often helps transform Bayesian inference problems that involve “unknown” distributions, i.e., distributions without a normalizing constant in closed form, into problems that only involve sampling from distributions with known normalizing constants through conditioning. Canonical examples for the successful application of this technique are the multinomial probit model (McCulloch and Rossi 1994), the multivariate probit model Edwards and Allenby (2003), mixture models (see e.g., Allenby et al. 1998; Frühwirth-Schnatter et al. 2004; Lenk and DeSarbo 2000; Otter et al. 2004), and hierarchical models in general.
From the perspective of Gibbs sampling, there is no distinction between (unobserved) aspects of the data, unobserved parameters, or any unobservable we can derive a conditional distribution for, within the confines of the Bayesian model under investigation. However, before one gets too excited about the possibilities of inference about any unobservable, it is useful to reflect about how much we can learn about β and z from the data in this example.
While it is possible to attain perfect posterior knowledge about β in this model in the limit of an infinitely large sample, it is impossible to ever learn the particular set of z’s that generated the data. This information is lost forever when moving from the data generating z to the observed y based on the indicator function y_{i} = 1(z_{i} > 0). We have one observation y_{i} to learn about each z_{i}. This observation only set identifies z_{i}, i.e., indicates if z_{i} < 0 or z_{i} > 0. In addition \( \mathcal{N}\left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) \) which can be viewed as a hierarchical prior for the z_{i} cannot degenerate, i.e., cannot deliver a perfect prediction by the definition of the probit likelihood. Any finite valued x^{′}_{i}β allows for y_{i} = 1 and y_{i} = 0, even if one of the two outcomes is extremely unlikely.
As such, we are severely limited in what we can learn about the data generating z no matter how many probit observations become available or what subjective prior parameters β^{0} and Σ^{0} we use. Thus, it is generally useful to distinguish between unobservables that can be consistently estimated in a particular model and unobservables that cannot, before further using the output from the Gibbs sampler. Here “consistently” means that we can think of amounts of data, i.e., likelihood information, or a subjective prior setting that translates into a degenerate posterior distribution which concentrates all its mass in one point. For example, it would be foolish to believe that using the posterior distribution of z could somehow further improve decisions informed by the data y and the model at hand, which depend on p(β|y), only.
Blocking. One could replace step 2 in the Gibbs sampler above by a Gibbs cycle through the full conditional distributions of each element β_{k}. in β, i.e., p (β_{k} | β_{−k}, z, •), where β_{−k} is short for all but the k-th element (These conditional densities are easily derived from the joint conditional normal distribution in Eq. 16 using linear regression theory).
Because any corresponding complete set of conditional distributions uniquely determines the joint distribution, this alternative sampler again delivers draws from the same joint posterior distribution p(β, z|y, •). However, the local dependence between successive pairs (β, z)^{1}, … , (β, z)^{r}, … , (β, z)^{R} produced by iterations of this alternative Gibbs sampler is relatively higher. This is because two successive cycles through p (β_{k}|β_{−k}, z, •) for all k-elements deliver draws of β that are more similar in expectation than two draws from p(β|z,•), which are independently distributed.
Replacing a cycle like that through p (β_{k} | β_{−k}, z, •) for all k-elements by a direct draw from the corresponding conditional joint distribution, in this case p (β|z, β^{0}, Σ^{0}), in a Gibbs sampler is referred to as “blocking,” or “grouping” (e.g., Chen et al. 2000). In general, blocked Gibbs samplers deliver more additional information about the posterior distribution per incremental iteration than unblocked samplers, which is intuitive considering direct iid-sampling from the joint posterior distribution as the theoretical limit of blocking. As such, blocked samplers also deliver pairs (β, z)^{r} in proportion to their joint posterior density in a finite sample based on fewer iterations, converge faster from arbitrary starting values.
Another technical aspect is the order in which to successively draw from the blocks of a Gibbs sampler. The theory of Gibbs sampling implies that the order does not matter and in fact a random ordering is easiest to motivate theoretically (see e.g., Roberts 1996, p. 51). However, in our particular example, repeated draws from step 2, i.e., p(β|z, •), or step 3, i.e., p(z|y, β), without switching to the respective other block in between are a perfect waste of time because these draws are conditionally iid. Furthermore, randomly switching to step 2 before updating all elements of z in step 3 is inefficient because step 2 pools information across all z. The updated pooled information is then “redistributed” across all z when drawing from p (z|y, β) in step 3.
The last line in Eq. 31 follows from the fact that both y_{1} and β_{1},…,β_{K} are conditioning arguments, i.e., fixed (for the moment). A useful interpretation of the final result, and in fact a way to derive the result almost instantly, is that the (conditional) posterior of z_{1} is proportional to the “likelihood” of z_{1} i.e., \( p\left({y}_1|{z}_1\right)=\mathbf{1}{\left({z}_i>0\right)}^{y_1}\mathbf{1}{\left({z}_i<0\right)}^{1-{y}_i} \) times a “prior probability” of z_{1}, i.e., \( p\left({z}_1|{\beta}_1,\dots, {\beta}_K\right)=\mathcal{N}\left({z}_1|{\mathbf{x}}_1^{\prime}\boldsymbol{\beta} \right) \). In other words, the (conditional) posterior is proportional to the probability of everything that directly depends on z_{1}, i.e., the probability of z_{1}’s “children,” times the probability of z_{1} given everything z_{1} directly depends on, i.e., z_{1}’s “parents.” (The terminology “children” and “parents” is owed to the representation of joint distributions and their conditional independence relationships in the form of directed acyclic graphs (see e.g., Pearl 2009, p. 12))
Therefore, the full conditional posterior of β_{1} does not depend on the observed data y, conditional on z. Again we find that the conditional posterior is proportional to the product of the (conditional) prior p (β_{1}|β_{2}, … , β_{K}) times the “likelihood,” i.e., the probability of everything that directly depends on β_{1} in the DGP, i.e., \( {\prod}_{i=1}^np\left({z}_i|{\beta}_1,\dots, {\beta}_K\right) \). Note both factors in this product involve normal distributions, and drawing all elements of β jointly from p (β_{1}, … , β_{K}|z_{n}, … , z_{n}), as in Eq. 25 is simple if the joint prior distribution of β is multivariate normal.
For predicting a pair y^{u}, z^{u} conditional on β, we are thus back at data generation, i.e., get a draw z^{u} from \( \mathcal{N}\left({z}^u|{\left({\mathbf{x}}^u\right)}^{\prime}\boldsymbol{\beta} \right) \) and determine y^{u} according to the sign of z^{u}. The predictive probability p(y^{u} = l|β) can be simulated as \( \frac{1}{R}{\sum}_{r=1}^R\mathbf{1}\left({\left({z}^u\right)}^r>0\right) \) or computed using Eq. 17.
To better appreciate this generally important point, it is useful to simulate probit data following the example given with the rbprobitGibbs routine in the R-package bayesm, to sample from the corresponding posterior using rbprobitGibbs, and then to simulate and compare predictions for different x^{u} as explained above. For a comparison with predictions at a frequentist point, estimate the R-command glm(…, family=binomial(link=”probit”),…) is useful.
Conditional posterior distributions in hierarchical models. Hierarchical models estimate a distribution of response coefficients, e.g., \( {\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N\sim p\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right) \) from a collection of i = 1, … , N time series Y = (y_{1}, … ,y_{N})’ where \( {\mathbf{y}}_i={\left({y}_{i,1},\dots, {y}_{i.t},\dots, {y}_{i,{T}_i}\right)}^{\prime } \) . \( P\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right) \) forms a hierarchical prior distribution. The difference to a purely subjective prior distribution is that the sample of time series observations contains likelihood information about parameters τ that index the hierarchical prior. In other words, upon placing a subjective prior distribution on τ, the likelihood information contained in the collection of time series will update this prior distribution to the posterior distribution p(τ|Y).
It should be noted that in these models, marginal posteriors for individual level coefficients, i.e., p(β_{i}|Y) will be biased or “shrunk” towards the hierarchical prior distribution for T_{i} relatively small or, more precisely, limited individual level likelihood information in p(y_{i}|β_{i}) relative to the information about β_{i} in the hierarchical prior. And it is precisely this situation that motivates the use of hierarchical models in the first place.
However, parameters τ indexing the hierarchical prior can be estimated consistently, and in many marketing applications where the behavior of the particular consumers in the estimation sample is just a means to learning about optimal actions in the population these consumers belong to, p (τ|Υ) is the main target of inference.
For many popular and useful choices of p(τ), Eq. 37 results in a conjugate update, i.e., a conditional distribution in the form of known distribution we can directly sample from. Perhaps the most prominent example is the model that takes \( p\left({\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N|\boldsymbol{\tau} \right)={\prod}_{i=1}^N\mathcal{N}\left({\boldsymbol{\beta}}_i|\overline{\boldsymbol{\beta}},{\mathbf{V}}_{\boldsymbol{\beta}}\right) \) and uses a so-called Normal-Inverse Wishart prior for \( p\left(\overline{\boldsymbol{\beta}},{\mathbf{V}}_{\boldsymbol{\beta}}\right) \) that is sometimes rather confusingly referred to as “the H(ierarchical)B(ayes)-model.” Examples are the routines rhierBinLogit, rhierLinearModel, rhierMnlRwMixture and rhierNegbinRw in the R-package bayesm (Rossi et al. 2005) that implement this hierarchical prior (or its finite mixture generalization in the case of rhierMnlRwMixture) for collections of time series of binomial logit, linear, multinomial logit, and negative binomial observations, respectively.
One interpretation of this approach towards inference for the parameters in the hierarchical prior is that it relies on the so-called random effects \( {\left\{{\boldsymbol{\beta}}_i\right\}}_{i=1}^N \) as augmented data, similar to the augmentation of latent utilities in the probit model discussed earlier. Different authors have argued that this approach may be suboptimal depending on the amount of likelihood information at the individual level and the amount of unobserved heterogeneity in (β_{1}, … , β_{Ν}) (see e.g., Chib and Carlin 1999; Frühwirth-Schnatter et al. 2004). However, practical alternative approaches that apply beyond the special case of conditionally normal individual level likelihood functions coupled with a (conditionally) normal hierarchical prior have yet to be developed.
For many individual level likelihood functions of interest in marketing, and perhaps most prominently so for the multinomial logit likelihood, the product on the right hand side of Eq. 38 does not translate into a known distribution. A solution to generating draws from distributions with unknown normalizing constants, the Metropolis-Hastings algorithm is discussed next. Finally, if sampling from the distribution in Eq. 38 is computationally expensive, the combination of Eqs. 37 and 38 suggests scope for parallel sampling from the latter for i = 1, … , N and then feeding back the updated (β_{1}, … , β_{Ν}) as conditioning arguments into Eq. 37 and so on.
Metropolis-Hastings. The Gibbs sampler solves the problem posed by a (joint) posterior distribution with unknown normalizing constant if there is a corresponding set of conditional posterior distributions with known normalizing constants. The Gibbs sampler is extremely powerful and in some sense universal if one is content with approximations to the posterior on a discrete grid (Ritter and Tanner 1992). However, a general technique to sample from distributions with unknown normalizing constants known as the Metropolis-Hastings (MH) algorithm further substantially facilitates real world applications of Bayesian inference. A practically important example in marketing is Bayesian inference for models defined by type-I extreme value error (T1EV) likelihoods, e.g., logit-models, coupled with normal prior distributions for the (regression) coefficients in the likelihood.
On iteration r, the MH-sampler thus transitions from the current “state” or parameter value θ^{r} to a new state θ* with probability α. With probability 1 – α, the current state at iteration r + 1 equals that at iteration r, i.e., θ^{r+1} = θ^{r} (Compute α according in Eq. 39, preferably on the log-scale, exponentiate, and compare the result to a draw u from a standard uniform distribution. If u < α move to θ*, else stay at θ^{r}, to obtain θ^{r+1}). The so-called candidate value or state θ* is sampled from the known “candidate generating” or “proposal” density q. Note that the unknown normalizing constant p(y) = ∫ p(y| θ)p(θ)dθ cancels from Eq. 39.
A remarkable property of this transition rule is that it defines a Markov chain or process with invariant or stationary distribution equal to the posterior distribution p (θ|y). (A Markov process is a stochastic process in which the future, i.e., the (r + 1)-th value only depends on the value attained in the r-th iteration. All values taken before at the (r − 1)-th, (r − 2)-th, and so on iteration are irrelevant for predicting or generating the (r + 1)-th value.) In practice, this implies that subject to rather weak conditions for the proposal density q, repeated application of the transition rule in Eq. 40 eventually delivers draws from the posterior distribution of the model under investigation, independent of the choice of initial or starting value θ^{r=0}. In other words, after discarding, say the first b values θ^{1}, … , θ^{r}, … , θ^{b} generated by b applications of Eq. 40 starting from θ^{0}, we can use the remaining R − b draws as a representative sample of the posterior distribution.
Equation 43 makes intuitive that the collection of moves away from θ_{j} and moves returning to θ_{j} by the MH sampler eventually represent the posterior support for θ_{j} and, because this holds for all values θ_{j}, the entire posterior support. The “eventual” part of this statement comes from the fact that we may start off the sampler at a parameter value θ_{j} = θ^{0} in a region of the parameter space Θ with extremely small posterior probability, i.e., in some extreme tail of the posterior distribution. As the MH sampler perhaps very slowly navigates the posterior, i.e., using many iterations depending on the proposal density q, moving into regions of the parameter space with higher posterior support, the draws along the path to that region over-represent the posterior support for these draws in any finite MH sample. This explains why the first b-iterations of the MH sampler that deliver the sequence θ^{1}, … , θ^{r}, … , θ^{b} from the arbitrary initial starting value θ^{0} need to be discarded as burn-in for the sequence θ^{b+1}, … , θ^{b+r}, … , θ^{R} to be representative of the posterior distribution.
Convergence. Unfortunately, there is no simultaneously practical and reliable way to assess the length of the burn-in sample b. I strongly recommend that users of so-called Monte-Carlo-Markov-Chain (MCMC) techniques that encompass the Gibbs sampler, the MH sampler, as well as collections and combinations, these techniques always take the time to check the convergence behavior of a particular algorithm using simulated data, no matter if the algorithm was designed by someone else or is being newly developed, coded from scratch. In this process, three additional advantages emerge from working with simulated data. First, it forces the researcher to be absolutely clear about his understanding of the data generating process. Second, it delivers an understanding of what informative and less informative data are. Third, it helps with assessing the influence of subjective prior choices.
The investigation of convergence behavior relies on time-series plots of posterior quantities of interest where “time” is measured in iterations of the MCMC sampler. We want these time series plots to look stationary, at least after projecting to the loss from different actions. In other words, at least times series plots of \( \mathcal{L} \)(a, θ^{r} ) need to have converged to stationary sequences over the first b iterations of the sampler. Obviously, the series of \( \mathcal{L} \)(a, θ^{r} ) will converge if the series of parameter draws θ^{r} converges. However, it sometimes may be easier to assess convergence in \( \mathcal{L} \)(a, θ^{r}) than in θ^{r} because the latter often is a high-dimensional object in applied work. In addition, strong posterior dependence between elements of the parameter vector θ may mask convergence to a stable predictive distribution. Interesting examples are “fundamentally over-parameterized” models in the sense that even an infinite amount of data only likelihood-identifies lower dimensional projections of the parameters (see e.g., McCulloch and Rossi 1994; Edwards and Allenby 2003; Wachtel and Otter 2013) (As discussed in section “Bayesian Essentials” above, a proper prior distribution effectively guarantees that the posterior distribution is proper, independent of what can be identified from the likelihood). However, strong posterior dependence between elements of θ^{r} is not limited to fundamentally over-parameterized models.
If a MCMC explores the posterior distribution quickly (“mixes well”), it will yield a representative sample of the posterior distribution in fewer iterations than a MCMC that explores the posterior distribution more slowly (“does not mix well”). The mixing-behavior of a MCMC has implications for the required length of the burn-in sample b. If a chain mixes well, we can choose vastly different starting values and we will quickly lose the ability to distinguish among chains that use different starting values based on summaries of draws. The information in the draws from the posterior all chains converge to will swamp the initial differences between chains. Reliable formal tests of convergence implemented in the R-package CODA (Plummer et al. 2006), for example, build on this idea. However, when a chain mixes well, the researcher will (almost always) see this when exploring the posterior sample generated by the MCMC graphically. And because chains that mix well converge quickly, this limits the need for formal testing. In applied work, it thus is a priority to make sure that the MCMC employed mixes well. This brings us back to the role of simulated data in the development and testing of numerically intensive inference routines such as MCMC. I will give practical examples further below.
However, the reason for using the MH sampler in the first place is that we cannot directly sample from the posterior distribution (Note that one can think of the Gibbs sampler as a cycle through MH steps with conditional proposal densities equal to the conditional posterior distributions). Nevertheless, it is sometimes possible to construct proposal densities as close approximations to the posterior distributions. An example is the routine rmnllndepMetrop in the R-package bayesm (Rossi et al. 2005) that uses a normal approximation to the likelihood to construct a multivariate t-distributed proposal centered at a penalized maximum likelihood estimate.
An obvious requirement for the proposal density is that the parameter set over which the proposal density q has positive support Θ_{q} is equal to, or a superset of the parameter set over which the posterior distribution has positive support, i.e., Θ_{p(θ| y)} ⊆ Θ_{q}. If the proposal density q is such that parameter values that have positive support under the posterior distribution can never be reached, an MH sampler using this proposal density cannot possibly deliver draws that are representative of the posterior distribution.
Conversely, if the proposal density extends beyond the support of the posterior, i.e., Θ_{p(θ| y)} ⊂ Θ_{q}, proposals to move into a region of the parameter space that is not supported under the posterior will simply be rejected. The corresponding acceptance probability α is equal to zero (see Eq. 39).
A related, less obvious but nevertheless practically important requirement for the proposal density is that it should have more mass in its tails relative to the posterior distribution. The reason is that a concentrated proposal density may effectively fail to navigate the entire posterior distribution in a way similar to a proposal that is only defined over a subset of the parameters space. A tricky aspect of thin tailed proposal densities, and concentrated in an area where the posterior distribution is relatively flat, is that time series plots of any finite number of MH draws may fail to indicate that the sampler has not converged, i.e., the plots may indicate convergence over a range of parameters that is not representative of the entire posterior distribution.
However, in many applications, the dimensionality of the parameter space is too large for a RW proposal that attempts to move all parameters simultaneously in one “big” MH step to work well. Conditional independence relationships in the DGP can be exploited to break one big MH step into a collection of MH steps of smaller dimensionality following the same logic that we used earlier to decompose the joint posterior distribution into a set of more manageable conditional posterior distributions for the Gibbs sampler.
The second line in Eq. 46 follows from the application of Bayes’ theorem (see Eq. 1 and note that normalizing constants \( \int p\left(\mathbf{y}|{\boldsymbol{\theta}}_{-k}^r,{\theta}_k\right)p\left({\boldsymbol{\theta}}_{-k}^r,{\theta}_k\right)\;d{\theta}_k \) cancel) and the decomposition of the joint proposal density into a conditional times a marginal. However, it is wasteful not to exploit conditional independence relationships that often vastly simplify the computation of the ratio in Eq. 46 for particular conditional posterior distributions (see e.g., the conditional posterior distribution in Eq. 31).
Moreover, unobservables that are conditionally independent a posteriori should always be drawn in separate MH steps, upon introducing the respective conditioning argument. It would be wasteful to constrain the sampler to either accept a joint move of all these unobservables to the respective candidate values or to reject the entire move and to repeat all respective values from iteration r. The conditional posterior \( p\left(\mathbf{z}|\mathbf{y},\boldsymbol{\beta} \right)={\prod}_{i=1}^np\left({z}_i|\boldsymbol{\beta}, {y}_i\right) \) discussed earlier in the context of the binomial probit likelihood serves as an example.
The practical advantage of working with full conditional distributions as the basis for MH-RW sampling is that the proposal densities q_{k} (ϵ) are univariate. As a consequence, we only need to determine the concentration of these distributions around ϵ_{k} = 0, which corresponds to \( {\theta}_k^{\ast }={\theta}_k^r \). When attempting to make multivariate proposals with the goal to move more than one element of the parameter vector in one step, a simple multivariate RW proposal of the form \( q\left(\boldsymbol{\upepsilon} \right)={\prod}_{k=1}^Kq\left({\upepsilon}_k\right) \) may suggest moves into directions with minimal support under the posterior which will result in θ^{r + 1} = θ^{r} for many iterations. Thus, setting up an MCMC as a repeated cycle through conditional MH steps facilitates the definition of suitable proposal densities. This is analogous to conditioning leading to known distributions in the Gibbs sampler, which can be viewed as a special case of MH sampling (see Eq. 44).
For continuous parameters the default choice for q_{k} (ϵ) is \( N\left(0,{\sigma}_k^2\right) \) where the parameter \( {\sigma}_k^2 \) is subject to “tuning” by the analyst. For an integer parameter ϵ = (η + 1) s could be used, where η is distributed Poisson with tuning parameter λ, and s takes values from {−1,1} with probability 0.5 (For strictly categorical parameters with no ordering among their values, the notion of a random walk is not defined. However, because of the finite prior support of such parameters, it is possible to use discrete uniform proposal distributions. Because all values have the same probability under a uniform distribution, the proposal distributions again cancel from the ratio in the acceptance probability α).
The tuning parameter implicitly specifies an average size of ϵ and thus an average distance between \( {\theta}_k^{\ast } \) and \( {\theta}_k^r \) (also known as the step-size of the proposal distribution), ϵ small in absolute value result in \( {\theta}_k^{\ast } \) close to \( {\theta}_k^r \) that are more likely accepted, i.e., \( {\theta}_k^{r+1}={\theta}_k^{\ast } \) than ϵ large in absolute value that will more likely result in \( {\theta}_k^{r+1}={\theta}_k^r \) when applying Eq. 40. If the number of total iterations R to run the MH sampler were of no concern, any setting of the tuning parameters that results in nondegenerate q_{k} (ϵ) would result in valid posterior inferences based on applications of Eq. 40.
However, both ϵ that are too small on average and ϵ that are too large on average will result in MH samplers that require a larger number of total iterations R to deliver the same amount of information about the posterior distribution than “optimally sized” ϵ. The situation is analogous to studying a population based on sampling. Larger samples result in more reliable inference and some sampling techniques result in higher statistical efficiency than others based on the same number of observations. Here, the population is the posterior distribution, the proposal density plays the role of the sampling plan, and importantly the sample size R is under our control, within the limits set by computational speed and time.
When the tuning parameter is set such that ϵ is too small on average, the MH sampler will explore the posterior in local neighborhoods extensively and navigate the entire posterior over many, many small steps creating “large swings” such that time series plots look like those of financial indices that can move into one direction for extended periods of times, in this case potentially for tens of thousands of iterations. The consequence is that the chain may appear as if it does not converge to a stationary distribution at all.
When the tuning parameter is set such that ϵ is too large on average, the chain will remain at the same value for many iterations and may fail to move at all, i.e., never accept to set \( {\theta}_k^{r+1}={\theta}_k^{\ast } \). However, if it at least moves sometimes, such a chain will arrive at a region of relatively large posterior support in large jumps and tend to stay there. In that sense, ϵ that are too large – provided that the chain moves at all – are the lesser evil. However, any reliable statements about posterior uncertainty based on a finite number of MH draws require decently tuned proposal densities. In practice, some experimentation is required that again is supported by the analysis of simulated data.
To illustrate, I simulated 500 observations from a binomial-probit model with data generating parameter vector β = (−3, 2, 4). The first coefficient is an intercept and the remaining two are slope coefficients for two randomly uniformly distributed covariates (see the Appendix for the corresponding R-script). The script calls a simple, stylized RW-MH-sampler for a binomial probit model coupled with a multivariate normal prior for the probit coefficients implemented in plain R (see the function rbprobitRWMetropolis in the Appendix).
The top-left plot in Fig. 2 depicts the MCMCs that use the smallest step-size investigated here. It presents an example of an MCMC-trace from a sampler that has not converged to delivering samples from the posterior distribution. All three traces exhibit a trend away from zero over the entire course of the 200,000 iterations the sampler was run. Looking at the y-axis, we see that the individual traces are nowhere near the data generating values and reflective of the starting values, even in the last iteration. In an application to real data, we would not now what the data generating parameter values are to compare. However, upon seeing something similar to the top-left plot, we would conclude that the sampler has not converged to a stationary distribution yet. Thus, summaries of the full set or any subset of the 200,000 draws in the top-left plot do not represent the posterior distribution.
The traces in the top-right plot are with a step-size σ_{k} that is five times larger than that in the top-left plot. We see that the three traces appear to converge to stationarity around iteration 50,000 or so, and we could use summaries of the last 150,000 draws to learn about the posterior distribution. With an even larger σ_{k} = 0.2, convergence to the stationary distribution is much quicker (see the bottom-left plot). Finally, when we use σ_{k} = 3, the largest MH step-size investigated here, we see that the MCMC relatively quickly jumps into the neighborhood of the data generating β, but sticks to the same parameter value, often for thousands of iterations.
Posterior means and standard deviations from the last 50,000 iterations
Step-size | Mean | Standard deviation | ||||
---|---|---|---|---|---|---|
β_{0} | β_{1} | β_{2} | β_{0} | β_{1} | β_{2} | |
0.001 | −2.32 | 1.41 | 3.28 | 0.06 | 0.05 | 0.09 |
0.005 | −2.86 | 1.85 | 3.84 | 0.16 | 0.25 | 0.21 |
0.2 | −2.89 | 1.89 | 3.83 | 0.24 | 0.26 | 0.30 |
0.3 | −2.85 | 1.87 | 3.82 | 0.17 | 0.18 | 0.25 |
Posterior means and standard deviations from the last 250,000 iterations
Step-size | Mean | Standard deviation | ||||
---|---|---|---|---|---|---|
β_{0} | β_{1} | β_{2} | β_{0} | β_{1} | β_{2} | |
0.001 | −2.28 | 1.34 | 3.23 | 0.10 | 0.10 | 0.13 |
0.005 | −2.92 | 1.90 | 3.89 | 0.23 | 0.27 | 0.29 |
0.2 | −2.88 | 1.88 | 3.83 | 0.24 | 0.26 | 0.30 |
3 | −2.90 | 1.94 | 3.82 | 0.23 | 0.23 | 0.32 |
Assuming the true posterior standard deviation (from a hypothetical infinite run of the MCMC) to be about 0.26 (see Table 7), we would expect the batch means to be distributed normally around the true mean with standard deviation \( .26/\sqrt{1000}\approx .008 \) simply because we cannot learn the exact mean of a nondegenerate posterior distribution from a finite sample. This translates into a 5-σ interval around the mean with a length of about 0.08. Any excess variation in batch means is evidence of the inefficiency of the employed sampling technologies relative to a hypothetical iid-sampler. From the x-axes in Fig. 5, we can see that batch means are distributed much more widely. Intuitively, a single 1000-iterations batch from each of the MCMCs is less informative about the posterior (more likely to summarize information from only parts of the posterior) than 1000 draws from a hypothetical iid-sampler. In addition, if someone had to bet on the inference from a randomly drawn single batch, he would prefer a draw from the Gibbs-sampler (in the bottom-right), followed by a draw from the RW-MH-sampler with step-size 0.2 (in the top-right). The decision between step-sizes 0.005 and 3 is less clear. However, from the wider distribution of batch means, it is obvious that MCMCs with these step-sizes explore the posterior less efficiently.
Finally, the batch standard deviations in Fig. 6 again identify the Gibbs-sampler as most efficient, followed by the RW-MH-chain with step-size 0.2. A randomly drawn batch of 1000 consecutive draws from these samplers is likely to yield a posterior standard deviation close to the posterior standard deviation estimated from all 600,000 − 50,000 = 550,000 draws. In addition, the top-left plot in Fig. 6 demonstrates that each and every single 1000 consecutive iterations batch from the chain with step-size 0.005 substantially underestimates the posterior standard deviation. In contrast, the chain with the (too) large step-size of 3 often suggests no posterior uncertainty at all – when no proposal is accepted in the batch – but does not uniformly underestimate the posterior standard deviation. This again suggests that chains with step-sizes that are too small are potentially more misleading than chains with step-sizes that are too large.
The examples discussed here nicely showcase that the emphasis in applied work should be on using, devising sampling schemes that mix well, before even considering the formal assessment of convergence. In a sense, it is almost always obvious from a graphical inspection of MCMC-trace plots whether a sampler that mixes well has converged or not.
- 1.
Change the function rbprobitRWMetropolis in the Appendix to cycling through MH-steps that update individual elements of the parameter vector one at a time from their conditional posterior distributions. Experiment with tuning RW-proposals for each element of the parameter vector independently.
- 2.
Obtain a copy of the “plain R” version of rbprobitGibbs (version 2.2–5 of bayesm available from the CRAN-archives), replace the part that generates latent utilities z in line 141 with RW-MH steps, and verify with simulated data that this new algorithm works. The setup is generally interesting, because it is a toy version of a hierarchical model with MH-updates at the lower level and conjugate updates of parameters that form the hierarchical prior.
- 3.
Modify this sampler such that you propose candidate values \( {z}_i^{\ast } \) from their (hierarchical) prior distribution \( \mathcal{N}\left({\mathbf{x}}_i^{\prime}\boldsymbol{\beta}, 1\right) \). Note that the proposal and the prior distribution will cancel from the ratio in the MH-acceptance probability α. You will likely see that this sampler does not converge to a posterior distribution p (β|y) anywhere near the data generating values, even though the time series of β^{1}, … , β^{r}, … , β^{R} suggests immediate convergence and superior mixing! This is an example of the drawbacks of a (collection of) proposal densities that do not have enough mass in their tails.
Recent developments. An important recent development in the context of making numerically intensive Bayesian analysis more practical is the No U-turn Sampler (NUTS) by Hoffman and Gelman (2014) which is a self-tuning Hamiltonian-Monte-Carlo sampler (see e.g., Neal 2011). This technique has been implemented in Stan (Carpenter et al. 2017) which interfaces with many popular software environments including R, Python, Matlab, and Stata, for example.
The basic principle of Hamiltonian-Monte-Carlo (HMC) is to leverage Hamiltonian dynamics for a more effective exploration of the posterior. In physics, Hamiltonian dynamics describe the change in location and momentum of an object by differential equations. The solutions to the differential equations yield the location and the momentum of an object at any particular point in time.
In HMC, the locations correspond to value of the q-element parameter vector to be estimated. Each location is associated with a potential energy and the statistical analogue is the negative of log-posterior evaluated these values (Thus, the posterior mode is the point of lowest potential energy we would gravitate to in the absence of “extra” kinetic energy that enables movements away from this point). The analogue to the momentum comes from expanding the parameter space by p additional parameters (where p = q), the negative log-density of which is the statistical analogue of kinetic energy (Thus, again the mode of this density is the point (the momentum vector) with the lowest kinetic energy). Usually, these additional parameters are assumed to be standard normally distributed. However, it should be noted that the p additional parameters and their density are purely technical devices to complete the Hamiltonian. Similarly, proposal distributions in the context of MH-sampling are technical devices to accomplish MH-sampling.
The algorithm first draws a p-element “momentum” vector from standard normal distributions. The momentum vector both defines the direction of the movement away from the current location (parameter value), and the maximum distance that can be realized, as explained next. HMC obeys the principle that the total energy, i.e., the sum of the potential and the kinetic energy is constant in the closed system described by the Hamiltonian, when deriving a new location (and a new momentum) at any point in time (see Eq. 2.12 in Neal 2011). Here time refers to some arbitrary time point after the onset of the momentum that generates a movement away from the current location.
The location-change is a function of the change in kinetic energy and the momentum-change a function of the change in potential energy. Note that the change in potential energy corresponds to the gradient of the negative log-posterior, and the change in kinetic energy to the gradient of the negative log-density of auxiliary momentum variables respectively, in statistical applications. If the differential equations describing the change in position and momentum could be solved exactly, one could solve for the location that is furthest away from the current location that can be reached in the direction of the current draw of the momentum, given its associated kinetic energy, define this as the new location, draw a new momentum vector, and so on.
It is useful to contemplate how such a procedure would explore the posterior. With a fixed distribution of momentum vectors (and corresponding kinetic energies), it would tend to move away more slowly from a pronounced posterior mode, i.e., in smaller steps in expectation, because of the steep increase in potential energy (defined as the negative of the log-posterior) around this mode. Here, the expectation is with respect to the fixed distribution of momentum vectors (and corresponding kinetic energies). Only outlying momentum vectors would supply sufficient kinetic energy to move far into directions of (much) higher potential energy. Conversely, it would tend to move more quickly, i.e., in larger steps in expectation, through areas of high-potential energy (small values of the log-posterior), and in the direction of low potential energy, in expectation. It is therefore somewhat intuitive that such a procedure would result in direct draws from the posterior that could represent the posterior effectively based on a relatively small number of draws. In contrast to RW-MH-sampling, the distance between two successive draws from this procedure would automatically reflect the concentration of the posterior at every value of the parameter space.
However, in practice, the solutions to the differential equations defining the Hamiltonian dynamics need to be approximated in discretized time. Again, time here refers to the time after the onset of the momentum that generates a movement away from the current location, i.e., the current parameter value. A discrete approximation that can be tuned to high accuracy (relative to the exact solution) is leapfrog integration. At each iteration of the HMC, L leapfrog steps that each correspond to a discrete time step of length ϵ are performed. Ideally, the number of steps L and the length of each step ϵ are chosen so that the new location (a new parameter value) is as far away as possible from the current parameter value, given the current draw of the p-element momentum vector and its associated kinetic energy, while keeping the approximation error low. Any remaining approximation error is controlled in a MH-step that compares the value of the Hamiltonian at the new position and the momentum at this position to the value of the Hamiltonian at the old position and the momentum vector that initiated the movement to the new position (In other words, the potential energy at the new location and the (remaining) kinetic energy are compared to the potential energy at the old location and the kinetic energy that brought about the movement to the new location). By the law of conservation of energy in the closed system described by the Hamiltonian, the Hamiltonian would evaluate to the same value if the discrete time approximation were exact.
NUTS automatically tunes L, ϵ, and additional parameters that rescale the kinetic energy in different dimensions of the log-posterior to arrive at a highly effective HMC-sampler that does not normally require user intervention. Thus, the researcher can fully concentrate on specifying the model, i.e., the likelihood and the pior, knowing that high quality numerical inference from the implied posterior is available through NUTS. A limitation is that the gradient of the log-posterior needs to be defined, which excludes discrete variables as direct objects of inference. However, in many models, discrete latent variables are introduced as augmented data, such as in models defining a discrete mixture of distributions. In these cases, NUTS could be used to sample from the posterior marginalized with respect to discrete latent variables. Based on the marginal posterior, the posterior distribution of discrete latent variables can be easily derived.
Model comparison
In the introduction, I mentioned the possibility of determining the dimensionality of a flexibly formulated model using the Bayesian approach. I also alluded to the possibility of making comparisons across different models for the same data, where models may arbitrarily differ in terms of likelihood functions, prior specifications, or both. Here, I will briefly describe the basic principles to this end. Specifically, I will show how the Bayesian approach can deliver consistent evidence for a more parsimonious model. As usual, consistency means convergence to the data generating truth as the sample size increases (When the set of models compared does not contain the model that in fact corresponds to the data generating truth, consistency means convergence to the model that is closest to the data generating truth in a predictive sense).
This contrasts with the classical frequentist approach, where we can only “fail to reject” relatively simpler descriptions of the world, i.e., more parsimonious theories and models in comparison to more complex models. I personally see this as a drawback of the classical frequentist approach because theory aimed at understanding the underlying causal mechanisms of observed associations generally thrives on establishing that particular (direct) causal effects do not exist.
By convention, Bayes Factors larger 3 count as weak but sufficient evidence in favor of the model in the numerator; Bayes Factors larger 20 count as strong evidence (Kass and Raftery 1995). I will comment more on this convention later.
For example, it would be perfectly alright to compare model k with marginal likelihood ∫p_{k}(y|θ)p_{k}(θ)dθ to a model j that introduces observed conditioning arguments (predictors, covariates) X, i.e., ∫p_{j}(y|X, θ)p_{j}(θ)dθ, or to include model i that uses additional data y′ for calibration in the comparison, based on ∫p_{i} (y|θ)p_{i}(θ|y^{′})dθ (However, note that ∫p_{i}(y, y^{′}|θ)p_{i}(θ)dθ ≠ ∫ p_{i}(y|θ)p_{i}(θ| y^{′})dθ. The former is a marginal likelihood for the data (y, y') and not for the data y. Marginal likelihoods for different models can only be directly compared as long as they pertain to the same data). A useful intuition for marginal likelihoods is that they reduce radically different, per se incomparable “stories” about what may have generated the data to densities for the data, which are directly comparable in the same way as we can compare predictions for the same event completely independent of the considerations that gave rise to the prediction.
Here, we exploited the fact p(y| β^{0}, Σ^{0})p(β| y, β^{0}, Σ^{0}) = p(y| β)p(β| β^{0}, Σ^{0}), by elementary rules of probability. Also note that with the intent to eventually compare across models defined by different likelihoods and priors, we kept track of all normalizing constants that we conveniently ignored before, when deriving the posterior distribution in Eq. 16. Specifically, we previously ignored the factors \( 1/\sqrt{2\pi } \) and (2π)^{−k/2}|Σ^{0}|^{−1/2} in the likelihood p(y|β) and the prior p(β|β^{0}, Σ^{0}), respectively.
Recall that \( \tilde{s} \) in the last line of Eq. 49 is a deterministic function of the subjective prior parameters β^{0}, Σ^{0}, and the data y (see Eq. 14). For all nondegenerate prior choices, \( \tilde{s} \) is going to be dominated by the term \( {\left(\mathbf{y}-{\mathbf{X}}^{\prime}\tilde{\boldsymbol{\beta}}\right)}^{\prime}\left(\mathbf{y}-{\mathbf{X}}^{\prime}\tilde{\boldsymbol{\beta}}\right) \), where \( \tilde{\boldsymbol{\beta}} \) converges to the maximum likelihood or ordinary least squares estimate as more data become available (assuming regular X).
If in contrast ℳ_{1} were the true model, or just closer to the truth in this case, the coefficients in \( {\tilde{\boldsymbol{\beta}}}_1 \) that correspond to X^{s} do not converge to zero. As a consequence, \( {\tilde{s}}_0 \) would grow faster in n than \( {\tilde{s}}_1 \), and SF_{0,1} would converge to zero (Note that \( \exp \left(\frac{-{\tilde{s}}_0+{\tilde{s}}_1}{2}\right)=\exp \left(n\frac{-{\tilde{s}}_0/n+{\tilde{s}}_1/n}{2}\right) \) converges to zero faster than n^{s/2} grows because of the exponential function, where \( -{\tilde{s}}_0/n+{\tilde{s}}_1/n \) converges to the true difference in average squared errors between ℳ_{1} and ℳ_{0}). Thus, the Bayes’ factor can both produce increasing evidence for the more parsimonious model, when the constraints imposed by this model hold exactly, and increasing evidence against it, when they do not (consider BF_{1,0} instead of BF_{0,1} in this case) (In this case, the conventional classifications of weak and strong evidence in favor of the model in the numerator of the Bayes’ factor often align with the usual cut-off values for rejecting a more constrained model based on p-values). In contrast, p-values can reliably reject a parsimonious model but are incapable of producing increasing evidence for such a model. By construction, the probability of rejecting a true, more parsimonious model in favor of a larger, over-parameterized model is equal to the chosen significance level data in repeated applications of the frequentist testing procedure, and independent of the sample size (the amount of information in the data).
Numerical Illustrations
A Brief Note on Software Implementation
Researcher interested in adopting the Bayesian approach nowadays have quite some choice regarding different software and available implementations of the Bayesian approach. More recently, established products for data analysis such as SPSS, STATA, or SAS have started to include options for Bayesian estimation of well established “standard” statistical models such as ANOVA and generalized linear regression models (Advanced users can certainly use these tools to estimate “their own” models too, and STATA specifically emphasizes this possibility). In contrast, WINBUGS is an example of an attempt to automate Bayesian inference, with the idea that the user should be able to exclusively concentrate on the specification of a model – likely outside of the set of “standard” statistical models implemented elsewhere – aided by a graphical user interface.
Much if not the vast majority of “Bayesian-papers” published in marketing to this day have relied on coding up the model and the (invariably) MCMC-routine to perform Bayesian inference “from scratch,” starting with some example code and taking advantage of components that repeat themselves across different models, e.g., conditionally conjugate updating of parameters indexing hierarchical priors. The programming languages used in this context include compiled languages such as C or Fortran, and interpreted languages such as Matlab, R, and Gauss. Here, the former are by construction less interactive when coding and the latter slower in the execution of code “that works.” Recently, Rcpp (Eddelbuettel and François 2011; Eddelbuettel 2013) emerged as an extremely useful compromise between the speed of compiled and the coder-friendliness of interpreted languages.
I am currently relying heavily on Rcpp in my own research. However, I view the advent of the No U-turn Sampler (NUTS) by Hoffman and Gelman (2014) as implemented in Stan (Carpenter et al. 2017) as a major breakthrough towards the goal of focusing on the specification of innovative models (almost) exclusively.
A Hierarchical Bayesian Multinomial Logit Model
Thus, brand A is slightly preferred to the outside good on average, whereas brand B is less attractive than the outside good to the average consumer in this market. However, there is a fair amount of heterogeneity in brand preferences in this market. For example, about 12.4% of consumers in this market prefer brand B to the outside good and around 43% prefer the outside good to brand A at x = 0. Moreover, consumers that have an above average preference for brand A are likely to have a below average preference for brand B in this market, as per the strongly negatively correlated brand coefficients in the population (\( \rho =-.997 \)). The tastes for the covariate x are relatively more homogenous and only consumers in the extreme tail of the preference distribution exhibit a higher preference for larger values of x in this population. I simulate N=2,000 individuals from this population and have each individual make T = 5 choices from complete sets that randomly vary in the x-values for brands A and B, both across t = 1,…,T and i = 1,…,N. I use this data to calibrate a Bayesian hierarchical MNL-model. I rely on the default subjective prior distributions implemented in \( \mathrm{bayesm} \)’s estimation routine \( \mathrm{rhierMnIRwMixture} \) and run this RW-MH-sampler with automatic tuning of proposal densities for 100,000 iterations saving every 10th draw (in \( \mathrm{bayesm}:\mathrm{R}=\mathrm{100,000},\mathrm{keep}=10 \)). The complete posterior is a 6009-dimensional object (3 means plus 3 variances plus 2 covariances plus 2000 times 3 individual level random effects). Because of the high dimensionality of the posterior, saving every draw from a long MCMC run can easily produce an object that taxes a computer’s RAM heavily. Saving every \( \mathrm{keep} \)-th draw increases the information content in a posterior sample limited by a computer’s RAM. For a maximum number of draws than can be saved, we can increase the number of MCMC-iterations \( \mathrm{R} \), when we simultaneously increase the number of iterations between parameters to be saved \( \left(\mathrm{keep}-1\right) \). The information content in the resulting sample is increased because saved draws separated by \( \mathrm{keep}-1 \) MCMC iterations will tend to be more independent from each other, replicate less of the information contained the preceding draw saved.
Looking at the maximum likelihood estimates and comparing them to the green bars, we can see that they are extremely inaccurate. Clearly, individual level posterior inference benefits tremendously from the information in the hierarchical prior distribution that the model learns by pooling information across the 2000 consumers in our simulated short panel.
Note that we estimated the model that was used to generate the data here. In applications, it is very likely that some or all subjective choices that go into the formulation of the model result in systematic differences from the data generating mechanism, including the choice of the hierarchical prior distribution that was (implicitly) chosen to be multivariate normal in this illustration. However, it is also clear that even misspecified hierarchical prior distributions can strike a beneficial bias-variance trade-off in applications where individual level maximum likelihood estimates are extremely noisy or may not exist at all. In fact, this bias-variance trade-off is at the source of the inroads Bayesian hierarchical models have made into applications in marketing. For a discussion of how to imbue hierarchical prior distributions with subjective knowledge about ordinal relationships, see Pachali et al. (2018).
Mediation Analysis: A Case for Bayesian Model Comparisons
The first equation regresses Y on the randomly assigned experimental variable X. A statistically significant coefficient c establishes empirical support for the total effect from X to Y (see Fig. 10, Panel a). Because of random assignment of X, the coefficient c necessarily measures a causal effect. The second equation regresses M on X. A statistically significant coefficient a establishes empirical support for the effect from X to M that is again causal by experimental design. The third equation regresses Y on randomly assigned X and on observed M. Finding that the effect from X on Y vanishes, when conditioned on M (i.e., that there is no direct effect c*), unequivocally establishes (full) mediation as the causal data generating model (see Fig. 10, Panel b) (In the limit of an infinite amount of data, the estimate of c* will only converge to exactly zero under full mediation. The only alternative process that yields c* = 0 in the limit features M as a joint cause of X and Y without another connection between X and Y. This process is ruled out a priori, when X is experimentally manipulated). Usually, empirical support for the hypothesis of c* = 0 is established based on p-values larger than some subjectively chosen significance level. An obvious drawback of this approach is that p-values, by construction, fail to measure the strength of empirical support for conditional independence, which in turn establishes full mediation. Based on p-values, we can only “fail to reject” the null-hypothesis.
Next, I illustrate the differences between the classical and the Bayesian approach in the context of c* = 0 using a sampling experiment. I thus consider the case of full mediation as DGP. Accordingly, I set t_{2} = t_{3} = 1, a = 4, c* = 0, b = 0.5, and σ_{m} = σ_{Y*} = 1 in Eqs. 52 and 53 and generate artificial data sets of different sizes: N_{1} = 50, N_{2} = 200, as well as N_{3} = 2000 (X_{i} is drawn from a uniform distribution for each i ∈, … , N}). I conduct 1000 replications for each data set size and compute Bayes’ factors defined as ratios of marginal likelihoods of the model \( {\mathrm{\mathcal{M}}}_0:{Y}_i={t}_3+{bM}_i+{\varepsilon}_{Y,i}^{\ast } \) and the model \( {\mathrm{\mathcal{M}}}_1:{Y}_i={t}_3+{c}^{\ast }{X}_i+{bM}_i+{\varepsilon}_{Y,i}^{\ast } \). Note that the former is more restricted than the latter and implies that the coefficient c* is equal to zero in the latter model (see Otter et al. (2018) for the computational details and R-scripts).
Distribution of Bayes’ factors in simulation
Pr(BF) > 3 | Pr(BF) > 20 | Pr(BF) > 100 | |
---|---|---|---|
N = 50 | 0.94 | 0.04 | 0.00 |
N = 200 | 0.97 | 0.71 | 0.00 |
N = 2000 | 0.99 | 0.93 | 0.43 |
Distribution of p-values in simulation
Pr(p-value) > 0.01 | Pr(p-value) > 0.05 | Pr(p-value) > 0.10 | |
---|---|---|---|
N = 50 | 0.99 | 0.96 | 0.90 |
N = 200 | 0.99 | 0.96 | 0.90 |
N = 2000 | 0.99 | 0.96 | 0.90 |
Thus, when the data generating process implies conditional independence (c* = 0), the Bayesian approach is the superior measure of empirical evidence for this process, compared to the approach based on p-values.
Conclusion
Writing a chapter like this, one certainly involves many trade-offs. I have chosen to emphasize general principles of Bayesian decision-making and inference in the hope of interesting and exciting readers that have an inclination towards quantitative methodology and are serious about improving marketing decisions. The promise from a deeper appreciation of the Bayesian paradigm, both in terms of its foundations in (optimal) decision-making and in terms of its computational approaches are better tailored quantitative approaches that can be developed and implemented as required by a new decision problem, or for the purpose of extracting (additional) knowledge from a new data source.
A drawback of this orientation is that the plethora of existing models that are usefully implemented in a fully Bayesian estimation framework, including common place prior distributions, are not even enumerated in this chapter. However, I believe that the full appreciation of individual, concrete applications requires a more general understanding of the Bayesian paradigm. Once this understanding develops, that for different individual models follows naturally.
Cross-References
Notes
Acknowledgments
I would like to thank Anocha Aribarg, Albert Bemmaor, Joachim Büschken, Arash Laghaie, anonymous reviewers, the editors, and participants in my class on “Bayesian Modeling for Marketing” helpful comments and feedback. All remaining errors are obviously mine.
References
- Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. http://www.jstor.org/stable/2290350
- Allenby, G. M., Arora, N., & Ginter, J. L. (1995). Incorporating prior knowledge into the analysis of conjoint studies. Journal of Marketing Research, 32(2), 152–162. http://www.jstor.org/stable/3152044
- Allenby, G. M., Arora, N., & Ginter, J. L. (1998). On the heterogeneity of demand. Journal of Marketing Research, 35(3), 384–389. http://www.jstor.org/stable/3152035
- Amemiya, T. (1985). Advanced econometrics. Cambridge, MA: Harvard University Press.Google Scholar
- Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182. https://doi.org/10.1037/0022-3514.51.6.1173.CrossRefGoogle Scholar
- Bernardo, J. M., & Smith, A. F. M. (2001). Bayesian theory. Measurement Science and Technology, 12(2), 221. http://stacks.iop.org/0957-0233/12/i=2/a=702.Google Scholar
- Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2), 192–236. http://www.jstor.org/stable/2984812 Google Scholar
- Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1), 1–32. https://doi.org/10.18637/jss.v076.i01. https://www.jstatsoft.org/v076/i01.CrossRefGoogle Scholar
- Chen, M.-H., Shao, Q.-M., & Ibrahim, J. G. (2000). Monte Carlo methods in Bayesian computation. New York: Springer. http://gateway.library.qut.edu.au/login?url=http://link.springer.com/openurl?genre=book&isbn=978-1-4612-1276-8.CrossRefGoogle Scholar
- Chib, S., & Carlin, B. P. (1999). On MCMC sampling in hierarchical longitudinal models. Statistics and Computing, 9(1), 17–26. https://doi.org/10.1023/A:1008853808677.CrossRefGoogle Scholar
- Eddelbuettel, D. (2013). Seamless R and C+ + integration with Repp. New York: Springer.CrossRefGoogle Scholar
- Eddelbuettel, D., & François, R. (2011). Repp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18. https://doi.org/10.18637/jss.v040.i08. http://www.jstatsoft.org/v40/i08/
- Edwards, Y. D., & Allenby, G. M. (2003). Multivariate analysis of multiple response data. Journal of Marketing Research, 40(3), 321–334. https://doi.org/10.1509/jmkr.40.3.321.19233.CrossRefGoogle Scholar
- Fasiolo, M. (2016). An introduction to mvnfast. R package version 0.1.6. https://CRAN.R-project.org/package=mvnfast
- Frühwirth-Schnatter, S., Tüchler, R., & Otter, T. (2004). Bayesian analysis of the heterogeneity model. Journal of Business & Economic Statistics, 22(1), 2–15. https://doi.org/10.1198/073500103288619331.CrossRefGoogle Scholar
- Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., & Hothorn, T. (2018). mvtnorm: Multivariate normal and t distributions. https://CRAN.R-project.org/package=mvtnorm. R package version 1.0-8.
- Geweke, John. (1991). Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints and the evaluation of constraint probabilities. In: E. M. Keramidas (Ed.), Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pp. 571–578.Google Scholar
- Gilks, W. R. (1996). Full conditional distributions. In S. (Sylvia) Richardson, D. J Spiegelhalter, & W. R. (Walter R.) Gilks (Eds.), Markov chain Monte Carlo in practice (pp. 75–88). London/Melbourne: Chapman & Hall.Google Scholar
- Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: data mining, inference, and prediction. New York: Springer.CrossRefGoogle Scholar
- Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67. https://doi.org/10.1080/00401706.1970.10488634.CrossRefGoogle Scholar
- Hoffman, M. D., & Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 1593–1623. http://jmlr.org/papers/vl5/hoffmanl4a.html.Google Scholar
- Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572.CrossRefGoogle Scholar
- Lenk, P. J., & DeSarbo, W. S. (2000). Bayesian inference for finite mixtures of generalized linear models with random effects. Psychometrika, 65(1), 93–119. https://doi.org/10.1007/BF02294188.CrossRefGoogle Scholar
- Lenk, P. J., DeSarbo, W. S., Green, P. E., & Young, M. R. (1996). Hierarchical Bayes conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2), 173–191. https://doi.org/10.1287/mksc.15.2.173.CrossRefGoogle Scholar
- Long, J. S. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks: Sage Publications. https://uk.sagepub.com/en-gb/eur/regression-models-for-categorical-and-limited-dependent-variables/book6071.Google Scholar
- McCulloch, R., & Rossi, P. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64(1–2), 207–240. https://EconPapers.repec.org/RePEc:eee:econom:v:64:y:1994:i:1-2:p:207-240.CrossRefGoogle Scholar
- Mersmann, O., Trautmann, H., Steuer, D., & Bornkamp, B. (2018). truncnorm: Truncated normal distribution. https://CRAN.R-project.org/package=truncnorm. R package version 1.0-8
- Montgomery, A. L., & Bradlow, E. T. (1999). Why analyst overconfidence about the functional form of demand models can lead to overpricing. Marketing Science, 18(4), 569–583. http://www.jstor.org/stable/193243
- Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. L. Jones, & X-L. Meng (Eds.), Handbook of Markov chain Monte Carlo (Chap. 5). Chapman & Hall/CRC. http://arxiv.org/abs/1206.1901
- Orme, B. (2017). The CBC system for choice-based conjoint analysis. Technical Report. https://sawtoothsoftware.com/download/techpap/cbctech.pdf
- Otter, T., Tüchler, R., & Frühwirth-Schnatter, S. (2004). Capturing consumer heterogeneity in metric conjoint analysis using Bayesian mixture models. International Journal of Research in Marketing, 21(3), 285–297. https://doi.org/10.1016/j.ijresmar.2003.11.002. http://www.sciencedirect.com/science/article/pii/S0167811604000308 CrossRefGoogle Scholar
- Otter, T., Gilbride, T. J., & Allenby, G. M. (2011). Testing models of strategic behavior characterized by conditional likelihoods. Marketing Science, 30(4), 686–701. http://www.jstor.org/stable/23012019
- Otter, T., Pachali, M. J., Mayer, S., & Landwehr, J. R. (2018). Causal inference using mediation analysis or instrumental variables – Full mediation in the absence of conditional independence. Marketing ZFP, 40(2), 41–57. https://doi.org/10.15358/0344-1369-2018-2-41.CrossRefGoogle Scholar
- Pachali, M. J., Kurz, P., & Otter, T. (2018). How to generalize from a hierarchical model? Technical Report. https://ssrn.com/abstract=3018670
- Pearl, J. (2009). Causality: Models, reasoning and inference (2nd ed.). New York: Cambridge University Press.CrossRefGoogle Scholar
- Plummer, M., Best, N., Cowles, K., & Vines, K. (2006). Coda: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7–11. https://journal.r-project.org/archive/.Google Scholar
- Ritter, C., & Tanner, M. A. (1992). Facilitating the Gibbs sampler: The Gibbs stopper and the Griddy-Gibbs sampler. Journal of the American Statistical Association, 87(419), 861–868. https://doi.org/10.1080/01621459.1992.10475289.CrossRefGoogle Scholar
- Robert, C. P. (1994). The Bayesian choice: a decision-theoretic motivation. New York: Springer.CrossRefGoogle Scholar
- Roberts, G. O. (1996). Markov chain concepts related to sampling algorithms. In S. (Sylvia) Richardson, D. J. Spiegelhalter, & W. R. (Walter R.) Gilks (Eds.), Markov chain Monte Carlo in practice (pp. 45–58). London/Melbourne: Chapman & Hall.Google Scholar
- Rossi, P. E., McCulloch, R. E., & Allenby, G. M. (1996). The value of purchase history data in target marketing. Marketing Science, 15(4), 321–340. https://doi.org/10.1287/mksc.l5.4.321.CrossRefGoogle Scholar
- Rossi, P. E., Allenby, G. M., & McCulloch, R. E. (2005). Bayesian statistics and marketing. Chichester: Wiley.CrossRefGoogle Scholar
- Wachtel, S., & Otter, T. (2013). Successive sample selection and its relevance for management decisions. Marketing Science, 32(1), 170–185. https://doi.org/10.1287/mksc.1120.0754.CrossRefGoogle Scholar
- Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.Google Scholar