1 Introduction

Conditional independence is a probabilistic approach to causality (Suppes 1970; Dawid 1979, 2004, 2007; Spohn 1980, 1994; Pearl 2009; Chalak and White 2012) while for instance correlation is obviously not as it is a symmetric relationship. Features of conditional independence are

  • Conditionally independent random variables are conditionally uncorrelated.

  • Conditionally independent random variables may be significantly correlated or not.

  • Independence does not imply conditional independence and vice versa.

  • Pairwise conditional independence does not imply joint conditional independence.

Statistical tests for pairwise conditional independence of random variables have been devised, e.g., Bergsma (2004), Su and White (2007), Su and White (2008), Song (2009), Bergsma (2010), Huang (2010), Zhang et al. (2011), Bouezmarni et al. (2012), Györfi and Walk (2012), Doran et al. (2014), Ramsey (2014), Huang et al. (2016), testing joint conditional independence of several random variables seems to be a challenge in general. For the special case of dichotomous variables, the “omnibus test” (Bonham-Carter 1994) and the “new omnibus test” (Agterberg and Cheng 2002) have been suggested.

Weak conditional independence of random variables was introduced in Wong and Butz (1999), and elaborated on in Butz and Sanscartier (2002). Extended conditional independence has recently been introduced in Constantinou and Dawid (2015). The definition of weak conditional independence given in Cheng (2015) refers to conditional independent random events, and rephrases conditional independence in terms of ratios of conditional probabilities rather than conditional probabilities to avoid the distinction of conditional independence given a conditioning event or its complement. This definition becomes irrelevant when proceeding from elementary probability of events to probability of random variables, and to the general definition of conditionally independent random variables.

Conditional independence is an issue in a Bayesian approach to estimate posterior (conditional) probabilities of a dichotomous random target variable in terms of weights-of-evidence (Good 1950, 1960, 1985). In turn, conditional independence is the major mathematical assumption of potential modeling with weights of evidence, cf. (Bonham-Carter et al. 1989; Agterberg and Cheng 2002; Schaeben 2014b), e.g., applied to prospectivity modeling of mineral deposits. The method requires a training dataset laid out in regular cells (pixels, voxels) of equal physical size representing the support of probabilities. The sum of posterior probabilities over all cells equals the sum of the target variable over all cells. Deviations indicate a violation of the assumption of conditional independence, and are used as statistic of a test (Agterberg and Cheng 2002) which involves a normality assumption. Funny enough, ArcSDM calculates so-called normalized probabilities, i.e., posterior probabilities rescaled so that the overall measure of conditional independence is satisfied (ESRI 2018); of course, the trick does not fix any problem. Violation of the assumption of conditional independence does not only corrupt the posterior (conditional) probabilities estimated with weights of evidence, but also their ranks, cf. (Schaeben 2014b), which is worse. Thus, the method of weights-of-evidence requires the mathematical modeling assumption of conditional independence to yield reasonable predictions. However, conditional independence is an issue with respect to logistic regression, too.

2 From Contingency Tables to Log-Linear Models

A comprehensive exposure of log-linear models is Christensen (1997). Let \({\varvec{ Z}}\) be a random vector of categorical random variables \(\mathsf Z_\ell , \ell =0,\ldots ,m\), i.e., \({\varvec{ Z}} = (\mathsf Z_0, \mathsf Z_1, \ldots , \mathsf Z_m)^{\mathsf {T}}\). It is completely characterized by its distribution

$$\begin{aligned} p_{\kappa } = P_{{\varvec{ Z}}} ({\varvec{s}}_{\kappa }) = P({\varvec{ Z}} = {\varvec{s}}_{\kappa }) = P \left( \left( \mathsf Z_0,\ldots ,\mathsf Z_m) = (s_{k_0}, \ldots , s_{k_m} \right) \right) \end{aligned}$$

with the multi-index \(\kappa = (k_0, \ldots , k_m)\), where \(s_{k_\ell }\) with \(k_\ell = 1,\ldots ,K_\ell \) denotes all possible categories of the categorical random variable \(\mathsf Z_\ell , \ell =0,\ldots ,m\). Since it is assumed that there is a total of \(K_\ell \) different categories with \(P_{Z_\ell }(s_{k_\ell }) > 0\), there is a total of \(\prod _{\ell =0}^m K_\ell \) different categorical states for \(\varvec{ Z} = \bigotimes _{\ell =0}^m \mathsf Z_\ell \).

The distribution of a categorical random vector may initially be thought of as being provided by contingency tables. More conveniently, the distribution of a categorical random vector \({\varvec{ Z}}\) can generally be written in terms of a log-linear model as

$$\begin{aligned} \log p_{\kappa } = \sum _{\kappa } w_{\kappa } \; f_{{\varvec{ Z}}}^{\kappa } ({\varvec{z}}) \end{aligned}$$

with

3 Independence, Conditional Independence of Random Variables

If the random variables \(\mathsf Z_\ell , \ell =1,\ldots ,m\), are independent, then the joint probability of any subset of random variables \(\mathsf Z_{\ell }\) can be factorized into the product of the individual probabilities, i.e.,

$$\begin{aligned} P_{ \bigotimes _{\ell \in M} Z_\ell } = \bigotimes _{\ell \in M} P_{\mathsf Z_\ell }. \end{aligned}$$

where M denotes any non-empty subset of the set \(\{1,\ldots ,m \}\). In particular

$$\begin{aligned} P_{\varvec{ Z}} = P_{ \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \bigotimes _{\ell =1}^m P_{\mathsf Z_\ell }. \end{aligned}$$

If the random variables \(\mathsf Z_\ell , \ell =1,\ldots ,m\), are conditionally independent given \(\mathsf Z_0\), then the joint conditional probability of any subset of random variables \(\mathsf Z_\ell \) given \(\mathsf Z_0\) can be factorized into the product of the individual conditional probabilities, i.e.,

$$\begin{aligned} P_{ \bigotimes _{\ell \in M} \mathsf Z_\ell \mid \mathsf Z_0 } = \bigotimes _{\ell \in M} P_{\mathsf Z_\ell \mid \mathsf Z_0}, \end{aligned}$$
(3.1)

and in particular

$$\begin{aligned} P_{ \bigotimes _{\ell =1}^m \mathsf Z_\ell \mid \mathsf Z_0 } = \bigotimes _{\ell =1}^m P_{\mathsf Z_\ell \mid \mathsf Z_0}. \end{aligned}$$

4 Logistic Regression, and Its Special Case of Weights-of-Evidence

Conditional expectation of a dichotomous random target variable \(\mathsf Z_0\) given a m–variate random predictor vector \(\varvec{ Z} = (\mathsf Z_1, \ldots , \mathsf Z_m)^{\mathsf {T}}\) is equal to a conditional probability, i.e.,

$$\begin{aligned} \mathrm {E}(\mathsf Z_0 \mid \varvec{ Z}) = P (\mathsf Z_0 = 1 \mid \varvec{ Z}). \end{aligned}$$

Then the ordinary logistic regression model (without interaction terms) neglecting the error term yields

$$\begin{aligned} \mathrm {logit} P(\mathsf Z_0 = 1 \mid \varvec{ Z}) = \beta _0 + \varvec{\beta }^{\mathsf {T}} \varvec{ Z}, \beta _0 \in \mathbb R, \varvec{\beta } \in \mathbb R^m. \end{aligned}$$

Omitting the error term it can be rewritten in terms of a probability as

$$\begin{aligned} P \left( \mathsf Z_0 = 1 \mid \varvec{ Z} \right) = \varLambda \left( \beta _0 + \varvec{\beta }^{\mathsf {T}} \varvec{ Z} \right) , \end{aligned}$$

where \(\varLambda \) denotes the logistic function. The logistic regression model with interaction terms reads in terms of a logit transformed probability

$$\begin{aligned} \mathrm {logit} P(\mathsf Z_0 = 1 \mid \varvec{ Z}) = \beta _0 + \sum _\ell \beta _\ell \mathsf Z_\ell + \sum _{\ell _i, \ldots , \ell _j} \beta _{\ell _i, \ldots , \ell _j} \mathsf Z_{\ell _i} \ldots \mathsf Z_{\ell _j} \bigr ), \end{aligned}$$
(3.2)

and in terms of a probability

$$\begin{aligned} P \left( \mathsf Z_0 = 1 \mid \varvec{ Z} \right) = \varLambda \left( \beta _0 + \sum _\ell \beta _\ell \mathsf Z_\ell + \sum _{\ell _i, \ldots , \ell _j} \beta _{\ell _i, \ldots , \ell _j} \mathsf Z_{\ell _i} \ldots \mathsf Z_{\ell _j} \bigr ) \right) . \end{aligned}$$

If all predictor variables are dichotomous variables and conditionally independent given the target variable then the parameters of the ordinary logistic regression model simplify to

$$\begin{aligned} \beta _0 = \mathrm {logit}P(\mathsf Z_0=1) + W^{(0)}, \quad \beta _\ell = C_\ell , \ell =1,\ldots ,m, \end{aligned}$$

with contrasts

$$\begin{aligned} C_\ell = W_{\ell }^{(1)} - W_{\ell }^{(0)}, \ell = 1,\ldots , m, \end{aligned}$$

defined as differences of weights of evidence

$$\begin{aligned} W_{\ell }^{(1)} = \ln {\frac{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 0 )}}, \quad W_{\ell }^{(0)} = \ln {\frac{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 0 )}}, \end{aligned}$$

and with \(W^{(0)} = \sum _{\ell =1}^m W_\ell ^{(0)}\) provided all conditional probabilities are different from 0 (Schaeben 2014b). Obviously the model parameters become independent of one another, and can be estimated by mere counting. This special case of a logistic regression model is usually referred to as the method of “weights-of-evidence”. In turn, the canonical generalization of Bayesian weights-of-evidence is logistic regression.

That weights of evidence \(W_\ell \) agree with the logistic regression parameters \(\beta _\ell \) in case of joint conditional independence becomes obvious when recalling

$$\begin{aligned} C_\ell= & {} W_{\ell }^{(1}) - W_{\ell }^{(0)} \\= & {} \ln {\frac{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 1 \mid \mathsf Z_0 = 0 )}} - \ln {\frac{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 1 )}{P(\mathsf Z_\ell = 0 \mid \mathsf Z_0 = 0 )}} \\= & {} \ln \left( \frac{\mathrm {O}(\mathsf Z_0 = 1 \mid \mathsf Z_\ell = 1)}{\mathrm {O}(\mathsf Z_0 = 1 \mid \mathsf Z_\ell = 0)} \right) = \beta _\ell , \end{aligned}$$

which is the log odds ratio, the usual interpretation of \(\beta _\ell \) (Hosmer and Lemeshow 2000).

If \(\varvec{ Z}\) comprises m dichotomous predictor variables \(\mathsf Z_\ell , \ell =1,\ldots ,m\), there are \(2^m\) possible different realizations \(\varvec{z}_k, k=1,\ldots , 2^m\), of \(\varvec{ Z}\). Then

$$\begin{aligned} \sum _{i=1}^n \widehat{P} \bigl ( \mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z} \left( i \right) \bigr )= & {} \sum _{k=1}^{2^m} \widehat{P}(\mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z}_k) \; H(\varvec{ Z} = \varvec{z}_k) \\= & {} \sum _{k=1}^{2^m} \widehat{P}(\mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z}_k) \; n \, \widehat{P}(\varvec{ Z} = \varvec{z}_k) \\= & {} n \widehat{P}(\mathsf Z_0=1) = \sum _{i=1}^n z_0(i), \end{aligned}$$

where the last equation is an application of the formula of total probability. It is a constitutive equation to estimate the parameters of a logistic regression model and holds always for fitted logistic regression models. With respect to weights-of-evidence, the test statistic of the so-called “new omnibus test” of conditional independence (Agterberg and Cheng 2002) is

$$\begin{aligned} t = \sum _{i=1}^n \left( \widehat{P} \left( \mathsf Z_0=1 \mid \varvec{ Z} = \varvec{z} \left( i \right) \right) - z_0(i) \right) \end{aligned}$$

and should not be too large for conditional independence to be reasonably assumed.

5 Hammersley–Clifford Theorem

Rephrasing the proper statement (Lauritzen 1996) casually, the Hammersley–Clifford Theorem states that a probability distribution with a positive density satisfies one of the Markov properties with respect to an undirected graph G if and only if its density can be factorized over the cliques of the graph. Since the distribution of a categorical random vector can be represented in terms of a log-linear model, Hammersley–Clifford theorem applies. Given \((m+1)\) random variables \(\mathsf Z_0, \dots , \mathsf Z_m\), there is a total of \(\left( {\begin{array}{c}m+1\\ \ell +1\end{array}}\right) \) different product terms each involving \((\ell +1)\) variables, \(\ell =0,\ldots ,m\), summing to a total of \(\sum _{\ell =0}^{m} \left( {\begin{array}{c}m+1\\ \ell +1\end{array}}\right) = 2^{m+1}-1\) different terms. Thus there is a total of \((m+1)\) single variable terms, and a total of \(2^{m+1}-(m+2)\) multi variable terms.

The full log-linear model encompasses all terms and reads

(3.3)

where \(\alpha \in C_{\ell +1}^{m+1}\) denotes an \((\ell +1)\)-combination of the set \(\{1, \ldots , m+1 \} \subset {\mathbb N}\), and \(\kappa (\alpha ) = ( k_{i_1}, \ldots , k_{i_{\ell +1}} )\) denotes a multi-index with \((\ell +1)\) entries \(k_{i_\ell } = 1,\ldots ,K_{i_\ell }\), for \(\ell =0,\ldots ,m\). The random vector \({\varvec{ Z}}_{\kappa (\alpha )}\) is the product of any tuple of \((\ell +1)\) components of \(\varvec{ Z}\), the total number of which is \(\left( {\begin{array}{c}m+1\\ \ell +1\end{array}}\right) \).

Assumptions of independence or conditional independence simplify the distribution of \({\varvec{ Z}}\), i.e., its full log-linear model, considerably. Assuming independence for all its components \(\mathsf Z_\ell , \ell =0,\ldots ,m\), the log-linear model simplifies according to Eq. (3.1) to

(3.4)

where \(\phi _{k_\ell } = \log p_{k_\ell }\).

Assuming joint conditional independence of all components \(\mathsf Z_\ell , \ell =1,\ldots ,m\), given \(\mathsf Z_0\), the log-linear model, Eq. (3.3), simplifies according to Eq. (3.1) to

(3.5)

Thus the latter model, Eq. (3.5), assuming conditional independence differs from the model for independence, Eq. (3.4), in the additional product terms \(\mathsf Z_0 \otimes \mathsf Z_\ell , \ell =1,\ldots ,m\).

Any violation of joint conditional independence given \(\mathsf Z_0\) results in additional cliques of the graph and in additional product terms. Assuming that conditional independence given \(\mathsf Z_0\) does not hold for a particular subset \(\mathsf Z_{\ell _1}, \ldots , Z_{\ell _k}\) of variables \(\mathsf Z_\ell \) results in an enlarging of the log-linear model of Eq. (3.5) by additional terms referring to \(\mathsf Z_0 \otimes \bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}\) and \(\bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}\), respectively.

6 Testing Joint Conditional Independence of Categorical Random Variables

The statistic of the likelihood ratio test (Neyman and Pearson 1933; Casella and Berger 2001) is the ratio of the maximized likelihood of a restricted model and the maximized likelihood of the full model. The assumption of the likelihood ratio test concerns the choice of the model family of distributions.

The null-hypothesis is that a given log-linear model is sufficiently large to represent the joint distribution. If the random variables are categorical, the full log-linear model is always sufficiently large as was explicitly shown above. More interesting are tests whether a smaller log-linear model is sufficiently large. Testing the null-hypothesis whether a log-linear model encompassing one-variable and two-variable terms, all of which involve \(\mathsf Z_0\), is sufficiently large provides a test of conditional independence of all \(\mathsf Z_\ell , \ell =1,\ldots ,m\), given \(\mathsf Z_0\) because this log-linear model is sufficiently large in case of conditional independence given \(\mathsf Z_0\). Thus, a reasonable rejection of the initial null-hypothesis implies a reasonable rejection of the assumption of conditional independence given \(\mathsf Z_0\).

7 Conditional Distribution, Logistic Regression

Since the joint distribution implies all marginal and conditional distribution, respectively, the conditional distribution

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }}{P_{\bigotimes _{\ell =1}^m \mathsf Z_\ell }} \end{aligned}$$
(3.6)

is explicitly given here by

$$\begin{aligned} \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(s_{k_0}, \dots , s_{k_\ell })}{P_{\bigotimes _{\ell =1}^m \mathsf Z_\ell }(s_{k_1}, \dots , s_{k_\ell })} = \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(s_{k_0}, \dots , s_{k_\ell })}{\sum _{k_0=1}^{K_0} P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(s_{k_0}, s_{k_1}, \dots , s_{k_\ell })}. \end{aligned}$$

Assuming independence, Eq. (3.6) immediately reveals

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = P_{\mathsf Z_0}. \end{aligned}$$

Assuming conditional independence of all \(\mathsf Z_\ell , \ell =1,\ldots ,m\), given \(\mathsf Z_0\) and further that \(\mathsf Z_0\) is dichotomous, then

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } (1 \mid s_{k_1}, \dots , s_{k_\ell }) = \frac{P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(1, s_{k_1}, \dots , s_{k_\ell })}{\sum _{i =0}^1 P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(i, s_{k_1}, \dots , s_{k_\ell })} \end{aligned}$$
(3.7)

with

$$\begin{aligned} P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(1, s_{k_1}, \dots , s_{k_\ell }) = \exp \left( \phi _{1} + \sum _{\ell =1}^m \phi _{k_\ell } + \sum _{\ell =1}^m \; \sum _{k_\ell =1}^{K_\ell } \phi _{1, k_\ell } \right) \end{aligned}$$

and

$$\begin{aligned} \sum _{i=0}^1 P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }(i, s_{k_1}, \dots , s_{k_\ell }) = \sum _{i =0}^1 \exp \left( \phi _{i} + \sum _{\ell =1}^m \phi _{k_\ell } + \sum _{\ell =1}^m \; \sum _{k_\ell =1}^{K_\ell } \phi _{i, k_\ell } \right) \end{aligned}$$

Thus,

Finally,

$$\begin{aligned} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \varLambda \Big ( \beta _0 + \sum _{\ell =1}^m \beta _\ell \mathsf Z_\ell \Big ), \end{aligned}$$

which is obviously logistic regression

$$\begin{aligned} \mathrm {logit} P_{\mathsf Z_0 \mid \bigotimes _{\ell =1}^m \mathsf Z_\ell } = \beta _0 + \sum _{\ell =1}^m \beta _\ell \mathsf Z_\ell . \end{aligned}$$
(3.8)

It should be noted that additional product terms in the joint probability \(P_{\bigotimes _{\ell =0}^m \mathsf Z_\ell }\) on the right hand side of Eq. (3.7) of the form \(\bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}\) including \(\mathsf Z_\ell , \ell =1,\ldots ,m\), only, i.e., not including \(\mathsf Z_0\), would not effect the form of the conditional probability, Eq. (3.8). Additional product terms of the form \(\mathsf Z_0 \otimes \bigotimes _{\ell =1}^k \bigotimes _{\ell _i \in C_{\ell }^k} \mathsf Z_{\ell _i}\), i.e., including \(\mathsf Z_0\), result in a logistic regression model with interaction terms, Eq. (3.2).

Ordinary logistic regression is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and all predictor variables are jointly conditionally independent given the target variable; in particular, it is optimum if the predictor variables are categorical and jointly conditionally independent given the target variable (Schaeben 2014a). Logistic regression with interaction terms is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and the interaction terms correspond to lacking conditionally independence given the target variable; for categorical predictor variables, interaction terms can compensate for any lack of conditional independence exactly. Logistic regression with interaction terms is optimum in case of lacking conditional independence (Schaeben 2014a).

8 Practical Applications

The practical application of the log-likelihood ratio test of joint conditional independence generally includes the following steps

  • test the null-hypothesis that the full log-linear model is sufficiently large to represent the joint probability of all predictor variables and the target variables;

  • if the first null-hypothesis is not reasonably rejected, test the null-hypotheses that smaller log-linear models are sufficiently large; in particular;

  • test the null hypothesis that the log-linear model without any interaction term is sufficiently large;

  • if the final null-hypothesis is rejected, then the predictor variables must not be assumed to be jointly conditionally independent given the target variable.

8.1 Practical Application with Fabricated Indicator Data

8.1.1 The Data Set BRY

The data set bry is derived from the https://en.wikipedia.org/wiki/Conditional_independence. Initially it comprises three random events B, R, Y, denoting the subsets of the set of all 49 pixels which are blue, red or yellow with given probabilities \(P(B) = \tfrac{18}{49} = 0.367, P(R) = \tfrac{16}{49} = 0.326, P(Y) = \tfrac{12}{49} = 0.244\). The random events BRY are distinguished from their corresponding random indicator variables \(\mathsf B, \mathsf R, \mathsf Y\) defined as usually, e.g.,

where denotes the indicator variable. They are assigned to pixels of a \(7 \times 7\) digital map image, Fig. 3.1.

Fig. 3.1
figure 1

Map images of random events BRY.

It should be noted that in this example any spatial references are solely owed to the purpose of visualization as map images, and that the test itself does not take any spatial references or spatially induced dependences into account.

Checking independence according to its definition in reference to random events, the figures

$$\begin{aligned} P(B \cap R) = 0.122, \quad P(B) \; P(R) = 0.119 \end{aligned}$$

indicate that the random events B and R are not independent. However, the deviation is small.

Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events B and R given Y does not imply conditional independence of the random events B and R given the complement \(\complement Y\), two checks are required. The results are

$$\begin{aligned} P(B \cap R \mid Y)= & {} \frac{1}{6} = P(B \mid Y) \; P(R \mid Y) \\ P(B \cap R \mid \complement Y)= & {} \frac{4}{37} \not = {\Bigl (\frac{12}{37}\Bigr )}^2 = P(B \mid \complement Y) \; P(R \mid \complement Y), \end{aligned}$$

and indicate that the random events B and R are conditionally independent given the random event Y, but that they are not conditionally independent given the complement \(\complement Y\). It should be noted that the deviation of the joint conditional probability and the product of the two individual conditional probabilities in terms of their ratio is 1.027. In fact, the events B and R are conditionally independent given either Y or \(\complement Y\) if one white pixel, e.g. pixel (1,7) with \(\mathsf B = \mathsf R = \mathsf Y = 0\), is omitted.

Generalizing the view to random variables \(\mathsf B, \mathsf R, \mathsf Y\) and their unique joint realization as shown in Fig. 3.1, Pearson’s \(\chi ^2\) test with Yates’ continuity correction of the null-hypothesis of independence of the random variables \(\mathsf B\) and \(\mathsf R\) given the data returns a p-value of 1 indicating that the null-hypothesis cannot reasonably be rejected.

The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a p-value of 0.996 indicating that the null-hypothesis cannot reasonably be rejected.

Thus, given the data the tests suggest to infer that the random variables \(\mathsf B\) and \(\mathsf R\) are independent and conditionally independent given the random variable \(\mathsf Y\).

8.1.2 The Data Set SCCI

The next data set scci comprises three random events \(B_1,B_2,T\) with given probabilities \(P(B_1) = P(B_2) = P(T) = \tfrac{7}{49} = \tfrac{7}{49} = 0.142\). They are assigned to pixels of a \(7 \times 7\) digital map image, Fig. 3.2.

Fig. 3.2
figure 2

Map images of random events \(B_1, B_2, T\) with \(P(B_1) = P(B_2) = P(T) = \tfrac{7}{49} = \tfrac{7}{49} = 0.142\).

Checking independence according to its definition for random events, the figures

$$\begin{aligned} P(B_1 \cap B_2) = 0.102, \quad P(B_1) \; P(B_2) = 0.020 \end{aligned}$$

indicate that the random events \(B_1\) and \(B_2\) are not independent.

Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events \(B_1\) and \(B_2\) given T does not imply conditional independence of the random events \(B_1\) and \(B_2\) given \(\complement T\), two checks are required. The results are

$$\begin{aligned} P(B_1 \cap B_2 \mid T) = 0.714 \not =&0.734 = P(B_1 \mid T) \; P(B_2 \mid T) \\ P(B_1 \cap B_2 \mid \complement T) = 0 \not =&0.0005 = P(B_1 \mid \complement T) \; P(B_2 \mid \complement T), \end{aligned}$$

and indicate that the random events \(B_1\) and \(B_2\) are neither conditionally independent given the random event T nor given the complement \(\complement T\).

Testing the null-hypothesis of independence of the random variables \(\mathsf B_1\) and \(\mathsf B_2\) with Pearson’s \(\chi ^2\) test with Yates’ continuity correction given the data returns a p-value of practically equal to 0 indicating that the null-hypothesis should be rejected.

The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a p-value of 0.825 indicating that the null-hypothesis cannot reasonably be rejected.

Thus, given the data the tests imply that the random variables \(\mathsf B_1\) and \(\mathsf B_2\) are not independent but conditionally independent given the random variable \(\mathsf T\).

9 Discussion and Conclusions

Since pairwise conditional independence does not imply joint conditional independence, the \(\chi ^2\)-test (Bonham-Carter 1994) of independence given \(\mathsf Z_0=1\) does not apply to checking the modeling assumption of weights-of-evidence. The disadvantage of both the “omnibus” test (Bonham-Carter 1994) and the “new omnibus” test (Agterberg and Cheng 2002) is twofold. First, it involves an assumption of normal distribution which itself should be subject to a test. Second, weights-of-evidence has to be applied to calculate the test statistic which is the sum of all predicted conditional probabilities within the training data set. If the test actually suggests rejection of the null-hypothesis of conditional independence, the user learns that the application of weights-of-evidence was not mathematically authorized to predict the conditional probabilities. The standard likelihood ratio test suggested here resolves both shortcomings.