Swamping and masking in Markov boundary discovery
 661 Downloads
Abstract
This paper considers the problems of swamping and masking in Markov boundary discovery for a target variable. There are two potential reasons for swamping and masking: one is incorrectness of some conditional independence (CI) tests, and the other is violation of local composition. First, we explain why the incorrectness of CI tests may lead to swamping and masking, analyze how to reduce the incorrectness of CI tests, and build an algorithm called LRH under local composition. For convenience, we integrate the two existing algorithms, IAMB and KIAMB, and our LRH into an algorithmic framework called LCMB. Second, since LCMB may prematurely stop searching if local composition is violated, a theoretical improvement on LCMB is made as follows: we analyze how to resume the stopped search of LCMB, construct a corresponding algorithmic framework called WLCMB, and show that its correctness only needs a more relaxed condition than that of LCMB. Finally, we apply LCMB and WLCMB to a number of Bayesian networks. The experimental results reveal that LRH is much more efficient than the existing two LCMB algorithms and that WLCMB can further improve LCMB.
Keywords
Bayesian network Markov blanket Markov boundary Masking Swamping1 Introduction
Markov blankets (Mb) and Markov boundaries (MB) are two basic concepts in Bayesian networks (BNs). For a target variable T, its Mb is a variable set conditioned on which all other variables are probabilistically independent of T, and its MB is a minimal Mb; that is, an MB is the smallest set containing all variables carrying the information about T that cannot be obtained from other variables (Pearl 1988).
The discovery of MBs plays a central role in feature selection (Pellet and Elisseeff 2008; Aliferis et al. 2010a, b; Fu and Desmarais 2010). Feature selection aims to identify the minimal subset of features required for probabilistic classification, with the following threefold objective (Guyon and Elisseeff 2003): improving the prediction performance of the predictors, providing faster and more costeffective predictors, and facilitating a better understanding of the underlying process that generated the data. Pearl (1988) showed the conditional probability for the target variable given other variables can be replaced by the one with an MB as the conditional set. Pellet and Elisseeff (2008) proved an MB is the theoretically optimal set of features if the faithfulness condition is satisfied. Further, under certain assumptions about the learner and the loss function, MB is the solution to the feature selection problem (Tsamardinos and Aliferis 2003; Masegosa and Moral 2012; Statnikov et al. 2013). Hence, MB discovery techniques are receiving more and more attention in recent years.
In the literature, there have been lots of MB discovery approaches, including independencebased and scorebased ones, as well as some hybrid methods. This paper focuses on the former.
The Koller–Sahami (KS) algorithm, put forward by Koller and Sahami (1996), is the first technique of creating a framework used to define the theoretically optimal filter method for a feature selection problem. It provides no theoretical guarantees to soundness (Tsamardinos et al. 2003a). The growshrink (GS) algorithm, which was proposed by Margaritis and Thrun (1999, 2000), consists of the growing phase and the shrinking phase. In its growing phase, as long as there exists a variable conditionally dependent on the target given the candidate Markov blanket (CMb), this variable will be added to the CMb until no more such variables exist. All members of an MB as well as some false positives enter the CMb at the end of the growing phase. The shrinking phase detects those false positives and removes them. The GS algorithm was theoretically proven by Margaritis and Thrun (1999) to be correct under the assumption that all the conditional independence (CI) tests are correct. Here, a CI test for a hypothesis is said to be correct, if the corresponding statistical decision is correctly made by using a testing method.
Tsamardinos et al. (2003a) pointed out that GS uses a static and potentially inefficient heuristic in the growing phase, and then they presented a variant of GS called the incremental association Markov boundary (IAMB) algorithm by employing a dynamic heuristic: IAMB reorders the remaining variables by means of an association function at each iteration such that the spouses of the target can enter the CMb early and thus fewer false positives are added to the CMb during the growing phase. HITON (Aliferis et al. 2003) also uses a similar static but slightly more efficient heuristic compared to GS.
Similar dynamic heuristics are employed by some variants of IAMB (Tsamardinos and Aliferis 2003; Yaramakala and Margaritis 2005; Zhang et al. 2010). This strategy is also used by divideandconquer search techniques, such as the max–min Markov boundary algorithm (Tsamardinos et al. 2003b), the parents and children based Markov boundary (PCMB) algorithm (Peña et al. 2007), the breadth first search of Markov boundary algorithm introduced by (Fu and Desmarais 2007), and the algorithms included in the algorithmic framework called GLL (Aliferis et al. 2010a).
Under the faithfulness condition, most of these algorithms efficiently retrieve an approximate MB. Peña et al. (2007) relaxed the faithfulness condition to the composition assumption. Based on this relaxation, they put forward a stochastic version of IAMB called KIAMB by introducing a randomization parameter \(K\in [0,1]\). Here, K specifies the tradeoff between greediness and randomness in the search: KIAMB with \(K=1\) coincides with IAMB which is completely greedy, while KIAMB with \(K=0\) is a completely random approach expected to discover all the MBs of the target variable with a nonzero probability if running repeatedly for enough times. Further, Statnikov et al. (2013) relaxed the condition for IAMB (also suitable for KIAMB) to be correct to local composition. Another stochastic search technique is the Bayesian stochastic search of Markov boudaries algorithm (Masegosa and Moral 2012), which tries to get all MBs by running a large number of times; it provides some alternative results by scoring the different obtained solutions.
 P1

Incorrect CI tests may lead to swamping and masking. Each MB discovery algorithm assumes that all CI tests are correct. This assumption requires the data efficiency of an algorithm. The parents and children based algorithms, such as PCMB and the algorithms in the GLL framework, are data efficient but not time efficient; in contrast, IAMB and KIAMB are time efficient but not data efficient (Schlüter 2014). Once one or more false positives with spuriously high dependence on the target enter the CMb, the cascading errors (Bromberg and Margaritis 2009) caused by them may lead to the exclusion of some true positives. Example 1 provides an illustration.
 P2

Violation of the faithfulness condition (or the local composition assumption) may also lead to swamping and masking. The faithfulness condition is usually required by the parents and children based algorithms (Peña et al. 2007; Aliferis et al. 2010a), while the relaxed assumption, local composition, is needed by IAMB and KIAMB. However, the faithfulness condition and the local composition assumption may be violated in practice. Example 2 illustrates this possibility.
Example 1
Yaramakala (2004) considered the following scenario: in a BN over \(\{T,X,Y_1,Y_2,Z\}\) with the graph given in (a) of Fig. 1 as its directed acyclic graph (DAG), the node Z is a nonmember of the MB for the target T, but it may have the highest association with T because there exist multiple paths for the flow of information between T and Z: \(T\rightarrow Y_1\rightarrow Z\) and \(T\rightarrow Y_2\rightarrow Z\). In this case, Z becomes the first node entering the CMb of IAMB. Peña et al. (2007) instantiated the same scenario to a problem of signal transmission and reception. Yaramakala (2004) and Peña et al. (2007) thought that there may be some true negatives entering the CMb in the growing phase such that the time cost increases. This is natural. However, a more important but neglected problem is that these false positives may bring some cascading errors (Bromberg and Margaritis 2009), which may further cause incorrectness of CI tests and thus the exclusion of some true positives. For example, \(Y_1\) or \(Y_2\) may eventually become a false negative. Hence, it is meaningful to consider the problem of (P1), and what we can do is to prevent too many true negatives with spuriously high dependence on the target from entering the CMb in the growing phase.
Example 2
Consider a target variable T which has three potential features X, Y, and Z. As we know, the total information about T carried by X and Y can be decomposed into: (a) the unique information carried by X, (b) the unique information carried by Y, (c) the redundant information shared by X and Y, and (d) the synergistic information carried jointly by X and Y (Williams and Beer 2010; Rauh et al. 2014). Assume Z carries all of (a)(b)(c) and some (but not all) of (d). It follows that: (1) Z has the highest association with T; (2) T is conditionally independent of X given Z; (3) T is conditionally independent of Y given Z; (4) T is conditionally dependent on \(\{X,Y\}\) given Z; (5) T is conditionally independent of Z given X and Y. Then, \(\{X,Y\}\) is the unique MB of T in \(\{T,X,Y,Z\}\). However, IAMB can not find this MB correctly. Specifically, in the growing phase of IAMB, Z enters the CMb and then it excludes X and Y; in the shrinking phase, Z remains in the CMb. Similarly, it follows that KIAMB can not find \(\{X,Y\}\) with a probability not \({<}66.67\) % for any value of \(K\in [0,1]\). We no longer consider other abovementioned algorithms because of the violation of the faithfulness condition. Therefore, it is meaningful to consider the problem of (P2).
These two examples indicate that both the incorrectness of some CI tests and the violation of local composition may lead to swamping and masking. This motivates us to build novel algorithms which are expected to (1) reduce the incorrectness of CI tests, and (2) overcome swamping and masking to a large extent in the case of violating the local composition assumption.
The remainder of this paper is organized as follows. Section 2 provides necessary preliminaries. Section 3 presents the IAMB and KIAMB algorithms, relaxes the notions of Mb and MB, and proves some new results for IAMB and KIAMB. Section 4 addresses the problem of (P1), puts forward a method of including as few true negatives as possible in the growing phase, and builds an algorithm called LRH, which is proven to be correct under the relaxed local composition assumption. The ALARM network is employed to show the data efficiency and time efficiency of LRH. In addition, this section gives a postprocessing technique to reduce incorrectness of CI tests kept in the shrinking phase. For convenience, IAMB, KIAMB, and LRH are integrated into an algorithmic framework called LCMB. To resume the search stopped in the growing phase of LCMB, Sect. 5 considers (P2) and constructs an efficient algorithmic framework called WLCMB. The application to ALARM indicates WLCMB can further improve LCMB in data efficiency. Section 6 applies LCMB and WLCMB to several large networks. Section 7 concludes this paper.
2 Preliminary
In the paper, we denote a variable and its value by uppercase and lowercase letters in italics (e.g., X, x), a set of variables and its value by uppercase and lowercase bold letters in italics (e.g., \(\varvec{X}\), \(\varvec{x}\)). The difference between \(\varvec{X}\) and \({\varvec{Y}}\) is denoted by \(\varvec{X}{\setminus }{\varvec{Y}}\). For brevity, we write \((\varvec{X}{\setminus }{\varvec{Y}}){\setminus }{\varvec{Z}}\) as \(\varvec{X}{\setminus }{\varvec{Y}}{\setminus }{\varvec{Z}}\). In addition, we use \(\varvec{X}\) to denote the number of variables involved in \(\varvec{X}\).
Suppose we have a joint probability distribution \(\mathbb {P}\) over \(\varvec{V}\triangleq \{X_1,\ldots ,X_p\}\) and a DAG \(\mathbb {G}\) with the variables in \(\varvec{V}\) as its nodes. We say \((\mathbb {G},\mathbb {P})\) satisfies the Markov condition if every \(X\in \varvec{V}\) is conditionally independent of its nondescendants given its parents. Further, \((\mathbb {G},\mathbb {P})\) is called a BN if it satisfies the Markov condition. Furthermore, \((\mathbb {G},\mathbb {P})\) satisfies the faithfulness condition if, based on the Markov condition, \(\mathbb {G}\) entails all and only CIs in \(\mathbb {P}\) (Pearl 1988; Neapolitan 2004).
Conditional mutual information (CMI) is one of the basic tools for testing CIs. Denote the CMI between \(\varvec{X}\) and \({\varvec{Y}}\) conditioned on \({\varvec{Z}}\) by \(\mathbb {I}(\varvec{X};{\varvec{Y}}{\varvec{Z}})\). Then \(\mathbb {I}(\varvec{X};{\varvec{Y}}{\varvec{Z}})\geqslant 0\), with equality holding if and only if Open image in new window (Zhang and Guo 2006). For a practical problem, we cannot access to the true CMI; instead, we use its empirical estimate, denoted by \(\mathbb {I}_{\varvec{D}}(\varvec{X};{\varvec{Y}}{\varvec{Z}})\), based on the data \(\varvec{D}\) (Cheng et al. 2002). Note that \(\mathbb {I}_{\varvec{D}}(\varvec{X};{\varvec{Y}}{\varvec{Z}})\geqslant 0\) also holds for any \(\varvec{X},{\varvec{Y}},{\varvec{Z}}\subseteq \varvec{V}\). Denote the \(G^2\) statistic by \(G^2(\varvec{X};{\varvec{Y}}{\varvec{Z}}) \triangleq 2n\cdot \mathbb {I}_{\varvec{D}}(\varvec{X};{\varvec{Y}}{\varvec{Z}})\), which approximates to the chisquare variate with \(r\triangleq (r_{\varvec{X}}1)(r_{{\varvec{Y}}}1)r_{{\varvec{Z}}}\) degrees of freedom, namely \(G^2(\varvec{X};{\varvec{Y}}{\varvec{Z}})\mathop {\sim }\limits ^{\centerdot \centerdot }\chi ^2(r)\), where \(r_{\varvec{\xi }}\) represents the number of configurations for \(\varvec{\xi }\) (de Campos 2006). Denote the p value by \(p_{\varvec{D}}(\varvec{X};{\varvec{Y}}{\varvec{Z}})=\mathbb {P}\{\chi ^2(r)\geqslant G^2(\varvec{X};{\varvec{Y}}{\varvec{Z}})\}\). Then, the \(G^2\) test asserts Open image in new window if \(p_{\varvec{D}}(\varvec{X};{\varvec{Y}}{\varvec{Z}})>\alpha \) for a significance level \(\alpha \), and concludes Open image in new window if \(p_{\varvec{D}}(\varvec{X};{\varvec{Y}}{\varvec{Z}})\leqslant \alpha \). In this paper, \(\alpha \) is set to be 0.05. Accordingly, the negative p value is used as the association function, \(f_{\varvec{D}}\), as Tsamardinos et al. (2006), Aliferis et al. (2010a, b), and Statnikov et al. (2013) did: \(f_{\varvec{D}}(\varvec{X};{\varvec{Y}}{\varvec{Z}})\triangleq \mathbb {P}\{\chi ^2(r)\geqslant G^2(\varvec{X};{\varvec{Y}}{\varvec{Z}})\}\).
The chain rule for CMI (Cover and Thomas 2006) is useful to prove the main results of this paper: \(\mathbb {I}(\varvec{X};{\varvec{Y}}_1\cup {\varvec{Y}}_2{\varvec{Z}})= \mathbb {I}(\varvec{X};{\varvec{Y}}_1{\varvec{Z}})+ \mathbb {I}(\varvec{X};{\varvec{Y}}_2{\varvec{Z}}\cup {\varvec{Y}}_1)\) holds for any four sets of variables \(\varvec{X}\), \({\varvec{Y}}_1\), \({\varvec{Y}}_2\), and \({\varvec{Z}}\) from \(\varvec{V}\). This formula remains valid if we replace \(\mathbb {I}(\cdot )\) with \(\mathbb {I}_{\varvec{D}}(\cdot )\).
In what follows, the concepts of Mb and MB are presented (Pearl 1988; Neapolitan 2004).
Definition 1
For \(T\in \varvec{V}\), we call \({\varvec{M}}\subseteq \varvec{V}{\setminus }\{T\}\) a Markov blanket (Mb) of T if Open image in new window . Further, a Markov boundary (MB) of \(\varvec{T}\) is any Mb such that none of its proper subsets is an Mb of T.
According to Definition 1, an Mb, saying \({\varvec{M}}\), of T is a set of variables which can shield T from all other variables, while an MB is a minimal Mb. Moreover, by means of the chain rule for CMI, it can be easily shown that \(\mathbb {I}(T;{\varvec{M}})= \max \nolimits _{\varvec{N}\subseteq \varvec{V}{\setminus }\{T\}}\mathbb {I}(T;\varvec{N})= \mathbb {I}(T;\varvec{V}{\setminus }\{T\})\), so \({\varvec{M}}\) carries all information about T carried by all the variables. Furthermore, the following results are well known in the literature (Pearl 1988; Neapolitan 2004; Statnikov et al. 2013): (a) if \((\mathbb {G},\mathbb {P})\) is a BN, then for \(T\in \varvec{V}\) the set of its all parents, children, and spouses is an Mb of T (we denote it by \({\varvec{M}}_T\)); (b) if \(\mathbb {P}\) satisfies the intersection property, then T has a unique MB; (c) if \((\mathbb {G},\mathbb {P})\) satisfies the faithfulness condition, then \({\varvec{M}}_T\) is the unique MB of T.
Consider again the BN with the graph presented in Fig. 2 as its DAG. In this BN, it is seen that \({\varvec{M}}_{X_4}\triangleq \{X_2,X_6,X_3\}\) is an Mb of \(X_4\); further, \({\varvec{M}}_{X_4}\) is the unique MB of \(X_4\) if the faithfulness condition is satisfied. Similarly, \({\varvec{M}}_{X_2}\triangleq \{X_4,X_5\}\) is the unique MB of \(X_2\) under the faithfulness condition.
Based on the notion of MB, we give the definition for swamping and masking:
Definition 2
(Swamping and masking) For \(T\in \varvec{V}\), let \({\varvec{M}}\subseteq \varvec{V}{\setminus }\{T\}\) be a true MB of T, \({\varvec{M}}_{\mathbb {A}}\triangleq ({\varvec{M}}{\setminus }\varvec{X})\cup {\varvec{Y}}\) be the output of an MB discovery algorithm, \(\mathbb {A}\), with \(\varvec{X}\subseteq {\varvec{M}}\) and \({\varvec{Y}}\subseteq \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}\). Assume \({\varvec{M}}_{\mathbb {A}}\) is not an MB of T. Then, we say (1) swamping occurs with respect to \({\varvec{M}}\), if \(\varvec{X}\ne {\varnothing }\); and (2) masking occurs with respect to \({\varvec{M}}\), if \({\varvec{Y}}\ne {\varnothing }\).
The MB of a target may not be unique. This is why we use “a” or “an” in Definition 2. This definition is applicable whether the MB is unique or not. Lemeire (2007) provided a case of violating the uniqueness of MB called information equivalence. \(\varvec{X}\) and \({\varvec{Y}}\) are called information equivalent with respect to T given \({\varvec{Z}}\subseteq \varvec{V}{\setminus }\varvec{X}{\setminus }{\varvec{Y}}{\setminus }\{T\}\) if the following four conditions hold: Open image in new window , Open image in new window , Open image in new window , and Open image in new window .
3 Two typical algorithms and a further discussion
IAMB is an enhanced variant of GS. Tsamardinos et al. (2003a) showed the correctness of IAMB under the faithfulness condition; Peña et al. (2007) relaxed the condition to the composition assumption; Statnikov et al. (2013) further relaxed the condition to the local composition assumption. The pseudo code for IAMB is described in Algorithm 1. In the algorithm, the function \(f_{\varvec{D}}\) denotes a heuristic measurement of the association between variables based on the data \(\varvec{D}\) (Tsamardinos et al. 2003a; Peña et al. 2007). Two widely used selections for \(f_{\varvec{D}}\) are CMI (Cheng et al. 2002; Tsamardinos et al. 2003a) and the negative p value (Tsamardinos et al. 2006; Aliferis et al. 2010a, b; Statnikov et al. 2013). This paper employs the latter. Yaramakala (2004) also suggested an equivalent version of the negative p value.
KIAMB is a stochastic extension of IAMB. It embeds a randomization parameter \(K\in [0,1]\) used to trade off greediness and randomness. If taking \(K=1\), KIAMB reduces to IAMB. Peña et al. (2007) proved the correctness of KIAMB under the composition assumption. By the proof, the local composition assumption is sufficient for this algorithm to be correct. Its pseudo code is also described in Algorithm 1. In the growing phase of KIAMB, \(K^{\,\!*}=\max \{1,\lfloor {\varvec{M}}_1\cdot K\rfloor \}\).
It is noted here that Algorithm 1 predefines a whitelist \(\varvec{W}\) and a blacklist \(\varvec{B}\), which can be determined by virtue of expert knowledge or empirical information. In the original IAMB and KIAMB, both \(\varvec{W}\) and \(\varvec{B}\) are taken as the empty set by default.
Recall that a CI test for a hypothesis is said to be correct if the corresponding statistical decision is correctly made by using a testing method. Based on this terminology, the correctness of IAMB and KIAMB is presented as follows (Tsamardinos et al. 2003a; Peña et al. 2007; Statnikov et al. 2013).
Theorem 1
(Correctness of IAMB and KIAMB) Assume T satisfies the local composition assumption, and all CI tests are correct. Then \(\mathrm {(i)}\) IAMB outputs an MB of T; \(\mathrm {(ii)}\) KIAMB outputs an MB of T for any \(K\in [0,1]\).
By this theorem and the two examples presented in Sect. 1, IAMB and KIAMB may fail to output an MB when some CI tests are incorrect or local composition is violated. In what follows, we give a naive definition for the outputs of these algorithms and then make a further discussion. Note that an MB can be equivalently defined to be any Mb such that Open image in new window holds for any nonempty \(\varvec{N}\subseteq {\varvec{M}}\), in view of the contraction property and the decomposition property.
Definition 3
For \(T\in \varvec{V}\), we call \({\varvec{M}}\subseteq \varvec{V}{\setminus }\{T\}\) a weak Markov blanket (WMb) of T if Open image in new window for any \(X\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}\). Further, a weak Markov boundary (WMB) of T is any WMb such that Open image in new window holds for any nonempty \(\varvec{N}\subseteq {\varvec{M}}\).
This definition is introduced to characterize the true output of an existing MB discovery algorithm (such as IAMB or KIAMB) in the case that local composition is violated. One did not care about such a definition in the early literature because the faithfulness condition or the composition property (and thus local composition) was usually assumed to be a precondition in an MB algorithm; but this definition becomes necessary if we try to explore what are influencing the efficiency of the existing MB discovery algorithms. “Appendix 1” gives a further explanation about why we define the notion of WMB in this way. Clearly, a WMb is an Mb under local composition, while a WMB is an MB under the same assumption. The following theorem describes the relation between Definition 3 and Algorithm 1.
Theorem 2
Assume all CI tests are correct. Then IAMB or KIAMB for any \(K\in [0,1]\) outputs a WMB of T.
Proof
Based on the notion of WMb, we relax the local composition assumption as follows:
Definition 4
(Markov local composition) We say \(T\in \varvec{V}\) satisfies the Markov local composition property, if T satisfies the local composition property with respect to any WMb of T or, equivalently, if every WMb of T in \(\varvec{V}\) is an Mb.
As seen, IAMB and KIAMB remain correct under the Markov local composition assumption. This is why we call them both LCMB algorithms.
4 LRH algorithm: lessen swamping, resist masking, and highlight the true positives
This section addresses the problem of (P1) posed in Sect. 1. First, we exemplify the situations that some CI tests are incorrect, even when the data size is large. Then, we analyze how to add as few false positives as possible to the CMb and thus to reduce the incorrectness of CI tests such that swamping and masking get alleviated. Finally, we present the resulting algorithm called LRH, which can lessen swamping, resist masking, and highlight the true positives.
4.1 An exemplification
Details of IAMB for discovering the MB, \(\{X_{23},X_{27},X_{29}\}\), of \(T\triangleq X_{2}\) in the ALARM network with \(\alpha =0.05\), based on a data set of size 5000
Phase  Iteration  Results of IAMB 

Growing  1  \(\varvec{M}={\varnothing }\) 
\(f_{\varvec{D}}(T;X_{1}\varvec{M})= \max _{X\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}} f_{\varvec{D}}(T;X{\varvec{M}})\geqslant \alpha \).  
Conclusion: \({\varvec{M}}\leftarrow {\varvec{M}}\cup \{X_{1}\}\).  
2  \({\varvec{M}}=\{X_{1}\}\)  
\(f_{\varvec{D}}(T;X_{4}{\varvec{M}})= \max _{X\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}} f_{\varvec{D}}(T;X{\varvec{M}})\geqslant \alpha \).  
Conclusion: \({\varvec{M}}\leftarrow {\varvec{M}}\cup \{X_{4}\}\).  
3  \({\varvec{M}}=\{X_{1},X_4\}\)  
\(f_{\varvec{D}}(T;X_{18}{\varvec{M}})= \max _{X\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}} f_{\varvec{D}}(T;X{\varvec{M}})\geqslant \alpha \).  
Conclusion: \({\varvec{M}}\leftarrow {\varvec{M}}\cup \{X_{18}\}\).  
4  \({\varvec{M}}=\{X_{1},X_4,X_{18}\}\)  
\(f_{\varvec{D}}(T;X_{23}{\varvec{M}})= \max _{X\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}} f_{\varvec{D}}(T;X{\varvec{M}})\geqslant \alpha \).  
Conclusion: \({\varvec{M}}\leftarrow {\varvec{M}}\cup \{X_{23}\}\).  
5  \({\varvec{M}}=\{X_{1},X_4,X_{18},X_{23}\}\)  
\(\max _{X\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}}f_{\varvec{D}}(T;X{\varvec{M}})\approx 1.0000<\alpha \)  
Conclusion: the growing phase ends, and the obtained Mb is \(\{X_{1},X_4,X_{18},X_{23}\}\).  
Shrinking  1  \({\varvec{M}} = \{X_{1},X_4,X_{18},X_{23}\}\) 
\(f_{\varvec{D}}(T;X_{1}{\varvec{M}}{\setminus }\{X_{1}\})\approx 1.0000 < \alpha \)  
Conclusion: \(X_{1}\) is removed from the CMb.  
2  \({\varvec{M}} = \{X_4,X_{18},X_{23}\}\)  
\(f_{\varvec{D}}(T;X_{i}{\varvec{M}}{\setminus }\{X_{i}\})\approx 0.0000\geqslant \alpha \), for \(i=4,18,23\).  
Conclusion: the shrinking phase ends, and the obtained MB is \(\{X_4,X_{18},X_{23}\}\).  
Conclusion: IAMB outputs \(\{X_4,X_{18},X_{23}\}\) as the MB of T, which is incorrect. 

\(T\leftarrow X_{23}\) \((\hbox {or}~X_{27})\rightarrow X_{22}\rightarrow X_{1}\) and \(T\leftarrow X_{29}\rightarrow X_{1}\) connect T and \(X_1\);

\(T\leftarrow X_{23}\) \((\hbox {or}~X_{27})\rightarrow X_{22}\rightarrow X_{4}\) and \(T\leftarrow X_{29}\rightarrow X_{21}\rightarrow X_{15}\rightarrow X_4\) connect T and \(X_4\);

\(T\leftarrow X_{29}\rightarrow X_{28}\rightarrow X_{18}\) and \(T\leftarrow X_{23}\) \((\hbox {or}~X_{27})\rightarrow X_{22}\rightarrow X_{21}\rightarrow X_{19}\rightarrow X_{18}\) connect T and \(X_{18}\).

The true CMI, \(\mathbb {I}(T;X_{27}X_1,X_4,X_{18},X_{23})\approx 0.0331 > 0\), indicating Open image in new window , but the p value of the \(G^2\)test, \(p_{\varvec{D}}(T;X_{27}X_1,X_4,X_{18},X_{23})\approx 1.0000\), is far larger than \(\alpha \), meaning the opposite assertion Open image in new window ;

The true CMI, \(\mathbb {I}(T;X_{29}X_1,X_4,X_{18},X_{23})\approx 0.0352 > 0\), indicating Open image in new window . On the other hand, \(p_{\varvec{D}}(T;X_{29}X_1,X_4,X_{18},X_{23})\approx 1.0000\gg \alpha \) asserts Open image in new window .
This analysis shows the incorrectness of CI tests may bring swamping and masking. However, we need to use “all CI tests are correct” as a precondition for an MB algorithm. Hence, what we can do is to reduce the incorrectness of CI tests as far as possible. Considering an incorrect CI test is usually the case of accepting a false hypothesis (Cochran 1954; Bromberg and Margaritis 2009), a good MB algorithm should add as few false positives as possible to the CMb in the growing phase, because too many false positives may make the detection of a true dependence hard.
4.2 Method
Example 1 presents a simplified scenario where swamping and masking happen due to the incorrectness of CI tests. By the graphical structure that (a) of Fig. 1 illustrates, the target T propagates its information to X, \(Y_1\), and \(Y_2\). Then, \(Y_1\) and \(Y_2\) transmit the information to Z. In other words, Z collects the information about T through \(Y_1\) and \(Y_2\), so it may carry more information about T than either \(Y_1\) or \(Y_2\). Mathematically, \(\mathbb {I}(T;Z)\geqslant \max \{\mathbb {I}(T;Y_1),\mathbb {I}(T;Y_2)\}\) may hold. This indicates Z has spuriously high association with T. For a larger BN such as the ALARM network, there may be many similar nodes to Z. Hence, we can add as few false positives as possible to the CMb by identifying such nodes.
Suppose the transmission via \(Y_2\) is blocked as (b) of Fig. 1 shows. That is, \(T\rightarrow Y_1\rightarrow Z\) becomes the only remaining channel between T and Z. In this case, the dataprocessing inequality (Cover and Thomas 2006) gives \(\mathbb {I}(T;ZY_2)\leqslant \mathbb {I}(T;Y_1Y_2)\). Similarly, if the transmission via \(Y_1\) is blocked as shown in (c) of Fig. 1, then \(\mathbb {I}(T;ZY_1)\leqslant \mathbb {I}(T;Y_2Y_1)\). This means Z can no longer effectively collect the information about T once one or more channels between T and Z are blocked, so \(Y_1\) or \(Y_2\) will enter the CMb before Z. Without loss of generality, suppose the CMb is obtained as \({\varvec{M}}\triangleq \{X,Y_1\}\) after two steps of the growing phase. Then, further blocking implies Open image in new window and Open image in new window . Hence, \(Y_2\) enters \({\varvec{M}}\) and thus \({\varvec{M}}=\{X,Y_1,Y_2\}\). Finally, Open image in new window , meaning the growing phase ends.
As seen, the method of blocking one or more information channels can add as few false positives as possible to the CMb in the growing phase, because the remaining information (after blocking information channels) about T carried by one node is closer to the true unique information about T carried by this node. Therefore, this method can reduce swamping and masking caused by the problem of (P1).
 (a)

Selection Let Open image in new window be the set of all nodes having information channels reaching T other than those through \({\varvec{M}}\). The nodes in \({\varvec{M}}_1\) are the candidates preparing to enter the CMb in the current step.
 (b)

Exclusion If \({\varvec{M}}_1\) is empty, the shrinking phase ends; if \({\varvec{M}}_1=1\), add the only node in \({\varvec{M}}_1\) to \({\varvec{M}}\) and then go to (a) of the next iteration; otherwise, the method of blocking information channels is used. Put Open image in new window and \({\varvec{M}}_3\triangleq {\varvec{M}}_1{\setminus }{\varvec{M}}_2\), in which Open image in new window denotes the set of all nodes having information channels reaching T and X other than those through \({\varvec{M}}\). This heuristic is inspired by the notion of 1step dependence coefficient (de Campos 2006; MartínezRodríguez et al. 2008; Lee et al. 2012). If \({\varvec{M}}_2={\varnothing }\), modify it as \({\varvec{M}}_2\triangleq \{Y\}\) with \(Y=\arg \max _{X\in {\varvec{M}}_1}f_{\varvec{D}}(T;X{\varvec{M}})\). All nodes in \(\varvec{M}_3\) (with spuriously high dependence on T) are excluded. This step can effectively reduce the possibility of adding too many false positives to the CMb. A further discussion about the exclusion procedure is given in Sect. 7.
 (c)
 Inclusion Let \({\varvec{Y}}\) be a set of \(k^{\,\!*}\triangleq \min \{k,{\varvec{M}}_2\}\) nodes from \({\varvec{M}}_2\) with the highest associations with T: takeand let \({\varvec{Y}}=\{X_{(1)},\ldots ,X_{(k^{*})}\}\), with \(g_{\varvec{D}}(T;X_{(1)}{\varvec{M}},\varvec{N}_{X_{(1)}})\geqslant \cdots \geqslant g_{\varvec{D}}(T;X_{(\,\!{\varvec{M}}_2\,\!)}{\varvec{M}},\varvec{N}_{X_{(\,\!{\varvec{M}}_2\,\!)}})\). Add the nodes in \({\varvec{Y}}\) to \({\varvec{M}}\). Here, \(k~(\geqslant 1)\) is the maximal number of nodes entering the CMb at each iteration. This paper uses \(k=3\).$$\begin{aligned} g_{\varvec{D}}(T;X{\varvec{M}},\varvec{N}_{X})=\min \limits _{Z\in \varvec{N}_X} f_{\varvec{D}}(T;X{\varvec{M}}\cup \{Z\}) \end{aligned}$$(2)
This is the basic method of designing the new algorithm, LRH (presented in the next subsection). It will be seen that the algorithm performs well in lessening swamping, resisting masking, and highlighting the true positives. This is why we call it the LRH algorithm.
4.3 LRH algorithm with application to the ALARM network
By the description given in Sect. 4.2, we present the LRH algorithm in Algorithm 1. LRH consists of two phases: in the growing phase, the SEI procedure is iteratively implemented to search an Mb which contains as few false positives as possible; in the shrinking phase, the Mb is refined to become an MB. Specifically, the selection, exclusion, and inclusion procedures of SEI are implemented in Line 3, Line 5, and Line 7, respectively, in the growing phase of LRH. As the following theorem shows, LRH is correct under the local composition assumption or the Markov local composition assumption. Hence, LRH is also an LCMB algorithm. The proof of this theorem is similar to that of Theorem 2, so we omit it here.
Theorem 3
(Correctness of LRH) Assume all CI tests are correct. Then LRH outputs a WMb of T for any \(k\geqslant 1\). Further, if T satisfies the (Markov) local composition assumption, then LRH outputs an MB of T.
Details of LRH for discovering the MB of \(T\triangleq X_{2}\) on the ALARM network with \(k=3\) and \(\alpha =0.05\), based on a data set of size 5000
Phase  Iteration  Results of LRH  

Growing  1  Selection  \(\varvec{M}={\varnothing }\) 
Open image in new window \(=\{X_i\!: i= 1,4,7,8,10,12,\ldots \!,15,18,19,21,\ldots \!,29,31,34,36\}\)  
Exclusion  Open image in new window , in which the nodes are sorted according to \(g_{\varvec{D}}\) from high to low. See Eq. (2) for details  
Conclusion: \(X_{i}\) is excluded from \({\varvec{M}}_1\), for \(i= 1,7,10,12,13,14,15,19,22,25,26,28,31,34\)  
Inclusion  \({\varvec{Y}} =``\,\!\hbox {a set of at most}~k~\hbox {nodes from}~\varvec{M}_2~\hbox {with the highest associations with}~T\,\!''\) \(= \{X_{29},X_{23},X_{21}\}\)  
Conclusion: \(X_{29}\), \(X_{23}\), and \(X_{21}\) are included into \({\varvec{M}}\)  
2  Selection  \({\varvec{M}}=\{X_{29},X_{23},X_{21}\}\)  
\({\varvec{M}}_1=\{X_{27}\}\)  
Exclusion  \({\varvec{M}}_2=\{X_{27}\}\)  
Conclusion: no node is excluded from \({\varvec{M}}_1\) since \({\varvec{M}}_1=1\)  
Inclusion  \({\varvec{Y}} =\{X_{27}\}\)  
Conclusion: \(X_{27}\) is included into \({\varvec{M}}\)  
3  \({\varvec{M}}=\{X_{29},X_{23},X_{21},X_{27}\}\)  
\({\varvec{M}}_1={\varnothing }\), so the search stops  
Conclusion: the growing phase ends, and the obtained Mb is \(\{X_{21},X_{23},X_{27},X_{29}\}\)  
Shrinking  1  \({\varvec{M}} = \{X_{21},X_{23},X_{27},X_{29}\}\)  
\(f_{\varvec{D}}(T;X_{21}\,\,{\varvec{M}}{\setminus }\{X_{21}\})=1.0000 < \alpha \); \(f_{\varvec{D}}(T;X_{23}\,\,{\varvec{M}}{\setminus }\{X_{23}\})=0.0000 \geqslant \alpha \) \(f_{\varvec{D}}(T;X_{27}\,\,{\varvec{M}}{\setminus }\{X_{27}\})=0.0000 \geqslant \alpha \); \(f_{\varvec{D}}(T;X_{29}\,\,{\varvec{M}}{\setminus }\{X_{29}\})= 0.0000 \geqslant \alpha \)  
Conclusion: \(X_{21}\) is removed from the CMb  
2  \({\varvec{M}} = \{X_{23},X_{27},X_{29}\}\)  
\(f_{\varvec{D}}(T;X_{23}\,\,{\varvec{M}}{\setminus }\{X_{23}\})=0.0000 \geqslant \alpha \); \(f_{\varvec{D}}(T;X_{27}\,\,{\varvec{M}}{\setminus }\{X_{27}\})=0.0000 \geqslant \alpha \); \(f_{\varvec{D}}(T;X_{29}\,\,{\varvec{M}}{\setminus }\{X_{29}\})=0.0000 \geqslant \alpha \)  
Conclusion: the shrinking phase ends, and the obtained MB is \(\{X_{23},X_{27},X_{29}\}\)  
Conclusion: LRH outputs \(\{X_{23},X_{27},X_{29}\}\) as the MB of T, which is correct 
To demonstrate how LRH works, we apply this algorithm to the ALARM network. The detailed operating steps of LRH for discovering the MB of \(T\triangleq X_2\) are presented in Table 2. Following the steps in the table, LRH first adds \(\{X_{29},X_{23},X_{21}\}\) and \(\{X_{27}\}\) to the CMb, and then removes the only one false positive, \(X_{21}\). As expected, the two nodes, \(X_4\) and \(X_{18}\), with spuriously high dependence on the target are successfully identified before the inclusion procedure of SEI and, therefore, they will no longer swamp the true positives, \(X_{27}\) and \(X_{29}\). In comparison, IAMB adds these two (plus another) nonmembers of the true MB, namely, \(X_1\), \(X_4\), and \(X_{18}\); the two true positives, \(X_{27}\) and \(X_{29}\), are then swamped. Although \(X_1\) is finally removed, \(X_4\) and \(X_{18}\) continue to mask themselves, so IAMB gives an incorrect output.
Results of the LCMB algorithms applied to the ALARM network based on a data set of size 5000 (\(\alpha = 0.05\))


LRH retrieves 21 and 3 Mbs; IAMB retrieves 8 and 5 Mbs; and KIAMB with \(K=0.5\) or \(K=0.8\) performs nearly as well as IAMB, while KIAMB with \(K=0.2\) performs poorly. Further, LRH possesses 34 out of the 37 maximal REs; IAMB possesses 14 maximal REs; and KIAMB with K taken as 0.2, 0.5, and 0.8 possess 5, 12, and 12 maximal REs, respectively. This indicates LRH improves on IAMB and KIAMB greatly. The results also reveal that it is reasonable to use RE to measure the performance that a potential MB carries the information about the target. In addition, it should be mentioned here that each LCMB algorithm outputs several Mbs (supersets of the true MBs) after implementing the BW function (see Algorithm 1 for details). This type of masking is the consequence of the incorrectness of some associated CI tests. The next subsection will discuss this issue and provides an effective postprocessing technique, called PostBW, to alleviate this type of masking.

There are 12 cases where LRH outputs a proper subset of the true MB, and 9 such cases for IAMB. As seen, in any one such case for IAMB, LRH also outputs a proper subset of the true MB (5 cases) or the true MB (4 cases), but not vice versa. Therefore, LRH performs better in lessening swamping than IAMB. Taking \(X_{12}\) as the target for example, IAMB adds \(X_{5}\), \(X_{7}\), \(X_{8}\), \(X_{9}\), \(X_{10}\), \(X_{13}\), and \(X_{34}\) to the CMb in its growing phase, and then removes \(X_{5}\), \(X_{7}\), \(X_{8}\), and \(X_{9}\) in its shrinking phase to output \(\{X_{10},X_{13},X_{34}\}\) as the MB of \(X_{12}\); LRH adds \(X_{10}\), \(X_{13}\), \(X_{16}\), and \(X_{34}\) to the CMb in the growing phase, and no nodes are removed in the shrinking phase. As seen, \(X_{16}\) is a spouse node of \(X_{12}\) but IAMB fails to include it. This is because IAMB adds too many false positives (i.e., \(X_{5}\), \(X_{7}\), \(X_{8}\), and \(X_{9}\)) in its growing phase such that the CI test for the true dependence Open image in new window incorrectly accepts the false hypothesis. LRH outputs the true MB of \(X_{12}\). Similarly, if taking \(X_{25}\) as the target, IAMB also fails to include the spouse node \(X_{24}\), while LRH can output the true MB, \(\{X_{23},X_{24},X_{26}\}\).

For IAMB, there are 15 cases in which the outputs excludes one or more true positives and includes some false positives. This means IAMB suffers swamping and masking severely. In comparison, LRH yields only one such output; it successfully prevents those spuriously dependent variables from entering the CMb in most situations by virtue of the SEI procedure. Therefore, the heuristic involved in SEI is effective in lessening swamping, resisting masking, and highlighting the true positives.
Average REs and RTs of the LCMB algorithms applied to the ALARM network based on 10 data sets of different sizes (from 500 to 5000): each result is averaged over all the 37 nodes as targets

Additionally, the average running time (RT; in seconds) of each LCMB algorithm is listed in the second part of Table 4. By the table, there is no significant difference between the RT of LRH and that of IAMB and KIAMB. Therefore, all the three LCMB algorithms are time efficient.
4.4 PostBW: a postprocessing technique
Theorem 4
Proof
We first prove the necessity. Suppose there is some \(X\in {\varvec{M}}\) and some \(Y\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}\) such that (4) and (5) hold simultaneously. Then, Open image in new window , in view of the contraction property, so (6) follows from the decomposition property. Consider the case that \({\varvec{M}}\) is a WMb, that is, Open image in new window holds for any \(Z\in \varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}\). Equivalently, we have Open image in new window . This combined with (6) and the contraction property implies Open image in new window . Hence, Open image in new window holds for any \(U\in (\varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\})\cup \{X\}= \varvec{V}{\setminus }({\varvec{M}}{\setminus }\{X\})\setminus \{T\}\)). This contradicts that \({\varvec{M}}\) is a WMB. In the case that \({\varvec{M}}\) is an Mb, we can similarly verify Open image in new window , which contradicts that \({\varvec{M}}\) is an MB. The proof of the necessity is completed.
To examine the performance of PostBW, we apply this procedure to the Mbs of \(X_3\), \(X_6\), \(X_8\), \(X_{11}\), and \(X_{32}\) outputted by IAMB and LRH (see Table 3 for details). All the false positives accepted by BW are identified by PostBW and all the true MBs for these five targets are correctly discovered, except that the MB of \(X_{11}\) is obtained as \(\{X_{34},X_{36}\}\). This shows that PostBW improves on BW substantially.
The computational complexity will increase if using PostBW: this procedure needs to do \(O(\varvec{V}\!\cdot \!{\varvec{M}}_T)\) extra CI tests. A feasible solution for alleviating the resulted computational cost is to interleave PostBW with BW. Following this idea, we first implement BW in each iteration, and then activate PostBW if BW stops (in each iteration). For convenience, we call this interleaved procedure to be InterPostBW, and present its pseudo code in Algorithm 2. Finally, we apply InterPostBW to the Mbs of \(X_3\), \(X_6\), \(X_8\), \(X_{11}\), and \(X_{32}\) outputted by IAMB and LRH. The results indicate InterPostBW has the same performance as PostBW in the sense of RE but it needs less RT for most situations.
5 WLCMB algorithmic framework
Section 4 considered the problem of (P1) and proposed the LRH algorithm. As we saw, LRH is time efficient and much more data efficient than IAMB and KIAMB. However, as Example 2 shows, the Markov local composition assumption may be violated in practice and, if this is the case, LRH and the other two LCMB algorithms will stop to search before finding a true MB. In this section, we consider the problem of (P2) as follows: analyze why swamping and masking occur in the case of violating the Markov local composition assumption, discuss how to overcome them by resuming the stopped search of LCMB, and build a corresponding algorithmic framework.
Recalling Example 2 considered in Sect. 1, IAMB incorrectly outputs \(\{Z\}\) as the MB of T, meaning the two true positives (i.e., X and Y) are swamped by Z, and the false positive (i.e., Z) successfully masks itself. This indicates the dynamic heuristic in the growing phase of IAMB may lead to swamping, which may further bring masking. Similarly, LRH incorrectly outputs \(\{Z\}\) as the MB of T, so LRH is also invalid for this type of swamping and masking. We also find that KIAMB as a random version of IAMB may discover the true MB if implementing it repeatedly; but this possibility is low. In addition, GS may find the true MB but this depends on the preassigned priority of variables checked in every search; swamping and masking will happen if, for example, the priority is “Z, X, Y” or “Z, Y, X”. Thus, LCMB may prematurely terminate the growing phase if the CMb shields T from every remaining single variable.
Let \({\varvec{M}}\) be a true MB of T in \(\varvec{V}\), and \({\varvec{M}}_{\mathbb {A}}\triangleq ({\varvec{M}}{\setminus }\varvec{X})\cup {\varvec{Y}}\) be the output of an MB discovery algorithm, \(\mathbb {A}\). Under the assumption that all CI tests are correct, Theorems 2 and 3 show that \({\varvec{M}}_{\mathbb {A}}\) is a WMB of T in \(\varvec{V}\). Further, \({\varvec{M}}_{\mathbb {A}}\) is not an MB (and thus also not an Mb), if the local composition assumption with respect to \({\varvec{M}}_{\mathbb {A}}\) is violated. In this case, \(\varvec{X}\ne {\varnothing }\), so swamping must occur. The questions are then: (1) why some useful information about T carried by some variables in \(\varvec{V}{\setminus }{\varvec{M}}{\setminus }\{T\}\) can not be captured successfully by \(\mathbb {A}\)? (2) how to resume the stopped search of \(\mathbb {A}\)?
 Open image in new window : On the one hand, \({\varvec{M}}\) is an MB, meaning Open image in new window , so Open image in new window in view of the weak union property. Equivalently, we have Open image in new window . On the other hand, \({\varvec{M}}_{\mathbb {A}}\) is not an Mb. Suppose Open image in new window , then the contraction property indicates which contradicts that \({\varvec{M}}_{\mathbb {A}}\) is not an Mb of T. Hence, Open image in new window .

\(k\geqslant 2\), and Open image in new window holds for any \(\ell =1,\ldots ,k\): \({\varvec{M}}_{\mathbb {A}}\) is a WMB (and thus a WMb) of T.

Open image in new window holds for any nonempty \(\varvec{N}\subseteq {\varvec{M}}_{\mathbb {A}}\): This is because \({\varvec{M}}_{\mathbb {A}}\) is a WMB of T.
Definition 5
(WMBsupplementary) For \(T\in \varvec{V}\), let \({\varvec{M}}_{\mathbb {A}}\) be a WMB of T in \(\varvec{V}\). For \(\varvec{S}\subseteq {\varvec{M}}_{\mathbb {A}}\), we call \(\varvec{N}_{\varvec{S}}\) \((\subseteq \varvec{V}{\setminus }{\varvec{M}}_{\mathbb {A}}{\setminus }\{T\})\) a WMBsupplementary of \(\varvec{S}\), if the following two conditions hold: (1) \(({\varvec{M}}_{\mathbb {A}}{\setminus }\varvec{S})\cup \varvec{N}_{\varvec{S}}\) is a WMb of T in \(\varvec{V}{\setminus }\varvec{S}\); and (2) Open image in new window holds for any nonempty \(\varvec{N}\subseteq \varvec{N}_{\varvec{S}}\), if \(\varvec{N}_{\varvec{S}}\ne {\varnothing }\).
The analysis before Definition 5 provides a method of resuming the search of LCMB: if putting a set of all (or part) of nodes from \({\varvec{M}}_{\mathbb {A}}\), say \(\varvec{S}\), into the blacklist temporarily, some swamped information may be detected, so the search can continue. For convenience, we call \(\varvec{S}\) a swamping set. Observing again Example 2, X and Y are no longer swamped if removing Z temporarily. Example 3 presented in the appendix gives a similar inspiration. If we can find one swamping set, \(\varvec{S}\), then we remove it temporarily to search the variables swamped by \(\varvec{S}\).
Following this way, we construct an LCMBbased algorithmic framework called WLCMB. Here, “W” refers to as “weak”; we call it WLCMB because it can output an MB under the weak Markov local composition assumption defined as below.
Definition 6
(Weak Markov local composition) We say T satisfies the weak Markov local composition property, if every WMB, \({\varvec{M}}_{\mathbb {A}}\), of T satisfying the following condition is an MB of T: any \(\varvec{S}\subseteq {\varvec{M}}_{\mathbb {A}}\) has a WMBsupplementary \(\varvec{N_S}\) such that Open image in new window .
The pseudo code of WLCMB is described in Algorithm 3. By the algorithm, WLCMB interleaves LCMB (i.e., Line 7 of Algorithm 3) with the searchresuming procedure (i.e., Line 9, and Line 10 of Algorithm 3) by virtue of the ImpWMB function. If taking \(\mathbb {A}\) as IAMB, KIAMB, and LRH, the corresponding WLCMB algorithm will be called WIAMB, WKIAMB, and WLRH, respectively. Moreover, the BW procedure in Algorithm 3 can also be replaced with PostBW or InterPostBW.
Theorem 5
(Correctness of WLCMB) Assume all CI tests are correct. Then WLCMB outputs a WMB of T for any LCMB algorithm taken from \(\{\mathtt{IAMB},\mathtt{KIAMB},\mathtt{LRH}\,\!\}\). Further, if T satisfies the weak Markov local composition assumption, then WLCMB outputs an MB of T.
Proof

IAMB and WIAMB: First, IAMB outputs \({\varvec{M}}^{(0)}=\{Z\}\), which is not the MB of T. Taking \(\varvec{S}=\{Z\}\subseteq {\varvec{M}}^{(0)}\), we obtain \({\varvec{M}}^{(1)}_{\varvec{S}}=\{X,Y\}\), meaning Open image in new window (i.e., Open image in new window ). Further, \({\varvec{M}}_{\texttt {FW}}^{(1)}=\{X,Y,Z\}\) and \({\varvec{M}}^{(1)}=\{X,Y\}\). Similarly, \({\varvec{M}}^{(2)}=\{X,Y\}={\varvec{M}}^{(1)}\). Thus, WIAMB ends, outputing \(\{X,Y\}\) correctly.

KIAMB and WKIAMB: The output of KIAMB may be \({\varvec{M}}^{(0)}=\{X,Y\}\) or \({\varvec{M}}^{(0)}=\{Z\}\). In either case, WKIAMB can output the correct MB. The details are omitted here.

LRH and WLRH: First, LRH selects \(\{X,Y,Z\}\) and excludes \(\{X,Y\}\) in its SEI procedure. Therefore, LRH outputs \({\varvec{M}}^{(0)}=\{Z\}\). The remaining process of WLRH is similar to that of WIAMB. Finally, WLRH outputs \(\{X,Y\}\) correctly.
Results of the WLCMB algorithms applied to the ALARM network based on a data set of size 5000 (\(\alpha = 0.05\))

Average REs and RTs of the WLCMB algorithms applied to the ALARM network based on 10 data sets of different sizes (from 500 to 5000): each result is averaged over all the 37 nodes as targets

We mention that WLCMB has a higher computational complexity than LCMB: the complexity of WLCMB is that of the associated LCMB multiplied by \(2^{{\varvec{M}}}\) in the average case. Hence, WLCMB usually needs longer RT to yield a better output than LCMB. This can be seen from the second part of Table 6, which provides the average RT of the three WLCMB algorithms applied to the ALARM network. The experimental results on several large networks given in Sect. 6 also show this assertion. This means we should trade off the expected RE and RT before deciding to select which MB discovery algorithm in practice.
6 Experimental results on large networks
Results of LCMBs and WLCMBs applied to the six large BNs based on two data sets of sizes 5000 and 2500: average AUCs, average REs, and average RTs (in seconds)

For each network, 10 nodes are randomly selected as targets. These targets are listed in Table 7. Besides the REs and RTs, we also compute the weighted area under ROC curve (AUC) based on the naive Bayes classifier. For the case of size 5000, we randomly select 4000 instances as the training set and use the others as the testing set; for the case of size 2500, we randomly select 2000 and 500 out of 5000 instances as the training set and the testing set, respectively. Table 7 presents all the results. For each case, the AUC of the true MB, \({\varvec{M}}_T\), of the target is provided to compare how the performance of an algorithm is close to the best. Each result is averaged over that of the 10 targets. In addition, according to the recommendation of Peña et al. (2007) and the results given in Sects. 4 and 5, we take \(K=0.8\) in KIAMB and WKIAMB.
Table 7 indicates our algorithms are applicable to large BNs. By the table, it is concluded that: (1) for the three LCMB algorithms, LRH performs best in the senses of RE and AUC; (2) for the three WLCMB algorithms, WLRH performs best in both senses; (3) for each LCMB and its corresponding WLCMB, the latter improves the data efficiency of the former. In addition, we note a natural conclusion that the results on the case of a larger data size are more desirable in most situations than that on the case of a smaller data size. In brief, LRH and WLRH have the best performances in solving (P1) and (P2), respectively. Considering that WLRH usually needs a longer RT than LRH as Sect. 5 analyzes, we should first trade off the RE and the RT in practice and then choose between these two algorithms.
7 Conclusion and discussions
This paper considered two potential reasons for causing swamping and masking. For the problem of (P1) that incorrect CI tests may lead to swamping and masking, we proposed the LRH algorithm to alleviate the influence that swamping and masking brings under the local composition assumption. The application to the ALARM network shows the superiority of LRH over the other two LCMB algorithms. For the problem of (P2) that the violation of local composition may also lead to swamping and masking, we put forward the WLCMB algorithmic framework. Theoretically, WLCMB can improve LCMB, because LCMB stops searching once local composition is violated with respect to the obtained WMb in the growing phase, while WLCMB may break this abnormal exit and then continues to search those swamped true positives. The further application to the ALARM network supports this theoretical argument.
Frequencies of swamping and masking for LCMB and WLCMB when applied to the ALARM network
Frequency  IAMB  LRH  WIAMB  WLRH 

Swamping  24/37  13/37  19/37  11/37 
Masking  20/37  4/37  13/37  1/37 
In the sense of (8), we call \(X_1,\ldots ,X_{\kappa }\) to be multiple information equivalent with respect to T given \({\varvec{M}}\) if Open image in new window (\(i=1,\ldots ,\kappa \)) and the CI statements contained in (7) hold. As seen, in the case of \(\kappa =2\), the notion of multiple information equivalence reduces that of information equivalence proposed by Lemeire et al. (2012). Note that multiple information equivalence may exist in \(\varvec{M}_{1}\) even when \({\varvec{M}}_2\) needs no modifications.
If multiple information equivalence exists, an alternative operation is to randomly take one variable from every such case to constitute a new \(\varvec{M}_{2}\), and other procedures of SEI remain unchanged. This idea may improve on the original SEI and thus LRH. Considering that this operation needs an extra computational complexity and that the occasions of multiple information equivalence are rare in practice, we discuss it no further.
References
 Aliferis, C. F., Tsamardinos, I., & Statnikov, A. (2003). Hiton: A novel Markov blanket algorithm for optimal variable selection. In AMIA 2003 annual symposium proceedings (pp. 21–25). American Medical Informatics Association.Google Scholar
 Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010a). Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation. Journal of Machine Learning Research, 11, 171–234.MathSciNetMATHGoogle Scholar
 Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010b). Local causal and Markov blanket induction for causal discovery and feature selection for classification part II: Analysis and extensions. Journal of Machine Learning Research, 11, 235–284.MathSciNetMATHGoogle Scholar
 Beinlich, I. A., Suermondt, H. J., Chavez, R. M., & Cooper, G. F. (1989). The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In Second European conference on artificial intelligence in medicine (pp. 247–256). London: Springer.Google Scholar
 BenGal, I. (2005). Outlier detection. In O. Maimon, L. Rokach (Eds.), Data mining and knowledge discovery handbook (pp. 131–146). New York: Springer US. ISBN 9780387244358. doi: 10.1007/038725465X_7.
 Bromberg, F., & Margaritis, D. (2009). Improving the reliability of causal discovery from small data sets using argumentation. Journal of Machine Learning Research, 10, 301–340.MathSciNetMATHGoogle Scholar
 Cheng, J., Greiner, R., Kelly, J., Bell, D., & Liu, W. (2002). Learning Bayesian networks from data: An informationtheory based approach. Artificial Intelligence, 137(1–2), 43–90.MathSciNetCrossRefMATHGoogle Scholar
 Cochran, W. G. (1954). Some methods for strengthening the common \(\chi ^2\) tests. Biometrics, 10(4), 417–451.MathSciNetCrossRefMATHGoogle Scholar
 Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Hoboken: Wiley.MATHGoogle Scholar
 de Campos, L. M. (2006). A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. Journal of Machine Learning Research, 7, 2149–2187.MathSciNetMATHGoogle Scholar
 Fu, S., & Desmarais, M. (2007). Local learning algorithm for Markov blanket discovery. In AI 2007: Advances in artificial intelligence (pp. 68–79). Berlin Heidelberg: Springer.Google Scholar
 Fu, S., & Desmarais, M. C. (2010). Markov blanket based feature selection: A review of past decade. In Proceedings of the world congress on engineering.Google Scholar
 Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATHGoogle Scholar
 Hadi, A. S., Rahmatullah, I. A. H. M., & Mark, W. (2009). Detection of outliers. Wiley Interdisciplinary Reviews Computational Statistics, 1(1), 57–70.CrossRefGoogle Scholar
 Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Thirteen international conference in machine learning. Stanford InfoLab.Google Scholar
 Lee, C.P., Leu, Y., & Yang, W.N. (2012). Constructing gene regulatory networks from microarray data using GA/PSO with DTW. Applied Soft Computing, 12(3), 1115–1124.CrossRefGoogle Scholar
 Lemeire, J. (2007). Learning causal models of multivariate systems and the value of it for the performance modeling of computer programs. ASP/VUBPRESS/UPA, PhD thesis.Google Scholar
 Lemeire, J., Meganck, S., Cartella, F., & Liu, T. (2012). Conservative independencebased causal structure learning in absence of adjacency faithfulness. International Journal of Approximate Reasoning, 53(9), 1305–1325.MathSciNetCrossRefMATHGoogle Scholar
 Margaritis, D., & Thrun, S. (1999). Bayesian network induction via local neighborhoods. Technical Report CMUCS99134, Carnegie Mellon University.Google Scholar
 Margaritis, D., & Thrun, S. (2000). Bayesian network induction via local neighborhoods. In Advances in neural information processing systems, vol 12 (pp. 505–511). Morgan Kaufmann.Google Scholar
 MartínezRodríguez, A. M., May, J. H., & Vargas, L. G. (2008). An optimizationbased approach for the design of Bayesian networks. Mathematical and Computer Modelling, 48(7–8), 1265–1278.MathSciNetCrossRefMATHGoogle Scholar
 Masegosa, A. R., & Moral, S. (2012). A Bayesian stochastic search method for discovering Markov boundaries. KnowledgeBased Systems, 35, 211–223.CrossRefGoogle Scholar
 Neapolitan, R. E. (2004). Learning bayesian networks. Upper Saddle River: Prentice Hall.Google Scholar
 Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco: Morgan Kaufmann.MATHGoogle Scholar
 Pellet, J.P., & Elisseeff, A. (2008). Using Markov blankets for causal structure learning. Journal of Machine Learning Research, 9, 1295–1342.MathSciNetMATHGoogle Scholar
 Peña, J. M., Nilsson, R., Björkegren, J., & Tegnér, J. (2007). Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning, 45(2), 211–232.CrossRefMATHGoogle Scholar
 Rauh, J., Bertschinger, N., Olbrich, E., & Jost, J. (2014). Reconsidering unique information: Towards a multivariate information decomposition. In IEEE international symposium on information theory (ISIT) (pp. 2232–2236). IEEE.Google Scholar
 Schlüter, F. (2014). A survey on independencebased Markov networks learning. Artificial Intelligence Review, 42, 1069–1093.CrossRefGoogle Scholar
 Statnikov, A., Lytkin, N. I., Lemeire, J., & Aliferis, C. F. (2013). Algorithms for discovery of multiple Markov boundaries. Journal of Machine Learning Research, 14(1), 499–566.MathSciNetMATHGoogle Scholar
 Tsamardinos, I., & Aliferis, C. F. (2003). Towards principled feature selection: Relevancy, filters and wrappers. In: Proceedings of the ninth international workshop on artificial intelligence and statistics.Google Scholar
 Tsamardinos, I., Aliferis, C. F., & Statnikov, A. (2003a). Algorithms for large scale Markov blanket discovery. In: Proceedings of the sixteenth international Florida artificial intelligence research society conference (FLAIRS) (pp. 376–381).Google Scholar
 Tsamardinos, I., Aliferis, C. F., & Statnikov, A. (2003b). Time and sample efficient discovery of Markov blankets and direct causal relations. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 673–678).Google Scholar
 Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max–min hillclimbing Bayesian network structure learning algorithm. Machine learning, 65(1), 31–78.CrossRefGoogle Scholar
 Williams, P. L., & Beer, R. D. (2010). Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515.
 Yaramakala, S. (2004). Fast Markov Blanket Discovery. MS thesis.Google Scholar
 Yaramakala, S., & Margaritis, D. (2005). Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the fifth IEEE international conference on data mining.Google Scholar
 Zhang, L., & Guo, H. (2006). Introduction to Bayesian networks. Beijing: Science Press.Google Scholar
 Zhang, Y., Zhang, Z., Liu, K., & Qian, G. (2010). An improved IAMB algorithm for Markov blanket discovery. Journal of Computers, 5, 1755–1761.Google Scholar