Scoring Bayesian networks of mixed variables
- 341 Downloads
Abstract
In this paper we outline two novel scoring methods for learning Bayesian networks in the presence of both continuous and discrete variables, that is, mixed variables. While much work has been done in the domain of automated Bayesian network learning, few studies have investigated this task in the presence of both continuous and discrete variables while focusing on scalability. Our goal is to provide two novel and scalable scoring functions capable of handling mixed variables. The first method, the Conditional Gaussian (CG) score, provides a highly efficient option. The second method, the Mixed Variable Polynomial (MVP) score, allows for a wider range of modeled relationships, including nonlinearity, but it is slower than CG. Both methods calculate log likelihood and degrees of freedom terms, which are incorporated into a Bayesian Information Criterion (BIC) score. Additionally, we introduce a structure prior for efficient learning of large networks and a simplification in scoring the discrete case which performs well empirically. While the core of this work focuses on applications in the search and score paradigm, we also show how the introduced scoring functions may be readily adapted as conditional independence tests for constraint-based Bayesian network learning algorithms. Lastly, we describe ways to simulate networks of mixed variable types and evaluate our proposed methods on such simulations.
Keywords
Bayesian network structure learning Mixed variables Continuous and discrete variables1 Introduction
Bayesian networks are a widely used graphical framework for representing probabilistic relationships among variables. In general, a Bayesian network consists of two components, a structure component and a distribution component. The structure component encodes conditional independence relationships between variables allowing for an efficient factorization of the joint distribution, while the distribution component parameterizes the probabilistic relationships among the variables. In this paper, our interests lie in learning the structure component of Bayesian networks, represented by a Directed Acyclic Graph (DAG). Learning a DAG over a set of variables is of particular interest, because under assumptions a DAG can be interpreted as a causal model [26].
Automated Bayesian network learning from data is an important and active area of research. However, relatively few researchers have investigated this task in the presence of both continuous and discrete variables [3, 8, 13, 15, 21, 24, 25]. In the limited work that has been done, researchers either ignore the case where continuous variables are parents of discrete variables, or do not provide solutions that scale much beyond 100 variables. The goal of this paper is to provide solutions for researchers working with datasets containing hundreds of variables.
Most methods for learning Bayesian networks fall into one of two categories: search and score or constraint-based. Search and score methods heuristically search the space of possible structures using an objective function to evaluate fitness while constraint-based methods use conditional independence tests find patterns of independence that are consistent with a set of DAGs. The core of this paper focuses on the search and score paradigm; however, we also show how the scoring functions we propose may be readily adapted as conditional independence tests for constraint-based methods. For additional background information on Bayesian networks and learning their structures, see [6].
The remainder of this paper is organized as follows. Section 2 discusses general properties of scoring functions and the Bayesian Information Criterion (BIC). Sections 3 and 4 introduce the Conditional Gaussian (CG) score and the Mixed Variable Polynomial (MVP) score, respectively. Section 5 details several adaptations of the introduced methods. Section 6 reports empirical results of the CG and MVP methods on data generated using simulation. Section 7 provides discussion and conclusions.
2 Scoring Bayesian networks
Note that several DAGs can encode the same set of conditional independence relationships. A set of DAGs which encodes the same independencies is known as a Markov Equivalence Class (MEC). If a scoring function S scores all DAGs in the same MEC equally, then S is score equivalent. To clarify, let \({\mathcal {G}}\) and \({\mathcal {G}}'\) be DAGs over the variables in dataset \({\mathcal {D}}\). If \({\mathcal {G}}\) and \({\mathcal {G}}'\) encode the same conditional independence relationships and S is score equivalent, then \(S({\mathcal {G}}, {\mathcal {D}}) = S({\mathcal {G}}', {\mathcal {D}})\). This can be a desirable trait because it allows search algorithms, such as Greedy Equivalent Search (GES) [5], to search over MECs directly.
Another common trait for scoring functions that algorithms such as GES require for optimality is consistency. Let \({\mathcal {D}}\) be a dataset and \({\mathcal {G}}\) and \({\mathcal {G}}'\)be DAGs. A scoring function \({\mathcal {S}}\) is consistent if in the large sample limit the following two conditions imply \({\mathcal {G}}\) will score higher than \({\mathcal {G}}'\), i.e., \(S({\mathcal {G}}, {\mathcal {D}}) > S({\mathcal {G}}', {\mathcal {D}})\): (1) There exists a parameterization \(\theta \) which allows \({\mathcal {G}}\) to represent the generating distribution of \({\mathcal {D}}\) and no such parameterization \(\theta '\) exists for \(\mathcal {G'}\) or (2) there exist parameterizations \(\theta \) and \(\theta '\) which allow \({\mathcal {G}}\) and \({\mathcal {G}}' \) each to represent the generating distribution of \({\mathcal {D}}\), but \({\mathcal {G}}\) contains fewer parameters.
2.1 The Bayesian information criterion
3 The conditional Gaussian score
Assumption 1
The data were generated from a Gaussian mixture where each Gaussian component exists for a particular setting of the discrete variables.
This assumption allows for efficient calculations, but also assumes that the discrete variables take part in generating the continuous variables by defining the Gaussian mixture components, e.g., \(p(C_{1}, C_{2}, D_{1}, D_{2})\) is a Gaussian mixture with a Gaussian component for each setting of \(D_{1}\) and \(D_{2}\). Therefore, when scoring a discrete variable as the child of a continuous variable, our model assumption will inherently encode the reverse relationship. In Sect. 6, we see that even with this assumption, CG performs quite well.
Assumption 2
The instances in the data are independent and identically distributed.
The data are assumed to be i.i.d. so that we can calculate the log likelihood as a sum over the marginal log probabilities for each instance in the data.
It is important to note that if we treat \(p(C_{1}, C_{2}, D_{1}, D_{2})\) as a mixture distribution with a Gaussian component for each setting of \(D_{1}\) and \(D_{2}\), to calculate \(p(C_{1}, C_{2}, D_{2})\) correctly, we must marginalize \(D_{2}\) out and treat \(p(C_{1}, C_{2}, D_{2})\) as a mixture distribution with a Gaussian mixture for each setting of \(D_{2}\).
Assumption 3
All Gaussian mixtures are approximately Gaussian.
For computational efficiency, we approximate all Gaussian mixtures resulting from marginalizing discrete variables out as single Gaussian distributions. In Sect. 6.1, we evaluate this approximation experimentally and find that it performs well.
Under mild conditions, BIC is consistent for Gaussian mixture models [10]. Since CG assumes the data are generated according to a Gaussian mixture, under the same mild assumptions, CG is consistent. Additionally, CG is score equivalent; see Appendix A for a proof.
In the remainder of the current section, we provide a high-level overview of the CG method; Sects. 3.1 and 3.2 provide details. Let \(Y_{i}\) be the ith variable in a DAG \({\mathcal {G}}\) with the set \(Pa_{i}\) containing the parents of \(Y_{i}\). Furthermore, let \(Pa_{i}\) consist of two mutually exclusive subsets \(Pc_{i}\) and \(Pd_{i}\) such that \(Pc_{i}\) and \(Pd_{i}\) hold the continuous and discrete parents of \(Y_{i}\), respectively. To evaluate the parent–child relationship between a variable \(Y_{i}\) and its parents \(Pa_{i}\), CG calculates the log likelihood and degrees of freedom for the joint distributions of two sets of variables, \(Y_{i} \cup Pa_{i}\) and \(Pa_{i}\). The log likelihood of \(Y_{i}\) given its parents \(Pa_{i}\) is computed as the difference between the log likelihood terms for \(Y_{i} \cup Pa_{i}\) and \(Pa_{i}\). Similarly, the degrees of freedom are calculated as the difference in parameters used to fit \(Y_{i} \cup Pa_{i}\) and \(Pa_{i}\).
When evaluating the two sets of variables, the dataset \({\mathcal {D}}\) is first partitioned according to the discrete variables in each set. That is, we divide \({\mathcal {D}}\) using a partitioning set \(\varPi _{i}\) over all the instances in \({\mathcal {D}}\). \(\varPi _{i}\) contains a partition for each combination of values the discrete variables take on in \({\mathcal {D}}\). Further, we form a design matrix \({\varvec{X}}_{p}\) for each partition \(p \in \varPi _{i}\). \({\varvec{X}}_{p}\) holds the data corresponding to the instances of the continuous variables in partition p. Gaussian and multinomial distributions are fit according to the continuous and discrete variables, respectively, to calculate log likelihood and degrees of freedom terms which BIC uses to compute the score.
3.1 Modeling a set of variables
When using CG, we have three different kinds of sets to model: \(Y_{i} \cup Pa_{i}\) where \(Y_{i}\) is continuous, \(Y_{i} \cup Pa_{i}\) where \(Y_{i}\) is discrete, and \(Pa_{i}\). They all follow the same generic format so we will describe the process in general while pointing out any subtle differences where they apply.
First we partition the data with respect to a partitioning set \(\varPi _i\) generated according to the discrete variables \(Pd_{i}\). Note that if our set includes a discrete child \(Y_{i}\), then the discrete variables are comprised of \(Y_{i} \cup Pd_{i}\) and we partition according to these variables. \(\varPi _{i}\) contains a partition for every combination of values in the discrete variables. We define the partitioning set \(\varPi _{i}\) using a Cartesian product of the discrete variables. Let \(|Pd_{i}| = d\), then partitioning set \(\varPi _{i} = (Y_{i}) \times Pd_{i}(1) \times Pd_{i}(2) \times \dots \times Pd_{i}(d)\) where \(Y_{i}\) is the set of values for the child (included only if \(Y_{i}\) is discrete), \(Pd_{i}(1)\) is the set of values for the first discrete parent, \(Pd_{i}(2)\) is the set of values for the second discrete parent, and so forth.
3.2 Calculating the log likelihood and degrees of freedom
4 The mixed variable polynomial score
The Mixed Variable Polynomial (MVP) score uses higher order polynomial functions to approximate relationships between any number of continuous and discrete variables. Since MVP uses BIC as a framework to evaluate its approximations, the score is decomposable into a sum of parent–child relationships. The MVP method scores the decomposed local components of a DAG \({\mathcal {G}}\) using approximating polynomial functions. To motivate the ideas underlying this approach, we note the implications of the Weierstrass Approximation Theorem for consistency.
Weierstrass Approximation Theorem Suppose f is a continuous real-valued function defined on the real interval [a, b]. For every \(\epsilon > 0\), there exists a polynomial p such that for all \(x \in [a, b]\), we have \(|f(x) - p(x)| < \epsilon \).
In short, as long as a function f is continuous and the contributing variables exist within a bounded interval, then there exists a polynomial function which approximates f to an arbitrary degree of accuracy [11]. This brings us to our first two assumptions.
Assumption 1
The sample space of each variable is finite.
To shed some light on this assumption, we note that MVP’s approximations are functions of continuous variables in the data. Thus, the motivation for Assumption 1 becomes apparent as a prerequisite of the previously stated theorem; finite sample spaces are bounded.
Assumption 2
Each continuous variable is defined by continuous functions of their continuous parents plus additive Gaussian noise. The probability mass function of each discrete variable is defined by positive continuous functions of their continuous parents.
The motivation for this assumption follows from Weierstrass’s approximation theorem since f, the function to be approximated, must be continuous. However, along with assuming continuity, we restrict the model class in the continuous child case to have additive Gaussian noise. This assumption allows us to use least squares regression to obtain efficient maximum likelihood estimates. Additionally, we assume positive functions in the discrete case since we are estimating probability mass functions. It is worth noting that we do not assume linearity unlike other commonly used scores.
Assumption 3
There are no interaction terms between continuous parents.
We make this assumption for tractability. Modeling all interactions among the continuous parents is a combinatorial problem. Thus, we forgo such interaction terms.
Assumption 4
The instances in the data are independent and identically distributed.
The data are assumed to be i.i.d. so that we can calculate the log likelihood as a sum over the marginal log probabilities for each instance in the data.
Under these assumptions, the MVP score is consistent in the large sample limit with an adequate choice of maximum polynomial degree; see Appendix A for a proof. However, due to the use of nonlinear functions, it is not score equivalent for any maximum polynomial degree greater than 1. In Sect. 6, we see that even without this property, the MVP score still performs quite well. Moreover, in general, we do not expect causal relationships to be score equivalent, so using a framework that requires score equivalence would not be desirable. As an example of previous work suggesting that asymmetric scores can be beneficial in inferring causation, see [12].
4.1 Partitioned regression
Let \(Y_{i}\) be the ith variable in a DAG \({\mathcal {G}}\) and \(Pa_{i}\) be the set containing the parents of \(Y_{i}\) in \({\mathcal {G}}\). Furthermore, let \(Pa_{i}\) consist of two mutually exclusive subsets \(Pc_{i}\) and \(Pd_{i}\) such that \(Pc_{i}\) and \(Pd_{i}\) hold the continuous and discrete parents of \(Y_{i}\), respectively. In general, to evaluate the local score component between \(Y_{i}\) and its parents \(Pa_{i}\), MVP first partitions the data with respect to the discrete parents \(Pd_{i}\) and performs least squares regression using the continuous parents \(Pc_{i}\). The log likelihood and degrees of freedom for the model are calculated depending on the variable type of \(Y_{i}\). BIC uses the log likelihood and degrees of freedom terms to compute the score.
A partitioning set \(\varPi _{i}\) partitions \({\mathcal {D}}\) with respect to the discrete parents \(Pd_{i}\) and contains a partition for every combination of values in the discrete parents. We define \(\varPi _{i}\) using a Cartesian product of the discrete parents \(Pd_{i}\). Let \(|Pd_{i}| = d\), then partitioning set \(\varPi _{i} = Pd_{i}(1) \times Pd_{i}(2) \times \dots \times Pd_{i}(d)\) where \(Pd_{i}(1)\) is the set of values for the first discrete parent, \(Pd_{i}(2)\) is the set of values for the second discrete parent, and so forth.
4.2 Modeling a continuous child
4.3 Modeling a discrete child
In the case where \(Y_{i}\) is discrete, for partition p with design matrix \({\varvec{X}}_{p}\) and target vector \({\varvec{y}}_{p}\), we calculate the log likelihood \(\ell _{p}(\varvec{\theta }_p | {\varvec{X}}_{p}, {\varvec{y}}_p)\) and degrees of freedom \(df_{p}(\varvec{\theta }_p | {\varvec{X}}_{p}, {\varvec{y}}_p)\) using least squares regression. Suppose \(Y_{i}\) consists of d categories. Let \(f_{p,h}\) calculate the probability of the h th category in \(Y_{i}\) given the values of the continuous parents \(Pc_{i}\) where \(h \in \{1, \dots , d\}\). By Assumption 2 and the Weierstrass Approximation Theorem, there exists a polynomial function \({\hat{f}}_{p,h}\) which approximates \(f_{p,h}\) arbitrarily well. With this in mind, we aim to approximate each \(f_{p,h}\) with the polynomial function \({\varvec{X}}_p \hat{\varvec{\beta }}_{p,h}\) where \(\hat{\varvec{\beta }}_{p,h}\) are the polynomial coefficients calculated from least squares regression. Our end goal is to use the approximations of each \(f_{p,h}\) as components of a conditional probability mass function in order to calculate the log likelihood and degrees of freedom terms.
- 1.
\(\sum _{h=1}^{d} {\varvec{X}}_p \hat{\varvec{\beta }}_{p,h} = {\varvec{1}}_p\)
- 2.
\({\varvec{x}}_{p,j} \hat{\varvec{\beta }}_{p,h} \ge 0, \; \forall \; j \in \{{1, \dots , n_p}\}, h \in \{1, \dots , d\}\).
- 1.
Shift the estimates such that they are centered about a noninformative center by subtracting \(\frac{1}{d}\) (line 6).
- 2.
Scale the estimates such that the smallest final value will be at least \(\frac{1}{n_{p}}\) (line 7).
- 3.
Shift the scaled estimates back to the original center by adding \(\frac{1}{d}\) (line 8).
5 Implementation details and adaptations
In this section we consider various adaptations of the two proposed scores. In Sect. 5.1, we discuss a binomial structure prior which allows for efficient learning of large networks. In Sect. 5.2, we discuss a simplification for scoring discrete children which performs well empirically. In Sect. 5.3, we discuss how to adapt our scores into conditional independence test for constraint-based methods.
5.1 Binomial structure prior
In Sect. 6, we let \(m' = m\) in order to calculate the binomial structure prior more efficiency.
We calculate q as \(q = \frac{r}{(m-1)}\), where r represents a user-specified upper bound on the expected number of parents of any given node.
Usually, BIC assumes the prior probability of models in Eq. (1) is distributed uniformly. By using the binomial structure prior instead, we adapt BIC to further penalize networks with complex structure. There are other approaches that use a nonuniform prior for BIC, notability, the extended BIC (EBIC) [4]. (EBIC) is a similar modification to BIC which aims to address the small-n-large-P situation. In Sect. 6.2, we compare both the binomial structure prior and EBIC against the use of a uniform prior.
5.2 Multinomial scoring with continuous parents
Both scores presented in this paper reduce to multinomial scoring in the case of a discrete child with exclusively discrete parents. As an alternative, we explore the use of multinomial scoring when there are discrete children and any combination of parents. Before starting a search, we create discretized versions of each continuous variable using equal frequency binning with a predefined number of bins b. Whenever scoring a discrete child, we replace any continuous parents with the precomputed discretized versions of those variables. This allows us to quickly and efficiently perform multinomial scoring for all discrete children. We will henceforth refer to this adaptation as the discretization heuristic and report our finding when choosing \(b = 3\) as a modification to CG in Sect. 6.4.
5.3 As a conditional independence test
We can readily adapt CG and MVP to produce conditional independence tests; to do so, we calculate the log likelihood and degrees of freedom as usual, but perform a likelihood ratio test instead of scoring with BIC. Suppose we wish to test Open image in new window where \(Y_0\) and \(Y_1\) are variables (nodes) and Z is a conditioning set of variables. Define \(\ell _{0}\) and \(df_{0}\), respectively, as the log likelihood and degrees of freedom for \(Y_{0}\) given \(Pa_{0}\) where \(Pa_{0} = Y_1 \cup Z\). Further, define \(\ell '_{0}\) and \(df'_{0}\), respectively, as the log likelihood and degrees of freedom for \(Y_{0}\) given \(Pa'_{0}\) where \(Pa'_{0} = Z\). Perform a likelihood ratio test with test statistic \(2(\ell _{0} - \ell '_{0})\) and \(df_{0} - df'_{0}\) degrees of freedom. This tests whether the model encoding Open image in new window or the model encoding Open image in new window fits the data better. If the scoring method used is not score equivalent, then we must also perform a likelihood ratio test with test statistic \(2(\ell _{1} - \ell '_{1})\) and \(df_{1} - df'_{1}\) degrees of freedom where \(Y_{0}\) and \(Y_{1}\) are swapped. In this case, we decide the variables are dependent if there is enough evidence in either test to support that hypothesis.
6 Simulation studies
To simulate mixed data, we first randomly generate a DAG \({\mathcal {G}}\) and designate each variable in \({\mathcal {G}}\) as either discrete or continuous. \({\mathcal {G}}\) is generated by randomly defining a causal order and adding edges between the variables. Edges are added between randomly chosen pairs of nodes such that the connections are true to the pre-specified ordering; they are continually added until the average degree of the graph reaches a user-specified amount. Variables in the network without parents are generated according to Gaussian and multinomial distributions. We create temporary discretized versions of each continuous variable using equal frequency binning with 2–5 bins uniformly chosen, for reasons described below. In causal order, we simulate the remaining variables as follows. Continuous variables are generated by partitioning on the discrete parents and randomly parameterizing the coefficients of a linear regression for each partition. Discrete variables are generated via randomly parameterized multinomial distributions of the variable being simulated, the discrete parents, and the discretized versions of the continuous parents. All temporary variables are removed after the simulation is completed. For all simulations, each variable is assigned either continuous or discrete with equal probability. Additionally, discrete variables will have a uniformly chosen number of categories between 2 and 5, inclusive.
- AP
- adjacency precision: the ratio of correctly predicted adjacent to all predicted adjacent
- AR
- adjacency recall: the ratio of correctly predicted adjacent to all true adjacent
- AHP
- arrowhead precision: the ratio of correctly predicted arrowheads to all predicted arrowheads
- AHR
- arrowhead recall: the ratio of correctly predicted arrowheads to all true arrowheads (in found adjacencies)
- T
- elapsed time (seconds)
6.1 The conditional Gaussian approximation
Compares the approximate method to the exact method for CG on graphs of average degree 2 and 4 with 100 measured variables
Sample size | 200 | 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Statistic | AP | AR | AHP | AHR | T (s) | AP | AR | AHP | AHR | T (s) | |
Avg Deg 2 | Exact | 0.56 | 0.56 | 0.37 | 0.31 | 1.03 | 0.79 | 0.79 | 0.65 | 0.55 | 2.94 |
Approx | 0.82 | 0.53 | 0.75 | 0.19 | 0.31 | 0.91 | 0.81 | 0.85 | 0.49 | 0.59 | |
Avg Deg 4 | Exact | 0.59 | 0.39 | 0.44 | 0.26 | 0.73 | 0.84 | 0.64 | 0.73 | 0.51 | 3.73 |
Approx | 0.82 | 0.36 | 0.69 | 0.23 | 0.17 | 0.92 | 0.62 | 0.84 | 0.51 | 0.99 |
Compares the use of different priors for CG, MVP 1, and MVP \(\log n\) on graphs of average degree 2 with 100 measured variables
Sample Size | 200 | 1,000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Statistic | AP | AR | AHP | AHR | T (s) | AP | AR | AHP | AHR | T (s) | |
CG | Uniform | 0.54 | 0.59 | 0.40 | 0.38 | 0.29 | 0.76 | 0.83 | 0.66 | 0.59 | 0.63 |
EBIC | 0.85 | 0.45 | 0.80 | 0.12 | 0.26 | 0.93 | 0.78 | 0.88 | 0.43 | 0.53 | |
Binomial | 0.82 | 0.53 | 0.75 | 0.19 | 0.18 | 0.91 | 0.81 | 0.85 | 0.49 | 0.55 | |
MVP 1 | Uniform | 0.36 | 0.57 | 0.24 | 0.35 | 9.32 | 0.70 | 0.81 | 0.56 | 0.57 | 6.43 |
EBIC | 0.84 | 0.39 | 0.69 | 0.17 | 1.35 | 0.85 | 0.71 | 0.74 | 0.46 | 3.53 | |
Binomial | 0.53 | 0.53 | 0.35 | 0.31 | 1.53 | 0.83 | 0.77 | 0.70 | 0.52 | 3.93 | |
MVP \(\log n\) | Uniform | 0.37 | 0.55 | 0.23 | 0.31 | 7.51 | 0.77 | 0.79 | 0.60 | 0.50 | 14.47 |
EBIC | 0.84 | 0.31 | 0.65 | 0.09 | 2.50 | 0.87 | 0.65 | 0.73 | 0.37 | 7.25 | |
Binomial | 0.52 | 0.51 | 0.33 | 0.28 | 2.51 | 0.84 | 0.76 | 0.68 | 0.47 | 8.54 |
Compares the use of different priors for CG, MVP 1, and MVP \(\log n\) on graphs of average degree 4 with 100 measured variables
Sample Size | 200 | 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Statistic | AP | AR | AHP | AHR | T (s) | AP | AR | AHP | AHR | T (s) | |
CG | Uniform | 0.66 | 0.41 | 0.52 | 0.31 | 0.23 | 0.87 | 0.66 | 0.79 | 0.57 | 1.09 |
EBIC | 0.85 | 0.30 | 0.70 | 0.16 | 0.15 | 0.94 | 0.57 | 0.86 | 0.43 | 0.92 | |
Binomial | 0.82 | 0.36 | 0.69 | 0.23 | 0.16 | 0.92 | 0.62 | 0.84 | 0.51 | 0.85 | |
MVP 1 | Uniform | 0.45 | 0.42 | 0.34 | 0.30 | 16.85 | 0.85 | 0.66 | 0.77 | 0.56 | 7.24 |
EBIC | 0.84 | 0.27 | 0.69 | 0.16 | 0.98 | 0.92 | 0.54 | 0.84 | 0.43 | 4.29 | |
Binomial | 0.53 | 0.36 | 0.39 | 0.25 | 6.24 | 0.90 | 0.61 | 0.83 | 0.51 | 4.90 | |
MVP \(\log n\) | Uniform | 0.44 | 0.37 | 0.30 | 0.23 | 20.70 | 0.89 | 0.62 | 0.78 | 0.49 | 16.74 |
EBIC | 0.85 | 0.18 | 0.64 | 0.08 | 2.06 | 0.94 | 0.46 | 0.84 | 0.34 | 9.25 | |
Binomial | 0.52 | 0.33 | 0.36 | 0.20 | 6.50 | 0.93 | 0.58 | 0.84 | 0.46 | 11.55 |
Compares the use of CG, CGd, MVP 1, and MVP \(\log n\) in the constraint-based paradigm with \(\alpha \) set to 0.001 on graphs of average degree 2 and 4, respectively, with 100 measured variables
Sample Size | 200 | 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Statistic | AP | AR | AHP | AHR | T (s) | AP | AR | AHP | AHR | T (s) | |
Avg Deg 2 | CG | 0.91 | 0.40 | 0.96 | 0.04 | 0.57 | 0.93 | 0.68 | 0.93 | 0.24 | 1.26 |
CGd | 0.92 | 0.42 | 0.98 | 0.05 | 0.46 | 0.95 | 0.69 | 0.94 | 0.26 | 1.15 | |
MVP 1 | 0.77 | 0.27 | 0.67 | 0.01 | 1.37 | 0.88 | 0.60 | 0.67 | 0.15 | 6.63 | |
MVP \(\log n\) | 0.97 | 0.29 | 0.75 | 0.01 | 2.03 | 0.92 | 0.67 | 0.68 | 0.23 | 16.82 | |
Avg Deg 4 | CG | 0.93 | 0.26 | 0.93 | 0.05 | 0.44 | 0.94 | 0.50 | 0.93 | 0.24 | 4.47 |
CGd | 0.93 | 0.26 | 0.89 | 0.05 | 0.44 | 0.95 | 0.50 | 0.95 | 0.24 | 8.55 | |
MVP 1 | 0.81 | 0.14 | 0.43 | 0.01 | 1.28 | 0.92 | 0.41 | 0.65 | 0.16 | 11.21 | |
MVP \(\log n\) | 0.98 | 0.15 | 0.56 | 0.01 | 2.01 | 0.93 | 0.48 | 0.60 | 0.21 | 48.73 |
Compares the use of CG, CGd, MVP 1, LR 1, MVP \(\log n\), LR \(\log n\), and MN using linear data from graphs of average degree 2 and 4, respectively, with 100 measured variables
Sample Size | 200 | 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Statistic | AP | AR | AHP | AHR | T (s) | AP | AR | AHP | AHR | T (s) | |
Avg Deg 2 | CG | 0.82 | 0.53 | 0.75 | 0.19 | 0.29 | 0.91 | 0.81 | 0.85 | 0.49 | 0.56 |
CGd | 0.90 | 0.45 | 0.82 | 0.17 | 0.19 | 0.95 | 0.77 | 0.93 | 0.47 | 0.53 | |
MVP 1 | 0.53 | 0.53 | 0.35 | 0.31 | 1.92 | 0.83 | 0.77 | 0.70 | 0.52 | 4.01 | |
LR 1 | 0.53 | 0.53 | 0.36 | 0.32 | 18.55 | 0.84 | 0.78 | 0.71 | 0.51 | 52.34 | |
MVP \(\log n\) | 0.52 | 0.51 | 0.33 | 0.28 | 2.63 | 0.84 | 0.76 | 0.68 | 0.47 | 8.55 | |
LR \(\log n\) | 0.53 | 0.51 | 0.34 | 0.28 | 34.86 | 0.87 | 0.78 | 0.71 | 0.49 | 165.53 | |
MN | 0.93 | 0.41 | 0.85 | 0.07 | 0.09 | 0.97 | 0.72 | 0.90 | 0.36 | 0.48 | |
Avg Deg 4 | CG | 0.82 | 0.36 | 0.69 | 0.23 | 0.16 | 0.92 | 0.62 | 0.84 | 0.51 | 0.90 |
CGd | 0.92 | 0.32 | 0.80 | 0.18 | 0.15 | 0.96 | 0.58 | 0.91 | 0.48 | 0.73 | |
MVP 1 | 0.53 | 0.36 | 0.39 | 0.25 | 8.82 | 0.90 | 0.61 | 0.83 | 0.51 | 5.20 | |
LR 1 | 0.53 | 0.36 | 0.39 | 0.25 | 27.22 | 0.91 | 0.63 | 0.83 | 0.52 | 62.08 | |
MVP \(\log n\) | 0.53 | 0.33 | 0.36 | 0.20 | 6.25 | 0.93 | 0.58 | 0.84 | 0.46 | 12.00 | |
LR \(\log n\) | 0.53 | 0.33 | 0.36 | 0.20 | 45.97 | 0.93 | 0.59 | 0.84 | 0.47 | 215.93 | |
MN | 0.93 | 0.26 | 0.86 | 0.07 | 0.06 | 0.98 | 0.51 | 0.84 | 0.36 | 0.19 |
Compares the use of CG, CGd, MVP 1, LR 1, MVP \(\log n\), LR \(\log n\), and MN using nonlinear data from graphs of average degree 2 and 4, respectively, with 100 measured variables
Sample Size | 200 | 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Statistic | AP | AR | AHP | AHR | T (s) | AP | AR | AHP | AHR | T (s) | |
Avg Deg 2 | CG | 0.53 | 0.54 | 0.35 | 0.33 | 0.47 | 0.58 | 0.67 | 0.45 | 0.45 | 1.91 |
CGd | 0.82 | 0.52 | 0.61 | 0.23 | 0.23 | 0.83 | 0.68 | 0.67 | 0.39 | 1.03 | |
MVP 1 | 0.67 | 0.55 | 0.45 | 0.32 | 0.82 | 0.76 | 0.69 | 0.58 | 0.41 | 3.37 | |
LR 1 | 0.67 | 0.55 | 0.45 | 0.31 | 10.93 | 0.76 | 0.68 | 0.58 | 0.41 | 46.03 | |
MVP \(\log n\) | 0.75 | 0.54 | 0.51 | 0.29 | 1.42 | 0.87 | 0.67 | 0.71 | 0.42 | 6.90 | |
LR \(\log n\) | 0.75 | 0.54 | 0.51 | 0.29 | 22.07 | 0.87 | 0.67 | 0.71 | 0.42 | 158.09 | |
MN | 0.95 | 0.49 | 0.96 | 0.05 | 0.11 | 0.96 | 0.65 | 0.83 | 0.31 | 0.24 | |
Avg Deg 4 | CG | 0.48 | 0.34 | 0.34 | 0.24 | 0.57 | 0.70 | 0.51 | 0.60 | 0.41 | 1.38 |
CGd | 0.81 | 0.32 | 0.64 | 0.19 | 0.33 | 0.86 | 0.51 | 0.77 | 0.39 | 1.00 | |
MVP 1 | 0.71 | 0.35 | 0.53 | 0.23 | 1.03 | 0.83 | 0.52 | 0.73 | 0.41 | 3.82 | |
LR 1 | 0.72 | 0.35 | 0.54 | 0.23 | 12.50 | 0.82 | 0.52 | 0.72 | 0.40 | 48.62 | |
MVP \(\log n\) | 0.81 | 0.33 | 0.63 | 0.23 | 1.88 | 0.93 | 0.52 | 0.84 | 0.41 | 8.29 | |
LR \(\log n\) | 0.81 | 0.34 | 0.63 | 0.23 | 35.04 | 0.93 | 0.53 | 0.84 | 0.42 | 191.39 | |
MN | 0.95 | 0.27 | 0.85 | 0.05 | 0.06 | 0.95 | 0.44 | 0.75 | 0.29 | 0.20 |
Compares the use of CG, CGd, MVP 1, MVP \(\log n\), and MN using linear data from graphs of average degree 2 and 4, respectively, with 500 measured variables
Sample Size | 200 | 1000 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Statistic | AP | AR | AHP | AHR | T (s) | AP | AR | AHP | AHR | T (s) | |
Avg Deg 2 | CG | 0.67 | 0.51 | 0.48 | 0.28 | 2.11 | 0.88 | 0.77 | 0.81 | 0.50 | 7.28 |
CGd | 0.86 | 0.44 | 0.75 | 0.19 | 1.74 | 0.94 | 0.71 | 0.89 | 0.46 | 8.42 | |
MVP 1 | 0.40 | 0.49 | 0.24 | 0.27 | 56.83 | 0.81 | 0.73 | 0.67 | 0.51 | 44.68 | |
MVP \(\log n\) | 0.40 | 0.46 | 0.23 | 0.24 | 81.20 | 0.84 | 0.72 | 0.68 | 0.46 | 96.91 | |
MN | 0.93 | 0.39 | 0.84 | 0.07 | 1.79 | 0.97 | 0.67 | 0.89 | 0.36 | 14.74 | |
Avg Deg 4 | CG | 0.75 | 0.35 | 0.59 | 0.25 | 2.21 | 0.91 | 0.61 | 0.84 | 0.51 | 9.70 |
CGd | 0.88 | 0.31 | 0.78 | 0.19 | 2.22 | 0.95 | 0.58 | 0.90 | 0.48 | 16.59 | |
MN | 0.93 | 0.26 | 0.77 | 0.07 | 1.32 | 0.98 | 0.51 | 0.86 | 0.37 | 12.43 |
6.2 Binomial structure prior
We tested the usefulness of the binomial structure prior by simulating 200 and 1000 samples from graphs of average degree 2 and 4 with 100 measured variables using fGES. We compare our scoring functions with and without the binomial structure prior. Additionally, we compare against extended BIC (EBIC). In these experiments, the binomial structure prior is set to 1 and EBIC’s gamma parameter is set to 0.5 upon suggestion of the authors [4]. In Tables 2 and 3 report findings when the average degrees of the graphs are 2 and 4, respectively.
While we set the binomial structure prior’s parameter to 1 for the experiments presented in this paper, it is important to note that this parameter can be chosen to be any value greater than 0. By varying the expected number parents, we can influence how sparse or dense the output graph will be. The choice of a low value results in a relatively sparse graph and a high value in a denser one.
From Table 2 and 3, for both the binomial structure prior and EBIC, we see boosts in precision with a reduction in recall. Additionally, we see vast reductions in the computation times. In general, EBIC seems to work better with small sample sizes. This makes sense, since EBIC is aimed at the small-n-large-P situation. However, for 1000 samples, we find the binomial structure prior performs relatively well. We use the binomial structure prior for the remainder of our score based experiments.
6.3 Conditional independence tests
We tested the usefulness of the CG and MVP scores as conditional independence tests by simulating, 200 and 1000 samples from graphs of average degree 2 and 4 with 100 measured variables. As a search algorithm, we used CPC Stable [19], which is a modified version of PC [26] that treats ambiguous triples as noncolliders. For independence testing, we set the significance level \(\alpha = 0.001\). Here we also use the discretization heuristic with \(b = 3\) for CG, denoted CGd, however we do not use a structure prior since we are no longer scoring a full Bayesian network in this paradigm. We did not include results for a version of MVP which uses the discretization heuristic because it had little effect. The results are shown in Table 4.
In general, we find that our methods perform better as scores, but still perform reasonably well as conditional independence tests. This is promising for use in algorithms, such as FCI, that model the possibility of latent confounding [26].
6.4 Tests against baseline scores
We used two simple baseline scores as a point of comparison for our methods. The first, which we denote MN, uses multinomial scoring for all cases. In order to do so, we essentially extend the discretization heuristic to the continuous child case so that we are always scoring with a multinomial. The second, which we denote as LR, uses partitioned linear regression in the continuous child case and partitioned logistic regression in the discrete child case. In our experiments, we applied Lib Linear [7], a widely used and efficient toolkit for logistic regression which uses truncated Newton optimization [9]. In a recent paper, Zaidi et al. [27] note that among the many optimization methods that have been evaluated, the truncated Newton method has been shown to converge the fastest, which provides support that Lib Linear is a competitive, state-of-the-art method to apply in our evaluation, as a baseline point of comparison. As with MVP, the appended term on LR denotes the maximum polynomial degree of the regressors.
We compared CG, CGd, MVP 1, LR 1, MVP \(\log n\), LR \(\log n\), and MN by simulating 200 and 1000 samples from graphs of average degree 2 and 4 with 100 measured variables. As a search algorithm, we again used fGES. Here we also use the discretization heuristic with \(b = 3\) for CGd and the binomial structure prior set to 1 for all scores. Additionally, boldface text highlights the best performing score for each statistic for each group in the table. The results are shown in Tables 5 and 6. For the results in Table 6, we extended our method for simulating data. Since MVP is designed to handle nonlinearity, while CG is not, we modified the continuous child phase of data generation to allow for nonlinearities. To do so, we additionally generate second, and third order polynomial terms. However, because of the nature of these nonlinear functions, the values of the data often become unmanageably large. To correct for this issue, we resample a variable with square-root and cube-root relationships if the values are too large. Appendix B contains details about how the data were simulated and the parameters used.
Table 5 shows the results when using linearly generated data and 100 variables. As a general pattern, MN had better precision than the CG methods which had better precision than the MVP and LR methods. For recall, just the opposite pattern tended to occur. In terms of timing, in general, MN was faster than the CG methods, which were faster than the MVP methods, which were considerably faster than LR.
Table 6 shows the results when using nonlinearly generated data and 100 variables. MN tended to have a higher precision than the MVP and LR methods, which often had higher precision than the CG methods. The relatively good performance of MN is surprising; although multinomial distributions can represent nonlinear relationships, the process of discretizing continuous variables loses information; the manner in which we generated the data (see the beginning of Sect. 6) when there is a discrete child and continuous parents may play a role in producing this result. The relatively better precision performance of the MVP methods compared to CG methods is not surprising, given that MVP can model nonlinear relationships and CG cannot. In terms of recall, MVP and CG performed comparably, while both performed better than MN. The relative timing results in Table 6 are similar to those in Table 5.
In Tables 5 and 6, there is almost no difference in precision and recall performance between MVP and LR. This result is understandable, since MVP is using an approximation to logistic regression in the case of a discrete child with continuous parents and performing all other cases identically. However, MVP is often tenfold or more faster than LR.
Table 7 shows the results of assessing the scalability of the methods. We simulated linear data on 500 variables. For average degree 4, no MVP results are shown because our machine ran out of memory while searching. Also, LR is not included at all in Table 7, because LR (as implemented) cannot scale to networks of this size due to time complexity. Table 7 shows that the CG methods had similar precision to MN, which generally had better precision than MVP. For the results shown, the recall of the MVP and CG methods were similar, which were generally better than the recall for MN. MN and the CG methods had similar timing results, which were faster than those of MVP.
In Table 7, we see that the CG and MVP methods are capable of scaling to graphs containing 500 measured variables, albeit sparse ones. CG was able to scale to a slightly denser graph of 500 variables. In general, we see the same performance on these larger networks as before on the networks of 100 measured variables. Additionally, for the smaller sample size of 200, MN performed comparably to CDd, but with a slightly higher precision and lower recall.
7 Conclusions
This paper introduces two novel scoring methods for learning Bayesian networks in the presence of both continuous and discrete variables. One of the methods scales to networks of 500 variables or more on a laptop. We introduce a structure prior for learning large networks and find that using a structure prior with BIC generally leads to relatively good network discovery performance, while requiring considerably less computation time. We showed how the CG and MVP scoring methods are readily adapted as conditional independence tests for constraint-based methods to support future use in algorithms such as FCI.
The MVP and LR methods had precision and recall results that were almost identical; however, MVP was considerably faster than LR. Such a speed difference is particularly important when performing Bayesian network learning, where the scoring method must be applied thousands of times in the course of learning a network. Using a different implementation of LR might affect the magnitude of this speed difference, but for the reasons we give in Sect. 4.3, we would not expect it to lead to LR becoming faster than MVP.
The fully discrete approach, MN, performed surprisingly well in our experiments in terms of precision and speed, although recall was often lower, and sometimes much lower, than that of CG and MVP.
The results of the experiments reported here support using CG when recall is a priority and the relationships are linear. If the relationships are likely to be nonlinear and recall remains a priority, then we suggest using MVP when there 100 or fewer variables and using CG when there are 500 variables or more. If precision is a priority, then our results support using MN.
All algorithms and simulation reported here were implemented in the Tetrad system [22], and the code is available in the Tetrad repository on GitHub.^{1}
There are several directions for future work. First, we would like to apply the methods to real datasets for which knowledge of the causal relationships is available. Second, we would like to expand the CG and MVP methods to model ordinal discrete variables. Although the nominal discrete variables that these methods currently model can represent ordinal variables, we would expect the methods to have greater power when they take advantage of knowledge about particular discrete variables being ordinal versus nominal. Third, we would like to further explore how to adaptively discretize variables in the MN method in order to improve its recall, while not substantially reducing its precision. Fourth, we would like to investigate alternative basis functions to polynomials for the MVP method.
Footnotes
Notes
Acknowledgements
We thank Clark Glymour, Peter Spirtes, Takis Benos, Dimitrios Manatakis, and Vineet Raghu for helpful discussions about the topics in this paper. We also thank the reviewers for their helpful comments.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
References
- 1.Anderson, T., Taylor, J.B.: Strong consistency of least squares estimates in normal linear regression. The Ann. Stat., pp. 788–790 (1976)Google Scholar
- 2.Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)MATHGoogle Scholar
- 3.Bøttcher, S.G.: Learning bayesian networks with mixed variables. Ph.D. thesis, Aalborg University (2004)Google Scholar
- 4.Chen, J., Chen, Z.: Extended bic for small-n-large-p sparse glm. Statistica Sinica pp. 555–574 (2012)Google Scholar
- 5.Chickering, D.M.: Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2002)MathSciNetMATHGoogle Scholar
- 6.Daly, R., Shen, Q., Aitken, S.: Review: learning bayesian networks: approaches and issues. The Knowl. Eng. Rev. 26(2), 99–157 (2011)CrossRefGoogle Scholar
- 7.Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
- 8.Heckerman, D., Geiger, D.: Learning bayesian networks: a unification for discrete and gaussian domains. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 274–284. Morgan Kaufmann Publishers Inc. (1995)Google Scholar
- 9.Hsia, C.Y., Zhu, Y., Lin, C.J.: A study on trust region update rules in newton methods for large-scale linear classification. In: Asian Conference on Machine Learning, pp. 33–48 (2017)Google Scholar
- 10.Huang, T., Peng, H., Zhang, K.: Model selection for gaussian mixture models. Statistica Sinica 27(1), 147–169 (2017)MathSciNetMATHGoogle Scholar
- 11.Jeffreys, H., Jeffreys, B.: Weierstrasss theorem on approximation by polynomials. Methods of Mathematical Physics pp. 446–448 (1988)Google Scholar
- 12.Peters, J., Janzing, D., Schölkopf, B.: Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, Cambridge (2017)Google Scholar
- 13.McGeachie, M.J., Chang, H.H., Weiss, S.T.: Cgbayesnets: conditional gaussian bayesian network learning and inference with mixed discrete and continuous data. PLoS Comput. Biol. 10(6), e1003676 (2014)CrossRefGoogle Scholar
- 14.Meek, C.: Complete orientation rules for patterns (1995)Google Scholar
- 15.Monti, S., Cooper, G.F.: A multivariate discretization method for learning bayesian networks from mixed data. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 404–413. Morgan Kaufmann Publishers Inc. (1998)Google Scholar
- 16.Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Burlington (1988)MATHGoogle Scholar
- 17.Raftery, A.E.: Bayesian model selection in social research. Sociol. Methodol., pp. 111–163 (1995)Google Scholar
- 18.Ramsey, J., Glymour, M., Sanchez-Romero, R., Glymour, C.: A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. Int. J. Data Sci. Anal., pp. 1–9 (2016)Google Scholar
- 19.Ramsey, J., Zhang, J., Spirtes, P.: Adjacency-faithfulness and conservative causal inference. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 401–408. AUAI Press, Arlington, Virginia (2006)Google Scholar
- 20.Ramsey, J.D., Malinsky, D.: Comparing the performance of graphical structure learning algorithms with tetrad. arXiv preprint arXiv:1607.08110 (2016)
- 21.Romero, V., Rumí, R., Salmerón, A.: Learning hybrid bayesian networks using mixtures of truncated exponentials. Int. J. Approx. Reason. 42(1–2), 54–68 (2006)MathSciNetCrossRefMATHGoogle Scholar
- 22.Scheines, R., Spirtes, P., Glymour, C., Meek, C., Richardson, T.: The tetrad project: constraint based aids to causal model specification. Multivar. Behav. Res. 33(1), 65–117 (1998)CrossRefGoogle Scholar
- 23.Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)MathSciNetCrossRefMATHGoogle Scholar
- 24.Sedgewick, A.J., Shi, I., Donovan, R.M., Benos, P.V.: Learning mixed graphical models with separate sparsity parameters and stability-based model selection. BMC Bioinform. 17(Suppl 5), 175 (2016)CrossRefGoogle Scholar
- 25.Sokolova, E., Groot, P., Claassen, T., Heskes, T.: Causal discovery from databases with discrete and continuous variables. In: European Workshop on Probabilistic Graphical Models, pp. 442–457. Springer (2014)Google Scholar
- 26.Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, and Search. MIT Press, Cambridge (2000)MATHGoogle Scholar
- 27.Zaidi, N.A., Webb, G.I.: A fast trust-region newton method for softmax logistic regression. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 705–713. SIAM (2017)Google Scholar