Dirichlet Bayesian network scores and the maximum relative entropy principle
 640 Downloads
 2 Citations
Abstract
A classic approach for learning Bayesian networks from data is to identify a maximum a posteriori (MAP) network structure. In the case of discrete Bayesian networks, MAP networks are selected by maximising one of several possible Bayesian–Dirichlet (BD) scores; the most famous is the Bayesian–Dirichlet equivalent uniform (BDeu) score from Heckerman et al. (Mach Learn 20(3):197–243, 1995). The key properties of BDeu arise from its uniform prior over the parameters of each local distribution in the network, which makes structure learning computationally efficient; it does not require the elicitation of prior knowledge from experts; and it satisfies score equivalence. In this paper we will review the derivation and the properties of BD scores, and of BDeu in particular, and we will link them to the corresponding entropy estimates to study them from an information theoretic perspective. To this end, we will work in the context of the foundational work of Giffin and Caticha (Proceedings of the 27th international workshop on Bayesian inference and maximum entropy methods in science and engineering, pp 74–84, 2007), who showed that Bayesian inference can be framed as a particular case of the maximum relative entropy principle. We will use this connection to show that BDeu should not be used for structure learning from sparse data, since it violates the maximum relative entropy principle; and that it is also problematic from a more classic Bayesian model selection perspective, because it produces Bayes factors that are sensitive to the value of its only hyperparameter. Using a large simulation study, we found in our previous work [Scutari in J Mach Learn Res (Proc Track PGM 2016) 52:438–448, 2016] that the Bayesian–Dirichlet sparse (BDs) score seems to provide better accuracy in structure learning; in this paper we further show that BDs does not suffer from the issues above, and we recommend to use it for sparse data instead of BDeu. Finally, will show that these issues are in fact different aspects of the same problem and a consequence of the distributional assumptions of the prior.
Keywords
Bayesian networks Structure learning Bayesian posterior estimation Maximum relative entropy principle Discrete data1 Introduction and background
Structure learning can be implemented in several ways, based on many results from probability, information and optimisation theory; algorithms for this task can be broadly grouped into constraintbased, scorebased and hybrid.
Scorebased algorithms are closer to model selection techniques developed in classic statistics and information theory. Each candidate network is assigned a score reflecting its goodness of fit, which is then taken as an objective function to maximise. This is often done using heuristic optimisation algorithms, from local search to genetic algorithms (Russell and Norvig 2009); but the availability of computational resources and advances in learning algorithms have recently made exact learning possible (Cussens 2012). Common choices for the network score include the Bayesian Information Criterion (BIC) and the marginal likelihood \({P}({\mathcal {G}}{\mathcal {D}})\) itself; for an overview see again Scutari and Denis (2014). We will cover both in more detail for discrete BNs in Sect. 2.
Hybrid algorithms use both statistical tests and score functions, combining the previous two families of algorithms. Their general formulation is described for BNs in Friedman et al. (1999); they have proved to be some of the top performers up to date (see for instance MMHC inTsamardinos et al. 2006).
As for parameter learning, the parameters \(\varTheta _{X_i}\) can be estimated independently for each node following (1) since its parents are assumed to be known from structure learning. Both maximum likelihood and Bayesian posterior estimators are in common use, with the latter being preferred due to their smoothness and superior predictive power (Koller and Friedman 2009).
In this paper we focus on scorebased structure learning in a Bayesian framework, in which we aim to identify a maximum a posteriori (MAP) DAG \({\mathcal {G}}\) that directly maximises \({P}({\mathcal {G}}{\mathcal {D}})\). For discrete BNs, this means maximising a Bayesian–Dirichlet (BD) marginal likelihood: the most common choice is the Bayesian–Dirichlet equivalent uniform (BDeu) score from Heckerman et al. (1995). We will show that the uniform prior distribution over each \(\varTheta _{X_i}\) that underlies BDeu can be problematic from a Bayesian perspective, resulting in wildly different Bayes factors (and thus structure learning outcomes) depending on the value of its only hyperparameter, the imaginary sample size. We will further investigate this problem from an information theoretic perspective, on the grounds that Bayesian posterior inference can be framed as a particular case of the maximum relative entropy principle (ME; Shore and Johnson 1980; Skilling 1988; Caticha 2004). We find that BDeu is not a reliable network score when applied to sparse data because it can select overly complex networks over simpler ones given the same information in the prior and in the data; and that in the process it violates the maximum relative entropy principle. That does not appear to be the case for other BD scores, which arise from different priors.
The paper is organised as follows: In Sect. 2 we will review Bayesian scorebased structure learning and BD scores. In Sect. 3 we will focus on BDeu, covering its underlying assumptions and issues reported in the literature. In particular, we will show with simple examples that BDeu can produce Bayes factors that are sensitive to the choice of its only hyperparameter, the imaginary sample size. In Sect. 4 we will derive the posterior expected entropy associated with a DAG \({\mathcal {G}}\), which we will further explore in Sect. 5. Finally, in Sect. 6 we will analyse BDeu using ME, and we will compare its behaviour with that of other BD scores.
2 Bayesian–Dirichlet marginal likelihoods

\(r_i\) is the number of states of \(X_i\)

\(q_i\) is the number of possible configurations of values of \(\varPi _{X_i}^{{\mathcal {G}}}\), taken to be equal to 1 if \(X_i\) has no parents;

\(n_{ij} = \sum _{k = 1}^{r_i} n_{ijk}\);

\(\alpha _{ij}= \sum _{k = 1}^{r_i} \alpha _{ijk}\);

and \(\varvec{\alpha }= \{\alpha _1, \ldots , \alpha _N\}\), \(\alpha _i = \sum _{j = 1}^{q_i} \alpha _{ij}\) are the imaginary sample sizes associated with each \(X_i\).

for \(\alpha _{ijk}= 1\) we obtain the K2 score from Cooper and Herskovits (1991);

for \(\alpha _{ijk}= {1}/{2}\) we obtain the BD score with Jeffrey’s prior (BDJ; Suzuki 2016);

for \(\alpha _{ijk}= \alpha / (r_i q_i)\) we obtain the BDeu score from Heckerman et al. (1995), which is the most common choice in the BD family and has \(\alpha _i = \alpha\) for all \(X_i\);

for \(\alpha _{ijk}= \alpha / (r_i \tilde{q}_i)\), where \(\tilde{q}_i\) is the number of \(\varPi _{X_i}^{{\mathcal {G}}}\) such that \(n_{ij} > 0\), we obtain the BD sparse (BDs) score recently proposed in Scutari (2016);

for the set \(\alpha _{ijk}^s = s / (r_i q_i)\), \(s \in S_L = \{2^{L}, 2^{L+1}, \ldots , 2^{L1}, 2^{L}\}\), \(L \in {\mathbb {N}}\) we obtain the locally averaged BD score (BDla) from Cano et al. (2013).
3 BDeu and Bayesian model selection
The uniform prior associated with BDeu has been justified by the lack of prior knowledge on the \(\varTheta _{X_i}\), as well as its computational simplicity and score equivalence; and it was widely assumed to be noninformative (e.g. Silander et al. 2007; Heckerman et al. 1995).
Finally, Suzuki (2016) studied the asymptotic properties of BDeu by contrasting it with BDJ. He found that BDeu is not regular in the sense that it may learn DAGs in a way that is not consistent with either the MDL principle (through BIC) or the ranking of those DAGs given by their entropy. Whether this happens depends on the values of the underlying \(\pi _{ik j}\), even if the positivity assumption holds and if n is large. This agrees with the observations in Ueno (2010), who also observed that BDeu is not necessarily consistent for any finite n, but only asymptotically for \(n \rightarrow \infty\).
Example 1
The Bayes factors for BDeu and BDs are shown for \(\alpha \in [10^{4}, 10^{4}]\) in the left panel of Fig. 2. The former converges to 1 for both \(\alpha _{ijk}\rightarrow 0\) and \(\alpha _{ijk}\rightarrow \infty\), but varies between 1 and 2.5 for finite \(\alpha\); whereas the latter is equal to 1 for all values of \(\alpha\), never showing a preference for either \({\mathcal {G}}^{}\) or \({\mathcal {G}}^{+}\). The Bayes factor for BDeu does not diverge nor converge to zero, which is consistent with (7) from Steck and Jaakkola (2003) as \(d_{\mathrm {EP}}^{({\mathcal {G}}^{+})}  d_{\mathrm {EP}}^{({\mathcal {G}}^{})} = 0  0 = 0\). However, it varies most quickly for \(\alpha \in [1, 10]\), exactly the range of the most common values used in practical applications. This provides further evidence supporting the conclusions of Steck and Jaakkola (2003), Steck (2008) and Silander et al. (2007).
Example 2
The Bayes factors for BDeu and BDs are shown in the right panel of Fig. 2. BDeu results in wildly different values depending on the choice of \(\alpha\), with Bayes factors that vary between 0.05 and 1 for small changes of \(\alpha \in [1, 10]\); BDs always gives a Bayes factor of 1. Again \(d_{\mathrm {EP}}^{({\mathcal {G}}^{+})}  d_{\mathrm {EP}}^{({\mathcal {G}}^{})} = 4  4 = 0\), which agrees with the fact that the Bayes factor for BDeu does not diverge or converge to zero; and \({\mathcal {G}}^{}\) and \({\mathcal {G}}^{+}\) have the same BIC score, so BDeu (but not BDs) violates the MDL principle in this example as well. \(\square\)
4 Bayesian structure learning and entropy
5 The posterior marginal entropy
The posterior expectation of the entropy for a given \(\text{Dirichlet}(\alpha _{ijk})\) prior in (11), despite having a form that looks very different from the marginal posterior entropy in (10), can be written in terms of the latter as we show in the following lemma.
Lemma 1
Proof of Lemma 1
Therefore, \({E}({H}^{{\mathcal {G}}}(X_i) {\mathcal {D}}; \alpha _{ijk})\) is well approximated by the marginal posterior entropy \({H}^{{\mathcal {G}}}(X_i  {\mathcal {D}}, \alpha _{ijk})\) from (10) plus a bias term that depends on the augmented counts \(\alpha _{ij}+ n_{ij}\) for the \(q_i\) configurations of \(\varPi _{X_i}^{{\mathcal {G}}}\). A similar result was derived in Miller (1955) for the empirical entropy estimator and is the basis of the Miller–Madow entropy estimator.
6 BDeu and the principle of maximum entropy
Example 1
Example 2
Based on these results and the examples above, we state the following theorem.
Theorem 1
Using BDeu and the associated uniform prior over the parameters of a BN for structure learning violates the maximum relative entropy principle if any candidate parent configuration of any node is not observed in the data.
Example 1
A side effect of not violating ME is that the choice between \({\mathcal {G}}^{}\) and \({\mathcal {G}}^{+}\) is no longer sensitive to the value of \(\alpha\); we can see from the left panels of Figs. 3 and 4 that both the difference between \({E}({H}^{{\mathcal {G}}^{}}(X) {\mathcal {D}}, \frac{1}{8})\) and \({E}({H}^{{\mathcal {G}}^{+}}(X) {\mathcal {D}}, \frac{1}{8})\) and the difference between \({E}({H}^{{\mathcal {G}}^{}}(X) {\mathcal {D}})\) and \({E}({H}^{{\mathcal {G}}^{+}}(X) {\mathcal {D}})\) are equal to zero for all \(\alpha\). \(\square\)
Example 2
It is easy to show that the theorem we just stated does not apply to K2 or BDJ, since under their priors \(\alpha _{ijk}\) is not a function of \(q_i\); but it does apply to BDla since its formulation is essentially a mixture of BDeu scores.
7 Conclusions and discussion
Bayesian network learning follows an inherently Bayesian workflow in which we first learn the structure of the DAG \({\mathcal {G}}\) from a data set \({\mathcal {D}}\), and then we learn the values of the parameters \(\varTheta _{X_i}\) given \({\mathcal {G}}\). In this paper we studied the properties of the Bayesian posterior scores used to estimate \({P}({\mathcal {G}}{\mathcal {D}})\) and to learn the \({\mathcal {G}}\) that best fits the data. For discrete Bayesian networks, these scores are Bayesian–Dirichlet (BD) marginal likelihoods that assume different Dirichlet priors for the \(\varTheta _{X_i}\) and, in the most general formulation, a hyperprior over the hyperparameters \(\alpha _{ijk}\) of the prior. We focused on the most common BD score, BDeu, which assumes a uniform prior over each \(\varTheta _{X_i}\); and we studied the impact of that prior on structure learning from a Bayesian and an information theoretic perspective. After deriving the form of the posterior expected entropy for \({\mathcal {G}}\) given \({\mathcal {D}}\), we found that BDeu may select models in a way that violates the maximum relative entropy principle. Furthermore, we showed that it produces Bayes factors that are very sensitive to the choice of the imaginary sample size. Both issues are related to the uniform prior assumed by BDeu for the \(\varTheta _{X_i}\), and can lead to the selection of overly dense DAGs when the data are sparse. In contrast, the BDs score proposed in Scutari (2016) does not, even though it converges to BDeu asymptotically; and neither do other BD scores in the literature. In the simulation study we performed in Scutari (2016), we found that BDs leads to more accurate structure learning; hence we recommend its use over BDeu for sparse data.
Notes
Compliance with ethical standards
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
References
 Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010a) Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: algorithms and empirical evaluation. J Mach Learn Res 11:171–234MathSciNetzbMATHGoogle Scholar
 Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010b) Local causal and Markov blanket induction for causal discovery and feature selection for classification part II: analysis and extensions. J Mach Learn Res 11:235–284MathSciNetzbMATHGoogle Scholar
 Anderson G, Qiu S (1997) A monotonicity property of the gamma function. Proc Am Math Soc 125(11):3355–3362zbMATHCrossRefGoogle Scholar
 Archer E, Park IM, Pillow JW (2014) Bayesian entropy estimation for countable discrete distributions. J Mach Learn Res 15:2833–2868MathSciNetzbMATHGoogle Scholar
 Berger JO (1985) Statistical decision theory and Bayesian analysis, 2nd edn. Springer, BerlinzbMATHCrossRefGoogle Scholar
 Cano A, GómezOlmedo M, Masegosa AR, Moral S (2013) Locally averaged Bayesian Dirichlet metrics for learning the structure and the parameters of Bayesian networks. Int J Approx Reason 54(4):526–540MathSciNetCrossRefGoogle Scholar
 Castelo R, Siebes A (2000) Priors on network structures. Biasing the search for Bayesian networks. Int J Approx Reason 24(1):39–57MathSciNetzbMATHCrossRefGoogle Scholar
 Caticha A (2004) Relative entropy and inductive inference. In: Bayesian inference and maximum entropy methods in science and engineering. AIP, New York, pp 75–96Google Scholar
 Chickering DM (1995) A transformational characterization of equivalent Bayesian network structures. In: Proceedings of the 11th conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Francisco, pp 87–98Google Scholar
 Chickering DM, Heckerman D (2000) A comparison of scientific and engineering criteria for Bayesian model selection. Stat Comput 10:55–62CrossRefGoogle Scholar
 Cooper GF, Herskovits E (1991) A Bayesian method for constructing Bayesian belief networks from databases. In: Proceedings of the 7th conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Francisco, pp 86–94Google Scholar
 Cussens J (2012) Bayesian network learning with cutting planes. In: Proceedings of the 27th conference on uncertainty in artificial intelligence. AUAI Press, pp 153–160Google Scholar
 Dawid AP (1984) Present position and potential developments: some personal views: statistical theory: the prequential approach. J R Stat Soc Ser A 147(2):278–292CrossRefGoogle Scholar
 Feder M (1986) Maximum entropy as a special case of the minimum description length criterion. IEEE Trans Inf Theory 32(6):847–849MathSciNetzbMATHCrossRefGoogle Scholar
 Friedman N, Nachman I, Peér D (1999) Learning Bayesian network structure from massive datasets: the “sparse candidate” algorithm. In: Proceedings of the 15th conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Francisco, pp 206–221Google Scholar
 Geiger D, Heckerman D (1994) Learning Gaussian networks. In: Proceedings of the 10th conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Francisco, pp 235–243Google Scholar
 Giffin A, Caticha A (2007) Updating probabilities with data and moments. In: Proceedings of the 27th international workshop on Bayesian inference and maximum entropy methods in science and engineering. AIP, New York, pp 74–84Google Scholar
 Hausser J, Strimmer K (2009) Entropy inference and the James–Stein estimator, with application to nonlinear gene association networks. J Mach Learn Res 10:1469–1484MathSciNetzbMATHGoogle Scholar
 Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20(3):197–243 (available as Technical Report MSRTR9409) zbMATHGoogle Scholar
 Johnson NL, Kotz S, Balakrishnan N (1997) Discrete multivariate distributions. Wiley, New YorkzbMATHGoogle Scholar
 Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT Press, CambridgezbMATHGoogle Scholar
 Koski TJT, Noble JM (2012) A review of Bayesian networks and structure learning. Math Appl 40(1):53–103MathSciNetzbMATHGoogle Scholar
 Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intell 10:269–293CrossRefGoogle Scholar
 Lauritzen SL, Wermuth N (1989) Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann Stat 17(1):31–57MathSciNetzbMATHCrossRefGoogle Scholar
 Mackay DJC (2003) Information theory, inference and learning algorithms. Cambridge University Press, CambridgezbMATHGoogle Scholar
 Miller GA (1955) Note on the bias of information estimates. In: Information theory in psychology IIB. Free Press, Glencoe, Illinois, New York, pp 95–100Google Scholar
 Mukherjee S, Speed TP (2008) Network inference using informative priors. Proc Natl Acad Sci 105(38):14313–14318CrossRefGoogle Scholar
 Nemenman I, Shafee F, Bialek W (2002) Entropy and inference, revisited. In: Proceedings of the 14th advances in neural information processing systems (NIPS) conference. MIT Press, Cambridge, Massachusetts, pp 471–478Google Scholar
 Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, BurlingtonzbMATHGoogle Scholar
 Pearl J (2009) Causality: models, reasoning and inference, 2nd edn. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
 Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–568zbMATHCrossRefGoogle Scholar
 Rissanen J (2007) Information and complexity in statistical models. Springer, BerlinzbMATHGoogle Scholar
 Russell SJ, Norvig P (2009) Artificial intelligence: a modern approach, 3rd edn. Prentice Hall, Upper Saddle RiverzbMATHGoogle Scholar
 Scutari M (2016) An empiricalBayes score for discrete Bayesian networks. J Mach Learn Res (Proc Track PGM 2016) 52:438–448Google Scholar
 Scutari M, Denis JB (2014) Bayesian networks with examples in R. Chapman & Hall, LondonzbMATHCrossRefGoogle Scholar
 Shore JE, Johnson RW (1980) Axiomatic derivation of the principle of maximum entropy and the principle of minimum crossentropy. IEEE Trans Inf Theory IT–26(1):26–37MathSciNetzbMATHCrossRefGoogle Scholar
 Silander T, Kontkanen P, Myllymäki P (2007) On sensitivity of the MAP Bayesian network structure to the equivalent sample size parameter. In: Proceedings of the 23rd conference on uncertainty in artificial intelligence. AUAI Press, pp 360–367Google Scholar
 Skilling J (1988) The axioms of maximum entropy. In: Maximumentropy and Bayesian methods in science and engineering. Springer, Berlin, pp 173–187CrossRefGoogle Scholar
 Steck H (2008) Learning the Bayesian network structure: dirichlet prior versus data. In: Proceedings of the 24th conference on uncertainty in artificial intelligence. AUAI Press, pp 511–518Google Scholar
 Steck H, Jaakkola TS (2003) On the Dirichlet prior and Bayesian regularization. Adv Neural Inf Process Syst 15:713–720Google Scholar
 Suzuki J (2015) Consistency of learning Bayesian network structures with continuous variables: an information theoretic approach. Entropy 17:5752–5770MathSciNetCrossRefGoogle Scholar
 Suzuki J (2016) A theoretical analysis of the BDeu scores in Bayesian network structure learning. Behaviormetrika 44:97–116CrossRefGoogle Scholar
 Tsamardinos I, Brown LE, Aliferis CF (2006) The maxmin hillclimbing Bayesian network structure learning algorithm. Mach Learn 65(1):31–78CrossRefGoogle Scholar
 Ueno M (2010) Learning networks determined by the ratio of prior and data. In: Proceedings of the 26th conference on uncertainty in artificial intelligence. AUAI Press, pp 598–605Google Scholar
 Ueno M (2011) Robust learning of Bayesian networks for prior belief. In: Proceedings of the 27th conference on uncertainty in artificial intelligence. AUAI Press, pp 698–707Google Scholar
 Weatherburn CE (1961) A first course in mathematical statistics. Cambridge University Press, CambridgezbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.