Bayesian network classifiers using ensembles and smoothing

Abstract

Bayesian network classifiers are, functionally, an interesting class of models, because they can be learnt out-of-core, i.e. without needing to hold the whole training data in main memory. The selective K-dependence Bayesian network classifier (SKDB) is state of the art in this class of models and has shown to rival random forest (RF) on problems with categorical data. In this paper, we introduce an ensembling technique for SKDB, called ensemble of SKDB (ESKDB). We show that ESKDB significantly outperforms RF on categorical and numerical data, as well as rivalling XGBoost. ESKDB combines three main components: (1) an effective strategy to vary the networks that is built by single classifiers (to make it an ensemble), (2) a stochastic discretization method which allows to both tackle numerical data as well as further increases the variance between different components of our ensemble and (3) a superior smoothing technique to ensure proper calibration of ESKDB’s probabilities. We conduct a large set of experiments with 72 datasets to study the properties of ESKDB (through a sensitivity analysis) and show its competitiveness with the state of the art.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    The more common representation \(\mathrm{Dir}(\alpha _1,\ldots , \alpha _C)\) is not used here.

  2. 2.

    https://github.com/icesky0125/ESKDB-on-numerical-data.

References

  1. 1.

    Bostrom H (2007) Estimating class probabilities in random forests. In: Machine learning and applications, 2007. ICMLA 2007. 6th international conference on, IEEE, pp 211–216

  2. 2.

    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  3. 3.

    Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Article  Google Scholar 

  4. 4.

    Buntine W (1991) Theory refinement of Bayesian networks. In: 7th conference on uncertainty in artificial intelligence, Anaheim, CA

  5. 5.

    Buntine W (1993) Learning classification trees. Artificial intelligence frontiers in statistics. Springer, Berlin, pp 182–201

    Google Scholar 

  6. 6.

    Buntine W, Mishra S (2014) Experiments with non-parametric topic models. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 881–890

  7. 7.

    Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SigKDD international conference on knowledge discovery and data mining, ACM, pp 785–794

  8. 8.

    Chipman HA, George EI, McCulloch RE (1998) Bayesian CART model search. J Am Stat Assoc 93(443):935–948

    Article  Google Scholar 

  9. 9.

    Chow C, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467

    MathSciNet  MATH  Article  Google Scholar 

  10. 10.

    Dash D, Cooper GF (2004) Model averaging for prediction with discrete Bayesian networks. J Mach Learn Res 5:1177–1203

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Du L (2011) Non-parametric Bayesian methods for structured topic models. Ph.D. thesis, Australian National University

  12. 12.

    Duan Z, Wang L (2017) \(K\)-dependence Bayesian classifier ensemble. Entropy 19(12):651

    MathSciNet  Google Scholar 

  13. 13.

    Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1022–1027

  14. 14.

    Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: European conference on computational learning theory. Springer, pp 23–37

  15. 15.

    Friedman J, Hastie T, Tibshirani R et al (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407

    MATH  Article  Google Scholar 

  16. 16.

    Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163

    MATH  Article  Google Scholar 

  17. 17.

    Hearst MA (1998) Support vector machines. IEEE Intell Syst 13(4):18–28

    Article  Google Scholar 

  18. 18.

    Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors). Stat Sci 14(4):382–417

    MATH  Article  Google Scholar 

  19. 19.

    Koivisto M, Sood K (2004) Exact Bayesian structure discovery in Bayesian networks. J Mach Learn Res 5:549–573

    MathSciNet  MATH  Google Scholar 

  20. 20.

    Lewis DD (1998) Naive Bayes at forty: the independence assumption in information retrieval. Springer, Berlin, pp 4–15

    Google Scholar 

  21. 21.

    Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  22. 22.

    Madigan D, York J, Allard D (1995) Bayesian graphical models for discrete data. Int Stat Rev 63(2):215–232

    MATH  Article  Google Scholar 

  23. 23.

    Martínez AM, Webb GI, Chen S, Zaidi NA (2016) Scalable learning of Bayesian network classifiers. J Mach Learn Res 17(1):1515–1549

  24. 24.

    Petitjean F, Buntine W, Webb GI, Zaidi N (2018) Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes. Mach Learn 107(8):1303–1331

    MathSciNet  MATH  Article  Google Scholar 

  25. 25.

    Provost F, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3):199–215

    MATH  Article  Google Scholar 

  26. 26.

    Sahami M (1996) Learning limited dependence Bayesian classifiers. KDD 96:335–338

    Google Scholar 

  27. 27.

    Shareghi E, Haffari G, Cohn T (2017) Compressed nonparametric language modelling. In: Proceedings of the 26th international joint conference on artificial intelligence, pp 2701–2707

  28. 28.

    Teh YW, Jordan MI (2010) Hierarchical Bayesian nonparametric models with applications. Bayesian Nonparametr 1:158–207

    MathSciNet  Article  Google Scholar 

  29. 29.

    Tian J, He R, Ram L (2010) Bayesian model averaging using the \(k\)-best Bayesian network structures. In: Proceedings of the 26th conference on uncertainty in artificial intelligence, AUAI Press, UAI’10, pp 589–597

  30. 30.

    Webb GI, Boughton JR, Wang Z (2005) Not so naive Bayes: aggregating one-dependence estimators. Mach Learn 58(1):5–24

    MATH  Article  Google Scholar 

  31. 31.

    Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. ICML Citeseer 1:609–616

    Google Scholar 

  32. 32.

    Zhou ZH (2012) Ensemble methods: foundations and algorithms, 1st edn. CRC Press

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to He Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was partially supported by the China Scholarship Council under Awards 201506300081 and the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (Projects DP190100017 and DE170100037).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Petitjean, F. & Buntine, W. Bayesian network classifiers using ensembles and smoothing. Knowl Inf Syst 62, 3457–3480 (2020). https://doi.org/10.1007/s10115-020-01458-z

Download citation

Keywords

  • Bayesian network classifier
  • Ensemble learning
  • Probability smoothing
  • Hierarchical Dirichlet process
  • Attribute discretization