Bayesian Network Structure Learning from Big Data: A Reservoir Sampling Based Ensemble Method

  • Yan Tang
  • Zhuoming XuEmail author
  • Yuanhang Zhuang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9645)


Bayesian network (BN) learning from big datasets is potentially more valuable than learning from conventional small datasets as big data contain more comprehensive probability distributions and richer causal relationships. However, learning BNs from big datasets requires high computational cost and easily ends in failure, especially when the learning task is performed on a conventional computation platform. This paper addresses the issue of BN structure learning from a big dataset on a conventional computation platform, and proposes a reservoir sampling based ensemble method (RSEM). In RSEM, a greedy algorithm is used to determine an appropriate size of sub datasets to be extracted from the big dataset. A fast reservoir sampling method is then adopted to efficiently extract sub datasets in one pass. Lastly, a weighted adjacent matrix based ensemble method is employed to produce the final BN structure. Experimental results on both synthetic and real-world big datasets show that RSEM can perform BN structure learning in an accurate and efficient way.


Bayesian network structure learning Reservoir sampling Ensemble method Probabilistic approximation Big data 



This work was supported by the Natural Science Foundation of Jiangsu Province, China (Grant No. BK20141420 and Grant No. BK20140857) and the “Six Talent Peaks Program” of Jiangsu Province, China (Grant No. 2008135).


  1. 1.
    Ben-Gal, I.: Bayesian Networks. Encyclopedia of Statistics in Quality and Reliability. Wiley, New York (2007)Google Scholar
  2. 2.
    Zhang, Y., Zhang, Y., Swears, N., et al.: Modeling temporal interactions with interval temporal bayesian networks for complex activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2468–2483 (2013)CrossRefGoogle Scholar
  3. 3.
    Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)CrossRefGoogle Scholar
  4. 4.
    Sun, S., Zhang, C., Yu, G.: A bayesian network approach to traffic flow forecasting. IEEE Trans. Intell. Trans. Syst. 7(1), 124–132 (2006)CrossRefGoogle Scholar
  5. 5.
    Al-Jarrah, O., Yoo, P., et al.: Efficient machine learning for big data: A review. Big Data Res. 2(3), 87–93 (2015)CrossRefGoogle Scholar
  6. 6.
    Fang, Q., Yue, K., Fu, X., Wu, H., Liu, W.: A mapreduce-based method for learning bayesian network from massive data. In: Ishikawa, Y., Li, J., Wang, W., Zhang, R., Zhang, W. (eds.) APWeb 2013. LNCS, vol. 7808, pp. 697–708. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  7. 7.
    Wang, J., Tang, Y., Nguyen, M., Altintas, I.: A scalable data science workflow ap-proach for big data bayesian network learning. In: Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing (BDC 2014), pp. 16–25 (2014)Google Scholar
  8. 8.
    Cheng, J., Greiner, R., Kelly, J., Bell, D., Liu, W.: Learning bayesian networks from data: An information-theory based approach. Artif. Intell. 137(1–2), 43–90 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Heckerman, D., Geiger, D., Chickering, D.: Learning bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 20, 197–243 (1995)zbMATHGoogle Scholar
  10. 10.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Series in Representation and Reasoning. Morgan Kaufmann, San Mateo (1988)zbMATHGoogle Scholar
  11. 11.
    Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)CrossRefGoogle Scholar
  12. 12.
    Jiang, L., Li, C., Cai, Z., Zhang, H.: Sampled bayesian network classifiers for class-imbalance and cost-sensitive learning. In: Proceedings of the IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 512–517 (2013)Google Scholar
  13. 13.
    Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)CrossRefGoogle Scholar
  15. 15.
    Hasna, N.J.S.: Weighted ensemble learning of bayesian network for gene regulatory networks. Neurocomputing 150((B)), 404–416 (2015)Google Scholar
  16. 16.
    Tang, Y., Wang, Y., Cooper, K., Li, L.: Towards big data bayesian network learning - an ensemble learning based approach. In: Proceedings of the IEEE International Congress on Big Data (BigData Congress), pp. 355–357 (2014)Google Scholar
  17. 17.
    Chickering, D., Heckerman, D., Meek, C.: Large-sample learning of bayesian networks is np-hard. J. Mach. Learn. Res. 5, 1287–1330 (2004)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Yoo, C., Ramirez, L., Liuzzi, J.: Big data analysis using modern statistical and machine learning methods in medicine. Int. Neurourol. J. 18(2), 50–57 (2014)CrossRefGoogle Scholar
  19. 19.
    Scutari, M.: Learning bayesian networks with the bnlearn r package. J. Statist. Softw. 35(3), 1–22 (2010)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Spiegelhalter, D., Cowell, R.: Learning in probabilistic expert systems. Bayesian Statistics, 4. Clarendon Press, Oxford (1992)Google Scholar
  21. 21.
    Beinlich, I., Suermondt, H., Chavez, R., Cooper, G.: The alarm monitoring system: A case study with two probabilistic inference techniques for belief networks. In: Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, pp. 247–256 (1989)Google Scholar
  22. 22.
    Onisko, A.: Probabilistic Causal Models in Medicine: Application to Diagnosis of Liver Disorders. Ph.D. thesis, Institute of Biocybernetics and Biomedical Engineering, Polish Academy of Science, Warsaw (2003)Google Scholar
  23. 23. - the U.S. Government Open Data: 2009 Home Mortgage Disclosure act (HMDA) Loan Application Register (LAR) Data, Accessed December 15, 2015.

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.College of Computer and InformationHohai UniversityNanjingChina

Personalised recommendations