Naive Bayes and Decision Tree Classifier for Streaming Data Using HBase

  • Aradhita MukherjeeEmail author
  • Sudip Mondal
  • Nabendu Chaki
  • Sunirmal Khatua
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 897)


Classification in real-time environment on streaming data set is one of the most challenging research areas nowadays. Data streaming is used in real-time environment where massive volume of data is generated in small sizes chunks which need to be processed very fast. HBase is a good option which is used for storing such heterogeneous massive small data files in a way so that scalability and availability are preserved. In real-time environment, data are generated exponentially. Thus to store auto incremented data, dynamic splitting is needed which is supported by HBase. We choose tobacco-affected student record and observed that Naive Bayes classifier is less complex and more accurate than decision tree. Also, in real-time environment, it shows its efficacy compared to others when the training sample is too large which is handled by HBase. The key value store in HBase provides the classifiers an extra edge by improving its performance in terms of time.


Big data HBase Naive Bayes classifier Real-time classification Data streaming Scalability 


  1. 1.
    Guo, J., Xu, W.: Research on optimization of community mass data storage based on HBase. In: Third International Conference on Cyberspace Technology (CCT) (2015)Google Scholar
  2. 2.
    Rajeswari, S., Lawrence R.: Classification model to predict the learners. In: Academic Performance using Big Data. 978-1-4673-8437-7/16/$31.00. IEEE (2016)Google Scholar
  3. 3.
    Vinod, D.F., Vasudevan, V.: A filter based feature set selection approach for big data classification of patient records. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) (2016)Google Scholar
  4. 4.
    An, Y., Sun, S., Wang, S.: Naive bayes classifiers for music emotion classification based on lyrics. In: IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS) (2017)Google Scholar
  5. 5.
    Huang, Y., Li, L.: Naive bayes classification algorithm based on small sample. In: IEEE International Conference on Cloud Computing and Intelligence Systems (2011)Google Scholar
  6. 6.
    Yang, X., Dong, H., Zhang, H.: Naive bayes based on estimation of distribution algorithms for classification. In: First International Conference on Information Science and Engineering (2009)Google Scholar
  7. 7.
    Tennant, M., Stahl, F., Rana, O., Gomes, J.B.: Scalable real-time classification of data streams with concept drift. Futur. Gener. Comput. Syst. 75 (2017)Google Scholar
  8. 8.
    Balicki, J., Dryja, P., Korłub, W.: Harmony search for data mining with big data. In: Saeed, K., Homenda, W. (eds.) Computer Information Systems and Industrial Management. CISIM 2016. Lecture Notes in Computer Science, vol. 9842. Springer (2016)Google Scholar
  9. 9.
    Samchao, F.: An incremental decision tree learning methodology regarding attributes in medical data mining. In: IEEE Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding (2009)Google Scholar
  10. 10.
    Chen, J., Wang, T., Abbey R., Pingeno, J.: A distributed decision tree algorithm and its implementation on big data platforms. In: IEEE Data Science and Advanced Analytics (DSAA) (2016)Google Scholar
  11. 11.
    Chandrasekar, P., Qian, K., Shahriar, H., Bhattacharya, P.: Improving the prediction accuracy of decision tree mining with data preprocessing. In: IEEE Annual Computer Software and Applications Conference (2017)Google Scholar
  12. 12.
    Wan, K.Y., Alagar, V.: Characteristics and classification of big data in health care sector. In: International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (2016)Google Scholar
  13. 13.
    Gandhi Bhagyashri, S., Deshpande Leena, A.: The survey on approaches to efficient clustering and classification analysis of big data. In: International Conference on Computing Communication Control and Automation (ICCUBEA) (2016)Google Scholar
  14. 14.
    Azqueta-Alzúaz, A., Brondino, I., Patiño-Martinez, M., Jimenez-Peris, R.: Massive data load on distributed database systems over HBase. In: 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2017)Google Scholar
  15. 15.
  16. 16.
    Dangi, A., Srivastava, S.: Educational data classification using selective Naïve Bayes for quota categorization. In: 2014 IEEE International Conference on MOOC, Innovation and Technology in Education (MITE), Patiala, 2014, pp. 118–121Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Aradhita Mukherjee
    • 1
    Email author
  • Sudip Mondal
    • 1
  • Nabendu Chaki
    • 1
  • Sunirmal Khatua
    • 1
  1. 1.Department of Computer Science & EngineeringUniversity of CalcuttaKolkataIndia

Personalised recommendations