Advertisement

Journal of Central South University

, Volume 26, Issue 1, pp 1–12 | Cite as

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

  • Peng Liu (刘鹏)
  • Hui-han Zhao (赵慧含)
  • Jia-yu Teng (滕家雨)
  • Yan-yan Yang (仰彦妍)
  • Ya-feng Liu (刘亚峰)
  • Zong-wei Zhu (朱宗卫)Email author
Article
  • 6 Downloads

Abstract

The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data. In order to solve this problem, this paper proposes and implements a parallel naive Bayes algorithm (PNBA) for Chinese text classification based on Spark, a parallel memory computing platform for big data. This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets (RDD). For comparison, a PNBA based on Hadoop is also implemented. The test results show that in the same computing environment and for the same text sets, the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability. Therefore, Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.

Key words

Chinese text classification naive Bayes spark hadoop resilient distributed dataset parallelization 

面向大规模中文文本分类的朴素贝叶斯并行Spark 算法

摘要

针对互联网中中文文本数据量激增使得对其作分类运算的处理时间显著延长的问题,提出并实 现了一种基于内存计算模型Spark 的并行朴素贝叶斯中文文本分类算法,主要利用弹性分布数据集编 程模型,实现了朴素贝叶斯分类器训练过程和预测过程的全程并行化算法。为便于比较,同时实现了 基于Hadoop-MapReduce 的并行朴素贝叶斯版本。实验结果表明,在相同计算环境下,对同一数据量 的中文文本集,基于Spark 的朴素贝叶斯中文文本分类并行化算法在加速比、扩展性等主要指标上明 显优于基于Hadoop 的实现,因此能更好地满足大规模中文文本数据挖掘的要求。

关键词

中文文本分类 朴素贝叶斯 Spark Hadoop 弹性分布式数据集 并行化 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    ZHOU Li-juan, WANG Hui, WANG Wen-bo. Parallel implementation of classification algorithms based on cloud computing environment [J]. Telkomnika Indonesian Journal of Electrical Engineering, 2012, 10(5): 1087–1092.CrossRefGoogle Scholar
  2. [2]
    LIU Bing-wei, BLASCH E, CHEN Yu, SHEN Dan, CHEN Gen-she. Scalable sentiment classification for big data analysis using Naive Bayes classifier [C]// IEEE International Conference on Big Data. Silicon. Valley, CA, USA: IEEE, 2013, 194(101): 99–104.Google Scholar
  3. [3]
    GROPP W, LUSK E, SKJELLUM A. Using MPI: Portable parallel programming with the message-passing interface [M]. Cambridge: MIT Press, 1999.CrossRefzbMATHGoogle Scholar
  4. [4]
    BERMAN F, FOX G, HEY A. Grid Computing: Making the global infrastructure a reality [M]. Hoboken, NJ, USA: Wiley & Sons, 2003.CrossRefGoogle Scholar
  5. [5]
    ZHANG Qi, CHENG Lu, BOUTABA R. Cloud computing: State-of-the-art and research challenges [J]. Journal of Internet Services and Applications, 2010, 1(1): 7–18.CrossRefGoogle Scholar
  6. [6]
    WHITE T. Hadoop: The definitive guide [M]. Sebastopol, CA, USA: O’Reilly Media, Inc, 2009.Google Scholar
  7. [7]
    DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107–113.CrossRefGoogle Scholar
  8. [8]
    SRIRAMA S N, BATRASHEV O, JAKOVITS P, VAINIKKO E. Scalability of parallel scientific applications on the cloud [J]. Scientific Programming, 2011, 19(2): 91–105.CrossRefGoogle Scholar
  9. [9]
    ZAHARIA M, CHOWDHURY M, FRANKLIN M J, SHENKER S, STOICA I. Spark: cluster computing with working sets [C]// Proceeding HotCloud’10 Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Boston: USENIX Association Berkeley, 2010: 1–8.Google Scholar
  10. [10]
    ZAHARIA M, CHOWDHURY M, DAS T, DAVE A, MA J, MCAULEY M, FRANKLIN M J, HENKER S, STOICA I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C]// Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. San Jose: USENIX Association Berkeley, 2012: 1–15.Google Scholar
  11. [11]
    JIANG Tao, ZHANG Qian-long, HOU Rui. Understanding the behavior of in-memory computing workloads [C]// Proceedings of 2014 IEEE International Symposium on Workload Characterization. Raleigh, NC, 2014: 22–30.Google Scholar
  12. [12]
    REYES-ORTIZ J L, ONETO L, ANGUITA D. Big data analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf [C]// Proceedings of the INNS Conference on Big Data 2015 Program San Francisco. Francisco, CA, 2015: 121–130.Google Scholar
  13. [13]
    LIU Zhi-qiang, GU Rong, YUAN Chun-feng, HUANG Yi-hua. Parallelization of classification algorithms based on SparkR [J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(11): 1281–1294. (in Chinese)Google Scholar
  14. [14]
    YAN Bo, YANG Zi-jiang, REN Yi-tian, TAN Xing, ERIC L. Microblog sentiment classification using parallel SVM in apache spark [C]// Proceeding of 2017 IEEE International Congress on Big Data. Honolulu, USA: IEEE, 2017: 282–288.CrossRefGoogle Scholar
  15. [15]
    LIU Peng, TENG Jia-yu, DING En-jie, MENG Lei. Parallel k-means algorithm for massive texts on Spark. [J]. Journal of Chinese Information Processing, 2017, 31(4): 145–153. (in Chinese)Google Scholar
  16. [16]
    TOMAS P, VIRGINIJUS M. Application of logistic regression with part-of-the-speech tagging for multi-class text classification [C]// Proceeding of the 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering. Vilnius, Lithuania: IEEE, 2016: 1–5.Google Scholar
  17. [17]
    CATALDO M. Enhanced vector space models for contentbased recommender systems [C]// Proceeding of the Fourth ACM Conference on Recommender Systems. New York, USA: ACM. 2010: 361–364.Google Scholar
  18. [18]
    Nature language processing & information retrieval [EB/OL] 2016. http://ictclas.nlpir.org/.
  19. [19]
    LONG Jun, WANG Lu-da, LI Zu-de, ZHANG Zu-ping, YANG Liu. WordNet-based lexical semantic classification for text corpus analysis [J]. Journal of Central South University, 2015, 22: 1833–1840.CrossRefGoogle Scholar
  20. [20]
    ZHANG Wen, YOSHIDA T, TANG Xi-jin. A comparative study of TF*IDF, LSI and multi-words for text classification [J]. Expert Systems with Applications, 2011, 38(3): 2758–2765.CrossRefGoogle Scholar
  21. [21]
    OWEN S, ANIL R, DUNNING T. Mahout in action [M]. Manning Publications, 2010.Google Scholar
  22. [22]
    VIKAS K V, BINDU K. R, LATHA P. A. Comprehensive study of text classification algorithms [C]// Proceeding of 2017 International Conference on Advances in Computing, Communications and Informatics. Udupi, India: IEEE, 2017: 1109–1113.Google Scholar
  23. [23]
    RENNIE J D, SHIH L, TEEVAN J. Tackling the poor assumptions of naive Bayes text classifiers [C]// Proceedings of the Twentieth International Conference on Machine Learning (ICML). Washington, DC, 2003: 661–623.Google Scholar
  24. [24]
    SUN X, ROVER D. Scalability of parallel algorithmmachine combinations [J]. IEEE Trans Parallel and Distributed System, 1994, 5(6): 599–613.CrossRefGoogle Scholar

Copyright information

© Central South University Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Peng Liu (刘鹏)
    • 1
    • 2
  • Hui-han Zhao (赵慧含)
    • 3
  • Jia-yu Teng (滕家雨)
    • 4
  • Yan-yan Yang (仰彦妍)
    • 3
  • Ya-feng Liu (刘亚峰)
    • 1
    • 2
  • Zong-wei Zhu (朱宗卫)
    • 5
    Email author
  1. 1.Internet of Things Perception Mine Research CentreChina University of Mining and TechnologyXuzhouChina
  2. 2.National and Local Joint Engineering Laboratory of Internet Application Technology on MineXuzhouChina
  3. 3.School of Information and Control EngineeringChina University of Mining and TechnologyXuzhouChina
  4. 4.Communication Division, NARI Technology Co., Ltd.NanjingChina
  5. 5.Suzhou Institute of University of Science and Technology of ChinaSuzhouChina

Personalised recommendations