Large Scale Text Clustering Method Study Based on MapReduce

  • Zhanquan Sun
  • Feng Li
  • Yanling Zhao
  • Lifeng Song
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9377)


Text clustering is an important research topic in data mining. Many text clustering methods have been proposed and obtained satisfactory results. Information Bottleneck algorithm, which is based on information loss, can measure complicated relationship between variables. It is taken as one of the most informative text clustering methods and has been applied widely in practical. With the development of information technology, the scale of text becomes larger and larger. Classical information bottleneck based clustering method will be out of work to process large-scale dataset because of expensive computational cost. For dealing with large scale text clustering problem, a novel clustering method based on MapReduce is proposed. In the method, dataset is divided into sub datasets and deployed to different computational nodes. Each computational node will only process sub dataset. The computational cost can be reduced markedly. The efficiency of the method is illustrated with a practical text clustering problem.


Text clustering Large Scale MapReduce Information Bottleneck Feature selection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hotho, A., Nurnberger, A., Paass, G.: A brief survey of text mining. Ldv Forum 20(1), 19–62 (2005)Google Scholar
  2. 2.
    Ranjan, M., Peterson, A.D., Ghosh, P.A.: A systematic evaluation of different methods for initializing the K-means clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 1–13 (2010)Google Scholar
  3. 3.
    Simon, H.: Self-organizing maps. Neural networks - A comprehensive foundation. Prentice-Hall (1999)Google Scholar
  4. 4.
    Tishby, N., Fernando, C., Bialek, W.: The information bottleneck method. In: The 37th Annual Allerton Conference on Communication, Control and Computing, Monticello, pp. 1–11 (1999)Google Scholar
  5. 5.
    Fox, G.C., Bae, S.H., et al.: Parallel Data Mining from Multicore to Cloudy Grids. High Speed and Large Scale Scientific Computing 18, 311–341 (2010)Google Scholar
  6. 6.
    Zhang, B.J., Ruan, Y., et al.: Applying Twister to Scientific Applications. In: Proceedings of CloudCom, pp. 25–32 (2010)Google Scholar
  7. 7.
    Ekanayake, J., Li, H., et al.: Twister: A Runtime for iterative MapReduce. In: The First International Workshop on MapReduce and its Applications of ACM HPDC, pp. 810–818 (2010)Google Scholar
  8. 8.
    Sun, Z.Q., Fox, G.C.: A parallel clustering method combined information bottleneck theory and centroid-based clustering. Journal of Supercomputing 69(1), 452–467 (2014)CrossRefGoogle Scholar
  9. 9.
    Matharage, S., Ganegedara, H., Alahakoon, D.: A scalable and dynamic self-organizing map for clustering large volumes of text data. In: International Joint Conference on Neural Networks, pp. 1–8 (2013)Google Scholar
  10. 10.
    Wu, O., Zuo, H., Zhu, M., et al.: Rank aggregation based text feature selection. In: IEEE International Conference on Web Intelligence, pp. 165–172 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

<SimplePara><Emphasis Type="Bold">Open Access</Emphasis> This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (, which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. </SimplePara> <SimplePara>The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.</SimplePara>

Authors and Affiliations

  • Zhanquan Sun
    • 1
  • Feng Li
    • 2
  • Yanling Zhao
    • 1
  • Lifeng Song
    • 3
  1. 1.Shandong Provincial Key Laboratory of Computer NetworksShandong Computer Science Center(National Supercomputer Center in Jinan)JinanChina
  2. 2.Department of History, College of Liberal ArtsShanghai UniversityShanghaiChina
  3. 3.Information communication sectionShandong Provincial Public Security BureauJinanChina

Personalised recommendations