Large Scale Text Clustering Method Study Based on MapReduce
Text clustering is an important research topic in data mining. Many text clustering methods have been proposed and obtained satisfactory results. Information Bottleneck algorithm, which is based on information loss, can measure complicated relationship between variables. It is taken as one of the most informative text clustering methods and has been applied widely in practical. With the development of information technology, the scale of text becomes larger and larger. Classical information bottleneck based clustering method will be out of work to process large-scale dataset because of expensive computational cost. For dealing with large scale text clustering problem, a novel clustering method based on MapReduce is proposed. In the method, dataset is divided into sub datasets and deployed to different computational nodes. Each computational node will only process sub dataset. The computational cost can be reduced markedly. The efficiency of the method is illustrated with a practical text clustering problem.
KeywordsText clustering Large Scale MapReduce Information Bottleneck Feature selection
Unable to display preview. Download preview PDF.
- 1.Hotho, A., Nurnberger, A., Paass, G.: A brief survey of text mining. Ldv Forum 20(1), 19–62 (2005)Google Scholar
- 2.Ranjan, M., Peterson, A.D., Ghosh, P.A.: A systematic evaluation of different methods for initializing the K-means clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 1–13 (2010)Google Scholar
- 3.Simon, H.: Self-organizing maps. Neural networks - A comprehensive foundation. Prentice-Hall (1999)Google Scholar
- 4.Tishby, N., Fernando, C., Bialek, W.: The information bottleneck method. In: The 37th Annual Allerton Conference on Communication, Control and Computing, Monticello, pp. 1–11 (1999)Google Scholar
- 5.Fox, G.C., Bae, S.H., et al.: Parallel Data Mining from Multicore to Cloudy Grids. High Speed and Large Scale Scientific Computing 18, 311–341 (2010)Google Scholar
- 6.Zhang, B.J., Ruan, Y., et al.: Applying Twister to Scientific Applications. In: Proceedings of CloudCom, pp. 25–32 (2010)Google Scholar
- 7.Ekanayake, J., Li, H., et al.: Twister: A Runtime for iterative MapReduce. In: The First International Workshop on MapReduce and its Applications of ACM HPDC, pp. 810–818 (2010)Google Scholar
- 9.Matharage, S., Ganegedara, H., Alahakoon, D.: A scalable and dynamic self-organizing map for clustering large volumes of text data. In: International Joint Conference on Neural Networks, pp. 1–8 (2013)Google Scholar
- 10.Wu, O., Zuo, H., Zhu, M., et al.: Rank aggregation based text feature selection. In: IEEE International Conference on Web Intelligence, pp. 165–172 (2009)Google Scholar
<SimplePara><Emphasis Type="Bold">Open Access</Emphasis> This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. </SimplePara> <SimplePara>The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.</SimplePara>