Cluster Computing

, Volume 18, Issue 2, pp 933–948 | Cite as

A parallel text document clustering algorithm based on neighbors

  • Yanjun Li
  • Congnan Luo
  • Soon M. Chung


In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345–366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271–1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.


Document clustering Text mining k-Means Parallel algorithm Cluster computing Performance analysis 


  1. 1.
    Aboutabl, A.E., Elsayed, M.N.: A novel parallel algorithm for clustering documents based on the hierarchical agglomerative approach. Int. J. Comput. Sci. Inf. Technol. (IJCSIT). 3(2), 152–163 (2011)Google Scholar
  2. 2.
    Bobda, C., Steenbock, N.: Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop on Rapid System Prototyping, pp. 38–43 (2001)Google Scholar
  3. 3.
    Brent, R.P., Luk, F.T.: The solution of singular-value and symmetric eigen-value problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6, 69–84 (1985)CrossRefMATHMathSciNetGoogle Scholar
  4. 4.
    Cao, Z., Zhou, Y. : Parallel text clustering based on MapReduce. In: Proceedings of the 2nd International Conference on Cloud and Green Computing, pp. 226–229 (2012)Google Scholar
  5. 5.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)Google Scholar
  6. 6.
    Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Large-Scale Parallel Data Mining, LNCS, vol. 1759, pp. 245–260. Springer, Heidelberg (2000)Google Scholar
  7. 7.
    Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)CrossRefGoogle Scholar
  8. 8.
    Garey, M.R., Johnson, D.S., Witsenhausen, H.S.: Complexity of the generalized Lloyd–Max problem. IEEE Trans. Inf. Theory 28(2), 256–257 (1982)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: a scalable parallel clustering algorithm for incremental data. In: Proceedings of the International Database Engineering and Applications Symposium, pp. 315–316 (2006)Google Scholar
  10. 10.
    Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)CrossRefGoogle Scholar
  11. 11.
    Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)MATHGoogle Scholar
  12. 12.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)MATHGoogle Scholar
  13. 13.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)CrossRefGoogle Scholar
  14. 14.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)Google Scholar
  15. 15.
    Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). Information about RCV1. Accessed 20 Oct 2014
  16. 16.
    Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Comput. 11(3), 275–290 (1989)CrossRefMATHMathSciNetGoogle Scholar
  17. 17.
    Li, Y., Chung, S.M.: Parallel bisecting K-means with prediction clustering algorithm. J. Supercomput. 39(1), 19–37 (2007)CrossRefMATHGoogle Scholar
  18. 18.
    Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)Google Scholar
  19. 19.
    Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans Knowl. Data Eng. 20(5), 641–652 (2008)CrossRefGoogle Scholar
  20. 20.
    Liu, G., Wang, Y., Zhao, T., Li, D.: Research on the parallel text clustering algorithm based on the semantic tree. In: Proceedings of the 6th International Conference on Computer Sciences and Convergence Information Technology, pp. 400–403 (2011)Google Scholar
  21. 21.
    Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68(11), 1271–1288 (2009)CrossRefGoogle Scholar
  22. 22.
    Mogill, J.A., Haglin, D.J.: Toward parallel document clustering. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS) Workshops and PhD Forum, pp. 1700–1709 (2011)Google Scholar
  23. 23.
    Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp. 397–405 (2000)Google Scholar
  24. 24.
    Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 1313–1325 (1995)CrossRefMATHMathSciNetGoogle Scholar
  25. 25.
    Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. 16(8), 909–921 (2004)CrossRefGoogle Scholar
  26. 26.
    Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. Parallel Distrib. Syst. 2(2), 129–137 (1991)CrossRefGoogle Scholar
  27. 27.
    van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979)Google Scholar
  28. 28.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD Workshop on Text Mining (2000)Google Scholar
  29. 29.
    Zhang, Y., Sun, J., Zhang, Y., Zhang, X.: Parallel implementation of CLARANS using PVM. In: Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1646–1649 (2004)Google Scholar
  30. 30.
    Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical Report# TR 02-014, Department of Computer Science, University of Minnesota, Minneapolis (2002)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Computer and Information ScienceFordham UniversityBronxUSA
  2. 2.Teradata CorporationSan DiegoUSA
  3. 3.Department of Computer Science and EngineeringWright State UniversityDaytonUSA

Personalised recommendations