Survey on Clustering Algorithms for Unstructured Data

  • R. S. M. Lakshmi Patibandla
  • N. Veeranjaneyulu
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 695)


In modern applications, clustering algorithms have been emerged learning aid to generate and analyze the huge volumes of data. The foremost clustering objective is to classify same type of data has been grouped with in the same Cluster while they are similar according to precise metrics. For various applications, clustering is one of the techniques to classify and analyze the large amount of data. On the other hand, the main issues of applying clustering algorithms for big data that causes uncertainty among the practitioners require consent in the definition of their properties in addition to be deficient in proper classification. In this paper, we studied various existing clustering methods which are suitable for large, semi-structured, and unstructured data and how we can apply same algorithms in distributed environment/hadoop.


Clustering Data algorithms Semi-structured Unstructured 


  1. 1.
    Madhuri, R., RamakrishnaMurty, M., Murthy, J.V.R., Prasad Reddy, P.V.G.D., et al.: Cluster analysis on different data sets using K-modes and K-prototype algorithms. In: International Conference and Published The Proceeding in AISC and Computing, pp. 137–144. Springer (2014)Google Scholar
  2. 2.
    Schmidt, S.: Data is exploding: the 3 V’s of big data. Business Computing World (2012)Google Scholar
  3. 3.
    RamakrishnaMurty, M., Murthy, J.V.R., Prasad Reddy, P.V.G.D., Sapathy, S.C.: A survey of Cross-Domain text categorization techniques. In: International Conference on Recent Advances in Information Technology RAIT-2012 IEEE Xplorer Proceedings (2012), 978-1-4577-0697-4/12Google Scholar
  4. 4.
    RamakrishnaMurty, M., Murthy, J.V.R., Prasad Reddy, P.V.G.D., et al.: Homogeneity separateness: a new validity measure for clustering problems. In: International Conference and Published The Proceedings in AISC and Computing, pp. 1–10. Springer (2014)Google Scholar
  5. 5.
    Zhai, Y., Ong, Y.-S., Tsang, I.W.: The emerging big dimensionality. In: Proceedings of the 22nd International Conference on World Wide Web Companion, Computational Intelligence Magazine, pp. 14–26. IEEE (2014)Google Scholar
  6. 6.
    Medvedev, V., Dzemyda, G., Kurasova, O., Marcinkeviˇcius, V.: Efficient data projection for visual analysis of large data sets using neural networks. Informatica, 507–520 (2011)Google Scholar
  7. 7.
    Dzemyda, G., Kurasova, O., Medvedev, V.: Dimension reduction and data visualization using neural networks. In: Maglogiannis, I., Karpouzis, K., Wallace, M., Soldatos, J. (eds.): Emerging Artificial Intelligence Applications in Computer Engineering, pp. 25–49 (2007)Google Scholar
  8. 8.
    Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall PTR, Upper Saddle River, USA (2002)Google Scholar
  9. 9.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.): Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley and Los Angeles, USA, pp. 281–297 (1967)Google Scholar
  10. 10.
    Kohonen, T.: Overture. In: Self-Organizing Neural Networks: Recent Advances and Applications, pp. 1–12. Springer, New York, USA (2002)Google Scholar
  11. 11.
    Dhillon, I., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceeding KDD 2004 Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556 (2004)Google Scholar
  12. 12.
    Dhillon, I., Modha, D.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 143–175 (2001)Google Scholar
  13. 13.
    de Amorim, R.C., Mirkin, B.: Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recognit. 1061–1075, (2012)Google Scholar
  14. 14.
    Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers (1981)Google Scholar
  15. 15.
    Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann, pp. 727–734 (2000)Google Scholar
  16. 16.
    Cai, X., Nie, F., Huang, H.: Multi-view k-means clustering on big data. In: Rossi, F. (ed.): Proceedings of the 23rd International Joint Conference on Artificial Intelligence, IJCAI 2013, IJCAI/AAAI (2013)Google Scholar
  17. 17.
    Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Proceedings of 23rd Annual Conference on Neural Information Processing Systems, NIPS, pp. 10–18 (2009)Google Scholar
  18. 18.
    Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., Tagiku, B.: Streaming k-means on well-clusterable data. In: Randall, D. (ed.): Proceedings of the Twenty-Second Annual ACM-SIAM SODA, pp. 26–40 (2011)Google Scholar
  19. 19.
    Shindler, M., Wong, A., Meyerson, A.: Fast and accurate k-means for large datasets. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.): Proceedings of 25th Annual Conference on Neural Information Processing Systems pp. 2375–2383 (2011)Google Scholar
  20. 20.
    McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn, Wiley series in probability and statistics (2008)Google Scholar
  21. 21.
    Abimbola, A.A., Omidiora, E.O., Olabiyisi, S.O.: An exploratory study of k-means and expectation maximization algorithms. Br. J. Math. Comput. Sci. 62–71 (2012)Google Scholar
  22. 22.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. SIGMOD 1996, pp. 103–114. ACM, New York, USA (1996)Google Scholar
  23. 23.
    Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. Inf. Syst. 35–58 (2001)Google Scholar
  24. 24.
    Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 49–60. ACM (1999)Google Scholar
  25. 25.
    David, L., Daniel, B.: Clustering very large datasets using a low memory matrix factored representation. Comput. Intell. 114–135 (2009)Google Scholar
  26. 26.
    Dzemyda, G., Kurasova, O., Zilinskas, J.: Multidimensional Data Visualization: Methods and Applications, Springer Optimization and Its Applications. Springer (2013)Google Scholar
  27. 27.
    Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: A general framework for unsupervised processing of structured data. Neurocomputing, 3–35 (2004)Google Scholar
  28. 28.
    Voegtlin, T.: Recursive self-organizing maps. Neural Netw. 979–991 (2002)Google Scholar
  29. 29.
    Lagus, K., Kaski, S., Kohonen, T.: Mining massive document collections by the WEBSOM method. Inf. Sci. 135–156 (2004)Google Scholar
  30. 30.
    Stefanoviˇc, P., Kurasova, O.: Visual analysis of self-organizing maps. In: Nonlinear Analysis: Modelling and Control, pp. 488–504 (2011)Google Scholar
  31. 31.
    Kurasova, O., Marcinkeviˇcius, V., Medvedev, V., Rapeˇcka, A., Stefanoviˇc, P.: Strategies for big data clustering. In: IEEE 26th International Conference on Tools with Artificial Intelligence, pp. 740–747 (2014)Google Scholar
  32. 32.
    Nandakumar, A.N., Yambem, N.: A survey on data mining algorithms on apache hadoop platform. IJETAE, 563–565 (2014)Google Scholar
  33. 33.
    Veeranjaneyulu, N., NirupamaBhat, M., Raghunadh, A.: Approaches for managing and analyzing unstructured data. IJCSE, 19–24 (2014)Google Scholar
  34. 34.
    Jaatun, M.G., Zhao, G., Rong, C. (eds.): Parallel K-Means clustering based on MapReduce. In: CloudCom 2009, LNCS 5931, pp. 674–679 (2009)Google Scholar
  35. 35.
    Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce (2010)Google Scholar
  36. 36.
    Wang, F.L., et al., (eds.): Parallel K-Means clustering of remote sensing images based on MapReduce. In: WISM 2010, LNCS 6318, pp. 162–170 (2010)Google Scholar
  37. 37.
    Sun, Z.: Study on Parallel SVM Based on MapReduce. In: Conference on WorldComp (2012)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • R. S. M. Lakshmi Patibandla
    • 1
  • N. Veeranjaneyulu
    • 2
  1. 1.Department of CSEVignan’s Foundation for Science, Technology & ResearchVadlamudiIndia
  2. 2.Department of ITVFSTR UniversityVadlamudiIndia

Personalised recommendations