A Survey of Machine Learning Methods for Big Data

  • Zoila Ruiz
  • Jaime Salvador
  • Jose Garcia-RodriguezEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10338)


Nowadays there are studies in different fields aimed to extract relevant information on trends, challenges and opportunities; all these studies have something in common: they work with large volumes of data. This work analyzes different studies carried out on the use of Machine Learning (ML) for processing large volumes of data (Big Data). Most of these datasets, are complex and come from various sources with structured or unstructured data. For this reason, it is necessary to find mechanisms that allow classification and, in a certain way, organize them to facilitate to the users the extraction of the required information. The processing of these data requires the use of classification techniques that will also be reviewed.


Big Data Machine learning Classification Clustering 



This work has been funded by the Spanish Government TIN2016-76515-R grant for the COMBAHO project, supported with Feder funds.


  1. 1.
    Agrawal, A.: Global K-means (GKM) clustering algorithm: a survey. Int. J. Comput. Appl. 79(2), 20–24 (2013)Google Scholar
  2. 2.
    Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)CrossRefGoogle Scholar
  3. 3.
    Al Malki, A., Rizk, M.M., El-Shorbagy, M.A., Mousa, A.A., Malki, A.A., Rizk, M.M., Mousa, A.A., Mousa, A.A.: Hybrid genetic algorithm with K-means for clustering problems. Open J. Optim. 5(02), 71 (2016)CrossRefGoogle Scholar
  4. 4.
    Al-Sultana, K.S., Khan, M.M.: Computational experience on four algorithms for the hard clustering problem. Pattern Recogn. Lett. 17(3), 295–308 (1996)CrossRefGoogle Scholar
  5. 5.
    Arellano-Verdejo, J., Alba, E., Godoy-Calderon, S.: Efficiently finding the optimum number of clusters in a dataset with a new hybrid differential evolution algorithm: DELA. Soft. Comput. 20(3), 895–905 (2016)CrossRefGoogle Scholar
  6. 6.
    Backlund, H., Hedblom, A., Neijman, N.: A density-based spatial clustering of application with noise. Data Mining TNM033, pp. 11–30 (2011)Google Scholar
  7. 7.
    Bobadilla, J., Ortega, F., Hernando, A., de Rivera, G.G.: A similarity metric designed to speed up, using hardware, the recommender systems k-nearest neighbors algorithm. Knowl.-Based Syst. 51, 27–34 (2013)CrossRefGoogle Scholar
  8. 8.
    Cai, X., Nie, F., Huang, H.: Multi-view K-means clustering on big data. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 2598–2604 (2013)Google Scholar
  9. 9.
    De Carvalho, F.A.T.: Fuzzy c-means clustering methods for symbolic interval data. Pattern Recogn. Lett. 28(4), 423–437 (2007)CrossRefGoogle Scholar
  10. 10.
    Cui, X., Potok, T.E.: Document clustering analysis based on hybrid PSO + K-means algorithm. J. Comput. Sci. 27(special issue), 33 (2005)Google Scholar
  11. 11.
    Dai, W., Ji, W.: A MapReduce implementation of C4. 5 decision tree algorithm. Int. J. Database Theory Appl. 7(1), 49–60 (2014)CrossRefGoogle Scholar
  12. 12.
    Pascual, D., Pla, F., Sánchez, J.S.: A density-based hierarchical clustering algorithm for highly overlapped distributions with noisy points. In: CCIA, vol. 220, pp. 183–192 (2010)Google Scholar
  13. 13.
    Derrac, J., Chiclana, F., García, S., Herrera, F.: Evolutionary fuzzy k-nearest neighbors algorithm using interval-valued fuzzy sets. Inf. Sci. 329, 144–163 (2016)CrossRefGoogle Scholar
  14. 14.
    Fan, W., Bifet, A.: Mining big data : current status, and forecast to the future. ACM SIGKDD Explor. Newsl. 14(2), 1–5 (2013)CrossRefGoogle Scholar
  15. 15.
    Feng, X., Wang, Z., Yin, G., Wang, Y.: PSO-based DBSCAN with obstacle constraints. J. Theor. Appl. Inf. Technol. 46(1), 377–383 (2012)Google Scholar
  16. 16.
    Hatamlou, A.: Black hole: a new heuristic optimization approach for data clustering. Inf. Sci. 222, 175–184 (2013)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Ho, R.: Big data machine learning: patterns for predictive analytics. DZone Refcardz 158, 1–6 (2012)Google Scholar
  18. 18.
    Jadhav, D.K.: Big data: the new challenges in data mining. Int. J. Innov. Res. Comput. Sci. Technol. 1(2), 39–42 (2013)MathSciNetGoogle Scholar
  19. 19.
    Jain, R.: A hybrid clustering algorithm for data mining, pp. 387–393 (2012). arXiv preprint arXiv:1205.5353
  20. 20.
    Jiang, M., Ding, Y., Goertzel, B., Huang, Z., Zhou, C., Chao, F.: Improving machine vision via incorporating expectation-maximization into deep spatio-temporal learning. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1804–1811 (2014)Google Scholar
  21. 21.
    Jin, H., Shum, W.-H., Leung, K.-S., Wong, M.-L.: Expanding self-organizing map for data visualization and cluster analysis. Inf. Sci. 163(1–3), 157–173 (2004)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Kohonen, T.: Essentials of the self-organizing map. Neural Netw. 37, 52–65 (2013)CrossRefGoogle Scholar
  23. 23.
    Liu, X., Lathauwer, L., Janssens, F., Moor, B.: Hybrid clustering of multiple information sources via HOSVD. In: Zhang, L., Lu, B.-L., Kwok, J. (eds.) ISNN 2010. LNCS, vol. 6064, pp. 337–345. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-13318-3_42 CrossRefGoogle Scholar
  24. 24.
    Luo, W., Nguyen, T., Nichols, M., Tran, T., Rana, S., Gupta, S., Phung, D., Venkatesh, S., Allender, S.: Is demography destiny? application of machine learning techniques to accurately predict population health outcomes from a minimal demographic dataset. PLoS ONE 10(5), e0125602 (2015)CrossRefGoogle Scholar
  25. 25.
    Mishra, S.K., Raghavan, V.V.: An empirical study of the performance of heuristic methods for clustering. In: Pattern Recognition in Practice IV - Multiple Paradigms, Comparative Studies and Hybrid Systems, pp. 425–436. Elsevier BV (1994)Google Scholar
  26. 26.
    Mujeeb, S., Naidu, L.K.: A relative study on big data applications and techniques. Int. J. Eng. Innov. Technol. (IJEIT) 4(10), 133–138 (2015)Google Scholar
  27. 27.
    Murugesan, K., Jun, Z.: Hybrid bisect K-means clustering algorithm. In: International Conference on Business Computing and Global Informatization (BCGIN), pp. 216–219. IEEE (2011)Google Scholar
  28. 28.
    Niknam, T., Fard, E.T., Pourjafarian, N., Rousta, A.: An efficient hybrid algorithm based on modified imperialist competitive algorithm and k-means for data clustering. Eng. Appl. Artif. Intell. 24(2), 306–317 (2011)CrossRefGoogle Scholar
  29. 29.
    Park, H.-S., Jun, C.-H.: A simple and fast algorithm for k-medoids clustering. Expert Syst. Appl. 36(2), 3336–3341 (2009)CrossRefGoogle Scholar
  30. 30.
    Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data. ACM SIGKDD Explor. Newsl. 6(1), 90–105 (2004)CrossRefGoogle Scholar
  31. 31.
    Qi, Z., Tian, Y., Shi, Y.: Robust twin support vector machine for pattern classification. Pattern Recogn. 46(1), 305–316 (2013)CrossRefzbMATHGoogle Scholar
  32. 32.
    Rebentrost, P., Mohseni, M., Lloyd, S.: Quantum support vector machine for big data classification. Phys. Rev. Lett. 113(3), 1–5 (2014)Google Scholar
  33. 33.
    Roy, D.K., Sharma, L.K.: Genetic k-Means clustering algorithm for mixed numeric and categorical data sets. Int. J. Artif. Intell. Appl. 1, 23–28 (2010)Google Scholar
  34. 34.
    Ruiz, R., Riquelme, J.C., Aguilar-Ruiz, J.S., García-Torres, M.: Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches. Expert Syst. Appl. 39(12), 11094–11102 (2012)CrossRefGoogle Scholar
  35. 35.
    Sheng, W., Liu, X.: A genetic k-medoids clustering algorithm. J. Heuristics 12(6), 447–466 (2006)CrossRefGoogle Scholar
  36. 36.
    Shim, K.: MapReduce algorithms for big data analysis. In: Madaan, A., Kikuchi, S., Bhalla, S. (eds.) DNIS 2013. LNCS, vol. 7813, pp. 44–48. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-37134-9_3 CrossRefGoogle Scholar
  37. 37.
    Tsai, M.-C., Chen, K.-H., Su, C.-T., Lin, H.-C.: An Application of PSO algorithm and decision tree for medical problem. In: 2nd Internatonal Conference on Intelligent Computational System, pp. 124–126 (2012)Google Scholar
  38. 38.
    van der Laan, M.J., Pollard, K.S.: A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. J. Stat. Plann. Infer. 117, 275–303 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Venkatesh, H., Perur, S.D., Jalihal, N.: A study on use of big data in cloud computing environment. Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 6(3), 2076–2078 (2015)Google Scholar
  40. 40.
    Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  41. 41.
    Xu, X., Ester, M., Kriegel, H.-P., Sander, J.: A distribution-based clustering algorithm for mining in large spatial databases. In: 14th International Conference on Data Engineering ( ICDE 1998) (1998)Google Scholar
  42. 42.
    Yang, F., Sun, T., Zhang, C.: An efficient hybrid data clustering method based on k-harmonic means and particle swarm optimization. Expert Syst. Appl. 36(6), 9847–9852 (2009)CrossRefGoogle Scholar
  43. 43.
    Yang, Y., Liao, Y., Meng, G., Lee, J.: A hybrid feature selection scheme for unsupervised learning and its application in bearing fault diagnosis. Expert Syst. Appl. 38(9), 11311–11320 (2011)CrossRefGoogle Scholar
  44. 44.
    Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recogn. 2, 2126–2136 (2006)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Zoila Ruiz
    • 1
  • Jaime Salvador
    • 1
  • Jose Garcia-Rodriguez
    • 2
    Email author
  1. 1.Universidad Central Del Ecuador, Ciudadela UniversitariaQuitoEcuador
  2. 2.Universidad de AlicanteAlicanteSpain

Personalised recommendations