Big Data Infrastructure: A Survey

  • Jaime Salvador
  • Zoila Ruiz
  • Jose Garcia-RodriguezEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10338)


In the last years, the volume of information is growing faster than ever before, moving from small datasets to huge volumes of information. This data growth has forced researchers to look for new alternatives to process and store this data, since traditional techniques have been limited by the size and structure of the information. On the other hand, the power of parallel computing in new processors has gradually increased, from single processor architectures to multiple processor, cores and threads. This latter fact enabled the use of machine learning techniques to take advantage of parallel processing capabilities offered by new architectures on large volumes of data. The present paper reviews and proposes a classification, using as criteria, the hardware infrastructures used in works of machine learning parallel approaches applied to large volumes of data.


Machine learning Big data Hadoop MapReduce GPU 



This work has been funded by the Spanish Government TIN2016-76515-R grant for the COMBAHO project, supported with Feder funds.


  1. 1.
    Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)CrossRefGoogle Scholar
  2. 2.
    Aridhi, S., Mephu, E.: Big graph mining: frameworks and techniques. Big Data Res. 6, 1–10 (2016)CrossRefGoogle Scholar
  3. 3.
    Armbrust, M., Ghodsi, A., Zaharia, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J.: Spark SQL: relational data processing in spark michael. In: Proceedings of the ACM SIGMOD International Conference on Management of Data - SIGMOD 2015, pp. 1383–1394 (2015)Google Scholar
  4. 4.
    Bertolucci, M., Carlini, E., Dazzi, P., Lulli, A., Ricci, L.: Static and dynamic big data partitioning on apache spark. Adv. Parallel Comput. 27, 489–498 (2016)Google Scholar
  5. 5.
    Borthakur, D.: HDFS architecture guide. Hadoop Apache Project, 1–13 (2008).
  6. 6.
    Castillo, S.J.L., del Castillo, J.R.F., Sotos, L.G.: Algorithms of machine learning for K-clustering. In: Demazeau, Y., et al. (eds.) Trends in Practical Applications of Agents and Multiagent Systems. AISC, vol. 71, pp. 443–452. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Catanzaro, B., Sundaram, N., Keutzer, K.: Fast support vector machine training and classication on graphics processors. In: Machine Learning, pp. 104–111 (2008)Google Scholar
  8. 8.
    Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada, H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 23 (2015)CrossRefGoogle Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating Systems Design and Implementation, pp. 137–149 (2004)Google Scholar
  10. 10.
    Nagina, Dhingra, S.: Scheduling algorithms in big data: a survey. Int. J. Eng. Comput. Sci. 5(8), 17737–17743 (2016)Google Scholar
  11. 11.
    Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM SIGKDD Explor. Newslett. 14(2), 1–5 (2013)CrossRefGoogle Scholar
  12. 12.
    Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)CrossRefGoogle Scholar
  13. 13.
    Ghemawat, S., Gobioff, H., Leung, S.: Google file system (2003)Google Scholar
  14. 14.
    Guller, M.: Big Data Analytics with Spark (2015). ISBN 9781484209653Google Scholar
  15. 15.
    Hafez, M.M., Shehab, M.E., El Fakharany, E., Abdel Ghfar Hegazy, A.E.F.: Effective selection of machine learning algorithms for big data analytics using apache spark. In: Hassanien, A.E., Shaalan, K., Gaber, T., Azar, A.T., Tolba, M.F. (eds.) AISI 2016. AISC, vol. 533, pp. 692–704. Springer, Cham (2017). doi: 10.1007/978-3-319-48308-5_66 CrossRefGoogle Scholar
  16. 16.
    Hashem, I.A.T., Anuar, N.B., Gani, A., Yaqoob, I., Xia, F., Khan, S.U.: MapReduce: review and open challenges. Scientometrics 109(1), 1–34 (2016)CrossRefGoogle Scholar
  17. 17.
    He, Q., Li, N., Luo, W.J., Shi, Z.Z.: A survey of machine learning for big data processing. Moshi Shibie yu Rengong Zhineng/Pattern Recogn. Artif. Intell. 27(4), 327–336 (2014)Google Scholar
  18. 18.
    Hodge, V.J., Keefe, S.O., Austin, J.: Hadoop neural network for parallel and distributed feature selection. Neural Netw. 78, 24–35 (2016)CrossRefGoogle Scholar
  19. 19.
    Holmes, A.: Hadoop in Practice. Manning, 2nd edn. (2015). ISBN 9781617292224Google Scholar
  20. 20.
    Issa, J., Figueira, S.: Hadoop and memcached: performance and power characterization and analysis. J. Cloud Comput.: Adv. Syst. Appl. 1(1), 10 (2012)CrossRefGoogle Scholar
  21. 21.
    Jackson, J.C., Vijayakumar, V., Quadir, M.A., Bharathi, C.: Survey on programming models and environments for cluster, cloud, and grid computing that defends big data. Procedia Comput. Sci. 50, 517–523 (2015)CrossRefGoogle Scholar
  22. 22.
    Jain, A., Bhatnagar, V.: Crime data analysis using pig with Hadoop. Phys. Procedia 78(December 2015), 571–578 (2016)Google Scholar
  23. 23.
    Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Comput. 18(1), 369–383 (2015)CrossRefGoogle Scholar
  24. 24.
    Kacfah Emani, C., Cullot, N., Nicolle, C.: Understandable big data: a survey. Comput. Sci. Rev. 17, 70–81 (2015)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Kamishima, T., Motoyoshi, F.: Learning from cluster examples. Mach. Learn. 53(3), 199–233 (2003)CrossRefzbMATHGoogle Scholar
  26. 26.
    Kiran, M., Kumar, A., Mukherjee, S., Ravi Prakash, G.: Verification and validation of MapReduce program model for parallel support vector machine algorithm on Hadoop cluster. Int. Conf. Adv. Comput. Communi. Syst. (ICACCS) 4(3), 317–325 (2013)Google Scholar
  27. 27.
    Kirk, D., Hwu, W.-M.W.: Processors, Programming Massively Parallel: A Hands-on Approach (2010). ISBN 0123814723Google Scholar
  28. 28.
    Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.: MLbase: a distributed machine-learning system. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013) (2013)Google Scholar
  29. 29.
    Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2(1), 24 (2015)CrossRefGoogle Scholar
  30. 30.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)MathSciNetzbMATHGoogle Scholar
  31. 31.
    Modha, D.S., Spangler, W.S.: Feature weighting in k-means clustering. Mach. Learn. 52(3), 217–237 (2003)CrossRefzbMATHGoogle Scholar
  32. 32.
    Naimur Rahman, M., Esmailpour, A., Zhao, J.: Machine learning with big data an efficient electricity generation forecasting system. Big Data Res. 5, 9–15 (2016)CrossRefGoogle Scholar
  33. 33.
    Namiot, D.: On big data stream processing. Int. J. Open Inf. Technol. 3(8), 48–51 (2015)Google Scholar
  34. 34.
    Nguyen, T.T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surveys Tutorials 10(4) (2008)Google Scholar
  35. 35.
    Spangenberg, N., Roth, M., Franczyk, B.: Evaluating new approaches of big data analytics frameworks. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 208, pp. 28–37. Springer, Cham (2015). doi: 10.1007/978-3-319-19027-3_3 CrossRefGoogle Scholar
  36. 36.
    Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning (2012). ISBN 9781935182689Google Scholar
  37. 37.
    Pääkkönen, P.: Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing. J. Big Data 3(1), 6 (2016)CrossRefGoogle Scholar
  38. 38.
    Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benitez, J.M., Alonso-Betanzos, A., Herrera, F.: Un Framework de Selección de Características basado en la Teoría de la Información para Big Data sobre Apache SparkGoogle Scholar
  39. 39.
    Saecker, M., Markl, V.: Big data analytics on modern hardware architectures: a technology survey. In: Aufaure, M.-A., Zimányi, E. (eds.) Business Intelligence. LNBIP, vol. 138, pp. 125–149. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-36318-4_6 CrossRefGoogle Scholar
  40. 40.
    Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)CrossRefGoogle Scholar
  41. 41.
    Saraladevi, B., Pazhaniraja, N., Paul, P.V., Basha, M.S.S., Dhavachelvan, P.: Big data and Hadoop-a study in security perspective. Procedia Comput. Sci. 50, 596–601 (2015)CrossRefGoogle Scholar
  42. 42.
    Seminario, C.E., Wilson, D.C.: Case study evaluation of mahout as a recommender platform. CEUR Workshop Proc. 910(September 2012), 45–50 (2012)Google Scholar
  43. 43.
    Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)CrossRefGoogle Scholar
  44. 44.
    Singh, R., Kaur, P.J.: Analyzing performance of Apache Tez and MapReduce with Hadoop multinode cluster on Amazon cloud. J. Big Data 3(1), 19 (2016)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Walunj, S.G., Sadafale, K.: An online recommendation system for e-commerce based on apache mahout framework. In: Proceedings of the Annual Conference on Computers and People Research, pp. 153–158 (2013)Google Scholar
  46. 46.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010 Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Jaime Salvador
    • 1
  • Zoila Ruiz
    • 1
  • Jose Garcia-Rodriguez
    • 2
    Email author
  1. 1.Universidad Central del Ecuador, Ciudadela UniversitariaQuitoEcuador
  2. 2.Universidad de AlicanteAlicanteSpain

Personalised recommendations