Big Data: Technologies and Tools

  • Julián Luengo
  • Diego García-Gil
  • Sergio Ramírez-Gallego
  • Salvador García
  • Francisco Herrera


The fast evolving Big Data environment has provoked that a myriad of tools, paradigms, and techniques surge to tackle different use cases in industry and science. However, because of the myriad of existing tools, it is often difficult for practitioners and experts to analyze and select the correct tool for their problems. In this chapter we present an introductory summary to the wide environment of Big Data with the aim of providing necessary knowledge to algorithm makers so that they are able to develop scalable and efficient machine learning solutions. We start with the discussion of common technical concepts, paradigms, and technologies which are the basement of frameworks like Spark and Hadoop. Afterwards we analyze in depth the most popular frameworks in Big Data, and their main components. Next we also discuss other novel platforms for high-speed streaming processing that are gaining increasing importance in industry. Finally we make a comparison between two of the most relevant large-scale processing platforms nowadays: Spark and Flink.


  1. 1.
    Aggarwal, C. C. (2015). Data mining: The textbook. Berlin: Springer.zbMATHGoogle Scholar
  2. 2.
    Apache Cascading. (2019).
  3. 3.
    Apache Drill. (2019). Apache Drill. Google Scholar
  4. 4.
    Apache Flink. (2019).
  5. 5.
    Apache Flink Project. (2015). Peeking into Apache Flink’s Engine Room.
  6. 6.
    Apache Flume. (2019).
  7. 7.
    Apache Giraph. (2019). Apache Giraph. Google Scholar
  8. 8.
    Apache Hive. (2019).
  9. 9.
    Apache Ignite. (2019).
  10. 10.
    Apache Mahout. (2019).
  11. 11.
    Apache Pig. (2019).
  12. 12.
    Apache Software Foundation. (2019). Apache project directory.
  13. 13.
    Apache Spark. (2019). Apache Spark: Lightning-fast cluster computing. Google Scholar
  14. 14.
    Apache Spark Project. (2015). Project Tungsten (Apache Spark).
  15. 15.
    Apache Storm. (2019).
  16. 16.
    Apache Tez. (2019).
  17. 17.
  18. 18.
    Avro Project. (2019). Avro Project. Google Scholar
  19. 19.
    Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140–147.MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Böhm, C., & Krebs, F. (2004). The k-nearest neighbour join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6), 728–749.CrossRefGoogle Scholar
  21. 21.
    Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 485–509.MathSciNetzbMATHCrossRefGoogle Scholar
  22. 22.
    Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).Google Scholar
  23. 23.
    Comer, D. (1979). Ubiquitous B-tree. ACM Computing Surveys, 11(2), 121–137.MathSciNetzbMATHCrossRefGoogle Scholar
  24. 24.
    Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation. Berkeley: USENIX Association.Google Scholar
  25. 25.
    Dennis, J. B. (1974). First version of a data flow procedure language (pp. 362–376). Berlin: Springer.Google Scholar
  26. 26.
    Dursi, J. (2019). HPC is dying, and MPI is killing it. Online; accessed July 2019.
  27. 27.
    Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M., et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.Google Scholar
  28. 28.
    Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association, 94(446), 496–509.MathSciNetzbMATHCrossRefGoogle Scholar
  29. 29.
    Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.MathSciNetzbMATHCrossRefGoogle Scholar
  30. 30.
    Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, 44(8), 1761–1776.CrossRefGoogle Scholar
  31. 31.
    García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2019). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 9.CrossRefGoogle Scholar
  32. 32.
    García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics, 2(1), 1.CrossRefGoogle Scholar
  33. 33.
    Gilbert, S., & Lynch, N. (2012). Perspectives on the cap theorem. Computer, 45(2), 30–36.CrossRefGoogle Scholar
  34. 34.
  35. 35.
    Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.zbMATHCrossRefGoogle Scholar
  36. 36.
    Harris, D. (2013). The history of Hadoop: From 4 nodes to the future of data. Google Scholar
  37. 37.
    Hazelcast. (2019).
  38. 38.
    Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics. Sebastopol: O’Reilly Media.Google Scholar
  39. 39.
    Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. Online; accessed March 2019.
  40. 40.
    Liu, D.C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.MathSciNetzbMATHCrossRefGoogle Scholar
  41. 41.
    Liu, T., Rosenberg, C. J., & Rowley, H. A. (2009). Performing a parallel nearest-neighbor matching operation using a parallel hybrid spill tree. US Patent 7,475,071.Google Scholar
  42. 42.
    Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2016). kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.CrossRefGoogle Scholar
  43. 43.
    Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., et al. (2010). Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases (pp. 330–339).CrossRefGoogle Scholar
  44. 44.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34), 1–7.MathSciNetzbMATHGoogle Scholar
  45. 45.
    MongoDB. (2019).
  46. 46.
    NoSQL Database. (2019). NoSQL database. Google Scholar
  47. 47.
    Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009).Google Scholar
  48. 48.
    Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 1426–1437.CrossRefGoogle Scholar
  49. 49.
    Parquet Project. (2019). Parquet Project. Google Scholar
  50. 50.
    Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2019). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.CrossRefGoogle Scholar
  51. 51.
    Robbins, H., & Monro, S. (1985). A stochastic approximation method. In Herbert Robbins selected papers (pp. 102–109). Berlin: Springer.CrossRefGoogle Scholar
  52. 52.
    Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Computation, 16(5), 1063–1076.zbMATHCrossRefGoogle Scholar
  53. 53.
    Rosenblatt, F. (1961). Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Technical report, Cornell Aeronautical Lab Inc., Buffalo.Google Scholar
  54. 54.
    Ross Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.Google Scholar
  55. 55.
    Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Kuala Lumpur: Pearson Education Limited.zbMATHGoogle Scholar
  56. 56.
    Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional (1st ed.). Boston: Addison-Wesley.Google Scholar
  57. 57.
    Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance techniques. Technical Report, Indiana University.Google Scholar
  58. 58.
    Spark Packages. (2019). 3rd party spark packages. Google Scholar
  59. 59.
    Spark Petabyte Sort. (2014). Apache Spark the fastest open source engine for sorting a petabyte.
  60. 60.
    Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.CrossRefGoogle Scholar
  61. 61.
    Stonebraker, M. (1986) The case for shared nothing. Database Engineering, 9, 4–9.Google Scholar
  62. 62.
    Sung, M. (2000). SIMD parallel processing Michael Sung 6.911: Architectures anonymous. [Online; accessed July 2019].
  63. 63.
    The team. (2015). H2O: Scalable machine learning.
  64. 64.
    Valiant, L. G. (1990). A bridging model for parallel computation. Communications of ACM, 33(8), 103–111.CrossRefGoogle Scholar
  65. 65.
    Wei, L.-J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14–15), 1871–1879.CrossRefGoogle Scholar
  66. 66.
    Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.Google Scholar
  67. 67.
    Yu, S., & Guo, S. (2016). Big data concepts, theories, and applications. Amsterdam: Elsevier.CrossRefGoogle Scholar
  68. 68.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12 (pp. 2–2).Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Julián Luengo
    • 1
  • Diego García-Gil
    • 1
  • Sergio Ramírez-Gallego
    • 2
  • Salvador García
    • 1
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and AIUniversity of GranadaGranadaSpain
  2. 2.DOCOMO Digital EspañaMadridSpain

Personalised recommendations