Big Data Software

  • Julián Luengo
  • Diego García-Gil
  • Sergio Ramírez-Gallego
  • Salvador García
  • Francisco Herrera


The advent of Big Data has created the necessity of new computing tools for processing huge amounts of data. Apache Hadoop was the first open-source framework that implemented the MapReduce paradigm. Apache Spark appeared a few years later improving the Hadoop Ecosystem. Similarly, Apache Flink appeared in the last years for tackling the Big Data streaming problem. However, as these frameworks were created for dealing with huge amounts of data, many practitioners will need machine learning algorithms for extracting the knowledge in the data. The success of a Big Data framework is going to be strongly related to its machine learning capability. This is the reason why nowadays these frameworks include a Big Data machine learning library, MLlib in the case of Spark, and FlinkML for Flink. In this chapter, we analyze in depth both MLlib and FlinkML Big Data libraries. We start with a description of Apache Spark MLlib and all of its components. We continue with a description of a Big Data library focused on data preprocessing for Apache Spark, named BigDaPSpark. Next, we provide an extensive analysis of FlinkML, and its included algorithms and utilities. Lastly, we finish with the description of a Big Data streaming library, focused on data preprocessing for Apache Flink, named BigDaPFlink.


  1. 1.
    Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys Tutorials, 17(4), 2347–2376.CrossRefGoogle Scholar
  2. 2.
    Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.CrossRefGoogle Scholar
  3. 3.
    Angiulli, F. (2007). Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, 19(11), 1450–1464.CrossRefGoogle Scholar
  4. 4.
    Apache Flink. (2019). Apache Flink. Scholar
  5. 5.
    Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., et al. (2015). Spark SQL: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data, SIGMOD ’15 (pp. 1383–1394).Google Scholar
  6. 6.
    Arnaiz-González, Á., González-Rogel, A., Díez-Pastor, J.-F., & López-Nozal, C. (2017). MR-DIS: democratic instance selection for big data by MapReduce. Progress in Artificial Intelligence, 6(3), 211–219.CrossRefGoogle Scholar
  7. 7.
    Basgall, M. J., Hasperué, W., Naiouf, M., Fernández, A., & Herrera, F. (2018). SMOTE-BD: An exact and scalable oversampling method for imbalanced classification in big data. Journal of Computer Science and Technology, 18(03), e23.CrossRefGoogle Scholar
  8. 8.
    Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations Newsletter, 6(1), 20–29.CrossRefGoogle Scholar
  9. 9.
    Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1–2), 245–271.MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).Google Scholar
  11. 11.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.zbMATHCrossRefGoogle Scholar
  12. 12.
    Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36(4), 1165–1188.CrossRefGoogle Scholar
  13. 13.
    Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.CrossRefGoogle Scholar
  14. 14.
    Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI (pp. 1022–1029).Google Scholar
  15. 15.
    Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M., et al. (2014). Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.Google Scholar
  16. 16.
    Figueredo, G. P., Triguero, I., Mesgarpour, M., Guerra, A. M., Garibaldi, J. M., & John, R. I. (2017). An immune-inspired technique to identify heavy goods vehicles incident hot spots. IEEE Transactions on Emerging Topics in Computational Intelligence, 1(4), 248–258.CrossRefGoogle Scholar
  17. 17.
    Gama, J., & Pinto, C. (2006). Discretization from data streams: Applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied Computing (pp. 662–667). New York: ACM.CrossRefGoogle Scholar
  18. 18.
    García, S., Cano, J. R., & Herrera, F. (2008). A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709.zbMATHCrossRefGoogle Scholar
  19. 19.
    García-Gil, D., Alcalde-Barros, A., Luengo, J., García, S., & Herrera, F. (2019). Big data preprocessing as the bridge between big data and smart data: BigDaPSpark and BigDaPFlink libraries. In Proceedings of the 4th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS (pp. 324–331). INSTICC, SciTePress.Google Scholar
  20. 20.
    García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.CrossRefGoogle Scholar
  21. 21.
    García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Principal components analysis random discretization ensemble for big data. Knowledge-Based Systems, 150, 166–174.CrossRefGoogle Scholar
  22. 22.
    Gupta, P., Sharma, A., & Jindal, R. (2016). Scalable machine learning algorithms for big data analytics: A comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(6), 194–214.Google Scholar
  23. 23.
    Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2006). Feature extraction: Foundations and applications (Studies in fuzziness and soft computing). New York: Springer.zbMATHCrossRefGoogle Scholar
  24. 24.
    Hadoop Distributed File System. (2019). Hadoop Distributed File System. Scholar
  25. 25.
    Janssens, J., Huszár, F., Postma, E. O., & van den Herik, H. J. (2012). Stochastic outlier selection. Technical Report, Technical report TiCC TR 2012–001, Tilburg University.Google Scholar
  26. 26.
    Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the utility of incremental feature selection for the classification of textual data streams. In Panhellenic Conference on Informatics (pp. 338–348). Berlin: Springer.Google Scholar
  27. 27.
    Marx, V. (2013). Biology: The big challenges of big data. Nature, 498(7453), 255–260.CrossRefGoogle Scholar
  28. 28.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). Mllib: Machine learning in Apache spark. Journal of Machine Learning Research, 17(34), 1–7.MathSciNetzbMATHGoogle Scholar
  29. 29.
    Philip-Chen, C. L., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275(10), 314–347.CrossRefGoogle Scholar
  30. 30.
    Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., & Herrera, F. (2018). Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion, 42, 51–61.CrossRefGoogle Scholar
  31. 31.
    Ramírez-Gallego, S., García, S., Benítez, J. M., & Herrera, F. (2018). A distributed evolutionary multivariate discretizer for big data processing on Apache spark. Swarm and Evolutionary Computation, 38, 240–250.CrossRefGoogle Scholar
  32. 32.
    Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Online entropy-based discretization for data streaming classification. Future Generation Computer Systems, 86, 59–70.CrossRefGoogle Scholar
  33. 33.
    Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., et al. (2016). Data discretization: Taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(1), 5–21.Google Scholar
  34. 34.
    Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., et al. (2018). An information theory-based feature selection framework for big data under Apache spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(9), 1441–1453.CrossRefGoogle Scholar
  35. 35.
    Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., & Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24(7), 1015–1022.CrossRefGoogle Scholar
  36. 36.
    Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters, 18(6), 507–513.CrossRefGoogle Scholar
  37. 37.
    Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., et al. (2014). Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).Google Scholar
  38. 38.
    Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Machine Learning Proceedings 1994 (pp. 293–301). Amsterdam: Elsevier.CrossRefGoogle Scholar
  39. 39.
    Snir, M., & Otto, S. (1998). MPI-The complete reference: The MPI core. Cambridge, MA: MIT Press.Google Scholar
  40. 40.
    Takane, Y., Young, F. W., & De Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika, 42(1), 7–67.zbMATHCrossRefGoogle Scholar
  41. 41.
    Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on systems, Man, and Cybernetics, SMC-6(6), 448–452.MathSciNetzbMATHCrossRefGoogle Scholar
  42. 42.
    Triguero, I., García, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition, 44(4), 901–916.CrossRefGoogle Scholar
  43. 43.
    Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.Google Scholar
  44. 44.
    Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, 331–345.CrossRefGoogle Scholar
  45. 45.
    Wang, J., Zhao, P., Hoi, S. C. H., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698–710.CrossRefGoogle Scholar
  46. 46.
    Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In 2014 IEEE International Conference on Data Mining (pp. 1031–1036).Google Scholar
  47. 47.
    White, T. (2012). Hadoop: The definitive guide (3rd ed.). Sebastopol, CA: O’Reilly Media.Google Scholar
  48. 48.
    Wilson, D. L. (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421.MathSciNetzbMATHCrossRefGoogle Scholar
  49. 49.
    Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 856–863).Google Scholar
  50. 50.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (pp. 1–14).Google Scholar
  51. 51.
    Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the Netflix prize. In R. Fleischer & J. Xu (Eds.), Algorithmic aspects in information and management (pp. 337–348). Berlin/Heidelberg: Springer.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Julián Luengo
    • 1
  • Diego García-Gil
    • 1
  • Sergio Ramírez-Gallego
    • 2
  • Salvador García
    • 1
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and AIUniversity of GranadaGranadaSpain
  2. 2.DOCOMO Digital EspañaMadridSpain

Personalised recommendations