• Julián Luengo
  • Diego García-Gil
  • Sergio Ramírez-Gallego
  • Salvador García
  • Francisco Herrera


We live in a world where data is generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapsed time, and to extract valuable knowledge from it. The term “Big Data” has spread rapidly in the framework of data mining and business intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. Therefore, the use of Big Data Analytics tools provides very significant advantages to both industry and academia. In this chapter we provide an introduction to Big Data and its problems. Next we discuss about a new topic, namely Big Data Analytics, referred to the application of machine learning techniques to Big Data problems. Then we continue with a definition of data preprocessing and the different techniques used to improve the quality of data. We finish with an introduction to the state of Big Data streaming.


  1. 1.
    Agrawal, D., Das, S., & Abbadi, A. E. (2011). Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530–533). New York: ACM.Google Scholar
  2. 2.
    Aha, D. W., Kibler, D., & Albert, M. K. (1999). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.Google Scholar
  3. 3.
    Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials, 17(4), 2347–2376.CrossRefGoogle Scholar
  4. 4.
    Apache Flink. (2019).
  5. 5.
    Apache Storm. (2019).
  6. 6.
    Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges. Information Fusion, 28, 45–59.CrossRefGoogle Scholar
  7. 7.
    Chen, H., Chiang, R. H. L., Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly: Management Information Systems, 36(4), 1165–1188.CrossRefGoogle Scholar
  8. 8.
    Choi, T.-M., Chan, H. K., & Yue, X. (2017). Recent development in big data analytics for business operations and risk management. IEEE Transactions on Cybernetics, 47(1), 81–92.CrossRefGoogle Scholar
  9. 9.
    Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.CrossRefGoogle Scholar
  10. 10.
    Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.CrossRefGoogle Scholar
  11. 11.
    Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M. et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.Google Scholar
  12. 12.
    Frénay, B., & Verleysen, M.: Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.Google Scholar
  13. 13.
    Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 79–85.Google Scholar
  14. 14.
    Gama, J. (2010). Knowledge discovery from data streams. London: Chapman and Hall/CRC.zbMATHCrossRefGoogle Scholar
  15. 15.
    Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing Surveys, 46(4), 44.zbMATHCrossRefGoogle Scholar
  16. 16.
    Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.CrossRefGoogle Scholar
  17. 17.
    García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining. Berlin: Springer.CrossRefGoogle Scholar
  18. 18.
    García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1, 9.CrossRefGoogle Scholar
  19. 19.
    García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.CrossRefGoogle Scholar
  20. 20.
    Hall, M. A. (1999). Correlation-based feature selection for machine learning. Hamilton: Department of Computer Science, Waikato University.Google Scholar
  21. 21.
    Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.zbMATHCrossRefGoogle Scholar
  22. 22.
    Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652–687.CrossRefGoogle Scholar
  23. 23.
    Iafrate, F. (2014). A journey from big data to smart data. Advances in Intelligent Systems and Computing, 261, 25–33.CrossRefGoogle Scholar
  24. 24.
    Jolliffe, I. (2011). Principal Component Analysis. Berlin: Springer.zbMATHGoogle Scholar
  25. 25.
    Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics (1st ed.). Sebastopol: O’Reilly Media.Google Scholar
  26. 26.
    Lin, J. (2013). MapReduce is good enough? If all you have is a hammer, throw away everything that’s not a nail! Big Data, 1(1), 28–37.CrossRefGoogle Scholar
  27. 27.
    Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393–423.MathSciNetCrossRefGoogle Scholar
  28. 28.
    Liu, H., & Motoda, H. (2002). On issues of instance selection. Data Mining and Knowledge Discovery, 6(2), 115–130 (2002)Google Scholar
  29. 29.
    López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.CrossRefGoogle Scholar
  30. 30.
    Masud, M. M., Chen, Q., Gao, J., Khan, L., Han, J., & Thuraisingham, B. (2010). Classification and novel class detection of data streams in a dynamic feature space. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 337–352). Berlin: Springer.CrossRefGoogle Scholar
  31. 31.
    Philip Chen, C. L., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–347.CrossRefGoogle Scholar
  32. 32.
    Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann.Google Scholar
  33. 33.
    Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2018). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.Google Scholar
  34. 34.
    Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., & Herrera, F. (2018). Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion, 42, 51–61.CrossRefGoogle Scholar
  35. 35.
    Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, S., & Herrera, F. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing, 239, 39–57.CrossRefGoogle Scholar
  36. 36.
    Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (pp. 1–10). Piscataway: IEEE.Google Scholar
  37. 37.
    Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.Google Scholar
  38. 38.
    Wang, H., & Wang, S. (2010). Mining incomplete survey data through classification. Knowledge and Information Systems, 24(2), 221–233.CrossRefGoogle Scholar
  39. 39.
    Watson, H. J., & Wixom, B. H. (2007). The current state of business intelligence. Computer, 40(9), 96–99.CrossRefGoogle Scholar
  40. 40.
    Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In 2014 IEEE International Conference on Data Mining (pp. 1031–1036). Piscataway: IEEE.CrossRefGoogle Scholar
  41. 41.
    White, T. (2012). Hadoop: The Definitive Guide. Sebastopol: O’Reilly Media.Google Scholar
  42. 42.
    Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.CrossRefGoogle Scholar
  43. 43.
    Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.CrossRefGoogle Scholar
  44. 44.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (p. 2). Berkeley: USENIX Association.Google Scholar
  45. 45.
    Zaki, M. J., & Meira, W. Jr. (2014). Data mining and analysis: Fundamental concepts and algorithms. New York: Cambridge University Press.zbMATHCrossRefGoogle Scholar
  46. 46.
    Zliobaite, I., & Gabrys, B. (2014). Adaptive preprocessing for streaming data. IEEE Transactions on Knowledge and Data Engineering, 26(2), 309–321.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Julián Luengo
    • 1
  • Diego García-Gil
    • 1
  • Sergio Ramírez-Gallego
    • 2
  • Salvador García
    • 1
  • Francisco Herrera
    • 1
  1. 1.Department of Computer Science and AIUniversity of GranadaGranadaSpain
  2. 2.DOCOMO Digital EspañaMadridSpain

Personalised recommendations