Advertisement

Deploying a Scalable Data Science Environment Using Docker

  • Sergio Martín-Santana
  • Carlos J. Pérez-González
  • Marcos Colebrook
  • José L. Roda-García
  • Pedro González-Yanes
Chapter

Abstract

Within the Data Science stack, the infrastructure layer supporting the distributed computing engine is a key part that plays an important role in order to obtain timely and accurate insights in a digital business. However, sometimes the expense of using such Data Science facilities in a commercial cloud infrastructure is not affordable to everyone. In this sense, we develop a computing environment based on free software tools over commodity computers. Thus, we show how to deploy an easily scalable Spark cluster using Docker including both Jupyter and RStudio that support Python and R programming languages. Moreover, we present a successful case study where this computing framework has been used to analyze statistical results using data collected from meteorological stations located in the Canary Islands (Spain).

Notes

Acknowledgements

This work is partially supported by the Spanish Ministry of Education and Science, Research Projects MTM2016-74877-P and CGL2015-67508-R, National Plan of Scientific Research, Technological Development and Innovation. The authors wish to thank Adrián Muñoz-Barrera and Luis A. Rubio-Rodríguez for their support and assistance both in the configuration and deployment of the cluster and in the development of the solution.

References

  1. 1.
    NIST. (2015a). Big data interoperability framework: Volume 5, architectures white paper survey. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-5.
  2. 2.
    EDSF. (2017). The EDISON data science framework, Release 2. Retrieved October 2017, from http://edison-project.eu/edison/edison-data-science-framework-edsf.
  3. 3.
    Plaza-Martín, V., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Dos-Santos, T., & González-González, J.C. (2016). Analyzing network log files using big data techniques. In: F.P. García-Márquez, B. Lev (Eds.) Big data management (pp. 227–256). Springer International Publishing.Google Scholar
  4. 4.
    NIST. (2015b). Big data interoperability framework: Volume 1, definitions. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-1.
  5. 5.
    Hazard, C. (2014). Stacking the deck: The next wave of opportunity in big data. Retrieved October 2017, from https://www.kdnuggets.com/2014/05/stacking-deck-next-wave-opportunity-big-data.html.
  6. 6.
    Forrester. (2016). Data Science Platforms Help Companies Turn Data Into Business Value. Retrieved October 2017, from https://www.datascience.com/resources/white-papers/forrester-data-science-platforms.
  7. 7.
    Brynjolfsson, E., Hitt, L. M., & Kim, H. H. (2011). Strength in numbers: How does data-driven decision making affect firm performance? SSRN Electronic Journal.  https://doi.org/10.2139/ssrn.1819486.CrossRefGoogle Scholar
  8. 8.
    Capgemini Consulting. (2015). Big & fast data: The rise of insight-driven business. Retrieved October 2017, from http://ww.capgemini.com/wp-content/uploads/2017/07/big_fast_data_the_rise_of_insight-driven_business-report.pdf.
  9. 9.
    Linden, A., Krensky, P., Hare, J., Idoine, C.J., Sicular, S., & Vashisth, S. (2017). magic quadrant for data science platforms. Retrieved October 2017, from https://www.gartner.com/doc/reprints?id=1-3TK9NW2&ct=170215&st=sb.
  10. 10.
    NITRD. (2016). The federal big data research and development strategic plan. Retrieved October 2017, from http://ww.nitrd.gov/PUBS/bigdatardstrategicplan.pdf.
  11. 11.
    BDV. (2017). Big data value strategic research and innovation agenda. Retrieved October 2017, from http://ww.bdva.eu/sites/default/files/EuropeanBigDataValuePartnership_SRIA__v3_0.pdf.
  12. 12.
    COTEC. (2017). Generación de talento Big Data en España (in Spanish). Retrieved October 2017, from http://cotec.es/media/BIG-DATA-FINAL-web.pdf.
  13. 13.
    Apache Hadoop. Retrieved October 2017, from http://hadoop.apache.org.
  14. 14.
    Apache Spark. Retrieved October 2017, from https://spark.apache.org.
  15. 15.
    NIST. (2015c). Big data interoperability framework: Volume 3, use cases and general requirements. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-3.
  16. 16.
    HC-STC. (2016). The big bata dilemma. Retrieved October 2017, from http://www.publications.parliament.uk/pa/cm201516/cmselect/cmsctech/468/468.pdf.
  17. 17.
    Docker: The container platform provider. Retrieved October 2017, from http://www.docker.com.
  18. 18.
    Docker hub. Retrieved October 2017, from https://hub.docker.com.
  19. 19.
    Project Jupyter. Retrieved October 2017, from http://jupyter.org.
  20. 20.
    RStudio: The open source and enterprise-ready professional software for R. Retrieved October 2017, from https://www.rstudio.com.
  21. 21.
    The R Project for statistical computing. Retrieved October 2017, from https://www.r-project.org.
  22. 22.
    Python. Retrieved October 2017, from https://www.python.org.
  23. 23.
    Anaconda Python distribution. Retrieved October 2017, from https://www.anaconda.com/download/.
  24. 24.
    Enthought Canopy Python distribution. Retrieved October 2017, from https://www.enthought.com/product/canopy/.
  25. 25.
    Datacamp: Learn Data Science Online. Retrieved October 2017, from https://www.datacamp.com.
  26. 26.
    Codecademy: Learn to code interactively for free. Retrieved October 2017, from https://www.codecademy.com.
  27. 27.
    Rodeo: A Python IDE built for analyzing data. Retrieved October 2017, from https://www.datascience.com/blog/docker-containers-for-data-science.
  28. 28.
    The Open Group Architecture Framework (TOGAF) Version 9.1. The Open Group. Retrieved October 2017, from http://www.opengroup.org/togaf.
  29. 29.
    Lankhorst, M. M. (2004). Enterprise architecture modelling—the issue of integration. Advanced Engineering Informatics, 18(4), 205–216.CrossRefGoogle Scholar
  30. 30.
    Zeppelin. Retrieved October 2017, from https://zeppelin.apache.org.
  31. 31.
    Tensorflow. Retrieved October 2017, from https://www.tensorflow.org.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Sergio Martín-Santana
    • 1
  • Carlos J. Pérez-González
    • 2
  • Marcos Colebrook
    • 3
  • José L. Roda-García
    • 3
  • Pedro González-Yanes
    • 4
  1. 1.Máster en Ingeniería InformáticaUniversidad de La LagunaTenerifeSpain
  2. 2.Departamento de Matemáticas, Investigación Operativa y ComputaciónUniversidad de La LagunaTenerifeSpain
  3. 3.Departamento de Ingeniería Informática y de SistemasUniversidad de La LagunaTenerifeSpain
  4. 4.Centro de Cálculo de la Escuela Superior de Ingeniería y Tecnología (Secc. Ing. Informática)Universidad de La LagunaTenerifeSpain

Personalised recommendations