Deploying a Scalable Data Science Environment Using Docker

  • Sergio Martín-Santana
  • Carlos J. Pérez-González
  • Marcos ColebrookEmail author
  • José L. Roda-García
  • Pedro González-Yanes


Within the Data Science stack, the infrastructure layer supporting the distributed computing engine is a key part that plays an important role in order to obtain timely and accurate insights in a digital business. However, sometimes the expense of using such Data Science facilities in a commercial cloud infrastructure is not affordable to everyone. In this sense, we develop a computing environment based on free software tools over commodity computers. Thus, we show how to deploy an easily scalable Spark cluster using Docker including both Jupyter and RStudio that support Python and R programming languages. Moreover, we present a successful case study where this computing framework has been used to analyze statistical results using data collected from meteorological stations located in the Canary Islands (Spain).



This work is partially supported by the Spanish Ministry of Education and Science, Research Projects MTM2016-74877-P and CGL2015-67508-R, National Plan of Scientific Research, Technological Development and Innovation. The authors wish to thank Adrián Muñoz-Barrera and Luis A. Rubio-Rodríguez for their support and assistance both in the configuration and deployment of the cluster and in the development of the solution.


  1. 1.
    NIST. (2015a). Big data interoperability framework: Volume 5, architectures white paper survey. Retrieved October 2017, from
  2. 2.
    EDSF. (2017). The EDISON data science framework, Release 2. Retrieved October 2017, from
  3. 3.
    Plaza-Martín, V., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Dos-Santos, T., & González-González, J.C. (2016). Analyzing network log files using big data techniques. In: F.P. García-Márquez, B. Lev (Eds.) Big data management (pp. 227–256). Springer International Publishing.Google Scholar
  4. 4.
    NIST. (2015b). Big data interoperability framework: Volume 1, definitions. Retrieved October 2017, from
  5. 5.
    Hazard, C. (2014). Stacking the deck: The next wave of opportunity in big data. Retrieved October 2017, from
  6. 6.
    Forrester. (2016). Data Science Platforms Help Companies Turn Data Into Business Value. Retrieved October 2017, from
  7. 7.
    Brynjolfsson, E., Hitt, L. M., & Kim, H. H. (2011). Strength in numbers: How does data-driven decision making affect firm performance? SSRN Electronic Journal. Scholar
  8. 8.
    Capgemini Consulting. (2015). Big & fast data: The rise of insight-driven business. Retrieved October 2017, from
  9. 9.
    Linden, A., Krensky, P., Hare, J., Idoine, C.J., Sicular, S., & Vashisth, S. (2017). magic quadrant for data science platforms. Retrieved October 2017, from
  10. 10.
    NITRD. (2016). The federal big data research and development strategic plan. Retrieved October 2017, from
  11. 11.
    BDV. (2017). Big data value strategic research and innovation agenda. Retrieved October 2017, from
  12. 12.
    COTEC. (2017). Generación de talento Big Data en España (in Spanish). Retrieved October 2017, from
  13. 13.
    Apache Hadoop. Retrieved October 2017, from
  14. 14.
    Apache Spark. Retrieved October 2017, from
  15. 15.
    NIST. (2015c). Big data interoperability framework: Volume 3, use cases and general requirements. Retrieved October 2017, from
  16. 16.
    HC-STC. (2016). The big bata dilemma. Retrieved October 2017, from
  17. 17.
    Docker: The container platform provider. Retrieved October 2017, from
  18. 18.
    Docker hub. Retrieved October 2017, from
  19. 19.
    Project Jupyter. Retrieved October 2017, from
  20. 20.
    RStudio: The open source and enterprise-ready professional software for R. Retrieved October 2017, from
  21. 21.
    The R Project for statistical computing. Retrieved October 2017, from
  22. 22.
    Python. Retrieved October 2017, from
  23. 23.
    Anaconda Python distribution. Retrieved October 2017, from
  24. 24.
    Enthought Canopy Python distribution. Retrieved October 2017, from
  25. 25.
    Datacamp: Learn Data Science Online. Retrieved October 2017, from
  26. 26.
    Codecademy: Learn to code interactively for free. Retrieved October 2017, from
  27. 27.
    Rodeo: A Python IDE built for analyzing data. Retrieved October 2017, from
  28. 28.
    The Open Group Architecture Framework (TOGAF) Version 9.1. The Open Group. Retrieved October 2017, from
  29. 29.
    Lankhorst, M. M. (2004). Enterprise architecture modelling—the issue of integration. Advanced Engineering Informatics, 18(4), 205–216.CrossRefGoogle Scholar
  30. 30.
    Zeppelin. Retrieved October 2017, from
  31. 31.
    Tensorflow. Retrieved October 2017, from

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Sergio Martín-Santana
    • 1
  • Carlos J. Pérez-González
    • 2
  • Marcos Colebrook
    • 3
    Email author
  • José L. Roda-García
    • 3
  • Pedro González-Yanes
    • 4
  1. 1.Máster en Ingeniería InformáticaUniversidad de La LagunaTenerifeSpain
  2. 2.Departamento de Matemáticas, Investigación Operativa y ComputaciónUniversidad de La LagunaTenerifeSpain
  3. 3.Departamento de Ingeniería Informática y de SistemasUniversidad de La LagunaTenerifeSpain
  4. 4.Centro de Cálculo de la Escuela Superior de Ingeniería y Tecnología (Secc. Ing. Informática)Universidad de La LagunaTenerifeSpain

Personalised recommendations