Skip to main content

Deploying a Scalable Data Science Environment Using Docker

  • Chapter
  • First Online:
Book cover Data Science and Digital Business

Abstract

Within the Data Science stack, the infrastructure layer supporting the distributed computing engine is a key part that plays an important role in order to obtain timely and accurate insights in a digital business. However, sometimes the expense of using such Data Science facilities in a commercial cloud infrastructure is not affordable to everyone. In this sense, we develop a computing environment based on free software tools over commodity computers. Thus, we show how to deploy an easily scalable Spark cluster using Docker including both Jupyter and RStudio that support Python and R programming languages. Moreover, we present a successful case study where this computing framework has been used to analyze statistical results using data collected from meteorological stations located in the Canary Islands (Spain).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. NIST. (2015a). Big data interoperability framework: Volume 5, architectures white paper survey. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-5.

  2. EDSF. (2017). The EDISON data science framework, Release 2. Retrieved October 2017, from http://edison-project.eu/edison/edison-data-science-framework-edsf.

  3. Plaza-Martín, V., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Dos-Santos, T., & González-González, J.C. (2016). Analyzing network log files using big data techniques. In: F.P. García-Márquez, B. Lev (Eds.) Big data management (pp. 227–256). Springer International Publishing.

    Google Scholar 

  4. NIST. (2015b). Big data interoperability framework: Volume 1, definitions. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-1.

  5. Hazard, C. (2014). Stacking the deck: The next wave of opportunity in big data. Retrieved October 2017, from https://www.kdnuggets.com/2014/05/stacking-deck-next-wave-opportunity-big-data.html.

  6. Forrester. (2016). Data Science Platforms Help Companies Turn Data Into Business Value. Retrieved October 2017, from https://www.datascience.com/resources/white-papers/forrester-data-science-platforms.

  7. Brynjolfsson, E., Hitt, L. M., & Kim, H. H. (2011). Strength in numbers: How does data-driven decision making affect firm performance? SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1819486.

    Article  Google Scholar 

  8. Capgemini Consulting. (2015). Big & fast data: The rise of insight-driven business. Retrieved October 2017, from http://ww.capgemini.com/wp-content/uploads/2017/07/big_fast_data_the_rise_of_insight-driven_business-report.pdf.

  9. Linden, A., Krensky, P., Hare, J., Idoine, C.J., Sicular, S., & Vashisth, S. (2017). magic quadrant for data science platforms. Retrieved October 2017, from https://www.gartner.com/doc/reprints?id=1-3TK9NW2&ct=170215&st=sb.

  10. NITRD. (2016). The federal big data research and development strategic plan. Retrieved October 2017, from http://ww.nitrd.gov/PUBS/bigdatardstrategicplan.pdf.

  11. BDV. (2017). Big data value strategic research and innovation agenda. Retrieved October 2017, from http://ww.bdva.eu/sites/default/files/EuropeanBigDataValuePartnership_SRIA__v3_0.pdf.

  12. COTEC. (2017). Generación de talento Big Data en España (in Spanish). Retrieved October 2017, from http://cotec.es/media/BIG-DATA-FINAL-web.pdf.

  13. Apache Hadoop. Retrieved October 2017, from http://hadoop.apache.org.

  14. Apache Spark. Retrieved October 2017, from https://spark.apache.org.

  15. NIST. (2015c). Big data interoperability framework: Volume 3, use cases and general requirements. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-3.

  16. HC-STC. (2016). The big bata dilemma. Retrieved October 2017, from http://www.publications.parliament.uk/pa/cm201516/cmselect/cmsctech/468/468.pdf.

  17. Docker: The container platform provider. Retrieved October 2017, from http://www.docker.com.

  18. Docker hub. Retrieved October 2017, from https://hub.docker.com.

  19. Project Jupyter. Retrieved October 2017, from http://jupyter.org.

  20. RStudio: The open source and enterprise-ready professional software for R. Retrieved October 2017, from https://www.rstudio.com.

  21. The R Project for statistical computing. Retrieved October 2017, from https://www.r-project.org.

  22. Python. Retrieved October 2017, from https://www.python.org.

  23. Anaconda Python distribution. Retrieved October 2017, from https://www.anaconda.com/download/.

  24. Enthought Canopy Python distribution. Retrieved October 2017, from https://www.enthought.com/product/canopy/.

  25. Datacamp: Learn Data Science Online. Retrieved October 2017, from https://www.datacamp.com.

  26. Codecademy: Learn to code interactively for free. Retrieved October 2017, from https://www.codecademy.com.

  27. Rodeo: A Python IDE built for analyzing data. Retrieved October 2017, from https://www.datascience.com/blog/docker-containers-for-data-science.

  28. The Open Group Architecture Framework (TOGAF) Version 9.1. The Open Group. Retrieved October 2017, from http://www.opengroup.org/togaf.

  29. Lankhorst, M. M. (2004). Enterprise architecture modelling—the issue of integration. Advanced Engineering Informatics, 18(4), 205–216.

    Article  Google Scholar 

  30. Zeppelin. Retrieved October 2017, from https://zeppelin.apache.org.

  31. Tensorflow. Retrieved October 2017, from https://www.tensorflow.org.

Download references

Acknowledgements

This work is partially supported by the Spanish Ministry of Education and Science, Research Projects MTM2016-74877-P and CGL2015-67508-R, National Plan of Scientific Research, Technological Development and Innovation. The authors wish to thank Adrián Muñoz-Barrera and Luis A. Rubio-Rodríguez for their support and assistance both in the configuration and deployment of the cluster and in the development of the solution.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos Colebrook .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Martín-Santana, S., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Yanes, P. (2019). Deploying a Scalable Data Science Environment Using Docker. In: García Márquez, F., Lev, B. (eds) Data Science and Digital Business. Springer, Cham. https://doi.org/10.1007/978-3-319-95651-0_7

Download citation

Publish with us

Policies and ethics