Deploying a Scalable Data Science Environment Using Docker

Martín-Santana, Sergio; Pérez-González, Carlos J.; Colebrook, Marcos; Roda-García, José L.; González-Yanes, Pedro

doi:10.1007/978-3-319-95651-0_7

Sergio Martín-Santana³,
Carlos J. Pérez-González⁴,
Marcos Colebrook⁵,
José L. Roda-García⁵ &
…
Pedro González-Yanes⁶

2612 Accesses
3 Citations
7 Altmetric

Abstract

Within the Data Science stack, the infrastructure layer supporting the distributed computing engine is a key part that plays an important role in order to obtain timely and accurate insights in a digital business. However, sometimes the expense of using such Data Science facilities in a commercial cloud infrastructure is not affordable to everyone. In this sense, we develop a computing environment based on free software tools over commodity computers. Thus, we show how to deploy an easily scalable Spark cluster using Docker including both Jupyter and RStudio that support Python and R programming languages. Moreover, we present a successful case study where this computing framework has been used to analyze statistical results using data collected from meteorological stations located in the Canary Islands (Spain).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

NIST. (2015a). Big data interoperability framework: Volume 5, architectures white paper survey. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-5.
EDSF. (2017). The EDISON data science framework, Release 2. Retrieved October 2017, from http://edison-project.eu/edison/edison-data-science-framework-edsf.
Plaza-Martín, V., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Dos-Santos, T., & González-González, J.C. (2016). Analyzing network log files using big data techniques. In: F.P. García-Márquez, B. Lev (Eds.) Big data management (pp. 227–256). Springer International Publishing.
Google Scholar
NIST. (2015b). Big data interoperability framework: Volume 1, definitions. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-1.
Hazard, C. (2014). Stacking the deck: The next wave of opportunity in big data. Retrieved October 2017, from https://www.kdnuggets.com/2014/05/stacking-deck-next-wave-opportunity-big-data.html.
Forrester. (2016). Data Science Platforms Help Companies Turn Data Into Business Value. Retrieved October 2017, from https://www.datascience.com/resources/white-papers/forrester-data-science-platforms.
Brynjolfsson, E., Hitt, L. M., & Kim, H. H. (2011). Strength in numbers: How does data-driven decision making affect firm performance? SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1819486.
Article Google Scholar
Capgemini Consulting. (2015). Big & fast data: The rise of insight-driven business. Retrieved October 2017, from http://ww.capgemini.com/wp-content/uploads/2017/07/big_fast_data_the_rise_of_insight-driven_business-report.pdf.
Linden, A., Krensky, P., Hare, J., Idoine, C.J., Sicular, S., & Vashisth, S. (2017). magic quadrant for data science platforms. Retrieved October 2017, from https://www.gartner.com/doc/reprints?id=1-3TK9NW2&ct=170215&st=sb.
NITRD. (2016). The federal big data research and development strategic plan. Retrieved October 2017, from http://ww.nitrd.gov/PUBS/bigdatardstrategicplan.pdf.
BDV. (2017). Big data value strategic research and innovation agenda. Retrieved October 2017, from http://ww.bdva.eu/sites/default/files/EuropeanBigDataValuePartnership_SRIA__v3_0.pdf.
COTEC. (2017). Generación de talento Big Data en España (in Spanish). Retrieved October 2017, from http://cotec.es/media/BIG-DATA-FINAL-web.pdf.
Apache Hadoop. Retrieved October 2017, from http://hadoop.apache.org.
Apache Spark. Retrieved October 2017, from https://spark.apache.org.
NIST. (2015c). Big data interoperability framework: Volume 3, use cases and general requirements. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-3.
HC-STC. (2016). The big bata dilemma. Retrieved October 2017, from http://www.publications.parliament.uk/pa/cm201516/cmselect/cmsctech/468/468.pdf.
Docker: The container platform provider. Retrieved October 2017, from http://www.docker.com.
Docker hub. Retrieved October 2017, from https://hub.docker.com.
Project Jupyter. Retrieved October 2017, from http://jupyter.org.
RStudio: The open source and enterprise-ready professional software for R. Retrieved October 2017, from https://www.rstudio.com.
The R Project for statistical computing. Retrieved October 2017, from https://www.r-project.org.
Python. Retrieved October 2017, from https://www.python.org.
Anaconda Python distribution. Retrieved October 2017, from https://www.anaconda.com/download/.
Enthought Canopy Python distribution. Retrieved October 2017, from https://www.enthought.com/product/canopy/.
Datacamp: Learn Data Science Online. Retrieved October 2017, from https://www.datacamp.com.
Codecademy: Learn to code interactively for free. Retrieved October 2017, from https://www.codecademy.com.
Rodeo: A Python IDE built for analyzing data. Retrieved October 2017, from https://www.datascience.com/blog/docker-containers-for-data-science.
The Open Group Architecture Framework (TOGAF) Version 9.1. The Open Group. Retrieved October 2017, from http://www.opengroup.org/togaf.
Lankhorst, M. M. (2004). Enterprise architecture modelling—the issue of integration. Advanced Engineering Informatics, 18(4), 205–216.
Article Google Scholar
Zeppelin. Retrieved October 2017, from https://zeppelin.apache.org.
Tensorflow. Retrieved October 2017, from https://www.tensorflow.org.

Download references

Acknowledgements

This work is partially supported by the Spanish Ministry of Education and Science, Research Projects MTM2016-74877-P and CGL2015-67508-R, National Plan of Scientific Research, Technological Development and Innovation. The authors wish to thank Adrián Muñoz-Barrera and Luis A. Rubio-Rodríguez for their support and assistance both in the configuration and deployment of the cluster and in the development of the solution.

Author information

Authors and Affiliations

Máster en Ingeniería Informática, Universidad de La Laguna, Tenerife, Spain
Sergio Martín-Santana
Departamento de Matemáticas, Investigación Operativa y Computación, Universidad de La Laguna, Tenerife, Spain
Carlos J. Pérez-González
Departamento de Ingeniería Informática y de Sistemas, Universidad de La Laguna, Tenerife, Spain
Marcos Colebrook & José L. Roda-García
Centro de Cálculo de la Escuela Superior de Ingeniería y Tecnología (Secc. Ing. Informática), Universidad de La Laguna, Tenerife, Spain
Pedro González-Yanes

Authors

Sergio Martín-Santana
View author publications
You can also search for this author in PubMed Google Scholar
Carlos J. Pérez-González
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Colebrook
View author publications
You can also search for this author in PubMed Google Scholar
José L. Roda-García
View author publications
You can also search for this author in PubMed Google Scholar
Pedro González-Yanes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcos Colebrook .

Editor information

Editors and Affiliations

ETSI Industriales de Ciudad Real, University of Castilla-La Mancha, Ciudad Real, Spain
Fausto Pedro García Márquez
LeBow College of Business, Drexel University, Philadelphia, PA, USA
Benjamin Lev

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Martín-Santana, S., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Yanes, P. (2019). Deploying a Scalable Data Science Environment Using Docker. In: García Márquez, F., Lev, B. (eds) Data Science and Digital Business. Springer, Cham. https://doi.org/10.1007/978-3-319-95651-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-95651-0_7
Published: 05 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-95650-3
Online ISBN: 978-3-319-95651-0
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics