Abstract
Within the Data Science stack, the infrastructure layer supporting the distributed computing engine is a key part that plays an important role in order to obtain timely and accurate insights in a digital business. However, sometimes the expense of using such Data Science facilities in a commercial cloud infrastructure is not affordable to everyone. In this sense, we develop a computing environment based on free software tools over commodity computers. Thus, we show how to deploy an easily scalable Spark cluster using Docker including both Jupyter and RStudio that support Python and R programming languages. Moreover, we present a successful case study where this computing framework has been used to analyze statistical results using data collected from meteorological stations located in the Canary Islands (Spain).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
NIST. (2015a). Big data interoperability framework: Volume 5, architectures white paper survey. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-5.
EDSF. (2017). The EDISON data science framework, Release 2. Retrieved October 2017, from http://edison-project.eu/edison/edison-data-science-framework-edsf.
Plaza-Martín, V., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Dos-Santos, T., & González-González, J.C. (2016). Analyzing network log files using big data techniques. In: F.P. García-Márquez, B. Lev (Eds.) Big data management (pp. 227–256). Springer International Publishing.
NIST. (2015b). Big data interoperability framework: Volume 1, definitions. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-1.
Hazard, C. (2014). Stacking the deck: The next wave of opportunity in big data. Retrieved October 2017, from https://www.kdnuggets.com/2014/05/stacking-deck-next-wave-opportunity-big-data.html.
Forrester. (2016). Data Science Platforms Help Companies Turn Data Into Business Value. Retrieved October 2017, from https://www.datascience.com/resources/white-papers/forrester-data-science-platforms.
Brynjolfsson, E., Hitt, L. M., & Kim, H. H. (2011). Strength in numbers: How does data-driven decision making affect firm performance? SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1819486.
Capgemini Consulting. (2015). Big & fast data: The rise of insight-driven business. Retrieved October 2017, from http://ww.capgemini.com/wp-content/uploads/2017/07/big_fast_data_the_rise_of_insight-driven_business-report.pdf.
Linden, A., Krensky, P., Hare, J., Idoine, C.J., Sicular, S., & Vashisth, S. (2017). magic quadrant for data science platforms. Retrieved October 2017, from https://www.gartner.com/doc/reprints?id=1-3TK9NW2&ct=170215&st=sb.
NITRD. (2016). The federal big data research and development strategic plan. Retrieved October 2017, from http://ww.nitrd.gov/PUBS/bigdatardstrategicplan.pdf.
BDV. (2017). Big data value strategic research and innovation agenda. Retrieved October 2017, from http://ww.bdva.eu/sites/default/files/EuropeanBigDataValuePartnership_SRIA__v3_0.pdf.
COTEC. (2017). Generación de talento Big Data en España (in Spanish). Retrieved October 2017, from http://cotec.es/media/BIG-DATA-FINAL-web.pdf.
Apache Hadoop. Retrieved October 2017, from http://hadoop.apache.org.
Apache Spark. Retrieved October 2017, from https://spark.apache.org.
NIST. (2015c). Big data interoperability framework: Volume 3, use cases and general requirements. Retrieved October 2017, from http://dx.doi.org/10.6028/NIST.SP.1500-3.
HC-STC. (2016). The big bata dilemma. Retrieved October 2017, from http://www.publications.parliament.uk/pa/cm201516/cmselect/cmsctech/468/468.pdf.
Docker: The container platform provider. Retrieved October 2017, from http://www.docker.com.
Docker hub. Retrieved October 2017, from https://hub.docker.com.
Project Jupyter. Retrieved October 2017, from http://jupyter.org.
RStudio: The open source and enterprise-ready professional software for R. Retrieved October 2017, from https://www.rstudio.com.
The R Project for statistical computing. Retrieved October 2017, from https://www.r-project.org.
Python. Retrieved October 2017, from https://www.python.org.
Anaconda Python distribution. Retrieved October 2017, from https://www.anaconda.com/download/.
Enthought Canopy Python distribution. Retrieved October 2017, from https://www.enthought.com/product/canopy/.
Datacamp: Learn Data Science Online. Retrieved October 2017, from https://www.datacamp.com.
Codecademy: Learn to code interactively for free. Retrieved October 2017, from https://www.codecademy.com.
Rodeo: A Python IDE built for analyzing data. Retrieved October 2017, from https://www.datascience.com/blog/docker-containers-for-data-science.
The Open Group Architecture Framework (TOGAF) Version 9.1. The Open Group. Retrieved October 2017, from http://www.opengroup.org/togaf.
Lankhorst, M. M. (2004). Enterprise architecture modelling—the issue of integration. Advanced Engineering Informatics, 18(4), 205–216.
Zeppelin. Retrieved October 2017, from https://zeppelin.apache.org.
Tensorflow. Retrieved October 2017, from https://www.tensorflow.org.
Acknowledgements
This work is partially supported by the Spanish Ministry of Education and Science, Research Projects MTM2016-74877-P and CGL2015-67508-R, National Plan of Scientific Research, Technological Development and Innovation. The authors wish to thank Adrián Muñoz-Barrera and Luis A. Rubio-Rodríguez for their support and assistance both in the configuration and deployment of the cluster and in the development of the solution.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Martín-Santana, S., Pérez-González, C.J., Colebrook, M., Roda-García, J.L., González-Yanes, P. (2019). Deploying a Scalable Data Science Environment Using Docker. In: García Márquez, F., Lev, B. (eds) Data Science and Digital Business. Springer, Cham. https://doi.org/10.1007/978-3-319-95651-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-95651-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-95650-3
Online ISBN: 978-3-319-95651-0
eBook Packages: Business and ManagementBusiness and Management (R0)