Abstract
This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classification and regression problems. These models include exploratory data analysis, data preprocessing, feature engineering and modeling.
The project has its basis in Pymach, a similar library that faces those steps for small and medium-sized datasets (about ten millions of rows and a few columns). Sparkmach central labor is to scale Pymach to overcome big datasets by using Apache Spark distributed computing, a distributed engine for large-scale data processing, that tackle several data science related problems in a cluster environment. Despite the software nature, Sparkmach can be of use for local environments, getting the most benefits from the distributed processing tools.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bravo-Rocca, G.: Pyspark package for getting an overview of a dataset (2016). https://pymach.readthedocs.io/en/latest/readme.html
Brownlee, J.: Machine learning mastery with Python (2016)
Christensson, P.: Python definition. https://techterms.com. Accessed 7 May 2018
Duch, W.: Meta-learning. Nicolaus Copernicus University, Poland
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark, Lightning-Fast Data Analysis. O’Reilly, Sebastopol (2015)
Plotly Technologies Inc.: Collaborative data science (2015)
McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56 (2010)
Metropolitan Transportation Authority. MTA | Subway, Bus, L.I.R.R.M.N.: Metropolitan transportation authority. MTA | subway, bus, long island rail road, metro-north (2014). http://web.mta.info/developers/MTA-Bus-Time-historical-data.html
Pyspark: Extracting, transforming and selecting features. https://spark.apache.org/docs/latest/ml-features.html. Accessed 7 May 2018
Repository, M.L.: Hepmass dataset. UCI, p. 3 (2014). https://archive.ics.uci.edu/ml/datasets/HEPMASS. Accessed 7 May 2018
Acknowledgments
The project would have been impossible without the support of Ciencia Activa and Fondo para la Innovación, la Ciencia y la Tecnología - Innovation, Science and Technology Fund (FINCyT).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bravo-Rocca, G., Torres-Robatty, P., Fiestas-Iquira, J. (2019). Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data. In: Lossio-Ventura, J., Muñante, D., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2018. Communications in Computer and Information Science, vol 898. Springer, Cham. https://doi.org/10.1007/978-3-030-11680-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-11680-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11679-8
Online ISBN: 978-3-030-11680-4
eBook Packages: Computer ScienceComputer Science (R0)