Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

Bravo-Rocca, Gusseppe; Torres-Robatty, Piero; Fiestas-Iquira, Jose

doi:10.1007/978-3-030-11680-4_13

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 898))

Included in the following conference series:

Annual International Symposium on Information Management and Big Data

768 Accesses

Abstract

This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classification and regression problems. These models include exploratory data analysis, data preprocessing, feature engineering and modeling.

The project has its basis in Pymach, a similar library that faces those steps for small and medium-sized datasets (about ten millions of rows and a few columns). Sparkmach central labor is to scale Pymach to overcome big datasets by using Apache Spark distributed computing, a distributed engine for large-scale data processing, that tackle several data science related problems in a cluster environment. Despite the software nature, Sparkmach can be of use for local environments, getting the most benefits from the distributed processing tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bravo-Rocca, G.: Pyspark package for getting an overview of a dataset (2016). https://pymach.readthedocs.io/en/latest/readme.html
Brownlee, J.: Machine learning mastery with Python (2016)
Google Scholar
Christensson, P.: Python definition. https://techterms.com. Accessed 7 May 2018
Duch, W.: Meta-learning. Nicolaus Copernicus University, Poland
Google Scholar
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark, Lightning-Fast Data Analysis. O’Reilly, Sebastopol (2015)
Google Scholar
Plotly Technologies Inc.: Collaborative data science (2015)
Google Scholar
McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56 (2010)
Google Scholar
Metropolitan Transportation Authority. MTA | Subway, Bus, L.I.R.R.M.N.: Metropolitan transportation authority. MTA | subway, bus, long island rail road, metro-north (2014). http://web.mta.info/developers/MTA-Bus-Time-historical-data.html
Pyspark: Extracting, transforming and selecting features. https://spark.apache.org/docs/latest/ml-features.html. Accessed 7 May 2018
Repository, M.L.: Hepmass dataset. UCI, p. 3 (2014). https://archive.ics.uci.edu/ml/datasets/HEPMASS. Accessed 7 May 2018

Download references

Acknowledgments

The project would have been impossible without the support of Ciencia Activa and Fondo para la Innovación, la Ciencia y la Tecnología - Innovation, Science and Technology Fund (FINCyT).

Author information

Authors and Affiliations

Universidad Nacional de Ingeniería, Rimac Lima, 15333, Peru
Gusseppe Bravo-Rocca, Piero Torres-Robatty & Jose Fiestas-Iquira

Authors

Gusseppe Bravo-Rocca
View author publications
You can also search for this author in PubMed Google Scholar
Piero Torres-Robatty
View author publications
You can also search for this author in PubMed Google Scholar
Jose Fiestas-Iquira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose Fiestas-Iquira .

Editor information

Editors and Affiliations

Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Juan Antonio Lossio-Ventura
Fondazione Bruno Kessler, Trento, Italy
Denisse Muñante
Facultad de Ingeniería, University of the Pacific, Jesús María, Lima, Peru
Hugo Alatrista-Salas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bravo-Rocca, G., Torres-Robatty, P., Fiestas-Iquira, J. (2019). Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data. In: Lossio-Ventura, J., Muñante, D., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2018. Communications in Computer and Information Science, vol 898. Springer, Cham. https://doi.org/10.1007/978-3-030-11680-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-11680-4_13
Published: 08 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11679-8
Online ISBN: 978-3-030-11680-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics