Abstract
After project initiation, the data engineering team takes over to build necessary infrastructure to acquire (identify, retrieve, and query), munge, explore, and persist data. The goal is to enable further data analysis tasks. Data engineering requires different expertise than is required in later stages of a data science process. It is typically an engineering discipline oriented toward craftsmanship to provide necessary input to later phases. Often disparate technologies must be orchestrated to handle data communication protocols and formats, perform exploratory visualizations, and preprocess (clean, integrate, and package), scale, and transform data. All these tasks must be done in context of a global project vision and mission relying on domain knowledge. It is extremely rare that raw data from sources is immediately in perfect shape to perform analysis. Even in the case of a clean dataset, there is often a need to simplify it. Consequently, dimensionality reduction coupled with feature selection (remove, add, and combine) is also part of data engineering. This chapter illustrates data engineering through two detailed case studies, which highlight most aspects of it.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Very large datasets shouldn’t be kept in a Git repository. It is better to store them in a cloud (S3, Google Drive, Dropbox, etc.) and download from there.
- 2.
I have omitted the In[...] and Out[...] prompts for brevity and just marked the input prompt by >>. Also, keep in mind that Tab completion works for all parts of a command, including file names. Just press Tab and see what Spyder offers to you.
- 3.
Visit https://matplotlib.org/examples/color/colormaps_reference.html to browse the available colormaps. Each sample is named (find the one that we have used here).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2019 Ervin Varga
About this chapter
Cite this chapter
Varga, E. (2019). Data Engineering. In: Practical Data Science with Python 3. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4859-1_2
Download citation
DOI: https://doi.org/10.1007/978-1-4842-4859-1_2
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-4858-4
Online ISBN: 978-1-4842-4859-1
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)