Data Engineering

Varga, Ervin

doi:10.1007/978-1-4842-4859-1_2

Data Engineering

Ervin Varga²

Chapter
First Online: 08 September 2019

8755 Accesses

Abstract

After project initiation, the data engineering team takes over to build necessary infrastructure to acquire (identify, retrieve, and query), munge, explore, and persist data. The goal is to enable further data analysis tasks. Data engineering requires different expertise than is required in later stages of a data science process. It is typically an engineering discipline oriented toward craftsmanship to provide necessary input to later phases. Often disparate technologies must be orchestrated to handle data communication protocols and formats, perform exploratory visualizations, and preprocess (clean, integrate, and package), scale, and transform data. All these tasks must be done in context of a global project vision and mission relying on domain knowledge. It is extremely rare that raw data from sources is immediately in perfect shape to perform analysis. Even in the case of a clean dataset, there is often a need to simplify it. Consequently, dimensionality reduction coupled with feature selection (remove, add, and combine) is also part of data engineering. This chapter illustrates data engineering through two detailed case studies, which highlight most aspects of it.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Very large datasets shouldn’t be kept in a Git repository. It is better to store them in a cloud (S3, Google Drive, Dropbox, etc.) and download from there.
2.
I have omitted the In[...] and Out[...] prompts for brevity and just marked the input prompt by >>. Also, keep in mind that Tab completion works for all parts of a command, including file names. Just press Tab and see what Spyder offers to you.
3.
Visit https://matplotlib.org/examples/color/colormaps_reference.html to browse the available colormaps. Each sample is named (find the one that we have used here).

Author information

Authors and Affiliations

Kikinda, Serbia
Ervin Varga

Authors

Ervin Varga
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Varga, E. (2019). Data Engineering. In: Practical Data Science with Python 3. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4859-1_2

Download citation

DOI: https://doi.org/10.1007/978-1-4842-4859-1_2
Published: 08 September 2019
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-4858-4
Online ISBN: 978-1-4842-4859-1
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics