Skip to main content

Data Engineering

  • Chapter
  • First Online:
  • 8755 Accesses

Abstract

After project initiation, the data engineering team takes over to build necessary infrastructure to acquire (identify, retrieve, and query), munge, explore, and persist data. The goal is to enable further data analysis tasks. Data engineering requires different expertise than is required in later stages of a data science process. It is typically an engineering discipline oriented toward craftsmanship to provide necessary input to later phases. Often disparate technologies must be orchestrated to handle data communication protocols and formats, perform exploratory visualizations, and preprocess (clean, integrate, and package), scale, and transform data. All these tasks must be done in context of a global project vision and mission relying on domain knowledge. It is extremely rare that raw data from sources is immediately in perfect shape to perform analysis. Even in the case of a clean dataset, there is often a need to simplify it. Consequently, dimensionality reduction coupled with feature selection (remove, add, and combine) is also part of data engineering. This chapter illustrates data engineering through two detailed case studies, which highlight most aspects of it.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Very large datasets shouldn’t be kept in a Git repository. It is better to store them in a cloud (S3, Google Drive, Dropbox, etc.) and download from there.

  2. 2.

    I have omitted the In[...] and Out[...] prompts for brevity and just marked the input prompt by >>. Also, keep in mind that Tab completion works for all parts of a command, including file names. Just press Tab and see what Spyder offers to you.

  3. 3.

    Visit https://matplotlib.org/examples/color/colormaps_reference.html to browse the available colormaps. Each sample is named (find the one that we have used here).

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Ervin Varga

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Varga, E. (2019). Data Engineering. In: Practical Data Science with Python 3. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4859-1_2

Download citation

Publish with us

Policies and ethics