Skip to main content

Data Preparation

  • Chapter
  • First Online:
Data Science for Transport
  • 2229 Accesses

Abstract

In the practical examples so far we have read data from CSV files, placed it into SQL queries, and inserted it into a database. In general, this three step process is known as ETL for Extract, Transform, Load. Extract means getting data out of some non-database file. Transform means converting it to match our ontology and type system. Load means loading it into the database. ETL is often performed on massive scales, with many computers working on the various steps on multiple data sources simultaneously. For example, this happens when a transport client sends you a hard disc with a terabyte of traffic sensor data on it. In the “big data” movement, the transformation step might not be so important, as the philosophy here is to worry about ontology only at runtime, and store the data in whatever form you can manage when it arrives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Pandas has many powerful features to perform SQL-like operations from inside Python, and to assist with data munging. For direct code translations between SQL and pandas, see http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html . Or you can use converters like pandasql which let you run actual SQL syntax on their data, without using a database. Also refer to the table in Chap. 2 for some useful Pandas commands for Transport applications.

  2. 2.

    Pronounced “print F”, from the printf command in C-like languages.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles Fox .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Fox, C. (2018). Data Preparation. In: Data Science for Transport. Springer Textbooks in Earth Sciences, Geography and Environment. Springer, Cham. https://doi.org/10.1007/978-3-319-72953-4_4

Download citation

Publish with us

Policies and ethics