Data Processing and Analysis

  • Robert Johansson


In the last several chapters, we have covered the main topics of traditional scientific computing. These topics provide a foundation for most computational work. Starting with this chapter, we move on to explore data processing and analysis, statistics, and statistical modeling. As the first step in this direction, we look at the data analysis library pandas. This library provides convenient data structures for representing series and tables of data and makes it easy to transform, split, merge, and convert data. These are important steps in the process of cleansing raw data into a tidy form that is suitable for analysis. The pandas library builds on top of NumPy and complements it with features that are particularly useful when handling data, such as labeled indexing, hierarchical indices, alignment of data for comparison and merging of datasets, handling of missing data, and much more. As such, the pandas library has become a de facto standard library for high-level data processing in Python, especially for statistics applications. The pandas library itself contains only limited support for statistical modeling (namely, linear regression). For more involved statistical analysis and modeling, there are other packages available, such as statmodels, patsy, and scikit-learn, which we cover in later chapters. However, also for statistical modeling with these packages, pandas can still be used for data representation and preparation. The pandas library is therefore a key component in the software stack for data analysis with Python.

Copyright information

© Robert Johansson 2019

Authors and Affiliations

  • Robert Johansson
    • 1
  1. 1.Urayasu-shi, ChibaJapan

Personalised recommendations