Skip to main content

Exploratory Data Analysis

  • Chapter
  • First Online:
  • 1584 Accesses

Abstract

The role of an exploratory data analysis (EDA) is to equip the modeler with an understanding of the data. More specifically, an EDA helps to answer two core questions: (a) whether a trait is safety related and (b) what function can be used to represent it in the model equation. This chapter shows how to do an EDA of the Colorado data using spreadsheet tools. As expected, Segment Length, AADT, and Terrain are safety-related traits. However, one cannot say what function links the E{μ} and these variables. The numerical results of the EDA motivate important general observations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    EDA is usually thought of as a set of activities that precedes formal modeling and, in this sense, calling it “initial” may seem redundant. However, as will be stressed repeatedly, modeling is not a once-through process. Rather, it is akin to a spiral the coils of which are cyclically repeated activities. The EDA will be an integral part of every modeling cycle, every turn of the spiral. It will help to determine whether the data indicate that a new trait is to be added to the SPF and in what form and way. In this role it will be called a Variable Introduction EDA or, by acronym, a VIEDA.

  2. 2.

    Here is what the Engineering Statistics Handbook (NIST/SEMATEC) says: “EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret.” (Sect. 1.1).

  3. 3.

    An orderly relationship is one where the existence of a perceived pattern makes curve-fitting a sensible choice. If the relationship is not orderly there is no reason to add this trait to the SPF. The question of when a trait should be added to the SPF is discussed at length in Chap. 9.

  4. 4.

    The VIEDA is discussed in Sect. 9.2.

  5. 5.

    The data can be downloaded from http://extras.springer.com/ using the ISBN (International Standard Book Number) of this book. Look in the “Data” folder for file “1 (a or b) Colorado full. (xls or xlsx).”

  6. 6.

    To access and download the condensed data go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “data” folder for files “4 (a or b) Colorado condensed. (xls or xlsx).”

  7. 7.

    In all examples I will refer to Excel 2007. Other versions may differ from it in toolbar arrangements, menus and similar inessential detail. Readers may have to adapt to the specifics of the version they use. The first appearance of the Pivot Table in Excel was in version 5 (1993). Up to and including “Office 2013,” there are 12 versions of Microsoft Excel.

  8. 8.

    Some can reach general conclusions by logical reasoning unaided by numbers and graphs; others need the stimulus and support of numerical results in their quest for reasoned deductions. The EDA suits the latter group.

  9. 9.

    This is the case whenever the μ of a unit is estimated by the Empirical Bayes method.

  10. 10.

    Bias is the difference between the average value of the estimate and the true value of what is being estimated. If the difference is not zero the estimator is said to be biased.

  11. 11.

    Network screening is the activity of identifying “blackspots” or “sites with promise,” units which may require attention and perhaps remediation The SafetyAnalyst software (Harwood et al. 2010) uses SPFs and Empirical Bayes estimates for network screening.

  12. 12.

    In regression analysis the process of model specification consists of selecting an appropriate functional form and of choosing the predictor variables. Common errors of model specification are (1) choosing an incorrect functional form, (2) omitting predictor variables which have a relationship with both the dependent variable and one or more of the predictor variables, (3) including an irrelevant predictor variables, and (4) assuming that the predictor variables are measured without error. If an estimated model is misspecified, it will produce biased and inconsistent estimates.

  13. 13.

    What determines whether a trait is safety-related is discussed in Sects. 9.1 and 9.2.

  14. 14.

    See e.g., Chiou and Fu (2013, p. 77) and Chen and Persaud (2014, p. 135).

  15. 15.

    See e.g., Gross et al. (2013, p. 236) who say that “Additional variables were considered based on available data and included in the models if (1) the variable significantly improved the model, and (2) the effect of the variable was intuitive.”

  16. 16.

    The data are from that region of Fig. 3.10 where accident counts are most numerous.

  17. 17.

    A non-linear relationship with AADT is a common empirical finding. This may reflect the fact that many things change with traffic flow: speed, spacing between vehicles, alertness, etc. In addition many safety-related traits are associated with traffic flow: level of enforcement and maintenance, presence of illumination, road design standards, etc. It would be indeed strange if such a complex interplay of influences when represented by the single trait-AADT, ended up as a straight-line relationship. Even more generally, it would be unexpected for a complex web of causes to become manifest as a simple mathematical function.

  18. 18.

    One might think that if a 1 mile long segment is expected to have X accidents then a 2 miles long segment with similar traits will have 2X accidents. The problem is that in our data Segment Length may be correlated with various safety-related traits; segment length may have something to do with the way roads are parsed for entry into data bases. Segments tend to end at intersections, jurisdiction boundaries and various geographic features. When intersections are far apart or a region is sparsely settled, segments tend to be longer. Far-apart intersections and sparsely populated regions may have fewer driveways, more homogeneous speeds, more driver fatigue, be further from trauma centers, etc. Because of such associations, the relationship between segment length and accident frequency may be more complex than one of simple proportionality.

  19. 19.

    A variable that has an important effect on the dependent variable but is not amongst the predictor variables.

  20. 20.

    The differences between F, R and M represent unaccounted for differences in grade, curvature, roadside, weather, road users, vehicles, etc.

References

  • Brillinger DR (2002) John Wilder Tukey. Not Am Math Soc 49(2):193–201

    MathSciNet  MATH  Google Scholar 

  • Chen Y, Persaud B (2014) Methodology to develop crash modification functions for road safety treatments with fully specified and hierarchical models. Accid Anal Prev 70:131–139

    Article  Google Scholar 

  • Chiou Y-C, Fu C (2013) Modeling crash frequency and severity using multinomial-generalized Poisson model with error components. Accid Anal Prev 50:73–82

    Article  Google Scholar 

  • Gross F, Craig L, Persaud B, Srinivasan R (2013) Safety effectiveness of converting signalized intersections to roundabouts. Accid Anal Prev 50:234–241

    Article  Google Scholar 

  • NIST/SEMATECH e-handbook of statistical methods. http://www.itl.nist.gov/div898/handbook/

  • Harwood DW, Torbic DJ, Richard KR, Meyer MM (2010) Safety Analyst software tools for safety management of specific highway sites. FHWA-HRT-10-063, Federal Highway Administration, Office of Safety Research and Development

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Hauer, E. (2015). Exploratory Data Analysis. In: The Art of Regression Modeling in Road Safety. Springer, Cham. https://doi.org/10.1007/978-3-319-12529-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12529-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12528-2

  • Online ISBN: 978-3-319-12529-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics