Skip to main content

A First Parametric SPF

  • Chapter
  • First Online:
The Art of Regression Modeling in Road Safety
  • 1563 Accesses

Abstract

The process of SPF development favors the gradual buildup of the model equation. In this chapter Segment Length is the first variable placed into a simple model equation and its parameters are estimated by minimizing the sum of (weighted) squared differences on a curve-fitting spreadsheet. Two general questions arise: (1) How accurate are the parameter estimates and (2) whether an SPF can be used to predict the safety effect of design changes and interventions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is commonly attributed to “lurking variables.” A lurking variable is one that is not in the model equation but affects one or more predictor variables and the dependent variable, thereby creating a correlation between them. But “correlation is not causation.” To illustrate, consider the relationship between the shoe sizes of elementary school students and their scores on a standard reading test. While these are correlated, thinking that larger shoe size causes higher reading scores is as absurd as thinking that high reading scores cause larger shoe sizes. Here the lurking variable is age. As the child gets older, both their shoe size and reading ability increase.

  2. 2.

    Even this logical requirement is more of esthetic significance than of practical importance. Very short road segments with very little traffic are of no interest in applications. Therefore, should it turn out that a non-zero intercept improves the fit, the esthetic price may be worth paying.

  3. 3.

    See, e.g., Mensah and Hauer (1998).

  4. 4.

    The one-after-another addition of variables is reminiscent of the “Forward Selection” approach to variable selection as opposed to “Backward Elimination” in “Stepwise Regression.” The former starts with no variables in the model, adding the variable that improves the model the most, and repeating this process until no variable improves the fit significantly. The latter begins with all variables in the model, and removing those the deletion of which harms the fit least, till a parsimonious and satisfactory model remains. The hallmark of Stepwise Regression is that it is an automated process; the modeler clicks a button and the software does the rest. The modeling approach advocated here abhors automation. It rests on the belief that to build a good SPF the exercise of human insight and intelligence is essential.

  5. 5.

    A comment on the difference between β0 and β1 is in order. The β1 determines the essential shape of the fitted curve. The role of β0 is different. It merely pre-multiplies whatever expression follows and thereby makes sure that the sum of the predicted accidents is roughly the same as the sum of the observed accidents. In this sense β 0 is just a scaling factor. To fix this idea, β0 will be referred to as the “scale parameter” and the other βs as “shape parameters.”

  6. 6.

    To download the spreadsheet, go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for “Chap. 6. First SPF, unweighted and weighted .xls or xlsx.”

  7. 7.

    At this early stage of model development, the minimization of the sum of squared differences is the “objective function.” The reasons for this choice are several. First, as already noted, this is a convex objective function and therefore the minimum which the Solver finds is global. For this reason, it is also a good source of initial guesses for other objective functions. Second, it is a popular choice, particularly in econometric modeling. Third, the least squares objective function has a long history going back to Carl Friedrich Gauss who is credited for developing it in 1795 (at the age of 18) and to Adrien-Marie Legendre who published it in 1806. In addition to being “time honored” this objective function has the merit of being the BLUE (best-linear-unbiased-estimator) of the parameters when observation errors have a mean of zero, are uncorrelated, and have equal variances. When the variances are not equal, as in our case, an appropriate weighting can be applied.

  8. 8.

    On the “Data” tab choose “Solver,” in the “Solver Parameters” window choose E4 as “Target Cell,” set the radio button to “Min,” choose D2:E2 as “By Changing Cells” and click “Solve.”

  9. 9.

    It is commonly assumed that accident counts are Poisson distributed and Appendix A lists the supporting arguments. For the Poisson distribution, the Mean = Variance. It follows that the accident counts on segments with different means have different variances.

  10. 10.

    When a curve is being fitted to data, it is as if each observed value was pulling the to-be-fitted curve towards itself; the larger the variance of the observed value, the lesser is the force with which it pulls. Here is why. Call Xi the observed value for unit i and call its variance Vi. Assume that for units 1 and 2 V1 > V2. Imagine that X2 is the average of n independent observed values the variance of which is V1. The question is how many observed values with variance V1 must be averaged so that the average has a variance V2? Answer: Because the average is \( \frac{{\displaystyle {\varSigma}_1^n}{X}_i}{n} \) the variance of the average is \( \frac{n{V}_1}{n^2}=\frac{V_1}{n} \). If \( \frac{V_1}{n} \) is to equal V2, it is as if X2 was the average of \( n=\frac{V_1}{V_2} \) observed values with variance V1. That implies that when summing residuals the weight for unit 2 should be V1/V2, i.e., in proportion to the reciprocal values of its variance. An early reference is Aitken (1935).

  11. 11.

    One could normalize the raw weights to make their sum equal 1 but, since this amounts only to multiplying the sum of weighted squared deviations by a constant, doing so would not affect the minimization.

  12. 12.

    This procedure goes under the name of “Iteratively reweighted least squares.”

  13. 13.

    See Appendix C.

  14. 14.

    When estimating for one population and the difference is negative, it is best to use \( \widehat{V}\left\{\mu \right\}=0 \). However, when estimating for many populations doing so would entail a systematic bias. To avoid this bias, the negative estimates were retained. The striated bands of points come from the use of the integers 0, 1, … in computing the SD.

  15. 15.

    The bandwidth here was 2.

  16. 16.

    Accuracy is the degree of closeness between the measurements of a quantity and its true value (JCGM 2008).

  17. 17.

    20.859 = 1.81.

  18. 18.

    Estimates of the Annual Average Daily Traffic are usually produced by “factoring up” counts conducted over a few hours on few summer days. Such counts are conducted once every few years and AADT estimates for the gap years are based on interpolation. How to account for such “errors in variables” will be discussed in Chap. 11.

  19. 19.

    26,600 has been subtracted from the ordinate.

  20. 20.

    This obscure sounding phrase will be explained and applied once the likelihood concept is introduced in Chap. 8.

  21. 21.

    The advantage of estimating parameter accuracy using Monte Carlo simulation is that it can be used even when parameter estimation is not by maximizing likelihood and when there are errors in variables (e.g., in AADT). For detail, see Straume and Johnson (2010).

  22. 22.

    To download this spreadsheet, go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for “Chap. 6. Simulation to determine parameter accuracy. xls or xlsm.”

  23. 23.

    The t-statistic is (usually) the estimate of the parameter divided by its standard error. The p-value is the probability of obtaining a test statistic that is at least as extreme as what was observed when the null hypothesis is true. Thus, e.g., if the null hypothesis in \( E\left\{\mu \right\}={\beta}_0{L}^{\beta_1} \) is that β1 = 1, then obtaining p < 0.01 is a strong presumption that the E{μ} is not proportional to L.

  24. 24.

    See Sect. 1.5.

  25. 25.

    Cross-sectional data describe the traits of a collection of units without there being a specific intervention or treatment applied to them at a certain time.

  26. 26.

    Thus, e.g., in the Colorado data segments that are 0.4–0.6 miles long have an average AADT of 2,899 vpd and 31 % are in mountainous terrain while for segments that are 0.9–1.1 miles long the average AADT is 1,767 and only 5 % are in mountainous terrain.

  27. 27.

    Recall that in Sect. 1.2 the axiom was that units identical in all safety-related traits have the same μ.

  28. 28.

    Observational data: data obtained by “observing” the system of interest without subjecting it to interventions.

  29. 29.

    For a review of some historical evidence see Hauer (2005).

  30. 30.

    Woodward is a philosopher with an interest in causation and explanation – a topic that has eluded consensus since Plato. In his book (Woodward 2003) emphasis is on causation in the context of manipulation and its results. To illustrate, what does the claim that coffee-drinking causes cancer say? According to Woodward, the claim means that you can affect your chances of getting cancer by changing your coffee consumption. Thus, if there is only correlation but no causation – if, for example, the statistical connection is due to some hidden factor, say stress, that causes both coffee-drinking and cancer – then you cannot increase your chances of getting cancer by drinking coffee. What causal information adds to information about correlation is information about manipulability. This kind of causality is particularly suitable for discussion of road safety because, in road safety, the question is how what we do by design and interventions is likely to affect accidents.

  31. 31.

    Fisher (1922, p. 311) defines the principal task of statistics to be “the reduction of data …”

  32. 32.

    The Highway Safety Manual (AASHTO 2010) assumes that the data generating process in Colorado and Louisiana are similar except that due to local factors such as climate or procedures for accident reporting they may differ by a multiplicative “Calibration Factor” (Volume 2, p. C-4).

  33. 33.

    Hooke's law is named after the seventeenth century British physicist Robert Hooke. He first stated this law in 1660 as a Latin anagram, whose solution he published in 1678 as Ut tensio, sic vis; literally translated as: “As the extension, so the force” or the more common meaning “The extension is proportional to the force.”

  34. 34.

    Speaking about those coining and using the criteria for causality in epidemiology, Spirtes et al. (1993) make the following unkind comment: Neither side understood what uncontrolled studies could and could not determine about causal relations and the effects of interventions. The statisticians pretended to an understanding of causality and correlation they did not have; the epidemiologists resorted to informal and often irrelevant criteria, appeals to plausibility, and in the worst case to ad hominemWhile the statisticians didnt get the connections between causality and probability right, theepidemiological criteria for causality were an intellectual disgrace, and the level of argumentwas sometimes more worthy of literary critics than scientists. (p. 302)

  35. 35.

    An explanation is a set of statements constructed to describe a set of facts which clarifies the causes, context, and consequences of those facts. Explanation and cause seem to be inseparable (see, e.g., Psillos 2002) Unfortunately statisticians speak about “explained variance” in a manner divorced from causation, forgetting their own dictum that “association is not causation.”

References

  • The American Association of State Highway and Transportation Officials (AASHTO) (2010) Highway safety manual, 1st edn. Washington, DC, AASHTO

    Google Scholar 

  • Aitken AC (1935) On least squares and linear combinations of observations. In: Proceedings of the Royal Society of Edinburgh, 55:42–48.

    Google Scholar 

  • Bollen KA, Pearl J (2013) Eight myths about causality and structural equation models. In: Morgan S (ed) Handbook of causal analysis for social research. Springer, New York

    Google Scholar 

  • Davis GA (2014) Crash reconstruction and crash modification factors. Accid Anal Prev 62:294–302

    Article  Google Scholar 

  • Elvik R (2007) Operational criteria of causality for observational road safety evaluation studies. Transp Res Rec 2019:74–81

    Article  Google Scholar 

  • Elvik R (2011) Assessing causality in multivariate accident models. Accid Anal Prev 43:253–264

    Article  Google Scholar 

  • Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Transact A Math Phys Eng Sci 222:309–368

    Article  MATH  Google Scholar 

  • Freedman D (1997) From association to causation via regression. Adv Appl Math 18:59–110

    Article  MathSciNet  MATH  Google Scholar 

  • Garber NJ, Gadirau R (1988) Speed variance and its influence on accidents. AAA Foundation for Traffic Safety, Washington, DC

    Google Scholar 

  • Hauer E (2005) Cause and effect in observational cross-section studies on road safety. https://www.researchgate.net/profile/Ezra_Hauer/ publications

  • Hauer E (2010) Cause, effect, and regression in road safety: a case study. Accid Anal Prev 42:1128–1135

    Article  Google Scholar 

  • Hauer E, Council FM, Mohammedshah Y (2004) Safety models for urban four-lane undivided road segments. Transp Res Rec 1897:96–105

    Article  Google Scholar 

  • JCGM (Joint Committee for Guides in Metrology) (2008) International vocabulary of metrology—basic and general concepts and associated terms. 200:2008

    Google Scholar 

  • Mensah A, Hauer E (1998) Two issues of averaging in multivariate modelling. Transp Res Rec 1635:37–43

    Article  Google Scholar 

  • Psillos S (2002) Causation and explanation. McGill-Queen’s University Press, Montreal

    Google Scholar 

  • Simon HA (1953) Causal order and identifiability. In: Hood W, Koopmans T (eds) Studies in econometric method, Cowles commission monograph 14. Yale University Press, New Haven, pp 49–74

    Google Scholar 

  • Spirtes P, Glymour C, Scheines R (1993) Causation, prediction and search, lecture notes in statistics, vol 81. Springer, New York

    Book  MATH  Google Scholar 

  • Straume M, Johnson ML (2010) Monte Carlo method for determining complete confidence probability distributions of estimated model parameters. In: Johnson M (ed) Essential numerical computer methods. Academic, New York

    Google Scholar 

  • Woodward J (2003) Making things happen: a theory of causal explanation. Oxford University Press, Oxford

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Hauer, E. (2015). A First Parametric SPF. In: The Art of Regression Modeling in Road Safety. Springer, Cham. https://doi.org/10.1007/978-3-319-12529-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12529-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12528-2

  • Online ISBN: 978-3-319-12529-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics