Skip to main content

Adding Variables

  • Chapter
  • First Online:
Book cover The Art of Regression Modeling in Road Safety
  • 1548 Accesses

Abstract

In this chapter the question is whether to add a variable to the model equation and how to do so. The key concept introduced is that of bias-in-use; it is that bias which arises when the user has information about the value of a safety-related variable which is not in the model equation. The purpose of adding a variable to the model equation is to reduce the bias-in-use. To determine whether adding a variable would reduce the bias-in-use a Variable Introduction Exploratory Data Analysis is conducted. The addition of a variable to the model equation requires a modification of the C-F spreadsheet following which all parameters are reestimated. This is done for the variables AADT, Terrain, and Year. The use of the Negative Multinomial likelihood function with panel data is demonstrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Narrowly construed model specification in regression “…refers to the determination of which independent variables should be included in or excluded from a regression equation” (Allen 1997, p. 166). More broadly the process of model specification consists of selecting an appropriate functional form and of choosing the predictor variables. Common errors of model specification are (1) choosing an incorrect functional form, (2) omitting predictor variables which have a relationship with the dependent variable, (3) including irrelevant predictor variables, and (4) assuming that the predictor variables are measured without error. If an estimated model is misspecified, it will produce biased and inconsistent estimates. (An estimator is consistent if as the number of observations increases the estimates get closer and closer to the true value).

  2. 2.

    See e.g., Gross et al. (2013, p. 236).

  3. 3.

    The Bias-In-Use is not what in statistics is called the “Omitted-Variable Bias.” The OVB is a bias in the parameters which occurs when a model incorrectly leaves out one or more important causal variables. The bias-in-use pertains to the estimate of E{μ} and arises when the variables in the SPF do not match the known safety-related traits of the unit or the population the E{μ} of which is of interest. From the “research” perspective it is the OVB that matters; from the “applications” perspective it is the bias-in-use that is important.

  4. 4.

    The following may serve as a succinct definition of bias-in-use: By “obvious observation 1” in Sect. 3.4 if A and B are variables and B is safety-related then E{μ|A and B} ≠ E{μ|A}. The “bias-in-use” is the difference E{μ|A and B} − E{μ|A}.

  5. 5.

    Situations in which variables absent from the model equation give cause for concern are commonly encountered and familiar to practitioners. To illustrate, the practitioner will sense that something is amiss when, say, the road segment of interest is a sag curve preceded by a long steep grade but grade and vertical curvature variables are not in the SPF.

  6. 6.

    See e.g., Zegeer et al. (1988) and Harwood et al. (2000), Appendix D.

  7. 7.

    See Sect. 1.2.

  8. 8.

    Using the numerical results of the EDA as illustration it was easy to state as “Obvious Observation 3” that “The larger is the number of traits that define a real population, the fewer are the observations from which its E{μ} is estimated and the larger tends to be the standard error of the estimate of E{μ}.”

  9. 9.

    Consider a multiplicative model equation in which expression A accounts for the contribution of variables X 1, X 2, …, X n and expression B accounts for the contribution of a new variable X n+1. Because of the uncertainties in the X’s and their parameters, both A and B are random variables. Let σ{A} and σ{B} be their standard deviations and ρ{A, B} their correlation coefficient. With this, \( {\left(\frac{\sigma \left\{ AB\right\}}{AB}\right)}^2={\left(\frac{\sigma \left\{A\right\}}{A}\right)}^2+{\left(\frac{\sigma \left\{B\right\}}{B}\right)}^2+2\frac{\sigma \left\{A\right\}\sigma \left\{B\right\}}{AB}\rho \left\{A,B\right\} \). If A and B are uncorrelated then (σ{AB})2 = B 2(σ{A})2 + A 2(σ{B})2, If expression B which accounts for the effect of the added uncorrelated variable has an average value of 1, the consequence of adding it to the model is to increase the variance by A 2(σ{B})2 , While it is tempting to assume that A and B are uncorrelated, this is seldom so. The addition of B will always change the scale parameter which is part of A. It is therefore difficult to foresee how the addition of a variable will affect the \( V\left\{\widehat{E}\left\{\mu \right\}\right\} \).

  10. 10.

    Lehmann (1990, p. 162) suggests using the average squared prediction error saying that: “The best fitting model is of course always the one that includes the largest number of variables. However, this difficulty can be overcome …, for example by selecting the dimension k which minimizes … the expected squared difference between the next observation and its best prediction from the model. This measure has two components: E(squared prediction error) = (squared bias) + (variance). As the dimension k of the model increases, the bias will decrease. At the same time the variance will tend to increase since the need to estimate a larger number of parameters will increase the variability of the estimator.”

  11. 11.

    How to work with Pivot Tables was explained in Sect. 3.3. To download this spreadsheet go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for “Chapter 9. VIEDA for AADT.xls or xlsm.”

  12. 12.

    A new variable is often introduced into the model equation by a function that multiplies the previous expression. In this case the Observed/Fitted ratio is of interest. When the new variable is introduced into the model equation by an additive function “case b” in Hauer (2004) should be used.

  13. 13.

    See Sect. 4.3.

  14. 14.

    The didactic reasons are two. First, that the power function X β is widely used and, as such, can be a benchmark against which to judge alternative functions to be examined later. Second, since the power function starts at the origin but the VIEDA suggests the presence of an intercept, the use of the power function might be reflected in the CURE plot and thereby illustrate its diagnostic use.

  15. 15.

    To download this spreadsheet go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for Chapter 9. NB fit for L and AADT and CURE plots.xls or xlsx

  16. 16.

    Keeping the factors in separate columns makes experimentation with alternative functional forms easier.

  17. 17.

    When running Solver the parameter estimates from the top part of Fig. 9.5 were used as initial guesses except that 0.001 was used for β 0 . Because now the parameters differ by several orders of magnitude, it is important to activate the “automatic scaling” option in Solver. See discussion in Sect. 5.4.

  18. 18.

    It was said in Sect. 6.7.1 that: “… should at a later stage in modeling the estimate of β 1 become indistinguishable from 1 that would be a sign that L reflects only the influence of segment length and that the influence of other safety-related variables has been satisfactorily accounted for.” Now, after the introduction of AADT, it seems that β 1 may indeed be fluctuating around 1. Should it remain so after more variables are added, it may be then be best to set the exponent of L to 1.

  19. 19.

    In Sect. 6.6.1 the 95 % confidence interval was 0.82–0.91.

  20. 20.

    The omitted-variable bias occurs when a model leaves out one or more safety-relevant variables. The bias pertains to estimates of parameters. When statistical packages report the standard error of parameter estimates they do so under the grossly unrealistic assumption that all causal variables are in the model equation and that the model equation is an exact representation of the relationship underlying the data. For these reasons the usually reported estimates of parameter accuracy cannot be trusted.

  21. 21.

    The other source discussed earlier was the choice of objective function (see Table 8.1). For a discussion of “modeling uncertainty” see Sect. 6.6.2.

  22. 22.

    In road safety, correlation between traits (variables) is a pervasive part of reality. One only has to think about the role played by traffic volume in determining the traits of roads and intersections. In our data the correlation between Segment Length and AADT is −0.17; Segments with more traffic tend to be shorter. Even this modest correlation is sufficient to cause a sizeable modeling uncertainty about the parameter of L.

  23. 23.

    In curve-fitting this phenomenon goes under the name collinearity. When variables are highly correlated the regression parameters change erratically in response to small changes in the model or the data.

  24. 24.

    Why such use of SPFs is problematic was discussed at length in Sect. 6.7.

  25. 25.

    See, e.g., Wood (2005), Lord (2008) and Lord et al. (2010).

  26. 26.

    In spite of the reservations discussed in Sect. 6.7.

  27. 27.

    Recall that V{μ} = (E{μ})2/(\( b \) L).

  28. 28.

    To download this spreadsheet go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for “Chapter 9. NB fit for L and AADT and CURE plots.xls or xlsx.”

  29. 29.

    The examination of alternative functional forms will be the subject of Chap. 10.

  30. 30.

    For example, if these low-AADT segments are mostly in mountainous terrain where, with the same L and AADT more accidents are predicted than in, say, flat terrain.

  31. 31.

    In terms of computation, all that is required is to sort by the “fitted value” (see, e.g., column D in Fig. 7.3) instead of sorting by a variable value (column B in the same figure).

  32. 32.

    See Fig. 3.1.

  33. 33.

    To download this spreadsheet go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for “Chapter 9. Terrain Pivot.xls or xlsx.”

  34. 34.

    The question of how the influence of a variable should be represented in the model equation will be discussed in Chap. 10. Thus, e.g., one of the EDA findings in Sect. 3.6 was that the effect of Terrain might depend on both AADT and Segment Length. The question will be how such a dependence can be captured.

  35. 35.

    As already noted, in the course of modeling the modeler has to make various assumptions. In the course of the continuing illustration it was assumed that the model equation is a product of power functions, that maximizing likelihood is a good objective, that accident counts are Poisson distributed, that the μ’s are Gamma distributed, etc. Now another assumption is added: that the influence on E{μ} of L and AADT is the same in every terrain. There is, of course, no prior support for such an assumption. Its validity can be examined by fitting separate models for each terrain.

  36. 36.

    To download this spreadsheet go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for “Chapter 9. NB fit with L, AADT and Terrain as multiplier. xls or xlsx.”

  37. 37.

    A segment in rolling terrain is estimated to have 1.642 times the number of accidents of a segment of the same length and same AADT in flat terrain; the corresponding multiplier for mountainous terrain is 2.495. By the numbers in Table 9.1 the corresponding implied multipliers are 0.87/0.58 = 1.5 and 1.31/0.58 = 2.3.

  38. 38.

    See Sect. 3.2.

  39. 39.

    See Sect. 3.2 and Fig. 3.2.

  40. 40.

    See Appendix E.

  41. 41.

    Over the 13-year period 1986–1998 on rural two-lane roads in Colorado the AADT increased, on the average, by about 55 %.

  42. 42.

    See Sects. 8.3.2, 8.3.3 and Appendices E, G.

  43. 43.

    With a slight compromise of purity it is possible to use the NB likelihood even with panel data. This approach will be used in the simulations of Chap. 11 where saving on computations matters. To download the spreadsheet where the NM likelihood was used go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for “Chapter 9. NM, 13 year panel, L, AADT, Terrain, and a common scale parameter.xls or xlsx.” To download the spreadsheet where the NB likelihood was used look for “Chapter 9. NB, 13-year panel.xls or xlsx.”

  44. 44.

    See Sect. 5.4 for detail.

  45. 45.

    The difference in \( b \) is due to the transition from L as segment length to the normalized X 1.

  46. 46.

    Section 8.2.1 and Charnes et al. (1976).

  47. 47.

    To download this spreadsheet go to http://extras.springer.com/ and enter the ISBN of this book. Look in the “Spreadsheets” folder for Chapter 9. Alternative objectives, 13-year panel, L, AADT, Terrain, and common scale paramete.xls or xlsx.

  48. 48.

    See Sect. 6.6.1

  49. 49.

    See e.g. Charnes et al. (1976).

  50. 50.

    See Sect. 4.2.

  51. 51.

    Algebraic operations are: addition, subtraction, multiplication, division and exponentiation.

References

  • Allen MP (1997) Understanding regression analysis. Springer, New York

    Google Scholar 

  • Charnes A, Frome EWL, Yu PL (1976) The equivalence of generalized least squares and maximum likelihood estimates in the exponential family. J Am Stat Assoc 71(353):169–171

    Article  MathSciNet  MATH  Google Scholar 

  • Draper N, Smith H (1981) Applied regression analysis, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Gross F, Craig L, Persaud B, Srinivasan R (2013) Safety effectiveness of converting signalized intersections to roundabouts. Accid Anal Prev 50:234–241

    Article  Google Scholar 

  • Harwood DW, Council FM, Hauer E, Hughes WE, Vogt A (2000) Prediction of the expected safety performance of rural two-lane highways. FHWA-RD-99-207. Office of Safety Research and Development, Federal Highway Administration, Washington, DC

    Google Scholar 

  • Hauer E (2004) Statistical safety modelling. Transportation Research Record 1897. National Academies Press, Washington, DC, pp 81–87

    Google Scholar 

  • Lehmann ER (1990) Model specification: the views of Fisher and Neyman, and later developments. Stat Sci 5(2):160–168

    Article  MathSciNet  MATH  Google Scholar 

  • Lord D (2008) Methodology for estimating the variance and confidence intervals for the estimate of the product of baseline models and AMFs. Accid Anal Prev 40:1013–1017

    Article  Google Scholar 

  • Lord D, Kuo P-F, Geedipally SR (2010) Comparison of application of product of baseline models and accident-modification factors and models with covariates. Transportation Research Record. No. 2147, pp 113–122

    Google Scholar 

  • Wood GR (2005) Confidence and prediction intervals for generalized linear accident models. Accid Anal Prev 37:267–273

    Article  Google Scholar 

  • Zegeer CVD, Reinfurt DW, Hummer J, Herf L, Hunter W (1988) Safety effects of cross-section design. Transportation Research Record 1195. National Academies Press, Washington, DC, pp 20–32

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Hauer, E. (2015). Adding Variables. In: The Art of Regression Modeling in Road Safety. Springer, Cham. https://doi.org/10.1007/978-3-319-12529-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12529-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12528-2

  • Online ISBN: 978-3-319-12529-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics