# Chapter 1: Statistical Models

## Abstract

This chapter introduces the concept of a statistical model. One particular type of statistical model—the generalized linear model—is the focus of this book, and so we begin with an introduction to statistical models in general. This allows us to introduce the necessary language, notation, and other important issues. We first discuss conventions for describing data mathematically (Sect. 1.2). We then highlight the importance of plotting data (Sect. 1.3), and explain how to numerically code non-numerical variables (Sect. 1.4) so that they can be used in mathematical models. We then introduce the two components of a statistical model used for understanding data (Sect. 1.5): the systematic and random components. The class of regression models is then introduced (Sect. 1.6), which includes all models in this book. Model interpretation is then considered (Sect. 1.7), followed by comparing physical models and statistical models (Sect. 1.8) to highlight the similarities and differences. The purpose of a statistical model is then given (Sect. 1.9), followed by a description of the two criteria for evaluating statistical models: accuracy and parsimony (Sect. 1.10). The importance of understanding the limitations of statistical models is then addressed (Sect. 1.11), including the differences between observational and experimental data. The generalizability of models is then discussed (Sect. 1.12). Finally, we make some introductory comments about using r for statistical modelling (Sect. 1.13).

## References

- [1]Agresti, A.: An Introduction to Categorical Data Analysis, second edn. Wiley-Interscience (2007)Google Scholar
- [2]Box, G.E.P., Draper, N.R.: Empirical Model-Building and Response Surfaces. Wiley, New York (1987)zbMATHGoogle Scholar
- [3]Brockmann, H.J.: Satellite male groups in horseshoe crabs,
*limulus polyphemus*. Ethology**102**, 1–21 (1996)CrossRefGoogle Scholar - [4]Dunn, P.K., Smyth, G.K.: GLMsData: Generalized linear model data sets (2017). URL https://CRAN.R-project.org/package=GLMsData. R package version 1.0.0
- [5]Efron, B.: Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association
**81**(395), 709–721 (1986)MathSciNetCrossRefGoogle Scholar - [6]Giauque, W.F., Wiebe, R.: The heat capacity of hydrogen bromide from \(15^{\circ }\) K. to its boiling point and its heat of vaporization. The entropy from spectroscopic data. Journal of the American Chemical Society
**51**(5), 1441–1449 (1929)CrossRefGoogle Scholar - [7]Hand, D.J., Daly, F., Lunn, A.D., McConway, K.Y., Ostrowski, E.: A Handbook of Small Data Sets. Chapman and Hall, London (1996)zbMATHGoogle Scholar
- [8]Joglekar, G., Scheunemyer, J.H., LaRiccia, V.: Lack-of-fit testing when replicates are not available. The American Statistician
**43**, 135–143 (1989)Google Scholar - [9]Johnson, B., Courtney, D.M.: Tower building. Child Development
**2**(2), 161–162 (1931)CrossRefGoogle Scholar - [10]Kahn, M.: An exhalent problem for teaching statistics. Journal of Statistical Education
**13**(2) (2005)Google Scholar - [11]Maron, M.: Threshold effect of eucalypt density on an aggressive avian competitor. Biological Conservation
**136**, 100–107 (2007)CrossRefGoogle Scholar - [12]Mazess, R.B., Peppler, W.W., Gibbons, M.: Total body composition by dualphoton (
^{153}Gd) absorptiometry. American Journal of Clinical Nutrition**40**, 834–839 (1984)CrossRefGoogle Scholar - [13]Myers, R.H., Montgomery, D.C., Vining, G.G.: Generalized Linear Models with Applications in Engineering and the Sciences. Wiley, Chichester (2002)zbMATHGoogle Scholar
- [14]Nelson, W.: Applied Life Data Analysis. Wiley Series in Probability and Statistics. John Wiley Sons, New York (1982)CrossRefGoogle Scholar
- [15]Royston, P., Altman, D.G.: Regression using fractional polynomials of continuous covariates: Parsimonious parametric modelling. Journal of the Royal Statistical Society, Series C
**43**(3), 429–467 (1994)Google Scholar - [16]Shacham, M., Brauner, N.: Minimizing the effects of collinearity in polynomial regression. Industrial and Engineering Chemical Research
**36**, 4405–4412 (1997)CrossRefGoogle Scholar - [17]Singer, J.D., Willett, J.B.: Improving the teaching of applied statistics: Putting the data back into data analysis. The American Statistician
**44**(3), 223–230 (1990)Google Scholar - [18]Smyth, G.K.: Australasian data and story library (Ozdasl) (2011). URL http://www.statsci.org/data
- [19]Tager, I.B., Weiss, S.T., Muñoz, A., Rosner, B., Speizer, F.E.: Longitudinal study of the effects of maternal smoking on pulmonary function in children. New England Journal of Medicine
**309**(12), 699–703 (1983)CrossRefGoogle Scholar - [20]Tager, I.B., Weiss, S.T., Rosner, B., Speizer, F.E.: Effect of parental cigarette smoking on the pulmonary function of children. American Journal of Epidemiology
**110**(1), 15–26 (1979)CrossRefGoogle Scholar