Linear Regression Methods

Steele, Brian; Chandler, John; Reddy, Swarna

doi:10.1007/978-3-319-45797-0_6

Brian Steele⁴,
John Chandler⁵ &
Swarna Reddy⁶

7211 Accesses

Abstract

Linear regression is a broad and well-developed area of statistics. If there is a core to statistical methodology, then linear regression is it. The ubiquity of linear regression methods in statistics and data analytics stems from the ease with which one may fit tractable models that describe the primary features of a process or population. Not only is linear regression useful for description, it’s also very useful for prediction since the models often provide good approximations of complex relationships. In the field of statistics, hypothesis testing and confidence intervals are routinely used in linear regression analyses. The extension of these methods to data science is often unsuccessful because of the prevalence of opportunistically collected data. Most of the time, opportunistically collected data cannot support inferential methods because the quality of the inferences produced by the methods is unknown. We discuss inference herein so that the reader may understand the potential for success and for failure of these methods. However, the focus is on the essential and most useful aspects of the subject matter for data analytics—the fitted models. The topic of linear regression provides an avenue to gain experience with the statistical package R, one of the most popular software packages used by data scientists.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A Python script for this purpose will be written in Chap. 11, Sect. 11.10
2.
This is to be expected since the fitted model is the equation of a line.
3.
The fitted model is an equation describing a plane.
4.
If X is not full rank, then the optimality statement needs to be modified.
5.
Exercise 3.3.7 guides the reader through the derivation.
6.
The analyst often believes H _a to be correct.
7.
Computationally, the test is easy to execute.
8.
The stipulation that the language is sophisticated eliminates Excel as a platform for statistical analysis.
9.
Body weight is also used in a calculation that incorporates differences in the density of muscle, bone, and fat.
10.
Visceral fat is located in the abdominal cavity.
11.
Boxplots were discussed briefly in Chap. 3.6.1.
12.
Netball is played only by women.
13.
The tutorial of Sect. 6.4 used a data set that was loaded into the object ais when the DAAG library was invoked with the library command.
14.
Strings are delimited by the end-of-line character at the end of each record.
15.
With our particular version of the complaint file, there are a variety of other attributes in the second position of record.
16.
R does recognize the variable sport as a factor so no action is needed. A variable x can be converted to a factor using the function call x=as.factor(x).
17.
The evidence supports the statement.
18.
For example, the reference level may be a control group in an experiment.
19.
If the fitted values are computed using a different prediction function (e.g., k-nearest neighbors regression), then the sample mean may not equal the sample of the fitted values. In that situation, we advocate computing the measure of fit as the squared correlation between fitted and observed values for simplicity.
20.
The models were fit using the same number of observations: n = 10, 886.
21.
The administrators of the contest used the held-data as a test set with which to objectively evaluate the predictions made by the contestants. We think that the hold-out period (10 days) is too long. A better test of predictive accuracy would use shorter periods (3 days or less) randomly interspersed in the time series.
22.
Interaction was discussed in Sect. 6.6.1.
23.
The aphorism all models are wrong, but some are useful is due to G.E. Box.

References

H. Fanaee-T, J. Gama, Event labeling combining ensemble detectors and background knowledge, in Progress in Artificial Intelligence (Springer, Berlin, 2013), pp. 1–15
Google Scholar
J. Fox, S. Weisberg, An R Companion to Applied Regression, 2nd edn. (Sage, Thousand Oaks, 2011)
Google Scholar
D.J. Hand, F. Daly, K. McConway, D. Lunn, E. Ostrowski, A Handbook of Small Data Sets (Chapman & Hall, London, 1993)
MATH Google Scholar
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, 2nd edn. (Springer, New York, 2009)
Book MATH Google Scholar
G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (Springer, New York, 2013)
Book MATH Google Scholar
Kaggle, https://www.kaggle.com/competitions. Accessed 12 June 2016
J. Maindonald, J. Braun, Data Analysis and Graphics Using R, 3rd edn. (Cambridge University Press, Cambridge, 2010)
MATH Google Scholar
E. O’Mahony, D.B. Shmoys, Data analysis and optimization for (citi) bike sharing, in Proceedings of the Twenty-Ninth Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence (2015)
Google Scholar
F. Ramsey, D. Schafer, The Statistical Sleuth, 3rd edn. (Brooks/Cole, Boston, 2012)
MATH Google Scholar
J.J. Reilly, J. Wilson, J.V. Durnin, Determination of body composition from skinfold thickness: a validation study. Arch. Dis. Child. 73 (4), 305–310 (1995)
Article Google Scholar
A.C. Rencher, B. Schaalje, Linear Models in Statistics, 2nd edn. (Wiley, New York, 2000)
MATH Google Scholar
R.E. Roberts, C.R. Roberts, I.G. Chen, Fatalism and risk of adolescent depression. Psychiatry: Interpersonal Biol. Process. 63 (3), 239–252 (2000)
Article Google Scholar
R.D. Telford, R.B. Cunningham, Sex, sport and body-size dependency of hematology in highly trained athletes. Med. Sci. Sports Exerc. 23, 788–794 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Montana, Missoula, MT, USA
Brian Steele
School of Business Administration, University of Montana, Missoula, MT, USA
John Chandler
SoftMath Consultants, LLC, Missoula, MT, USA
Swarna Reddy

Authors

Brian Steele
View author publications
You can also search for this author in PubMed Google Scholar
John Chandler
View author publications
You can also search for this author in PubMed Google Scholar
Swarna Reddy
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Linear Regression Methods. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-45797-0_6
Published: 27 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45795-6
Online ISBN: 978-3-319-45797-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics