Skip to main content

Linear Regression Methods

  • Chapter
  • First Online:
Book cover Algorithms for Data Science

Abstract

Linear regression is a broad and well-developed area of statistics. If there is a core to statistical methodology, then linear regression is it. The ubiquity of linear regression methods in statistics and data analytics stems from the ease with which one may fit tractable models that describe the primary features of a process or population. Not only is linear regression useful for description, it’s also very useful for prediction since the models often provide good approximations of complex relationships. In the field of statistics, hypothesis testing and confidence intervals are routinely used in linear regression analyses. The extension of these methods to data science is often unsuccessful because of the prevalence of opportunistically collected data. Most of the time, opportunistically collected data cannot support inferential methods because the quality of the inferences produced by the methods is unknown. We discuss inference herein so that the reader may understand the potential for success and for failure of these methods. However, the focus is on the essential and most useful aspects of the subject matter for data analytics—the fitted models. The topic of linear regression provides an avenue to gain experience with the statistical package R, one of the most popular software packages used by data scientists.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A Python script for this purpose will be written in Chap. 11, Sect. 11.10

  2. 2.

    This is to be expected since the fitted model is the equation of a line.

  3. 3.

    The fitted model is an equation describing a plane.

  4. 4.

    If X is not full rank, then the optimality statement needs to be modified.

  5. 5.

    Exercise 3.3.7 guides the reader through the derivation.

  6. 6.

    The analyst often believes H a to be correct.

  7. 7.

    Computationally, the test is easy to execute.

  8. 8.

    The stipulation that the language is sophisticated eliminates Excel as a platform for statistical analysis.

  9. 9.

    Body weight is also used in a calculation that incorporates differences in the density of muscle, bone, and fat.

  10. 10.

    Visceral fat is located in the abdominal cavity.

  11. 11.

    Boxplots were discussed briefly in Chap. 3.6.1.

  12. 12.

    Netball is played only by women.

  13. 13.

    The tutorial of Sect. 6.4 used a data set that was loaded into the object ais when the DAAG library was invoked with the library command.

  14. 14.

    Strings are delimited by the end-of-line character at the end of each record.

  15. 15.

    With our particular version of the complaint file, there are a variety of other attributes in the second position of record.

  16. 16.

    R does recognize the variable sport as a factor so no action is needed. A variable x can be converted to a factor using the function call x=as.factor(x).

  17. 17.

    The evidence supports the statement.

  18. 18.

    For example, the reference level may be a control group in an experiment.

  19. 19.

    If the fitted values are computed using a different prediction function (e.g., k-nearest neighbors regression), then the sample mean may not equal the sample of the fitted values. In that situation, we advocate computing the measure of fit as the squared correlation between fitted and observed values for simplicity.

  20. 20.

    The models were fit using the same number of observations: n = 10, 886.

  21. 21.

    The administrators of the contest used the held-data as a test set with which to objectively evaluate the predictions made by the contestants. We think that the hold-out period (10 days) is too long. A better test of predictive accuracy would use shorter periods (3 days or less) randomly interspersed in the time series.

  22. 22.

    Interaction was discussed in Sect. 6.6.1.

  23. 23.

    The aphorism all models are wrong, but some are useful is due to G.E. Box.

References

  1. H. Fanaee-T, J. Gama, Event labeling combining ensemble detectors and background knowledge, in Progress in Artificial Intelligence (Springer, Berlin, 2013), pp. 1–15

    Google Scholar 

  2. J. Fox, S. Weisberg, An R Companion to Applied Regression, 2nd edn. (Sage, Thousand Oaks, 2011)

    Google Scholar 

  3. D.J. Hand, F. Daly, K. McConway, D. Lunn, E. Ostrowski, A Handbook of Small Data Sets (Chapman & Hall, London, 1993)

    MATH  Google Scholar 

  4. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, 2nd edn. (Springer, New York, 2009)

    Book  MATH  Google Scholar 

  5. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (Springer, New York, 2013)

    Book  MATH  Google Scholar 

  6. Kaggle, https://www.kaggle.com/competitions. Accessed 12 June 2016

  7. J. Maindonald, J. Braun, Data Analysis and Graphics Using R, 3rd edn. (Cambridge University Press, Cambridge, 2010)

    MATH  Google Scholar 

  8. E. O’Mahony, D.B. Shmoys, Data analysis and optimization for (citi) bike sharing, in Proceedings of the Twenty-Ninth Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence (2015)

    Google Scholar 

  9. F. Ramsey, D. Schafer, The Statistical Sleuth, 3rd edn. (Brooks/Cole, Boston, 2012)

    MATH  Google Scholar 

  10. J.J. Reilly, J. Wilson, J.V. Durnin, Determination of body composition from skinfold thickness: a validation study. Arch. Dis. Child. 73 (4), 305–310 (1995)

    Article  Google Scholar 

  11. A.C. Rencher, B. Schaalje, Linear Models in Statistics, 2nd edn. (Wiley, New York, 2000)

    MATH  Google Scholar 

  12. R.E. Roberts, C.R. Roberts, I.G. Chen, Fatalism and risk of adolescent depression. Psychiatry: Interpersonal Biol. Process. 63 (3), 239–252 (2000)

    Article  Google Scholar 

  13. R.D. Telford, R.B. Cunningham, Sex, sport and body-size dependency of hematology in highly trained athletes. Med. Sci. Sports Exerc. 23, 788–794 (1991)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Linear Regression Methods. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45797-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45795-6

  • Online ISBN: 978-3-319-45797-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics