Skip to main content

Introduction

  • Chapter

Part of the book series: Springer Series in Statistics ((SSS))

Abstract

Statistics comprises among other areas study design, hypothesis testing, estimation, and prediction. This text aims at the last area, by presenting methods that enable an analyst to develop models that will make accurate predictions of responses for future observations. Prediction could be considered a superset of hypothesis testing and estimation, so the methods presented here will also assist the analyst in those areas. It is worth pausing to explain how this is so.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For example, unadjusted odds ratios from 2 × 2 tables are different from adjusted odds ratios when there is variation in subjects’ risk factors within each treatment group, even when the distribution of the risk factors is identical between the two groups.

  2. 2.

    Simple examples to the contrary are the less weight given to a false negative diagnosis of cancer in the elderly and the aversion of some subjects to surgery or chemotherapy.

  3. 3.

    To make an optimal decision you need to know all relevant data about an individual (used to estimate the probability of an outcome), and the utility (cost, loss function) of making each decision. Sensitivity and specificity do not provide this information. For example, if one estimated that the probability of a disease given age, sex, and symptoms is 0.1 and the “cost”of a false positive equaled the “cost” of a false negative, one would act as if the person does not have the disease. Given other utilities, one would make different decisions. If the utilities are unknown, one gives the best estimate of the probability of the outcome to the decision maker and let her incorporate her own unspoken utilities in making an optimum decision for her.

    Besides the fact that cutoffs that are not individualized do not apply to individuals, only to groups, individual decision making does not utilize sensitivity and specificity. For an individual we can compute \(\text{Prob}(Y = 1\vert X = x)\); we don’t care about Prob(Y = 1 | X > c), and an individual having X = x would be quite puzzled if she were given Prob(X > c | future unknown Y) when she already knows X = x so X is no longer a random variable.

    Even when group decision making is needed, sensitivity and specificity can be bypassed. For mass marketing, for example, one can rank order individuals by the estimated probability of buying the product, to create a lift curve . This is then used to target the k most likely buyers where k is chosen to meet total program cost constraints.

  4. 4.

    The ROC curve is a plot of sensitivity vs. one minus specificity as one varies a cutoff on a continuous predictor used to make a decision.

  5. 5.

    An exception may be sensitive variables such as income level. Subjects may be more willing to check a box corresponding to a wide interval containing their income. It is unlikely that a reduction in the probability that a subject will inflate her income will offset the loss of precision due to categorization of income, but there will be a decrease in the number of refusals. This reduction in missing data can more than offset the lack of precision.

References

  1. K. Akazawa, T. Nakamura, and Y. Palesch. Power of logrank test and Cox regression model in clinical trials with heterogeneous samples. Stat Med, 16:583–597, 1997.

    Article  Google Scholar 

  2. D. G. Altman. Categorising continuous covariates (letter to the editor). Brit J Cancer, 64:975, 1991.

    Article  Google Scholar 

  3. D. G. Altman and P. K. Andersen. A note on the uncertainty of a survival probability estimated from Cox’s regression model. Biometrika, 73:722–724, 1986.

    Article  MathSciNet  MATH  Google Scholar 

  4. D. G. Altman, B. Lausen, W. Sauerbrei, and M. Schumacher. Dangers of using ‘optimal’ cutpoints in the evaluation of prognostic factors. J Nat Cancer Inst, 86:829–835, 1994.

    Article  Google Scholar 

  5. D. G. Altman and P. Royston. What do we mean by validating a prognostic model? Stat Med, 19:453–473, 2000.

    Article  Google Scholar 

  6. G. L. Anderson and T. R. Fleming. Model misspecification in proportional hazards regression. Biometrika, 82:527–541, 1995.

    Article  MathSciNet  MATH  Google Scholar 

  7. H. Belcher. The concept of residual confounding in regression models and some applications. Stat Med, 11:1747–1758, 1992.

    Article  Google Scholar 

  8. R. Bordley. Statistical decisionmaking without math. Chance, 20(3):39–44, 2007.

    Article  MathSciNet  Google Scholar 

  9. L. Breiman. Statistical modeling: The two cultures (with discussion). Statistical Sci, 16:199–231, 2001.

    Article  MathSciNet  MATH  Google Scholar 

  10. W. M. Briggs and R. Zaretzki. The skill plot: A graphical technique for evaluating continuous diagnostic tests (with discussion). Biometrics, 64:250–261, 2008.

    Article  MathSciNet  Google Scholar 

  11. S. T. Buckland, K. P. Burnham, and N. H. Augustin. Model selection: An integral part of inference. Biometrics, 53:603–618, 1997.

    Article  MATH  Google Scholar 

  12. P. Buettner, C. Garbe, and I. Guggenmoos-Holzmann. Problems in defining cutoff points of continuous prognostic factors: Example of tumor thickness in primary cutaneous melanoma. J Clin Epi, 50:1201–1210, 1997.

    Article  Google Scholar 

  13. R. M. Califf, L. H. Woodlief, F. E. Harrell, K. L. Lee, H. D. White, A. Guerci, G. I. Barbash, R. Simes, W. Weaver, M. L. Simoons, E. J. Topol, and T. Investigators. Selection of thrombolytic therapy for individual patients: Development of a clinical model. Am Heart J, 133:630–639, 1997.

    Article  Google Scholar 

  14. C. Chatfield. Model uncertainty, data mining and statistical inference (with discussion). J Roy Stat Soc A, 158:419–466, 1995.

    Article  Google Scholar 

  15. A. F. Connors, T. Speroff, N. V. Dawson, C. Thomas, F. E. Harrell, D. Wagner, N. Desbiens, L. Goldman, A. W. Wu, R. M. Califf, W. J. Fulkerson, H. Vidaillet, S. Broste, P. Bellamy, J. Lynn, W. A. Knaus, and T. S. Investigators. The effectiveness of right heart catheterization in the initial care of critically ill patients. JAMA, 276:889–897, 1996.

    Article  Google Scholar 

  16. D. Draper. Assessment and propagation of model uncertainty (with discussion). J Roy Stat Soc B, 57:45–97, 1995.

    MathSciNet  MATH  Google Scholar 

  17. J. Fan and R. A. Levine. To amnio or not to amnio: That is the decision for Bayes. Chance, 20(3):26–32, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  18. D. Faraggi and R. Simon. A simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis. Stat Med, 15:2203–2213, 1996.

    Article  Google Scholar 

  19. J. J. Faraway. The cost of data analysis. J Comp Graph Stat, 1:213–229, 1992.

    Google Scholar 

  20. V. Fedorov, F. Mannino, and R. Zhang. Consequences of dichotomization. Pharm Stat, 8:50–61, 2009.

    Article  Google Scholar 

  21. I. Ford, J. Norrie, and S. Ahmadi. Model inconsistency, illustrated by the Cox proportional hazards model. Stat Med, 14:735–746, 1995.

    Article  Google Scholar 

  22. M. H. Gail and R. M. Pfeiffer. On criteria for evaluating models of absolute risk. Biostatistics, 6(2):227–239, 2005.

    Article  MATH  Google Scholar 

  23. T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc, 102:359–378, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  24. S. G. Hilsenbeck and G. M. Clark. Practical p-value adjustment for optimally selected cutpoints. Stat Med, 15:103–112, 1996.

    Article  Google Scholar 

  25. L. I. Iezzoni. Dimensions of Risk. In L. I. Iezzoni, editor, Risk Adjustment for Measuring Health Outcomes, chapter 2, pages 29–118. Foundation of the American College of Healthcare Executives, Ann Arbor, MI, 1994.

    Google Scholar 

  26. D. M. Kent and R. Hayward. Limitations of applying summary results of clinical trials to individual patients. JAMA, 298:1209–1212, 2007.

    Article  Google Scholar 

  27. W. A. Knaus, F. E. Harrell, C. J. Fisher, D. P. Wagner, S. M. Opan, J. C. Sadoff, E. A. Draper, C. A. Walawander, K. Conboy, and T. H. Grasela. The clinical evaluation of new drugs for sepsis: A prospective study design based on survival analysis. JAMA, 270:1233–1241, 1993.

    Article  Google Scholar 

  28. A. Laupacis, N. Sekar, and I. G. Stiell. Clinical prediction rules: A review and suggested modifications of methodological standards. JAMA, 277:488–494, 1997.

    Article  Google Scholar 

  29. B. Lausen and M. Schumacher. Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comp Stat Data Analysis, 21(3):307–326, 1996.

    Article  MATH  Google Scholar 

  30. E. L. Lehmann. Model specification: The views of Fisher and Neyman and later developments. Statistical Sci, 5:160–168, 1990.

    Article  MATH  Google Scholar 

  31. J. K. Lindsey and B. Jones. Choosing among generalized linear models applied to medical data. Stat Med, 17:59–68, 1998.

    Google Scholar 

  32. X. Luo, L. A. Stfanski, and D. D. Boos. Tuning variable selection procedures by adding noise. Technometrics, 48:165–175, 2006.

    Article  MathSciNet  Google Scholar 

  33. C. Mallows. The zeroth problem. Am Statistician, 52:1–9, 1998.

    MathSciNet  Google Scholar 

  34. D. R. Ragland. Dichotomizing continuous outcome variables: Dependence of the magnitude of association and statistical power on the cutpoint. Epi, 3:434–440, 1992. See letters to editor May 1993 P. 274-, Vol 4 No. 3.

    Google Scholar 

  35. B. M. Reilly and A. T. Evans. Translating clinical research into clinical practice: Impact of using prediction rules to make decisions. Ann Int Med, 144:201–209, 2006.

    Article  Google Scholar 

  36. P. R. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55, 1983.

    Article  MathSciNet  MATH  Google Scholar 

  37. E. W. Steyerberg, P. M. M. Bossuyt, and K. L. Lee. Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics? Am Heart J, 139:745–751, 2000. Editorial, pp. 761–763.

    Google Scholar 

  38. S. Suissa and L. Blais. Binary regression with continuous outcomes. Stat Med, 14:247–255, 1995.

    Article  Google Scholar 

  39. R. Tibshirani and K. Knight. The covariance inflation criterion for adaptive model selection. J Roy Stat Soc B, 61:529–546, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  40. A. J. Vickers. Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular markers. Am Statistician, 62(4):314–320, 2008.

    Article  MathSciNet  Google Scholar 

  41. J. Whitehead. Sample size calculations for ordered categorical data. Stat Med, 12:2257–2271, 1993. See letter to editor SM 15:1065-6 for binary case;see errata in SM 13:871 1994;see kol95com, jul96sam.

    Google Scholar 

  42. J. Ye. On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc, 93:120–131, 1998.

    Article  MATH  Google Scholar 

  43. H. Zou, T. Hastie, and R. Tibshirani. On the “degrees of freedom” of the lasso. Ann Stat, 35:2173–2192, 2007.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Harrell, F.E. (2015). Introduction. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_1

Download citation

Publish with us

Policies and ethics