Skip to main content

Binary Logistic Regression

  • Chapter

Part of the book series: Springer Series in Statistics ((SSS))

Abstract

Binary responses are commonly studied in many fields. Examples include 1 the presence or absence of a particular disease, death during surgery, or a consumer purchasing a product. Often one wishes to study how a set of predictor variables X is related to a dichotomous response variable Y. The predictors may describe such quantities as treatment assignment, dosage, risk factors, and calendar time. For convenience we define the response to be Y = 0 or 1, with Y = 1 denoting the occurrence of the event of interest. Often a dichotomous outcome can be studied by calculating certain proportions, for example, the proportion of deaths among females and the proportion among males. However, in many situations, there are multiple descriptors, or one or more of the descriptors are continuous. Without a statistical model, studying patterns such as the relationship between age and occurrence of a disease, for example, would require the creation of arbitrary age groups to allow estimation of disease prevalence as a function of age.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The general formula for the sample size required to achieve a margin of error of δ in estimating a true probability of θ at the 0.95 confidence level is \(n = (\frac{1.96} {\delta } )^{2} \times \theta (1-\theta )\). Set \(\theta = \frac{1} {2}\) (intercept=0) for the worst case.

  2. 2.

    The R code can easily be modified for other event frequencies, or the minimum of the number of events and non-events for a dataset at hand can be compared with \(\frac{n} {2}\) in this simulation. An average maximum absolute error of 0.05 corresponds roughly to a half-width of the 0.95 confidence interval of 0.1.

  3. 3.

    In the wireframe plots that follow, predictions for cholesterol–age combinations for which fewer than 5 exterior points exist are not shown, so as to not extrapolate to regions not supported by at least five points beyond the data perimeter.

  4. 4.

    Note that D and B (below) and other indexes not related to c (below) do not work well in case-control studies because of their reliance on absolute probability estimates.

References

  1. A. Agresti. Categorical data analysis. Wiley, Hoboken, NJ, second edition, 2002.

    Book  MATH  Google Scholar 

  2. H. R. Arkes, N. V. Dawson, T. Speroff, F. E. Harrell, C. Alzola, R. Phillips, N. Desbiens, R. K. Oye, W. Knaus, A. F. Connors, and T. Investigators. The covariance decomposition of the probability score and its use in evaluating prognostic estimates. Med Decis Mak, 15:120–131, 1995.

    Article  Google Scholar 

  3. D. Bamber. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Mathe Psych, 12:387–415, 1975.

    Article  MathSciNet  MATH  Google Scholar 

  4. J. Banks. Nomograms. In S. Kotz and N. L. Johnson, editors, Encyclopedia of Stat Scis, volume 6. Wiley, New York, 1985.

    Google Scholar 

  5. K. N. Berk and D. E. Booth. Seeing a curve in multiple regression. Technometrics, 37:385–398, 1995.

    Article  MATH  Google Scholar 

  6. G. W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Rev, 78:1–3, 1950.

    Article  Google Scholar 

  7. M. Buyse. R 2: A useful measure of model performance when predicting a dichotomous outcome. Stat Med, 19:271–274, 2000. Letter to the Editor regarding Stat Med 18:375–384; 1999.

    Google Scholar 

  8. M. S. Cepeda, R. Boston, J. T. Farrar, and B. L. Strom. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epi, 158:280–287, 2003.

    Article  Google Scholar 

  9. J. M. Chambers and T. J. Hastie, editors. Statistical Models in S. Wadsworth and Brooks/Cole, Pacific Grove, CA, 1992.

    MATH  Google Scholar 

  10. W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc, 74:829–836, 1979.

    Article  MathSciNet  MATH  Google Scholar 

  11. D. Collett. Modelling Binary Data. Chapman and Hall, London, second edition, 2002.

    MATH  Google Scholar 

  12. E. F. Cook and L. Goldman. Asymmetric stratification: An outline for an efficient method for controlling confounding in cohort studies. Am J Epi, 127:626–639, 1988.

    Google Scholar 

  13. N. R. Cook. Use and misues of the receiver operating characteristic curve in risk prediction. Circulation, 115:928–935, 2007.

    Article  Google Scholar 

  14. J. Copas. The effectiveness of risk scores: The logit rank plot. Appl Stat, 48:165–183, 1999.

    MATH  Google Scholar 

  15. J. B. Copas. Cross-validation shrinkage of regression predictors. J Roy Stat Soc B, 49:175–183, 1987.

    MathSciNet  MATH  Google Scholar 

  16. J. B. Copas. Unweighted sum of squares tests for proportions. Appl Stat, 38:71–80, 1989.

    Article  MathSciNet  Google Scholar 

  17. D. R. Cox. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B, 20:215–242, 1958.

    MATH  Google Scholar 

  18. D. R. Cox. Two further applications of a model for binary regression. Biometrika, 45(3/4):562–565, 1958.

    Article  MATH  Google Scholar 

  19. D. R. Cox and N. Wermuth. A comment on the coefficient of determination for binary responses. Am Statistician, 46:1–4, 1992.

    Google Scholar 

  20. J. G. Cragg and R. Uhler. The demand for automobiles. Canadian Journal of Economics, 3:386–406, 1970.

    Article  Google Scholar 

  21. C. E. Davis, J. E. Hyde, S. I. Bangdiwala, and J. J. Nelson. An example of dependencies among variables in a conditional logistic regression. In S. H. Moolgavkar and R. L. Prentice, editors, Modern Statistical Methods in Chronic Disease Epi, pages 140–147. Wiley, New York, 1986.

    Google Scholar 

  22. B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Am Stat Assoc, 78:316–331, 1983.

    Article  MathSciNet  MATH  Google Scholar 

  23. E. B. Fowlkes. Some diagnostics for binary logistic regression via smoothing. Biometrika, 74:503–515, 1987.

    Article  MathSciNet  Google Scholar 

  24. J. H. Friedman. A variable span smoother. Technical Report 5, Laboratory for Computational Statistics, Department of Statistics, Stanford University, 1984.

    Google Scholar 

  25. T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc, 102:359–378, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  26. M. Halperin, W. C. Blackwelder, and J. I. Verter. Estimation of the multivariate logistic risk function: A comparison of the discriminant function and maximum likelihood approaches. J Chron Dis, 24:125–158, 1971.

    Article  MATH  Google Scholar 

  27. D. J. Hand. Construction and Assessment of Classification Rules. Wiley, Chichester, 1997.

    MATH  Google Scholar 

  28. T. L. Hankins. Blood, dirt, and nomograms. Chance, 13(1):26–37, 2000.

    Article  Google Scholar 

  29. J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, 1982.

    Article  Google Scholar 

  30. F. E. Harrell. Comparison of strategies for validating binary logistic regression models. Unpublished manuscript, 1991.

    Google Scholar 

  31. F. E. Harrell and K. L. Lee. A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In P. K. Sen, editor, Biostatistics: Statistics in Biomedical, Public Health, and Environmental Sciences. The Bernard G. Greenberg Volume, pages 333–343. North-Holland, Amsterdam, 1985.

    Google Scholar 

  32. F. E. Harrell and K. L. Lee. The practical value of logistic regression. In Proceedings of the Tenth Annual SAS Users Group International Conference, pages 1031–1036, 1985.

    Google Scholar 

  33. F. E. Harrell and K. L. Lee. Using logistic model calibration to assess the quality of probability predictions. Unpublished manuscript, 1987.

    Google Scholar 

  34. W. W. Hauck and A. Donner. Wald’s test as applied to hypotheses in logit analysis. J Am Stat Assoc, 72:851–863, 1977.

    MathSciNet  MATH  Google Scholar 

  35. A. V. Hernández, M. J. Eijkemans, and E. W. Steyerberg. Randomized controlled trials with time-to-event outcomes: how much does prespecified covariate adjustment increase power? Annals of epidemiology, 16(1):41–48, Jan. 2006.

    Google Scholar 

  36. A. V. Hernández, E. W. Steyerberg, and J. D. F. Habbema. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. J Clin Epi, 57:454–460, 2004.

    Article  MATH  Google Scholar 

  37. D. W. Hosmer, T. Hosmer, S. le Cessie, and S. Lemeshow. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med, 16:965–980, 1997.

    Article  Google Scholar 

  38. D. W. Hosmer and S. Lemeshow. Goodness-of-fit tests for the multiple logistic regression model. Comm Stat Th Meth, 9:1043–1069, 1980.

    Article  Google Scholar 

  39. D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 1989.

    Google Scholar 

  40. D. W. Hosmer and S. Lemeshow. Confidence interval estimates of an index of quality performance based on logistic regression models. Stat Med, 14:2161–2172, 1995. See letter to editor 16:1301-3,1997.

    Google Scholar 

  41. B. Hu, M. Palta, and J. Shao. Properties of R 2 statistics for logistic regression. Stat Med, 25:1383–1395, 2006.

    Article  MathSciNet  Google Scholar 

  42. R. Kay and S. Little. Assessing the fit of the logistic model: A case study of children with the haemolytic uraemic syndrome. Appl Stat, 35:16–30, 1986.

    Article  MATH  Google Scholar 

  43. E. L. Korn and R. Simon. Explained residual variation, explained risk, and goodness of fit. Am Statistician, 45:201–206, 1991.

    Google Scholar 

  44. J. M. Landwehr, D. Pregibon, and A. C. Shoemaker. Graphical methods for assessing logistic regression models (with discussion). J Am Stat Assoc, 79:61–83, 1984.

    Article  MATH  Google Scholar 

  45. P. W. Lavori, R. Dawson, and T. B. Mueller. Causal estimation of time-varying treatment effects in observational studies: Application to depressive disorder. Stat Med, 13:1089–1100, 1994.

    Article  Google Scholar 

  46. S. le Cessie and J. C. van Houwelingen. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics, 47:1267–1282, 1991.

    Article  MATH  Google Scholar 

  47. J. G. Liao and D. McGee. Adjusted coefficients of determination for logistic regression. Am Statistician, 57:161–165, 2003.

    Article  MathSciNet  MATH  Google Scholar 

  48. K. Linnet. Assessing diagnostic tests by a strictly proper scoring rule. Stat Med, 8:609–618, 1989.

    Article  Google Scholar 

  49. K. Liu and A. R. Dyer. A rank statistic for assessing the amount of variation explained by risk factors in epidemiologic studies. Am J Epi, 109:597–606, 1979.

    Google Scholar 

  50. G. S. Maddala. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge, UK, 1983.

    Book  MATH  Google Scholar 

  51. L. Magee. R 2 measures based on Wald and likelihood ratio joint significance tests. Am Statistician, 44:250–253, 1990.

    Google Scholar 

  52. S. Menard. Coefficients of determination for multiple logistic regression analysis. Am Statistician, 54:17–24, 2000.

    Google Scholar 

  53. M. E. Miller, S. L. Hui, and W. M. Tierney. Validation techniques for logistic regression models. Stat Med, 10:1213–1226, 1991.

    Article  Google Scholar 

  54. M. Mittlböck and M. Schemper. Explained variation for logistic regression. Stat Med, 15:1987–1997, 1996.

    Article  MATH  Google Scholar 

  55. K. G. M. Moons, Donders, E. W. Steyerberg, and F. E. Harrell. Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epi, 57:1262–1270, 2004.

    Google Scholar 

  56. N. J. D. Nagelkerke. A note on a general definition of the coefficient of determination. Biometrika, 78:691–692, 1991.

    Article  MathSciNet  MATH  Google Scholar 

  57. R. Newson. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. Stata Journal, 2(1), 2002. http://www.stata-journal.com/article.html?article=st0007.

  58. R. Newson. Confidence intervals for rank statistics: Somers’ D and extensions. Stata J, 6(3):309–334, 2006.

    Google Scholar 

  59. P. C. O’Brien. Comparing two samples: Extensions of the t, rank-sum, and log-rank test. J Am Stat Assoc, 83:52–61, 1988.

    Google Scholar 

  60. M. J. Pencina, R. B. D’Agostino, and O. V. Demler. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med, 31(2):101–113, 2012.

    Article  MathSciNet  Google Scholar 

  61. M. J. Pencina, R. B. D’Agostino Sr, R. B. D’Agostino Jr, and R. S. Vasan. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med, 27:157–172, 2008.

    Article  MathSciNet  Google Scholar 

  62. D. Pregibon. Logistic regression diagnostics. Ann Stat, 9:705–724, 1981.

    Article  MathSciNet  MATH  Google Scholar 

  63. D. Pregibon. Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38:485–498, 1982.

    Article  Google Scholar 

  64. S. J. Press and S. Wilson. Choosing between logistic regression and discriminant analysis. J Am Stat Assoc, 73:699–705, 1978.

    Article  MATH  Google Scholar 

  65. D. B. Pryor, F. E. Harrell, K. L. Lee, R. M. Califf, and R. A. Rosati. Estimating the likelihood of significant coronary artery disease. Am J Med, 75:771–780, 1983.

    Article  MATH  Google Scholar 

  66. J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by modeling the expectation of exposure conditional on confounders. Biometrics, 48:479–495, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  67. L. D. Robinson and N. P. Jewell. Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev, 59:227–240, 1991.

    Article  MATH  Google Scholar 

  68. P. R. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55, 1983.

    Article  MathSciNet  MATH  Google Scholar 

  69. P. R. Rosenbaum and D. B. Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J Roy Stat Soc B, 45:212–218, 1983.

    Google Scholar 

  70. J. C. Sinclair and M. B. Bracken. Clinically useful measures of effect in binary analyses of randomized trials. J Clin Epi, 47:881–889, 1994.

    Article  Google Scholar 

  71. R. H. Somers. A new asymmetric measure of association for ordinal variables. Am Soc Rev, 27:799–811, 1962.

    Article  Google Scholar 

  72. A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 262:2700–2707, 1989.

    Article  Google Scholar 

  73. N. Stallard. Simple tests for the external validation of mortality prediction scores. Stat Med, 28:377–388, 2009.

    Article  MathSciNet  Google Scholar 

  74. E. W. Steyerberg, P. M. M. Bossuyt, and K. L. Lee. Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics? Am Heart J, 139:745–751, 2000. Editorial, pp. 761–763.

    Google Scholar 

  75. E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema. Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets. Med Decis Mak, 21:45–56, 2001.

    Article  Google Scholar 

  76. T. Tjur. Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination. Am Statistician, 63(4):366–372, 2009.

    Article  MathSciNet  MATH  Google Scholar 

  77. J. C. van Houwelingen and S. le Cessie. Logistic regression, a review. Statistica Neerlandica, 42:215–232, 1988.

    Article  MathSciNet  Google Scholar 

  78. J. C. van Houwelingen and S. le Cessie. Predictive value of statistical models. Stat Med, 9:1303–1325, 1990.

    Article  Google Scholar 

  79. S. H. Walker and D. B. Duncan. Estimation of the probability of an event as a function of several independent variables. Biometrika, 54:167–178, 1967.

    Article  MathSciNet  MATH  Google Scholar 

  80. Y. Wax. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of diet-cancer relationship in epidemiological studies. Stat Med, 11:1273–1287, 1992.

    Article  Google Scholar 

  81. T. L. Wenger, F. E. Harrell, K. K. Brown, S. Lederman, and H. C. Strauss. Ventricular fibrillation following canine coronary reperfusion: Different outcomes with pentobarbital and α-chloralose. Can J Phys Pharm, 62:224–228, 1984.

    Article  Google Scholar 

  82. B. Zheng and A. Agresti. Summarizing the predictive power of a generalized linear model. Stat Med, 19:1771–1781, 2000.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Harrell, F.E. (2015). Binary Logistic Regression. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_10

Download citation

Publish with us

Policies and ethics