Skip to main content

Missing Data

  • Chapter
Handbook of Epidemiology

Abstract

The problem of dealing with missing values is common throughout statistical work and is present whenever human subjects are enrolled. Respondents may refuse participation or may be unreachable. Patients in clinical and epidemiological studies may with draw their initial consent without further explanation. Early work on missing values was largely concerned with algorithmic and computational solutions to the induced lack of balance or deviations from the intended study design (Afifi and Elashoff 1966; Hartley and Hocking 1971). More recently general algorithms such as the Expectation-Maximization (EM) (Dempster et al. 1977), and data imputation and augmentation procedures (Rubin1987;Tanner andWong1987) combined with powerful computing resources have largely provided a solution to this aspect of the problem. There remains the very difficult and important question of assessing the impact of missing data on subsequent statistical inference. Conditions can be formulated, under which an analysis that proceeds as if the missing data are missing by design, that is, ignoring the missing value process, can provide valid answers to study questions. While such an approach is attractive from a pragmatic point of view, the difficulty is that such conditions can rarely be assumed to hold with full certainty. Indeed, assumptions will be required that cannot be assessed from the data under analysis. Hence in this setting there cannot be anything that could be termed a definitive analysis, and hence any analysis of preference is ideally to be supplemented with a so-called sensitivity analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 199.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Aerts M, Geys H, Molenberghs G, and Ryan LM (2002) Topics in Modelling of Clustered Binary Data. Chapman & Hall, London

    Google Scholar 

  • Afifi A, Elashoff R (1966) Missing observations in multivariate statistics I: Review of the literature. Journal of the American Statistical Association 61:595–604

    Article  MathSciNet  Google Scholar 

  • Amemiya T (1984) Tobit models: a survey. Journal of Econometrics 24:3–61

    Article  MATH  MathSciNet  Google Scholar 

  • Ashford JR, Sowden RR (1970) Multi-variate probit analysis. Biometrics 26:535–546

    Article  Google Scholar 

  • Baker SG (1995) Marginal regression for repeated binary data with outcome subject to non-ignorable non-response. Biometrics 51:1042–1052

    Article  MATH  Google Scholar 

  • Bahadur RR (1961) A representation of the joint distribution of responses to n dichotomous items. In: Solomon H (ed) Studies in Item Analysis and Prediction Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press, Stanford CA

    Google Scholar 

  • Beckman RJ, Nachtsheim CJ, and Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426

    Article  MATH  MathSciNet  Google Scholar 

  • Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88:9–25

    Article  MATH  Google Scholar 

  • Buck SF (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society Series B 22:302–306

    MATH  MathSciNet  Google Scholar 

  • Chatterjee S, Hadi AS (1988) Sensitivity Analysis in Linear Regression. John Wiley & Sons, New York

    MATH  Google Scholar 

  • Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19:15–18

    Article  MATH  MathSciNet  Google Scholar 

  • Cook RD (1979) Influential observations in linear regression. Journal of the American Statistical Association 74:169–174

    Article  MATH  MathSciNet  Google Scholar 

  • Cook RD (1986) Assessment of local influence. Journal of the Royal Statistical Society Series B 48:133–169

    MATH  Google Scholar 

  • Cook RD, Weisberg S (1982) Residuals and Influence in Regression. Chapman & Hall, London

    MATH  Google Scholar 

  • Dale JR (1986) Global cross-ratio models for bivariate, discrete, ordered responses. Biometrics 42:909–917

    Article  Google Scholar 

  • Dempster AP, Rubin DB (1983) Overview. Incomplete Data in Sample Surveys, Vol. II: Theory and Annotated Bibliography, Madow WG, Olkin I, Rubin DB (eds). Academic Press, New York, pp 3–10

    Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B 39:1–38

    MATH  MathSciNet  Google Scholar 

  • Diggle PJ, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics 43:49–93

    Article  MATH  Google Scholar 

  • Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of Longitudinal Data. Oxford University Press, New York

    Google Scholar 

  • Draper D (1995) Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society Series B 57:45–97

    MATH  MathSciNet  Google Scholar 

  • Ekholm A (1991) Algorithms versus models for analyzing data that contain misclassification errors. Biometrics 47:1171–1182

    Article  Google Scholar 

  • Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models. Springer-Verlag, Heidelberg

    MATH  Google Scholar 

  • Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995) Regression models for longitudinal binary responses with informative dropouts. Journal of the Royal Statistical Society Series B 57:691–704

    MATH  MathSciNet  Google Scholar 

  • Fitzmaurice GM, Heath G, Clifford P (1996a) Logistic regression models for binary data panel data with attrition. Journal of the Royal Statistical Society Series A 159:249–264

    MATH  MathSciNet  Google Scholar 

  • Fitzmaurice GM, Laird NM, Zahner GEP (1996b) Multivariate logistic models for incomplete binary response. Journal of the American Statistical Association 91:99–108

    Article  MATH  Google Scholar 

  • George EO, Bowman D (1995) A saturated model for analyzing exchangeable binary data: Applications to clinical and developmental toxicity studies. Journal of the American Statistical Association 90:871–879

    Article  MATH  Google Scholar 

  • Geys H, Molenberghs G, Lipsitz SR (1998) A note on the comparison of pseudolikelihood and generalized estimating equations for marginal odds ratio models. Journal of Statistical Computation and Simulation 62:45–72

    Article  MATH  Google Scholar 

  • Glonek GFV, McCullagh P (1995) Multivariate logisticmodels. Journal of the Royal Statistical Society Series B 81:477–482

    Google Scholar 

  • Goss PE, Winer EP, Tannock IF, Schwartz LH, Kremer AB (1999) Breast cancer: randomized phase III trial comparing the new potent and selective third-generation aromatase inhibitor vorozole with megestrol acetate in postmenopausal advanced breast cancer patients. Journal of Clinical Oncology 17:52–63

    Google Scholar 

  • Hartley HO, Hocking R (1971) The analysis of incomplete data. Biometrics 27:7783–808

    Article  Google Scholar 

  • Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5:475–492

    Google Scholar 

  • Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine 16:239–258

    Article  Google Scholar 

  • Kenward MG, Molenberghs G (1998) Likelihood based frequentist inference when data are missing at random. Statistical Science 12:236–247

    MathSciNet  Google Scholar 

  • Kenward MG, Molenberghs G, Thijs H (2003) Pattern-mixture models with proper time dependence. Biometrika 90:53–71

    Article  MATH  MathSciNet  Google Scholar 

  • Laird NM (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:84

    MathSciNet  Google Scholar 

  • Lang JB, Agresti A (1994) Simultaneously modeling joint and marginal distributions of multivariate categorical responses. Journal of the American Statistical Association 89:625–632

    Article  MATH  Google Scholar 

  • le Cessie S, van Houwelingen JC (1994) Logistic regression for correlated binary data. Applied Statistics 43:95–108

    Article  MATH  Google Scholar 

  • Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582

    Article  MATH  Google Scholar 

  • Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22

    Article  MATH  MathSciNet  Google Scholar 

  • Liang K-Y, Zeger SL, Qaqish B (1992) Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society Series B 54:3–40

    MATH  MathSciNet  Google Scholar 

  • Lipsitz SR, Laird NM, Harrington DP (1991) Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78:153–160

    Article  MathSciNet  Google Scholar 

  • Little RJA (1986) A note about models for selectivity bias. Econometrika 53:1469–1474

    Article  Google Scholar 

  • Little RJA (1993) Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association 88:125–134

    Article  MATH  Google Scholar 

  • Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483

    Article  MATH  MathSciNet  Google Scholar 

  • Little RJA (1995) Modeling the drop-out mechanism in repeated measures studies. Journal of the American Statistical Association 90:1112–1121

    Article  MATH  MathSciNet  Google Scholar 

  • Little RJA, Rubin DB (1987) Statistical Analysis with Missing Data. John Wiley & Sons, New York

    MATH  Google Scholar 

  • Mallinckrodt CH, Clark WS, Stacy RD (2001a) Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Information Journal 35:1215–1225

    Google Scholar 

  • Mallinckrodt CH, Clark WS, Stacy RD (2001b) Accounting for dropout bias using mixed-effects models. Journal of Biopharmaceutical Statistics series 11,(1 & 2):9–21

    Article  Google Scholar 

  • Mallinckrodt CH, Clark WS, Carroll RJ, Molenberghs G (2003a) Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. Journal of Biopharmaceutical Statistics 13:179–190

    Article  MATH  Google Scholar 

  • Mallinckrodt CH, Sanger TM, Dube S, Debrota DJ, Molenberghs G, Carroll RJ, Zeigler Potter WM, Tollefson, GD (2003b) Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biological Psychiatry series 53:754–760

    Article  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized Linear Models. Chapman & Hall, London

    MATH  Google Scholar 

  • Michiels B, Molenberghs G, Bijnens L, Vangeneugden T, Thijs H (2002) Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Statistics in Medicine 21:1023–1041

    Article  Google Scholar 

  • Molenberghs G, Lesaffre E (1994) Marginal modelling of correlated ordinal data using a multivariate Plackett distribution. Journal of the American Statistical Association 89:633–644

    Article  MATH  Google Scholar 

  • Molenberghs G, Lesaffre E (1999) Marginal modelling of multivariate categorical data. Statistics in Medicine 18:2237–2255

    Article  Google Scholar 

  • Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with non-random dropout. Biometrika 84:33–44

    Article  MATH  Google Scholar 

  • Molenberghs G, Michiels B, Kenward MG, Diggle PJ (1998) Missing data mechanisms and pattern-mixture models. Statistica Neerlandica 52:153–161

    Article  MATH  MathSciNet  Google Scholar 

  • Murray GD, Findlay JG (1988) Correcting for the bias caused by drop-outs in hypertension trials. Statististics in Medicine 7:941–946

    Article  Google Scholar 

  • Nelder JA, Mead R (1965) A simplex method for function minimisation. The Computer Journal 7:303–313

    MathSciNet  Google Scholar 

  • Neuhaus JM (1992) Statistical methods for longitudinal and clustered designs with binary responses. Statistical Methods in Medical Research 1:249–273

    Google Scholar 

  • Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. International Statistical Review 59:25–35

    Article  Google Scholar 

  • Plackett RL (1965) A class of bivariate distributions. Journal of the American Statistical Association 60:516–522

    Article  MathSciNet  Google Scholar 

  • Prentice RL (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics 44:1033–1048

    Article  MATH  MathSciNet  Google Scholar 

  • Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90:106–121

    Article  MATH  MathSciNet  Google Scholar 

  • Robins JM, Rotnitzky A, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with non-ignorable non-response. Journal of the American Statistical Association 93:1321–1339

    Article  MATH  MathSciNet  Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    Article  MATH  MathSciNet  Google Scholar 

  • Rubin DB (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York

    Google Scholar 

  • Rubin DB (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:80–82

    Google Scholar 

  • Schafer JL (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, London

    MATH  Google Scholar 

  • Schipper H, Clinch J, McMurray A (1984) Measuring the quality of life of cancer patients: the Functional-Living Index-Cancer: development and validation. Journal of Clinical Oncology 2:472–483

    Google Scholar 

  • Sheiner LB, Beal SL, Dunne A (1997) Analysis of nonrandomly censored ordered categorical longitudinal data from analgesic trials. Journal of the American Statistical Association 92:1235–1244

    Article  MATH  Google Scholar 

  • Siddiqui O, Ali MW (1998) A comparison of the random-effects pattern-mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. Journal of Biopharmaceutical Statistics 8:545–563

    Article  MATH  Google Scholar 

  • Skellam JG (1948) A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. Journal of the Royal Statistical Society Series B 10:257–261

    MATH  MathSciNet  Google Scholar 

  • Smith DM, Robertson B, Diggle PJ (1996) Object-oriented Software for the Analysis of Longitudinal Data in S. Technical Report MA 96/192. Department of Mathematics and Statistics, University of Lancaster, LA1 4YF, United Kingdom

    Google Scholar 

  • Stiratelli R, Laird N, Ware J (1984) Random effects models for serial observations with dichotomous response. Biometrics 40:961–972

    Article  Google Scholar 

  • Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82:528–550

    Article  MATH  MathSciNet  Google Scholar 

  • Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265

    Article  MATH  Google Scholar 

  • Verbeke G, Molenberghs G (1997) Linear Mixed Models in Practice: A SAS-Oriented Approach. Lecture Notes in Statistics 126. Springer-Verlag, New York

    MATH  Google Scholar 

  • Verbeke G, Molenberghs G (2000) Linear Mixed Models for Longitudinal Data. Springer-Verlag, New York

    MATH  Google Scholar 

  • Verbeke G, Molenberghs G, Thijs H, Lesaffre E, Kenward MG (2001) Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57:7–14

    Article  MathSciNet  Google Scholar 

  • Wedderburn RWM (1974) Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61:439–447

    MATH  MathSciNet  Google Scholar 

  • Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Molenberghs, G., Beunckens, C., Jansen, I., Thijs, H., Verbeke, G., Kenward, M.G. (2005). Missing Data. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-26577-1_20

Download citation

Publish with us

Policies and ethics