Missing Data

Molenberghs, Geert; Beunckens, Caroline; Jansen, Ivy; Thijs, Herbert; Verbeke, Geert; Kenward, Michael G.

doi:10.1007/978-3-540-26577-1_20

Geert Molenberghs³,
Caroline Beunckens³,
Ivy Jansen³,
Herbert Thijs³,
Geert Verbeke³ &
…
Michael G. Kenward⁴

6106 Accesses

Abstract

The problem of dealing with missing values is common throughout statistical work and is present whenever human subjects are enrolled. Respondents may refuse participation or may be unreachable. Patients in clinical and epidemiological studies may with draw their initial consent without further explanation. Early work on missing values was largely concerned with algorithmic and computational solutions to the induced lack of balance or deviations from the intended study design (Afifi and Elashoff 1966; Hartley and Hocking 1971). More recently general algorithms such as the Expectation-Maximization (EM) (Dempster et al. 1977), and data imputation and augmentation procedures (Rubin1987;Tanner andWong1987) combined with powerful computing resources have largely provided a solution to this aspect of the problem. There remains the very difficult and important question of assessing the impact of missing data on subsequent statistical inference. Conditions can be formulated, under which an analysis that proceeds as if the missing data are missing by design, that is, ignoring the missing value process, can provide valid answers to study questions. While such an approach is attractive from a pragmatic point of view, the difficulty is that such conditions can rarely be assumed to hold with full certainty. Indeed, assumptions will be required that cannot be assessed from the data under analysis. Hence in this setting there cannot be anything that could be termed a definitive analysis, and hence any analysis of preference is ideally to be supplemented with a so-called sensitivity analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 199.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aerts M, Geys H, Molenberghs G, and Ryan LM (2002) Topics in Modelling of Clustered Binary Data. Chapman & Hall, London
Google Scholar
Afifi A, Elashoff R (1966) Missing observations in multivariate statistics I: Review of the literature. Journal of the American Statistical Association 61:595–604
Article MathSciNet Google Scholar
Amemiya T (1984) Tobit models: a survey. Journal of Econometrics 24:3–61
Article MATH MathSciNet Google Scholar
Ashford JR, Sowden RR (1970) Multi-variate probit analysis. Biometrics 26:535–546
Article Google Scholar
Baker SG (1995) Marginal regression for repeated binary data with outcome subject to non-ignorable non-response. Biometrics 51:1042–1052
Article MATH Google Scholar
Bahadur RR (1961) A representation of the joint distribution of responses to n dichotomous items. In: Solomon H (ed) Studies in Item Analysis and Prediction Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press, Stanford CA
Google Scholar
Beckman RJ, Nachtsheim CJ, and Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426
Article MATH MathSciNet Google Scholar
Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88:9–25
Article MATH Google Scholar
Buck SF (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society Series B 22:302–306
MATH MathSciNet Google Scholar
Chatterjee S, Hadi AS (1988) Sensitivity Analysis in Linear Regression. John Wiley & Sons, New York
MATH Google Scholar
Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19:15–18
Article MATH MathSciNet Google Scholar
Cook RD (1979) Influential observations in linear regression. Journal of the American Statistical Association 74:169–174
Article MATH MathSciNet Google Scholar
Cook RD (1986) Assessment of local influence. Journal of the Royal Statistical Society Series B 48:133–169
MATH Google Scholar
Cook RD, Weisberg S (1982) Residuals and Influence in Regression. Chapman & Hall, London
MATH Google Scholar
Dale JR (1986) Global cross-ratio models for bivariate, discrete, ordered responses. Biometrics 42:909–917
Article Google Scholar
Dempster AP, Rubin DB (1983) Overview. Incomplete Data in Sample Surveys, Vol. II: Theory and Annotated Bibliography, Madow WG, Olkin I, Rubin DB (eds). Academic Press, New York, pp 3–10
Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B 39:1–38
MATH MathSciNet Google Scholar
Diggle PJ, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics 43:49–93
Article MATH Google Scholar
Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of Longitudinal Data. Oxford University Press, New York
Google Scholar
Draper D (1995) Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society Series B 57:45–97
MATH MathSciNet Google Scholar
Ekholm A (1991) Algorithms versus models for analyzing data that contain misclassification errors. Biometrics 47:1171–1182
Article Google Scholar
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models. Springer-Verlag, Heidelberg
MATH Google Scholar
Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995) Regression models for longitudinal binary responses with informative dropouts. Journal of the Royal Statistical Society Series B 57:691–704
MATH MathSciNet Google Scholar
Fitzmaurice GM, Heath G, Clifford P (1996a) Logistic regression models for binary data panel data with attrition. Journal of the Royal Statistical Society Series A 159:249–264
MATH MathSciNet Google Scholar
Fitzmaurice GM, Laird NM, Zahner GEP (1996b) Multivariate logistic models for incomplete binary response. Journal of the American Statistical Association 91:99–108
Article MATH Google Scholar
George EO, Bowman D (1995) A saturated model for analyzing exchangeable binary data: Applications to clinical and developmental toxicity studies. Journal of the American Statistical Association 90:871–879
Article MATH Google Scholar
Geys H, Molenberghs G, Lipsitz SR (1998) A note on the comparison of pseudolikelihood and generalized estimating equations for marginal odds ratio models. Journal of Statistical Computation and Simulation 62:45–72
Article MATH Google Scholar
Glonek GFV, McCullagh P (1995) Multivariate logisticmodels. Journal of the Royal Statistical Society Series B 81:477–482
Google Scholar
Goss PE, Winer EP, Tannock IF, Schwartz LH, Kremer AB (1999) Breast cancer: randomized phase III trial comparing the new potent and selective third-generation aromatase inhibitor vorozole with megestrol acetate in postmenopausal advanced breast cancer patients. Journal of Clinical Oncology 17:52–63
Google Scholar
Hartley HO, Hocking R (1971) The analysis of incomplete data. Biometrics 27:7783–808
Article Google Scholar
Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5:475–492
Google Scholar
Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine 16:239–258
Article Google Scholar
Kenward MG, Molenberghs G (1998) Likelihood based frequentist inference when data are missing at random. Statistical Science 12:236–247
MathSciNet Google Scholar
Kenward MG, Molenberghs G, Thijs H (2003) Pattern-mixture models with proper time dependence. Biometrika 90:53–71
Article MATH MathSciNet Google Scholar
Laird NM (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:84
MathSciNet Google Scholar
Lang JB, Agresti A (1994) Simultaneously modeling joint and marginal distributions of multivariate categorical responses. Journal of the American Statistical Association 89:625–632
Article MATH Google Scholar
le Cessie S, van Houwelingen JC (1994) Logistic regression for correlated binary data. Applied Statistics 43:95–108
Article MATH Google Scholar
Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582
Article MATH Google Scholar
Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22
Article MATH MathSciNet Google Scholar
Liang K-Y, Zeger SL, Qaqish B (1992) Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society Series B 54:3–40
MATH MathSciNet Google Scholar
Lipsitz SR, Laird NM, Harrington DP (1991) Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78:153–160
Article MathSciNet Google Scholar
Little RJA (1986) A note about models for selectivity bias. Econometrika 53:1469–1474
Article Google Scholar
Little RJA (1993) Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association 88:125–134
Article MATH Google Scholar
Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483
Article MATH MathSciNet Google Scholar
Little RJA (1995) Modeling the drop-out mechanism in repeated measures studies. Journal of the American Statistical Association 90:1112–1121
Article MATH MathSciNet Google Scholar
Little RJA, Rubin DB (1987) Statistical Analysis with Missing Data. John Wiley & Sons, New York
MATH Google Scholar
Mallinckrodt CH, Clark WS, Stacy RD (2001a) Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Information Journal 35:1215–1225
Google Scholar
Mallinckrodt CH, Clark WS, Stacy RD (2001b) Accounting for dropout bias using mixed-effects models. Journal of Biopharmaceutical Statistics series 11,(1 & 2):9–21
Article Google Scholar
Mallinckrodt CH, Clark WS, Carroll RJ, Molenberghs G (2003a) Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. Journal of Biopharmaceutical Statistics 13:179–190
Article MATH Google Scholar
Mallinckrodt CH, Sanger TM, Dube S, Debrota DJ, Molenberghs G, Carroll RJ, Zeigler Potter WM, Tollefson, GD (2003b) Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biological Psychiatry series 53:754–760
Article Google Scholar
McCullagh P, Nelder JA (1989) Generalized Linear Models. Chapman & Hall, London
MATH Google Scholar
Michiels B, Molenberghs G, Bijnens L, Vangeneugden T, Thijs H (2002) Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Statistics in Medicine 21:1023–1041
Article Google Scholar
Molenberghs G, Lesaffre E (1994) Marginal modelling of correlated ordinal data using a multivariate Plackett distribution. Journal of the American Statistical Association 89:633–644
Article MATH Google Scholar
Molenberghs G, Lesaffre E (1999) Marginal modelling of multivariate categorical data. Statistics in Medicine 18:2237–2255
Article Google Scholar
Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with non-random dropout. Biometrika 84:33–44
Article MATH Google Scholar
Molenberghs G, Michiels B, Kenward MG, Diggle PJ (1998) Missing data mechanisms and pattern-mixture models. Statistica Neerlandica 52:153–161
Article MATH MathSciNet Google Scholar
Murray GD, Findlay JG (1988) Correcting for the bias caused by drop-outs in hypertension trials. Statististics in Medicine 7:941–946
Article Google Scholar
Nelder JA, Mead R (1965) A simplex method for function minimisation. The Computer Journal 7:303–313
MathSciNet Google Scholar
Neuhaus JM (1992) Statistical methods for longitudinal and clustered designs with binary responses. Statistical Methods in Medical Research 1:249–273
Google Scholar
Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. International Statistical Review 59:25–35
Article Google Scholar
Plackett RL (1965) A class of bivariate distributions. Journal of the American Statistical Association 60:516–522
Article MathSciNet Google Scholar
Prentice RL (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics 44:1033–1048
Article MATH MathSciNet Google Scholar
Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90:106–121
Article MATH MathSciNet Google Scholar
Robins JM, Rotnitzky A, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with non-ignorable non-response. Journal of the American Statistical Association 93:1321–1339
Article MATH MathSciNet Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Article MATH MathSciNet Google Scholar
Rubin DB (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York
Google Scholar
Rubin DB (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:80–82
Google Scholar
Schafer JL (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, London
MATH Google Scholar
Schipper H, Clinch J, McMurray A (1984) Measuring the quality of life of cancer patients: the Functional-Living Index-Cancer: development and validation. Journal of Clinical Oncology 2:472–483
Google Scholar
Sheiner LB, Beal SL, Dunne A (1997) Analysis of nonrandomly censored ordered categorical longitudinal data from analgesic trials. Journal of the American Statistical Association 92:1235–1244
Article MATH Google Scholar
Siddiqui O, Ali MW (1998) A comparison of the random-effects pattern-mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. Journal of Biopharmaceutical Statistics 8:545–563
Article MATH Google Scholar
Skellam JG (1948) A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. Journal of the Royal Statistical Society Series B 10:257–261
MATH MathSciNet Google Scholar
Smith DM, Robertson B, Diggle PJ (1996) Object-oriented Software for the Analysis of Longitudinal Data in S. Technical Report MA 96/192. Department of Mathematics and Statistics, University of Lancaster, LA1 4YF, United Kingdom
Google Scholar
Stiratelli R, Laird N, Ware J (1984) Random effects models for serial observations with dichotomous response. Biometrics 40:961–972
Article Google Scholar
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82:528–550
Article MATH MathSciNet Google Scholar
Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265
Article MATH Google Scholar
Verbeke G, Molenberghs G (1997) Linear Mixed Models in Practice: A SAS-Oriented Approach. Lecture Notes in Statistics 126. Springer-Verlag, New York
MATH Google Scholar
Verbeke G, Molenberghs G (2000) Linear Mixed Models for Longitudinal Data. Springer-Verlag, New York
MATH Google Scholar
Verbeke G, Molenberghs G, Thijs H, Lesaffre E, Kenward MG (2001) Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57:7–14
Article MathSciNet Google Scholar
Wedderburn RWM (1974) Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61:439–447
MATH MathSciNet Google Scholar
Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Biostatistics Centre for Statistics, Limburg University Centrum, Universitaire Campus Building D, 3590, Diepenbeek, Belgium
Geert Molenberghs, Caroline Beunckens, Ivy Jansen, Herbert Thijs & Geert Verbeke
Medical Statistics Unit, London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, UK
Michael G. Kenward

Authors

Geert Molenberghs
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Beunckens
View author publications
You can also search for this author in PubMed Google Scholar
Ivy Jansen
View author publications
You can also search for this author in PubMed Google Scholar
Herbert Thijs
View author publications
You can also search for this author in PubMed Google Scholar
Geert Verbeke
View author publications
You can also search for this author in PubMed Google Scholar
Michael G. Kenward
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Division of Epidemiological Methods and Ethiologic Research, Bremen Institute for Prevention Research and Social Medicine (BIPS), Linzer Str. 10, 28359, Bremen, Germany
Wolfgang Ahrens & Iris Pigeot &
Division of Biometrie and Data Management, Bremen Institute for Prevention Research and Social Medicine (BIPS), Linzer Str. 10, 28359, Bremen, Germany
Wolfgang Ahrens & Iris Pigeot &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Molenberghs, G., Beunckens, C., Jansen, I., Thijs, H., Verbeke, G., Kenward, M.G. (2005). Missing Data. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-26577-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-26577-1_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00566-7
Online ISBN: 978-3-540-26577-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics