Abstract
The problem of dealing with missing values is common throughout statistical work and is present whenever human subjects are enrolled. Respondents may refuse participation or may be unreachable. Patients in clinical and epidemiological studies may with draw their initial consent without further explanation. Early work on missing values was largely concerned with algorithmic and computational solutions to the induced lack of balance or deviations from the intended study design (Afifi and Elashoff 1966; Hartley and Hocking 1971). More recently general algorithms such as the Expectation-Maximization (EM) (Dempster et al. 1977), and data imputation and augmentation procedures (Rubin1987;Tanner andWong1987) combined with powerful computing resources have largely provided a solution to this aspect of the problem. There remains the very difficult and important question of assessing the impact of missing data on subsequent statistical inference. Conditions can be formulated, under which an analysis that proceeds as if the missing data are missing by design, that is, ignoring the missing value process, can provide valid answers to study questions. While such an approach is attractive from a pragmatic point of view, the difficulty is that such conditions can rarely be assumed to hold with full certainty. Indeed, assumptions will be required that cannot be assessed from the data under analysis. Hence in this setting there cannot be anything that could be termed a definitive analysis, and hence any analysis of preference is ideally to be supplemented with a so-called sensitivity analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aerts M, Geys H, Molenberghs G, and Ryan LM (2002) Topics in Modelling of Clustered Binary Data. Chapman & Hall, London
Afifi A, Elashoff R (1966) Missing observations in multivariate statistics I: Review of the literature. Journal of the American Statistical Association 61:595–604
Amemiya T (1984) Tobit models: a survey. Journal of Econometrics 24:3–61
Ashford JR, Sowden RR (1970) Multi-variate probit analysis. Biometrics 26:535–546
Baker SG (1995) Marginal regression for repeated binary data with outcome subject to non-ignorable non-response. Biometrics 51:1042–1052
Bahadur RR (1961) A representation of the joint distribution of responses to n dichotomous items. In: Solomon H (ed) Studies in Item Analysis and Prediction Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press, Stanford CA
Beckman RJ, Nachtsheim CJ, and Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426
Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88:9–25
Buck SF (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society Series B 22:302–306
Chatterjee S, Hadi AS (1988) Sensitivity Analysis in Linear Regression. John Wiley & Sons, New York
Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19:15–18
Cook RD (1979) Influential observations in linear regression. Journal of the American Statistical Association 74:169–174
Cook RD (1986) Assessment of local influence. Journal of the Royal Statistical Society Series B 48:133–169
Cook RD, Weisberg S (1982) Residuals and Influence in Regression. Chapman & Hall, London
Dale JR (1986) Global cross-ratio models for bivariate, discrete, ordered responses. Biometrics 42:909–917
Dempster AP, Rubin DB (1983) Overview. Incomplete Data in Sample Surveys, Vol. II: Theory and Annotated Bibliography, Madow WG, Olkin I, Rubin DB (eds). Academic Press, New York, pp 3–10
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B 39:1–38
Diggle PJ, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics 43:49–93
Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of Longitudinal Data. Oxford University Press, New York
Draper D (1995) Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society Series B 57:45–97
Ekholm A (1991) Algorithms versus models for analyzing data that contain misclassification errors. Biometrics 47:1171–1182
Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models. Springer-Verlag, Heidelberg
Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995) Regression models for longitudinal binary responses with informative dropouts. Journal of the Royal Statistical Society Series B 57:691–704
Fitzmaurice GM, Heath G, Clifford P (1996a) Logistic regression models for binary data panel data with attrition. Journal of the Royal Statistical Society Series A 159:249–264
Fitzmaurice GM, Laird NM, Zahner GEP (1996b) Multivariate logistic models for incomplete binary response. Journal of the American Statistical Association 91:99–108
George EO, Bowman D (1995) A saturated model for analyzing exchangeable binary data: Applications to clinical and developmental toxicity studies. Journal of the American Statistical Association 90:871–879
Geys H, Molenberghs G, Lipsitz SR (1998) A note on the comparison of pseudolikelihood and generalized estimating equations for marginal odds ratio models. Journal of Statistical Computation and Simulation 62:45–72
Glonek GFV, McCullagh P (1995) Multivariate logisticmodels. Journal of the Royal Statistical Society Series B 81:477–482
Goss PE, Winer EP, Tannock IF, Schwartz LH, Kremer AB (1999) Breast cancer: randomized phase III trial comparing the new potent and selective third-generation aromatase inhibitor vorozole with megestrol acetate in postmenopausal advanced breast cancer patients. Journal of Clinical Oncology 17:52–63
Hartley HO, Hocking R (1971) The analysis of incomplete data. Biometrics 27:7783–808
Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5:475–492
Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine 16:239–258
Kenward MG, Molenberghs G (1998) Likelihood based frequentist inference when data are missing at random. Statistical Science 12:236–247
Kenward MG, Molenberghs G, Thijs H (2003) Pattern-mixture models with proper time dependence. Biometrika 90:53–71
Laird NM (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:84
Lang JB, Agresti A (1994) Simultaneously modeling joint and marginal distributions of multivariate categorical responses. Journal of the American Statistical Association 89:625–632
le Cessie S, van Houwelingen JC (1994) Logistic regression for correlated binary data. Applied Statistics 43:95–108
Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582
Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22
Liang K-Y, Zeger SL, Qaqish B (1992) Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society Series B 54:3–40
Lipsitz SR, Laird NM, Harrington DP (1991) Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78:153–160
Little RJA (1986) A note about models for selectivity bias. Econometrika 53:1469–1474
Little RJA (1993) Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association 88:125–134
Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483
Little RJA (1995) Modeling the drop-out mechanism in repeated measures studies. Journal of the American Statistical Association 90:1112–1121
Little RJA, Rubin DB (1987) Statistical Analysis with Missing Data. John Wiley & Sons, New York
Mallinckrodt CH, Clark WS, Stacy RD (2001a) Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Information Journal 35:1215–1225
Mallinckrodt CH, Clark WS, Stacy RD (2001b) Accounting for dropout bias using mixed-effects models. Journal of Biopharmaceutical Statistics series 11,(1 & 2):9–21
Mallinckrodt CH, Clark WS, Carroll RJ, Molenberghs G (2003a) Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. Journal of Biopharmaceutical Statistics 13:179–190
Mallinckrodt CH, Sanger TM, Dube S, Debrota DJ, Molenberghs G, Carroll RJ, Zeigler Potter WM, Tollefson, GD (2003b) Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biological Psychiatry series 53:754–760
McCullagh P, Nelder JA (1989) Generalized Linear Models. Chapman & Hall, London
Michiels B, Molenberghs G, Bijnens L, Vangeneugden T, Thijs H (2002) Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Statistics in Medicine 21:1023–1041
Molenberghs G, Lesaffre E (1994) Marginal modelling of correlated ordinal data using a multivariate Plackett distribution. Journal of the American Statistical Association 89:633–644
Molenberghs G, Lesaffre E (1999) Marginal modelling of multivariate categorical data. Statistics in Medicine 18:2237–2255
Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with non-random dropout. Biometrika 84:33–44
Molenberghs G, Michiels B, Kenward MG, Diggle PJ (1998) Missing data mechanisms and pattern-mixture models. Statistica Neerlandica 52:153–161
Murray GD, Findlay JG (1988) Correcting for the bias caused by drop-outs in hypertension trials. Statististics in Medicine 7:941–946
Nelder JA, Mead R (1965) A simplex method for function minimisation. The Computer Journal 7:303–313
Neuhaus JM (1992) Statistical methods for longitudinal and clustered designs with binary responses. Statistical Methods in Medical Research 1:249–273
Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. International Statistical Review 59:25–35
Plackett RL (1965) A class of bivariate distributions. Journal of the American Statistical Association 60:516–522
Prentice RL (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics 44:1033–1048
Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90:106–121
Robins JM, Rotnitzky A, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with non-ignorable non-response. Journal of the American Statistical Association 93:1321–1339
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Rubin DB (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York
Rubin DB (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:80–82
Schafer JL (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, London
Schipper H, Clinch J, McMurray A (1984) Measuring the quality of life of cancer patients: the Functional-Living Index-Cancer: development and validation. Journal of Clinical Oncology 2:472–483
Sheiner LB, Beal SL, Dunne A (1997) Analysis of nonrandomly censored ordered categorical longitudinal data from analgesic trials. Journal of the American Statistical Association 92:1235–1244
Siddiqui O, Ali MW (1998) A comparison of the random-effects pattern-mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. Journal of Biopharmaceutical Statistics 8:545–563
Skellam JG (1948) A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. Journal of the Royal Statistical Society Series B 10:257–261
Smith DM, Robertson B, Diggle PJ (1996) Object-oriented Software for the Analysis of Longitudinal Data in S. Technical Report MA 96/192. Department of Mathematics and Statistics, University of Lancaster, LA1 4YF, United Kingdom
Stiratelli R, Laird N, Ware J (1984) Random effects models for serial observations with dichotomous response. Biometrics 40:961–972
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82:528–550
Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265
Verbeke G, Molenberghs G (1997) Linear Mixed Models in Practice: A SAS-Oriented Approach. Lecture Notes in Statistics 126. Springer-Verlag, New York
Verbeke G, Molenberghs G (2000) Linear Mixed Models for Longitudinal Data. Springer-Verlag, New York
Verbeke G, Molenberghs G, Thijs H, Lesaffre E, Kenward MG (2001) Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57:7–14
Wedderburn RWM (1974) Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61:439–447
Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Molenberghs, G., Beunckens, C., Jansen, I., Thijs, H., Verbeke, G., Kenward, M.G. (2005). Missing Data. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-26577-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-26577-1_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00566-7
Online ISBN: 978-3-540-26577-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)