Semiparametric model for regression analysis with nonmonotone missing data


Semiparametric likelihoods for regression models with missing at random data (Chen in J Am Stat Assoc 99:1176–1189, 2004, Zhang and Rockette in J Stat Comput Simul 77(2):163–173, 2007, Zhao et al. in Biom J 51: 123–136, 2009, Zhao in Commun Stat Theory Methods 38:3736–3744, 2009) are robust as they use nonparametric models for covariate distributions and do not require modeling the missing data probabilities. Furthermore, the EM algorithms based on the semiparametric likelihoods have closed form expressions for both E-step and M-step. As far as we know the semiparametric likelihoods can only deal with the simple monotone missing data pattern. In this research we extend the semiparemetric likelihood approach to deal with regression models with arbitrary nonmonotone missing at random data. We propose a pseudo-likelihood model, which uses an empirical distribution to model the conditional distribution of missing covariates given observed covariates for each missing data pattern separately. We show that an EM algorithm with closed form updating formulas can be used for computing maximum pseudo-likelihood estimates for regression models with nonmonotone missing data. We then propose estimating the asymptotic variance of the maximum pseudo-likelihood estimator through a profile log likelihood and the EM algorithm. We examine the finite sample performance of the new methods in simulation studies and further illustrate the methods in a real data example investigating high risk gambling behavior and the associated factors.

This is a preview of subscription content, log in to check access.


  1. Chatterjee N, Chen Y, Breslow NE (2003) A pseudo-score estimator for regression problems with two-phase sampling. J Am Stat Assoc 98:158–168

    Article  Google Scholar 

  2. Chen HY (2004) Nonparametric and semiparametric models for missing covariates in parametric regression. J Am Stat Assoc 99:1176–1189

    MathSciNet  Article  Google Scholar 

  3. Chen HY, Xie H, Qian Y (2011) Multiple imputation for missing values through conditional semiparametric odds ratio models. Biometrics 67:799–809

    MathSciNet  Article  Google Scholar 

  4. Gao X, Song PXK (2011) Composite likelihood EM algorithm with applications to multivariate hidden Markov model. Stat Sin 21:165–185

    MathSciNet  MATH  Google Scholar 

  5. Huang Y (2009) Statistical analysis of gambling behaviors. Thesis at the University of Regina

  6. Ibrahim JG (1990) Incomplete data in generalized linear models. J Am Stat Assoc 85:765–769

    Article  Google Scholar 

  7. Ibrahim JG, Weisberg S (1992) Incomplete data in generalized linear models with continuous covariates. Aust N Z J Stat 34:461–470

    Article  Google Scholar 

  8. Ibrahim JG, Chen MH, Lipsitz SR (1999) Monte Carlo EM for missing covariates in prametric regression models. Biometrics 55:591–596

    Article  Google Scholar 

  9. Ibrahim JG, Chen MH, Lipsitz SR, Herring AH (2005) Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346

    MathSciNet  Article  Google Scholar 

  10. Lawless JF, Kalbfleisch JD, Wild CJ (1999) Semiparametric methods for response-selective and missing data problems in regression. J R Stat Soc B 61(2):413–438

    MathSciNet  Article  Google Scholar 

  11. Lindsay B (1988) Composite likelihood methods. Contemp Math 80:220–239

    MathSciNet  MATH  Google Scholar 

  12. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York

    Google Scholar 

  13. Little RJA, Schludhter MD (1985) Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72(3):497–512

    MathSciNet  Article  Google Scholar 

  14. Murphy SA, van der Vaart AW (2000) On profile likelihood. J Am Stat Assoc 95:449–465

    MathSciNet  Article  Google Scholar 

  15. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    MathSciNet  Article  Google Scholar 

  16. Sinha S, Saha KK, Wang S (2014) Semiparametric approach for non-monotone missing covariates in a parametric regression model. Biometrics 70(2):299–311

    MathSciNet  Article  Google Scholar 

  17. Sun B, Tchetgen EJT (2018) On inverse probability weighting for nonmonotone missing at random data. J Am Stat Assoc 113:369–379

    MathSciNet  Article  Google Scholar 

  18. Varin C, Reid N, Firth D (2011) An overview of composite likelihood methods. Stat Sin 21:5–42

    MathSciNet  MATH  Google Scholar 

  19. Zhang Z, Rockette HE (2007) An EM algorithm for regression analysis with incomplete covariate information. J Stat Comput Simul 77(2):163–173

    MathSciNet  Article  Google Scholar 

  20. Zhao Y (2009) Regression analysis with covariates missing at random: a piece-wise nonparametric model for missing covariates. Commun Stat Theory Methods 38:3736–3744

    MathSciNet  Article  Google Scholar 

  21. Zhao Y, Joe H (2005) Composite likelihood estimation in multivariate data analysis. Can J Stat 33:335–356

    MathSciNet  Article  Google Scholar 

  22. Zhao LP, Lipsitz S (1992) Designs and analysis of two-stage studies. Stat Med 11:769–782

    Article  Google Scholar 

  23. Zhao Y, Lawless JF, McLeish DL (2009) Likelihood methods for regression models with expensive variables missing by design. Biom J 51:123–136

    MathSciNet  Article  Google Scholar 

Download references


We thank the editors and the two anonymous reviewers for their helpful comments and suggestions. This research was partially supported by Grant from the Natural Sciences and Engineering Research Council of Canada (YZ).

Author information



Corresponding author

Correspondence to Yang Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, Y. Semiparametric model for regression analysis with nonmonotone missing data. Stat Methods Appl (2020).

Download citation


  • EM algorithm
  • Nonmonotone missing data patterns
  • Profile log likelihood
  • Pseudo-likelihood
  • Semiparametric likelihood