Abstract
Two- or multi-phase study designs are often used in settings involving failure times. In most studies, whether or not certain covariates are measured on an individual depends on their failure time and status. For example, when failures are rare, case–cohort or case–control designs are used to increase the number of failures relative to a random sample of the same size. Another scenario is where certain covariates are expensive to measure, so they are obtained only for selected individuals in a cohort. This paper considers such situations and focuses on cases where we wish to test hypotheses of no association between failure time and expensive covariates. Efficient score tests based on maximum likelihood are developed and shown to have a simple form for a wide class of models and sampling designs. Some numerical comparisons of study designs are presented.
Similar content being viewed by others
References
Barnett IJ, Lee S, Lin X (2013) Detecting rare variant effects using extreme phenotype sampling in sequencing association studies. Genet Epidemiol 37:142–151
Boos DD (1992) On generalized score tests. Am Stat 46:327–333
Borgan Ø, Samuelsen SO (2014) Nested case-control and case-cohort studies. Handbook of survival analysis. Chapman and Hall/CRC Press, Boca Raton
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M (2009) Using the whole cohort in the analysis of case–cohort data. Am J Epidemiol 169:1398–1405
Chatterjee N, Chen YH, Breslow NE (2003) A pseudoscore estimator for regression problems with two-phase sampling. J Am Stat Assoc 98:158–168
Chen HY, Little RJ (1999) Proportional hazards regression with missing covariates. J Am Stat Assoc 94:896–908
Derkach A, Lawless JF, Sun L (2015) Score tests for association under response-dependent sampling designs for expensive covariates. Biometrika 103:988–994
Ding J, Zhou H, Liu L, Cai J, Longnecker MP (2014) Estimating effect of environmental contaminants on women’s subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics 15:636–650
Ding J, Lu TS, Cai J, Zhou H (2016) Recent progresses in outcome-dependent sampling with failure time data. Lifetime Data Anal. doi:10.1007/s10985-015-9355-7
Forse CL, Yilmaz YE, Pinnaduwage D, O’Malley FP, Mulligan AM, Bull SB, Andrulis IL (2013) Elevated expression of podocalyxin is associated with lymphatic invasion, basal-like phenotype, and clinical outcome in axillary lymph node-negative breast cancer. Breast Cancer Res Treat 137:709–719
Huang BE, Lin DY (2007) Efficient association mapping of quantitative trait loci with selective genotyping. Am J Hum Genet 80:567–576
Ibrahim JG, Chen MH, Lipsitz SR, Herring AH (2005) Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346
Kalbfleisch JD, Prentice RL (2002) The statistical analysis of failure time data, 2nd edn. Wiley, New York
Lawless JF, Kalbfleisch JD, Wild CJ (1999) Semiparametric methods for response-selective and missing data problems in regression. J R Stat Soc Ser B (Stat Methodol) 61:413–438
Lee S, Abecasis GR, Boehnke M, Lin X (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95:5–23
Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D (2011) Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol 35:790–799
Martinussen T (1999) Cox regression with incomplete covariate measurements using the EM algorithm. Scand J Stat 26:479–491
Murphy SA, van der Vaart AW (2000) On the profile likelihood. J Am Stat Assoc 95:449–465
Nan B (2004) Efficient estimation for case-cohort data. Can J Stat 32:403–419
Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73:1–11
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Rubin DB, Little RJ (2002) Statistical analysis with missing data. Wiley, Hoboken
Saarela O, Kulathinal S, Arjas E, Läärä E (2008) Nested case-control data utilized for multiple outcomes: a likelihood approach and alternatives. Stat Med 27:5991–6008
Samuelsen SO, Ånestad H, Skrondal A (2007) Stratified case–cohort analysis of general cohort sampling designs. Scand J Stat 34:103–119
Scheike TH, Juul A (2004) Maximum likelihood estimation for Cox’s regression model under nested case-control sampling. Biostatistics 5:193–206
Scheike TH, Martinussen T (2004) Maximum likelihood estimation for Cox’s regression model under case-cohort sampling. Scand J Stat 31:283–293
Scott AJ, Wild CJ (1986) Fitting logistic models under case–control or choice based sampling. J R Stat Soc Ser B (Methodol) 48:170–182
Shen Y, Cai T, Chen Y, Yang Y, Chen J (2015) Retrospective likelihood based methods for analyzing case cohort genetic association studies. Biometrics 71:960–968
Støer NC, Samuelsen SO (2012) Comparison of estimators in nested case-control studies with multiple outcomes. Lifetime Data Anal 18:261–283
Whittemore AS (1997) Multistage sampling designs and estimating equations. J R Stat Soc Ser B (Stat Methodol) 59:589–602
Zeng D, Lin DY (2007) Semiparametric transformation models with random effects for recurrent events. J Am Stat Assoc 102:167–180
Zeng D, Lin DY (2014) Efficient estimation of semiparametric transformation models for two-phase cohort studies. J Am Stat Assoc 109:371–383
Zhang Z, Rockette HE (2005) On maximum likelihood estimation in parametric regression with missing covariates. J Stat Plan Inference 134:206–223
Zhang Z, Rockette HE (2007) An EM algorithm for regression analysis with incomplete covariate information. J Stat Comput Simul 77:163–173
Zhao Y, Lawless JF, McLeish DL (2009) Likelihood methods for regression models with expensive variables missing by design. Biom J 51:123–136
Acknowledgements
This research was supported by a grant to the author from the Natural Sciences and Engineering Research Council of Canada (DG RGPIN-8597). The author thanks Ker-ai Lee for implementing the simulation studies in Section 4, and Shelley Bull and Yildiz Yilmaz for helpful comments on the study in Section 4.2.
Author information
Authors and Affiliations
Corresponding author
Appendix: Variance estimate for score statistic (7)
Appendix: Variance estimate for score statistic (7)
The individual components of the score functions given by (4) and (5) can be written as
where \(i = 1, \ldots , N\). The information matrix for \((\theta ^{\prime }, \alpha ^{\prime })^{\prime }\) then has component matrices as follows, where we define \(A_{\theta }(y | x, z) = \partial ^2 \log f(y | x, z) / \partial \theta \partial \theta ^{\prime }\) and \(A_{\alpha }(x | z) = \partial ^2 \log g(x | z) / \partial \alpha \partial \alpha ^{\prime }\):
Define \(A_{\mu \mu i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \mu _i^2\), \(A_{\sigma \sigma i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \sigma \partial \sigma ^{\prime }\), \(A_{\mu \sigma i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \mu _i \partial \sigma ^{\prime }\), \(A_{\alpha \alpha i} = - \partial ^2 \log g(x_i | z_i) / \partial \alpha \partial \alpha ^{\prime }\), where \(\mu _i = \beta ^{\prime } x_i + \gamma ^{\prime } z_i\). We note that under \(H_0\): \(\beta = 0\) we have (i) \(g(x|y,z) = g(x|z)\), (ii) \(f(y | x, z; \theta ) = f(y | z; \gamma , \sigma )\), and (iii) all of \(\phi ^{\prime }_{\mu }(y | x, z)\), \(\phi ^{\prime }_{\sigma }(y | x, z)\), \(A_{\mu \mu i}\), \(A_{\sigma \sigma i}\) and \(A_{\mu \sigma i}\) depend only on y and z. The components of \(I_{\theta \theta i}\), \(I_{\alpha \alpha i}\) and \(I_{\theta \alpha i}\) then simplify as follows:
The variance estimate (8) is then obtained by noting that under \(H_0\), the asymptotic variance of \(N^{1/2} U_{\beta }(\tilde{\theta }, \tilde{\alpha })\) is the limit of
where \(I_{\beta \beta } = \sum _{i=1}^N I_{\beta \beta i}\), \(I_{\beta \gamma } = \sum _{i=1}^N I_{\beta \gamma i}\) etc. and replacing \(\theta \), \(\alpha \) in the entries of \(I_{\beta \beta }\), \(I_{\beta \eta }\), \(I_{\beta \alpha }\), \(I_{\eta \eta }\) and \(I_{\alpha \alpha }\) given above with \(\tilde{\theta }\), \(\tilde{\alpha }\).
Rights and permissions
About this article
Cite this article
Lawless, J.F. Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates. Lifetime Data Anal 24, 28–44 (2018). https://doi.org/10.1007/s10985-016-9386-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-016-9386-8