Skip to main content
Log in

Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates

  • Published:
Lifetime Data Analysis Aims and scope Submit manuscript

Abstract

Two- or multi-phase study designs are often used in settings involving failure times. In most studies, whether or not certain covariates are measured on an individual depends on their failure time and status. For example, when failures are rare, case–cohort or case–control designs are used to increase the number of failures relative to a random sample of the same size. Another scenario is where certain covariates are expensive to measure, so they are obtained only for selected individuals in a cohort. This paper considers such situations and focuses on cases where we wish to test hypotheses of no association between failure time and expensive covariates. Efficient score tests based on maximum likelihood are developed and shown to have a simple form for a wide class of models and sampling designs. Some numerical comparisons of study designs are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Barnett IJ, Lee S, Lin X (2013) Detecting rare variant effects using extreme phenotype sampling in sequencing association studies. Genet Epidemiol 37:142–151

    Article  Google Scholar 

  • Boos DD (1992) On generalized score tests. Am Stat 46:327–333

    Google Scholar 

  • Borgan Ø, Samuelsen SO (2014) Nested case-control and case-cohort studies. Handbook of survival analysis. Chapman and Hall/CRC Press, Boca Raton

    Google Scholar 

  • Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M (2009) Using the whole cohort in the analysis of case–cohort data. Am J Epidemiol 169:1398–1405

    Article  Google Scholar 

  • Chatterjee N, Chen YH, Breslow NE (2003) A pseudoscore estimator for regression problems with two-phase sampling. J Am Stat Assoc 98:158–168

    Article  MathSciNet  MATH  Google Scholar 

  • Chen HY, Little RJ (1999) Proportional hazards regression with missing covariates. J Am Stat Assoc 94:896–908

    Article  MathSciNet  MATH  Google Scholar 

  • Derkach A, Lawless JF, Sun L (2015) Score tests for association under response-dependent sampling designs for expensive covariates. Biometrika 103:988–994

    Article  MathSciNet  MATH  Google Scholar 

  • Ding J, Zhou H, Liu L, Cai J, Longnecker MP (2014) Estimating effect of environmental contaminants on women’s subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics 15:636–650

    Article  Google Scholar 

  • Ding J, Lu TS, Cai J, Zhou H (2016) Recent progresses in outcome-dependent sampling with failure time data. Lifetime Data Anal. doi:10.1007/s10985-015-9355-7

  • Forse CL, Yilmaz YE, Pinnaduwage D, O’Malley FP, Mulligan AM, Bull SB, Andrulis IL (2013) Elevated expression of podocalyxin is associated with lymphatic invasion, basal-like phenotype, and clinical outcome in axillary lymph node-negative breast cancer. Breast Cancer Res Treat 137:709–719

    Article  Google Scholar 

  • Huang BE, Lin DY (2007) Efficient association mapping of quantitative trait loci with selective genotyping. Am J Hum Genet 80:567–576

    Article  Google Scholar 

  • Ibrahim JG, Chen MH, Lipsitz SR, Herring AH (2005) Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346

    Article  MathSciNet  MATH  Google Scholar 

  • Kalbfleisch JD, Prentice RL (2002) The statistical analysis of failure time data, 2nd edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Lawless JF, Kalbfleisch JD, Wild CJ (1999) Semiparametric methods for response-selective and missing data problems in regression. J R Stat Soc Ser B (Stat Methodol) 61:413–438

    Article  MathSciNet  MATH  Google Scholar 

  • Lee S, Abecasis GR, Boehnke M, Lin X (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95:5–23

    Article  Google Scholar 

  • Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D (2011) Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol 35:790–799

    Article  Google Scholar 

  • Martinussen T (1999) Cox regression with incomplete covariate measurements using the EM algorithm. Scand J Stat 26:479–491

    Article  MathSciNet  MATH  Google Scholar 

  • Murphy SA, van der Vaart AW (2000) On the profile likelihood. J Am Stat Assoc 95:449–465

    Article  MathSciNet  MATH  Google Scholar 

  • Nan B (2004) Efficient estimation for case-cohort data. Can J Stat 32:403–419

    Article  MathSciNet  MATH  Google Scholar 

  • Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73:1–11

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin DB, Little RJ (2002) Statistical analysis with missing data. Wiley, Hoboken

    MATH  Google Scholar 

  • Saarela O, Kulathinal S, Arjas E, Läärä E (2008) Nested case-control data utilized for multiple outcomes: a likelihood approach and alternatives. Stat Med 27:5991–6008

    Article  MathSciNet  Google Scholar 

  • Samuelsen SO, Ånestad H, Skrondal A (2007) Stratified case–cohort analysis of general cohort sampling designs. Scand J Stat 34:103–119

    Article  MathSciNet  MATH  Google Scholar 

  • Scheike TH, Juul A (2004) Maximum likelihood estimation for Cox’s regression model under nested case-control sampling. Biostatistics 5:193–206

    Article  MATH  Google Scholar 

  • Scheike TH, Martinussen T (2004) Maximum likelihood estimation for Cox’s regression model under case-cohort sampling. Scand J Stat 31:283–293

    Article  MathSciNet  MATH  Google Scholar 

  • Scott AJ, Wild CJ (1986) Fitting logistic models under case–control or choice based sampling. J R Stat Soc Ser B (Methodol) 48:170–182

    MathSciNet  MATH  Google Scholar 

  • Shen Y, Cai T, Chen Y, Yang Y, Chen J (2015) Retrospective likelihood based methods for analyzing case cohort genetic association studies. Biometrics 71:960–968

    Article  MathSciNet  MATH  Google Scholar 

  • Støer NC, Samuelsen SO (2012) Comparison of estimators in nested case-control studies with multiple outcomes. Lifetime Data Anal 18:261–283

    Article  MathSciNet  MATH  Google Scholar 

  • Whittemore AS (1997) Multistage sampling designs and estimating equations. J R Stat Soc Ser B (Stat Methodol) 59:589–602

    Article  MathSciNet  MATH  Google Scholar 

  • Zeng D, Lin DY (2007) Semiparametric transformation models with random effects for recurrent events. J Am Stat Assoc 102:167–180

    Article  MathSciNet  MATH  Google Scholar 

  • Zeng D, Lin DY (2014) Efficient estimation of semiparametric transformation models for two-phase cohort studies. J Am Stat Assoc 109:371–383

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang Z, Rockette HE (2005) On maximum likelihood estimation in parametric regression with missing covariates. J Stat Plan Inference 134:206–223

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang Z, Rockette HE (2007) An EM algorithm for regression analysis with incomplete covariate information. J Stat Comput Simul 77:163–173

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao Y, Lawless JF, McLeish DL (2009) Likelihood methods for regression models with expensive variables missing by design. Biom J 51:123–136

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research was supported by a grant to the author from the Natural Sciences and Engineering Research Council of Canada (DG RGPIN-8597). The author thanks Ker-ai Lee for implementing the simulation studies in Section 4, and Shelley Bull and Yildiz Yilmaz for helpful comments on the study in Section 4.2.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. F. Lawless.

Appendix: Variance estimate for score statistic (7)

Appendix: Variance estimate for score statistic (7)

The individual components of the score functions given by (4) and (5) can be written as

$$\begin{aligned} U_{\theta i}= & {} R_i \phi ^{\prime }_{\theta }(y_i | x_i, z_i) + (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) | y_i, z_i \} \\ U_{\alpha i}= & {} R_i \phi ^{\prime }_{\alpha }(x_i | z_i) + (1 - R_i) E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) | y_i, z_i \} \end{aligned}$$

where \(i = 1, \ldots , N\). The information matrix for \((\theta ^{\prime }, \alpha ^{\prime })^{\prime }\) then has component matrices as follows, where we define \(A_{\theta }(y | x, z) = \partial ^2 \log f(y | x, z) / \partial \theta \partial \theta ^{\prime }\) and \(A_{\alpha }(x | z) = \partial ^2 \log g(x | z) / \partial \alpha \partial \alpha ^{\prime }\):

$$\begin{aligned} I_{\theta \theta i} = -\frac{\partial U_{\theta i}}{\partial \theta ^{\prime }}= & {} R_i A_{\theta }(y_i | x_i, z_i) + (1 - R_i) E \{ A_{\theta }(y_i | X_i, z_i) | y_i, z_i \} \\&- (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) \phi ^{\prime }_{\theta }(y_i | X_i, z_i)^{\prime } | y_i, z_i \} \\&+ (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) |y_i, z_i \} E\{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i)^{\prime } | y_i, z_i \} \\ I_{\alpha \alpha i} = -\frac{\partial U_{\alpha i}}{\partial \alpha ^{\prime }}= & {} R_i A_{\alpha }(x_i | z_i) + (1 - R_i) E \{A_{\alpha }(X_i | z_i) | y_i, z_i \} \\&- (1 - R_i) E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | y_i, z_i \} \\&+ (1 - R_i) E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) |y_i, z_i \} E\{ \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | y_i, z_i \} \\ I_{\theta \alpha i} = -\frac{\partial U_{\theta i}}{\partial \alpha ^{\prime }}= & {} (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) \} E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) | y_i, z_i \}^{\prime } \\&- (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | y_i, z_i \} \; . \end{aligned}$$

Define \(A_{\mu \mu i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \mu _i^2\), \(A_{\sigma \sigma i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \sigma \partial \sigma ^{\prime }\), \(A_{\mu \sigma i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \mu _i \partial \sigma ^{\prime }\), \(A_{\alpha \alpha i} = - \partial ^2 \log g(x_i | z_i) / \partial \alpha \partial \alpha ^{\prime }\), where \(\mu _i = \beta ^{\prime } x_i + \gamma ^{\prime } z_i\). We note that under \(H_0\): \(\beta = 0\) we have (i) \(g(x|y,z) = g(x|z)\), (ii) \(f(y | x, z; \theta ) = f(y | z; \gamma , \sigma )\), and (iii) all of \(\phi ^{\prime }_{\mu }(y | x, z)\), \(\phi ^{\prime }_{\sigma }(y | x, z)\), \(A_{\mu \mu i}\), \(A_{\sigma \sigma i}\) and \(A_{\mu \sigma i}\) depend only on y and z. The components of \(I_{\theta \theta i}\), \(I_{\alpha \alpha i}\) and \(I_{\theta \alpha i}\) then simplify as follows:

$$\begin{aligned} I_{\beta \beta i}= & {} R_i A_{\mu \mu i} x_i x^{\prime }_i + (1 - R_i) \{ A_{\mu \mu i} E(X_i X^{\prime }_i | z_i) - \phi ^{\prime }_{\mu }(y_i | z_i)^2 \mathrm{var}(X_i | z_i) \} \\ I_{\beta \gamma i}= & {} R_i A_{\mu \mu i} x_i z^{\prime }_i + (1 - R_i) A_{\mu \mu i} E(X_i | z_i) z^{\prime }_i \\ I_{\beta \sigma i}= & {} R_i x_i A_{\mu \sigma i} + (1 - R_i) E(X_i | z_i) A_{\mu \sigma i} \\ I_{\beta \alpha i}= & {} - (1 - R_i) \phi ^{\prime }_{\mu }(y_i | z_i) E \{X_i \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | z_i \} \\ I_{\gamma \gamma i}= & {} A_{\mu \mu i} z_i z^{\prime }_i \;,~~ I_{\gamma \sigma i} = z_i A_{\mu \sigma i} \;, ~~ I_{\gamma \alpha i} = 0 \;, ~~ I_{\sigma \alpha i} = 0 \;, ~~ I_{\alpha \alpha i} = R_i A_{\alpha \alpha i} \; . \end{aligned}$$

The variance estimate (8) is then obtained by noting that under \(H_0\), the asymptotic variance of \(N^{1/2} U_{\beta }(\tilde{\theta }, \tilde{\alpha })\) is the limit of

$$\begin{aligned} N^{-1} E \left\{ I_{\beta \beta } - \left( I_{\beta \eta } I_{\beta \alpha } \right) \left( \begin{array}{cc} I_{\eta \eta } &{} 0 \\ 0 &{} I_{\alpha \alpha } \\ \end{array} \right) ^{-1} \left( \begin{array}{c} I^{\prime }_{\beta \eta } \\ I^{\prime }_{\beta \alpha } \\ \end{array} \right) \right\} \; , \end{aligned}$$

where \(I_{\beta \beta } = \sum _{i=1}^N I_{\beta \beta i}\), \(I_{\beta \gamma } = \sum _{i=1}^N I_{\beta \gamma i}\) etc. and replacing \(\theta \), \(\alpha \) in the entries of \(I_{\beta \beta }\), \(I_{\beta \eta }\), \(I_{\beta \alpha }\), \(I_{\eta \eta }\) and \(I_{\alpha \alpha }\) given above with \(\tilde{\theta }\), \(\tilde{\alpha }\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lawless, J.F. Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates. Lifetime Data Anal 24, 28–44 (2018). https://doi.org/10.1007/s10985-016-9386-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-016-9386-8

Keywords

Navigation