Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates

Lawless, J. F.

doi:10.1007/s10985-016-9386-8

Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates

Published: 29 November 2016

Volume 24, pages 28–44, (2018)
Cite this article

Lifetime Data Analysis Aims and scope Submit manuscript

J. F. Lawless¹

558 Accesses
15 Citations
Explore all metrics

Abstract

Two- or multi-phase study designs are often used in settings involving failure times. In most studies, whether or not certain covariates are measured on an individual depends on their failure time and status. For example, when failures are rare, case–cohort or case–control designs are used to increase the number of failures relative to a random sample of the same size. Another scenario is where certain covariates are expensive to measure, so they are obtained only for selected individuals in a cohort. This paper considers such situations and focuses on cases where we wish to test hypotheses of no association between failure time and expensive covariates. Efficient score tests based on maximum likelihood are developed and shown to have a simple form for a wide class of models and sampling designs. Some numerical comparisons of study designs are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerated failure time model for data from outcome-dependent sampling

Article 12 October 2020

Optimal generalized case-cohort analysis with accelerated failure time model

Article 18 November 2016

Parametric estimation of association in bivariate failure-time data subject to competing risks: sensitivity to underlying assumptions

Article 03 August 2018

References

Barnett IJ, Lee S, Lin X (2013) Detecting rare variant effects using extreme phenotype sampling in sequencing association studies. Genet Epidemiol 37:142–151
Article Google Scholar
Boos DD (1992) On generalized score tests. Am Stat 46:327–333
Google Scholar
Borgan Ø, Samuelsen SO (2014) Nested case-control and case-cohort studies. Handbook of survival analysis. Chapman and Hall/CRC Press, Boca Raton
Google Scholar
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M (2009) Using the whole cohort in the analysis of case–cohort data. Am J Epidemiol 169:1398–1405
Article Google Scholar
Chatterjee N, Chen YH, Breslow NE (2003) A pseudoscore estimator for regression problems with two-phase sampling. J Am Stat Assoc 98:158–168
Article MathSciNet MATH Google Scholar
Chen HY, Little RJ (1999) Proportional hazards regression with missing covariates. J Am Stat Assoc 94:896–908
Article MathSciNet MATH Google Scholar
Derkach A, Lawless JF, Sun L (2015) Score tests for association under response-dependent sampling designs for expensive covariates. Biometrika 103:988–994
Article MathSciNet MATH Google Scholar
Ding J, Zhou H, Liu L, Cai J, Longnecker MP (2014) Estimating effect of environmental contaminants on women’s subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics 15:636–650
Article Google Scholar
Ding J, Lu TS, Cai J, Zhou H (2016) Recent progresses in outcome-dependent sampling with failure time data. Lifetime Data Anal. doi:10.1007/s10985-015-9355-7
Forse CL, Yilmaz YE, Pinnaduwage D, O’Malley FP, Mulligan AM, Bull SB, Andrulis IL (2013) Elevated expression of podocalyxin is associated with lymphatic invasion, basal-like phenotype, and clinical outcome in axillary lymph node-negative breast cancer. Breast Cancer Res Treat 137:709–719
Article Google Scholar
Huang BE, Lin DY (2007) Efficient association mapping of quantitative trait loci with selective genotyping. Am J Hum Genet 80:567–576
Article Google Scholar
Ibrahim JG, Chen MH, Lipsitz SR, Herring AH (2005) Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346
Article MathSciNet MATH Google Scholar
Kalbfleisch JD, Prentice RL (2002) The statistical analysis of failure time data, 2nd edn. Wiley, New York
Book MATH Google Scholar
Lawless JF, Kalbfleisch JD, Wild CJ (1999) Semiparametric methods for response-selective and missing data problems in regression. J R Stat Soc Ser B (Stat Methodol) 61:413–438
Article MathSciNet MATH Google Scholar
Lee S, Abecasis GR, Boehnke M, Lin X (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95:5–23
Article Google Scholar
Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D (2011) Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol 35:790–799
Article Google Scholar
Martinussen T (1999) Cox regression with incomplete covariate measurements using the EM algorithm. Scand J Stat 26:479–491
Article MathSciNet MATH Google Scholar
Murphy SA, van der Vaart AW (2000) On the profile likelihood. J Am Stat Assoc 95:449–465
Article MathSciNet MATH Google Scholar
Nan B (2004) Efficient estimation for case-cohort data. Can J Stat 32:403–419
Article MathSciNet MATH Google Scholar
Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73:1–11
Article MathSciNet MATH Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Article MathSciNet MATH Google Scholar
Rubin DB, Little RJ (2002) Statistical analysis with missing data. Wiley, Hoboken
MATH Google Scholar
Saarela O, Kulathinal S, Arjas E, Läärä E (2008) Nested case-control data utilized for multiple outcomes: a likelihood approach and alternatives. Stat Med 27:5991–6008
Article MathSciNet Google Scholar
Samuelsen SO, Ånestad H, Skrondal A (2007) Stratified case–cohort analysis of general cohort sampling designs. Scand J Stat 34:103–119
Article MathSciNet MATH Google Scholar
Scheike TH, Juul A (2004) Maximum likelihood estimation for Cox’s regression model under nested case-control sampling. Biostatistics 5:193–206
Article MATH Google Scholar
Scheike TH, Martinussen T (2004) Maximum likelihood estimation for Cox’s regression model under case-cohort sampling. Scand J Stat 31:283–293
Article MathSciNet MATH Google Scholar
Scott AJ, Wild CJ (1986) Fitting logistic models under case–control or choice based sampling. J R Stat Soc Ser B (Methodol) 48:170–182
MathSciNet MATH Google Scholar
Shen Y, Cai T, Chen Y, Yang Y, Chen J (2015) Retrospective likelihood based methods for analyzing case cohort genetic association studies. Biometrics 71:960–968
Article MathSciNet MATH Google Scholar
Støer NC, Samuelsen SO (2012) Comparison of estimators in nested case-control studies with multiple outcomes. Lifetime Data Anal 18:261–283
Article MathSciNet MATH Google Scholar
Whittemore AS (1997) Multistage sampling designs and estimating equations. J R Stat Soc Ser B (Stat Methodol) 59:589–602
Article MathSciNet MATH Google Scholar
Zeng D, Lin DY (2007) Semiparametric transformation models with random effects for recurrent events. J Am Stat Assoc 102:167–180
Article MathSciNet MATH Google Scholar
Zeng D, Lin DY (2014) Efficient estimation of semiparametric transformation models for two-phase cohort studies. J Am Stat Assoc 109:371–383
Article MathSciNet MATH Google Scholar
Zhang Z, Rockette HE (2005) On maximum likelihood estimation in parametric regression with missing covariates. J Stat Plan Inference 134:206–223
Article MathSciNet MATH Google Scholar
Zhang Z, Rockette HE (2007) An EM algorithm for regression analysis with incomplete covariate information. J Stat Comput Simul 77:163–173
Article MathSciNet MATH Google Scholar
Zhao Y, Lawless JF, McLeish DL (2009) Likelihood methods for regression models with expensive variables missing by design. Biom J 51:123–136
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by a grant to the author from the Natural Sciences and Engineering Research Council of Canada (DG RGPIN-8597). The author thanks Ker-ai Lee for implementing the simulation studies in Section 4, and Shelley Bull and Yildiz Yilmaz for helpful comments on the study in Section 4.2.

Author information

Authors and Affiliations

Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada
J. F. Lawless

Authors

J. F. Lawless
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. F. Lawless.

Appendix: Variance estimate for score statistic (7)

The individual components of the score functions given by (4) and (5) can be written as

$$\begin{aligned} U_{\theta i}= & {} R_i \phi ^{\prime }_{\theta }(y_i | x_i, z_i) + (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) | y_i, z_i \} \\ U_{\alpha i}= & {} R_i \phi ^{\prime }_{\alpha }(x_i | z_i) + (1 - R_i) E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) | y_i, z_i \} \end{aligned}$$

where $i = 1, \ldots , N$. The information matrix for $(\theta ^{\prime }, \alpha ^{\prime })^{\prime }$ then has component matrices as follows, where we define $A_{\theta }(y | x, z) = \partial ^2 \log f(y | x, z) / \partial \theta \partial \theta ^{\prime }$ and $A_{\alpha }(x | z) = \partial ^2 \log g(x | z) / \partial \alpha \partial \alpha ^{\prime }$:

$$\begin{aligned} I_{\theta \theta i} = -\frac{\partial U_{\theta i}}{\partial \theta ^{\prime }}= & {} R_i A_{\theta }(y_i | x_i, z_i) + (1 - R_i) E \{ A_{\theta }(y_i | X_i, z_i) | y_i, z_i \} \\&- (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) \phi ^{\prime }_{\theta }(y_i | X_i, z_i)^{\prime } | y_i, z_i \} \\&+ (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) |y_i, z_i \} E\{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i)^{\prime } | y_i, z_i \} \\ I_{\alpha \alpha i} = -\frac{\partial U_{\alpha i}}{\partial \alpha ^{\prime }}= & {} R_i A_{\alpha }(x_i | z_i) + (1 - R_i) E \{A_{\alpha }(X_i | z_i) | y_i, z_i \} \\&- (1 - R_i) E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | y_i, z_i \} \\&+ (1 - R_i) E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) |y_i, z_i \} E\{ \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | y_i, z_i \} \\ I_{\theta \alpha i} = -\frac{\partial U_{\theta i}}{\partial \alpha ^{\prime }}= & {} (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) \} E \{ \phi ^{\prime }_{\alpha }(X_i | z_i) | y_i, z_i \}^{\prime } \\&- (1 - R_i) E \{ \phi ^{\prime }_{\theta }(y_i | X_i, z_i) \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | y_i, z_i \} \; . \end{aligned}$$

Define $A_{\mu \mu i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \mu _i^2$, $A_{\sigma \sigma i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \sigma \partial \sigma ^{\prime }$, $A_{\mu \sigma i} = -\partial ^2 \log f_0(y_i; \mu _i, \sigma ) / \partial \mu _i \partial \sigma ^{\prime }$, $A_{\alpha \alpha i} = - \partial ^2 \log g(x_i | z_i) / \partial \alpha \partial \alpha ^{\prime }$, where $\mu _i = \beta ^{\prime } x_i + \gamma ^{\prime } z_i$. We note that under $H_0$: $\beta = 0$ we have (i) $g(x|y,z) = g(x|z)$, (ii) $f(y | x, z; \theta ) = f(y | z; \gamma , \sigma )$, and (iii) all of $\phi ^{\prime }_{\mu }(y | x, z)$, $\phi ^{\prime }_{\sigma }(y | x, z)$, $A_{\mu \mu i}$, $A_{\sigma \sigma i}$ and $A_{\mu \sigma i}$ depend only on y and z. The components of $I_{\theta \theta i}$, $I_{\alpha \alpha i}$ and $I_{\theta \alpha i}$ then simplify as follows:

$$\begin{aligned} I_{\beta \beta i}= & {} R_i A_{\mu \mu i} x_i x^{\prime }_i + (1 - R_i) \{ A_{\mu \mu i} E(X_i X^{\prime }_i | z_i) - \phi ^{\prime }_{\mu }(y_i | z_i)^2 \mathrm{var}(X_i | z_i) \} \\ I_{\beta \gamma i}= & {} R_i A_{\mu \mu i} x_i z^{\prime }_i + (1 - R_i) A_{\mu \mu i} E(X_i | z_i) z^{\prime }_i \\ I_{\beta \sigma i}= & {} R_i x_i A_{\mu \sigma i} + (1 - R_i) E(X_i | z_i) A_{\mu \sigma i} \\ I_{\beta \alpha i}= & {} - (1 - R_i) \phi ^{\prime }_{\mu }(y_i | z_i) E \{X_i \phi ^{\prime }_{\alpha }(X_i | z_i)^{\prime } | z_i \} \\ I_{\gamma \gamma i}= & {} A_{\mu \mu i} z_i z^{\prime }_i \;,~~ I_{\gamma \sigma i} = z_i A_{\mu \sigma i} \;, ~~ I_{\gamma \alpha i} = 0 \;, ~~ I_{\sigma \alpha i} = 0 \;, ~~ I_{\alpha \alpha i} = R_i A_{\alpha \alpha i} \; . \end{aligned}$$

The variance estimate (8) is then obtained by noting that under $H_0$, the asymptotic variance of $N^{1/2} U_{\beta }(\tilde{\theta }, \tilde{\alpha })$ is the limit of

$$\begin{aligned} N^{-1} E \left\{ I_{\beta \beta } - \left( I_{\beta \eta } I_{\beta \alpha } \right) \left( \begin{array}{cc} I_{\eta \eta } &{} 0 \\ 0 &{} I_{\alpha \alpha } \\ \end{array} \right) ^{-1} \left( \begin{array}{c} I^{\prime }_{\beta \eta } \\ I^{\prime }_{\beta \alpha } \\ \end{array} \right) \right\} \; , \end{aligned}$$

where $I_{\beta \beta } = \sum _{i=1}^N I_{\beta \beta i}$, $I_{\beta \gamma } = \sum _{i=1}^N I_{\beta \gamma i}$ etc. and replacing $\theta $, $\alpha $ in the entries of $I_{\beta \beta }$, $I_{\beta \eta }$, $I_{\beta \alpha }$, $I_{\eta \eta }$ and $I_{\alpha \alpha }$ given above with $\tilde{\theta }$, $\tilde{\alpha }$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lawless, J.F. Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates. Lifetime Data Anal 24, 28–44 (2018). https://doi.org/10.1007/s10985-016-9386-8

Download citation

Received: 20 April 2016
Accepted: 23 November 2016
Published: 29 November 2016
Issue Date: January 2018
DOI: https://doi.org/10.1007/s10985-016-9386-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates

Abstract

Access this article

Similar content being viewed by others

Accelerated failure time model for data from outcome-dependent sampling

Optimal generalized case-cohort analysis with accelerated failure time model

Parametric estimation of association in bivariate failure-time data subject to competing risks: sensitivity to underlying assumptions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Variance estimate for score statistic (7)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates

Abstract

Access this article

Similar content being viewed by others

Accelerated failure time model for data from outcome-dependent sampling

Optimal generalized case-cohort analysis with accelerated failure time model

Parametric estimation of association in bivariate failure-time data subject to competing risks: sensitivity to underlying assumptions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Variance estimate for score statistic (7)

Appendix: Variance estimate for score statistic (7)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation