## Abstract

Case–cohort studies are useful when information on certain risk factors is difficult or costly to ascertain. Particularly, a case–cohort study may be well suited in situations where several case series are of interest, e.g. in studies with competing risks, because the same sub-cohort may serve as a comparison group for all case series. Previous analyses of this kind of sampled cohort data most often involved estimation of rate ratios based on a Cox regression model. However, with competing risks this method will not provide parameters that directly describe the association between covariates and cumulative risks. In this paper, we study regression analysis of cause-specific cumulative risks in case–cohort studies using pseudo-observations. We focus mainly on the situation with competing risks. However, as a by-product, we also develop a method by which absolute mortality risks may be analyzed directly from case–cohort survival data. We adjust for the case–cohort sampling by inverse sampling probabilities applied to a generalized estimation equation. The large-sample properties of the proposed estimator are developed and small-sample properties are evaluated in a simulation study. We apply the methodology to study the effect of a specific diet component and a specific gene on the absolute risk of atrial fibrillation.

## Introduction

Epidemiological cohort studies are useful for quantifying the association between risk factors or other covariates and rates of mortality or morbidity. In such studies, regression methods are commonly applied. The Cox regression model is by far the most frequently applied model for such data with incomplete follow-up. In some situations, information on certain risk factors may be difficult or costly to ascertain; in these situations, various methods for *sampling from the cohort* have been proposed. Thus, in a study of the association between cervical carcinoma in situ (CIN) and HPV-16 viral load, Josefsson et al (2000) applied a *nested case–control study* (Thomas 1977; Borgan and Samuelsen 2013) including all 468 cases of CIN and five controls per case sampled from the screened cohort consisting of 146,889 Swedish women that generated the cases. The authors applied this design to reduce the costs associated with doing the cytological analyses needed to ascertain the viral load. Further, Petersen et al (2005) used a *case–cohort design* (Prentice 1986; Borgan and Samuelsen 2013) in a study of the association between cause-specific mortality rates among Danish adoptees and cause of death information for their biological and adoptive parents. Here, data on all of the 1403 adoptees who were observed to die were ascertained together with data on a random *sub-cohort* sampled from the Danish Adoption Register. The sub-cohort counted 1683 adoptees (of whom 203 were also among the cases). In that study, ascertainment of data on cause-specific mortality for the biological and adoptive parents was very time-consuming as it involved scrutiny of non-computerized mortality records.

Estimation in the Cox model for the case–cohort study is most often based on a modification of the Cox partial likelihood. At each observed event time in the Cox partial likelihood modification, covariate values for the individual experiencing an event at that time are compared to those from a sample of the cohort, adjusting for the case–cohort sampling by a type of inverse probability of sampling weights. A considerable body of literature exists discussing the most efficient way of analyzing the case–cohort data and of choosing the sampling weights. Extant literature is reviewed in Borgan and Samuelsen (2013). Likelihood-based inference has also been applied (Kalbfleisch and Lawless 1988; Scheike and Martinussen 2004). In the study by Petersen et al (2005), authors were interested in analyzing mortality from several different causes. In such a competing risks situation the case–cohort design may be particularly useful because the same randomly chosen sub-cohort may be used as a comparison group for several different case series (Prentice 1986).

Analysis of sampled cohort data most often involves estimation of hazard ratios based on a Cox regression model. However, both of the above-mentioned designs allow for estimation of absolute risks because baseline hazards are estimable whenever the sampled data are based on a well-defined cohort for which the size is known throughout time. Thus, when a case–cohort design is used for analyzing data with competing risks, the cumulative incidence for each cause may be estimated by plugging-in estimates from models for all of the cause-specific hazards. This method, however, will not provide parameters that directly describe the association between covariates and cumulative incidences. It would therefore be of interest to directly fit regression models for the cumulative incidences based on the data sampled via a case–cohort design.

In recent years, methods for analyzing direct regression models for other parameters than hazard ratios have been developed. Such parameters include contrasts between cumulative incidences, restricted means, and numbers of life years lost due to specific causes of death (Klein et al 2007; Andersen et al 2003, 2004; Scheike et al 2008; Andersen 2013). A flexible framework has been established by creating a transformation of the time-to-event data, the pseudo-observations that are analyzed using estimation equations for a generalized linear model for the parameter in question.

In this paper, we study the analysis of cause-specific cumulative risks in a case–cohort study using pseudo-observations. We focus mainly on the situation with competing risks. However, as a by-product, we also develop a method by which absolute mortality risks may be analyzed directly from case–cohort survival data. We adjust for the case–cohort sampling by applying inverse sampling probabilities to the generalized estimation equation. The large-sample properties of the proposed estimator are developed and the small-sample properties are evaluated in a simulation study. We apply the methodology on case–cohort data based on the Danish Diet, Cancer, and Health Cohort in Sect. 5. In the case–cohort study we were interested in analyzing the effect of specific genes on the absolute risk for atrial fibrillation. Genetic information was obtained for 2192 cases, corresponding to 91% of all cases, and for a sub-sample of 4559 individuals, corresponding to a random 8% sample of the cohort. The subsample left 4330 non-cases for analysis, corresponding to 8% of all non-cases in the cohort.

## Set-up

Let *T* denote the time of event, \(\varDelta \in \{ 1,\ldots ,d\}\) the type of event and \(Z=(Z_{(1)}^T, Z_{(2)}^T)^T\) a vector of covariates, where \(Z_{(1)}\) are covariates available for the full cohort, and \(Z_{(2)}\) are covariates available only in the case–cohort sample. We consider regression models for the cumulative risk of the event type 1 at a time point *t*. We will formulate a framework that is sufficiently general to include several association measures, such as risk ratios or risk differences. Let *V* denote the function of \((T,\varDelta )\) which is of interest, that is, \(1\{T\le t,\varDelta =1\}\) in the case where focus is on the cumulative risk of a type-1 event. The quantity of interest is then \(\mathrm {E}(V)=P(T\le t,\varDelta =1)=F_1(t)\), say, and we are interested in a regression model \(\mathrm {E}(V|Z)=\mu (\beta _0;Z)\) for the mean function \(\mu (Z)=P(T\le t,\varDelta =1|Z)\). Typically, \(\mu (\beta _0; Z) = \mu (\beta _0^T Z)\) is the inverse of the link function in a generalized linear model. Here we will specify only a model for the risk of 1-events, \(P(T\le t,\varDelta =1|Z)=\mu (\beta _0; Z)\). We may not be able to observe *V* directly because of censoring. Let *C* be a right-censoring time and assume that we observe the censored event time \(\tilde{T}=T\wedge C\) and event type \(\tilde{\varDelta }=\varDelta 1\{T\le C\}\). We assume that the censoring time *C* is independent of \((T,\varDelta ,Z)\) including the covariate \(Z_{(2)}\) that will only be observed in the case–cohort sample. The time-to-event outcome thus consists of \(X=(\tilde{T},\tilde{\varDelta })\) and we assume that the time-to-event data is available for the complete cohort.

We now turn to sampling of the sub-cohort where the covariates \(Z_{(2)}\) are available. In the classic case–cohort design, the covariates are complete for all observed events and for a sub-cohort chosen at random. We will allow the 1-events to be sampled at random with a fraction less than 100%. Such a design is often called the generalized case–cohort study design (Chen 2001; Cai and Zeng 2007). Let \(\xi \) be an indicator of an individual being sampled in the sub-cohort with probability \(p>0\) and let \(\eta \) be an indicator for an individual being sampled among 1-events outside the sub-cohort with probability *q*. The additional events are sampled among the \(\tilde{n}\), say, 1-events not in the sub-cohort. The sampling indicator can then be written \(R=(\xi +(1-\xi )\cdot 1(\tilde{\varDelta }=1)\cdot \eta )\). The sub-cohort and additional 1-event subjects are selected by independent Bernoulli sampling with the probabilities *p* and *q* (Kulich and Lin 2000). The Bernoulli sampling is slightly different from the original case–cohort sampling in Prentice (1986), where the sub-cohort was selected as a simple random sample of fixed size of the cohort. Similarly, in Cai and Zeng (2007) the selected additional 1-event subjects was a fixed sized simple random sample among the 1-event subjects that were not selected in the sub-cohort. The Bernoulli sampling has been previously studied in case–cohort studies (Kulich and Lin 2000; Lin 2000; Chen 2001; Zhang and Goldstein 2003; Kulich and Lin 2004). As noted by Chen (2001), the simple random sampling of fixed size and the Bernoulli sampling case–cohort study will under regularity conditions have the same asymptotic properties. The Bernoulli sampling results in an i.i.d. setting, where

The subjects experiencing 1-events are thus sampled with probability \(\alpha _1=(p+(1-p)q)\) and 1-event-free subjects are sampled with the probability \(\alpha _0=p\). For the purpose of the analysis presented here, the generalized case–cohort design is a way of sampling 1-events with probability \(\alpha _1\) and 1-event-free subjects with probability \(\alpha _0\) (Fig. 1). The analysis is also valid for a survey type of sampling which directly samples 1-event subjects with the probability \(\alpha _1\) and 1-event-free subjects with the probability \(\alpha _0\). In the example we will estimate the sampling fractions \(\alpha _1\) and \(\alpha _0\) from the number of selected 1-event and 1-event-free subjects with measured genetic data. The selection indicator distribution may also depend on the observed covariates, \(Z_{(1)}\), as it is the case in stratified sampling (Borgan and Samuelsen 2013).

We will assume that the time-to-event outcome *X* is available for the full cohort. Let \(\tilde{X}=(X,R)\) denote the combined time-to-event data and indicator for selection. We consider inference based on \((X_1,Z_1,R_1),\ldots (X_n,Z_n,R_n)\) that are i.i.d. replicates of (*X*, *Z*, *R*). Note, due to the Bernoulli sampling we obtain convergence of the sampling indicator \(\frac{1}{n}\sum _{i=1}^n R_i\), which is thus not an assumption as for the simple random sampling of fixed sample size. For evaluation of the large sample properties it is useful to work with a counting process representation of the time-to-event data. Define first the at-risk indicator \(Y_i(s)=1(s\le \tilde{T}_i)\) and the counting processes \(N_{i,j}(s)=1(\tilde{T}_i\le s,\tilde{\varDelta }_i=j)\) for censoring \(j= 0\) and each event type \(j=1,\ldots ,d\). The counting process representation is then \(\delta _{X_i}=(Y_i(\cdot ),N_{i,0}(\cdot ),N_{i,1}(\cdot ),\ldots N_{i,d}(\cdot ))\) if joint analysis of all event types is of interest and \(\delta _{X_i}=(Y_i(\cdot ),N_{i,0}(\cdot ),N_{i,1}(\cdot ))\) for the analysis of the first event type. For simplicity, we will concentrate on the first event type. Now, we define a number of quantities depending on the censored event times: let \(H_j(s)=\mathrm {P}(\tilde{T}_i\le s,\tilde{\varDelta }_i =j)\) be the probability of observing the event of type \(j=1,\ldots ,d\) or censoring \(j=0\) before time *s*, and let \(H(s)=\mathrm {P}(\tilde{T}_i\ge s)\) be the probability of being at risk at time *s*, and let \(F=(H,H_0,\ldots ,H_d)^T\). These functions can be estimated by their empirical versions \(\hat{H}_n(s)=\frac{1}{n}\sum _{i=1}^n 1(\tilde{T}_i\ge s)=\frac{1}{n}\sum _{i=1}^n Y_i(s)\) and \(\hat{H}_{n,j}(s)=\frac{1}{n}\sum _{i=1}^n 1(\tilde{T}_i\le s,\tilde{\varDelta }_i=j)=\frac{1}{n}\sum _{i=1}^n N_{i,j}(s)\) for \(j=0,1,\ldots ,d\).

The cumulative event-1 risk can be estimated by observing that under the assumption that censoring \(C_i\) is independent of \((T_i,\varDelta _i)\), an estimator of the cumulative censoring hazard \(\varLambda _0(s)=\int _0^s \frac{1}{H(u)}\text {d}H_0(u)\) is the Nelson-Aalen estimate \(\hat{\varLambda }_{n,0}(s)=\int _0^s \frac{1}{\hat{H}_n(u)}\text {d}\hat{H}_{n,0}(u)\) and an estimator of the survival function for the censoring distribution is the Kaplan–Meier estimate , where is the product integral. An estimator of the cumulative type-1 risk \(F_1(s)=\int _0^s \frac{1}{G(u-)} \text {d}H_1(u)\) is then \(\widehat{F}_{n,1}(s)=\int _0^s \frac{1}{\hat{G}_n(u-)}\text {d}\hat{H}_{n,1}(u)\), which is the Aalen–Johansen estimator of \(F_1(s)\) in its inverse-probability-of-censoring weighted form (Jewell et al 2007). The Aalen–Johansen estimator is thus a function, \(\phi \) say, of a sample average of the counting process representation, \(F_n=\frac{1}{n}\sum _{i=1}^n\delta _{X_i}\).

## Pseudo-observations

Censoring will be dealt by the pseudo-observation method (Andersen et al 2003). The pseudo-observation method is based on a well-behaved estimator of the quantify of interest, \(\theta =\mathrm {E}(V)\). Let \(\hat{\theta }_n\) denote the estimate of \(\theta \) based on the full sample \(X_1,..., X_n\) and let \(\hat{\theta }_n^{(i)}\) denote the similar estimate based on the sample \(X_1,\ldots ,X_{i-1}, X_{i+1},\ldots ,X_{n}\), i.e., leaving out \(X_i\). The jack-knife pseudo-observation is defined as

Since the time-to-event data is assumed available for the full cohort then also the pseudo-observations are available for the full cohort. Let \(A(\beta ;Z_i)\) be a vector function depending only on the regression parameters and covariates. When analyzing the complete cohort, estimates of \(\beta _0\) are then obtained based on \(\hat{\theta }_{n,1},\ldots , \hat{\theta }_{n,n}\) by solving an estimating equation of the type

This is the estimation equation for a generalized linear model where the pseudo-observation \( \hat{\theta }_{n,i}\) replaces the potentially unobserved \(V_i\). For the cumulative risk of the event of interest we can use the Aalen–Johansen estimator for \(\theta \) and for risk difference \(\mathrm {E}(V_i|Z_i)=\mathrm {P}(T_i\le t,\varDelta _i=1|Z_i)=\beta _0^TZ_i\) and \(A(\beta ;Z_i)=Z_i\) is a common choice (Klein and Andersen 2005).

The asymptotic distribution of the pseudo-observation estimator \(\hat{\beta }_n\) that is the solution to (2) was established in survival analysis using the Kaplan–Meier estimator in Jacobsen and Martinussen (2016), and in a more general setting including the Aalen–Johansen estimator in Overgaard et al (2017). Variance estimation in a cohort study based on the Aalen–Johansen estimator in competing risks was considered in Overgaard et al (2018). The asymptotic variance is different from the limit of the usual Huber–White robust variance estimate of (2) due to the correlation between the pseudo-observations. Overgaard et al (2017) showed that under regularity conditions, including standard assumptions for the consistency of the Aalen–Johansen estimate, pseudo-observations satisfy

uniformly in \(i=1,\ldots ,n\), where \(\dot{\phi }(\cdot )\) and \(\ddot{\phi }(\cdot )\) are the first and second order influence function defined below. One can view (3) as the approximate transformation of the original time-to-event dataset that is created by the pseudo-observations. The correlation comes from the third term in the approximation, which impacts the asymptotic distribution of the estimator \(\hat{\beta }_n\) as we will explicitly see when turning to the variance estimator in the case–cohort sample. The approximation (3) was established in Overgaard et al (2017) for the Aalen–Johansen estimate under the regularity condition that *G* and \(\varLambda _j\) are continuous functions and \(H(t)>0\).

To formally define the Aalen–Johansen functional let \(f=(f_*, f_0, f_1)\) be suitable functions (to be defined shortly) corresponding to the three time-to-event processes. Let \(\psi (s; f )=\int _0^s \frac{1}{f_*(u)} df_0(u)\) be the mapping of the basic counting processes into the integrated censoring hazard and the mapping of the integrated hazard into the censoring survival function. Then the Aalen–Johansen functional is given by

Evaluating at \(f=\frac{1}{n}\sum _{i=1}^n \delta _{X_i}\) we get the Aalen–Johansen estimate and evaluating at \(f=F\) we get \(F_1(s)\). A first order derivative at *f* in direction *g* is defined by \(\phi _f^\prime (s;g)=\frac{\partial }{\partial u}\phi (s;f +ug)\big |_{u=0}\) and a second order derivative at *f* in directions *g* and *h* is similarly defined by \(\phi _f^{\prime \prime }(s;g,h)=\frac{\partial ^2}{\partial u\; \partial v}\phi (s;f+ug+vh)\big |_{u=0,v=0}\). Then the first order influence function is \(\dot{\phi }(s; x)=\phi _{F}^\prime (s;\delta _x-F)\) and, similarly, the second order influence function is \(\ddot{\phi }(s; x_1,x_2)=\phi _{F}^{\prime \prime }(s;\delta _{x_1}-F,\delta _{x_2}-F)\).

## Case–cohort sampling

The case–cohort sampling will be dealt with by applying inverse sampling probabilities \(w_i=w(X_i)=P(R_i=1|X_i)^{-1}=\alpha _1^{-1}\cdot 1(\tilde{\varDelta }_i=1)+\alpha _0^{-1}\cdot 1(\tilde{\varDelta }_i\ne 1)\) to the estimating equation. Let \(n_0\) and \(n_1\) denote the number of event-1-free and number of event-1 in the cohort and \(m_0\) and \(m_1\) the corresponding number in the case–cohort. In the weights, we may replace the pre-specified sampling fractions \(\alpha _0\), \(\alpha _1\) with the observed sampling fraction \(m_0/n_0\), \(m_1/n_1\). For simplicity, here we will use the pre-specified sampling fractions \(\alpha _0\), \(\alpha _1\).

We suggest estimating \(\beta \) in the case–cohort sample using the estimating equation

where \(\hat{\theta }_{n,i}\), \(i=1,\ldots ,n\) are the pseudo-observations based on the Aalen–Johansen estimator for the complete cohort. The estimating function \(U_n\) at \(\beta _0\) will asymptotically have a mean of zero. An informal argument is as follows: Define \(\tilde{A}(\beta ;\tilde{X}_i,Z_i)=R_i w(X_i)A(\beta ; Z_i)\). First notice that for any function \(\gamma (X_1,Z_1,X_2,Z_2)\),

We can in the estimating Eq. (4) replace the pseudo-observations by the approximation (3), in the limit apply the up-weighting property (5), and then use that the score function for the full cohort estimating equation has a mean of zero. A formal argument for the asymptotic behavior of the estimation Eq. (4) is given in the Appendix using a U-statistic representation.

Assume that *A* and \(\mu \) are suitably differentiable and integrable, see the Appendix for details.

### Theorem 1

An estimator \(\hat{\beta }_n\) exists such that \(U_n(\hat{\beta }_n)=0\) with a probability tending to 1 for \(n\rightarrow \infty \). Moreover, \(\hat{\beta }_n\rightarrow \beta _0\) in probability and

as \(n\rightarrow \infty \), with

and

where

The proof of the theorem to some extent follows Overgaard et al (2017, 2018), and the proof is given in the Appendix. The main consideration is that due to the case–cohort sampling the vector \(\tilde{A}(\beta ;\tilde{X}_i,Z_i)\) now depends on the sampling indicator, time-to-event data, as well as the covariate and parameter in \(A(\beta ;Z_i)\). The first part of (6), \(\mathrm {Var}\{h_0(\tilde{X},Z)\}\), is the usual Huber–White robust variance of the generalized linear estimation equation, and the second part can be considered a bias term of the Huber–White variance due to the correlation of the pseudo-observations. The Huber–White variance is thus in general a conservative variance estimator. Note that the bias term is the same as for the full cohort analysis. When the case–cohort is equal to the full cohort, i.e., when \(p=q=1\) or equivalently \(\alpha _0=\alpha _1=1\), then \(\tilde{A}(\beta ;\tilde{X},Z)=A(\beta ; Z)\) and the expression for \(\varSigma \) reduces to the variance for the full cohort (Overgaard et al 2018).

Simulations in Jacobsen and Martinussen (2016) for the Kaplan–Meier estimator and in Overgaard et al (2018) for the Aalen–Johansen estimator for the full cohort analysis show that the Huber–White variance estimator may be a reasonable approximation unless the covariates are believed to have a strong effect on the cumulative incidence and there is a considerable amount of censoring. The relative size of the bias term compared to the Huber–White variance will be even smaller in a case–cohort analysis, since the case–cohort sampling inflates only the first term of (6) whereas the second term is the same as for the cohort analysis.

For estimating *M*, we can use

The functions \(h_{0}\) and \(h_{1}\) are estimated by

and thus \(\varSigma \) by

The first and second derivatives of the Aalen–Johansen operator were presented in Overgaard et al (2018). The variance formula (7) is a modification of the variance formula for the complete cohort where we restrict to the case–cohort sample by the indicator \(R_i\) and introduce the weights \(w(X_i)\). The variance \(\mathrm {Var}\{h_0(\tilde{X},Z)\}\) can be estimated by the usual Huber–White robust variance estimate of the generalized linear estimation equation,

Note that the Huber–White variance estimate does not require the first and second-order influence functions.

The proposed variance estimate based on \(\hat{M}_n\) and \(\hat{\varSigma }_n\) is a generalization of the cohort variance of Overgaard et al (2018) that takes the case–cohort sampling into account. The proposed variance estimate has a similar sandwich form as the variance by Barlow (1994) for the Cox case–cohort model. In an extensive simulation study of Petersen et al (2005), the Barlow variance estimate was shown to have the best small sample properties for the Cox case–cohort model. The proposed variance estimate has also good small sample properties as shown in the simulation scenarios in Sect. 6. The asymptotic variance in Theorem 1 can be written in other forms. For the Cox case–cohort model, Self and Prentice (1988) decomposed the variance as a sum of the variance for a full cohort study plus a variance term due to the case–cohort sampling. A similar decomposition can in our setting be derived if the Huber–White variance \(\mathrm {Var}\{h_0(\tilde{X},Z)\}\) is written as

where \(B(\beta _0; X,Z)= A(\beta _0; Z )\{F_1(t)+\dot{\phi }(t;X)-\mu (\beta _0;Z)\}\). The two terms in the variance expression are uncorrelated since \(E(Rw(X)-1|X,Z)=0\). Hence we may write the case–cohort variance as the sum of the full cohort variance plus a term due to the sampling of the case–cohort. The second term can be written as

where \(a^{\otimes 2}=aa^T\). When \(q=1\), corresponding to the standard case–cohort design where all 1-event subjects are sampled, the second term is

A similar form was derived for the additive hazard regression model in Kulich and Lin (2000) and for the proportional hazard regression model in Kulich and Lin (2004). Note that as *q* approaches 0 the sum of the two terms approaches the variance of a simple random sample of fraction *p* of the complete cohort, \(\frac{1}{p}\mathrm {Var}[B(\beta _0; X,Z)]\).

The Huber–White variance can be further decomposed into a sum of three terms,

where \(R_1=\xi 1(\tilde{\varDelta } \ne 1) + 1(\tilde{\varDelta } = 1)\), \(w_1(X)=P(R_1=1|X,Z)^{-1}=\frac{1}{p}1(\tilde{\varDelta }\ne 1)+1(\tilde{\varDelta }=1)\) and \(\frac{w(X)}{w_1(X)}=P(R=1|R_1=1,X,Z)^{-1}=1(\tilde{\varDelta }\ne 1)+(p+(1-p)q)^{-1}1(\tilde{\varDelta }=1)\). The three parts are again uncorrelated, e.g. the correlation between the first two part is zero since \(E(R_1w_1(X)-1|X,Z)=0\) and the correlations with the last term are zero since \(E(R w(X)/w_1(X)-1|R_1=1,X,Z)=0\). The first term is the full cohort variance, the second term is the added variance in a case–cohort study where all 1-events are sampled, and the last term is the added variance that corresponds to not sampling all 1-events. Here the second term can be written

The third term can be written

Note the second and third terms can also be identified from the decomposition in (8). Note also that as *q* approaches 0 the third term tends to

so that the sum of the three terms approaches again the variance of a simple random sample of fraction *p* of the complete cohort. A decomposition into three terms was also presented in Kang and Cai (2009), however, based on other weights. We use the inverse probability of sampling, i.e.,

with conditional variance

whereas Kang and Cai (2009) use

with conditional variance

The two terms in the conditional variance correspond to factors in the second and third factors of the asymptotic variance. The factor \(\frac{1-(p+(1-p)q)}{p+(1-p)q}\) can be substantial lower than \((1-p)\frac{1-q}{q}\) when *q* is low or moderate.

## Example: diet, genes, and the risk of atrial fibrillation

The Danish Diet, Cancer, and Health Cohort was established to investigate the effect of diet on the risk of various diseases, with an initial focus on cancer. The cohort contains data from 57,053 participants born in Denmark who were living in the urban areas of Copenhagen and Aarhus when they were enrolled in the study (Tjønneland et al 2007). Participants were enrolled from December 1993 to May 1997 and were 50–65 years of age upon enrollment. At baseline, the participants filled in a detailed semi-quantitative food frequency questionnaire with 192 items, including 24 questions regarding intake of fish and food products containing fish. Here, we focus on omega-3 fatty acids from marine intake. In the present study, participants with heart disease before entry and missing covariates were excluded, which left 54,737 participants for analysis.

Atrial fibrillation is a common cardiac abnormal heart arrhythmia characterized by rapid and irregular beating of the heart. The condition is associated with an increased risk of heart failure, stroke, and dementia. The severity of atrial fibrillation varies, and the condition is not always diagnosed. The threshold for clinical diagnosis of atrial fibrillation may depend of several factors including the severity of the condition and prevailing guidelines at the hospital of admission. The events of interest here were all clinical diagnoses of atrial fibrillation given at a Danish hospital. The diagnoses were obtained from the Danish Hospital Register, which contains all diagnoses given at a hospital. We used data with follow-up until 31 December 2009, with 2,418 diagnoses among the 54,737 participants. Information of death was collected from the Danish Civil Registration System. Both the Hospital Register and the Civil Registration System collect prospective information on every Dane. We were interested in the risk of atrial fibrillation between 50 and 75 years of age, estimated to be 8.2% (95% CI: 7.8–8.6%) in the complete cohort. Choosing age as the underlying time scale, the follow-up varied substantially between participants. The confidence intervals presented here are based on the Huber–White variance estimate.

It is expected that a high intake of marine omega-3 fatty acids is associated with a low risk of atrial fibrillation. We here define a high intake of marine omega-3 fatty acids to be an intake exceeding the 75% percentile in the study cohort. The estimated reduced risk (risk difference) associated with a high intake of marine omega-3 fatty acids was 1.4% (95% CI: 0.5–2.2%) based on the complete cohort.

In addition to the direct dietary intake, marine omega-3 fatty acids are also synthesized endogenously, which might affect the risk of atrial fibrillation. The genetic marker SNP rs174546 has been shown to be associated with the endogenous conversion of omega-3 fatty acids and may affect atrial fibrillation. The genotype has three variants: CC, which has a high conversion; CT, which has a medium-level conversion; and TT of which the T allele has been associated with a low, reduced conversion compared with the C allele. Since genetic information would be costly to ascertain for the full cohort, a case–cohort study was set up to analyze the effect of the SNP rs174546 marker.

The cohort comprised of 2418 1-event subjects and 52,319 1-event-free subjects. A random sub-cohort sample was drawn and genetic information was available for 4559 (8%) of these subjects, 229 1-event subjects and 4330 1-event-free subjects. The cohort contains 2189 additional 1-events of which 1,963 have genetic information. This suggests that genetic information is available for approximately 90% of 1-event subjects outside the sub-cohort. The case–cohort sampling leaves 2192 1-event subjects and 4330 1-event-free subjects for analysis. The Diet, Cancer, and Health Cohort uses the sampling fractions \(\hat{\alpha }_1=2192/2418=91\)% and \(\hat{\alpha }_0=4330/52,319=8\)% for 1-event subjects and 1-event-free subjects, respectively, for analysis. We will treat these sampling fractions as fixed in the analyses. Based on the data from this case–cohort sample, the estimated risk reduction connected with a high intake of marine omega-3 fatty acids was 1.5% (95% CI: 0.4–2.7%), which is close to the results based on the full cohort, albeit the confidence interval was slightly wider.

We now considered if the genetic marker had any association to the risk of atrial fibrillation: the risk of atrial fibrillation associated with CC: 8.4% (95% CI: 7.7–9.1%); CT: 8.3% (95% CI: 7.6–9.0%); and TT: 7.4% (95% CI: 6.2–8.8%) corresponding to risk differences \(\text {RD}_{\text {CT}}=-0.1\)% (95% CI: − 1.1;0.8) and \(\text {RD}_{\text {TT}}=-0.9\)% (95% CI: −2.4;0.5) as compared to the CC genotype. The risk differences remained unchanged when adjusting for the intake of marine omega-3 fatty acids and age at enrollment: adjusted risk differences \(\text {aRD}_{\text {CT}}= -0.1\)% (95% CI: −1.1;0.8%) and \(\text {aRD}_{\text {TT}}= -0.9\)% (95% CI: −2.4;0.5%). In comparison, applying the Prentice case–cohort analysis based on a proportional hazards model yields the following adjusted hazard ratios: \(\text {aHR}_{\text {CT}}=0.95\) (95% CI: 0.85;1.06) and \(\text {aHR}_{\text {TT}}=0.90\) (95% CI: 0.76;1.08). The results obtained using pseudo observations and expressed as risk differences are in agreement with the Prentice case–cohort analysis. However, the two methods express the associations differently, and the analytic methods are based on different model assumptions.

A detailed analysis of several genetic markers, different fatty acids and their possible biological interaction will be published elsewhere.

## Simulation

We will evaluate the proposed estimator in a set-up that was used in Overgaard et al (2017, 2018) for evaluating the use of pseudo-observations in cohort analyses. Consider a dichotomous exposure covariate \(Z\in \{0,1\}\) with a 50% frequency of exposed and non-exposed subjects and a linear model for the event of interest

and \(F_{Z,2}(s)=\eta \cdot s\) for the competing event. The parameter \(\beta _0\) is the cumulative risk of 1-events among unexposed subjects (\(Z=0\)) and \(\beta _1\) is the 1-event risk difference between exposed (\(Z=1\)) and unexposed subjects. Consider a censoring time \(C\in [0,1]\) that is completely independent of events and covariate data. The distribution of *C* is specified by setting \(p_c=P(C<1)\) and letting the censoring time *C* follow an uniform distribution on the interval [0, 1[ given \(C<1\). We choose \(p_c\) such that the observed fraction of censored data before time 1, \(p_{\text {oc}}=P(C<T\wedge 1)\), is equal to specified values that are set below. Under the assumption of uniform distribution of censoring and event time, we can find \(p_c=P(C<1)\) using the equation

In the simulation tables, we present the censoring fraction \(p_{\text {oc}}\). The sub-cohort is given by all events of interest and a random sample of the event-1 free data with the frequency \(\alpha _0\). We consider four different scenarios for specific values of \(\beta _0\), \(\beta _1\), \(\eta \), \(p_{\text {oc}}\), \(\alpha _0\) and \(\alpha _1\) to illustrate (1) small-sample properties, (2) comparison of the proposed analysis and the conventional Cox regression analysis for case–cohort data, (3) comparison of the Huber–White robust variance estimate and U-statistics variance estimate, and (4) Comparison of the case–cohort design to a random sample of the same size.

*Scenario 1*. First, we evaluate the small-sample properties of the proposed estimator. We assume that the unexposed risk of event type 1 is \(\beta _0=0.20\), the risk difference between exposed and unexposed is \(\beta _1=0.05\), competing events \(\eta =0.20\) and consider the case where there are 20% observed censored outcomes, \(p_{\text {oc}}=0.20\). We then vary the number of observations in the complete cohort \(n=1000\), 5000, 10, 000, and the probability of being selected in the sub-cohort \(\alpha _0=0.10\), 0.20, and 0.30. The probability of an observed 1-event in the full cohort is

So for a cohort size of \(n=1000\) and \(\alpha _0=0.10\) there will on average be 196 1-events (109 in the exposed group and 87 in the non-exposed group) and 80 1-event-free subjects in the case–cohort sample. Table 1 presents the coverage of 95% confidence intervals based on 10,000 replications using the Huber–White robust variance estimate (Cov\(_{\text {HW}}\)) and the U-statistics variance estimate (Cov\(_{U}\)) derived from (7) for both the full cohort and sub-cohort. Each line of the table is a result of independent simulations. It is seen that both variance estimates yield an acceptable coverage of confidence intervals for all considered sample sizes and sample fractions.

*Scenario 2*. Now we compare the proposed estimator to the Prentice case–cohort analysis based on the Cox proportional hazard model for survival data. We apply the weights originally suggested by Prentice (1986) and the robust variance estimate suggested in Barlow (1994), Lin and Wei (1989), Lin and Ying (1993). This combination of weights and variance estimation was recommended by Petersen et al (2003) in a large simulation study evaluating different weights and variance estimates. In scenario 2, we assume that the unexposed subjects have a risk of event of \(\beta _0=0.01\), 0.02, 0.05, 0.10, 0.15, 0.20, and risk differences \(\beta _1=0.01\), 0.02, 0.05, 0.10, 0.15, 0.20, respectively, corresponding to a risk ratio of 2. When the risk of events is low, say below 10%, the simulation model corresponds approximately to a proportional hazard model in the time interval [0, 1]. For survival data with no competing risks, the hazard ratio will then be approximately equal to the risk ratio. We consider the case where there are 20% observed censored outcomes, \(p_{\text {oc}}=0.20\), the number of observations in the complete cohort is \(n=10,000\), and we select a random sample of the cohort equal to the true event rate among the exposed, \(\alpha _0=\beta _0+\beta _1\). The sampling ensures that the case–cohort sample contains at least as many 1-event-free as 1-events in both exposure groups. We derive the coverage of 95% confidence intervals based on 10, 000 replications for the proposed estimate and the Prentice estimate in the sub-cohort (Table 2). The proposed pseudo-observation approach seems to perform just as well as the Prentice analysis when the risk is low and the pseudo-observation approach remains unbiased when the risk is not low.

*Scenario 3*. We compare the Huber–White variance and the U-statistics variance for a fixed sample size of \(n=10,000\). We consider the baseline risk \(\beta _0=0.20\), the competing event risk \(\eta =0.20\), but vary risk difference \(\beta _1=0.05\), 0.20, 0.50, probability of observed censored \(p_{\text {oc}}=0.10\), 0.30, 0.60, and the probability of selection to the sub-cohort \(\alpha _0=0.10\), 0.30, 0.50 (Table 3). The coverage of 95% confidence intervals falls outside the 94-96% range only when the risk difference is high, \(\beta _1=0.50\), and proportion of the observed censoring is large, \(p_{\text {oc}}=0.60\), and only for the variance estimate in the cohort analysis. The Huber–White variance seems to be applicable in most case–cohort studies.

*Scenario 4*. Finally, we consider the set-up from Scenario 2 in a complete cohort of size \(n=10,000\) with 20% observed censored events before time 1, \(p_{\text {oc}}=0.20\). Suppose that due to costs, the exposure can be ascertained only for 25% of the cohort. We consider sample ratios of events \(\alpha _1=1\), 0.5, and 0.25. The latter setting will correspond to a random sample of the cohort since the 1-event and 1-event-free subjects will be sampled with an equal probability 25%. The sampling ratio among non-events \(\alpha _0\) can be found from the equation

Since the setting where \(\beta _0=0.20\) and \(\beta _1=0.40\) corresponds to an observed number of events \(P(\tilde{\varDelta }=1)\) larger than 25% this setting is not relevant here. Standard deviation (SD) of \(\log (RR)\) is seen in Table 4. The case–cohort design has a lower standard deviation than the random sample in all situations except in one setting where the case–cohort study samples fewer non-events than the random sample.

## Discussion

We have suggested analyzing cumulative risks for case–cohort sampled competing risks time-to-event data using pseudo-observations. The approach presented assumes that the time-to-event data, and thus the pseudo-observations, are available for the complete cohort and the covariate data are complete only within the sub-cohort. The time-to-event data is often available in cohort studies, for example those that are based on register data. If the time-to-event data is available only for the cases and the sub-cohort then one may modify the Aalen–Johansen estimator by up-weighting the at risk sum, see e.g., Langholz and Jiao (2007). The resulting estimator has a similar approximation as in (3). The approach presented can also easily be generalized to joint inference for several competing event types \(k=1,\dots ,d\), where sampling fractions *p* and \(q_k\) (or \(\alpha _k\)) are used for the *k*th event type and inference is on \(P(T\le t,\varDelta =k)=\mu (\beta _k;Z)\), \(k=1,\dots ,d\). For example, if interest is on the first \(d=2\) event types we may specify a sub-cohort sampling fraction *p* and event-specific sampling fractions \(q_1\), \(q_2\) and model \((P(T\le t,\varDelta =1),P(T\le t,\varDelta =2)) =(\mu (\beta _1;Z),\mu (\beta _2;Z))\). The function \(\mu \) may depend on the event type. Marginal hazard models for several disease outcome of interest using case–cohort data were studied in Kang and Cai (2009).

It should be acknowledged that the approach presented does not use the covariate information for individuals outside the sub-cohort who do not have an event of interest, which is done in likelihood-based methods. We assumed that the censoring was independent of covariates; but, following Binder et al (2014), this extra assumption may be avoided. Also, in simple situations where the censoring depends on one or two categorical covariates, the pseudo-observations can be computed in strata of the risk factors.

We presented the pseudo-observation approach in a case where the sampling depends only on the event status. The approach presented can easily be extended to situations where the sampling also depend on the fully observed covariates \(Z_{(1)}\). One example is stratified sampling, where the cohort is divided into strata and the sampling fraction may vary among strata as discussed for the Cox model in Borgan and Samuelsen (2013). Other extensions of interest are multivariate analysis of all events and other association measures such as differences in the restricted means and number of life years lost due to a specific cause of death.

## Supplementary Materials

Web Appendix A with proofs of the theorems and Appendix B with example code for the Stata package.

## References

Andersen PK (2013) Decomposition of number of years lost according to causes of death. Stat Med 32:5278–5285

Andersen PK, Klein JP, Rosthoj S (2003) Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 90:15–27

Andersen PK, Hansen MG, Klein JP (2004) Regression analysis of restricted mean survival time based on pseudo-observations. Life Time Data Anal 10:335–350

Barlow WE (1994) Robust variance estimation for case-cohort design. Biometrics 50:1064–1072

Binder N, Gerds TA, Andersen PK (2014) Pseudo-observations for competing risks with covariate dependent censoring. Life time Data Anal 20(2):303–315

Borgan O, Samuelsen SO (2013) Nested case-control and case-cohort studies. In: Klein JP, van Houwelingen HC, Ibrahim JG, Scheike TH (eds) Handbook of survival analysis. Chapman and Hall/CRC, Boca Raton, pp 343–367

Cai J, Zeng D (2007) Power calculation for case-cohort studies with nonrare events. Biometrics 63(4):1288–1295

Chen K (2001) Generalized case-cohort sampling. J R Stat Soc Ser B (Stat Methodol) 63(4):791–809

Jacobsen M, Martinussen T (2016) A note on the large sample properties of estimators based on generalized linear models for correlated pseudo-observations. Scand J Stat 43(3):845–862

Jewell NP, Lei X, Ghani AC, Donnelly CA, Leung GM, Ho LM, Cowling BJ, Hedley AJ (2007) Non-parametric estimation of the case fatality ratio with competing risks data: an application to severe acute respiratory syndrome (sars). Stat Med 26(9):1982–1998

Josefsson A, Magnusson P, Ylitalo N, Sørensen P, Qwarforth-Tubbin P, Andersen P, Melbye M, Adami HO, Gyllensten U (2000) Viral load of human papilloma virus 16 as a determinant for development of cervical carcinoma in situ: a nested case-control study. Lancet 355(9222):2189–2193

Kalbfleisch J, Lawless J (1988) Likelihood analysis of multi-state models for disease incidence and mortality. Stat Med 7(1–2):149–160

Kang S, Cai J (2009) Marginal hazards model for case-cohort studies with multiple disease outcomes. Biometrika 96(4):887–901

Klein JP, Andersen PK (2005) Regression modeling of competing risks data based on pseudovalues of the cumulative incidence function. Biometrics 61(1):223–229

Klein JP, Andersen PK, Logan BL, Harhoff MG (2007) Analyzing survival curves at a fixed point in time. Stat Med 26:4505–4519

Kulich M, Lin D (2000) Additive hazards regression for case-cohort studies. Biometrika 87(1):73–87

Kulich M, Lin D (2004) Improving the efficiency of relative-risk estimation in case-cohort studies. J Am Stat Assoc 99(467):832–844

Langholz B, Jiao J (2007) Computational methods for case-cohort studies. Comput Stat Data Anal 51(8):3737–3748

Lin D (2000) On fitting cox’s proportional hazards models to survey data. Biometrika 87(1):37–47

Lin DY, Wei LJ (1989) The robust inference for the cox proportional hazards model. J Am Stat Assoc 84(408):1074–1078

Lin DY, Ying Z (1993) Cox regression with incomplete covariate measurements. J Am Stat Assoc 88:1341–1349

Overgaard M, Parner ET, Pedersen J (2017) Asymptotic theory of generalized estimating equation based on Jack-knife pseudo-observations. Ann Stat 45(5):1988–2015

Overgaard M, Parner ET, Pedersen J (2018) Estimating the variance in a pseudo-observation scheme with competing risks. Scand J Stat 45(4):923–940

Petersen L, Sørensen T, Andersen P (2003) Comparison of case-cohort estimators based on data on premature death of adult adoptees. Stat Med 22:3795–3803

Petersen L, Andersen P, Sørensen T (2005) Premature death of adult adoptees: analyses of a case-cohort sample. Biom J 47:815–824

Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73(1):1–11

Scheike T, Martinussen T (2004) Maximum likelihood estimation for Cox’s regreession model under case-cohort sampling. Scand J Stat 31:283–293

Scheike T, Zhang M, Gerds T (2008) Predicting cumulative incidence probability by direct binomial regression. Biometrika 95(1):205–220

Self SG, Prentice RL (1988) Asymptotic distribution theory and efficiency results for case-cohort studies. Ann Stat 16(1):64–81

Thomas DC (1977) Addendum to: Methods of cohort analysis: appraisal by application to asbestos mining, by F. D. K. Liddell, J. C. McDonald and D. C. Thomas. Royal Statistical Society: Series A (General) 140:469–491

Tjønneland A, Olsen A, Boll K, Stripp C, Christensen J, Overvad K (2007) Study design, exposure variables, and socioeconomic determinants of participation in diet, cancer and health: a population-based prospective cohort study of 57,053 men and women in Denmark. Scand J Public Health 35(4):432–41

Zhang H, Goldstein L (2003) Information and asymptotic efficiency of the case-cohort sampling design in cox’s regression model. J Multivar Anal 85(2):292–317

## Acknowledgements

We are grateful for constructive comments from the referees and an associate editor. The data from Danish Diet, Cancer, and Health Cohort was kindly made available by Lotte Maxild Mortensen. The work presented in this article is supported by Novo Nordisk Foundation grant NNF17OC0028276.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Electronic supplementary material

Below is the link to the electronic supplementary material.

## Rights and permissions

## About this article

### Cite this article

Parner, E.T., Andersen, P.K. & Overgaard, M. Cumulative risk regression in case–cohort studies using pseudo-observations.
*Lifetime Data Anal* **26, **639–658 (2020). https://doi.org/10.1007/s10985-020-09492-3

Received:

Accepted:

Published:

Issue Date:

### Keywords

- Case–cohort study
- Competing risks
- Cumulative incidence
- Cumulative risk
- Pseudo-observations