# Second-Order Estimating Equations for Clustered Current Status Data from Family Studies Using Response-Dependent Sampling

- 622 Downloads

## Abstract

Studies about the genetic basis for disease are routinely conducted through family studies under response-dependent sampling in which affected individuals called probands are sampled from a disease registry, and their respective family members (non-probands) are recruited for study. The extent to which the dependence in some feature of the disease process (e.g., presence, age of onset, severity) varies according to the kinship of individuals reflects the evidence of a genetic cause for disease. When the probands are selected from a disease registry, it is common for them to provide quite detailed information regarding their disease history, but non-probands often simply provide their disease status at the time of contact. We develop conditional second-order estimating equations for studying the nature and extent of within-family dependence which recognizes the biased sampling scheme employed in family studies and the current status data provided by the non-probands. Simulation studies are carried out to evaluate the finite sample performance of different estimating functions and to quantify the empirical relative efficiency of the various methods. Sensitivity to model misspecification is also explored. An application to a motivating psoriatic arthritis family study is given for illustration.

## Keywords

Current status data Family study Gaussian copula Relative efficiency Response-dependent sampling Robustness Second-order estimating equations## 1 Introduction

### 1.1 Introduction

The heritable nature of disease can be inferred from the structure of the within-family dependence in disease manifestation [19]. For rare diseases, population-based cohort studies are inefficient and impractical, so response-dependent biased family designs are routinely employed to obtain enriched samples with higher representation of diseased individuals and more variation in genetic markers than would be seen in the unselected population. Much work has been carried out for the analysis of such data when the disease status is modeled as a binary trait. Conditional likelihood based on generalized linear mixed models can be used for dealing with the dependence of binary phenotypes within families [3, 4], or estimating equations can be formed by specifying marginal mean and dependence structures for the analysis of binary phenotypes from case–control family studies [32].

The age of onset for many chronic diseases is highly variable, however, and simply using the binary trait of the disease status does not account for the variable times individuals have been at risk for disease in family studies. MacLean et al. [21] and Shih and Chatterjee [27] pointed out that, if information on the age of onset and the effect of censoring are not addressed, the estimators of the covariate effects on the disease process may be less efficient and the degree of familial aggregation may be underestimated. Models which consider the disease onset time distribution and measure dependence in terms of these times offer a preferable framework for analysis.

When interest lies in examining genetic association or gene-environment interaction, case–control or case-only family study are commonly used. Li et al. [20], Hsu et al. [15], and Shih and Chatterjee [27] proposed likelihood methods based on disease onset time for case–control family study and Chatterjee et al. [6] proposed methods to estimate the relative risk, cumulative risk, and residual familial aggregation for case–control family data and modified method for case-only family data. In their methods, modeling and estimation of the residual familial aggregation is key to adjustment for ascertainment bias, but this is done using an exchangeable dependence structure in which the association is the same for different pairs of relatives. Gorfine et al. [14] use the frailty models to account for heterogeneity in familial risk, but pointed out that frailty-based methods may be affected by the uncertainty on the frailty parameter estimate.

In this article, we consider a simple family study, where an affected individual called a *proband* is selected from a registry of patients. Consenting family members (*non-probands*) of each proband are then recruited and examined to collect information on their disease status [5]. Probands are given a special designation because their disease status led to the selection of their family. Under such sampling scheme, one obtains a right-truncated onset time for the proband and current status (type I interval-censored) data for non-probands [28]. While work has been done on the analysis of multivariate current status data [8, 16], little has been done to our knowledge in the context of biased sampling schemes.

Insights into the genetic basis of the disease can be gained by comparing the strength of the association in disease status between pairs of family members with different kinships [7, 18]. More elaborate dependence modeling also plays a central role when studying the “parent-of-descent” hypothesis, where the primary goal is to estimate and compare the strength of father–child and mother–child associations in phenotype to elucidate the role of the sex chromosomes in disease transmission [30]. With this in mind, we consider copula models [22] as a basis for modeling the joint risk of disease among family members. The dependence parameters can be interpreted as reflecting “residual familial aggregation” that is not explained by covariates in the marginal models. Copula models have several advantages over frailty models. First, the marginal models still retain simple interpretation when using copula models, which is not the case under the frailty model. Second, copula models yield dependence measures which are functionally independent of the parameters in the marginal onset time distribution, so the marginal distribution can be specified in any desirable way. Third, the dependence measure is directly specified under the copula model which has clear meaning and it also provides a natural basis for regression of genetic effects, but the frailty models do not provide simple measures of within-family dependence and it is difficult to interpret the meaning of the dependence.

Analyses must address the biased sampling scheme employed in these studies. Likelihood contributions from each family which are proportional to joint probability functions for the phenotypes of non-probands *conditional* on the disease status of the proband will admit valid inference [29] under correct model specification, but enumeration of all possible sample outcomes can be computationally demanding with large families. We develop a class of conditional second-order estimating equations in the spirit of Prentice [25]. We use the term *conditional* to reflect the fact that moments in the second-order estimating equation are all conditional on the disease onset time of the proband. A supplementary estimating equation is incorporated to extract limited information about the marginal onset time distribution from the proband.

### 1.2 The University of Toronto Psoriatic Arthritis Family Study

The incidence of psoriatic arthritis (PsA) is reported to be between 0.3 and 1.0% [9] and hereditary factors are thought to be important, as some studies have suggested that close blood relatives of individuals affected by psoriatic arthritis are at higher risk of developing the disease compared to the general population. Characterizing the within-family association nature and identifying important genetic risk factors are important to understand the disease etiology. Particular interest lies in assessing whether there is a higher rate of paternal, rather than maternal, transmission of the disease, which is also called “parent- of-origin” effect [2]. A family study of psoriatic arthritis is conducted in the Centre for Prognosis Studies in the Rheumatic Disease at the University of Toronto. Probands were selected from the members of the University of Toronto Psoriatic Arthritis Registry, and their family members were recruited into the family study with their consent. A total of 169 two-generation families ranging in size from 2 to 7 individuals were recruited; 54 families were comprised of only one non-proband and 115 have more than one non-proband. The disease onset times were only available for probands, but for other family members only the disease status is available when they are examined, yielding current status data. In total 538 individuals are in the family study and only 194 (169 probands and 25 non-probands) were diagnosed with PsA. Except for the demographic data, information of some HLA markers is also available for individuals in the PsA family study. We focus on identifying the significant HLA markers for the psoriatic arthritis and characterizing the within-family association structure, also testing whether there is “parent-of-origin” effect for the psoriatic arthritis.

The remainder of this paper is organized as follows. In Sect. 2 we define notation and formulate the conditional second-order estimating equation for family data under response-dependent sampling, which are a combination of right-truncated onset time from probands and current status data from non-probands. We consider an illustrative example in which the dependence structure is governed by a Gaussian copula and work with this model in subsequent calculations and simulations where we examine specific estimating equations involving different derivative matrices and working independence assumptions. In Sect. 3, we explore the asymptotic relative efficiencies and finite sample properties of estimators from several variants of the estimating equations introduced in Sect. 2; these results also permit sample size calculations for planning studies aiming to detect effects of genetic markers. The impact of misspecification of the dependence structure on properties of estimators and power of genetic tests is investigated in Sect. 4. An application to the motivating psoriatic arthritis family study is given in Sect. 5 in which we assess the genetic basis of the disease. Concluding remarks are given in Sect. 6.

## 2 Conditional Estimating Equations Under Biased Sampling

### 2.1 Notation, Sampling, and Observation Scheme for Family Studies

We consider the setting in which a registry of *M* individuals is created by selecting a random sample from a population, screening each individual for disease, and recruiting those found to have the condition of interest [10]. If \(C_{i0}\) denotes the age of individual 0 in family *i* at the time of sampling and screening, and \(T_{i0}\) denotes their age of disease onset, then this individual is recruited to the registry if \(Y_{i0}=I(T_{i0} \le C_{i0})=1\); we assume that \(T_{i0}\) is verifiable by a review of medical records for individuals recruited to the registry. When a family study is carried out, we assume that probands are selected from the disease registry by simple random sampling and without loss of generality we label the families of selected probands \(i=1,\ldots , m\).

*j*in family

*i*, where \(j= 1, \ldots , n_i\) are the labels for the non-probands. Then if \(T_{i}=(T_{i0}, T_{i1}, \ldots , T_{in_{i}})'\) and \(X_{i}=(X'_{i0}, \ldots , X'_{in_i})'\), we write the joint cumulative distribution function (j.c.d.f) for family

*i*as \(F_i(t)= P(T_{i0} \le t_0, \ldots , T_{in_i} \le t_{n_i}|X_i)\). We assume \(T_{ij} \perp X^{(-j)}_{i}|X_{ij}\), where \(X_{i}^{(-j)} = \{X_{ij'}; j'\ne j, 0 \le j' \le n_i\}\), and write \(F_{ij}(t;\theta ) = P(T_{ij} \le t|X_{ij}; \theta )\). The marginal hazard function for the disease onset time of individual

*j*, \(j=0, 1, \ldots , n_i\), in family

*i*is

Classification of non-probands with respect to their disease status is made at the time of recruitment and clinical examination, yielding current status data. Let \(C_{ij}\) denote the age of non-proband *j* in family *i* at the time of assessment and let \(Y_{ij} = I(T_{ij} \le C_{ij})\); we let \(\bar{C}_i = (C_{i1},\ldots , C_{in_i})'\), \(\bar{Y}_i = (Y_{i1}, \ldots , Y_{in_i})'\) and \(\bar{X}_i = (X'_{i1}, \ldots , X'_{in_i})'\). If \(Y_i = (Y_{i0}, \bar{Y}'_i)'\), \(C_i = (C_{i0}, \bar{C}'_i)'\), and \(X_i = (X'_{i0}, \bar{X}'_i)'\), the family data therefore consist of \(\{T_{i0}, Y_i, C_i, X_i \}\) subject to \(Y_{i0}=1\).

### 2.2 Second-Order Estimating Functions

The association parameter \(\gamma \) is of central importance here so we next formulate conditional second-order generalized estimating equations in the spirit of Prentice [25] and Zhao and Prentice [33].

*j*,

*k*) in family

*i*. To account for response-biased sampling, we define conditional moments and let \(\mu _{i} = E[\bar{Y}_{i}|T_{i0};\psi ]\) and \(\eta _i = E[\bar{Z}_i | T_{i0};\psi ]\) be the contributions from the non-probands and let \(\mu _{i0} = E[T_{i0}|Y_{i0} = 1;\theta ]\) for the proband where we suppress the dependence on \(X_i\) and \(C_i\). The conditional second-order estimating equations (CGEE2) denoted by \(U(\psi ) = \sum _{i=1}^{m} U_i (\psi ) =0\) have the form:

*working partial independence*(WPI) matrix. Combining these simplifications, we consider four different estimating functions based on (1)

- A.
Full \(G_i\) and Full covariance matrix \(W_i\) denoted G–W,

- B.
Full \(G_i\) and WPI \(W_i\) denoted G-WPI,

- C.
G\(^{\mathrm{I}}\) and WPI \(W_i\) denoted G\(^{\mathrm{I}}\)-WPI, and

- D.
G\(^{\mathrm{II}}\) and WPI \(W_i\) denoted \(G^{\mathrm{II}}\)-WPI.

### 2.3 An Illustrative Dependence Structure Based on a Gaussian Copula

*j*and

*k*in family

*i*and their relationship, and \(\gamma \) is the corresponding \(r\times 1\) vector of coefficients. This second-order regression model can be helpful when investigating the effect of risk factors on the pairwise association as \(v_{ijk}\) could represent family-level or individual-level features, or information on the kinship of individuals

*j*and

*k*in family

*i*; inference on their effects can be easily carried out based on \(\gamma \). For example, in the PsA family study with two generations, when the “parent-of-origin” hypothesis is of interest, we can formulate the second-order model as

## 3 Relative Efficiency Under Particular Estimating Equations

### 3.1 A Study of Asymptotic Relative Efficiency

Consider two-generation families composed of two parents and two children, \(n_i = 3\). The proband is randomly selected from the four family members and is indexed by \(j=0\). A Weibull distribution is adopted for the onset time for all family members; \(\mathcal{F}(t_{ij}|X_{ij}; \theta ) = \exp \big (-(\lambda t_{ij})^\kappa \exp (X_{ij}\beta )\big )\), where \(X_{ij}\) is a binary variable with \(P(X_{ij} = 1) = 0.5\), \(j=0, 1, 2, 3\), and we assume that \(X_{ij} \perp X_{ik}\), \(j \ne k\); \(\theta =(\lambda , \kappa , \beta )'\). Let \(\kappa = 1.2\), \(\beta = \log 1.2\), and choose \(\lambda \) to give a median age of 45 years for disease onset for group with \(X_{ij}=0\). The clinic entry time for the proband \(C_{i0}\) is normally distributed with mean 50 and variance 20, and families are recruited into the study only if their probands satisfy the selection condition \(T_{i0} \le C_{i0}\). For non-proband *j* in the selected family *i*, let \(C_{ij}\) be the random age of contact, following \(N(\mu =60, \sigma ^2 = 10)\) for individuals in the first generation and \(N(\mu =30, \sigma ^2 = 10)\) for the individuals in the second generation, \(j=1, 2, 3\); the age at contact for individuals in both generation are truncated at 90 years. We consider a Gaussian copula to induce an exchangeable within-family association for simplicity here, and let Kendall’s \(\tau \) vary from 0 to 0.5 to reflect independence to strong within-family association. The second-order model with a Fisher transformation link function is simply \(\log \big ((1+\tau _{ijk})/(1-\tau _{ijk})\big ) = \gamma _0\), \(0\le j < k \le 3\). The asymptotic variances of estimators based on conditional estimating equations in (2) are approximated by Monte Carlo simulation based on 20,000 samples.

### 3.2 Finite Sample Study of the Conditional Estimating Equations

Empirical properties of estimates under conditional estimating equations for family data from response-dependent sampling, where within-family association is induced by Gaussian copula with exchangeable structure; parametric margins and piecewise constant baseline hazards (3 pieces) are considered\(^\mathrm{a}\); \(n_i = 3\), \(nsim=1000\)

Weibull margin | Piecewise constant margin\(^\mathrm{a}\) | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

\(\beta \) | \(\gamma _0\) | \(\beta \) | \(\gamma _0\) | |||||||||||||||

\(\tau \) | EE\(^\mathrm{b}\) | EBIAS | ESE | ASE | ECP% | EBIAS | ESE | ASE | ECP% | EBIAS | ESE | ASE | ECP% | EBIAS | ESE | ASE | ECP% | NC % |

\(m=200\) | ||||||||||||||||||

0.0 | G–W | 0.002 | 0.113 | 0.115 | 95.5 | -0.002 | 0.067 | 0.065 | 94.1 | 0.002 | 0.113 | 0.115 | 95.7 | -0.000 | 0.070 | 0.067 | 93.4 | 0.2 |

G-WPI | 0.001 | 0.124 | 0.125 | 95.7 | 0.000 | 0.100 | 0.096 | 93.9 | 0.002 | 0.125 | 0.125 | 95.7 | 0.004 | 0.111 | 0.102 | 94.0 | 1.3 | |

G\(^{\mathtt{I}}\)-WPI | 0.001 | 0.113 | 0.115 | 95.7 | -0.001 | 0.080 | 0.077 | 93.2 | 0.002 | 0.114 | 0.115 | 95.6 | 0.000 | 0.084 | 0.080 | 93.7 | 0.0 | |

G\(^{\mathtt{II}}\)-WPI | 0.001 | 0.113 | 0.115 | 95.7 | -0.001 | 0.078 | 0.074 | 93.2 | 0.001 | 0.113 | 0.115 | 95.7 | 0.001 | 0.081 | 0.076 | 93.2 | 0.2 | |

0.2 | G–W | -0.006 | 0.109 | 0.109 | 94.5 | 0.002 | 0.106 | 0.105 | 94.6 | -0.006 | 0.109 | 0.108 | 94.3 | 0.021 | 0.115 | 0.112 | 95.2 | 0.1 |

G-WPI | -0.010 | 0.120 | 0.119 | 95.1 | 0.009 | 0.136 | 0.133 | 95.0 | -0.011 | 0.120 | 0.118 | 95.1 | 0.050 | 0.163 | 0.151 | 93.8 | 2.2 | |

G\(^{\mathtt{I}}\)-WPI | -0.006 | 0.109 | 0.109 | 94.2 | 0.005 | 0.118 | 0.117 | 94.8 | -0.007 | 0.109 | 0.108 | 94.2 | 0.032 | 0.132 | 0.126 | 95.0 | 0.0 | |

G\(^{\mathtt{II}}\)-WPI | -0.007 | 0.109 | 0.109 | 94.3 | 0.006 | 0.119 | 0.119 | 95.4 | -0.007 | 0.108 | 0.108 | 94.1 | 0.031 | 0.133 | 0.128 | 95.8 | 0.1 | |

0.4 | G-W | 0.001 | 0.092 | 0.094 | 96.0 | 0.008 | 0.152 | 0.154 | 95.6 | -0.003 | 0.090 | 0.092 | 96.3 | 0.065 | 0.174 | 0.164 | 94.3 | 0.1 |

G-WPI | 0.001 | 0.103 | 0.104 | 96.4 | 0.015 | 0.173 | 0.175 | 95.7 | -0.005 | 0.101 | 0.101 | 95.8 | 0.100 | 0.208 | 0.197 | 93.3 | 1.6 | |

G\(^{\mathtt{I}}\)-WPI | 0.001 | 0.093 | 0.094 | 96.0 | 0.013 | 0.170 | 0.173 | 96.6 | -0.004 | 0.091 | 0.092 | 95.6 | 0.083 | 0.195 | 0.189 | 96.6 | 0.0 | |

G\(^{\mathtt{II}}\)-WPI | 0.000 | 0.093 | 0.094 | 96.1 | 0.016 | 0.183 | 0.186 | 96.0 | -0.004 | 0.091 | 0.092 | 95.7 | 0.091 | 0.217 | 0.211 | 98.4 | 2.0 | |

\(m=1000\) | ||||||||||||||||||

0.0 | G–W | -0.001 | 0.051 | 0.052 | 95.6 | 0.001 | 0.030 | 0.029 | 94.7 | -0.002 | 0.049 | 0.052 | 95.9 | 0.001 | 0.030 | 0.030 | 95.3 | 0.0 |

G-WPI | -0.001 | 0.055 | 0.056 | 95.4 | -0.000 | 0.042 | 0.043 | 95.1 | -0.002 | 0.054 | 0.056 | 95.6 | 0.001 | 0.046 | 0.046 | 94.9 | 0.0 | |

G\(^{\mathtt{I}}\)-WPI | -0.001 | 0.051 | 0.052 | 95.8 | -0.000 | 0.035 | 0.035 | 94.0 | -0.002 | 0.049 | 0.052 | 96.0 | 0.001 | 0.036 | 0.036 | 94.9 | 0.0 | |

G\(^{\mathtt{II}}\)-WPI | -0.001 | 0.051 | 0.052 | 95.8 | 0.000 | 0.034 | 0.034 | 93.9 | -0.002 | 0.049 | 0.052 | 96.1 | 0.002 | 0.035 | 0.034 | 95.1 | 0.0 | |

0.2 | G–W | -0.004 | 0.050 | 0.049 | 93.2 | 0.002 | 0.048 | 0.047 | 95.2 | -0.005 | 0.050 | 0.048 | 93.4 | 0.021 | 0.051 | 0.050 | 93.5 | 0.0 |

G-WPI | -0.004 | 0.054 | 0.054 | 93.6 | 0.004 | 0.059 | 0.059 | 95.6 | -0.005 | 0.054 | 0.053 | 93.5 | 0.039 | 0.068 | 0.067 | 92.7 | 0.0 | |

G\(^{\mathtt{I}}\)-WPI | -0.004 | 0.050 | 0.049 | 93.4 | 0.003 | 0.053 | 0.052 | 95.0 | -0.005 | 0.050 | 0.048 | 93.4 | 0.031 | 0.057 | 0.056 | 93.2 | 0.0 | |

G\(^{\mathtt{II}}\)-WPI | -0.004 | 0.050 | 0.049 | 93.4 | 0.004 | 0.054 | 0.053 | 94.7 | -0.005 | 0.050 | 0.048 | 93.4 | 0.030 | 0.058 | 0.057 | 92.8 | 0.0 | |

0.4 | G–W | 0.000 | 0.042 | 0.042 | 95.0 | 0.003 | 0.068 | 0.068 | 95.0 | -0.004 | 0.041 | 0.041 | 94.7 | 0.058 | 0.076 | 0.073 | 88.9 | 0.0 |

G-WPI | -0.000 | 0.045 | 0.047 | 95.5 | 0.006 | 0.078 | 0.077 | 95.1 | -0.007 | 0.044 | 0.045 | 95.3 | 0.088 | 0.092 | 0.088 | 83.7 | 0.0 | |

G\(^{\mathtt{I}}\)-WPI | 0.000 | 0.042 | 0.042 | 95.0 | 0.005 | 0.075 | 0.075 | 95.9 | -0.005 | 0.041 | 0.041 | 94.7 | 0.076 | 0.085 | 0.083 | 86.7 | 0.0 | |

G\(^{\mathtt{II}}\)-WPI | -0.000 | 0.042 | 0.042 | 94.9 | 0.005 | 0.080 | 0.080 | 96.0 | -0.005 | 0.041 | 0.041 | 94.8 | 0.079 | 0.094 | 0.090 | 88.6 | 0.0 |

The results under the exchangeable Gaussian copula in Table 1 show that when the Weibull model is specified for the onset time distribution, the empirical biases are negligible for all conditional estimating equations; there is very slight finite sample bias for the association parameter when \(m=200\) and the within-family association is strong (Kendall’s \(\tau =0.4\)). The empirical standard errors (ESE) agree with the average standard errors (ASE) based on the robust variance form, and the empirical coverage probabilities (ECP) of nominal 95% confidence intervals are in general within the acceptable range. Consistent with the theoretical results of Sect. 3.1, the greatest efficiency came from the conditional estimating equations with the full derivative matrix and full covariance matrix (G–W), followed by those with G\(^\mathrm{I}\) and WPI matrix \(W_i\) (G\(^\mathrm{I}\)-WPI). The empirical performance of the conditional estimating equations with the full \(G_i\) and WPI matrix \(W_i\) is worse than the others, again in alignment with the conclusion based on Fig. 1.

Empirical properties of estimates under conditional estimating equations for family data from response-dependent sampling, where within-family association is induced by Gaussian copula with structure, \(\tau _{pp} = 0.1, \tau _{ss}=0.4\) and \(\tau _{pc} = 0.2\); parametric margins and piecewise constant baseline hazards (3 pieces) are considered\(^\mathrm{a}\); \(n_i = 3\), \(nsim=1000\)

\(\beta \) | \(\gamma _0\) | \(\gamma _1\) | \(\gamma _2\) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Margins | EE\(^\mathrm{b}\) | BIAS | ESE | ASE | ECP% | BIAS | ESE | ASE | ECP% | BIAS | ESE | ASE | ECP% | BIAS | ESE | ASE | ECP% | NC % |

\(m=200\) | ||||||||||||||||||

Weibull | G-W | 0.001 | 0.103 | 0.104 | 96.3 | -0.002 | 0.172 | 0.169 | 94.4 | 0.015 | 0.194 | 0.196 | 95.2 | 0.004 | 0.143 | 0.138 | 94.7 | 0.0 |

G-WPI | 0.001 | 0.115 | 0.114 | 95.3 | 0.006 | 0.239 | 0.233 | 94.2 | 0.015 | 0.267 | 0.262 | 94.3 | 0.001 | 0.196 | 0.191 | 94.2 | 0.0 | |

G\(^{\mathtt{I}}\)-WPI | 0.000 | 0.103 | 0.104 | 96.3 | -0.001 | 0.216 | 0.210 | 93.3 | 0.018 | 0.262 | 0.256 | 94.8 | 0.004 | 0.193 | 0.187 | 94.6 | 0.0 | |

PWC-3\(^\mathrm{a}\) | G-W | -0.000 | 0.102 | 0.103 | 96.2 | 0.011 | 0.179 | 0.175 | 94.4 | 0.031 | 0.207 | 0.208 | 95.0 | 0.011 | 0.147 | 0.142 | 94.3 | 0.2 |

G-WPI | -0.001 | 0.114 | 0.112 | 95.1 | 0.032 | 0.255 | 0.246 | 93.7 | 0.058 | 0.317 | 0.303 | 94.0 | 0.016 | 0.211 | 0.205 | 94.3 | 3.9 | |

G\(^{\mathtt{I}}\)-WPI | -0.001 | 0.102 | 0.103 | 96.3 | 0.013 | 0.227 | 0.219 | 93.3 | 0.051 | 0.298 | 0.286 | 94.8 | 0.016 | 0.205 | 0.197 | 94.7 | 0.4 | |

\(m=1000\) | ||||||||||||||||||

Weibull | G-W | -0.002 | 0.045 | 0.047 | 95.9 | -0.000 | 0.073 | 0.075 | 97.2 | 0.006 | 0.084 | 0.088 | 96.2 | 0.001 | 0.060 | 0.062 | 95.2 | 0.0 |

G-WPI | -0.001 | 0.050 | 0.051 | 95.6 | 0.005 | 0.102 | 0.104 | 95.0 | 0.005 | 0.114 | 0.117 | 95.9 | -0.001 | 0.084 | 0.085 | 95.8 | 0.0 | |

G\(^{\mathtt{I}}\)-WPI | -0.002 | 0.045 | 0.047 | 96.0 | 0.002 | 0.091 | 0.094 | 95.6 | 0.007 | 0.111 | 0.115 | 96.0 | -0.000 | 0.082 | 0.084 | 96.4 | 0.0 | |

PWC-3\(^\mathrm{a}\) | G-W | -0.003 | 0.045 | 0.046 | 95.6 | 0.014 | 0.075 | 0.078 | 96.5 | 0.021 | 0.092 | 0.093 | 95.6 | 0.007 | 0.063 | 0.063 | 95.0 | 0.0 |

G-WPI | -0.003 | 0.050 | 0.051 | 95.3 | 0.030 | 0.108 | 0.110 | 94.4 | 0.040 | 0.135 | 0.135 | 94.2 | 0.011 | 0.090 | 0.092 | 95.9 | 0.0 | |

G\(^{\mathtt{I}}\)-WPI | -0.004 | 0.045 | 0.046 | 95.7 | 0.019 | 0.095 | 0.098 | 95.0 | 0.036 | 0.126 | 0.128 | 94.9 | 0.010 | 0.087 | 0.088 | 95.5 | 0.0 |

Under more general association structure, results under G\(^\mathrm{II}\)-WPI estimating equation are not summarized because of high non-convergence percentage for such more general association structure. For the other three conditional estimating equations, their performance was again excellent under the correct Weibull model and again 100% of the replicates lead to convergence for \(m=200\) and \(m=1000\); see Table 2. Empirical biases were generally small, there was a good agreement between the empirical and average robust standard errors, and the empirical coverage probability was generally within the acceptable range. Under the piecewise constant model, convergence rate was 100% when \(m=1000\) and the empirical properties of the estimators for \(\beta \) and \(\gamma \) were good in such settings. When \(m=200\), performance remained good but with small finite sample bias and good empirical coverage probability.

## 4 Impact of Misspecifying the Dependence Structure

### 4.1 Limiting Bias Under Misspecified Conditional Estimating Equations

From Fig. 2, we see that the conditional estimating equations with the full \(G_i\) and WPI matrix \(W_i\) is the most sensitive to misspecification. Although one might anticipate that the full \(G_i\) and full covariance matrix \(W_i\) (G–W) would be less robust than G\(^\mathrm{I}\)-WPI or G\(^\mathrm{II}\)-WPI, the asymptotic relative biases of estimators defined through G–W are in general no larger than those under G\(^\mathrm{I}\)-WPI and G\(^\mathrm{II}\)-WPI when Kendall’s \(\tau \) is less than 0.3; the sensitivity of estimators from G–W to misspecification becomes more apparent, compared to those based on G\(^\mathrm{I}\)-WPI and G\(^\mathrm{II}\)-WPI, when Kendall’s \(\tau \) is larger (i.e., \(>0.3\)); G\(^\mathtt{I}\)-WPI is slightly more sensitive to this form of misspecification than G\(^\mathrm{II}\)-WPI. Furthermore, the asymptotic relative biases for \(\beta \) under the conditional estimating equations are all relatively modest when Kendall’s \(\tau \) is small to modest. If one is primarily interested in the estimation of \(\beta \), then the proposed conditional estimating equations are reasonably robust to misspecification of the copula function for modest Kendall’s \(\tau \), but the asymptotic biases of the dependence parameters are appreciable under misspecification of the dependence structure. This conclusion is analogous to those made regarding misspecification of the random effect distribution with response-dependent sampling [11, 13, 23]. We also conducted supplementary simulation studies demonstrating a good agreement between the finite sample and asymptotic biases in studies with 200 and 1000 families (see Supplementary Material).

### 4.2 Power Implications of Dependence Structure Misspecification

*m*satisfying

*m*and significance level \(\alpha _1\), is

## 5 Application to The Psoriatic Arthritis Family Study

Hereditary factors are thought to be important in psoriatic arthritis, as some studies have suggested that close blood relatives of affected individuals are at higher risk of developing the disease compared to the general population. Interest therefore lies in characterizing the effect of genetic markers on the risk of disease; we consider four human leukocyte antigen (HLA) markers reported in the literature as being associated with psoriasis or psoriatic arthritis including HLA-B8, HLA-B27, HLA-C6, and HLA-C12. Characterizing the nature of the within-family association structure can also provide useful insight into the genetic basis of the disease. Particular interest lies in assessing the “parent-of-origin” effect; preliminary evidence suggests that there may be a stronger risk of paternal transmission, over maternal transmission, of risk of disease; we refer readers to Pollock et al. [24] for associated results based on binary analyses.

Table 3 summarizes the results with the top half obtained from the full derivative and covariance matrices (G–W) and the bottom half reporting the results from G\(^\mathrm{I}\)-WPI. A model with no HLA covariate is given in the first column followed by four univariate models, with the last column containing results from a multivariate model including all four markers. The estimates for the association parameters are given in terms of \(\gamma \) and the three Kendall’s \(\tau \) parameters.

Based on the model with no HLA covariates, we find \(\hat{\tau }_\mathrm{ss} = 0.337\) (95% CI: 0.113, 0.528; *p* value = 0.002), indicating highly significant association between siblings in the disease onset time. The father–child association is lower at \(\hat{\tau }_\mathrm{fc} = 0.225\) (95% CI: −0.030, 0.452) and not quite statistically significant (*p* value = 0.072). For the mother–child association, we find \(\hat{\tau }_\mathrm{mc} = 0.130\) (95% CI: −0.153, 0.393) which is weaker still and insignificant (*p* value = 0.364). A test of the parent-of-origin hypothesis based on \(H_0 : \gamma _2 - \gamma _3 = 0\) yields a Wald statistic of 1.435 (*p* value = 0.151). As this is not statistically significant at the 5% significance level, there is insufficient evidence to claim a statistically significant “parent- of-origin” effect. The results are broadly comparable for the HLA regression analyses based on the other conditional estimating equation (G\(^\mathrm{I}\)-WPI). For the association parameters, the estimates are somewhat lower with \(\hat{\tau }_\mathrm{ss} = 0.220\) (95% CI: −0.003, 0.423; *p* value = 0.046), \(\hat{\tau }_\mathrm{fs} = 0.104\) (95% CI: −0.128, 0.324; *p* value = 0.378), and \(\hat{\tau }_\mathrm{ms}= -0.018\) (95% CI: −0.256, 0.222; *p* value = 0.886). The Wald statistic of 1.682 (*p* value = 0.092) does not suggest a “parent- of-origin” effect.

The large sample theory we develop can be used to plan a future family study and it is possible to calculate how many families would be required to ensure adequate power to test the parent-of-origin hypothesis in a future study. In a new study, we may consider recruitment of families of members of the registry and presume that the distribution of family members, ages at assessment, and other factors are similar in the new study. We use the sample size formula similar to (9) but for \(\gamma _2 - \gamma _3\) and determine that 627 families would be required to ensure 80% power to detect a significant difference between the father–child and mother–child associations using estimating function G–W when the true effects correspond to those seen in the first column of Table 3. The current study therefore appears to be grossly under-powered to formally test the parent-of-origin hypothesis.

Estimates of analyses of HLA markers and time to disease onset based on conditional estimating equations using response-biased psoriatic arthritis family data; associated standard errors are in the parentheses

Univariate models | Multivariate model | |||||
---|---|---|---|---|---|---|

No HLA | HLA-B8 | HLA-B27 | HLA-C6 | HLA-C12 | ||

Full G and Full W (G\(-\)W) | ||||||

\(\log \lambda \) | \(-\)5.461 (0.531) | \(-\)5.353 (0.455) | \(-\)5.573 (0.609) | \(-\)5.650 (0.590) | \(-\)5.342 (0.464) | \(-\)5.544 (0.543) |

\(\log \kappa \) | 1.347 (0.086) | 1.347 (0.086) | 1.348 (0.086) | 1.349 (0.086) | 1.346 (0.087) | 1.349 (0.086) |

\(\beta _{B8}\) | \(-\) | \(-\)0.483 (0.709) | \(-\) | \(-\) | \(-\) | \(-\)0.195 (0.751) |

\(\beta _{B27}\) | \(-\) | \(-\) | 1.161 (0.755) | \(-\) | \(-\) | 1.308 (0.771) |

\(\beta _{C6}\) | \(-\) | \(-\) | \(-\) | 0.397 (0.593) | \(-\) | 0.458 (0.625) |

\(\beta _{C12}\) | \(-\) | \(-\) | \(-\) | \(-\) | 0.788 (0.769) | 0.982 (0.830) |

\(\gamma _1\) | 0.700 (0.242) | 0.660 (0.224) | 0.741 (0.261) | 0.777 (0.242) | 0.620 (0.234) | 0.695 (0.244) |

\(\gamma _2\) | 0.458 (0.264) | 0.414 (0.242) | 0.470 (0.281) | 0.536 (0.259) | 0.378 (0.264) | 0.423 (0.269) |

\(\gamma _3\) | 0.261 (0.291) | 0.220 (0.268) | 0.300 (0.310) | 0.343 (0.285) | 0.173 (0.289) | 0.252 (0.287) |

\(\tau _{ss}\) | 0.337 (0.107) | 0.318 (0.101) | 0.354 (0.114) | 0.370 (0.104) | 0.301 (0.106) | 0.334 (0.108) |

\(\tau _{fc}\) | 0.225 (0.125) | 0.204 (0.116) | 0.231 (0.133) | 0.262 (0.121) | 0.187 (0.127) | 0.209 (0.129) |

\(\tau _{mc}\) | 0.130 (0.143) | 0.110 (0.132) | 0.149 (0.151) | 0.170 (0.138) | 0.086 (0.143) | 0.125 (0.141) |

G\(^{\text{ I }}\) and WPI W (G\(^{\text{ I }}\) \(-\)WPI) | ||||||

\(\log \lambda \) | \(-\)5.037 (0.326) | \(-\)5.007 (0.313) | \(-\)5.105 (0.358) | \(-\)5.034 (0.325) | \(-\)5.052 (0.317) | \(-\)5.105 (0.331) |

\(\log \kappa \) | 1.337 (0.087) | 1.338 (0.087) | 1.339 (0.087) | 1.337 (0.087) | 1.338 (0.088) | 1.340 (0.087) |

\(\beta _{B8}\) | \(-\) | \(-\)0.490 (0.644) | \(-\) | \(-\) | \(-\) | \(-\)0.325 (0.669) |

\(\beta _{B27}\) | \(-\) | \(-\) | 0.849 (0.587) | \(-\) | \(-\) | 0.969 (0.580) |

\(\beta _{C6}\) | \(-\) | \(-\) | \(-\) | \(-\)0.037 (0.492) | \(-\) | 0.060 (0.537) |

\(\beta _{C12}\) | \(-\) | \(-\) | \(-\) | \(-\) | 0.767 (0.644) | 0.911 (0.667) |

\(\gamma _1\) | 0.448 (0.232) | 0.447 (0.234) | 0.481 (0.237) | 0.447 (0.238) | 0.437 (0.227) | 0.465 (0.238) |

\(\gamma _2\) | 0.208 (0.237) | 0.199 (0.234) | 0.208 (0.238) | 0.206 (0.241) | 0.195 (0.241) | 0.180 (0.244) |

\(\gamma _3\) | \(-\)0.036 (0.249) | \(-\)0.039 (0.246) | \(-\)0.005 (0.250) | \(-\)0.037 (0.256) | \(-\)0.059 (0.251) | \(-\)0.036 (0.254) |

\(\tau _{ss}\) | 0.220 (0.110) | 0.220 (0.111) | 0.236 (0.112) | 0.220 (0.113) | 0.215 (0.108) | 0.228 (0.113) |

\(\tau _{fc}\) | 0.104 (0.117) | 0.099 (0.116) | 0.104 (0.118) | 0.103 (0.119) | 0.097 (0.119) | 0.090 (0.121) |

\(\tau _{mc}\) | \(-\)0.018 (0.124) | \(-\)0.019 (0.123) | \(-\)0.002 (0.125) | \(-\)0.019 (0.128) | \(-\)0.030 (0.125) | \(-\)0.018 (0.127) |

## 6 Discussion

Estimating functions have been developed to model the nature and extent of within-family dependence in disease onset times from family studies under response-dependent sampling. A novel aspect of this work is the formulation of the dependence measures on the basis of the disease onset time and the recognition that the available data on family members are handled more naturally as current status data rather than binary data. This approach utilizes all available data from probands and their relatives in assessing the association between age of onset and covariates and in evaluating the association structure of age of onset among family members. The biased sampling scheme typically employed in family studies is addressed by the use of conditional estimating equations where the conditioning event reflects the selection criteria. Several specific estimating functions within the class proposed are assessed in terms of efficiency and robustness; these results complement the standard results of second-order estimating functions since all moments in the proposed equations are conditional. We also outline how sample size requirements for family studies can be assessed based on this framework to ensure that power objectives are met. Code for solving the conditional second-order estimating equations (1) and for obtaining the variance estimates of Sect. 2.2 are available at Github https://github.com/Yujie-Zhong/CGEE2.

We have focused on the use of estimating functions for the analysis of family data in part because the likelihood can be challenging to compute when the size of the family is large. Nevertheless, some assessment of the loss of efficiency in comparison to this optimal approach would be worthwhile. The validity of the proposed conditional second-order estimating equations hinges on correct specification of the dependence structure, a requirement that is analogous to the need for correct specification of the mixing distribution in random effects models for data obtained based on a response-dependent sampling scheme [23]. Assessing model adequacy is best done by testing for the need for model expansion; this could be carried out by testing the need for more cut-points in the baseline hazard function to accommodate a more flexible hazard function, or the need to test for a more general dependence structure. In the present setting, the dependence structure is most easily formulated by selecting a working copula model for the joint distribution of the onset times in the population. If this dependence structure is misspecified, inconsistent estimates are obtained, and we examine their consequences in Sect. 4 to make recommendations on the use of a particular derivative and working covariance matrix. The properties of estimators under model misspecification can be explored using large sample theory [31], but these will be influenced by response-dependent sampling schemes and so a more general study of the effect of misspecification in this framework represents an important area for further research.

We have restricted our attention to parametric models for the onset time distribution. Natural extensions would be to introduce non-parametric or semi-parametric methods for estimating the marginal distributions. In the latter case, one can look at multiplicative Cox models, accelerated failure time models, and Aalen’s additive model, among many other methods. Joint estimation based on the most general conditional estimating equation can be challenging in this setting, but two-stage estimation procedures may be feasible; this is an area of current research. The preliminary work based on the piecewise constant baseline hazard model, however, suggests that studies may need to recruit a lot of families if the incidence rate is low to estimate the marginal onset time distribution. If the disease onset times are available for all or even some of the non-probands found to have the disease, these data could help in estimation; the estimating equations we present can be modified in this case to incorporate such data. Auxiliary samples can also be useful to enhance inferences.

While there is an increasing amount of attention given to the use of disease onset time as a basis for modeling within-family dependence, there remain challenging issues that warrant further attention. The primary challenge is in quantifying dependence in the presence of the competing risk of death [12, 26]. The classical illness–death process is a natural framework for modeling the occurrence of disease in individuals who are at risk, and generalization of this set-up to model within-family dependence is an area warranting attention. This issue is not unique to analyses based on disease onset times; when current status data are treated as binary data, the requirement that individuals are alive at the time of contact is ignored.

## Notes

### Acknowledgements

The authors thank Drs. Dafna Gladman, Vinod Chandran, and Lihi Eder for stimulating collaboration and helpful discussions involving the psoriatic arthritis research program. This research was financially supported by grants from the UK Medical Research Council [Unit Programme No. MC_UP_1302/3], the Natural Sciences and Engineering Research Council of Canada (RGPIN 155849), and the Canadian Institutes for Health Research (FRN 13887). Richard Cook is a Tier I Canada Research Chair in Statistical Methods for Health Research.

## Supplementary material

## References

- 1.Balemi A, Lee A (2009) Comparison of GEE1 and GEE2 estimation applied to clustered logistic regression. J Stat Comput Simul 79(4):361–378MathSciNetCrossRefzbMATHGoogle Scholar
- 2.Burden AD, Javed S, Bailey M, Hodgins M, Connor M, Tillman D (1998) Genetics of psoriasis: paternal inheritance and a locus on chromosome 6p. J Investig Dermatol 110(6):958–960CrossRefGoogle Scholar
- 3.Burton PR, Palmer LJ, Jacobs K, Keen KJ, Olson JM, Elston RC (2000) Ascertainment adjustment: where does it take us? Am J Hum Genet 67(6):1505–1514CrossRefGoogle Scholar
- 4.Burton PR, Tiller KJ, Gurrin LC, Cookson WO, Musk AW, Palmer LJ (1999) Genetic variance components analysis for binary phenotypes using generalized linear mixed models (GLMMs) and Gibbs sampling. Genet Epidemiol 17(2):118–140CrossRefGoogle Scholar
- 5.Cannings C, Thompson EA (1977) Ascertainment in the sequential sampling of pedigrees. Clin Genet 12(4):208–212CrossRefGoogle Scholar
- 6.Chatterjee N, Kalaylioglu Z, Shih JH, Gail MH (2006) Case-control and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics 62(1):36–48MathSciNetCrossRefzbMATHGoogle Scholar
- 7.Dorman JS, Trucco M, LaPorte RE, Kuller LH (1988) Family studies: the key to understanding the genetic and environmental etiology of chronic disease? Genet Epidemiol 5(5):305–310CrossRefGoogle Scholar
- 8.Dunson DB, Dinse GE (2002) Bayesian models for multivariate current status data with informative censoring. Biometrics 58(1):79–88MathSciNetCrossRefzbMATHGoogle Scholar
- 9.Gelfand JM, Gladman DD, Mease PJ, Smith N, Margolis DJ, Nijsten T, Stern RS, Feldman SR, Rolstad T (2005) Epidemiology of psoriatic arthritis in the population of the United States. J Am Acad Dermatol 53(4):573–586CrossRefGoogle Scholar
- 10.Gladman DD, Schentag CT, Tom B, Chandran V, Brockbank J, Rosen C, Farewell VT (2009) Development and initial validation of a screening questionnaire for psoriatic arthritis: the Toronto Psoriatic Arthritis Screen (ToPAS). Ann Rheum Dis 68(4):497–501CrossRefGoogle Scholar
- 11.Glidden DV, Vittinghoff E (2004) Modelling clustered survival data from multicentre clinical trials. Stat Med 23(3):369–388CrossRefGoogle Scholar
- 12.Gorfine M, Hsu L (2011) Frailty-based competing risks model for multivariate survival data. Biometrics 67(2):415–426MathSciNetCrossRefzbMATHGoogle Scholar
- 13.Gorfine M, De-Picciotto R, Hsu L (2012) Conditional and marginal estimates in case-control family data-extensions and sensitivity analyses. J Stat Comput Simul 82(10):1449–1470MathSciNetCrossRefzbMATHGoogle Scholar
- 14.Gorfine M, Hsu L, Parmigiani G (2013) Frailty models for familial risk with application to breast cancer. J Am Stat Assoc 108(504):1205–1215MathSciNetCrossRefzbMATHGoogle Scholar
- 15.Hsu L, Prentice RL, Zhao LP (1999) On dependence estimation using correlated failure time data from case-control family studies. Biometrika 86(4):743–753MathSciNetCrossRefzbMATHGoogle Scholar
- 16.Jewell NP, Van Der Laan M, Lei X (2005) Bivariate current status data with univariate monitoring times. Biometrika 92(4):847–862MathSciNetCrossRefzbMATHGoogle Scholar
- 17.Joe H (1997) Multivariate models and multivariate dependence concepts. Chapman and Hall, LondonCrossRefzbMATHGoogle Scholar
- 18.Khoury MJ, Beaty TH, Cohen BH (1993) Fundamentals of genetic epidemiology. Oxford University Press, New YorkGoogle Scholar
- 19.Laird NM, Lange C (2006) Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet 7(5):385–394CrossRefGoogle Scholar
- 20.Li H, Yang P, Schwartz AG (1998) Analysis of age of onset data from case-control family studies. Biometrics 54(3):1030–1039CrossRefzbMATHGoogle Scholar
- 21.MacLean CJ, Neale MC, Meyer JM, Kendler KS, Rao DC (1990) Estimating familial effects on age at onset and liability to schizophrenia. II. Adjustment for censored data. Genet Epidemiol 7(6):419–426CrossRefGoogle Scholar
- 22.Nelsen RB (2006) An introduction to copulas. Springer, New YorkzbMATHGoogle Scholar
- 23.Neuhaus JM, McCulloch CE (2011) The effect of misspecification of random effects distributions in clustered data settings with outcome-dependent sampling. Can J Stat 39(3):488–497MathSciNetzbMATHGoogle Scholar
- 24.Pollock RA, Thavaneswaran A, Pellett F, Chandran V, Petronis A, Rahman P, Gladman DD (2015) Further evidence supporting a parent-of-origin effect in psoriatic disease. Arthritis Care Res 67(11):2151–4658CrossRefGoogle Scholar
- 25.Prentice RL (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics 44(4):1033–1048MathSciNetCrossRefzbMATHGoogle Scholar
- 26.Shih JH, Albert PS (2010) Modeling familial association of ages at onset of disease in the presence of competing risk. Biometrics 66(4):1012–1023MathSciNetCrossRefzbMATHGoogle Scholar
- 27.Shih JH, Chatterjee N (2002) Analysis of survival data from case-control family studies. Biometrics 58(3):502–509MathSciNetCrossRefzbMATHGoogle Scholar
- 28.Sun J (2006) The statistical analysis of interval-censored failure time data. Springer, New YorkzbMATHGoogle Scholar
- 29.Thompson EA (1993) Sampling and ascertainment in genetic epidemiology: a tutorial review. University of Washington, SeattleGoogle Scholar
- 30.Weinberg CR (1999) Methods for detection of parent-of-origin effects in genetic studies of case-parents triads. Am J Hum Genet 65(1):229–235CrossRefGoogle Scholar
- 31.White H (1982) Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–26MathSciNetCrossRefzbMATHGoogle Scholar
- 32.Zhao LP, Hsu L, Holte S, Chen Y, Quiaoit F, Prentice RL (1998) Combined association and aggregation analysis of data from case-control family studies. Biometrika 85(2):299–315CrossRefzbMATHGoogle Scholar
- 33.Zhao LP, Prentice RL (1990) Correlated binary regression using a quadratic exponential model. Biometrika 77(3):642–648MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.