Introducing a new estimator and test for the weighted allcause hazard ratio
 113 Downloads
Abstract
Background
The rationale for the use of composite timetoevent endpoints is to increase the number of expected events and thereby the power by combining several event types of clinical interest. The allcause hazard ratio is the standard effect measure for composite endpoints where the allcause hazard function is given as the sum of the eventspecific hazards. However, the effect of the individual components might differ, in magnitude or even in direction, which leads to interpretation difficulties. Moreover, the individual event types often are of different clinical relevance which further complicates interpretation. Our working group recently proposed a new weighted effect measure for composite endpoints called the ‘weighted allcause hazard ratio’. By imposing relevance weights for the components, the interpretation of the composite effect becomes more ‘natural’. Although the weighted allcause hazard ratio seems an elegant solution to overcome interpretation problems, the originally published approach has several shortcomings: First, the proposed point estimator requires prespecification of a parametric survival model. Second, no closed formula for a corresponding test statistic was provided. Instead, a permutation test was proposed. Third, no clear guidance for the choice of the relevance weights was provided. In this work, we will overcome these problems.
Methods
Within this work a new nonparametric estimator and a related closed formula test statistic are presented. Performance of the new estimator and test is compared to the original ones by a MonteCarlo simulation study.
Results
The original parametric estimator is sensible to missspecifications of the survival model. The new nonparametric estimator turns out to be very robust even if the required assumptions are not met. The new test shows considerably better power properties than the permutation test, is computationally much less expensive but might not preserve type one error in all situations. A scheme for choosing the relevance weights in the planning stage is provided.
Conclusion
We recommend to use the nonparametric estimator along with the new test to assess the weighted allcause hazard ratio. Concrete guidance for the choice of the relevance weights is now available. Thus, applying the weighted allcause hazard ratio in clinical applications is both  feasible and recommended.
Keywords
Composite endpoint Weighted effect measure Weightbased logrank test Simulation studyBackground
In many clinical trials, the aim is to compare two treatment groups with respect to a rarely occurring event like myocardial infarction or death. In this situation, a high number of patients has to be included and observed over a long period of time for a demonstration of a relevant treatment effect and to reach an acceptable power. Combining several events of interest within a socalled composite endpoint can lead to a smaller required sample size and save time as a higher number of events is mend to increase the power. The common treatment effect measure for composite endpoints is the allcause hazard ratio. This effect measure is based on the total number of events irrespective of their type. Commonly, either the logrank test or the Cox proportional hazards model [1, 2, 3, 4] are used for analysing the allcause hazard ratio. However, the interpretation of the allcause hazard ratio as a composite treatment effect can be difficult. This is due to two reasons: First, the composite might not necessarily reflect the effects of the individual components which can differ in magnitude or even in direction [5, 6, 7]. Second, the distinct event types could be of different clinical relevance. For example, the fatal event ‘death’ is more relevant than a nonfatal event like ‘cardiovascular hospital admission’. Moreover, the less relevant event often contributes a higher number of events and therefore has a higher influence on the composite effect than the less relevant event.
Current guidelines on clinical trial methodology hence recommend to combine only events of the same clinical relevance [3, 8]. However, this is rather unrealistic in clinical practice, as important components like ‘death’ cannot be excluded from the primary analysis if a fatal event is clearly more relevant than any other nonfatal event. Therefore, to address the problems that arise within the analysis of a composite endpoint other methods to ease the interpretation of results are needed. An intuitive approach could be to define a weighted composite effect measure with weights that reflect the different levels of clinical relevance of the components. Weighted effect measures have been proposed and compared by several authors [9, 10, 11, 12, 13, 14]. Some of the main disadvantages of these approaches include the high dependence on the censoring mechanism and on competing risks [13, 14]. Recently, Rauch et al. [15] proposed a new weighted effect measure called the ‘weighted allcause hazard ratio’. This new effect measure is defined as the ratio between the weighted average of the causespecific hazards for two groups. Thereby, the predefined weights are assigned to the individual causespecific hazards. With equal weights for the components the weighted allcause hazard ratio corresponds to the common allcause hazard ratio and thus defines a natural extension of the standard approach.

How robust is the original estimator for the weighted allcause hazard ratio against missspecifications of the underlying parametric survival model?

How robust is the new alternative nonparametric estimator for the weighted allcause hazard ratio?

How can we derive a closed formula test statistic for testing the weighted allcause hazard ratio?

How do the different estimators and tests behave in a direct performance comparison?

What are the required steps when choosing adequate weighting factors in the planning stage?
This paper is organized as follows: In the Methods Section, we start by introducing the standard unweighted approach for analysing a composite timetofirst event endpoint. In the same section, the weighted allcause hazard ratio is introduced as well as the original parametric estimator and the permutation test as recently proposed by Rauch et al. [15]. A new nonparametric estimator for the weighted allcause hazard ratio and a related closed formula test is introduced subsequently. Next, we provide a stepbystep guidance on the choice of the relevance weighting factors. In the Results Section, the different estimators and tests for the weighted allcause hazard ratio are compared by means of a MonteCarlo simulation study to evaluate their performance for various data scenarios, in particular those who meet and those who violate the underlying model assumptions. We discuss our methods and results and we finish the article with concluding remarks.
Methods
The standard allcause hazard ratio
The interest lies, throughout this work, in a twoarm clinical trial where an intervention I shall be compared to a control C with respect to a composite timetoevent endpoint. A total of n individuals are randomized in a 1:1 allocation to the two groups. The composite endpoint consists of k components EP_{j}, j=1,...,k. It is assumed that a lower number of events corresponds to a more favourable result. The observational period is given by the interval [0,τ]. The study aim is to demonstrate superiority of the new intervention and therefore a onesided test problem is formulated.
Definitions and test problem
where the indices I and C denote the group allocation and proportional hazards are assumed so that θ_{CE} is constant in time. Note that the proportional hazards assumption can only hold true for both the composite and for the components if equal causespecific baseline hazards are assumed across all components.
Point estimator and test statistic
For estimating the allcause hazard ratio, a semiparametric estimator for the allcause hazard ratio \(\widehat {\theta }_{CE}\) can be obtained by means of partial maximumlikelihood estimator from the wellknown Coxmodel [1].
The test statistic LR is approximately standard normally distributed under the null hypothesis given in (2). Negative values of the test statistic favour the intervention and therefore the null hypothesis is rejected if LR≤−z_{1−α}, where z_{1−α} is the corresponding (1−α)quantile of the standard normal distribution and α is the onesided significance level.
The weighted allcause hazard ratio
Definitions and test problem
where the nonnegative weights \(w_{EP_{j}}\geq 0\), j=1,...,k, are reflecting the clinical relevance of the components EP_{j}, j=1,...,k. If the weights are all equally set to 1 \((w_{EP_{1}}=w_{EP_{2}}=...=w_{EP_{k}}=1)\), then the weighted allcause hazard corresponds to the standard allcause hazard.
The hypotheses to be assessed in the confirmatory analysis are thus equivalent to the common unweighted approach.
Original point estimator and test statistic
The prespecification of a survival model to identify the causespecific hazard must be seen as a considerable restriction as the shape of the survival distribution is usually not known in advance. Thus, it is of interest to evaluate how sensible the parametric estimator reacts when the survival model is missspecified. Moreover, there is the general interest in deriving a less restrictive nonparametric estimator.
A related variance estimator for (8) cannot easily be deduced and thus an asymptotic distribution of the parametric estimator given in (8) is not available. Therefore, Rauch et al. [15] considered a permutation test to test the null hypothesis specified above. For the permutation test the sampling distribution is built by resampling the observed data. Thereby, the originally assigned treatment groups are randomly assigned to the observation without replacement in several runs. Although this is an elegant option without the need to make further restrictive assumptions, the disadvantage is that such a permutation test is not available as a standard application in statistical software but requires implementation. Moreover, depending on the trial sample size and the computer capacities, this is a very time consuming approach.
New point estimator and closed formula test statistic
This is a very restrictive assumption usually not met in practice. The assumption is only required to formally derive the new nonparametric estimator. We do not generally focus on data situations were this assumption is fulfilled. The estimator is only relevant for practical use if deviations from this assumptions produce no relevant bias. This will be investigated in detail in the sections Simulation scenarios and Results.
In contrast to the parametric estimator \(\hat \theta ^{w}_{CE}(t)\) given in (8), the nonparametric estimator \(\widetilde {\theta }^{w}_{CE}(t)\) given in (10) does not require the prespecification of a survival model. However, the correctness of the nonparametric estimator is still based on the assumption of equal causespecific baseline hazards. In case the baseline hazards differ, \(\widetilde {\theta }^{w}_{CE}(t)\) can be calculated but represents a biased estimator for \(\theta ^{w}_{CE}(t)\). Therefore, it is of interest to evaluate how sensible the nonparametric estimator reacts when the equal baseline hazards assumption is violated.
An alternative testing procedure to the discussed permutation test can be formulated by a weightbased logrank test statistic derived from a modification of the common logrank test statistic given in (3). We use the expression ‘weightbased logrank test’ instead of ‘weighted logrank test’, as in the literature the weighted logrank test refers to weights which are assigned to the different observation time points whereas we aim to weight the different event types of a composite endpoint.
assuming that no events of different types occur at the same time point.
Under the null hypothesis of equal weighted composite (cumulative) hazards the test statistic (11) is approximately standard normal distributed. Hence, the null hypothesis is rejected if LR^{w}≤−z_{1−α}, where z_{1−α} is the corresponding (1−α)quantile of the standard normal distribution and α is the onesided significance level.
Note that the common weighted logrank test can be shown to be equivalent to the Cox score test [16] because the weights are working on the coefficient β and thus the partial likelihood and its logarithm can be easily deduced. The intention of the common weighted logrank test is to weight the time points. However, in our weightbased logrank test, the weights have another meaning and are working on the whole hazard not only on the coefficient. Thus, the loglikelihood translates to a form were the weights are additive and therefore the score test does not translate to the test statistic proposed in this work. This was also the reason why we called our test ’weightbased’ and not ’weighted’ logrank test. Our test is valid but must be interpreted as a Waldtype test statistic.
Stepbystep guidance for the choice of weights
Thus, by introducing the component weights we implicitly modify the event time distribution that is the corresponding survival function. When choosing a weight unequal to 1, the survival distribution changes its shape. For a weight larger than 1, the number of events artificially increases and as a consequence, the survival function decreases sooner. In contrast, for a weight smaller than 1 the survival distribution becomes more flat as the number of events is artificially decreased. Whereas the allcause hazard ratio can be heavily masked by a large causespecific hazard of a less relevant component, a more relevant component with a lower number of events can only have a meaningful influence on the composite effect measure, when it is upweighted (or if the less relevant component is downweighted accordingly). On the contrary, if a large causespecific hazard is downweighted this can result in a power loss. Therefore, weighting can improve interpretation but the effect on power can be positive or negative, depending on the data situation at hand.
It can be seen that the weights are still working multiplicatively on the cumulative causespecific hazards and the event time distributions for the different event types are also connected multiplicatively. By the introduction of the weights we still assume that an individual can only experience one event but (for weights smaller than 1) less individuals experience the event. This means that the expected number of events decreases with a weight smaller than 1. Therefore, the weighted survival function for the composite still corresponds to a time to first event setting but with a proportion of events which is lower compared to the unweighted approach.
 1.
Identify the clinically most relevant event type (e.g. ‘death’) and assign a weight of 1.
 2.
Choose the order of clinical relevance for the remaining event types. For each event type EP_{j} you should answer the question "How many events of type EP_{j} can be considered as equally harmful than observing one event (or any other amount of reference events) in the clinically most relevant endpoint?". For example, if in the example given above 5 events of type EP_{2} are considered as equally harmful as one event of EP_{1}, then the weighting scheme proposed in Scenario B might be preferred. If instead the researcher arguments that 5 events of type EP_{2} are considered as equally harmful as 3 events of EP_{1}, then the weighting scheme proposed in Scenario A should be preferred. The weights are thus mend to bring all events to the same severity scale. By assigning a weight of 1 to the most relevant event type, this event type acts as the reference event. Therefore, the weighted survival function and its summarizing measures (median survival, hazard ratio) can be interpreted as a standard survival function for the reference event. For example, if ’death’ is the reference event, on a population and on an individual patient level, the weighted survival function then expresses the probability to be neither dead nor in a condition considered as equally harmful. The median weighted survival can be interpreted as the time when half of the population is either dead or in an equally harmful condition.
 3.
If there are some assumption about the form of the underlying event time distributions, then the functional form of the causespecific hazards is known. The weighted causespecific hazards are obtained by simple multiplication with the weighting factors. We recommend to choose several weighting scenarios and to plot the resulting weighted and unweighted event time distributions and to investigate graphically how different weights would affect the expected survival time and median survival per group. Moreover, the weighted and unweighted hazard ratio can be analytically deduced and compared. By this, the impact of the weighting scheme becomes more explicit.
Simulation scenarios
To provide a systematic comparison of the original parametric estimator \(\widehat \theta ^{w}_{CE}(t)\) to the new nonparametric estimator \(\widetilde \theta ^{w}_{CE}(t)\) for the weighted allcause hazard ratio and in order to analyse the performance of the weightbased logrank test compared to the originally proposed permutation test we performed a simulation study with the software R Version 3.3.3 [17].
Investigated simulation scenarios
Scenario  \(\lambda ^{I}_{EP_{1}}(t)\)  \(\lambda ^{C}_{EP_{1}}(t)\)  \(\lambda ^{I}_{EP_{2}}(t)\)  \({\lambda ^{C}_{EP_{2}}(t)}\)  Description  Assumptions for original parametric estimator ^{∗}  Assumptions for new nonparametric estimator ^{#} 

1  0.24  0.4  0.24t  0.8t  Weibull distributed;  \(\checkmark \)  ✗ 
PH assumption only fulfilled for components;  
unequal causespecific baseline hazards  
2  0.192t^{−0.2}  0.28t^{−0.3}  0.084t^{−0.3}  0.32t^{−0.2}  Weibull distributed;  ✗  ✗ 
PH assumption not fulfilled for components and composite;  
unequal causespecific baseline hazards  
3  0.24t  0.8t  1.2t^{2}  0.72t^{2}  Weibull distributed;  \(\checkmark \)  ✗ 
PH assumption only fulfilled for components;  
unequal causespecific baseline hazards  
4  0.2t^{−0.4}  0.3t^{−0.3}  0.1t  0.1t^{1.5}  Weibull distributed;  ✗  ✗ 
PH assumption not fulfilled for components and composite;  
unequal causespecific baseline hazards  
5  0.5t^{−0.4}  0.9t^{−0.5}  0.25  0.22t^{0.1}  Weibull distributed;  ✗  ✗ 
PH assumption not fulfilled for components and composite;  
unequal causespecific baseline hazards  
6  0.24  0.24  0.24t  0.24t  Weibull distributed;  \(\checkmark \)  ✗ 
PH assumption fulfilled for components and composite;  
unequal causespecific baseline hazards  
7  0.05  0.1  1  0.5  Weibull distributed;  \(\checkmark \)  \(\checkmark \) 
PH assumption fulfilled for components and composite;  
equal causespecific baseline hazards  
8  0.42e^{0.7t}−0.42  0.7e^{0.7t}−0.7  0.21e^{0.7t}−0.21  0.7e^{0.7t}−0.7  Gompertz distributed;  ✗  \(\checkmark \) 
PH assumption fulfilled for components and composite;  
equal causespecific baseline hazards  
9  0.42e^{2t}−0.42  0.7e^{2t}−0.7  0.21e^{0.7t}−0.21  0.7e^{0.7t}−0.7  Gompertz distributed;  ✗  ✗ 
PH assumption only fulfilled for components;  
unequal causespecific baseline hazards  
10  0.42e^{0.7t}  0.7e^{0.8t}  0.21e^{0.8t}  0.7e^{0.6t}  Gompertz distributed;  ✗  ✗ 
PH assumption not fulfilled for components and composite;  
unequal causespecific baseline hazards 
Thereby, κ>0 is the scale parameter and ν>0 is the shape parameter. The investigated scenarios show to some extend the flexibility of the Weibull model. Situations with earlier occurring events for one event type (higher causespecific hazard) and later occurring events for the other event type (lower causespecific hazards) are capture as well as situations where the difference in hazards is smaller. In the scenarios 16 at least one causespecific hazard is timedependent whereas in Scenario 7 the hazards are constant.
The hazard for the composite increases over time for the Scenarios 1 and 3 and decreases for Scenario 2. For the Scenarios 4 and 5 the hazard first decreases and then increases after a while. For the Scenarios 1 and 3 the proportional hazards assumption is fulfilled for each of the event types simultaneously. Also note that in Scenario 3 and partly in the Scenarios 4 and 5 the effects for the event types point into opposite directions. Scenario 6 depicts a situation where no treatment effect for the individual components and the composite exists. In Scenario 7 there are opposite effects for the individual components which cancel out in the combined composite for one weighting scheme.
which is also referred to the GompertzMakeham distributed hazard [20, 21]. Again, κ>0 is a scaling parameter and ν>0 a shape parameter. In addition, a more general term ε≥−κ defining the intercept is formulated. For all Scenarios with Gompertz distributed event times the hazard for the composite increases over time. For the situation where the shape parameters are equal across all event types the proportional hazards assumption does apply to the composite. This is the case for Scenario 8 but not for the Scenarios 9 and 10. The proportional hazards assumption also holds true for each event type separately for the Scenarios 8 and 9. In Scenario 10 the proportional hazards assumption is violated for all event types and for the composite.
Results
Simulation results
Sc.  Assumptions for  Assumptions for  τ  Weights  Ln of  Mean number  Mean of estimated  Power for  

original parametric  new nonparametric  True  of events (sd)  Ln(WHR) (sd)  permutation  weightbased  
estimator^{*}  estimator^{#}  WHR  test  logrank test  
\(w_{EP_{1}}\)  \(w_{EP_{2}}\)  \(ln(\theta _{CE}^{w}(\tau))\)  E P _{1}  E P _{2}  \(ln(\hat {\theta }^{w}_{CE}(\tau))\)  \(ln(\tilde {\theta }^{w}_{CE}(\tau))\)  \(\hat {\theta }^{w}_{CE}(\tau)\)  \(\tilde {\theta }^{w}_{CE}(\tau)\)  
1a  \(\checkmark \)  ✗  1  1  0.1  0.60  50.21 (6.51)  35.16 (5.21)  0.61 (0.27)  0.57 (0.28)  0.66  0.72 
1b  2  1  0.1  0.67  72.60 (6.74)  80.13 (6.73)  0.67 (0.20)  0.60 (0.24)  0.94  0.90  
1c  2  0.1  1  1.18  1.20 (0.22)  1.14 (0.21)  1.00  1.00  
2a  ✗  ✗  1  1  0.1  0.44  48.22 (6.08)  37.15 (5.33)  0.59 (0.29)  0.58 (0.29)  0.58  0.75 
2b  2  1  0.1  0.38  67.79 (6.75)  51.93 (5.60)  0.54 (0.23)  0.53 (0.23)  0.64  0.81  
3a  \(\checkmark \)  ✗  1  1  0.1  0.88  39.97 (5.63)  47.94 (5.87)  0.90 (0.29)  1.00 (0.31)  0.90  0.96 
3b  1  0.1  1  0.43  0.42 (0.29)  0.38 (0.28)  0.00  0.00  
3c  2  1  0.1  0.68  69.84 (6.39)  124.63 (6.40)  0.67 (0.20)  0.80 (0.21)  0.96  1.00  
4a  ✗  ✗  1  1  0.1  0.39  62.84 (6.39)  6.49 (2.50)  0.22 (0.64)  0.23 (0.26)  0.14  0.29 
4b  1  0.1  1  0.08  0.12 (0.96)  0.01 (0.44)  0.02  0.05  
4c  2  1  0.1  0.46  86.66 (6.95)  24.29 (4.57)  0.27 (0.21)  0.28 (0.21)  0.27  0.44  
4d  2  0.1  1  0.36  0.12 (0.40)  0.15 (0.32)  0.05  0.17  
5a  ✗  ✗  1  1  0.1  0.56  133.27 (6.26)  19.31 (4.05)  0.82 (0.28)  0.82 (0.17)  1.00  1.00 
5b  1  0.1  1  0.03  0.04 (0.57)  0.14 (0.30)  0.03  0.34  
6a  \(\checkmark \)  ✗  2  1  0.1  0.00  67.04 (6.74)  56.49 (6.37)  0.00 (0.21)  0.00 (0.23)  0.02  0.08 
6b  2  0.1  1  0.00  0.00 (0.25)  0.00 (0.24)  0.02  0.06  
7a  \(\checkmark \)  \(\checkmark \)  2  1  0.1  0.00  15.83 (3.84)  141.89 (6.18)  0.01 (0.30)  0.02 (0.28)  0.02  0.02 
7b  2  0.1  1  0.68  0.68 (0.16)  0.68 (0.17)  0.00  0.00  
8a  ✗  \(\checkmark \)  2  1  0.1  0.56  108.25 (6.93)  78.79 (6.69)  0.69 (0.21)  0.56 (0.19)  0.92  0.96 
8b  2  0.1  1  1.12  1.29 (0.24)  1.13 (0.21)  1.00  1.00  
9a  ✗  ✗  1  0.1  1  0.88  132.78 (6.43)  15.22 (3.56)  1.10 (0.42)  0.93 (0.31)  0.77  0.97 
9b  2  1  0.1  0.51  178.24 (4.21)  21.76 (4.21)  0.66 (0.18)  0.52 (0.15)  0.96  0.97  
9c  2  0.1  1  0.71  1.31 (0.48)  0.89 (0.30)  0.90  0.99  
10a  ✗  ✗  1  1  0.1  0.64  84.95 (7.02)  62.44 (6.34)  0.58 (0.21)  0.59 (0.21)  0.81  0.94 
10b  1  0.1  1  0.95  1.04 (0.23)  1.05 (0.23)  1.00  1.00  
10c  2  1  0.1  0.72  113.64 (6.92)  80.58 (6.55)  0.55 (0.17)  0.61 (0.18)  0.89  0.98  
10d  2  0.1  1  0.79  0.95 (0.19)  1.03 (0.21)  1.00  1.00 
Simulation results: Performance
Sc.  Amount of Simulations  Bias  Standardized Bias  \(\sqrt {\text {Mean Square Error}}\)  Relative Efficiency  Coverage ^{∗}  

\(ln(\hat {\theta }^{w}_{CE}(\tau))\)  \(ln(\tilde {\theta }^{w}_{CE}(\tau))\)  \(ln(\hat {\theta }^{w}_{CE}(\tau))\)  \(ln(\tilde {\theta }^{w}_{CE}(\tau))\)  \(ln(\hat {\theta }^{w}_{CE}(\tau))\)  \(ln(\tilde {\theta }^{w}_{CE}(\tau))\)  \(ln(\hat {\theta }^{w}_{CE}(\tau))\)  \(ln(\tilde {\theta }^{w}_{CE}(\tau))\)  \(\frac {MSE\left (ln(\hat {\theta }^{w}_{CE}(\tau))\right)}{MSE\left (ln(\tilde {\theta }^{w}_{CE}(\tau))\right)}\)  \(ln(\hat {\theta }^{w}_{CE}(\tau))\)  \(ln(\tilde {\theta }^{w}_{CE}(\tau))\)  
1a  991  1000  0.01  0.03  0.05  0.11  0.27  0.29  0.91  93.14  92.70 
1b  997  1000  0.01  0.07  0.04  0.30  0.20  0.23  0.76  94.68  93.10 
1c  997  1000  0.02  0.04  0.07  0.19  0.23  0.22  1.12  95.29  94.80 
2a  1000  1000  0.15  0.14  0.51  0.49  0.32  0.32  1.02  90.60  91.00 
2b  999  1000  0.15  0.15  0.63  0.63  0.28  0.28  1.02  89.09  88.80 
3a  1000  1000  0.02  0.12  0.07  0.37  0.30  0.33  0.79  93.50  91.40 
3b  1000  1000  0.00  0.04  0.01  0.15  0.30  0.29  1.06  94.20  95.20 
3c  1000  1000  0.00  0.12  0.02  0.58  0.20  0.24  0.68  93.10  88.70 
4a  993  1000  0.17  0.16  0.26  0.39  0.66  0.31  4.57  90.07  89.53 
4b  993  1000  0.20  0.09  0.21  0.21  0.98  0.46  4.85  94.44  95.47 
4c  996  1000  0.19  0.18  0.93  0.84  0.29  0.28  1.03  82.90  84.87 
4d  996  1000  0.24  0.21  0.60  0.65  0.46  0.39  1.43  90.14  89.48 
5a  995  1000  0.26  0.25  0.91  1.46  0.38  0.31  1.55  71.80  68.07 
5b  995  1000  0.01  0.11  0.01  0.35  0.57  0.32  3.21  93.56  93.99 
6a  998  998  0.00  0.00  0.01  0.01  0.21  0.23  0.86  94.79  94.60 
6b  998  998  0.00  0.00  0.01  0.01  0.25  0.24  1.11  95.69  95.90 
7a  984  1000  0.01  0.02  0.04  0.07  0.30  0.28  1.16  94.60  95.20 
7b  984  1000  0.01  0.01  0.04  0.03  0.16  0.17  0.99  94.70  94.99 
8a  1000  1000  0.14  0.00  0.64  0.01  0.25  0.19  1.78  89.20  93.70 
8b  1000  1000  0.17  0.01  0.70  0.05  0.30  0.21  1.97  85.00  93.30 
9a  998  998  0.22  0.05  0.53  0.18  0.47  0.32  2.22  90.68  96.09 
9b  990  1000  0.15  0.01  0.81  0.04  0.23  0.15  2.29  86.97  94.20 
9c  990  1000  0.60  0.18  1.25  0.59  0.77  0.35  4.74  62.93  87.10 
10a  1000  1000  0.06  0.04  0.28  0.21  0.21  0.21  1.00  93.70  93.80 
10b  1000  1000  0.09  0.10  0.38  0.44  0.24  0.25  0.94  92.70  92.10 
10c  1000  1000  0.17  0.11  0.99  0.60  0.24  0.22  1.27  83.80  89.30 
10d  1000  1000  0.16  0.23  0.81  1.10  0.25  0.32  0.63  87.20  75.90 
Scenarios 1 and 3 reflect situations where the proportional hazards assumption is fulfilled for each component but the Weibull distributed causespecific hazards are unequal and thus the composite effect is timedependent. Since in this scenarios the assumptions for the original estimator is fulfilled it is intuitive that the (standardized) bias is small for the parametric estimator. Although the assumptions for the nonparametric estimator are violated the bias is still rather small. This good performance is also captured in the coverage which is mostly near the anticipated 95%. It is furthermore intuitive that the original estimator shows most often a smaller mean square error in relation to the nonparametric estimator. Note that in Scenario 3 the unweighted effects point into different directions but the direction of the weighted effect depends on the weighting scheme. In the Scenarios 2, 4, and 5 the proportional hazards assumption is not fulfilled for neither the components nor for the composite but the causespecific hazards still follow a Weibull distribution. For Scenario 2 it can be seen that the estimated weighted effects are the same for both estimators but do not approach the true effect as good as in the Scenarios 1 and 3. This is because both approaches need at least the assumption of proportional hazards in the components. A similar outcome would be expected for the Scenarios 4 and 5. However, in both scenarios the parametric estimator performs much worse than the nonparametric estimator. This is due to the higher variability in the estimations. For Scenario 6 where there is no effect for the unweighted composite both approaches perform quite well. For the original estimator this was expected since its assumptions are fulfilled. In Scenario 7 with the weights 1 for event type 1 and 0.1 for event type 2 the true combined treatment effect is 0. This is also captured quite well in both estimators. Note that only for this specific weighting scheme the composite effect is 0 but not for the other weighting schemes. However, the performance of the estimation approaches is also satisfying for the other weighting schemes. In Scenario 8, GompertzMakeham distributed causespecific hazards are assumed. Thereby, the proportional hazards assumption is fulfilled for the components and the composite. Thus, it is intuitive that the new nonparametric estimator closely coincide with the true effect. However, the parametric estimator based on the Weibull model is relevantly biased independent of the weighting scheme and shows a higher variability. Scenario 9 still depicts GompertzMakeham distributed causespecific hazards but the proportional hazards assumption is only fulfilled for the components and not for the composite. Although the causespecific baseline hazards are thus unequal the nonparametric estimator performs better in this scenario whereas the parametric estimator shows substantial bias and variability which might be also due to convergence problems. Scenario 10 represents Gompertz distributed causespecific hazards where the proportional hazards assumption is neither fulfilled for the components nor for the composite. Compared to the two previous scenarios the performance of the parametric estimator has increased and is not globally worse than that of the nonparametric estimator. The performance depends on the weighting scheme. In here, not all τweightcombinations are displayed. However, the performance of the missing combination scenarios is comparable to the corresponding scenarios displayed.
In conclusion, the original parametric estimator turns out to be sensible against model missspecifications for estimating the underlying causespecific hazards as expressed by most values of the (standardized) bias and the coverage of the confidence intervals for Scenarios 4, 5, 8, 9, 10. In these scenarios, the performance of the nonparametric estimator tends to be better because not only the (standardized) bias is smaller and the coverage probability is better but also the relative efficiency favours the nonparametric approach. Moreover, in Scenarios 4 and 5 the (standardized) bias of the parametric estimator is smaller and its variation is considerably higher which cannot only be explained by the smaller amount of converged simulations. The higher amount of nonconverging models for the original approach is furthermore a disadvantage. In scenarios where the assumption for the parametric estimator is fulfilled (Scenarios 1 and 3) its performance tends to be better than for the nonparametric approach. Although in these scenarios the assumption of equal causespecific baseline hazards is violated, the performance of the nonparametric estimator is however not considerably worse than for the parametric estimator.
Except for Scenario 1b, the power of the weightbased logrank test is uniformly equal or larger than the power of the permutation test. This power advantage in particular occurs in situations where the two point estimators coincide (Scenarios 2a and 2b or 10a) or even when the nonparametric estimator suggests a less extreme effect (Scenarios 8 or 9). For Scenario 6 where there is no effect for the components nor for the composite the permutation test in the investigated scenarios performs better in terms of preserving the type one error. In Scenario 7 where the composite effect is 0 for one weighting scheme the type one error is preserved for the permutation test as well as the weightbased logrank test in this scenario.
If the weights are chosen to be 1 and 0.7, the performance comparisons basically come to the same results (compare Additional file 1). Summarizing the results of our simulation, the new nonparametric estimator and the corresponding weightbased logrank test outperform the original estimator and the permutation test.
Discussion
In this work, we investigated a new estimator and test for the weighted allcause hazard ratio which was recently proposed by Rauch et al. [15] as an alternative effect measure to the standard allcause hazard ratio to assess a composite timetoevent endpoint. The weighted allcause hazard ratio as a weighted effect measure for composite endpoints is appealing because it is a natural extension of the allcause hazard ratio. It allows to regulate the influence of event types with a greater clinical relevance and thereby eases the interpretation of the results. Generally it must be noted that the weighted allcause hazard ratio was introduced to ease the interpretation of the effect in terms of clinical relevance. The aim of the weighted effect measure is not to decrease the sample size or increase the power. The power of the weighted allcause hazard ratio can be larger but may also be smaller than the power of the unweighted standard approach.
The original parametric estimator proposed by Rauch et al. [15] requires the specification of a parametric survival model to estimate the causespecific hazards. Moreover, in the original work by Rauch et al. [15] a permutation test was proposed to test the new effect measure which comes along with a high computational effort. In this work, we overcome these shortcoming by proposing a new nonparametric estimator for the weighted allcause hazard ratio and a closed formulabased test statistic which is given by a weightbased version of the wellknown logrank test.
The simulation study performed within this work shows that the original parametric estimator is sensible to missspecifications of the underlying causespecific event time distribution. If there are uncertainties about the underlying parametric model for the identification of the causespecific hazards we therefore recommend to use the new nonparametric estimator. In fact, the new nonparametric estimator proposed in this work turns out to be more robust even if the required assumption of equal causespecific baseline hazards is not met. The relative efficiency as well as the coverage depict also that the performance of the nonparametric estimator is in most cases at least as good as the original parametric estimator. Additionally, in our scenarios convergence problems arose more often when using the parametric estimator. This problems in convergence arose in scenarios where the effect of one event type was either very high at the beginning of the observational period or there was nearly no effect at the end of the observational period where the survival function reaches 0. Moreover, the simulation study shows that the new weightbased logrank test results in considerably better power properties than the originally proposed permutation test in almost all investigated scenarios. In some scenarios the type one error might not be preserved and it has to be further investigated in which this is exactly the case and how it can be addressed. In addition, the weightbased logrank test is computationally much less expensive. However, one remaining restriction is that confidence intervals cannot be directly provided because the testing procedure is not equivalent to the Cox score test. The only possibility to provide confidence intervals for the weighted hazard ratio would be by means of bootstrapping techniques.
Apart from investigating the performance of the point estimator and the related statistical test, we additionally provide a stepbystep guidance on how to choose the relevance weights for the individual components in the planning stage. It is often criticized that the choice of relevance weights in a weighted effect measure is to a certain extend arbitrary. By applying our stepbystep guidance for the choice of weights, this criticism can be addressed. To be concrete, we propose to choose a weight of 1 for the clinically most relevant component and to choose weights smaller or equal to 1 for all other components by judging how many events of a certain type would be considered as equally harmful than an event in the most relevant component. Using this approach for defining the weights, comparability to the unweighted approach is given and the most relevant event serves as a reference. When the shape of the different event time distributions is known in the planning stage, we also recommend to look at the plots of the weighted and unweighted event time distributions for different weight constellations to visually inspect the influence of the weight choice on the shape of the survival curves and on the treatment effect.
Conclusion
In conclusion, we recommend to use the new nonparametric estimator along with the weightbased logrank test to assess the weighted allcause hazard ratio. When applying the weighting scheme proposed within our stepbystep guidance, the choice of the weights can be motivated with reasonable clinical knowledge. With the results from this work, the weighted average hazard ratio therefore becomes a very attractive new effect measure for clinical trials with composite endpoints.
Notes
Funding
This work was supported by the German Research Foundation (Grant RA 2347/12). The German Research Foundation had no influence on any of the research, i.e. study design, analysis, interpretation, or writing, done in this article.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Supplementary material
References
 1.Cox DR. Regression models and lifetables. J Royal Stat Soc Ser B (Methodol). 1972; 34(2):187–220.Google Scholar
 2.Lubsen J, Kirwan BA. Combined endpoints: can we use them?Stat Med. 2002; 21(19):2959–7290.CrossRefGoogle Scholar
 3.U.S. Department of Health and Human Services. Food and Drug Administration, Center for Drug Evaluation and Research (CDER), Center for Biologics Evaluation and Research (CBER), ICH. Guidance for Industry: E9 Statistical Principles for Clinical Trials. 1998. http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm073137.pdf. Accessed 23 Aug 2017.
 4.Rauch G, Beyersmann J. Planning and evaluating clinical trials with composite timetofirstevent endpoints in a competing risk framework. Stat Med. 2013; 32(21):3595–608.CrossRefGoogle Scholar
 5.Bethel MA, Holman R, Haffner SM, Califf RM, HuntsmanLabed A, Hua TA, Murray J. Determing the most appropriate components for a composite clinical trial outcome. Am Heart J. 2008; 156(4):633–40.CrossRefGoogle Scholar
 6.Freemantle N, Calvert M. Composite and surrogate outcomes in randomised controlled trials. Am Heart J. 2007; 334(1):756–7.Google Scholar
 7.Freemantle N, Calvert M, Wood J, Eastaugh J, Griffin C. Composite outcomes in randomized trials  greater precision but with greater uncertainty?J Am Med Assoc. 2003; 289(19):756–7.CrossRefGoogle Scholar
 8.Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen. General methods  version 5.0. 2017. https://www.iqwig.de/download/allgemeinemethoden_version50.pdf. accessed 23 Aug 2017.
 9.Pocock SJ, Ariti CA, Collier TJ, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J. 2012; 33(2):176–82.CrossRefGoogle Scholar
 10.Buyse M. Generalized pairwise comparisons of prioritized outcomes in the twosample problem. Stat Med. 2010; 29(30):3245–57.CrossRefGoogle Scholar
 11.Péron J, Buyse M, Ozenne B, Roche L, Roy P. An extension of generalized pairwise comparisons for prioritized outcomes in the presence of censoring. Stat Methods Med Res. 2016; 27(4):1230–9.CrossRefGoogle Scholar
 12.Lachin JM, Bebu I. Application of the wei lachin multivariate onedirectional test to multiple eventtime outcomes. Clin Trials. 2015; 12(6):627–33.CrossRefGoogle Scholar
 13.Bebu I, Lachin JM. Large sample inference of a win ratio analysis of a composite outcome based on prioritized outcomes. Biostatistics. 2016; 17(1):178–87.CrossRefGoogle Scholar
 14.Rauch G, JahnEimermacher A, Brannath W, Kieser M. Opportunities and challenges of combined effect measures based on prioritized outcomes. Stat Med. 2014; 33(7):1104–20.CrossRefGoogle Scholar
 15.Rauch G, Kunzmann K, Kieser M, Wegscheider K, Koenig J, Eulenburg C. A weighted combined effect measure for the analysis of a composite timetofirstevent endpoint with components of different clinical relevance. Stat Med. 2018; 37(5):749–67.CrossRefGoogle Scholar
 16.Lin RS, León LF. Estimation of treatment effects in weighted logrank tests. Contemp Clin Trials Commun. 2017; 8(1):147–55.CrossRefGoogle Scholar
 17.R Core Team. R: A language and environment for statistical computing. 2018. https://www.rproject.org/. 2017, Version 3.3.3.
 18.Matsumoto M, Nishimura T. Mersenne twister. a 623dimensionally equidistributed uniform pseudorandom number generator. ACM Trans Model Comput Simul. 1998; 8(1):3–30.CrossRefGoogle Scholar
 19.Bender R, Augustin T, M B. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005; 24(11):1713–23.CrossRefGoogle Scholar
 20.Kleinbaum DG, Klein M. Survival Analysis, A SelfLearning Text, Third Edition. New York: Springer; 2012.Google Scholar
 21.Pletcher SD. Model fitting and hypothesis testing for agespecific mortality data. J Evolution Biol. 1999; 12(3):430–9.CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.