INTRODUCTION

Multi-arm clinical trials are increasingly used in modern clinical research. Some examples of multi-arm trials include phase II dose–response studies (1), drug combination studies (2), multi-arm multi-stage (MAMS) designs (3,4), and master protocols to study multiple therapies, multiple diseases, or both (5). A benefit of multi-arm trials is the ability to test many new promising treatments and address multiple research objectives within a single protocol, thereby potentially speeding up research and development processes compared to a sequence of single-arm or two-arm trials (6).

When designing a multi-arm trial, an important consideration is the choice of the allocation ratio, i.e., the target allocation proportions across the treatment arms. The choice of the allocation ratio usually stems from the study objectives. Many clinical trials are designed with an intent to have equal allocation to the treatment groups, which is consistent with a principle of “clinical equipoise” and frequently leads to maximum statistical power for treatment comparisons (e.g., if the primary outcome variance is constant across the groups) (7). On the other hand, unequal allocation designs have recently gained considerable attraction (8,9). For instance, unequal allocation designs may be preferred over equal allocation designs under the following circumstances: (i) in studies with nonlinear dose–response estimation objectives (10,11,12,13); (ii) when there is heterogeneity of the outcome variance across the treatment arms (14,15,16); (iii) when there is an ethical imperative to allocate greater proportion of study patients to superior treatment arms (17,18,19); (iv) when there is an unequal interest in certain treatment comparisons (20); and (v) when there is a differential treatment cost and an investigator wants to get most power for the given budget (21). Importantly, unequal allocation designs can involve non-integer (irrational) numbers. For example, in a \( \left(K>2\right) \)-arm trial comparing \( \left(K-1\right) \) experimental treatments versus control (Dunnett’s procedure), the optimal allocation ratio minimizing the sum of variances of the \( \left(K-1\right) \) pairwise comparisons is given by \( {\sigma}_1\sqrt{K-1}:{\sigma}_2:\dots :{\sigma}_K \), where σi is the standard deviation of the outcome in the ith treatment group (7).

Once the target allocation ratio is chosen, a question is how to implement it in practice. It is well recognized that randomization is the hallmark of any well-conducted clinical trial (22). When properly implemented, randomization can promote selected study objectives while maintaining validity and integrity of the study results (23). There is a variety of randomization designs that can be applied in multi-arm trials with equal or unequal integer-valued allocation ratios. The most common one is the permuted block design (PBD) for which treatment assignments are made at random in blocks of a given size to achieve the desired allocation ratio C1 : C2 : … : CK, where Ci’s are positive, not necessarily equal, integers with the greatest common divisor of 1. The PBD has been criticized by some authors as being too restrictive and susceptible to selection bias (24,25). Some alternatives to the PBD have been developed recently (26,27,28).

For multi-arm trials with unequal allocation involving non-integer (irrational) proportions, the choice of a randomization design is less straightforward, as highlighted in (29). The simplest approach is to use complete randomization (CR) for which treatment assignments are generated independently, according to a multinomial distribution with cell probabilities equal to the target allocation proportions. A major drawback with CR is that it can result with non-negligible probability in large departures from the desired allocation, especially in small trials. One useful alternative to CR is the mass weighted urn design (MWUD) which was shown to maintain a good tradeoff between treatment balance and allocation randomness (29). Other designs for irrational target allocations can be constructed by adopting the methodology of optimal response-adaptive randomization (30). Some promising designs for this purpose are the doubly adaptive biased coin design (31), the generalized drop-the-loser urn (32), and the optimal adaptive generalized Pólya urn (33), to name a few. However, all these designs rely on asymptotic results which may not hold in small to moderate sample sizes that are common in practice.

The present paper is motivated by our recent work (34) which investigated the structure of the D-optimal design for dose-finding experiments with time-to-event data. In particular, we found that for a quadratic dose–response model with Weibull outcomes that are subject to right censoring, the equal allocation (1:1:1) design can be highly inefficient when the amount of censoring is high. The D-optimal design is supported at 3 points, but the location of these points in the dose interval, as well as the optimal allocation proportions at these points, depend on the true model and the amount of censored data in the experiment. As such, the D-optimal allocation proportions are found through numerical optimization and they are generally quite different from the equal allocation. A two-stage adaptive design was proposed and it was found to be nearly as efficient as the true D-optimal design. The authors of (34) also mentioned that practical implementation of the adaptive D-optimal design requires a judicious choice of a randomization procedure. Given that, in practice, dose-finding studies are relatively small (due to budgetary and ethical constraints), it is imperative that the chosen randomization procedure can closely attain the desired optimal allocation for small and moderate samples while maintaining the randomized nature of the experiment. Our main conjecture in the present paper is that the choice of randomization for the D-optimal design does matter as far as statistical properties such as quality of dose–response curve estimation are concerned.

The remainder of this paper is organized as follows. In the “MATERIALS AND METHODS” section, we give a statistical background and overview of randomization designs that can be used to target multi-arm unequal allocation with possibly non-integer (irrational) proportions for trials with small and moderate sample sizes. In the “SIMULATION STUDY PLAN” section, we outline a strategy to investigate statistical properties of selected randomization procedures targeting the D-optimal design. The “RESULTS” section presents findings from our simulations, which includes a study of single-stage randomization procedures targeting locally D-optimal design, two-stage adaptive optimal designs, and multi-stage adaptive designs with early stopping rules. We also explore robustness of our proposed designs to experimental (chronological and selection) biases. The “DISCUSSION” section concludes with a summary of our main findings and outlines some important future work.

MATERIALS AND METHODS

D-Optimal Design

Following (34), we consider a second-order polynomial model for log-transformed event times:

$$ \log T={\beta}_0+{\beta}_1x+{\beta}_2{x}^2+ b\varepsilon, $$
(1)

where x is the dose level chosen from the interval \( \mathcal{X}=\left[0,1\right] \), ε is an error term following the standard extreme value distribution, b > 0 is a scale parameter that determines the Weibull hazard pattern, and (β0, β1, β2) are the regression coefficients. For the model in Eq. (1), T~Weibull distribution with Median(T| x) = exp(β0 + β1x + β2x2)logb(2), and the hazard function of T (conditional on x) is h(t| x) = b−1 exp(−(β0 + β1x + β2x2)/b)t1/b − 1. For a given x, the hazard is monotone increasing if 0 < b < 1; it is constant (exponential distribution) if b = 1; and it is decreasing if b > 1. Furthermore, we assume that each subject in the study has a fixed follow-up time τ > 0, and T is right-censored by τ such that the observed time is t = min(T, τ). Let δ=1{T ≤ τ} denote the event indicator. For a study of size n, the data structure is \( {\mathcal{F}}_n=\left\{\left({t}_i,{\delta}_i,{x}_i\right),i=1,\dots, n\right\} \).

The study objective is to estimate the vector of model parameters θ = (β0, β1, β2, b) as precisely as possible. For this purpose, we consider designs of the form ξ = {(x1, ρ1); (x2, ρ2); (x3, ρ3)}, where xi’s are distinct dose levels in \( \mathcal{X}=\left[0,1\right] \) and ρk’s are allocation proportions at these doses (0 < ρk < 1 and \( {\sum}_{k=1}^3{\rho}_k=1 \)). The design’s Fisher information matrix is a weighted sum \( \mathbf{M}\left(\xi, \boldsymbol{\theta} \right)={\sum}_{k=1}^3{\rho}_k{\mathbf{M}}_{x_k}\left(\xi, \boldsymbol{\theta} \right) \), where \( {\mathbf{M}}_{x_k}\left(\xi, \boldsymbol{\theta} \right) \) is the Fisher information matrix for a single observation at dose xk (it is a 4 × 4 matrix whose expression is given in Eq. (6) in (34)). The locally D-optimal design ξ minimizes − log  ∣ M(ξ, θ)∣, which leads to the smallest volume of the confidence ellipsoid for θ. In (34), it was found that if there is no censored data in the experiment, then ξ is the uniform (equal allocation) design, supported at dose levels 0, 1/2, and 1. However, in the presence of censoring, the structure of ξ is more complex—both optimal dose levels and the allocation proportions depend on the true model and the amount of censoring, i.e., ξ = {(xk(θ), ρk(θ)), k = 1, 2, 3}, and ξ must be found numerically, using, for example, a first-order (exchange) algorithm (35). Since in practice θ is unknown, one can construct a two-stage adaptive D-optimal design as follows. At stage 1, a cohort of n(1) subjects is allocated to doses according to the uniform design ξ(1) = {(0, 1/3); (0.5, 1/3); (1, 1/3)}. Based on observed data \( {\mathcal{F}}_{n^{(1)}}=\left\{\left({t}_i,{\delta}_i,{x}_i\right),i=1,\dots, {n}^{(1)}\right\} \), compute \( {\widehat{\boldsymbol{\theta}}}_{MLE}^{(1)} \), the maximum likelihood estimate (MLE) of θ, and approximate ξ by \( {\overset{\sim }{\xi}}^{\ast }=\left\{\left({x}_k\left({\widehat{\boldsymbol{\theta}}}_{MLE}^{(1)}\right),{\rho}_k\left({\widehat{\boldsymbol{\theta}}}_{MLE}^{(1)}\right)\right),k=1,2,3\right\} \). At stage 2, additional n(2) subjects are allocated to doses according to \( {\overset{\sim }{\xi}}^{\ast } \). The final analysis is based on data from the pooled sample of n = n(1) + n(2) subjects. In (34), it was shown that such a two-stage design provides a very good approximation to, and it is nearly as efficient as, the true D-optimal design without the need for prior knowledge of the model parameters before the start of the trial.

An important open question is how to allocate subjects to doses for both stage 1 and stage 2 of these adaptive designs. The cohort sizes n(1) and n(2) can be small in practice, and an experimenter must ensure that actual allocation numbers are as close as possible (ideally are matching) the targeted ones. At the same time, the allocation must involve a random element to minimize the potential for selection bias (36). Thus, balance and randomization are two competing requirements. There are many randomization procedures that can be used implementing D-optimal allocation (22). In the next section, we describe a selection of procedures that are relevant to our study.

Randomization Procedures for Implementing D-Optimal Design

To fix ideas, we start with an “idealized” setting when both the true model (θ) and the amount of censored data are known, and therefore the D-optimal design ξ = {(xk(θ), ρk(θ)), k = 1, 2, 3} is available to the experimenter. We shall also use notations d1, d2, and d3 to indicate the optimal dose levels x1(θ), x2(θ), and x3(θ), respectively. For dose dk (k = 1, 2, 3), the optimal proportion is \( {\rho}_k^{\ast }={\rho}_k\left(\boldsymbol{\theta} \right) \) (possibly an irrational number), with the obvious constraint of \( {\rho}_1^{\ast }+{\rho}_2^{\ast }+{\rho}_3^{\ast }=1 \).

Assume that the total sample size n is fixed and pre-determined. Let Nk(j) denote the sample size for dose dk after j subjects (1 ≤ j ≤ n) have been randomized into the study. The vector N(j) = (N1(j), N2(j), N3(j)) is random with N1(j) + N2(j) + N3(j) = j, and the distribution of N(j) for j = 1, …, n is determined by a randomization procedure used in the study. A restricted randomization procedure can be formally defined by specifying conditional randomization probabilities for allocating the jth subject to doses d1, d2, and d3 as follows:

$$ {\displaystyle \begin{array}{l}\kern1.8em {P}_k(j)=\Pr \left(\mathrm{Subject}\ j\ \mathrm{is}\ \mathrm{assigned}\ \mathrm{to}\ \mathrm{dose}\ {d}_k\right)=\Pr \left({d}_k|\boldsymbol{N}\left(j-1\right)\right),j=2,\dots, n\\ {}\mathrm{and}\kern0.5em {P}_k(1)={\rho}_k^{\ast },k=1,2,3.\end{array}} $$
(2)

In other words, for any restricted randomization procedure, the randomization probability for the next eligible subject depends on the current numbers of the dose assignments in the trial.

We shall study the following randomization procedures:

  • Completely randomized design (CRD): Every subject is randomized to the dose groups with probabilities equal to the D-optimal allocation, i.e., \( {P}_k(j)={\rho}_k^{\ast } \), j = 1, …, n, k = 1, 2, 3. The CRD is very simple to implement and it provides the highest degree of randomness, but for small samples, it can lead to deviations from the desired allocation with non-negligible probability (22).

  • Permuted block design (PBD): To implement allocation (\( {\rho}_1^{\ast },{\rho}_2^{\ast },{\rho}_3^{\ast } \)) for a cohort of size n, the desired split of sample size n among the doses is \( n{\rho}_1^{\ast }:n{\rho}_2^{\ast }:n{\rho}_3^{\ast } \), which, after rounding to the integer values, is, say, C1 : C2 : C3. Without loss of generality, we can assume that Ck’s are positive integers with the greatest common divisor of 1 and C1 + C2 + C3 = n. For the PBD, the conditional randomization probabilities are:

$$ {P}_k(j)=\frac{C_k-{N}_k\left(j-1\right)}{C_1+{C}_2+{C}_3-\left(j-1\right)},j=2,\dots, n\ \mathrm{and}\ {P}_k(1)=\frac{C_k}{C_1+{C}_2+{C}_3},k=1,2,3. $$
(3)

Note that Nk(j − 1) in Eq. (3) can take values 0, 1, …, Ck for j = 2, …, n.

  • Doubly adaptive biased coin design (DBCD): Initial dose assignments (j = 1, …, m0) are made using PBD with a block size that is a multiple of 3, e.g., m0 = 3, 6, or 9. Subsequently, the (j + 1)st subject (j = m0, …, n−1) is randomized to dose dk with probability

$$ {P}_k\left(j+1\right)=\frac{\rho_k^{\ast }{\left({\rho}_k^{\ast }/\frac{N_k(j)}{j}\right)}^{\gamma }}{\sum_{l=1}^3{\rho}_l^{\ast }{\left({\rho}_l^{\ast }/\frac{N_l(j)}{j}\right)}^{\gamma }},k=1,2,3, $$
(4)

where γ ≥ 0 is a user-defined parameter controlling the degree of randomness of the procedure (γ = 0 is most random and γ → ∞ is an almost deterministic procedure). The DBCD has established asymptotic properties (31). For practical purposes, γ = 2 is recommended (37).

  • Generalized drop-the-loser urn design (GDLUD): The GDLUD (32) utilizes an urn containing balls of four types: type 0 is the immigration ball, and types 1, 2,  and 3 represent “dose” balls. Dose assignments for eligible subjects are made sequentially by drawing a ball at random from the urn. Let \( {Z}_0=\left(1,{\rho}_1^{\ast },{\rho}_2^{\ast },{\rho}_3^{\ast}\right) \) denote the initial urn composition (one immigration ball and \( {\rho}_1^{\ast }+{\rho}_2^{\ast }+{\rho}_3^{\ast } \) “dose” balls). The urn composition is changed adaptively during the course of the trial. Let Zj − 1 = (Zj − 1,0, Zj − 1,1, Zj − 1,2, Zj − 1,3) denote the urn composition after j − 1 steps (numbers Zj − 1, i,i = 1, 2, 3 can be negative and/or irrational). Let \( {Z}_{j-1,k}^{+}=\max \left(0,{Z}_{j-1,k}\right) \) and k = 0, 1, 2, 3. At the jth step, the probability of selecting a ball of type k is \( {Z}_{j-1,k}^{+}/{\sum}_{i=0}^3{Z}_{j-1,i}^{+} \), k = 0, 1, 2, 3. If selected ball is type 0 (immigration), no dose is assigned and the ball is replaced into the urn together with additional \( C{\rho}_1^{\ast }+C{\rho}_2^{\ast }+C{\rho}_3^{\ast } \) “dose” balls (C is some positive constant). Therefore, the urn composition becomes Zj, 0 = Zj − 1, 0 and \( {Z}_{j,i}={Z}_{j-1,i}+C{\rho}_i^{\ast } \), i = 1, 2, 3. If selected ball is type ( = 1, 2, 3), then it is not replaced, the eligible subject is assigned to the corresponding dose level, and the urn composition becomes Zj, = Zj − 1, − 1 and Zj, i = Zj − 1,i for i ≠ . The described procedure is repeated until a pre-specified number of subjects (n) is randomized in the study. The GDLUD has established asymptotic properties (32): the allocation proportions are strongly consistent for the target proportions and follow an asymptotically normal distribution with known variance structure.

  • Mass Weighted Urn Design (MWUD): The MWUD (29) uses an urn containing three “dose” balls. Initially, each ball has mass proportional to the target allocation: \( {m}_{0,i}=\alpha {\rho}_i^{\ast } \), i = 1, 2, 3 (the parameter α is a positive integer controlling maximum tolerated imbalance). The mass of the balls is changing adaptively, according to the history of dose assignments. Among the balls with positive mass, a ball is drawn with probability proportional to its mass, and the corresponding dose is assigned to the next eligible subject. One unit mass is taken from the selected ball and redistributed among three balls in the ratio \( {\rho}_1^{\ast }:{\rho}_2^{\ast }:{\rho}_3^{\ast } \), after which the ball is returned into the urn. Therefore, after (j − 1) assignments, the probability mass for the ith dose group is \( {m}_{j-1,i}=\alpha {\rho}_i^{\ast }-{N}_i\left(j-1\right)+\left(j-1\right){\rho}_i^{\ast } \) and the total mass of the three balls in the urn at each step is \( {\sum}_{i=1}^3{m}_{j-1,i}\equiv \alpha \). These steps are repeated until the pre-specified number of subjects is enrolled in the study. The MWUD has a simple explicit formula for conditional randomization probability:

$$ {P}_k(j)=\frac{\max \left\{{\alpha \rho}_k^{\ast }-{N}_k\left(j-1\right)+\left(j-1\right){\rho}_k^{\ast },0\right\}}{\sum_{l=1}^3\max \left\{{\alpha \rho}_l^{\ast }-{N}_l\left(j-1\right)+\left(j-1\right){\rho}_l^{\ast },0\right\}},\kern0.5em k=1,2,3. $$
(5)

It was proved in (29) that, for the MWUD, maximum imbalance (defined as Euclidean distance from the vector of current allocation proportions to the target allocation proportions) at each allocation step is controlled by the value of α.

  • Maximum entropy constrained balance randomization (MaxEnt): The MaxEnt procedure is an extension of Efron’s biased coin design (38) to a multi-arm setting with unequal allocation (39,40). Dose assignments for eligible subjects are made sequentially. Consider a point in the trial when j − 1 subjects have been randomized into the study, with Ni(j − 1) subjects assigned to dose di, i = 1, 2, 3. The randomization rule for the jth subject is as follows: Compute B1, B2, B3, the hypothetical treatment imbalances which would result from assigning the jth subject to doses d1, d2, d3:

$$ {B}_k=\sqrt{\sum \limits_{i=1}^3{\left({N}_{ik}(j)-j{\rho}_k^{\ast}\right)}^2},\mathrm{where}\kern0.3em {N}_{ik}(j)=\Big\{{\displaystyle \begin{array}{cc}{N}_i\left(j-1\right)+1,& if\kern0.3em i=k;\\ {}{N}_i\left(j-1\right),& if\kern0.3em i\ne k.\end{array}}\operatorname{} $$

The vector of randomization probabilities P(j) = (P1(j), P2(j), P3(j)) is obtained by maximizing entropy (minimizing Kullback-Leibler divergence between P(j) and the target allocation \( {\boldsymbol{\rho}}^{\ast}=\left({\rho}_1^{\ast },{\rho}_2^{\ast },{\rho}_3^{\ast}\right) \)) subject to a contraint on expected imbalance. Mathematically, it is derived as a solution to the following constrained optimization problem:

$$ {\displaystyle \begin{array}{l}{\operatorname{maximize}}_{\boldsymbol{P}(j)}\left\{-{\sum}_{i=1}^3{P}_i(j)\log \left({P}_i(j)/{\rho}_i^{\ast}\right)\right\}\\ {}\mathrm{s}.\mathrm{t}.{\sum}_{i=1}^3{B}_i{P}_i(j)\le \eta {B}_{(1)}+\left(1-\eta \right){\sum}_{i=1}^3{B}_i{\rho}_i^{\ast}\\ {}\mathrm{and}\;{\sum}_{i=1}^3{P}_i(j)=1,0\le {P}_i(j)\le 1,i=1,2,3.\end{array}} $$
(6)

In Eq. (6), \( {B}_{(1)}=\underset{i}{\min }{B}_i \) and η is a user-defined parameter (0 ≤ η ≤ 1) that controls amount of randomness of the procedure (η = 0 is most random and η = 1 is almost deterministic procedure). The explicit solution to problem in Eq. (6) can be found in (40).

The six described randomization procedures can be used as building blocks for constructing adaptive randomization designs. For instance, a two-stage adaptive design with n(1) = n(2) = 30 can be implemented as follows. The first 30 patients are randomized in equal proportions among the doses 0, 1/2, and 1 using PBD, in which case Eq. (3) becomes

$$ {P}_k(j)=\frac{10-{N}_k\left(j-1\right)}{30-\left(j-1\right)},j=2,\dots, 30\ \mathrm{and}\ {P}_k(1)=\frac{1}{3},k=1,2,3. $$

Based on their observed data, the D-optimal design is estimated as \( {\widehat{\xi}}^{\ast }=\left\{\left({\widehat{d}}_k,{\widehat{\rho}}_k^{\ast}\right),k=1,2,3\right\} \), and in stage 2, an additional 30 patients are randomized using CRD, namely, the jth patient (j = 31, …, 60) is randomized among the doses \( {\widehat{d}}_1 \), \( {\widehat{d}}_2 \), and \( {\widehat{d}}_3 \) with probabilities \( {P}_k(j)={\widehat{\rho}}_k^{\ast } \), k = 1, 2, 3. We will denote such a two-stage adaptive randomization design by PBD → CRD, emphasizing that PBD is used in stage 1 and CRD is used in stage 2.

Likewise, a multi-stage design PBD → CRD → CRD → ... means that the first cohort of patients is randomized into the study using PBD; the second cohort is randomized according to an updated D-optimal design using CRD; the third cohort is randomized according to an updated (using cumulative outcome data from first two cohorts) D-optimal design using CRD, etc.

Statistical Criteria for Comparison of Randomization Procedures

The primary objective of a dose–response study is to estimate the dose–response curve as precisely as possible. The D-optimal design fulfills this objective. For a realized design ξn = {(dk, Nk(n)), k = 1, 2, 3}, where Nk(n) subjects have been randomized to dose dk, one can compute D-efficiency relative to the true D-optimal design ξ as:

$$ \mathrm{D}-\mathrm{eff}(n)={\left\{\frac{\left|\mathbf{M}\left({\xi}_n,\boldsymbol{\theta} \right)\right|}{\left|\mathbf{M}\left({\xi}^{\ast },\boldsymbol{\theta} \right)\right|}\right\}}^{1/4} $$
(7)

For given values of n and θ, D-eff(n) is, in general, a random variable because ξn depends on N(n) = (N1(n), N2(n), N3(n)) whose distribution is determined by a randomization procedure. We can take E(D-eff(n)) as a measure of estimation precision of a randomization procedure targeting D-optimal allocation. High values of E(D-eff(n)) are desirable.

Another measure of estimation accuracy is based on the mean squared error, which takes into account both bias and variance. For a trial of size n, for a given design, compute mean squared errors of the model parameters: \( \left({MSE}_{\beta_0}, MS{E}_{\beta_1}, MS{E}_{\beta_2}, MS{E}_b\right), \) where \( {MSE}_{\beta_0}={\left\{E\left({\widehat{\beta}}_0\right)-{\beta}_0\right\}}^2+ Var\left({\widehat{\beta}}_0\right) \), and other MSEs are defined similarly. When comparing Design 1 vs. Design 2, we take a ratio of MSE values: \( {R}_{\beta_0}=\frac{\mathrm{MS}{\mathrm{E}}_{\beta_0}\left(\mathrm{Design}\ 2\right)}{\mathrm{MS}{\mathrm{E}}_{\beta_0}\left(\mathrm{Design}\ 1\right)} \). A value of \( {R}_{\beta_0}=0.9 \) implies that Design 1 is 90% as efficient as Design 2 for estimating the parameter β0. Likewise, compute ratios of MSE’s for three other parameters, \( {R}_{\beta_1}, \) \( {R}_{\beta_2}, \) and Rb, and take an average value as an overall measure of relative efficiency:

$$ \mathrm{RE}(n)=\frac{1}{4}\left({R}_{\beta_0}+{R}_{\beta_1}+{R}_{\beta_2}+{R}_b\right) $$
(8)

In addition to statistical estimation, we consider several other useful metrics. A measure of allocation accuracy of a randomization procedure is the closeness of the realized allocation to the true D-optimal allocation. For a design with n subjects, an imbalance (using Euclidean distance) is \( Imb(n)=\sqrt{\sum_{k=1}^3{\left({N}_k(n)-n{\rho}_k^{\ast}\right)}^2} \). Small values of Imb(n) are desirable; ideally Imb(n) = 0. Since Nk(n)’s are random variables, we take expected value, E(Imb(n)), and this is referred to as momentum of probability mass (MPM) (41).

To quantify lack of randomness of a randomization procedure, at the jth allocation step, we compute the distance between the conditional randomization probability vector P(j) and the D-optimal allocation vector ρ as \( d(j)=\sqrt{\sum_{k=1}^3{\left({P}_k(j)-{\rho}_k^{\ast}\right)}^2} \), j = 1, …, n. If \( {P}_k(j)={\rho}_k^{\ast } \) for k = 1, 2, 3, then the dose assignment for the jth subject is made completely at random. A cumulative measure of lack of randomness (forcing index, FI) (40,42) is defined as:

$$ FI(n)={n}^{-1}{\sum}_{j=1}^nd(j). $$

The smaller FI(n) is, the more random (and therefore, potentially less predictable) a randomization procedure is. FI(n) ≡ 0 corresponds to CRD, which is most random and provides no potential for selection bias in the study.

Finally, we consider variability of randomization procedures by examining the average standard deviation of the allocation proportions: \( ASD(n)=\sqrt{n{\sum}_{i=1}^3{\left\{ SD\left({N}_i(n)/n\right)\right\}}^2} \). It is expected that randomization procedures with low values of ASD(n) should be more concentrated around the target D-optimal allocation, and therefore they should lead to more efficient dose–response estimation.

SIMULATION STUDY PLAN

Throughout, we assume that responses follow the model in Eq. (1) with the following parameters: β0 = 1.90, β1 = 0.60, β2 = 2.80, and b = 0.65. The average event probability is assumed to be 0.50. Under these assumptions, the D-optimal dose levels are d1 = 0, d2 = 0.269, and d3 = 0.726; the optimal allocation proportions are \( {\rho}_1^{\ast }=0.407 \), \( {\rho}_2^{\ast }=0.336 \), and \( {\rho}_3^{\ast }=0.257 \). Figure 1 shows the true dose–response curve, the optimal dose levels, and the optimal allocation proportions at these doses.

Fig. 1
figure 1

True dose–response curve (black line), D-optimal doses (x-location of red bars), and allocation proportions (height of red bars) for model (1) with β0 = 1.90, β1 = 0.60, β2 = 2.80, b = 0.65, and an average probability of event = 50%

Our simulation study consists of four major parts.

First, we evaluate various single-stage randomization procedures targeting the locally D-optimal design (assuming the true model is known) for small and moderate sample sizes. The design operating characteristics include measures of estimation precision, balance, and randomness, as described in the section “Statistical Criteria for Comparison of Randomization Procedures”.

Second, we implement a two-stage adaptive optimal design (34) using different combinations of randomization procedures in stages 1 and 2. For each of these strategies, the target allocation in stage 1 is (1/3, 1/3, 1/3), and the target allocation in stage 2 is derived from the estimated D-optimal design. In this setting, the D-efficiency of a two-stage design relative to the D-optimal design is computed as follows:

$$ \mathrm{D}-\mathrm{eff}(n)={\left\{\frac{\left|{n}^{(1)}\mathbf{M}\left({\xi}_{\mathrm{obs}}^{(1)},\boldsymbol{\theta} \right)+\left(n-{n}^{(1)}\right)\mathbf{M}\left({\xi}_{\mathrm{obs}}^{(2)},\boldsymbol{\theta} \right)\right|}{\left|n\mathbf{M}\left({\xi}^{\ast },\boldsymbol{\theta} \right)\right|}\right\}}^{1/4}, $$
(9)

where \( {\xi}_{\mathrm{obs}}^{(1)} \) is the stage 1 randomization design using n(1) subjects, \( {\xi}_{\mathrm{obs}}^{(2)} \) is the stage 2 randomization design using n − n(1) subjects, and ξ is the (theoretical) D-optimal design. The relative efficiency of a two-stage design, RE(n), is computed with respect to a single-stage locally D-optimal design implemented using MaxEnt(η = 1) procedure (which, by construction, leads to the most balanced allocation).

Third, we implement a multi-stage adaptive design with early stopping criteria (34) using different combinations of randomization procedures. In this setting, all designs aim at achieving the same pre-defined level of estimation precision, and the key operating characteristic is the sample size at study termination. Our conjecture is that there are randomization procedures that require a smaller sample size, given the stopping rule.

Fourth, we evaluate the robustness of different adaptive randomization strategies to two types of experimental bias: chronological bias and selection bias. Chronological bias can arise if patient outcomes over time are affected by unobserved time trends (43,44). Selection bias can occur when an investigator knows or is able to guess with high probability which treatment is to be assigned to an upcoming patient (36). The advance knowledge of the treatment assignment can motivate an investigator to selectively enroll a particular type of patients who are thought to benefit most from the given treatment thereby confounding the true treatment effect. The importance of assessing robustness of randomization designs to chronological and selection biases has been recently documented by Hilgers and coauthors in the Evaluation of Randomization procedures for Design Optimization (ERDO) template (45).

Note that our simulation plan here is by no means exhaustive. However, we supply an R code (available upon request from the first author) that can be used to reproduce all results in this paper and generate additional findings under user-defined experimental scenarios and other combinations of adaptive randomization strategies.

RESULTS

Targeting Locally D-Optimal Design

We consider seven randomization designs targeting D-optimal allocation (0.407, 0.336, 0.257). These designs are as follows: (I) CRD, (II) DBCD (γ = 2), (III) GDLUD (C = 10), (IV) MWUD (α = 10), (V) MaxEnt(η = 0.5), (VI) MaxEnt(η = 1), and (VII) PBD. We also consider a uniform allocation design which randomizes study subjects among the dose levels 0, 0.5, and 1 in equal proportions by means of PBD, i.e., (VIII) Uniform PBD.

Table I summarizes the performance of the eight randomization designs, evaluated for four values of the sample size: n = 15; 30; 45; 60. In regard to balance and randomness, the most variable and least restrictive design is CRD: it has highest values of MPM(n) and ASD(n) and lowest possible forcing index (FI(n) = 0). By construction, the most balanced designs are MaxEnt(η = 1), PBD, and Uniform PBD: they have constant values of MPM(n)—0.50, 1.14, and 0.54, respectively—regardless of n. MaxEnt(η = 1) is most restrictive: it has FI(n) = 0.66, which is highest among all designs. In comparison, PBD (which also attains a balanced allocation for the given sample size) has FI(n) = 0.11. The difference between these two designs is that MaxEnt(η = 1) forces balance at each allocation step, whereas for PBD, the allocation ratio may deviate from the target at intermediate steps. Other designs provide a tradeoff between balance and randomness and have values of MPM(n), ASD(n), and FI(n) in between the two extremes (CRD and MaxEnt(η = 1)). Clearly, balance (low value of MPM(n)) is achieved at the cost of randomness (high value of FI(n)).

Table I Operating Characteristics of Eight Randomization Designs for a Single-Stage Trial with Locally D-optimal Design

Figure 2 shows box-plots of simulated distributions of D-eff(n) for eight randomization designs, for sample sizes n = 15, 30, 45, and 60. One can see that the CRD has the largest spread of D-eff(n), especially for small sample sizes (n = 15 and 30). As n increases, CRD becomes more efficient: for n = 45 and n = 60, the minimum value of D-eff(n) is >0.75 and the median value of D-eff(n) is ~ 0.99. At the same time, the most balanced designs MaxEnt(η = 1) and PBD have the highest values of D-eff(n) ~ 1.0, regardless of n. Other designs targeting D-optimal allocation perform reasonably well—their distributions of D-eff(n) become less spread and closer to 1.0 with the increase in n. Finally, Uniform PBD has constant D-eff(n)=0.74 for all values of n, which is, as expected, much lower than the other designs given the non-optimal doses of the uniform design.

Fig. 2
figure 2

Distribution of D-efficiency of eight randomization designs for a single-stage trial with the locally D-optimal design

Figure 3 shows measures of estimation accuracy (Bias2, Variance, and MSE) for the four model parameters (β0, β1, β2, b) for eight randomization designs and sample sizes n = 15; 30; 45; 60. The seven designs targeting D-optimal allocation exhibit improvement (lower Bias2, Variance, MSE) with the increase in n. More variable designs (e.g., CRD) have somewhat higher Bias2, Variance, and MSE than less variable designs (e.g., PBD). The Uniform PBD (purple curve) stands out in the plots: its overall performance, as assessed by MSE, is substantially worse compared to the seven designs targeting D-optimal allocation. From Table I, we observe that Uniform PBD performs well relative to the D-optimal design only for small sample size: when n = 15, its RE(n) = 1.03; however, for n = 30, 45, 60 its RE(n) values drop to 0.82, 0.67, and 0.38, respectively.

Fig. 3
figure 3

Estimation precision (Bias2, MSE, Variance) of eight randomization designs for a single-stage trial with the locally D-optimal design

From these results, we can make an important intermediate observation. In an “idealized” setting of a known nonlinear dose–response model with censored time-to-event data, the quality of estimation depends on both the choice of allocation design and the randomization procedure to implement the target allocation. The D-optimal allocation implemented by a randomization procedure with low variability (e.g., MaxEnt(η = 1) or PBD) results in most accurate estimation of the dose–response relationship, especially when sample size is small, e.g., n = 15 (cf. Fig. 2). Using a less restrictive (more random) randomization procedure can result in some deterioration of statistical estimation in small samples; however, the quality of estimation is improved with larger sample sizes. For instance, when the “most random” CRD procedure is applied to target the D-optimal allocation, the average (across 10,000 simulation runs) D-efficiency values are 0.93, 0.97, 0.98, and 0.99, respectively, for sample sizes n = 15, 30, 45,and 60 (cf. Table I). Using a non-optimal allocation (e.g., Uniform design), even with most restrictive and most balanced randomization procedure leads to inferior performance which does not improve with the increase in sample size. In our example, the average D-efficiency of Uniform PBD was 0.74 for n = 15, 30, 45,and 60 (cf. Table I).

Of course, our observations here are obtained based on data generated from a selected model in Eq. (1), under one experimental scenario (visualized in Fig. 1) and four choices of the sample size, with n = 15 being the smallest one. Additional simulations under other experimental scenarios and using smaller values of n (e.g., n = 9 or 12) can be performed to investigate loss in efficiency due to imbalance induced by randomization in very small samples. We defer this task to the future work.

Overall, our findings from the considered example are in line with the template of Hu and Rosenberger (46) which suggests that for randomized comparative trials the performance of a randomization design is determined by an interplay between optimality (power) of a fixed allocation design, speed of convergence of a randomization procedure to the desired allocation, and variability of the allocation proportions. In our setting, we deal with estimation, not hypothesis testing; yet, we arrive at a similar conclusion: both D-optimality and variability of a randomization procedure (and, of course, the study size) determine the design performance.

Two-Stage Adaptive Optimal Design

To appreciate the impact of randomization on performance of a two-stage adaptive optimal design, we compare five adaptive design strategies using different combinations of randomization procedures (CRD, MaxEnt(η = 1), and PBD) at stages 1 and 2. We also include a non-adaptive Uniform PBD as a reference procedure. We use fixed total sample size n = 60 and investigate three choices of the first-stage cohort size: n(1) = 15, 30, and 45. For each design strategy, the main concern is dose–response estimation quality, as assessed by D-eff(n) and RE(n).

Table II summarizes the performance of various two-stage adaptive design strategies. The two-stage design which uses CRD at stage 1 results in 1–2% loss in average D-efficiency compared to the adaptive designs which utilize MaxEnt(η = 1) or PBD in stage 1. In regard to the RE(n), there is no single best strategy. A combination MaxEnt(η = 1) → PBD is a top performer in scenarios with n(1) = 15 and n(1) = 30, and MaxEnt(η = 1) → CRD seems to perform best when n(1) = 45. Note that in a scenario when the total sample size is split equally between stages 1 and 2 (i.e., n(1) = n(2) = 30), the two-stage adaptive designs have highest values of both D-eff(n) (0.84–0.85) and RE(n) (0.67–0.70). By contrast, when n(1) = 15, D-eff(n) is 0.79–0.81, and RE(n) is 0.63–0.66; and when n(1) = 45, D-eff(n) is 0.82–0.83 and the RE(n) is 0.61–0.68. This indicates that adaptive designs using MaxEnt(η = 1) → PBD or PBD → PBD combinations, with an adaptation after 50% of the total sample size provide best performance in our example. The two-stage uniform design applied with the (non-optimal) equal allocation has worst performance (D-eff(n)=0.74 and RE(n)=0.38), which reinforces importance of D-optimality in trial design.

Table II Operating Characteristics of 5 Two-Stage Adaptive Optimal Design Strategies and a Fixed Uniform Allocation Design for a Total Sample Size of n = 60

Based on the results from Table II, one may argue that the improvements due to use of a “more balanced” randomization method such as PBD over a “less balanced” randomization procedure such as CRD are very modest (12% in our example). However, these results are obtained under only one experimental scenario, with a limited selection of sample sizes for stages 1 and 2, and a limited selection of the stage 1/stage 2 ratio of the sample sizes (15:45, 30:30, and 45:15 in our example). A more thorough study would be needed to carefully assess an impact of these parameters on the performance of various two-stage randomization design strategies.

Table III illustrates the importance of having a sufficiently large size for stage 1 of the trial. Displayed in Table III is the percentage of simulation runs for which the MLE of θ (and, therefore, an estimate of the D-optimal design) could not be obtained based on data ascertained from stage 1 (in which case stage 2 of the trial was implemented using Uniform PBD). One can see that when n(1) = 15, the percentage of “failed” stage 1 trials was 23% for CRD, 11% for MaxEnt(η = 1), and 12% for PBD. In other words, if CRD is used with n(1) = 15, there is almost a 1 out of 4 chance that the D-optimal design cannot be estimated after stage 1. On the other hand, for n(1) = 30 and n(1) = 45, the probability of not being able to estimate the D-optimal design after stage 1 with CRD drops to 4 and 2%, respectively. Obviously, for MaxEnt(η = 1) and PBD, the corresponding numbers are lower (~ 1 and < 1%, respectively) because these two designs have better balancing properties than CRD.

Table III Percentage of Simulation Runs for a Two-Stage Adaptive Design for Which the MLE of θ (and Therefore, an Estimate of the D-optimal Design) Could not Be Obtained Based on Data from Stage 1

Adaptive Optimal Designs with Early Stopping

To evaluate an impact of randomization on adaptive designs with early stopping, we consider four competing adaptive design strategies. All designs randomize the first cohort of 15 subjects among the doses 0, 1/2, and 1 using target allocation (1/3, 1/3, 1/3). Thereafter, additional cohorts of 15 subjects are randomized into the study using different randomization procedures targeting an updated D-optimal design, until either the maximum sample size of the study nmax is reached, or the study stopping criterion is met. In our simulations, we set nmax = 1000. For the study stopping criterion, we use the rule based on the volume of the confidence ellipsoid, described in (34) as follows: the study should stop once \( \left|{\mathbf{M}}_{obs}^{-1}\left({\widehat{\boldsymbol{\theta}}}_{MLE},\xi \right)\right|\le {\left({\eta}^4\left|{\widehat{\beta}}_0\right|\left|{\widehat{\beta}}_1\right|\left|{\widehat{\beta}}_2\right|\left|\widehat{b}\right|\right)}^2 \), where 0 < η < 1 is a user-defined constant. In our simulations, we explore four choices of η = 0.15; 0.20; 0.25; 0.35. We also include Uniform PBD with the same stopping rule as a reference procedure.

Figure 4 shows distributions of the sample size at study termination for different adaptive design strategies and the uniform allocation design. Overall, for a given value of η, there is little difference among the adaptive optimal designs. Smaller values of η imply higher bar for estimation accuracy, and therefore a larger sample size is required. When η = 0.35, the CRD → CRD → ... strategy has a median sample size = 45 and a maximum sample size = 120; at the same time, three other adaptive design strategies (MaxEnt(η = 1) → PBD → ...; MaxEnt(η = 1) → MaxEnt(η  = 1) → ...; and MaxEnt(η = 1) → CRD → ...) have the same median sample size = 45, but a lower maximum sample size = 60. By contrast, the median sample size for Uniform PBD is substantially larger compared to the adaptive designs: based on our simulations, 7–11 more cohorts of size 15 are required for Uniform PBD to achieve the same level of estimation accuracy as for the adaptive designs.

Fig. 4
figure 4

Distribution of sample size at study termination for multi-stage adaptive designs.

Robustness to Experimental Biases

A recently published ERDO template (45) emphasizes the importance of assessing the robustness of randomization procedures to chronological bias and selection bias. The chronological bias can arise, for example, in a long-term study with slow recruitment where patients enrolled later in the study may be healthier due to an overall improved standard of care, and if treatment assignments are not balanced over time, then treatment comparison may be biased. To mitigate the impact of chronological bias, it is recommended that a randomization design should balance treatment assignments over time, e.g., by means of some kind of restricted randomization (44,47). The potential negative impact of selection bias on statistical inference (test decisions) is acknowledged and well documented (48,49,50,51). Strategies to reduce risk of selection bias exist (52,53); one recommendation is to use less restrictive randomization procedures, such as the maximal procedure (47,54).

The ERDO template provides a general framework for justifying the choice of a randomization procedure in practice. Here, we apply it in a setting of an adaptive randomized three-arm trial with censored time-to-event outcomes and the D-optimal allocation.

Chronological Bias

We assume that there is an effect due to a time trend, which means that the true model for the jth subject in the study has the form:

$$ \log {T}_j={\beta}_0+{\beta}_1{x}_j+{\beta}_2{x}_j^2+{\eta}_j+b{\varepsilon}_j, $$
(10)

where xj is the dose, εj is the error term following the standard extreme value distribution, and ηj is the time trend, which can take one of the following forms (44):

$$ {\eta}_j=\nu \left\{\begin{array}{cc}\frac{j}{n}& \mathrm{linear}\ \mathrm{time}\ \mathrm{trend},\\ {}{1}_{j\ge c}(j)& \mathrm{stepwise}\ \mathrm{trend},\\ {}\log \left(\frac{j}{n}\right)& \mathrm{logarithmic}\ \mathrm{trend},\end{array}\right. $$

where the time trend effect ν is a positive number. A sensible choice for ν can be a fraction of the variation in the data, i.e., the standard deviation or range. Furthermore, we assume that ηj and εj are independent.

We consider a two-stage adaptive design with n = 60 and n(1) = 30 in which data are generated according to model in Eq. (10) with three kinds of time trend described above. For the stepwise trend, we take c = 30, which means that the patients recruited in stage 2 (after interim analysis) are systematically different from the patients recruited in stage 1 of the study. We consider six choices for ν: ν = 0 (no time trend) and ν = 0.5; 1; 2; 5; 10 (time trend is present). We evaluate three different two-stage adaptive designs and the uniform allocation design. The key interest is quality of estimation, as assessed by D-eff(n) and RE(n).

First, we found that chronological bias had no impact on D-efficiency for any of the considered design strategies—the average values of D-eff(n) were identical in the no-trend case (ν = 0) and the cases when the trend was present (ν > 0) (results not displayed here). Figure 5 shows a plot of MSE values for estimating (β0, β1, β2, b) vs. ν, for three kinds of time trend. Overall, one can see that there is no apparent evidence that greater amount of chronological bias (higher values of ν) can lead to an increase of MSE values. Interestingly, the presence of chronological bias (ν > 0) resulted in substantial decrease of MSE values for the parameters b and β0 for Uniform PBD (purple curve).

Fig. 5
figure 5

MSE of two-stage adaptive strategies without and in the presence of chronological bias

Selection Bias

We adopt the approach described in (51) but accounting for a three-arm randomization setting. We assume the outcome is survival time and longer times indicate better treatment efficacy. An investigator favors an experimental treatment, and if she anticipates that the next treatment assignment is either low or high dose, then she can select a terminally ill patient who meets all eligibility criteria patients to be allocated to this dose group.

We assume the investigator has the information of the target allocation (ρ1, ρ2, ρ3), and knows the current treatment distribution (N1(j), N2(j), N3(j)). Furthermore, assume that the investigator uses the minimum imbalance guessing strategy, as described in (55). Compute the imbalance between the current allocation and the target allocation:

$$ {Imb}_i=\frac{N_i(j)}{j}-{\rho}_i,\kern0.5em i=1,2,3. $$

The treatment (dose level) with the minimum value of imbalance is predicted, i.e.,

$$ \ell =\arg \underset{i=1,2,3}{\min}\mathit{\operatorname{Im}}{b}_i $$

In other words, can take values 1 ⇒ dose level is 0 (placebo), 2 ⇒ low active dose, or 3 ⇒ high active dose. The biasing strategy is as follows. If  = 2 or 3, then the investigator enrolls a “sicker” patient into the study. If  = 1, then the investigator enrolls a “healthier” patient into the study. If there is a tie between 1 and 2 (i.e., Imb1 =  Im b2 <  Im b3), or a tie between 1 and 3 (i.e., Imb1 =  Im b3 <  Im b2), or a tie between 1, 2, and 3 (i.e., Imb1 =  Im b2 =  Im b3), then the investigator enrolls a “normal” patient into the study. Therefore, the model for the jth subject in the study has the form:

$$ \log {T}_j=\left({\beta}_0+{\beta}_1{x}_j+{\beta}_2{x}_j^2\right){\eta}_j+b{\varepsilon}_j, $$
(11)

where xj is the dose, εj is the error term following the standard extreme value distribution, and ηj is given by the following:

$$ {\eta}_j=\left\{\begin{array}{cc}\nu & \mathrm{if}\ \ell =2\ \mathrm{or}\ \ell =3,\\ {}1& \mathrm{if}\ \ell =\left\{1,2\right\}\ \mathrm{or}\ \ell =\left\{1,3\right\}\ \mathrm{or}\ \ell =\left\{1,2,3\right\},\\ {}1/\nu & \mathrm{if}\ \ell =1,\end{array}\right. $$

with ν ∈ (0, 1) being the biasing factor. At a dose level x, the median survival time would be exp(ν(β0 + β1x + β2x2)){log(2)}b for a “sicker” patient; exp(1/ν(β0 + β1x + β2x2)){log(2)}b for a “healthier” patient; and exp((β0 + β1x + β2x2)){log(2)}b for a “normal” patient.

We consider a two-stage adaptive design with n = 60 and n(1) = 30 in which data are generated according to model in Eq. (11) with ν = 0.5 (selection bias is present) and ν = 1 (no selection bias). We evaluate three different two-stage design strategies: CRD → CRD, MaxEnt(η = 1) → MaxEnt(η = 1), and PBD → PBD. As before, we include Uniform PBD as a reference procedure.

Figure 6 shows theoretical (red) and estimated (yellow) median time-to-event profiles as well as the 25th and 75th quantile time-to-event curves for different designs. Clearly, the presence of selection bias (ν = 0.5) has a negative impact on quality of estimation, for all four design strategies considered here: the designs tend to systematically underestimate the dose–response curve at higher dose levels. The “least affected” design is PBD → PBD. The Uniform PBD has the worst performance. Interestingly, the CRD → CRD did not provide best protection against selection bias (as one might have conjectured based on the results from equal (1:1) allocation randomization, in which case the CRD is known to be least susceptible to selection bias (22)).

Fig. 6
figure 6

Estimated dose–response relationship for two-stage adaptive strategies without (bottom figures) and in the presence of (top figures) selection bias

DISCUSSION

In this paper, we evaluated impact of randomization on statistical properties of adaptive optimal designs in a time-to-event dose–response study with the D-optimal allocation that involves possibly non-integer (irrational) allocation proportions. To our knowledge, this is the first paper that systematically investigated the choice of a randomization procedure in such a setting.

Optimal designs for dose–response studies with nonlinear models depend on the true model parameters that are unknown in practice. A solution to this problem is to use adaptive optimal designs which attempt to achieve maximal incremental increase in information about the model at each step. Previous work has shown that such adaptive designs can successfully approximate true optimal designs in various dose–response settings (56,57,58,59,60). Many of these designs were developed for phase I dose escalation trials in which adaptations are applied sequentially, in a non-randomized manner. On the other hand, phase II dose–response trials use randomized parallel group designs and attempt to gain maximum information about the dose–response over a given dose range for a given sample size. A practical solution for a phase II dose–response study is a two-stage adaptive optimal design, for which data from a (pilot) first stage of the trial are used to ascertain an initial estimate of the dose–response curve and use this information to optimize the second stage of the trial. Two-stage adaptive designs have been shown to be highly efficient in various settings (34,61,62,63,64). With a two-stage design, an important question is how to implement it in practice. Randomization is a powerful tool that can be used to achieve a pre-determined treatment allocation ratio in each stage while protecting a study from bias and maintaining validity of the trial results. The choice of the “best” randomization procedure for use in practice can be elusive due to a variety of available methods (22). Many studies do not go into details on how randomization is implemented in practice (45). In the current paper, we provide an example of how different randomization options can be examined to select one for implementation in an adaptive dose–response trial.

We have shown that both the choice of an allocation design and a randomization procedure to implement the target allocation impact the quality of dose–response estimation, especially for small samples. The D-optimal allocation implemented by a randomization procedure with low variability leads to the most accurate estimation of the dose–response relationship. Our findings are consistent with the template of Hu and Rosenberger (46) which suggests that optimality of a fixed allocation design and variability of the randomization procedure are two major determinants of the performance of a randomization design in practice. From our simulation studies, we found that design optimality has a more profound impact on design performance than the randomization procedure. In other words, applying the “most balanced” randomization procedure such as PBD to target a non-optimal design is an inferior strategy to applying the “most random” CRD procedure to target the D-optimal design. We found that while CRD (applied to the D-optimal target) can incur some loss in efficiency in small samples, it becomes more efficient as the sample size increases, e.g., the median D-efficiency of CRD is ~ 99% for n = 60. Using more restrictive (and properly calibrated) randomization procedures (such as DBCD, GDLUD, etc.) can also be an attractive strategy.

For a two-stage design with a pre-determined total sample size, two important considerations involve the timing of an interim analysis and the choice of randomization procedures for stages 1 and 2. If stage 1 size is too small and a highly variable randomization procedure (such as CRD) is used to allocate patients to doses, then there is a substantial risk that the D-optimal design cannot be estimated after stage 1, thereby defying the purpose of design adaptation. If stage 1 size is too large, then an interim estimate of the D-optimal design can be readily available; yet the second stage may be too small to fully benefit from this interim knowledge. Our simulations show that an equal split of the total sample size between stages 1 and 2 and use of a “well-balanced” randomization to implement target allocation in each stage (especially in stage 1) is an optimal strategy.

For a multi-stage adaptive design with early stopping, it is important that the first (pilot) cohort is randomized to doses according to a “well-balanced” procedure such as PBD or MaxEnt(η = 1). Thereafter, additional cohorts can be randomized (according to an updated D-optimal design) using different methods, including CRD. Again, design optimality has a profound effect—using a sub-optimal (uniform allocation) design requires much larger sample sizes to attain the same level of estimation accuracy as for the adaptive optimal designs.

In practice, the design performance may be affected by various experimental biases. It is increasingly common to evaluate the influence of potential selection bias and chronological bias on test decision (type I error rate) (44,49,50,51,65). In the current paper, we investigated the potential impact of experimental bias on dose–response estimation using a recently published ERDO template (45). In particular, our simulations provide evidence that selection bias can have a detrimental impact on quality of dose–response estimation. A striking finding is that a sub-optimal (uniform allocation) design can lead to very misleading conclusions, in both scenarios with and without selection bias. Without selection bias, the Uniform PBD can overestimate the true dose–response curve, whereas when selection bias is present, the design can grossly underestimate the curve and even yield a false impression that the dose–response is flat. In our example, a two-stage adaptive optimal design with PBD applied in both stages was more robust (while still being affected) to selection bias than other designs.

We would also like to highlight several important problems for future research. In the current paper, we only focused on estimation of dose–response. However, in many phase II clinical trials, the primary objective is to first test whether the dose–response is present and then estimate the dose–response curve. Testing the presence of dose–response in time-to-event settings may be a challenging task—due to small sample sizes, censored data, and model uncertainty. Which test (parametric or nonparametric) should be used? The impact of randomization on power of the test was studied in response-adaptive randomized comparative studies (46) but not in the context of dose–response studies, and this is one important open problem. Another problem is sample size justification for two-stage adaptive designs. How large should a study be? Is equal split of the total sample size between stage 1 and stage 2 always optimal? Our simulations in the current paper suggest so, but the formal proof of this conjecture is yet to be provided.

In practice, historical data from previous studies may be available, in which case a Bayesian design may be a viable option, e.g., the first (pilot) stage may be implemented using Bayesian optimal design (which may be different from the uniform design), and subsequent adaptations can be implemented in a Bayesian manner (rather than using maximum likelihood updating). The impact of randomization on statistical properties of Bayesian adaptive dose–response designs certainly merits investigation.

The results in the current paper are based on the assumption that event times follow a quadratic Weibull regression model with four parameters. While such a model is quite flexible and can cover a broad variety of dose–response shapes (34), it may still be misspecified in a number of ways; e.g., a third- or higher-order polynomial model may be a better choice, and/or a time-to-event distribution may be other than Weibull (say, log-logistic, Gamma, etc.). Finding locally D-optimal designs under different models and constructing response-adaptive designs that converge to the “true” ones can be done using similar arguments as in our previous work (34) and the current paper. More complex models may require larger amount of data to estimate the underlying dose–response and implement the corresponding D-optimal designs. However, the main findings of the current work are likely to be extended to such more complex models (provided that the functional form of the model and the event time distribution are chosen correctly). If the model form and/or the distribution of the event times are misspecified, then statistical properties of response-adaptive optimal designs (constructed under different assumptions) may be affected. The impact of such misspecifications is another important open problem which we hope to pursue in our future work.

In many time-to-event trials, there are important covariates (prognostic factors) that are correlated with the primary outcome. Rosenberger and Sverdlov (66) discuss strategies for handling covariates in the design of randomized comparative trials and advocate a class of covariate-adjusted response-adaptive (CARA) randomization designs. CARA randomization can be particularly attractive in trials for personalized medicine (67). An application of CARA randomization in time-to-event dose–response trials is yet another open problem.

Finally, we think that further theoretical and simulation studies are warranted to better understand the impact of chronological bias and selection bias on estimation and statistical tests following adaptive optimal designs. The ERDO template (45) is an excellent starting point to facilitate such an investigation.

CONCLUSION

The current paper provides a systematic study of adaptive randomization procedures to target D-optimal designs for dose–response trials with time-to-event outcomes. Simulation studies provide evidence that the choice of randomization to implement the D-optimal design does matter as far as quality of dose–response curve estimation is concerned. For best performance, an adaptive design with small cohort sizes should be implemented with a randomization procedure that ensures a “well-balanced” allocation according to the targeted D-optimal design at each stage. Using a sub-optimal design can lead to very misleading results, in both scenarios with and without selection bias. The results of the current work should help clinical investigators select an appropriate randomization procedure for their dose–response study.