Single-case experiments (SCEs) are designed experiments that include repeated measurements of a single entity (usually a person) for at least one dependent variable under different levels (i.e., treatments) of one or more independent variables (Barlow, Nock, & Hersen, 2009; Gast & Ledford, 2014; Kazdin, 2011; Kratochwill & Levin, 1992; Onghena, 2005).

Fields such as special education, school psychology, and clinical psychology are increasingly using SCEs to assess the efficacy of an intervention or treatment for a single subject (Alnahdi, 2015; Bowman-Perrott, Burke, de Marin, Zhang, & Davis, 2015; Hammond & Gast, 2010; Leong, Carter, & Stephenson, 2015; Moeller, Dattilo, & Rusch, 2015; Shadish & Sullivan, 2011; Smith, 2012; Swaminathan & Rogers, 2007). SCEs are also gaining in popularity in medical science (where they are often called “N-of-1 designs”) to evaluate treatments for patients with, for instance, chronic pain or attention deficit hyperactivity disorder (Gabler, Duan, Vohra, & Kravitz, 2011). The recent development of guidelines for reporting the results of SCEs confirm the growing interest in these types of designs in the educational, behavioral, and health sciences (Shamseer et al., 2016; Tate, Togher, Perdices, McDonald, & Rosenkoetter, 2012).

Despite the growing popularity of SCEs, there is no broad consensus with respect to adequate data-analysis methods for these types of designs. As a result, a wide variety of methods is currently being used (often in combination with each other; Kratochwill et al., 2010; Maggin, O’Keeffe, & Johnson, 2011; Shadish, 2014). These methods can be broadly categorized in two main approaches: visual analysis and statistical analysis (Heyvaert, Wendt, Van den Noortgate, & Onghena, 2015). Visual analysis consists of inspecting graphed SCE data for changes in level, overlap between phases, variability, trend, immediacy of the effect, and consistency of data patterns across similar phases (Horner, Swaminathan, Sugai, & Smolkowski, 2012). Statistical analysis methods for SCE data can be subdivided into three groups: effect size calculation, statistical modeling, and statistical inference. Effect size calculation refers to determining the size of the treatment effect by calculating formal effect size (ES) measures. Examples include mean difference measures (e.g., Busk & Serlin, 1992; Hedges, Pustejovsky, & Shadish, 2012), measures based on data nonoverlap between phases (e.g., Parker, Hagan-Burke, & Vannest, 2007; Parker & Vannest, 2009; Parker, Vannest, & Brown, 2009; Parker, Vannest, Davis, & Sauber, 2011), and regression-based measures (e.g., Allison & Gorman, 1993; Center, Skiba, & Casey, 1985–1986; Solanas, Manolov, & Onghena, 2010; Van den Noortgate & Onghena, 2003; White, Rusch, Kazdin, & Hartmann, 1989). In statistical modeling, the goal is to devise a statistical model that provides an adequate conceptualization of the data. Examples include multilevel modeling (Van den Noortgate & Onghena, 2003), structural equation modeling (Shadish, Rindskopf, & Hedges, 2008), and interrupted time series analysis (Borckardt & Nash, 2014; Gottman & Glass, 1978). Statistical inference refers to determining the statistical significance of ES measures through statistical hypothesis testing or to constructing confidence intervals for parameter estimates (Heyvaert, Wendt, Van den Noortgate, & Onghena, 2015; Michiels, Heyvaert, Meulders, & Onghena, 2017).

The present article deals with the inferential approach to evaluating treatment effects in single-case data. Inferential procedures can be parametric or nonparametric. However, parametric procedures such as statistical tests and confidence intervals based on t and F distributions are often not appropriate to analyze SCE data because the assumptions underlying these procedures (e.g., random sampling and more specific distributional assumptions) are often violated in many areas of behavioral research and particularly in single-case research (e.g., Adams & Anthony, 1996; Dugard, 2014; Edgington & Onghena, 2007; Ferron & Levin, 2014; Levin, Ferron, & Gafurov, 2014; Micceri, 1989). In contrast, nonparametric procedures do not make specific distributional assumptions about the data.

One of these nonparametric procedures, the randomization test (RT), has been proposed by some researchers as an appropriate statistical test to evaluate treatment effects in randomized SCEs (i.e., SCEs that include random assignment of measurement occasions to treatment conditions; e.g., Bulté & Onghena, 2008; Edgington, 1967; Heyvaert & Onghena, 2014; Levin, Ferron, & Kratochwill, 2012; Onghena, 1992; Onghena & Edgington, 1994, 2005). The RT is based on the random assignment model, which assumes that each experimental unit has been randomly assigned to one of the levels of the independent variable (similar to the way individual subjects are randomly assigned to treatment conditions in a between-subjects design; Kempthorne, 1955).Footnote 1 Furthermore, by randomly assigning measurement occasions to treatment conditions all known and unknown confounding variables can be controlled in a statistical way. Consequently, a potential statistically significant treatment effect can be attributed to the experimental manipulation. An alternative model, which is adopted by most parametric statistical tests, is the random sampling model. In this model, data are assumed to have been randomly sampled from a specific population of interest. Because the random assignment model does not make an assumption of random sampling, any statistical inference made under this model is conditional on the data that are analyzed (Keller, 2012).

A common practical problem in designing experiments is determining the number of observations that is required for the statistical tests to have sufficient power. The power of a statistical test is defined as the probability of rejecting a false null hypothesis. A power of 80% is generally accepted as the minimal requirement for a statistical test (Cohen, 1988, 1992). Power analysis can provide guidelines for the minimum number of observations that is required in order to detect an effect of a certain size with a certain probability. In an SCE the minimum number of observations refers to the minimum number of measurement occasions for the single case.

Apart from selecting the number of measurement occasions, the single-case researcher must also make other choices when designing a randomized SCE. More specifically, one must select a specific design, which determines the type of random assignment that is used in the SCE. In addition, the choice of an adequate ES measure is obviously important. All the aforementioned choices that are made when designing a randomized SCE have an effect on the power of the RT (Keller, 2012). It is thus extremely important for scientific practice to systematically investigate the effect of these factors on the power of the RT.

Several simulation studies concerning the power of the RT for different types of single-case designs and data patterns have already been performed (e.g., Ferron & Onghena, 1996; Ferron & Sentovich, 2002; Ferron & Ware, 1995; Heyvaert et al., 2017; Levin, Ferron, & Gafurov, 2014; Levin, Ferron, & Kratochwill, 2012; Manolov, Solanas, Bulté, & Onghena, 2010; Onghena, 1994). Although these simulation studies provide valuable information regarding the power of the RT in the context of analyzing SCEs, previous research has not yet systematically investigated one important determinant of the RT’s power: the ES measure that is used as the test statistic. Furthermore, all previous simulation studies that examined the power of RTs for single-case designs have used a random sampling conceptualization of statistical power, the so-called “unconditional power,” although a random assignment conceptualization, the so-called “conditional power” is more consistent with the RT framework (Keller, 2012). With this article, we aim to fill both gaps.

With respect to the effect of the employed ES measure on the RT’s power, we focused on nonoverlap effect size (NES) measures, which are currently receiving considerable attention from the single-case community as measures for quantifying treatment effects in SCEs (e.g., Heyvaert, Saenen, Campbell, Maes, & Onghena, 2014; Lenz, 2012; Wolery, Busick, Reichow, & Barton, 2010). NES measures are rooted either in the tradition of visually analyzing single-case data or in the tradition of nonparametric rank statistics, and assess the number of data points between conditions that do not overlap. Following the approach proposed by Heyvaert and Onghena (2014), we will use these NES measures as test statistics in an RT. More specifically, we included the percentage of nonoverlapping data (PND; Scruggs, Mastropieri, & Casto, 1987) and the nonoverlap of all pairs (NAP; Parker & Vannest, 2009) in our study.

The PND is the earliest published NES measure and the most widely used one (Maggin et al., 2011; Schlosser, Lee, & Wendt, 2008). The PND is calculated as the percentage of data points from the treatment condition that exceeds the single highest data point from the control condition (assuming that the treatment is intended to increase the dependent variable). To calculate the PND, one first identifies the highest data point from the control condition. Next, for each of the treatment condition data points, whether or not this data point exceeds the highest control-condition data point is recorded. The PND can take values from 0% to 100%, with 0% indicating complete data overlap, and 100% indicating complete data nonoverlap. Note that if the treatment is expected to decrease the scores on the dependent variable, the PND is calculated by comparing the treatment condition data points to the lowest control condition data point. Figure 1 illustrates the calculation of the PND for a hypothetical data set.

Fig. 1
figure 1

Example of calculating the percentage of nonoverlapping data (PND) for a hypothetical data set. The dotted line represents the data from the control condition, and the full line represents the data from the treatment condition

The NAP was introduced to meet the statistical limitations of the PND, and is calculated as the percentage of treatment data points exceeding each control data point by looking at all pairwise comparisons, with ties counting as a half point (Parker & Vannest, 2009). The NAP is equivalent to the Mann–Whitney U statistic and is defined from 0 to 1, with .50 indicating a null effect of the treatment (Mann & Whitney, 1947; Parker & Vannest, 2009). Figure 2 illustrates the calculation of the NAP for a hypothetical data set.

Fig. 2
figure 2

Example of calculating the nonoverlap of all pairs (NAP) for a hypothetical data set. The dotted line represents the data from the control condition, and the full line represents the data from the treatment condition. The lines from every control condition data point to each of the treatment condition data points have only been drawn for the first control condition data point, so as not to clutter the graph

The second goal of this simulation study was to evaluate the power of the RT in a conditional power framework. In the previously cited simulation studies a random sampling model was used to generate the data for calculating the statistical power of the RT. Because the RT does not make an assumption of random sampling, evaluating the statistical power under a random sampling model does not do justice to the RT. As was demonstrated by Keller (2012), it is conceptually more appropriate to evaluate the statistical power of the RT by generating data using a compatible random assignment model. The resulting statistical power estimates are called “conditional power” estimates, because the estimates are conditional on a specific data set (see also Corcoran & Mehta, 2002; Gabriel & Hsu, 1983; Kempthorne, 1955; Kempthorne & Doerfler, 1969; Pesarin & De Martini, 2002; Pratt & Gibbons, 1981).Footnote 2 We elaborate this conditional power analysis approach in the Methods section and explain how this approach is combined with the three data generating processes that we used.

An additional goal of the present study was to investigate the effect of specific characteristics of the data on the power of the RT. Research has shown that data from single-case designs can contain autocorrelation (e.g., Shadish & Sullivan, 2011; Solomon, 2014). To account for this possibility we generated data that were not autocorrelated (independent standard normally distributed data) as well as data that contained strong positive autocorrelation (generated from a first-order autoregressive Gaussian process). In addition, we generated data from a uniform distribution (with a population standard deviation of 1) to evaluate the power of the RT in a situation in which classic distributional assumptions are severely violated.

The type of single-case design that is chosen to perform an SCE has important implications for the types of research questions that can be answered and the statistical power of the RT. For this reason we will now provide some information about the types of single-case designs we included in this simulation study and the types of research situations for which they are appropriate. The single-case designs that were used in this simulation study were all single-case alternation designs. Alternation designs are single-case designs in which rapid and frequent alternation of treatment conditions is possible. RTs for alternation designs are based on the random assignment of treatment conditions to measurement occasions (Onghena & Edgington, 2005). Although phase designs are used more often than alternation designs in practice (Shadish & Sullivan, 2011), we focused on alternation designs in this simulation study because they are more powerful than phase designs for SCEs (Onghena, 1994; Onghena & Edgington, 2005). The alternation designs we included were the completely randomized design (CRD), the randomized block design (RBD), and a restricted randomized alternation design (RRAD; Onghena, 1994, 2005). The CRD is the simplest alternation design (Edgington, 1967). In this design, treatment conditions are randomized solely with respect to the numbers of measurement occasions for each level of the independent variable. For example, the number of possible assignments for a hypothetical SCE with two conditions (A and B) with three measurement occasions per condition is given by \( \left(\begin{array}{c}\hfill 6\hfill \\ {}\hfill 3\hfill \end{array}\right) \), which equals 20 possible assignments (Onghena, 2005):

AAABBB

BBBAAA

AABABB

BBABAA

AABBAB

BBAABA

AABBBA

BBAAAB

ABAABB

BABBAA

ABABAB

BABABA

ABABBA

BABAAB

ABBAAB

BAABBA

ABBABA

BAABAB

ABBBAA

BAAABB

This method of randomization is analogous to the random assignment of subjects to a balanced between-subjects design with two conditions. When certain assignments resulting from complete randomization are deemed undesirable for an SCE (e.g., AAAAABBBBB), other alternation designs can be derived from the CRD randomization scheme by imposing additional constraints on the method of randomization. For example, an RBD is obtained by grouping measurement occasions in pairs and randomizing the treatment order within each pair. An RBD for the same hypothetical SCE yields 23 = 8 possible assignments (which are a subset of the set of CRD assignments):

ABABAB

BABABA

ABABBA

BABAAB

ABBAAB

BAABBA

ABBABA

BAABAB

RBDs can be used to counter the effect of time-related confounding variables on the dependent variable. For example, suppose a researcher wishes to conduct an SCE that evaluates the effect of a behavioral treatment on a depressed patient’s feelings of negative affect. If the researcher knows that the level of negative affect of the patient can fluctuate from day to day, irrespective of the treatment, an RBD can be used to control for this confounding factor. Suppose the experiment consists of 10 days (i.e., ten blocks) where on each day the researcher administers both treatments (i.e., the control condition and the treatment condition) and records the patient’s level of negative affect after each treatment (i.e., two negative affect scores per day). Because the sequence of conditions within a day (i.e., block) is determined randomly for every day, a potential significant treatment effect cannot be attributed to the day-to-day fluctuations in negative affect but only to the behavioral treatment.

When one wants to prevent the temporal clustering of treatments by ensuring that the randomization only allows a maximum number of successive measurement occasions to have the same treatment, one can use an RRAD (Onghena & Edgington, 1994). The RRAD yields a larger subset of the set of CRD assignments for a given SCE than the RBD. More specifically, an RRAD with a maximum number of two consecutive administrations of the same condition yields the following assignments for our hypothetical SCE:

AABABB

BBABAA

AABBAB

BBAABA

ABAABB

BABBAA

ABABAB

BABABA

ABABBA

BABAAB

ABBAAB

BAABBA

ABBABA

BAABAB

Note that the entire set of RBD assignments is present in the set of RRAD assignments.

Method

The Method section contains three parts. The first part introduces the RT as a method of evaluating treatment effects in SCEs. The second part discusses how the conditional power of the RT is calculated. Finally, the third part details the design matrix of the simulation study.

Evaluating treatment effects in single-case experiments

Before explaining the way in which the conditional power of the RT is calculated, we will provide a worked example of the several steps that need to be taken to analyze a randomized SCE with an RT. Suppose we want to perform a randomized SCE consisting of four measurement occasions. Assume that the employed single-case design is balanced and completely randomized. The first step is made before executing the experiment and consists of listing all permissible assignments for the chosen experimental design. A permissible assignment is an assignment that adheres to the restrictions imposed by the chosen single-case design. In this example, the only restriction is that the design is balanced. When there are only two experimental conditions, this results in the following set of permissible assignments:

AABB

ABAB

ABBA

BAAB

BABA

BBAA

Second, one of the permissible assignments is randomly selected as the assignment to execute the actual experiment. Suppose the selected assignment is ABBA.

Third, one chooses an ES measure that is deemed adequate to answer the research question. This ES measure will be the test statistic of the RT. Suppose we choose the MD between the A and the B condition as the test statistic for this RT. Note that in order to test a two-sided alternative hypothesis, the test statistic must be sensitive to both directions of a possible effect. In this case we will use the absolute mean difference between the A and the B condition.

Suppose that the observed data are 2, 5, 4, and 3. For the selected assignment ABBA, this yields an observed test statistic of 2. As a fourth step, we calculate the test statistic for all permissible assignments:

$$ \begin{array}{l}\mathrm{AABB} = > \left|0\right| = 0\\ {}\mathrm{ABAB} = > \left|\hbox{--} 1\right| = 1\\ {}\mathrm{ABBA} = > \left|2\right| = 2\\ {}\mathrm{BAAB} = > \left|\hbox{--} 2\right| = 2\\ {}\mathrm{BABA} = > \left|1\right| = 1\\ {}\mathrm{BBAA} = > \left|0\right| = 0\end{array} $$

These values make up the randomization distribution. This collection of values is used as a reference distribution to calculate the statistical significance of the observed test statistic.

As a fifth step, the two-sided p value of the RT is calculated as the proportion of test statistics in the randomization distribution that are at least as extreme as the observed test statistic. When looking at the randomization distribution, we can see that two of the six permissible assignments lead to the same test statistic value as the observed test statistic, which results in a two-sided p value of 1/3. This p value provides a probabilistic statement of observing the data under the null hypothesis that the conditions are unrelated to the data, and the validity of this statement is guaranteed by the randomization of the conditions. If that null hypothesis were true, then there is a probability of 1/3 to obtain a test statistic value as extreme as the one observed (Edgington & Onghena, 2007).

Note that this example was only chosen for didactical purposes as it is evident that an SCE with only four measurement occasions can never yield a p value that is smaller than any conventional significance level. Without performing any simulations, we can already infer that an SCE with only four measurement occasions has zero statistical power for all practical purposes.

The main advantages of the RT are that it makes no distributional assumptions and no assumption of random sampling. These advantages are important because it has been shown that the assumptions underlying parametric tests (e.g., random sampling or specific distributional assumptions) are doubtful in many domains of behavioral research and particularly for single-case research (e.g., Adams & Anthony, 1996; Dugard, 2014; Edgington & Onghena, 2007; Ferron & Levin, 2014; Levin et al., 2014; Micceri, 1989). Other advantages of the RT as compared to parametric tests are its flexibility with regard to the choice of test statistic and the choice of experimental design (Ferron & Sentovich, 2002; Onghena, 1992; Onghena & Edgington, 2005).

Power analysis in the random assignment model

The RT produces so-called “conditional inferences”—that is, inferences that are conditional on the observed data, just like Fisher’s exact test is conditional on the marginal totals (Agresti, 1992; Krauth, 1988). Consequently, when investigating the power of the RT, it makes most sense to use this conditional framework too, and to calculate the so-called “conditional power” (i.e., the power of the RT for a specific data set). The advantage of this conceptualization is that the conditional power calculations are consistent with the random assignment model, which is also used for the validity of the RT, and that no assumption of random sampling is required. For the calculation of conditional power only an additional assumption about the treatment effect is necessary, just like one needs an assumption of the effect size parameter for the calculation of unconditional power.

If one would calculate the unconditional power of the RT, one would generate a large number of data sets (with fixed condition labels), sampled from a known distribution, and calculate the proportion of data sets that yield a p value smaller than or equal to a predefined significance level α. In contrast, to calculate the conditional power of the RT, one starts with a fixed set of scores that would be observed if the null hypothesis of no treatment effect is true (the “null scores”). Next one generates all possible randomizations of the condition labels, and constructs all possible data sets by pairing the null scores with the condition labels; null scores that are assigned to the treatment condition are transformed into observed scores containing the treatment effect. The conditional power is calculated as the proportion of those data sets that yield a p value smaller than or equal to α. Importantly, the unconditional power of the RT is defined in relation to the repeated random sampling of data sets from a known distribution whereas the conditional power of the RT is defined in relation to the repeated random assignment of condition labels to a specific set of null scores. Consequently for the calculation of conditional power, one does not need to make an assumption of random sampling. Note that this also implies that the resulting conditional power only pertains to that specific set of null scores.

To calculate either the conditional or the unconditional power of the RT one needs to make an assumption about the nature of the treatment effect. For the conditional power it means the specification of a specific functional relation between the null scores and the observed scores if a specific alternative hypothesis is true. The best known and most straightforward model in this respect is the unit-treatment additivity model (e.g., Cox & Reid, 2000; Hinkelmann & Kempthorne, 2008; Lehman, 1959; Welch & Gutierrez, 1988). This model describes the relation between the null scores and the observed scores as

$$ {{\mathrm{X}}_{\mathrm{i}}}^{\mathrm{B}} = {{\mathrm{X}}_{\mathrm{i}}}^{\mathrm{A}} + \varDelta $$

In this equation, Xi B is the observed score of experimental unit i if i is assigned to the experimental condition B, Xi A is the hypothetical score of i if i is assigned instead to the control condition A (i.e., the null score), and Δ is the constant additive effect of the treatment. If we assume that the null hypothesis is true, the equation above is reduced to

$$ {{\mathrm{X}}_{\mathrm{i}}}^{\mathrm{B}}={{\mathrm{X}}_{\mathrm{i}}}^{\mathrm{A}} $$

which implies that the observed score for experimental unit i is independent from the condition to which it is assigned. Note that the null scores are assumed to be known in order to calculate conditional power, just as the distributional form has to be known in order to calculate unconditional power.

The conditional power of the RT is determined by the significance level of the test, the number of observations, the size of the treatment effect, the employed test statistic, and the effect function (Keller, 2012). Whereas unconditional power is defined as the percentage of rejections of the null hypothesis across a given number of samples drawn from a certain population distribution, the conditional power is defined as the percentage of random assignments of treatment conditions to experimental units that result in a rejection of the null hypothesis, given an assumed treatment effect (Kempthorne & Doerfler, 1969).

To calculate the exact conditional power of the RT for a specific data set, a few steps must be carried out. To begin, we must choose a single-case design, a number of observations, and a test statistic to use in the RT. Next, we generate one set of null scores for the chosen number of observations. We then obtain all permissible assignments of the employed randomization scheme for the chosen number of observations. If there are k permissible assignments, we then construct k different data sets from the null scores by adding the treatment effect to the null scores of the measurement occasions in the treatment condition. Next, we perform the RT for each of the k data sets from the previous step and record whether or not the null hypothesis is rejected at a pre-specified significance level. The exact conditional power is then defined as the overall proportion of rejected null hypotheses across the k RTs.

Notice that the RT is a computer-intensive method and that the calculation of the exact conditional power is “computer-intensive squared.” If the number of observations rises, then k for each RT increases exponentially. For the exact conditional power, the RT is repeated k times, resulting in a total of k 2 calculations.

Table 1 illustrates the steps that are involved in calculating the exact conditional power of the RT.

Table 1 Calculation of the RT’s exact conditional power

In a random assignment framework with unit-treatment additivity, the conditional power of the RT is a function of the constant additive effect Δ. This implies that we can construct a conditional power curve for the null scores from Table 4 by varying the value of Δ. Figure 3 displays the conditional power function for the set of null scores from Table 4 and for α = 1/3.

Fig. 3
figure 3

Conditional power curve of the two-sided randomization test (RT) for α = 1/3, in a completely randomized design using (2, 3, 5, 3) as the set of null scores

For very small data sets such as in this example, it becomes apparent that the conditional power curve of the RT is actually a stepwise function. The function is stepwise because the conditional power is determined by the proportion of the k RTs that yield a p value smaller than or equal to the significance level α. If k is a small number, then only multiples of 1/k are possible conditional power values. For larger data sets, the number of possible steps is quite large so that the power curve becomes indistinguishable from a continuous function.

Design matrix of the simulation study

We manipulated five experimental factors in the present simulation study:

  1. 1.

    Characteristics of the data. To investigate the effect of different types of data on the conditional power of the RT, data were generated from a standard normal distribution and a uniform distribution (with a population standard deviation of 1). We selected the normal and uniform distributions because of their simplicity and ubiquity. Because both distributions were also used in the simulation study of Keller (2012), we could use his results as a benchmark. We added a first-order autoregressive model with Gaussian errors (AR1). The autoregressive parameter (AR) quantifies the autocorrelation in the data. We hypothesized that the presence of positive autocorrelation would have little influence in the selected single-case designs because of their fast alternation of the experimental conditions. Some pilot testing with small and medium levels of autocorrelation supported this hypothesis. For this reason (and in order to keep the number of experimental conditions manageable) we included only one, rather large, level of positive autocorrelation: AR = 0.6. Because the variance of an AR1 model is

    $$ \frac{\sigma_e^2}{1- A{R}^2} $$

    and we sampled e from a standard normal distribution (σ 2 e = 1) and AR = 0.6, the variance of the AR1 model is 1.5625. We only used three types of data (standard normal, uniform and AR1) to keep the simulation study feasible in terms of design and duration, and because we did not expect that the distributional shape would have a large impact on the power.

  1. 2.

    Test statistics used in the RT. Three different ES measures were used as the test statistic in the RT: the PND, the NAP, and the MD. The main reason for including the PND in this simulation study is that it is the most widely used NES (Maggin et al., 2011; Schlosser et al., 2008). As such, we believe it is of great importance to investigate PND’s usability in statistical inferences. NAP was included because it was introduced to meet the statistical limitations of the PND (Parker & Vannest, 2009). To compare the performance of the selected NESs to a more generally accepted test statistic, we also included the mean difference (MD) in our simulation study. All tests statistics were formulated in a nondirectional way, so we only consider two-sided p values in this simulation study.

  2. 3.

    Designs. Three different single-case alternation designs were investigated: the CRD, the RBD, and the RRAD (see above for details). In this study, we limited our investigation to designs with two conditions (a control condition and a treatment condition). The CRD entails full random assignment of the condition labels with the only restriction that each condition must contain the same number of measurement occasions for each assignment. The RBD uses a form of randomization that groups measurement occasions into blocks of a certain size (we will use a block size of two observations) and then randomizes the measurement occasions within blocks. The RRAD uses full random assignment with the restriction that the maximal number of adjacent measurement occasions from the same condition is limited to a pre-specified value (Onghena & Edgington, 1994). In alignment with the What Works Clearinghouse (WWC) standards’ recommendation to have at least three measurement occasions in a “phase” (Kratochwill et al., 2010), there could never be more than two adjacent measurement occasions from the same condition in the RRAD randomization scheme. Note that all the designs were balanced designs (i.e., they contain the same number of measurement occasions in each condition).

  3. 4.

    Size of the treatment effect. Our choice of treatment ESs was guided by reported ESs in various domains of single-case research. ESs in single-case research are generally larger than in between-subjects research and are sometimes extremely high (Busk & Serlin, 1992). For example, Fabiano et al. (2009) performed meta-analyses of behavioral treatments for attention-deficit/hyperactivity disorder for various study designs and found average ESs of 0.83 and 3.78 for between-subjects studies and single-case studies, respectively. In a similar vein, two single-case meta-analyses concerning interventions for reducing challenging behavior in persons with intellectual disabilities resulted in average ESs of approximately 3 (Heyvaert, Maes, Van den Noortgate, Kuppens, & Onghena, 2012; Heyvaert, Saenen, Maes, & Onghena, 2014). With these results in mind, we included six levels of the treatment effect: 0, 0.5, 1, 1.5, 2, and 2.5. On the basis of pilot simulation testing, we set 2.5 as the maximum ES in our simulation study, because the conditional power for this ES was already 100% in almost all conditions. The size of the treatment effect in empirical single-case research can vary greatly depending on the specific domain and with the current selection of ESs we are able to cover the entire range of frequently found empirical ESs.

  4. 5.

    Number of measurement occasions. The selected numbers of measurement occasions to generate a complete data set were 12, 20, 30, and 40 measurement occasions and were chosen to cover the range of common series lengths in empirical research. For example, Ferron, Farmer, and Owens (2010) performed a survey that found average series lengths that ranged from 7 to 58 with a median of 24. A survey by Shadish and Sullivan (2011) found an average of 20 measurement occasions per individual time series. Note that the smallest amount of measurement occasions was selected to be 12 rather than 10 because of the fact that an RT using an RBD is unable to reach a 5% significance level for a data set with only 10 measurement occasions (25 = 32 possible assignments, and because we are considering two-sided tests, 1/16 is always larger than 1/20).

Crossing the levels of these five factors resulted in a total of 3 × 3 × 3 × 4 × 6 = 648 conditions. For each condition, 100 null data sets were generated and the conditional power was averaged across these 100 replications. For each replication, the conditional power was calculated using Kempthorne and Doerfler’s (1969) method. The significance level was set at 5%.

Despite rapid advancements in computing speed and exact algorithms, it is usually not feasible to calculate the exact conditional power of the RT due to the exponential increase of the computational demand when the number of observations grows larger (Keller, 2012). An alternative to exact computation is analytical approximation. For example, Gabriel and Hsu (1983) showed that their analytical approximation of the RT’s exact conditional power only slightly overestimates the true power. However, their approximation shows larger biases for small numbers of observations or skewed treatment effects. In situations in which the number of observations is too large for exact computation but too small for precise analytic approximation, Monte Carlo approximation is a good alternative (Senchaudhuri, Mehta, & Patel, 1995). In this approach, only a random sample of all permissible assignments is used to calculate the conditional power of the RT. For a single RT it has been shown that the Monte Carlo RT produces valid p values (Edgington & Onghena, 2007). Furthermore, the accuracy of the random sampling can be increased to the desired level simply by increasing the number of random assignments that are drawn from the set of all permissible assignments (Senchaudhuri et al., 1995). Edgington (1969) pointed out that an efficient Monte Carlo RT can already be carried out with 1,000 random assignments. To keep the simulation study computationally manageable and without having to resort to analytical approximations, we used Monte Carlo sampling for the average conditional power calculation. More specifically, we selected 1,000 random assignments for each null sample, which resulted in 1,000 data sets for the conditional power calculation of each null sample. The conditional power for each null sample was calculated by performing the RT on each of these 1,000 data sets using 1,000 random assignments of the condition labels and by determining the proportion of p values that were equal to or smaller than .05. The average conditional power for each condition in the simulation study was obtained by averaging the conditional powers of the 100 null samples.

Results

To evaluate the effect of each experimental factor on the RTs conditional power, we plotted the main effect of each individual experimental factor while averaging the power across all other experimental factors. Figures 4 to 7 represent the main effects of ES measure, design, characteristics of the data, and number of measurement occasions on the average conditional power (averaged over 100 replications) with the size of the treatment effect plotted on the x-axis. Complete numerical results of the simulation study are displayed in Tables 2 to 4 in the Appendix.

Fig. 4
figure 4

Main effects of different effect size (ES) measures on the conditional power of the RT

Figure 4 shows that the MD and the NAP perform very similarly (an average difference of 1.74% in favor of the MD), whereas use of the PND as the test statistic in the RT yields substantially lower power. More specifically, the average power difference between the MD and the PND is 16.36% across the range of treatment ESs.

Figure 5 shows that the RRAD is the single-case design that on average yields the highest conditional power in the RT. More specifically, the average power advantage is 3.36% as compared to the CRD, and 6.39% as compared to the RBD.

Fig. 5
figure 5

Main effects of different designs on the conditional power of the RT

Figure 6 reveals that there is only a very minimal, yet consistent, effect of the characteristics of the data on the conditional power of the RT. More specifically, the average difference between the conditions using the uniform distribution and the conditions using the standard normal distribution is 1.59%. Similarly, the average difference between the conditions using the standard normal distribution and the conditions using the AR1 model is 0.74%. These results show that the power of the RT is only slightly affected by extreme variations in the distributional characteristics of the data (cf. the data from a standard normal distribution vs. the data from a uniform distribution). Furthermore, the very minimal power difference between the conditions using the AR1 model and the conditions using the standard normal distribution indicates that the RT is almost not affected by strong positive autocorrelation in the data when single-case alternation designs are used.

Fig. 6
figure 6

Main effects of different characteristics of the data on the conditional power of the RT

Finally, Fig. 7 shows the effect of the number of observations on the conditional power of the RT. Note the sharply decelerating increases in conditional power when the number of observations is increased from 12 to 20 (an average increase of 15.65%), from 20 to 30 (an average increase of 8.69%), and from 30 to 40 (an average increase of 2.45%).

Fig. 7
figure 7

Main effects of different numbers of observations on the conditional power of the RT

Apart from visually analyzing the results of the simulation study, we also evaluated the results by looking at the variation between conditions using a multiway analysis of variance (ANOVA). More specifically we looked at the main effects of all experimental factors and two-way interactions effects that were deemed theoretically meaningful (an interaction effect between design and ES measure, between characteristics of the data and ES measure, and between characteristics of the data and design). We did not include higher-level interactions because they are difficult to interpret. For each evaluated effect, we calculated the proportion of explained variance in order to distinguish between the most important patterns in the results. All included effects of the ANOVA were significant at a significance level of .001. However, there are large discrepancies in the proportions of explained variance of the various effects. The size of the treatment effect by far has the largest proportion of explained variance (69.02%). This comes as no surprise because we used a wide range of values for this simulation factor (employing treatment ESs from 0 to 2.5). The number of measurement occasions has the second largest proportion of explained variance (6.9%). Furthermore, the choice of ES measure explains 3.34% of the variance, which is still considerable given the wide range of levels of the previously discussed factors. The other effects included in the ANOVA all explained less than 1% of the variance on their own. Nevertheless, they were all significant at the .001 level indicating that they have a significant effect on the conditional power of the RT. Furthermore, whereas the treatment effect is in practice not controlled by the researcher, and the number of measurement occasions is sometimes constrained by logistic or financial considerations, the ES measure and the design can be chosen by the researcher. Hence it is reassuring to see that power can still be optimized by deliberate smart choices in the design phase of the study, given the constraints of the research context.

The main results from the simulation study can be summarized as follows:

  • The MD and the NAP ES measures perform very similarly in terms of conditional power whereas the PND performs substantially worse.

  • The RRAD is the single-case design that on average yields the highest conditional power.

  • The conditional power of the RT is only minimally influenced by the characteristics of the null scores.

Discussion

In this article we investigated the conditional power of the RT using three different single-case ES measures (the MD, the NAP, and the PND) and three different randomized single-case designs with rapid treatment alternation (the CRD, the RRAD, and the RBD) for three types of simulated data (data from a standard normal distribution, data from a uniform distribution, and data from an autoregressive process with Gaussian errors) using a significance level of 5%.

The results were evaluated by visual analysis of power graphs and by decomposing the variance in the simulation results using ANOVA. The most important patterns in the results were identified by looking at the proportion of explained variance of each of the simulation factors.

With respect to the effect of the ES measure on the power of the RT, the results showed that the MD and the NAP perform very similarly whereas the PND performs substantially worse. This large discrepancy with the other two ES measures indicates that the PND has undesirable characteristics when it comes to evaluating intervention effects in SCEs. In fact, many authors have pointed out that the PND has two important limitations (e.g., Parker & Vannest, 2009; Shadish et al., 2008; Wolery et al., 2010). First, because only one control condition data point is used as a reference point to compare to the treatment data points, the PND is highly influenced by outliers. Second, the PND has a poor ability to discriminate between different magnitudes of a treatment effect. For example, when all treatment data points exceed the single highest control condition data point, it does not matter how large the nonoverlap is; the PND will always be 100%. Because of these limitations, authors have called for the abandonment of the PND in favor of other NES measures (e.g., Kratochwill et al., 2010; Parker & Vannest, 2009). In a similar vein, Campbell (2013, p. 24) stated, “I believe the PND methodology constitutes a first wave of SCD meta-analysis that is being followed by efforts to improve on inherent statistical limitations of PND.” The results of our simulation study add to these criticisms, in the sense that the PND yields substantially less power in the RT than the NAP and the MD despite the absence of outliers in the simulated data and regardless of the size of the treatment effect. As such, we agree with the critics of PND that this measure should be abandoned in favor of more recent NES measures (such as the NAP).

With respect to the effect of the design on the conditional power of the RT, the results showed that the RRAD was the most powerful design, followed by the CRD and then the RBD. We argue that RRAD yields the highest power because it is a design that prevents the temporal clustering of measurement occasions but at the same time allows for a large number of assignments. In contrast, the CRD allows more temporal clustering than the RRAD whereas the RBD is more restrictive in terms of number of possible assignments than the RRAD. Although the choice of design also depends on feasibility and the concrete phenomenon that is studied, single-case researchers can take these findings with respect to the influence of the single-case experimental design on statistical power into consideration whenever they are designing an SCE that is to be evaluated using an RT.

The results of the simulation study also showed that the power of the RT is only minimally affected by the distributional characteristics of the null scores or the presence of strong positive autocorrelation. This is an important finding because it has been shown that single-case data often contain autocorrelation and violate classic distributional assumptions (e.g., Adams & Anthony, 1996; Dugard, 2014; Edgington & Onghena, 2007; Ferron & Levin, 2014; Levin et al., 2014; Micceri, 1989). However, we should note that the effect of autocorrelation on the power of the RT depends on the type of single-case design that is being used. For example, Ferron and Onghena (1996) evaluated the effect of autocorrelation on the power of the RT using a design that randomly assigns treatments to more extended phases. In this type of design, the researcher specifies a number of equally long phases and then randomly assigns a treatment condition to each of the phases. The results showed that positive autocorrelation increased the power of the RT for this type of single-case design. In contrast, Ferron and Ware (1995) showed that positive autocorrelation decreases the power of the RT for single-case phase designs that use random assignment of intervention points.

Apart from the size of the treatment effect, the number of measurement occasions is the second most important determinant of the conditional power of the RT. However, our results showed a sharply decelerating increase in conditional power when the number of measurement occasions was increased. In other words, the larger the number of measurement occasions becomes, the smaller the subsequent increase in power. This information can be useful to researchers interested in SCEs for which the experiment is preferably as short as possible (e.g., if there are negative side effects of the experiment for the participant), in order to determine an optimal SCE length that balances between having sufficient statistical power and minimizing any discomforts for the participants.

We will now address some limitations to this simulation study. First of all, an obvious limitation is that the generalizability of our results is limited to the simulation conditions that were included. More in particular, we only considered null scores generated from continuous distributions. For some applications, other continuous or discrete distributions might be more relevant. We would expect that highly discrete distributions might compromise power because they give rise to many ties in both the data themselves and in the reference distribution. The power values in the present simulation study can therefore be considered as upper bounds. On the bright side, however, it should be noted that RTs always treat data in a discrete way and remain valid even if highly discrete (e.g., skewed dichotomous distributions) are used.

A second limitation is that the unit-treatment additivity model that we used to model the treatment effect in this simulation study conceptualizes the treatment effect as a difference in level and assumes no other effects. As a consequence, we only evaluated ES measures that are sensitive to differences in level. Nevertheless, we should mention that the unit-treatment additivity model is a generally accepted model in nonparametric statistics and a standard for classical evaluations of nonparametric methods (e.g., Cox & Reid, 2000; Hinkelmann & Kempthorne, 2008; Lehman, 1959; Welch & Gutierrez, 1988).

A third limitation of the present simulation study is that we used a Monte Carlo approach to approximate the exact conditional power. Although the Monte Carlo approach is computationally far more efficient, it introduces a random sampling (of assignments) error in the exact conditional power estimates. However, the magnitude of this error is a function of the number of random assignments that is used and can be determined analytically (see Edgington, 1969). More specifically, one can determine a confidence interval for the exact conditional power when it is approximated via Monte Carlo sampling. The bounds of a 99% confidence interval can be constructed using the following formula:

$$ \begin{array}{l} lower\ bound = \frac{\left( k-1\right) p-2.58\sqrt{\ \left(\left( k-1\right) p q\right)} + 1}{k}\\ {} upper\ bound = \frac{\left( k-1\right) p + 2.58\sqrt{\ \left(\left( k-1\right) p q\right)} + 1}{k}\end{array} $$

with k being the number of random assignments, p being the exact conditional power, q being 1–p, and 2.58 being the 99.5th percentile of a standard normal distribution. For example, if the exact conditional power of a specific simulation condition is 80%, the 99% confidence interval of the Monte Carlo approximation with k = 1,000 is [77%; 83%]. Note that the Monte Carlo approach also causes the Type I error rates (i.e., the power when the treatment effect is zero) to slightly deviate from the specified significance level. When an exact RT is used (which uses all possible randomizations of the condition labels) the Type I error rate is always exactly equal to the specified significance level (Keller, 2012).

A generally accepted standard for sufficient statistical power is 80% (Cohen, 1988). When the power of a test is 80%, the probability of a Type II error (=1 – power) is 20%. If the conventional significance level of 5% is used, the ratio between the probability of a Type II error and a Type I error will then be four to one (20% to 5%). A power smaller than 80% would result in a too high a probability of a Type II error, whereas a power materially larger than 80% is likely to require data sets containing numbers of measurement occasions that could be unfeasible to collect (Cohen, 1992). To make a practical recommendation regarding the number of measurement occasions that yields a power of at least 80% in the RT using a significance level of 5% we must make an assumption about plausible ESs in single-case research. We previously mentioned that very high ESs of 3 or more are not uncommon in SCEs. In this case, the results of the simulation study with normally or uniformly distributed data show that an SCE with at least 20 measurement occasions is needed to obtain sufficient statistical power in the RT when used with randomized alternation designs and with a significance level of 5%. With other designs and highly discrete and skewed expected data sets even a larger number of measurements might be needed.

Finally, it is important to notice that the conditional power framework that we used in this simulation study contains an apparent paradox. On the one hand, the conditional power of a specific null data set only holds for that specific data set (i.e., it is conditional on the null scores). On the other hand, an a priori power analysis always requires knowledge, or an assumption, about the expected data in order to guide recommendations for the number of observations to be included in the experiment. Invoking distributional assumptions for the power analysis would go against the conceptual framework of conditional power as the latter makes no assumption of random sampling.

Similar to the approach taken by Keller (2012) we have tried to reconcile these two seemingly opposing requirements by calculating the conditional powers for a large number of null data sets (sampled from the probability distributions described in the methods section) and then by averaging these conditional powers. As such, the individual conditional powers are free of distributional assumptions, although the average conditional power for all the conditions reported in Tables 2, 3, and 4 of this article are dependent on the specific distributions from which the null data sets were generated.

In practice, a guesstimate of conditional power can be obtained by performing a small pilot study and using the pilot data to plan a subsequent larger data collection. The null scores in the unit-treatment additivity model can be reconstructed by calculating \( \widehat{\varDelta} \) as the difference between the condition means and subtracting this difference from all the scores observed in the treatment condition:

$$ {{\mathrm{X}}_{\mathrm{i}}}^{\mathrm{A}} = {{\mathrm{X}}_{\mathrm{i}}}^{\mathrm{B}}-\widehat{\varDelta} $$

Once we have null scores, we can calculate the exact conditional power for varying levels of Δ. Alternatively, different effect models can be explored, each leading to other null scores.

Collings and Hamilton (1988) and Hamilton and Collings (1991) also used this idea of a pilot study to determine the appropriate sample size on the basis of the distribution-free power of nonparametric tests for location shift. More specifically, the authors proposed to bootstrap the pilot data (i.e., draw random samples with replacement from the pilot data) to form a large number of bootstrap data sets. The proportion of data sets that yield a p value smaller than or equal to the significance level α is defined as the power estimate for the test. The authors showed empirically that their proposed method yields reliable results for estimating the power of the Wilcoxon two-sample test by means of a simulation study. Their method is appealing because the bootstrap technique allows for bootstrap data sets of any numbers of observations (smaller or larger than the number of observations in the pilot data set) and because no distributional assumptions are needed. In a similar vein we could use this bootstrap technique to generate a number of data sets from the pilot data. Next, we can calculate the conditional power for each of these data sets separately and obtain the average conditional power for all data sets together without additional distributional assumptions. The only additional assumption is that the distributional shape of the pilot data is indicative of the distributional shape of the finally collected data.

Suggestions for further research

To keep the computational burden of the simulation study manageable we had to limit the present study to the factors that were most relevant to answer our research questions. However, other interesting factors remain to be investigated in future research. These include: the use of various other ES measures as test statistics in the RT, the inclusion of other statistical distributions for generating data such as skewed, exponential or bimodal distributions, the inclusion of additional AR values for data generated with the AR1 model, and the inclusion of other time-series structures (e.g., moving average models). Because count data are used regularly as single-case outcome measures, future simulation studies could focus on the comparison between the power of the RT for data generated from discrete probability distributions and the power of the RT for data generated from continuous probability distributions. In addition future research could focus on investigating the power of the RT in unbalanced alternation designs as the single-case designs in this simulation study were all balanced designs.

In this study, we only modeled treatment effects that were defined as differences in level between experimental conditions. However, several single-case ES measures that look at trends exist (e.g., Tau-U, Parker et al., 2011; regression-based ES measures, Van den Noortgate & Onghena, 2003). Consequently, future research could focus on power analysis of the RT for ES measures that are sensitive to trend using simulated data that contain different types of trend effects. We previously mentioned that single-case phase designs are more frequently used in practice than single-case alternation designs. For this reason, future research could also investigate the conditional power of the RT for a variety of single-case phase designs.

Finally, another avenue for future research is to examine the theoretical relation between unconditional power and average conditional power as well as the possibility of obtaining distribution-free average conditional power using the bootstrap technique proposed by Collings and Hamilton (1988) and Hamilton and Collings (1991). For the latter it would be interesting to develop user-friendly statistical simulation software that assists in exploring the conditional power for a variety of distributional shapes and effect functions.

Conclusion

On the basis of the results of this simulation study, we would not recommend the use of the PND as an ES measure for the purpose of statistical inference using an RT in single-case alternation designs, because of its low statistical power. In contrast, the NAP yields power levels that are very similar to those of the MD, and as such provides a good alternative for researchers who want to use a nonoverlap measure to quantify the treatment effect in SCEs. With regard to the number of measurement occasions that are needed to ensure adequate statistical power in the RT, we recommend including at least 20 measurement occasions when working with alternation designs in the domain of single-case research and using a 5% significance level.