The conditional power of randomization tests for singlecase effect sizes in designs with randomized treatment order: A Monte Carlo simulation study
Abstract
The conditional power (CP) of the randomization test (RT) was investigated in a simulation study in which three different singlecase effect size (ES) measures were used as the test statistics: the mean difference (MD), the percentage of nonoverlapping data (PND), and the nonoverlap of all pairs (NAP). Furthermore, we studied the effect of the experimental design on the RT’s CP for three different singlecase designs with rapid treatment alternation: the completely randomized design (CRD), the randomized block design (RBD), and the restricted randomized alternation design (RRAD). As a third goal, we evaluated the CP of the RT for three types of simulated data: data generated from a standard normal distribution, data generated from a uniform distribution, and data generated from a firstorder autoregressive Gaussian process. The results showed that the MD and NAP perform very similarly in terms of CP, whereas the PND performs substantially worse. Furthermore, the RRAD yielded marginally higher power in the RT, followed by the CRD and then the RBD. Finally, the power of the RT was almost unaffected by the type of the simulated data. On the basis of the results of the simulation study, we recommend at least 20 measurement occasions for singlecase designs with a randomized treatment order that are to be evaluated with an RT using a 5% significance level. Furthermore, we do not recommend use of the PND, because of its low power in the RT.
Keywords
Singlecase design Randomization test Statistical power Nonoverlap effect sizes Autocorrelation Monte Carlo simulation studySinglecase experiments (SCEs) are designed experiments that include repeated measurements of a single entity (usually a person) for at least one dependent variable under different levels (i.e., treatments) of one or more independent variables (Barlow, Nock, & Hersen, 2009; Gast & Ledford, 2014; Kazdin, 2011; Kratochwill & Levin, 1992; Onghena, 2005).
Fields such as special education, school psychology, and clinical psychology are increasingly using SCEs to assess the efficacy of an intervention or treatment for a single subject (Alnahdi, 2015; BowmanPerrott, Burke, de Marin, Zhang, & Davis, 2015; Hammond & Gast, 2010; Leong, Carter, & Stephenson, 2015; Moeller, Dattilo, & Rusch, 2015; Shadish & Sullivan, 2011; Smith, 2012; Swaminathan & Rogers, 2007). SCEs are also gaining in popularity in medical science (where they are often called “Nof1 designs”) to evaluate treatments for patients with, for instance, chronic pain or attention deficit hyperactivity disorder (Gabler, Duan, Vohra, & Kravitz, 2011). The recent development of guidelines for reporting the results of SCEs confirm the growing interest in these types of designs in the educational, behavioral, and health sciences (Shamseer et al., 2016; Tate, Togher, Perdices, McDonald, & Rosenkoetter, 2012).
Despite the growing popularity of SCEs, there is no broad consensus with respect to adequate dataanalysis methods for these types of designs. As a result, a wide variety of methods is currently being used (often in combination with each other; Kratochwill et al., 2010; Maggin, O’Keeffe, & Johnson, 2011; Shadish, 2014). These methods can be broadly categorized in two main approaches: visual analysis and statistical analysis (Heyvaert, Wendt, Van den Noortgate, & Onghena, 2015). Visual analysis consists of inspecting graphed SCE data for changes in level, overlap between phases, variability, trend, immediacy of the effect, and consistency of data patterns across similar phases (Horner, Swaminathan, Sugai, & Smolkowski, 2012). Statistical analysis methods for SCE data can be subdivided into three groups: effect size calculation, statistical modeling, and statistical inference. Effect size calculation refers to determining the size of the treatment effect by calculating formal effect size (ES) measures. Examples include mean difference measures (e.g., Busk & Serlin, 1992; Hedges, Pustejovsky, & Shadish, 2012), measures based on data nonoverlap between phases (e.g., Parker, HaganBurke, & Vannest, 2007; Parker & Vannest, 2009; Parker, Vannest, & Brown, 2009; Parker, Vannest, Davis, & Sauber, 2011), and regressionbased measures (e.g., Allison & Gorman, 1993; Center, Skiba, & Casey, 1985–1986; Solanas, Manolov, & Onghena, 2010; Van den Noortgate & Onghena, 2003; White, Rusch, Kazdin, & Hartmann, 1989). In statistical modeling, the goal is to devise a statistical model that provides an adequate conceptualization of the data. Examples include multilevel modeling (Van den Noortgate & Onghena, 2003), structural equation modeling (Shadish, Rindskopf, & Hedges, 2008), and interrupted time series analysis (Borckardt & Nash, 2014; Gottman & Glass, 1978). Statistical inference refers to determining the statistical significance of ES measures through statistical hypothesis testing or to constructing confidence intervals for parameter estimates (Heyvaert, Wendt, Van den Noortgate, & Onghena, 2015; Michiels, Heyvaert, Meulders, & Onghena, 2017).
The present article deals with the inferential approach to evaluating treatment effects in singlecase data. Inferential procedures can be parametric or nonparametric. However, parametric procedures such as statistical tests and confidence intervals based on t and F distributions are often not appropriate to analyze SCE data because the assumptions underlying these procedures (e.g., random sampling and more specific distributional assumptions) are often violated in many areas of behavioral research and particularly in singlecase research (e.g., Adams & Anthony, 1996; Dugard, 2014; Edgington & Onghena, 2007; Ferron & Levin, 2014; Levin, Ferron, & Gafurov, 2014; Micceri, 1989). In contrast, nonparametric procedures do not make specific distributional assumptions about the data.
One of these nonparametric procedures, the randomization test (RT), has been proposed by some researchers as an appropriate statistical test to evaluate treatment effects in randomized SCEs (i.e., SCEs that include random assignment of measurement occasions to treatment conditions; e.g., Bulté & Onghena, 2008; Edgington, 1967; Heyvaert & Onghena, 2014; Levin, Ferron, & Kratochwill, 2012; Onghena, 1992; Onghena & Edgington, 1994, 2005). The RT is based on the random assignment model, which assumes that each experimental unit has been randomly assigned to one of the levels of the independent variable (similar to the way individual subjects are randomly assigned to treatment conditions in a betweensubjects design; Kempthorne, 1955).^{1} Furthermore, by randomly assigning measurement occasions to treatment conditions all known and unknown confounding variables can be controlled in a statistical way. Consequently, a potential statistically significant treatment effect can be attributed to the experimental manipulation. An alternative model, which is adopted by most parametric statistical tests, is the random sampling model. In this model, data are assumed to have been randomly sampled from a specific population of interest. Because the random assignment model does not make an assumption of random sampling, any statistical inference made under this model is conditional on the data that are analyzed (Keller, 2012).
A common practical problem in designing experiments is determining the number of observations that is required for the statistical tests to have sufficient power. The power of a statistical test is defined as the probability of rejecting a false null hypothesis. A power of 80% is generally accepted as the minimal requirement for a statistical test (Cohen, 1988, 1992). Power analysis can provide guidelines for the minimum number of observations that is required in order to detect an effect of a certain size with a certain probability. In an SCE the minimum number of observations refers to the minimum number of measurement occasions for the single case.
Apart from selecting the number of measurement occasions, the singlecase researcher must also make other choices when designing a randomized SCE. More specifically, one must select a specific design, which determines the type of random assignment that is used in the SCE. In addition, the choice of an adequate ES measure is obviously important. All the aforementioned choices that are made when designing a randomized SCE have an effect on the power of the RT (Keller, 2012). It is thus extremely important for scientific practice to systematically investigate the effect of these factors on the power of the RT.
Several simulation studies concerning the power of the RT for different types of singlecase designs and data patterns have already been performed (e.g., Ferron & Onghena, 1996; Ferron & Sentovich, 2002; Ferron & Ware, 1995; Heyvaert et al., 2017; Levin, Ferron, & Gafurov, 2014; Levin, Ferron, & Kratochwill, 2012; Manolov, Solanas, Bulté, & Onghena, 2010; Onghena, 1994). Although these simulation studies provide valuable information regarding the power of the RT in the context of analyzing SCEs, previous research has not yet systematically investigated one important determinant of the RT’s power: the ES measure that is used as the test statistic. Furthermore, all previous simulation studies that examined the power of RTs for singlecase designs have used a random sampling conceptualization of statistical power, the socalled “unconditional power,” although a random assignment conceptualization, the socalled “conditional power” is more consistent with the RT framework (Keller, 2012). With this article, we aim to fill both gaps.
With respect to the effect of the employed ES measure on the RT’s power, we focused on nonoverlap effect size (NES) measures, which are currently receiving considerable attention from the singlecase community as measures for quantifying treatment effects in SCEs (e.g., Heyvaert, Saenen, Campbell, Maes, & Onghena, 2014; Lenz, 2012; Wolery, Busick, Reichow, & Barton, 2010). NES measures are rooted either in the tradition of visually analyzing singlecase data or in the tradition of nonparametric rank statistics, and assess the number of data points between conditions that do not overlap. Following the approach proposed by Heyvaert and Onghena (2014), we will use these NES measures as test statistics in an RT. More specifically, we included the percentage of nonoverlapping data (PND; Scruggs, Mastropieri, & Casto, 1987) and the nonoverlap of all pairs (NAP; Parker & Vannest, 2009) in our study.
The second goal of this simulation study was to evaluate the power of the RT in a conditional power framework. In the previously cited simulation studies a random sampling model was used to generate the data for calculating the statistical power of the RT. Because the RT does not make an assumption of random sampling, evaluating the statistical power under a random sampling model does not do justice to the RT. As was demonstrated by Keller (2012), it is conceptually more appropriate to evaluate the statistical power of the RT by generating data using a compatible random assignment model. The resulting statistical power estimates are called “conditional power” estimates, because the estimates are conditional on a specific data set (see also Corcoran & Mehta, 2002; Gabriel & Hsu, 1983; Kempthorne, 1955; Kempthorne & Doerfler, 1969; Pesarin & De Martini, 2002; Pratt & Gibbons, 1981).^{2} We elaborate this conditional power analysis approach in the Methods section and explain how this approach is combined with the three data generating processes that we used.
An additional goal of the present study was to investigate the effect of specific characteristics of the data on the power of the RT. Research has shown that data from singlecase designs can contain autocorrelation (e.g., Shadish & Sullivan, 2011; Solomon, 2014). To account for this possibility we generated data that were not autocorrelated (independent standard normally distributed data) as well as data that contained strong positive autocorrelation (generated from a firstorder autoregressive Gaussian process). In addition, we generated data from a uniform distribution (with a population standard deviation of 1) to evaluate the power of the RT in a situation in which classic distributional assumptions are severely violated.
AAABBB  BBBAAA 
AABABB  BBABAA 
AABBAB  BBAABA 
AABBBA  BBAAAB 
ABAABB  BABBAA 
ABABAB  BABABA 
ABABBA  BABAAB 
ABBAAB  BAABBA 
ABBABA  BAABAB 
ABBBAA  BAAABB 
ABABAB  BABABA 
ABABBA  BABAAB 
ABBAAB  BAABBA 
ABBABA  BAABAB 
RBDs can be used to counter the effect of timerelated confounding variables on the dependent variable. For example, suppose a researcher wishes to conduct an SCE that evaluates the effect of a behavioral treatment on a depressed patient’s feelings of negative affect. If the researcher knows that the level of negative affect of the patient can fluctuate from day to day, irrespective of the treatment, an RBD can be used to control for this confounding factor. Suppose the experiment consists of 10 days (i.e., ten blocks) where on each day the researcher administers both treatments (i.e., the control condition and the treatment condition) and records the patient’s level of negative affect after each treatment (i.e., two negative affect scores per day). Because the sequence of conditions within a day (i.e., block) is determined randomly for every day, a potential significant treatment effect cannot be attributed to the daytoday fluctuations in negative affect but only to the behavioral treatment.
AABABB  BBABAA 
AABBAB  BBAABA 
ABAABB  BABBAA 
ABABAB  BABABA 
ABABBA  BABAAB 
ABBAAB  BAABBA 
ABBABA  BAABAB 
Note that the entire set of RBD assignments is present in the set of RRAD assignments.
Method
The Method section contains three parts. The first part introduces the RT as a method of evaluating treatment effects in SCEs. The second part discusses how the conditional power of the RT is calculated. Finally, the third part details the design matrix of the simulation study.
Evaluating treatment effects in singlecase experiments
AABB 
ABAB 
ABBA 
BAAB 
BABA 
BBAA 
Second, one of the permissible assignments is randomly selected as the assignment to execute the actual experiment. Suppose the selected assignment is ABBA.
Third, one chooses an ES measure that is deemed adequate to answer the research question. This ES measure will be the test statistic of the RT. Suppose we choose the MD between the A and the B condition as the test statistic for this RT. Note that in order to test a twosided alternative hypothesis, the test statistic must be sensitive to both directions of a possible effect. In this case we will use the absolute mean difference between the A and the B condition.
These values make up the randomization distribution. This collection of values is used as a reference distribution to calculate the statistical significance of the observed test statistic.
As a fifth step, the twosided p value of the RT is calculated as the proportion of test statistics in the randomization distribution that are at least as extreme as the observed test statistic. When looking at the randomization distribution, we can see that two of the six permissible assignments lead to the same test statistic value as the observed test statistic, which results in a twosided p value of 1/3. This p value provides a probabilistic statement of observing the data under the null hypothesis that the conditions are unrelated to the data, and the validity of this statement is guaranteed by the randomization of the conditions. If that null hypothesis were true, then there is a probability of 1/3 to obtain a test statistic value as extreme as the one observed (Edgington & Onghena, 2007).
Note that this example was only chosen for didactical purposes as it is evident that an SCE with only four measurement occasions can never yield a p value that is smaller than any conventional significance level. Without performing any simulations, we can already infer that an SCE with only four measurement occasions has zero statistical power for all practical purposes.
The main advantages of the RT are that it makes no distributional assumptions and no assumption of random sampling. These advantages are important because it has been shown that the assumptions underlying parametric tests (e.g., random sampling or specific distributional assumptions) are doubtful in many domains of behavioral research and particularly for singlecase research (e.g., Adams & Anthony, 1996; Dugard, 2014; Edgington & Onghena, 2007; Ferron & Levin, 2014; Levin et al., 2014; Micceri, 1989). Other advantages of the RT as compared to parametric tests are its flexibility with regard to the choice of test statistic and the choice of experimental design (Ferron & Sentovich, 2002; Onghena, 1992; Onghena & Edgington, 2005).
Power analysis in the random assignment model
The RT produces socalled “conditional inferences”—that is, inferences that are conditional on the observed data, just like Fisher’s exact test is conditional on the marginal totals (Agresti, 1992; Krauth, 1988). Consequently, when investigating the power of the RT, it makes most sense to use this conditional framework too, and to calculate the socalled “conditional power” (i.e., the power of the RT for a specific data set). The advantage of this conceptualization is that the conditional power calculations are consistent with the random assignment model, which is also used for the validity of the RT, and that no assumption of random sampling is required. For the calculation of conditional power only an additional assumption about the treatment effect is necessary, just like one needs an assumption of the effect size parameter for the calculation of unconditional power.
If one would calculate the unconditional power of the RT, one would generate a large number of data sets (with fixed condition labels), sampled from a known distribution, and calculate the proportion of data sets that yield a p value smaller than or equal to a predefined significance level α. In contrast, to calculate the conditional power of the RT, one starts with a fixed set of scores that would be observed if the null hypothesis of no treatment effect is true (the “null scores”). Next one generates all possible randomizations of the condition labels, and constructs all possible data sets by pairing the null scores with the condition labels; null scores that are assigned to the treatment condition are transformed into observed scores containing the treatment effect. The conditional power is calculated as the proportion of those data sets that yield a p value smaller than or equal to α. Importantly, the unconditional power of the RT is defined in relation to the repeated random sampling of data sets from a known distribution whereas the conditional power of the RT is defined in relation to the repeated random assignment of condition labels to a specific set of null scores. Consequently for the calculation of conditional power, one does not need to make an assumption of random sampling. Note that this also implies that the resulting conditional power only pertains to that specific set of null scores.
which implies that the observed score for experimental unit i is independent from the condition to which it is assigned. Note that the null scores are assumed to be known in order to calculate conditional power, just as the distributional form has to be known in order to calculate unconditional power.
The conditional power of the RT is determined by the significance level of the test, the number of observations, the size of the treatment effect, the employed test statistic, and the effect function (Keller, 2012). Whereas unconditional power is defined as the percentage of rejections of the null hypothesis across a given number of samples drawn from a certain population distribution, the conditional power is defined as the percentage of random assignments of treatment conditions to experimental units that result in a rejection of the null hypothesis, given an assumed treatment effect (Kempthorne & Doerfler, 1969).
To calculate the exact conditional power of the RT for a specific data set, a few steps must be carried out. To begin, we must choose a singlecase design, a number of observations, and a test statistic to use in the RT. Next, we generate one set of null scores for the chosen number of observations. We then obtain all permissible assignments of the employed randomization scheme for the chosen number of observations. If there are k permissible assignments, we then construct k different data sets from the null scores by adding the treatment effect to the null scores of the measurement occasions in the treatment condition. Next, we perform the RT for each of the k data sets from the previous step and record whether or not the null hypothesis is rejected at a prespecified significance level. The exact conditional power is then defined as the overall proportion of rejected null hypotheses across the k RTs.
Notice that the RT is a computerintensive method and that the calculation of the exact conditional power is “computerintensive squared.” If the number of observations rises, then k for each RT increases exponentially. For the exact conditional power, the RT is repeated k times, resulting in a total of k ^{2} calculations.
Calculation of the RT’s exact conditional power
Steps  Example  

1) Obtain a set of n (e.g., 4) null scores.  2, 3, 5, 3  
2) Generate all permissible assignments according to the employed singlecase design (e.g., a balanced CRD).  A A B B  
A B A B  
A B B A  
B A A B  
B A B A  
B B A A  
3) Apply the treatment effect (e.g., Δ = 1.5) to the B data for each of the k permissible assignments resulting in k data sets.  k  
1  A  A  B  B  
2  3  5 +1.5  3 +1.5  
2  A  B  A  B  
2  3 +1.5  5  3 +1.5  
3  A  B  B  A  
2  3 +1.5  5 +1.5  3  
4  B  A  A  B  
2 +1.5  3  5  3 +1.5  
5  B  A  B  A  
2 +1.5  3  5 +1.5  3  
6  B  B  A  A  
2 +1.5  3 +1n5  5  3  
4) Execute the RT for each of the k data sets resulting in k p values.  k  p value  
1  .33  
2  1  
3  .33  
4  1  
5  .33  
6  1  
5) The exact conditional power of the RT is the proportion of the k p values that are smaller than or equal to, the chosen α level (e.g., α = 1/3).  Half of the k p values are smaller than or equal to 1/3 so the exact conditional power is 50%. 
For very small data sets such as in this example, it becomes apparent that the conditional power curve of the RT is actually a stepwise function. The function is stepwise because the conditional power is determined by the proportion of the k RTs that yield a p value smaller than or equal to the significance level α. If k is a small number, then only multiples of 1/k are possible conditional power values. For larger data sets, the number of possible steps is quite large so that the power curve becomes indistinguishable from a continuous function.
Design matrix of the simulation study
 1.Characteristics of the data. To investigate the effect of different types of data on the conditional power of the RT, data were generated from a standard normal distribution and a uniform distribution (with a population standard deviation of 1). We selected the normal and uniform distributions because of their simplicity and ubiquity. Because both distributions were also used in the simulation study of Keller (2012), we could use his results as a benchmark. We added a firstorder autoregressive model with Gaussian errors (AR1). The autoregressive parameter (AR) quantifies the autocorrelation in the data. We hypothesized that the presence of positive autocorrelation would have little influence in the selected singlecase designs because of their fast alternation of the experimental conditions. Some pilot testing with small and medium levels of autocorrelation supported this hypothesis. For this reason (and in order to keep the number of experimental conditions manageable) we included only one, rather large, level of positive autocorrelation: AR = 0.6. Because the variance of an AR1 model is$$ \frac{\sigma_e^2}{1 A{R}^2} $$
and we sampled e from a standard normal distribution (σ _{ e } ^{2} = 1) and AR = 0.6, the variance of the AR1 model is 1.5625. We only used three types of data (standard normal, uniform and AR1) to keep the simulation study feasible in terms of design and duration, and because we did not expect that the distributional shape would have a large impact on the power.
 2.
Test statistics used in the RT. Three different ES measures were used as the test statistic in the RT: the PND, the NAP, and the MD. The main reason for including the PND in this simulation study is that it is the most widely used NES (Maggin et al., 2011; Schlosser et al., 2008). As such, we believe it is of great importance to investigate PND’s usability in statistical inferences. NAP was included because it was introduced to meet the statistical limitations of the PND (Parker & Vannest, 2009). To compare the performance of the selected NESs to a more generally accepted test statistic, we also included the mean difference (MD) in our simulation study. All tests statistics were formulated in a nondirectional way, so we only consider twosided p values in this simulation study.
 3.
Designs. Three different singlecase alternation designs were investigated: the CRD, the RBD, and the RRAD (see above for details). In this study, we limited our investigation to designs with two conditions (a control condition and a treatment condition). The CRD entails full random assignment of the condition labels with the only restriction that each condition must contain the same number of measurement occasions for each assignment. The RBD uses a form of randomization that groups measurement occasions into blocks of a certain size (we will use a block size of two observations) and then randomizes the measurement occasions within blocks. The RRAD uses full random assignment with the restriction that the maximal number of adjacent measurement occasions from the same condition is limited to a prespecified value (Onghena & Edgington, 1994). In alignment with the What Works Clearinghouse (WWC) standards’ recommendation to have at least three measurement occasions in a “phase” (Kratochwill et al., 2010), there could never be more than two adjacent measurement occasions from the same condition in the RRAD randomization scheme. Note that all the designs were balanced designs (i.e., they contain the same number of measurement occasions in each condition).
 4.
Size of the treatment effect. Our choice of treatment ESs was guided by reported ESs in various domains of singlecase research. ESs in singlecase research are generally larger than in betweensubjects research and are sometimes extremely high (Busk & Serlin, 1992). For example, Fabiano et al. (2009) performed metaanalyses of behavioral treatments for attentiondeficit/hyperactivity disorder for various study designs and found average ESs of 0.83 and 3.78 for betweensubjects studies and singlecase studies, respectively. In a similar vein, two singlecase metaanalyses concerning interventions for reducing challenging behavior in persons with intellectual disabilities resulted in average ESs of approximately 3 (Heyvaert, Maes, Van den Noortgate, Kuppens, & Onghena, 2012; Heyvaert, Saenen, Maes, & Onghena, 2014). With these results in mind, we included six levels of the treatment effect: 0, 0.5, 1, 1.5, 2, and 2.5. On the basis of pilot simulation testing, we set 2.5 as the maximum ES in our simulation study, because the conditional power for this ES was already 100% in almost all conditions. The size of the treatment effect in empirical singlecase research can vary greatly depending on the specific domain and with the current selection of ESs we are able to cover the entire range of frequently found empirical ESs.
 5.
Number of measurement occasions. The selected numbers of measurement occasions to generate a complete data set were 12, 20, 30, and 40 measurement occasions and were chosen to cover the range of common series lengths in empirical research. For example, Ferron, Farmer, and Owens (2010) performed a survey that found average series lengths that ranged from 7 to 58 with a median of 24. A survey by Shadish and Sullivan (2011) found an average of 20 measurement occasions per individual time series. Note that the smallest amount of measurement occasions was selected to be 12 rather than 10 because of the fact that an RT using an RBD is unable to reach a 5% significance level for a data set with only 10 measurement occasions (2^{5} = 32 possible assignments, and because we are considering twosided tests, 1/16 is always larger than 1/20).
Crossing the levels of these five factors resulted in a total of 3 × 3 × 3 × 4 × 6 = 648 conditions. For each condition, 100 null data sets were generated and the conditional power was averaged across these 100 replications. For each replication, the conditional power was calculated using Kempthorne and Doerfler’s (1969) method. The significance level was set at 5%.
Despite rapid advancements in computing speed and exact algorithms, it is usually not feasible to calculate the exact conditional power of the RT due to the exponential increase of the computational demand when the number of observations grows larger (Keller, 2012). An alternative to exact computation is analytical approximation. For example, Gabriel and Hsu (1983) showed that their analytical approximation of the RT’s exact conditional power only slightly overestimates the true power. However, their approximation shows larger biases for small numbers of observations or skewed treatment effects. In situations in which the number of observations is too large for exact computation but too small for precise analytic approximation, Monte Carlo approximation is a good alternative (Senchaudhuri, Mehta, & Patel, 1995). In this approach, only a random sample of all permissible assignments is used to calculate the conditional power of the RT. For a single RT it has been shown that the Monte Carlo RT produces valid p values (Edgington & Onghena, 2007). Furthermore, the accuracy of the random sampling can be increased to the desired level simply by increasing the number of random assignments that are drawn from the set of all permissible assignments (Senchaudhuri et al., 1995). Edgington (1969) pointed out that an efficient Monte Carlo RT can already be carried out with 1,000 random assignments. To keep the simulation study computationally manageable and without having to resort to analytical approximations, we used Monte Carlo sampling for the average conditional power calculation. More specifically, we selected 1,000 random assignments for each null sample, which resulted in 1,000 data sets for the conditional power calculation of each null sample. The conditional power for each null sample was calculated by performing the RT on each of these 1,000 data sets using 1,000 random assignments of the condition labels and by determining the proportion of p values that were equal to or smaller than .05. The average conditional power for each condition in the simulation study was obtained by averaging the conditional powers of the 100 null samples.
Results
Figure 4 shows that the MD and the NAP perform very similarly (an average difference of 1.74% in favor of the MD), whereas use of the PND as the test statistic in the RT yields substantially lower power. More specifically, the average power difference between the MD and the PND is 16.36% across the range of treatment ESs.
Apart from visually analyzing the results of the simulation study, we also evaluated the results by looking at the variation between conditions using a multiway analysis of variance (ANOVA). More specifically we looked at the main effects of all experimental factors and twoway interactions effects that were deemed theoretically meaningful (an interaction effect between design and ES measure, between characteristics of the data and ES measure, and between characteristics of the data and design). We did not include higherlevel interactions because they are difficult to interpret. For each evaluated effect, we calculated the proportion of explained variance in order to distinguish between the most important patterns in the results. All included effects of the ANOVA were significant at a significance level of .001. However, there are large discrepancies in the proportions of explained variance of the various effects. The size of the treatment effect by far has the largest proportion of explained variance (69.02%). This comes as no surprise because we used a wide range of values for this simulation factor (employing treatment ESs from 0 to 2.5). The number of measurement occasions has the second largest proportion of explained variance (6.9%). Furthermore, the choice of ES measure explains 3.34% of the variance, which is still considerable given the wide range of levels of the previously discussed factors. The other effects included in the ANOVA all explained less than 1% of the variance on their own. Nevertheless, they were all significant at the .001 level indicating that they have a significant effect on the conditional power of the RT. Furthermore, whereas the treatment effect is in practice not controlled by the researcher, and the number of measurement occasions is sometimes constrained by logistic or financial considerations, the ES measure and the design can be chosen by the researcher. Hence it is reassuring to see that power can still be optimized by deliberate smart choices in the design phase of the study, given the constraints of the research context.

The MD and the NAP ES measures perform very similarly in terms of conditional power whereas the PND performs substantially worse.

The RRAD is the singlecase design that on average yields the highest conditional power.

The conditional power of the RT is only minimally influenced by the characteristics of the null scores.
Discussion
In this article we investigated the conditional power of the RT using three different singlecase ES measures (the MD, the NAP, and the PND) and three different randomized singlecase designs with rapid treatment alternation (the CRD, the RRAD, and the RBD) for three types of simulated data (data from a standard normal distribution, data from a uniform distribution, and data from an autoregressive process with Gaussian errors) using a significance level of 5%.
The results were evaluated by visual analysis of power graphs and by decomposing the variance in the simulation results using ANOVA. The most important patterns in the results were identified by looking at the proportion of explained variance of each of the simulation factors.
With respect to the effect of the ES measure on the power of the RT, the results showed that the MD and the NAP perform very similarly whereas the PND performs substantially worse. This large discrepancy with the other two ES measures indicates that the PND has undesirable characteristics when it comes to evaluating intervention effects in SCEs. In fact, many authors have pointed out that the PND has two important limitations (e.g., Parker & Vannest, 2009; Shadish et al., 2008; Wolery et al., 2010). First, because only one control condition data point is used as a reference point to compare to the treatment data points, the PND is highly influenced by outliers. Second, the PND has a poor ability to discriminate between different magnitudes of a treatment effect. For example, when all treatment data points exceed the single highest control condition data point, it does not matter how large the nonoverlap is; the PND will always be 100%. Because of these limitations, authors have called for the abandonment of the PND in favor of other NES measures (e.g., Kratochwill et al., 2010; Parker & Vannest, 2009). In a similar vein, Campbell (2013, p. 24) stated, “I believe the PND methodology constitutes a first wave of SCD metaanalysis that is being followed by efforts to improve on inherent statistical limitations of PND.” The results of our simulation study add to these criticisms, in the sense that the PND yields substantially less power in the RT than the NAP and the MD despite the absence of outliers in the simulated data and regardless of the size of the treatment effect. As such, we agree with the critics of PND that this measure should be abandoned in favor of more recent NES measures (such as the NAP).
With respect to the effect of the design on the conditional power of the RT, the results showed that the RRAD was the most powerful design, followed by the CRD and then the RBD. We argue that RRAD yields the highest power because it is a design that prevents the temporal clustering of measurement occasions but at the same time allows for a large number of assignments. In contrast, the CRD allows more temporal clustering than the RRAD whereas the RBD is more restrictive in terms of number of possible assignments than the RRAD. Although the choice of design also depends on feasibility and the concrete phenomenon that is studied, singlecase researchers can take these findings with respect to the influence of the singlecase experimental design on statistical power into consideration whenever they are designing an SCE that is to be evaluated using an RT.
The results of the simulation study also showed that the power of the RT is only minimally affected by the distributional characteristics of the null scores or the presence of strong positive autocorrelation. This is an important finding because it has been shown that singlecase data often contain autocorrelation and violate classic distributional assumptions (e.g., Adams & Anthony, 1996; Dugard, 2014; Edgington & Onghena, 2007; Ferron & Levin, 2014; Levin et al., 2014; Micceri, 1989). However, we should note that the effect of autocorrelation on the power of the RT depends on the type of singlecase design that is being used. For example, Ferron and Onghena (1996) evaluated the effect of autocorrelation on the power of the RT using a design that randomly assigns treatments to more extended phases. In this type of design, the researcher specifies a number of equally long phases and then randomly assigns a treatment condition to each of the phases. The results showed that positive autocorrelation increased the power of the RT for this type of singlecase design. In contrast, Ferron and Ware (1995) showed that positive autocorrelation decreases the power of the RT for singlecase phase designs that use random assignment of intervention points.
Apart from the size of the treatment effect, the number of measurement occasions is the second most important determinant of the conditional power of the RT. However, our results showed a sharply decelerating increase in conditional power when the number of measurement occasions was increased. In other words, the larger the number of measurement occasions becomes, the smaller the subsequent increase in power. This information can be useful to researchers interested in SCEs for which the experiment is preferably as short as possible (e.g., if there are negative side effects of the experiment for the participant), in order to determine an optimal SCE length that balances between having sufficient statistical power and minimizing any discomforts for the participants.
We will now address some limitations to this simulation study. First of all, an obvious limitation is that the generalizability of our results is limited to the simulation conditions that were included. More in particular, we only considered null scores generated from continuous distributions. For some applications, other continuous or discrete distributions might be more relevant. We would expect that highly discrete distributions might compromise power because they give rise to many ties in both the data themselves and in the reference distribution. The power values in the present simulation study can therefore be considered as upper bounds. On the bright side, however, it should be noted that RTs always treat data in a discrete way and remain valid even if highly discrete (e.g., skewed dichotomous distributions) are used.
A second limitation is that the unittreatment additivity model that we used to model the treatment effect in this simulation study conceptualizes the treatment effect as a difference in level and assumes no other effects. As a consequence, we only evaluated ES measures that are sensitive to differences in level. Nevertheless, we should mention that the unittreatment additivity model is a generally accepted model in nonparametric statistics and a standard for classical evaluations of nonparametric methods (e.g., Cox & Reid, 2000; Hinkelmann & Kempthorne, 2008; Lehman, 1959; Welch & Gutierrez, 1988).
with k being the number of random assignments, p being the exact conditional power, q being 1–p, and 2.58 being the 99.5th percentile of a standard normal distribution. For example, if the exact conditional power of a specific simulation condition is 80%, the 99% confidence interval of the Monte Carlo approximation with k = 1,000 is [77%; 83%]. Note that the Monte Carlo approach also causes the Type I error rates (i.e., the power when the treatment effect is zero) to slightly deviate from the specified significance level. When an exact RT is used (which uses all possible randomizations of the condition labels) the Type I error rate is always exactly equal to the specified significance level (Keller, 2012).
A generally accepted standard for sufficient statistical power is 80% (Cohen, 1988). When the power of a test is 80%, the probability of a Type II error (=1 – power) is 20%. If the conventional significance level of 5% is used, the ratio between the probability of a Type II error and a Type I error will then be four to one (20% to 5%). A power smaller than 80% would result in a too high a probability of a Type II error, whereas a power materially larger than 80% is likely to require data sets containing numbers of measurement occasions that could be unfeasible to collect (Cohen, 1992). To make a practical recommendation regarding the number of measurement occasions that yields a power of at least 80% in the RT using a significance level of 5% we must make an assumption about plausible ESs in singlecase research. We previously mentioned that very high ESs of 3 or more are not uncommon in SCEs. In this case, the results of the simulation study with normally or uniformly distributed data show that an SCE with at least 20 measurement occasions is needed to obtain sufficient statistical power in the RT when used with randomized alternation designs and with a significance level of 5%. With other designs and highly discrete and skewed expected data sets even a larger number of measurements might be needed.
Finally, it is important to notice that the conditional power framework that we used in this simulation study contains an apparent paradox. On the one hand, the conditional power of a specific null data set only holds for that specific data set (i.e., it is conditional on the null scores). On the other hand, an a priori power analysis always requires knowledge, or an assumption, about the expected data in order to guide recommendations for the number of observations to be included in the experiment. Invoking distributional assumptions for the power analysis would go against the conceptual framework of conditional power as the latter makes no assumption of random sampling.
Similar to the approach taken by Keller (2012) we have tried to reconcile these two seemingly opposing requirements by calculating the conditional powers for a large number of null data sets (sampled from the probability distributions described in the methods section) and then by averaging these conditional powers. As such, the individual conditional powers are free of distributional assumptions, although the average conditional power for all the conditions reported in Tables 2, 3, and 4 of this article are dependent on the specific distributions from which the null data sets were generated.
Once we have null scores, we can calculate the exact conditional power for varying levels of Δ. Alternatively, different effect models can be explored, each leading to other null scores.
Collings and Hamilton (1988) and Hamilton and Collings (1991) also used this idea of a pilot study to determine the appropriate sample size on the basis of the distributionfree power of nonparametric tests for location shift. More specifically, the authors proposed to bootstrap the pilot data (i.e., draw random samples with replacement from the pilot data) to form a large number of bootstrap data sets. The proportion of data sets that yield a p value smaller than or equal to the significance level α is defined as the power estimate for the test. The authors showed empirically that their proposed method yields reliable results for estimating the power of the Wilcoxon twosample test by means of a simulation study. Their method is appealing because the bootstrap technique allows for bootstrap data sets of any numbers of observations (smaller or larger than the number of observations in the pilot data set) and because no distributional assumptions are needed. In a similar vein we could use this bootstrap technique to generate a number of data sets from the pilot data. Next, we can calculate the conditional power for each of these data sets separately and obtain the average conditional power for all data sets together without additional distributional assumptions. The only additional assumption is that the distributional shape of the pilot data is indicative of the distributional shape of the finally collected data.
Suggestions for further research
To keep the computational burden of the simulation study manageable we had to limit the present study to the factors that were most relevant to answer our research questions. However, other interesting factors remain to be investigated in future research. These include: the use of various other ES measures as test statistics in the RT, the inclusion of other statistical distributions for generating data such as skewed, exponential or bimodal distributions, the inclusion of additional AR values for data generated with the AR1 model, and the inclusion of other timeseries structures (e.g., moving average models). Because count data are used regularly as singlecase outcome measures, future simulation studies could focus on the comparison between the power of the RT for data generated from discrete probability distributions and the power of the RT for data generated from continuous probability distributions. In addition future research could focus on investigating the power of the RT in unbalanced alternation designs as the singlecase designs in this simulation study were all balanced designs.
In this study, we only modeled treatment effects that were defined as differences in level between experimental conditions. However, several singlecase ES measures that look at trends exist (e.g., TauU, Parker et al., 2011; regressionbased ES measures, Van den Noortgate & Onghena, 2003). Consequently, future research could focus on power analysis of the RT for ES measures that are sensitive to trend using simulated data that contain different types of trend effects. We previously mentioned that singlecase phase designs are more frequently used in practice than singlecase alternation designs. For this reason, future research could also investigate the conditional power of the RT for a variety of singlecase phase designs.
Finally, another avenue for future research is to examine the theoretical relation between unconditional power and average conditional power as well as the possibility of obtaining distributionfree average conditional power using the bootstrap technique proposed by Collings and Hamilton (1988) and Hamilton and Collings (1991). For the latter it would be interesting to develop userfriendly statistical simulation software that assists in exploring the conditional power for a variety of distributional shapes and effect functions.
Conclusion
On the basis of the results of this simulation study, we would not recommend the use of the PND as an ES measure for the purpose of statistical inference using an RT in singlecase alternation designs, because of its low statistical power. In contrast, the NAP yields power levels that are very similar to those of the MD, and as such provides a good alternative for researchers who want to use a nonoverlap measure to quantify the treatment effect in SCEs. With regard to the number of measurement occasions that are needed to ensure adequate statistical power in the RT, we recommend including at least 20 measurement occasions when working with alternation designs in the domain of singlecase research and using a 5% significance level.
Footnotes
 1.
We will use the term “assignment” to refer to a specific randomization of the condition labels in an SCE.
 2.
This use of the term “conditional power” is standard in nonparametric statistics, but it should not be confused with the use of this term in the context of sequential clinical trials (see Lachin, 2005, for an overview of this alternative use of the term).
Notes
Author note
We thank two anonymous reviewers for their valuable comments on a previous version of this article.
Compliance with ethical standards
Funding
This research was funded by the Research Foundation–Flanders (FWO), Belgium (Grant ID G.0593.14 for B.M. and Grant ID 1242413N for M.H.).
Supplementary material
References
 Adams, D. C., & Anthony, C. D. (1996). Using randomization techniques to analyse behavioural data. Animal Behaviour, 51, 733–738.CrossRefGoogle Scholar
 Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7, 131–153.CrossRefGoogle Scholar
 Allison, D. B., & Gorman, B. S. (1993). Calculating effect sizes for metaanalysis: The case of the single case. Behaviour Research Therapy, 31, 621–631.CrossRefPubMedGoogle Scholar
 Alnahdi, G. H. (2015). Singlesubject design in special education: Advantages and limitations. Journal of Research in Special Educational Needs, 15, 257–265.CrossRefGoogle Scholar
 Barlow, D. H., Nock, M. K., & Hersen, M. (2009). Single case experimental designs: Strategies for studying behavior change (3rd ed.). Boston: Pearson.Google Scholar
 Borckardt, J. J., & Nash, M. R. (2014). Simulation modelling analysis for small sets of singlesubject data collected over time. Neuropsychological Rehabilitation, 24, 492–506.CrossRefPubMedGoogle Scholar
 BowmanPerrott, L., Burke, M. D., de Marin, S., Zhang, N., & Davis, H. (2015). A metaanalysis of singlecase research on behavior contracts: Effects on behavioral and academic outcomes among children and youth. Behavior Modification, 39, 247–269. doi: 10.1177/0145445514551383 CrossRefPubMedGoogle Scholar
 Bulté, I., & Onghena, P. (2008). An R package for singlecase randomization tests. Behavior Research Methods, 40, 467–478. doi: 10.3758/BRM.40.2.467 CrossRefPubMedGoogle Scholar
 Busk, P. L., & Serlin, R. C. (1992). Metaanalysis for singlecase research. In T. R. Kratochwill & J. R. Levin (Eds.), Singlecase research design and analysis: New directions for psychology and education (pp. 187–212). HillsdaleJ: Erlbaum.Google Scholar
 Campbell, J. M. (2013). Commentary on PND at 25. Remedial and Special Education, 34, 20–25.CrossRefGoogle Scholar
 Center, B. A., Skiba, R. J., & Casey, A. (1985–1986). A methodology for the quantitative synthesis of intrasubject design research. Journal of Special Education, 19, 387–400.Google Scholar
 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Erlbaum.Google Scholar
 Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. doi: 10.1037/00332909.112.1.155 CrossRefPubMedGoogle Scholar
 Collings, B. J., & Hamilton, M. A. (1988). Estimating the power of the twosample Wilcoxon test for location shift. Biometrics, 44, 847–860.CrossRefPubMedGoogle Scholar
 Corcoran, C. D., & Mehta, C. R. (2002). Exact level and power of permutation, bootstrap, and asymptotic tests of trend. Journal of Modern Applied Statistical Methods, 1(1). Retrieved from digitalcommons.wayne.edu/jmasm/vol1/iss1/7Google Scholar
 Cox, D. R., & Reid, N. (2000). The theory of the design of experiments. Boca Raton: Chapman & Hall/CRC.Google Scholar
 Dugard, P. (2014). Randomization tests: A new gold standard? Journal of Contextual Behavioral Science, 3, 65–68.CrossRefGoogle Scholar
 Edgington, E. S. (1967). Statistical inference from N = 1 experiments. Journal of Psychology, 65, 195–199.CrossRefGoogle Scholar
 Edgington, E. S. (1969). Approximate randomization tests. Journal of Psychology, 72, 143–149.CrossRefGoogle Scholar
 Edgington, E. S., & Onghena, P. (2007). Randomization tests (4th ed.). Boca Raton: Chapman & Hall/CRC.Google Scholar
 Fabiano, G. A., Pelham, W. E., Coles, E. K., Gnagy, E. M., ChronisTuscano, A., & O’Connor, B. C. (2009). A metaanalysis of behavioral treatments for attentiondeficit/hyperactivity disorder. Clinical Psychology Review, 29, 129–140.CrossRefPubMedGoogle Scholar
 Ferron, J. M., Farmer, J. L., & Owens, C. M. (2010). Estimating individual treatment effects from multiplebaseline data: A Monte Carlo study of multilevelmodeling approaches. Behavior Research Methods, 42, 930–943. doi: 10.3758/BRM.42.3.930 CrossRefPubMedGoogle Scholar
 Ferron, J. M., & Levin, J. R. (2014). Singlecase permutation and randomization statistical tests: Present status, promising new developments. In T. R. Kratochwill & J. R. Levin (Eds.), Singlecase intervention research: Methodological and statistical advances (pp. 153–183). Washington, DC: American Psychological Association.CrossRefGoogle Scholar
 Ferron, J., & Onghena, P. (1996). The power of randomization tests for singlecase phase designs. Journal of Experimental Education, 64, 231–239.CrossRefGoogle Scholar
 Ferron, J., & Sentovich, C. (2002). Statistical power of randomization tests used with multiplebaseline designs. Journal of Experimental Education, 70, 165–178.CrossRefGoogle Scholar
 Ferron, J., & Ware, W. (1995). Analyzing singlecase data: The power of randomization tests. Journal of Experimental Education, 63, 167–178.CrossRefGoogle Scholar
 Gabler, N. B., Duan, N., Vohra, S., & Kravitz, R. L. (2011). Nof1 trials in the medical literature: A systematic review. Medical Care, 49, 761–768. doi: 10.1097/MLR.0b013e318215d90d CrossRefPubMedGoogle Scholar
 Gabriel, K. R., & Hsu, C.F. (1983). Evaluation of the power of rerandomization tests, with application to weather modification experiments. Journal of the American Statistical Association, 78, 766–775.CrossRefGoogle Scholar
 Gast, D. L., & Ledford, J. R. (2014). Single case research methodology: Applications in special education and behavioral sciences (2nd ed.). New York: Routledge.Google Scholar
 Gottman, J. M., & Glass, G. V. (1978). Analysis of interrupted timeseries experiments. In T. R. Kratochwill (Ed.), Singlesubject research: Strategies for evaluating change (pp. 197–237). New York: Academic Press.CrossRefGoogle Scholar
 Hamilton, M. A., & Collings, B. J. (1991). Determining the appropriate sample size for nonparametric tests for location shift. Technometrics, 3, 327–337.CrossRefGoogle Scholar
 Hammond, D., & Gast, D. L. (2010). Descriptive analysis of singlesubject research designs: 1983–2007. Education and Training in Autism and Developmental Disabilities, 45, 187–202.Google Scholar
 Hedges, L. V., Pustejovsky, J. E., & Shadish, W. R. (2012). A standardized mean difference effect size for single case designs. Research Synthesis Methods, 3, 324–239.CrossRefGoogle Scholar
 Heyvaert, M., Maes, B., Van den Noortgate, W., Kuppens, S., & Onghena, P. (2012). A multilevel metaanalysis of singlecase and smalln research on interventions for reducing challenging behavior in persons with intellectual disabilities. Research in Developmental Disabilities, 33, 766–780.CrossRefPubMedGoogle Scholar
 Heyvaert, M., Moeyaert, M., Verkempynck, P., Van den Noortgate, W., Vervloet, M., Ugille, M., & Onghena, P. (2017). Testing the intervention effect in singlecase experiments: A Monte Carlo simulation study. Journal of Experimental Education, 85, 175–196. doi: 10.1080/00220973.2015.1123667 CrossRefGoogle Scholar
 Heyvaert, M., & Onghena, P. (2014). Analysis of singlecase data: Randomisation tests for measures of effect size. Neuropsychological Rehabilitation, 24, 507–527.CrossRefPubMedGoogle Scholar
 Heyvaert, M., Saenen, L., Campbell, J. M., Maes, B., & Onghena, P. (2014). Efficacy of behavioral interventions for reducing problem behavior in persons with autism: An updated quantitative synthesis of singlesubject research. Research in Developmental Disabilities, 35, 2463–2476.CrossRefPubMedGoogle Scholar
 Heyvaert, M., Saenen, L., Maes, B., & Onghena, P. (2014). Systematic review of restraint interventions for challenging behaviour among persons with intellectual disabilities: Focus on effectiveness in singlecase experiments. Journal of Applied Research in Intellectual Disabilities, 27, 493–510.CrossRefPubMedGoogle Scholar
 Heyvaert, M., Wendt, O., Van den Noortgate, W., & Onghena, P. (2015). Randomization and dataanalysis items in quality standards for singlecase experimental studies. Journal of Special Education, 49, 146–156.CrossRefGoogle Scholar
 Hinkelmann, K., & Kempthorne, O. (2008). Design and analysis of experiments: Vols. I and II (2nd ed.). Hoboken: Wiley.Google Scholar
 Horner, R. H., Swaminathan, H., Sugai, G., & Smolkowski, K. (2012). Considerations for the systematic analysis and use of singlecase research. Education and Treatment of Children, 35, 269–290.CrossRefGoogle Scholar
 Kazdin, A. E. (2011). Singlecase research designs: Methods for clinical and applied settings (2nd ed.). New York: Oxford University Press.Google Scholar
 Keller, B. (2012). Detecting treatment effects with small samples: The power of some tests under the randomization model. Psychometrika, 2, 324–338.CrossRefGoogle Scholar
 Kempthorne, O. (1955). The randomization theory of experimental inference. Journal of the American Statistical Association, 50, 946–967.Google Scholar
 Kempthorne, O., & Doerfler, T. E. (1969). The behavior of some significance tests under experimental randomization. Biometrika, 56, 231–248.CrossRefGoogle Scholar
 Kratochwill, T. R., Hitchcock, J., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2010). Singlecase designs technical documentation. Retrieved from the What Works Clearinghouse website: ies.ed.gov/ncee/wwc/pdf/wwc_scd.pdfGoogle Scholar
 Kratochwill, T. R., & Levin, J. R. (1992). Singlecase research design and analysis: New directions for psychology and education. Hillsdale: Erlbaum.Google Scholar
 Krauth, J. (1988). Distributionfree statistics: An applicationoriented approach. New York: Elsevier.Google Scholar
 Lachin, J. M. (2005). A review of methods for futility stopping based on conditional power. Statistics in Medicine, 24, 2747–2764.CrossRefPubMedGoogle Scholar
 Lehmann, E. L. (1959). Testing statistical hypotheses. Hoboken: Wiley.Google Scholar
 Lenz, A. (2012). Calculating effect size in singlecase research: A comparison of nonoverlap methods. Measurement and Evaluation in Counseling and Development, 46, 64–73.CrossRefGoogle Scholar
 Leong, H. M., Carter, M., & Stephenson, J. (2015). Systematic review of sensory integration therapy for individuals with disabilities: Single case design studies. Research in Developmental Disabilities, 47, 334–351.CrossRefPubMedGoogle Scholar
 Levin, J. R., Ferron, J. M., & Gafurov, B. S. (2014). Improved randomization tests for a class of singlecase intervention designs. Journal of Modern Applied Statistical Methods, 13, 2–52.CrossRefGoogle Scholar
 Levin, J. R., Ferron, J. M., & Kratochwill, T. R. (2012). Nonparametric statistical tests for singlecase systematic and randomized ABAB … AB and alternating treatment intervention designs: New developments, new directions. Journal of School Psychology, 50, 599–624.CrossRefPubMedGoogle Scholar
 Maggin, D. M., O’Keeffe, B. V., & Johnson, A. H. (2011). A quantitative synthesis of methodology in the metaanalysis of singlesubject research for students with disabilities: 1985–2009. Exceptionality, 19, 109–135.CrossRefGoogle Scholar
 Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18, 50–60.CrossRefGoogle Scholar
 Manolov, R., Solanas, A., Bulté, I., & Onghena, P. (2010). Datadivisionspecific robustness and power of randomization tests for ABAB designs. Journal of Experimental Education, 78, 191–214.CrossRefGoogle Scholar
 Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166. doi: 10.1037/00332909.105.1.156 CrossRefGoogle Scholar
 Michiels, B., Heyvaert, M., Meulders, A., & Onghena, P. (2017). Confidence intervals for singlecase effect size measures based on randomization test inversion. Behavior Research Methods, 49, 363–381. doi: 10.3758/s1342801607144 CrossRefPubMedGoogle Scholar
 Moeller, J. D., Dattilo, J., & Rusch, F. (2015). Applying quality indicators to singlecase research designs used in special education: A systematic review. Psychology in the Schools, 52, 139–153.CrossRefGoogle Scholar
 Onghena, P. (1992). Randomization tests for extensions and variations of ABAB singlecase experimental designs: A rejoinder. Behavioral Assessment, 14, 153–171.Google Scholar
 Onghena, P. (1994). The power of randomization tests for singlecase designs (Unpublished doctoral dissertation). Leuven: Catholic University of Leuven.Google Scholar
 Onghena, P. (2005). Singlecase designs. In B. Everitt & D. Howell (Eds.), Encyclopedia of statistics in behavioral science : vol. 4 (pp. 1850–1854). Chichester: Wiley.Google Scholar
 Onghena, P., & Edgington, E. S. (1994). Randomization tests for restricted alternating treatments designs. Behaviour Research and Therapy, 32, 783–786.CrossRefPubMedGoogle Scholar
 Onghena, P., & Edgington, E. S. (2005). Customization of pain treatments: Singlecase design and analysis. Clinical Journal of Pain, 21, 56–68.CrossRefPubMedGoogle Scholar
 Parker, R. I., HaganBurke, S., & Vannest, K. (2007). Percent of all nonoverlapping data (PAND): An alternative to PND. Journal of Special Education, 40, 194–204.CrossRefGoogle Scholar
 Parker, R. I., & Vannest, K. J. (2009). An improved effect size for singlecase research: Nonoverlap of all pairs. Behavior Therapy, 40, 357–367.CrossRefPubMedGoogle Scholar
 Parker, R. I., Vannest, K. J., Davis, J. L., & Sauber, S. B. (2011). Combining nonoverlap and trend for singlecase research: TauU. Behavior Therapy, 42, 284–299.CrossRefPubMedGoogle Scholar
 Parker, R. I., Vannest, K. J., & Brown, L. (2009). The improvement rate difference for single case research. Exceptional Children, 75, 135–150.CrossRefGoogle Scholar
 Pesarin, F., & De Martini, D. (2002). On unbiasedness and power of permutation tests. Metron, 60, 3–19.Google Scholar
 Pratt, J. W., & Gibbons, J. D. (1981). Concepts of nonparametric theory. New York: Springer.CrossRefGoogle Scholar
 Schlosser, R. W., Lee, D. L., & Wendt, O. (2008). Application of the percentage of nonoverlapping data (PND) in systematic reviews and metaanalyses: A systematic review of reporting characteristics. EvidenceBased Communication Assessment and Intervention, 2, 163–187.CrossRefGoogle Scholar
 Scruggs, T. E., Mastropieri, M. A., & Casto, G. (1987). The quantitative synthesis of single subject research: Methodology and validation. Remedial and Special Education, 8, 24–33.CrossRefGoogle Scholar
 Senchaudhuri, P., Mehta, C. R., & Patel, N. R. (1995). Estimating exact pvalues by the method of control variates, or Monte Carlo rescue. Journal of the American Statistical Association, 90, 640–648.Google Scholar
 Shadish, W. R. (2014). Analysis and metaanalysis of singlecase designs: An introduction. Journal of School Psychology, 52, 109–122.CrossRefPubMedGoogle Scholar
 Shadish, W. R., Rindskopf, D. M., & Hedges, L. V. (2008). The state of the science in the metaanalysis of singlecase experimental designs. EvidenceBased Communication Assessment and Intervention, 2, 188–196.CrossRefGoogle Scholar
 Shadish, W. R., & Sullivan, K. J. (2011). Characteristics of singlecase designs used to assess intervention effects in 2008. Behavior Research Methods, 43, 971–980. doi: 10.3758/s134280110111y CrossRefPubMedGoogle Scholar
 Shamseer, L., Sampson, M., Bukutu, C., Schmid, C. H., Nikles, J., Tate, R., & CENT Group. (2016). CONSORT extension for reporting Nof1 trials (CENT) 2015: Explanation and elaboration. Journal of Clinical Epidemiology, 76, 18–46. doi: 10.1016/j.jclinepi.2015.05.018 CrossRefPubMedGoogle Scholar
 Smith, J. D. (2012). Singlecase experimental designs: A systematic review of published research and current standards. Psychological Methods, 17, 510–550. doi: 10.1037/a0029312 CrossRefPubMedGoogle Scholar
 Solanas, A., Manolov, R., & Onghena, P. (2010). Estimating slope and level change in N = 1 designs. Behavior Modification, 34, 195–218.CrossRefPubMedGoogle Scholar
 Solomon, B. G. (2014). Violations of assumptions in schoolbased singlecase data: Implications for the selection and interpretation of effect sizes. Behavior Modification, 38, 477–496. doi: 10.1177/0145445513510931 CrossRefPubMedGoogle Scholar
 Swaminathan, H., & Rogers, H. J. (2007). Statistical reform in school psychology research: A synthesis. Psychology in the Schools, 44, 543–549.CrossRefGoogle Scholar
 Tate, R., Togher, L., Perdices, M., McDonald, S., & Rosenkoetter, U. (2012). Developing reporting guidelines for singlecase experimental designs: The SCRIBE project. Paper presented at the 9th Conference of the Neuropsychological Rehabilitation Special Interest Group of the World Federation for Neurorehabilitation, Bergen, NorwayGoogle Scholar
 Van den Noortgate, W., & Onghena, P. (2003). Hierarchical linear models for the quantitative integration of effect sizes in singlecase research. Behavior Research Methods, Instruments, & Computers, 35, 1–10. doi: 10.3758/BF03195492 CrossRefGoogle Scholar
 Welch, W., & Gutierrez, L. G. (1988). Robust permutation tests for matchedpairs designs. Journal of the American Statistical Association, 402, 450–455.CrossRefGoogle Scholar
 White, D. M., Rusch, F. R., Kazdin, A. E., & Hartmann, D. P. (1989). Applications of metaanalysis in individualsubject research. Behavioral Assessment, 11, 281–296.Google Scholar
 Wolery, M., Busick, M., Reichow, R., & Barton, E. E. (2010). Comparison of overlap methods for quantitatively synthesizing singlesubject data. Journal of Special Education, 44, 18–28.CrossRefGoogle Scholar