Introduction

A single-case experimental design (SCED) is an experimental design where one subject, participant, or case is observed repeatedly over time, resulting in a time series. During this time series, one or more dependent variables are measured under different levels in order to assess the effect of the particular treatment or intervention (Onghena & Edgington, 2005). Often the time series includes at least one baseline phase and one treatment phase. Studies using an SCED design frequently report results of a small number of multiple cases. When generalizing the results of several SCED studies in a meta-analysis, the data of interest is then of a hierarchical nature: measurements are nested within cases, which in turn are nested within studies. This hierarchical nesting of the data can be taken into account elegantly by using hierarchical or multilevel modeling for statistical analysis (Van den Noortgate & Onghena 2003a, 2003b, 2008).

In the basic multilevel model for meta-analysis of SCED data as proposed in previous research (Raudenbush & Bryk 2002; Moeyaert et al., 2014; Shadish et al., 2008, 2013; Van den Noortgate & Onghena 2007), the observed scores for each case are assumed to be normally distributed around their expected value. However, Shadish and Sullivan (2011) have reported that the outcome variables measured in SCED studies are very often of a discrete rather than continuous nature, and for these discrete outcomes the assumption of conditional normality does not hold. To account for both the hierarchical and the count nature of SCED data, two frameworks can be combined: linear mixed modeling (LMM) (Hox, 2010; Gelman & Hill, 2009; Snijders & Bosker, 2012) and generalized linear modeling (GLM) (Gill, 2001; McCullagh & Nelder, 1999). Both frameworks have proven to provide very flexible tools. From their most basic forms, they expand into more specialized models in a clear and simple manner. Combining both frameworks results in a generalized linear mixed model (GLMM) (Hox, 2010; Gelman & Hill, 2009; Snijders & Bosker, 2012; Jiang, 2007), which is specified by (1) a distribution for the random effects, (2) a linear combination of predictor variables, (3) a function linking this linear predictor to the expected value of the response variable conditional on the random effects, and (4) a distribution for the response variable around this expected value. GLMMs can be very well customized to the particular type of data at hand, i.e., count data in SCED meta-analyses (Shadish et al., 2013).

One downside of the GLMM framework is that it is relatively complex to understand. Customizing a generalized linear mixed model requires a more general mathematical understanding of both the GLM and the LMM framework. Even though efficient estimation methods are available in many popular software packages (Zhou et al., 1999; Bates et al., 2015; Molenberghs et al., 2002) and even though these models have proven their robustness and their power (Abad et al., 2010; Capanu et al., 2013; Yau & Kuk, 2002), they might be somewhat intimidating for social scientists to apply. Another difficult aspect of the GLMM framework is that the more sophisticated the model, the more information is needed to make sure the GLMM estimation converges (Li & Redden, 2015; Abad et al., 2010). However, in SCED contexts typically a relatively small number of data points is available (Shadish & Sullivan, 2011) and this might result in less reliable GLMM estimates (Nelson & Leroux, 2008).

For an assessment of the current use of GLMMs in SCED contexts, we have access to data collected for a recent review conducted by the same team of authors of this simulation study (Jamshidi et al., 2017). This systematic review includes 178 systematic reviews and meta-analyses of SCED studies from the last three decades and includes a description of their study characteristics. Of the included studies, only 22 (12%) used hierarchical or mixed modeling and 19 of those were published after 2010. Only about half of these studies reported the type of measurement scale of the dependent variable, but those that did reported almost exclusively rates, percentages, or counts. Yet all of these 22 studies used an LMM rather than a GLMM. Together with the aforementioned issues of the complexity of the GLMM, this observation encourages us to look deeper into the consequences of misspecifying SCED count data with an LMM (which assumes normally distributed outcomes).

To this end, a simulation study is conducted in which count data with a hierarchical structure are generated according to a two-level GLMM, assuming a Poisson distribution of scores within the phases. The simulated datasets are analyzed by fitting the GLMM used for data generation, as well as by fitting a two-level LMM that assumes normality of the scores within phases. The main aim of this study is to investigate whether the GLMM, as the theoretically correctly specified model, outperforms the LMM across all conditions, and, if not, in which conditions the LMM performs well enough (or better).

As to the conditions in which the LMM leads to acceptable performance, we have two hypotheses. First, if the expected count responses in the baseline and/or treatment phase are relatively high, the LMM might perform relatively better than when the expected counts are small due to better normal approximations of Poisson distributions with larger expected values (Stroup, 2013). Second, if the sample size is small, the LMM might perform relatively better than the GLMM due to the GLMM being a too complex model to estimate when information is sparse (Hembry et al., 2015).

Various simulation conditions are taken into account. These conditions differ in the number of cases, the number of measurements within cases, the average baseline response, the average effect size and the true variance component values. To analyze the performance of the model fits, we look at common goodness of fit criteria, fixed effect parameter recovery, the Type I error rate and the power. The goal is to provide applied researchers with recommendations on the required criteria (e.g., the required sample size or the required average count in the baseline and/or treatment phase) for reliable analysis of count data with simpler LMMs.

Methodology

For simplicity, the simulation in this study will only take into account two levels (measurements nested within cases). The model used to simulate the SCED count data is a GLMM with an underlying Poisson distribution and a log link function:

$$\begin{array}{@{}rcl@{}} &&Y_{ij} \sim \text{Poisson}\left( \lambda_{ij}\right) \\ &&\log\left( \lambda_{ij}\right) = \beta_{0j} + \beta_{1j}D_{ij} \\ &&\left\{\begin{array}{l} \beta_{0j} = \gamma_{00} + u_{0j} \\ \beta_{1j} = \gamma_{10} + u_{1j} \end{array} \right. \\ &&\left( \begin{array}{ll} u_{0j}\\\\ u_{1j} \end{array}\right) \sim \text{MVN} \left[\begin{array}{llll} \left( \begin{array}{ll} 0 \\ \\ 0\end{array}\right), \left( \begin{array}{ccc} \sigma^{2}_{u0} & \sigma_{u01} \\ \\ \sigma_{u01} & \sigma^{2}_{u1} \end{array}\right) \end{array}\right], \end{array} $$
(1)

where i = 1,…,I indicates the measurement occasion and j = 1,…,J the case. The variable Dij is a dummy variable indicating the phase of the experiment: Dij equals 0 if the measurement was taken during the baseline phase, while Dij equals 1 if the measurement was taken during the treatment phase. The random effects u0j and u1j have respective variances \(\sigma _{u0}^{2}\) and \(\sigma _{u1}^{2}\), and their covariance is σu01. In this GLMM, γ00 represents the average of the logarithm of the baseline level, and γ00 represents the logarithm of the treatment effect across the J cases.

Two models are used to analyze the simulated data: one GLMM identical to the one used to generate the data in Eq. 1, and one two-level LMM as defined below:

$$\begin{array}{@{}rcl@{}} &&Y_{ij} \sim \text{N}\left( \mu_{ij}, \sigma_{e}\right) \\ &&\mu_{ij} = \beta_{0j}^{*} + \beta_{1j}^{*}D_{ij} \\ &&\left\{ \begin{array}{l} \beta_{0j}^{*} = \gamma_{00}^{*} + u_{0j}^{*} \\ \beta_{1j}^{*} = \gamma_{10}^{*} + u_{1j}^{*} \end{array} \right. \\ &&\left( \begin{array}{lll} u_{0j}^{*}\\ \\ u_{1j}^{*} \end{array}\right) \sim \text{MVN} \left[\begin{array}{lll} \left( \begin{array}{lll} 0 \\ 0\end{array}\right), \left( \begin{array}{lll} \left( \sigma_{u0}^{*}\right)^{2} & \sigma_{u01}^{*} \\ \sigma_{u01}^{*} & \left( \sigma_{u1}^{*}\right)^{2} \end{array}\right) \end{array}\right]. \end{array} $$
(2)

Simulation conditions

Design parameters

We refer to I, the number of measurements per case, and to J the number of cases, as ‘design parameters’ because they influence the single-case experimental design implemented in the simulation. A common practice in SCED research is to vary the length of the baseline phase (Shadish & Sullivan, 2011), so the time point on which the treatment or intervention is introduced is different over cases. This is a so-called multiple baseline design and it is the design implemented in this simulation study. In an SCED context, I will typically be quite small and often J will be even smaller (Shadish & Sullivan, 2011). This might have a significant influence on the fit of the LMM and especially of the GLMM, since the latter, more complex model can be more difficult to estimate if the number of data points is small (Nelson & Leroux, 2008). For many measurements and cases, the fit to the simulated data is expected to be good. The number of measurements I will be defined as either 8, 12, or 20. These values were deliberately chosen to be somewhat smaller than common numbers of measurement occasions, as reported by Moeyaert et al. (2013) and based on Ferron et al., (2010) and Shadish and Sullivan (2011) and Swanson and Sachse-Lee (2000). This was done in order to test the hypothesis on better relative performance of the LMM with small sample sizes. The number of cases, J, will be defined as either 4, 8, or 10. These values are also close to the values for J chosen in Moeyaert et al. (2013), which were based on recommendations of Barlow and Hersen (1984) and Kazdin and Kopel (1975) and on the review by Shadish and Sullivan (2011), but the values in this study were chosen to be more spread apart. This was done to have a slightly larger range in levels when considering J as a factor in the analysis of the simulation results. For all combinations of I and J, a list of starting point values (i.e., the first measurement that is part of the treatment phase) is defined. This list has length J and contains the starting point i ∈ [1,I] for every case j. These starting points were chosen so that they were evenly distributed among different cases and so that both the baseline and the treatment phase contained a substantial number of measurements. Table 1 provides a summary of the design parameter combinations and their corresponding lists of starting point values.

Table 1 Timing of intervention for simulated cases

Model parameters

In the GLMM (1) used for generating data, the raw data points Yij are generated by random sampling from a Poisson (λij) distribution. For sufficiently large values of λij, however, the normal distribution with mean λij and variance λij is a good approximation to the Poisson distribution (Johnson et al., 2005). This leads to a hypothesis stating that for the GLMM generated data with sufficiently large λijs, the LMM (2) might result in a relatively better fit. To verify this hypothesis, the simulation conditions need to distinguish between generated data that are ‘highly discrete’ in nature (smaller λij values) and generated data that have a more ‘continuous’ nature (larger λij values) due to good approximations by the normal distribution. With two phases (baseline and treatment) and two characterizations (highly discrete or approximately continuous in nature), we obtain four conditions based on the the responses (Table 2). Without loss of generality, this study only includes one combination of a phase with a highly discrete average response and a phase with an approximately continuous average response, i.e., the second combination listed in Table 2.

Table 2 Categorization of the average baseline response and treatment response

The aim of this section is to define values for the nominal fixed effects parameters (γ00 and γ10) and variance components (\(\sigma _{u0}^{2}\), \(\sigma _{u1}^{2}\) and σu01) in such a way that they cover the three combinations of interest listed in Table 2. Thus, the question is the following: how do the values for the model parameters γ00, γ10, \(\sigma _{u0}^{2}\), \(\sigma _{u1}^{2}\) and σu01 affect the model’s average response, i.e., E (λij)? From the linear expression for λij in the GLMM (1) it follows that

$$\begin{array}{@{}rcl@{}} \lambda_{ij} &=& \exp\left( \beta_{0j} + \beta_{1j}D_{ij}\right) \\ &=& \exp\left( \beta_{0j}\right)\exp\left( \beta_{1j}D_{ij}\right). \end{array} $$

Thus, in the baseline phase E (λij) equals E [exp (β0j)] and in the treatment phase E (λij) equals E [exp (β0j) exp (β1j)]. An expansion of these expected values of exponentials of β0j and β1j can be obtained based on properties of the multivariate lognormal distribution. Since \(\left (u_{0j}, u_{1j}\right )^{\intercal }\) is sampled from a multivariate normal distribution, \(\left (\beta _{0j}, \beta _{1j}\right )^{\intercal }\) also follows a multivariate normal distribution. Therefore, \(\exp \left (\boldsymbol {\beta }\right ) = \left [\exp \left (\beta _{0j}\right ), \exp \left (\beta _{1j}\right )\right ]^{\intercal }\) follows a multivariate lognormal distribution, of which the elements k of the mean vector of exp (β) equal

$$ \text{E}\left[\exp\left( \boldsymbol{\beta}\right)\right]_{k} = \exp\left( \mu_{k} + \frac{1}{2}{\Sigma}_{kk}\right) $$
(3)

and the elements kl of the covariance matrix of exp (β) equal

$$\begin{array}{@{}rcl@{}} \text{Var}\left[\exp\left( \boldsymbol{\beta}\right)\right]_{kl} &=& \exp\left[\mu_{k} + \mu_{l} + \frac{1}{2}\left( {\Sigma}_{kk} + {\Sigma}_{ll}\right)\right]\\ &&\times\left[\exp\left( {\Sigma}_{kl}\right) - 1\right]. \end{array} $$
(4)

So, if \(\boldsymbol {\beta } = \left (\beta _{0j}, \beta _{1j}\right )^{\intercal } \sim \text {MVN}\left (\boldsymbol {\mu }, \boldsymbol {\Sigma }\right )\) with \(\boldsymbol {\mu } = \left (\gamma _{00},\gamma _{10}\right )^{\intercal }\) and

$$\boldsymbol{\Sigma} = \left( \begin{array}{ccc} \sigma^{2}_{u0} & \sigma_{u01} \\ \sigma_{u01} & \sigma^{2}_{u1} \end{array}\right), $$

we have that

$$\begin{array}{@{}rcl@{}} \text{E}\left[\exp\left( \beta_{0j}\right)\right] &=& \exp\left( \gamma_{00} + \frac{\sigma_{u0}^{2}}{2}\right) \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} \text{E}\left[\exp\left( \beta_{1j}\right)\right] &=& \exp\left( \gamma_{10} + \frac{\sigma_{u1}^{2}}{2}\right) \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} \text{Var}\left[\exp\left( \beta_{0j}\right)\right] &=& \text{E}\left[\exp\left( \beta_{0j}\right)\right]^{2}\\ &&\times\left[\exp\left( \sigma_{u0}\right)-1\right] \end{array} $$
(7)
$$\begin{array}{@{}rcl@{}} \text{Var}\left[\exp\left( \beta_{1j}\right)\right] &=& \text{E}\left[\exp\left( \beta_{1j}\right)\right]^{2}\\ &&\times\left[\exp\left( \sigma_{u1}\right)-1\right] \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} \text{Cov}\left[\exp\left( \beta_{0j}\right),\exp\left( \beta_{1j}\right)\right] &=& \text{E}\left[\exp\left( \beta_{0j}\right)\right]\text{E}\left[\exp\left( \beta_{1j}\right)\right]\\ &&\times \left[\exp\left( \sigma_{u01}\right)-1\right] \end{array} $$
(9)

Equation 5 describes the average baseline response. By combining Eq. 5, Eq. 6 and the formula for the expected value of the product of two dependent variables, an expression for the average treatment response can be derived:

$$\begin{array}{@{}rcl@{}} \text{E}\left[\exp\left( \beta_{0j}\right)\exp\left( \beta_{1j}\right)\right] &=& \text{E}\left[\exp\left( \beta_{0j}\right)\right]\text{E}\left[\exp\left( \beta_{1j}\right)\right]\\ &&+ \text{Cov}\left[\exp\left( \beta_{0j}\right),\exp\left( \beta_{1j}\right)\right] \\ &=& \text{E}\left[\exp\left( \beta_{0j}\right)\right]\text{E}\left[\exp\left( \beta_{1j}\right)\right]\\ &&\times\left[1 + \exp\left( \sigma_{u01}\right) - 1\right] \\ &=& \text{E}\left[\exp\left( \beta_{0j}\right)\right]\\ &&\times\text{E}\left[\exp\left( \beta_{1j}\right)\right]\exp\left( \sigma_{u01}\right) \end{array} $$
(10)

An important point to notice here is that the expected treatment response E [exp (β0j) exp (β1j)] is not merely equal to the expected baseline response E [exp (β0j)] times E [exp (β1j)]. Equation 10 shows the influence of the σu01 parameter.

These derivations illustrate how the average baseline and treatment responses depend on the model parameters in a not very straightforward way. The average baseline response depends in a non-linear way on not only γ00 but also \(\sigma _{u0}^{2}\). The average treatment response depends in a non-linear way on all five model parameters γ00, γ10, \(\sigma _{u0}^{2}\), \(\sigma _{u1}^{2}\) and σu01 together. Therefore, in this simulation study, nominal values for Eqs. 5 and 6 are chosen rather than values for the γ00, γ10, \(\sigma _{u0}^{2}\), \(\sigma _{u1}^{2}\) model parameters directly. This makes managing the categorization of conditions in the Table 2 categories easier. Summarizing the choice of values for E [exp (β0j)], E [exp (β1j)] and σu01, the conditions and their categorizations are listed as ‘highly discrete’ or ‘approximately continuous’ in Table 3.

Table 3 Simulation conditions based on categorization of the average baseline response, treatment response, and effect size

After having defined values for E [exp (β0j)] and E [exp (β0j) exp (β1j)] in Table 3, the choice of \(\sigma _{u0}^{2}\) and \(\sigma _{u1}^{2}\) values will uniquely determine the corresponding values for γ00 and γ10 as shown in Eqs. 5 and 6. Since there are no particular restrictions for values of γ00 and γ10, the focus will now be on well defining values for the variance components \(\sigma _{u0}^{2}\) and \(\sigma _{u1}^{2}\). These variance components have an influence on the variance of E [exp (β0j)] and E [exp (β1j)] as shown in Eqs. 7 and 8. Note that in Table 3, deliberate choices of values were made for these expected values because they should cover all categories. If the variance of exp (β0j) and exp (β0j) exp (β1j) is large, however, a relatively high amount of generated β ' s will yield values of exp (β0j) and exp (β0j) exp (β1j), which do not fall into the foreseen categories from Table 3. This is due to positive skewness of the lognormal distribution when the underlying normal variance (i.e., \(\sigma _{u0}^{2}\) and \(\sigma _{u1}^{2}\)) is larger. Therefore, two values are defined for both \(\sigma _{u0}^{2}\) and \(\sigma _{u1}^{2}\): [log (1.35)]2 and [log (1.50)]2. According to Eqs. 7 and 8, the corresponding variances for \(\sigma _{u0}^{2} = \sigma _{u1}^{2} = \left [\log \left (1.50\right )\right ]^{2}\) equal:

$$\begin{array}{@{}rcl@{}} \text{Var}\left[\exp\left( \beta_{0j}\right)\right] &=& \text{E}\left[\exp\left( \beta_{0j}\right)\right]\left[\exp\left( \sigma_{u0}\right)-1\right] \\ &=& \text{E}\left[\exp\left( \beta_{0j}\right)\right]\left[\exp\left( \log\left( 1.50\right)\right)-1\right] \\ &=& \frac{1}{2}\cdot\text{E}\left[\exp\left( \beta_{0j}\right)\right] \\ \text{Var}\left[\exp\left( \beta_{1j}\right)\right] &=& \text{E}\left[\exp\left( \beta_{1j}\right)\right]\left[\exp\left( \sigma_{u1}\right)-1\right] \\ &=& \frac{1}{2}\cdot\text{E}\left[\exp\left( \beta_{1j}\right)\right]. \end{array} $$

So the variances will equal 35% (\(\sigma _{u0}^{2} = \sigma _{u1}^{2} = \left [\log \left (1.35\right )\right ]^{2}\)) or 50% (\(\sigma _{u0}^{2} = \sigma _{u1}^{2} = \left [\log \left (1.50\right )\right ]^{2}\)) of the expected values.

A final condition to check is whether the choices of values for the variance components yield a positive semi-definite covariance matrix. This is equivalent to making sure that the correlation between β0j and β1j is between − 1 and 1, or that |σu01| ≤ |σu0σu1|. Checking this restriction for the largest value of σu01 (i.e., σu01 = log(1.05)) and the smallest values of σu0 and σu1 (i.e., σu0 = σu1 = log(1.35)), one can verify that this condition is indeed met:

$$\left|\sigma_{u01}\right| \leq \left|\sigma_{u0}\sigma_{u1}\right| \Leftrightarrow \log(1.05) \leq \left[\log\left( 1.35\right)\right]^{2}. $$

Analysis

Goodness of fit

To assess the goodness of fit of the GLMM and the LMM, the Akaike information criterion (AIC, Akaike (1998)) and the Bayesian information criterion (BIC, Schwarz (1978) and Claeskens and Jansen (2015)) are used. In every iteration of the simulation, the AIC and the BIC of the GLMM and the LMM fits are computed. Next, a relative AIC and BIC score is calculated by taking the relative difference of the LMM and GLMM goodness of fit criteria (resp. denoted as AICL or BICL for LMM and AICG or BICG for GLMM):

$$\begin{array}{@{}rcl@{}} S_{\text{AIC}} &=& \frac{\text{AIC}_{\text{L}}-\text{AIC}_{\text{G}}}{\text{AIC}_{\text{G}}} \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} S_{\text{BIC}} &=& \frac{\text{BIC}_{\text{L}}-\text{BIC}_{\text{G}}}{\text{BIC}_{\text{G}}} \end{array} $$
(12)

The motivation behind these scores is that they provide a comparison between the LMM and GLMM in one score, and that these scores in turn are comparable across conditions. This facilitates representation of the goodness of fit results in a clear and compact figure later in the analysis. When SAIC < 0 or SBIC < 0, the LMM fit results in a lower AIC or BIC and this would lead to the conclusion that the LMM provides a better fit than the GLMM. The reverse finding, i.e., SAIC > 0 or SBIC > 0, would lead to the conclusion that the GLMM provides a better fit than the LMM. Per condition, the mean \(\overline {S}_{\text {AIC}}\) and \(\overline {S}_{\text {BIC}}\) are each calculated over all iterations.

Fixed effect parameter recovery

In SCED research, the main interest is usually in the treatment effect and its size (Van den Noortgate & Onghena, 2008). When analyzing SCED data with the classical continuous linear mixed model as expressed in Eq. 2, the corresponding parameter of interest is \(\gamma _{10}^{*}\). This parameter expresses the average increase or decrease in baseline response across cases after the treatment or intervention. Note that this is an additive change: the average baseline response changes from

$$\text{E}\left( \mu_{ij}|D_{ij} = 0\right) = \gamma_{00}^{*} $$

to

$$\text{E}\left( \mu_{ij}|D_{ij} = 1\right) = \gamma_{00}^{*}+ \gamma_{10}^{*} $$

in the treatment phase. Thus, the fixed effect \(\gamma _{10}^{*}\) expresses the average difference between the expected baseline response and the expected treatment response. However, the GLMM fixed effect parameter γ10 cannot be interpreted in the same way. Indeed, interpretation of γ10 is not as straightforward. Equations 56 and 10 show how the expected treatment response does not even merely equal the expected baseline response times exp (γ10) because of the influence of the variance components.

This observation leads to the following complication in this simulation study. Data are generated from the GLM model as defined in Eq. 1, with a nominal value for γ10. Afterwards, the LMM as defined in Eq. 2 is fit, which yields an estimate \(\hat {\gamma }_{10}^{*}\). However, this \(\hat {\gamma }_{10}^{*}\) will not be comparable with the nominal γ10, since γ10 and \(\gamma _{10}^{*}\) are two different parameters and they do not express the same concept.

To address this complication, two approaches are proposed. Both approaches provide a transformation of the parameters of one of the models into a new parameter. This new parameter is comparable to the fixed effect parameter of the other model and therefore a fixed parameter recovery assessment can be conducted based on the new parameter estimate from the first model and the fixed effect parameter estimate from the second model. Note that a general investigation of transformations of effect sizes based on the LMM to effect sizes based on the GLMM and vice versa is not within the scope of this paper, though this might be interesting for future research.

The first approach consists of a transformation of the GLMM parameters into a new parameter ΔG, which expresses an effect size comparable to the fixed effect \(\gamma _{10}^{\ast }\) of the LMM. By comparing the estimate for ΔG from the GLMM and the estimate for \(\gamma _{10}^{\ast }\) from the LMM we can assess the fixed effect parameter recovery. The second approach is analogous, but uses the LMM as a starting point instead. Based on a transformation the LMM parameters, it introduces a new fixed effect parameter ΓL and this ΓL is subsequently compared to γ10 to assess fixed parameter recovery.

The first metric Δ is defined as the additive effect of the treatment. This additive effect should express the difference of the average baseline response and the average treatment response:

$$ {\Delta} = \text{E}\left( \text{Tx}\right) - \text{E}\left( \text{B}\right) $$
(13)

For the GLMM, Eqs. 5 and 10 can be used to define a ΔG parameter:

$$\begin{array}{@{}rcl@{}} {\Delta}_{G} &=& \text{E}\left[\exp\left( \beta_{0j}\right)\exp\left( \beta_{1j}\right)\exp\left( \sigma_{u01}\right)\right] - \text{E}\left[\exp\left( \beta_{0j}\right)\right] \\ &=& \exp\left( \gamma_{00} + \frac{\sigma_{u0}^{2}}{2}\right)\left[\exp\left( \gamma_{10} + \frac{\sigma_{u1}^{2}}{2} + \sigma_{u01}\right) - 1\right] \\ \end{array} $$
(14)

The parameter ΔG can be computed for every condition using the parameters used in data generation and substituting them into Eq. 14. The estimator \(\hat {\Delta }_{G}\) can be estimated for each simulated dataset by substituting the estimated parameters into Eq. 14:

$$ \hat{\Delta}_{G} = \exp\left( \hat{\gamma}_{00} + \frac{\hat{\sigma}_{u0}^{2}}{2}\right)\left[\exp\left( \hat{\gamma}_{10} + \frac{\hat{\sigma}_{u1}^{2}}{2} + \hat{\sigma}_{u01}\right) - 1\right] $$
(15)

For the LMM, a ΔL parameter is defined analogously:

$$\begin{array}{@{}rcl@{}} {\Delta}_{L} &=& \text{E}\left( \beta_{0j} + \beta_{1j}\right) - \text{E}\left( \beta_{0j}\right) \\ &=& \text{E}\left( \beta_{0j}\right) + \text{E}\left( \beta_{1j}\right) - \text{E}\left( \beta_{0j}\right) \\ &=& \text{E}\left( \beta_{1j}\right) \\ &=& \gamma_{10}^{*} \end{array} $$
(16)

The parameter ΔL can be computed for every condition using the parameters used in data generation and substituting them into Eq. 14. The estimator \(\hat {\Delta }_{L}\) can be estimated for each simulated dataset by \(\hat {\gamma }_{10}^{*}\):

$$ \hat{\Delta}_{L} = \hat{\gamma}^{*}_{10} $$
(17)

The second metric Γ is defined by the following expression based on the expected baseline and treatment responses and on the variance in the baseline and in the treatment:

$$ {\Gamma} = \log\left[\left( \frac{\text{E}\left( \text{Tx}\right)}{\text{E}\left( \text{B}\right)}\right)^{2}\sqrt{\frac{\text{E}\left( \text{B}\right)^{2}+\text{Var}\left( \text{B}\right)}{\text{E}\left( \text{Tx}\right)^{2}+\text{Var}\left( \text{Tx}\right)}}\right] $$
(18)

For the GLMM, it can be shown that the above expression equals γ10. These calculations are provided in Appendix A. Thus a ΓG parameter is defined as:

$$ {\Gamma}_{G} = \gamma_{10} $$
(19)

The parameter ΓG can be computed for every condition using the parameters used in data generation and substituting them into Eq. 19. The estimator \(\hat {\Gamma }_{G}\) can be estimated for each simulated dataset by \(\hat {\gamma }_{10}\):

$$ \hat{\Gamma}_{G} = \hat{\gamma}_{10} $$
(20)

For the LMM, according to Eq. 2 we have that

$$\begin{array}{@{}rcl@{}} \text{E}\left( \text{B}\right) &=& \gamma_{00}^{*} \\ \text{Var}\left( \text{B}\right) &=& \left( \sigma_{u0}^{*}\right)^{2} \\ \text{E}\left( \text{Tx}\right) &=& \gamma_{00}^{*} + \gamma_{10}^{*} \\ \text{Var}\left( \text{Tx}\right) &=& \left( \sigma_{u0}^{*}\right)^{2} + \left( \sigma_{u1}^{*}\right)^{2} + 2\sigma_{u01}^{*}. \end{array} $$

Thus a ΓL parameter is defined as follows:

$$\begin{array}{@{}rcl@{}} {\Gamma}_{L} &=& \log\left[\left( \frac{\gamma_{00}^{*} + \gamma_{10}^{*}}{\gamma_{00}^{*}}\right)^{2}\right.\\ &&\qquad\left. \times\sqrt{\frac{\left( \gamma_{00}^{*}\right)^{2} + \left( \sigma_{u0}^{*}\right)^{2}}{\left( \gamma_{00}^{*} + \gamma_{10}^{*}\right)^{2} + \left( \sigma_{u0}^{*}\right)^{2} + \left( \sigma_{u1}^{*}\right)^{2} + 2\left( \sigma_{u01}^{*}\right)}}\right] \\ \end{array} $$
(21)

The parameter ΓL can be computed for every condition using the parameters used in data generation and substituting them into Eq. 19. The estimator \(\hat {\Gamma }_{L}\) can be estimated for each simulated dataset by substituting the estimated parameters into Eq. (21):

$$\begin{array}{@{}rcl@{}} \hat{\Gamma}_{L} &=& \log\left[\left( \frac{\hat{\gamma}_{00}^{*} + \hat{\gamma}_{10}^{*}}{\hat{\gamma}_{00}^{*}}\right)^{2}\right.\\ &&\qquad\times\left.\sqrt{\frac{\left( \hat{\gamma}_{00}^{*}\right)^{2} + \left( \hat{\sigma}_{u0}^{*}\right)^{2}}{\left( \hat{\gamma}_{00}^{*} + \hat{\gamma}_{10}^{*}\right)^{2} + \left( \hat{\sigma}_{u0}^{*}\right)^{2} + \left( \hat{\sigma}_{u1}^{*}\right)^{2} + 2\left( \hat{\sigma}_{u01}^{*}\right)}}\right] \\ \end{array} $$
(22)

For each of these parameters Δ and Γ, the relative bias (RB) and the mean squared error (MSE) are calculated.

Inference

In SCED meta-analysis, researchers are interested in finding out if there is an effect of a treatment or intervention. This is expressed in an effect size: a metric indicating the direction and the size of the effect. In multilevel modeling of SCED meta-analytical data, typically the fixed effects are chosen as effect sizes (i.e., γ10 in a GLMM (1) and \(\gamma _{10}^{*}\) in a LMM (2)). Because the data in this simulation study are simulated according to the GLMM in Eq. 1, the parameter of interest here is γ10. The binary hypotheses on which inference in the GLMM setting is based, are:

$$\begin{array}{@{}rcl@{}} H_{0} &:& \gamma_{10} = {\Gamma}_{G} = 0\\ H_{\alpha} &:& \gamma_{10} = {\Gamma}_{G} \neq 0 \end{array} $$

We calculate the proportion of rejections of the null hypothesis per condition, i.e., the proportion of GLMMs estimated yielding a p value smaller than the significance level α for the \(\hat {\gamma }_{10}\) estimate. In conditions where the nominal γ10 equals 0, this proportion equals the type I error rate. In conditions where the nominal γ10 does not equal 0, this proportion equals the power.

For the LMM however, p values are calculated based on a different set of hypotheses:

$$\begin{array}{@{}rcl@{}} H_{0} : \gamma_{10}^{*} &=& {\Delta}_{L} = 0\\ H_{\alpha} : \gamma_{10}^{*} &=& {\Delta}_{L} \neq 0 \end{array} $$

Again, we calculate the proportion of null hypothesis rejections per condition. We have to interpret this proportion based on the nominal ΔG value (14), since ΔL should estimate the same additive treatment effect. In conditions where the nominal ΔG equals 0, the proportion of rejections per condition equals the type I error rate. In conditions where the nominal ΔG does not equal 0, this proportion equals the power. Note that according to Eq. 14 we have that

$${\Delta}_{G} = 0 \Leftrightarrow \left[\left( \gamma_{10} = -\frac{\sigma_{u1}^{2}}{2}\right) \wedge \left( \sigma_{u01} = 0\right)\right].$$

This expression will be the motivation for the choice of values for γ10 and σu01.

The p values for the LMM are computed based on the approximate Wald F-test with Satterthwaite denominator degrees of freedom (Gumedze & Dunne, 2011; Satterthwaite, 1946). The underlying p values for the GLMM are computed based on an approximate Wald Z-test. The choice of Z-test for inference based on the GLMM was due to practical constraints with lme4 (Bates et al., 2015), the package we used for simulation in R (R Core Team, 2017). We elaborate on this further in Appendix B. The significance level α is set to .05.

Simulation conditions

All design and model parameters and their choice of values have been summarized in Table 4, together with a motivation based on the calculations and analyses described in the previous paragraphs. The total number of conditions equals 3 × 3 × 2 × 2 × 2 × 3 × 3 = 648. For ΔG (Eq. 14), which depends on all five model parameters (γ00, γ10, \(\sigma _{u0}^{2}\), \(\sigma _{u1}^{2}\) and σu01), the particular choice of values for these parameters (see Table 4) resulted in 22 unique nominal parameter values. For ΓG (Eq. 19), which depends on γ10 and on \(\sigma _{u1}^{2}\), this resulted in (2 × 2) + 1 = 5 unique nominal parameter values. The 22 ΔG nominal parameter values range from 0 to 53.5. The (rounded) ΓG nominal parameter values equal − .0822, − .0450, 0, 1.1706, and 1.2077. To keep a balance between feasibility and precision with as many as 648 conditions, we generate N = 2000 datasets per condition. With this number of simulated datasets, a condition with a true type I error rate of .05 would have an estimated type I error rate with a standard error of \(\sqrt {\frac {.05\times .95}{2000}} = .0049\), and because we will analyze the results across multiple conditions rather than within individual conditions, the analyses will be based on multiples of 2000 datasets.

Table 4 Simulation condition factors summary

Recall that for fixed parameter recovery, the relative bias and the MSE will be analyzed. Two careful considerations have to be made in order to obtain meaningful results for the relative bias and the MSE. First, since ΓG can take on negative nominal parameter values, the sign of the relative bias will be influenced when dividing by these nominal parameter values. Therefore we opt to calculate a modified relative bias by dividing the bias by the absolute value of the nominal parameter value:

$$\frac{\overline{\hat{\theta}_{i}} - \theta}{\left|\theta\right|} $$

Second, the MSE is relative with respect to the nominal parameter value, which makes MSEs difficult to compare when the range of nominal parameter values is large (as it is for ΔG). Therefore we opt to calculate a relative MSE by dividing the MSE by the squared nominal parameter value:

$$\frac{\text{MSE}\left( \hat{\theta}_{i}\right)}{\theta^{2}} $$

Summary of results

To address the previously stated research objectives, we will compare the LMM and GLMM results and discuss their performance in terms of goodness of fit (SAIC (Eq. 11) and SBIC (Eq. 12)), fixed effect parameter recovery (quantified by the MSEs and relative bias of the Δ and Γ estimators), type I error rate and power. We studied the effect of the following design factors: baseline-treatment category (as defined in Table 3), effect size category (as defined in Table 3), number of measurements I and number of cases J. We choose to study the effect of the baseline treatment and the effect size categories rather than the effect of the individual model parameters γ00, γ10, \(\sigma _{u0}^{2}\), \(\sigma _{u1}^{2}\) and σu01, because (1) they are more easily interpretable, (2) they relate directly to the research questions stated in this study, and (3) simulation conditions were generated using these categories rather than the individual parameters. To assess which impact these factors have on the performance outcomes, we conduct an ANOVA analysis and calculate η2 values. The results are shown in Table 5. To avoid discussing trivial effects, we will only discuss factors that explain at least 14% of the variance in the outcome variables (results shown in bold in Table 5). The cutoff value of 14% is based on the rule of thumb suggested by Cohen (1988). However, we choose to include the factor Model in all of our results because of our explicit interest in assessing the performance of the LMM by using the GLMM’s performance results as a benchmark.

Table 5 Eta-squared values (η2) for association of design factors with outcomes

For graphical purposes, the baseline-treatment categories from Table 3 are denoted as follows in the graphical results: a highly discrete average baseline response and a highly discrete average treatment response is denoted as category ‘HD-HD’ (from ‘highly discrete - highly discrete’), a highly discrete average baseline response and an approximately continuous average treatment response is denoted as category ‘HD-AC’ (from ‘highly discrete - approximately continuous’) and finally an approximately normal average baseline response and an approximately normal average treatment response is denoted as category ‘AC-AC’ (from ‘approximately continuous - approximately continuous’).

Software

We use the open-source R software (R Core Team, 2017) to generate and analyze the SCED count data. The LMM and the GLMM are estimated through the lmer() and glmer() functions, respectively, both available in the lme4 package (Bates et al., 2015). Using the default argument settings, the lmer() function provides restricted maximum likelihood (REML) estimates for the LMM parameters and the glmer() function provides estimates based on a Gauss–Hermite quadrature approximation of the log-likelihood function. In Appendix B, we provide some R code samples and explain how we obtained and analyzed the LMM and GLMM estimates.

Results

Goodness of fit criteria

Previously, a relative AIC score SAIC and a relative BIC score SBIC were defined (see Eqs. 11 and 12). Analysis results for both scores are very similar, thus only results for the SAIC scores are reported in this paper. From Table 5, we see that most of the variability in SAIC is associated with the baseline-treatment category. Figure 1 shows the distribution of all SAIC scores for each of the baseline-treatment categories. By definition, a negative value of SAIC indicates that the LMM fit results in a lower AIC than the GLMM fit, and thus the LMM performs relatively better. Figure 1 shows that this is almost never the case, except for some observations within the AC-AC category. A closer look at the conditions that yielded negative mean SAIC scores learns that those scores only occur when J = 4 or when I = 8 and J = 8, indicating that only when the data are approximately normal and information is sparse, the GLMM no longer outperforms the LMM in terms of goodness of fit. This is due to both the fact that the GLMM is more complex to estimate and to the fact that count data with higher expected values are better approximated by a normal distribution, which makes the LMM’s assumption of normally distributed residuals and therefore a normally distributed dependent variable more plausible.

Fig. 1
figure 1

Mean AIC scores SAIC. Baseline-treatment categories are based on Table 3

Fixed effect parameter recovery

For the statistics ΔG (Eq. 14), ΔL (Eq. 16), ΓG (Eq. 19) and ΓL (Eq. 21), a simple linear regression analysis is conducted to study the relation between the LMM and the GLMM estimators for Δ and Γ. The fitted model predicts the LMM estimate based on the GLMM estimates. A significant regression equation was found for both the Δ and the Γ estimates, with an R2 of .9986 (ΔL = 0.0225 + 1.0042 ⋅ΔG) and .9963 (ΓL = − 0.0066 + 1.0088 ⋅ΓG), respectively. This is an important result because it allows for comparison between the GLMM and the LMM based on their parameter estimates. Now that it is clear that there is a way to compare the fixed effect estimations of the GLMM and the LMM, the next step is to assess which model provides the best fixed effect estimator. To assess the quality of the ΔG, ΔL, ΓG and ΓL as estimators, the relative bias and the relative MSE of all four are analyzed. Note that conditions where Δ = 0 or Γ = 0 were left out in order to be able to calculate a finite relative bias and relative MSE.

For the relative bias of the Δ estimates, we see from Table 5 that none of the design factors is associated with an η2 value higher than our cutoff value of 14%. The total η2 for the relative bias of Δ equals 0.1892. Because we have corrected for all factors on which we defined our simulation conditions in the ANOVA analysis, this low total of η2 values indicates that most of the variation in relative bias of Δ must be due to sampling error. Across all conditions, the relative bias of Δ ranges from − 0.035 to 0.38 with a median of 0.00097, indicating that for many conditions, Δ is unbiased. The factors with higher η2 values in Table 5 give an indication as to which factors affect biasedness in the Δ estimates. In Fig. 2, the relative bias is shown across different levels of two factors with relatively high η2 values, i.e., model and effect size. The LMM is the model which estimates Δ directly and its associated estimator ΔL is less biased than the GLMM’s ΔG estimator. This is especially true when the effect size is small, although even then the relative bias of ΔG is still reasonably small.

Fig. 2
figure 2

Relative bias of the ΔG and ΔL estimators. Effect size categories are based on Table 3

The relative bias of Γ is mostly associated with the size of the effect (η2 = .2464 in Table 5) and is shown in Fig. 3. Again, the model which cannot directly estimate the statistic, i.e., LMM, appears to be most biased. The relative bias of ΓL goes up to 40% when the effect size is small. Remarkably, the relative bias for Γ is highest for small effect sizes, and slightly lower when the effect size is zero. For large effect sizes, however, both the LMM and the GLMM estimators have very little bias. Looking deeper into the high relative bias observed in conditions where the effect size is small to zero, we see in Fig. 4 that the higher relative biases are associated with conditions where J is small and, to a lesser extent, with conditions where the underlying data are highly discrete. These observations hold true for both ΓL and ΓG.

Fig. 3
figure 3

Relative bias of the ΓG and ΓL estimators. Effect size categories are based on Table 3

Fig. 4
figure 4

Relative bias of the ΓG and ΓL estimators for conditions where the effect size is small to zero. Baseline-treatment categories are based on Table 3

From Table 5 we see that the relative MSE values hardly depend on the underlying model. The effect size category has the largest association and a closer look to the MSE values reveals that the relative MSE values for both Γ and Δ are highest when the effect size is small to zero. These conditions are investigated further in Figs. 5 and 6. We compare across different levels of baseline-treatment category and number of cases J because those factors yield the second and third highest η2 values in Table 5. When J is small and/or when the underlying data are more discrete in nature (as in the HD-HD category), the relative MSEs are higher. Since the GLMM and the LMM have very similar relative MSEs no matter the baseline-treatment category, we cannot conclude that the LMM’s relative MSE’s improve (relative to those of the GLMM) when the underlying data become more continuous.

Fig. 5
figure 5

Relative MSE of the ΔG and ΔL estimators for conditions where the effect size is small to zero. Baseline-treatment categories are based on Table 3

Fig. 6
figure 6

Relative MSE of the ΓG and ΓL estimators for conditions where the effect size is small to zero. Baseline-treatment categories are based on Table 3

To summarize the results of the relative bias and the relative MSEs of the Δ and Γ parameter estimates, Table 6 shows the overall means of both measures. This table confirms again that the relative MSE is very similar for both models, but it also shows a slight disadvantage for the model which cannot directly estimate the parameter. The latter observation is more clear for the relative bias, as was also clear from Figs. 2 and 3.

Table 6 Overall mean relative bias and relative mean squared error for the Δ and Γ parameter estimates

Inference

Because the type 1 error rates are calculated as the proportion rejections in conditions where the nominal effect is zero, and because in that case the effect size factor variable only has one level, effect size was left out in the ANOVA analysis. Based on the η2 values in Table 5, we look further into how the baseline-treatment category has an effect on the type 1 error rate (η2 = .2487). From Table 7, it is clear that the type I error rate of the GLMM is higher than that of the LMM. In conditions where the underlying data are highly discrete, the type I error rate of the GLMM improves, but a closer look to the data revealed that the type I error rate of the LMM is consistently closer to the nominal α = .05 than the type I error rate of the GLMM.

Table 7 Type I error rates

The power naturally depends mostly on the effect size (η2 = .8835, Table 5), because power generally increases for larger, more noticeable effects. The impact of all other factors falls below our 14% cutoff for η2. However, to study what sample sizes are needed to reach an acceptable power level, we also consider the number of cases J as a factor when analyzing the power of Γ in Table 8 and of Δ in Table 9. Our choice of J rather than I (the number of measurements) is based on the higher η2 value in Table 5 (η2 = .0137 for J versus η2 = .0003 for I). From Tables 8 and 9, we can indeed see that the power increases as the sample size increases. For large (absolute) values of Δ and Γ, the power approaches 1. Comparing the two models, we see that the power of the GLMM is consistently higher than that of the LMM for both Γ and Δ. However, the power of the LMM reaches the commonly accepted 80% threshold (Cohen, 1988) when J ≥ 8 in conditions where the effect is large (Δ ≥ 5 or |Γ| ≥ 1.1706). When J = 4 the power of the LMM stays below .63 (Δ) or .56 (Γ) even for the largest effects.

Table 8 Proportion rejections in function of Γ
Table 9 Proportion rejections in function of Δ

Discussion

With this simulation study, we wanted to see whether the GLMM consistently outperforms the LMM, and, if not, in which cases the LMM has an acceptable performance. Three aspects of both models have been considered to assess their performance: goodness of fit, fixed effect parameter recovery, and inference.

In terms of goodness of fit, the LMM does in general not perform as well as the GLMM. In Fig. 1, a vast majority of the SAIC scores lies above 0, indicating that the AIC of the GLMM is generally lower than the AIC of the LMM according to Eq. 11. Only when the baseline and treatment average responses are relatively high and when the number of cases is very small (J = 4) does the LMM achieve a goodness of fit comparable to that of the GLMM. In conditions with very sparse information, the more complex GLMM has a disadvantage compared to the LMM. Additionally, the LMM has the advantage that the baseline and treatment phase averages of the underlying count data are high and that the LMM therefore provides a good normal approximation of the data.

To assess the performance of both models in terms of fixed effect parameter recovery, we compared their parameter estimators \(\hat {\Delta }_{G}\) vs. \(\hat {\Delta }_{L}\) and \(\hat {\Gamma }_{G}\) vs. \(\hat {\Gamma }_{L}\). The most important measure of quality of an estimator is the MSE because it encompasses both the bias and the variance. A qualitative estimator should have an MSE as small as possible, i.e., a bias of zero and a small variance. From Table 6 and from Figs. 5 and 6 it is clear that the MSEs of the estimators of both models are on average very alike, with a slight advantage for the model which can directly estimate the parameter (i.e., the LMM for Δ and the GLMM for Γ).

In terms of inference, the first step in comparing the performance of the LMM with the GLMM is to look at the type I error rate. As seen in Table 7, the type I error rate of the LMM is better under control than the rate when using the GLMM. Although this might seem surprising, similar good behavior of less complex albeit misspecified (generalized) linear mixed models on small sample data has been observed (Bell et al., 2014). The more complex models, even though theoretically better fit to model the data, might function poorly when making too many estimates from too few pieces of information (Muth et al., 2016). Since the type I error rate of the LMM is under control, the next step is to look at its power. From Tables 8 and 9 it is clear that the LMM does not obtain the same power as the GLMM, not even for large effects. Only when the effect size and the number of cases J are large (Δ ≥ 5 or Γ ≥ 1.1706, and J ≥ 8) does the power of the LMM reach a level of 80%. This was true for all values of I (the number of measurements) considered in our simulation.

For applied research, a crucial next question is when is it acceptable to use an LMM to analyze single-case count data? In terms of goodness of fit, the LMM only yields acceptable AICs (i.e., AICs as low or lower than those of the GLMM) if the count data are well approximated by a normal distribution in both the baseline and the treatment phase and if the sample size (and especially the number of cases J) is very small. However, even in those conditions the LMM obtains a goodness of fit that is only 10% worse than that of the GLMM (Fig. 1). If this is considered acceptable, we recommend using the LMM in situations where the estimated effect size and the number of cases are reasonably large (J ≥ 8), to ensure an acceptable power and unbiased fixed effect estimates.

When it comes to selecting an effect size to express the fixed effect, applied researchers need to determine whether they have a specific interest in either the additive effect expressed by Δ or the effect expressed by Γ. It makes sense to opt for the additive effect as expressed by Δ because it is more easily interpretable. Moreover, its estimate \(\hat {\Delta }\) is readily available from the applied LMM as it does not need any transformation. Since this simulation study has provided some quantitative evidence of the good performance of the ΔL estimator in terms of relative bias and relative MSE, the use of the LMM to model single-case count data to obtain an estimate for Δ would not be discouraged, even though it is an overly simplified model. Inference based on ΔL is valid, because the type I error rate of ΔL is under control and behaves well in all conditions. Again, caution is advised when doing inference based on ΔL if the effect size or the number of cases is small, since then the power might not be acceptable.

When practitioners want to estimate the effect size Γ, it is preferable to use the GLMM to avoid the manipulations required to get the ΓL estimate from the LMM (as illustrated in Appendix B). The GLMM will result in a slightly higher bias for Γ, but a lower MSE, compared to modeling a LMM and estimating Δ. Even when using the GLMM to estimate Γ, however, there might be up to 10% relative bias in the estimates for some conditions, particularly when the effect size is not large and the amount of data for estimation is limited. When the sample sizes increase (i.e., the measurement series gets longer and the number of participants increases), this bias disappears.

If practitioners decide to model SCED count data using the GLMM with a Poisson distribution (Eq. 1), they need to be aware of the assumptions associated with the Poisson distribution (Winkelmann, 2008). First of all, the length of time intervals or session during which the counts are measured has to be the same across the entire time series. Applied researchers might do this already intuitively to make counts comparable over sessions, or based on good practices recommended by single-case handbooks (Ayres and Gast, 2010). In case the time series includes sessions of different lengths, the GLMM (1) can be adjusted to account for this by including an offset (Casals et al., 2015). As such, the outcome modeled is a rate rather than a count.

Another assumption to be taken into account when modeling a Poisson distribution is that the rate of occurrence across each time interval or session has to be constant; that is, the probability of occurrence of the measured event should be constant throughout each time interval. This assumption might be violated when an observed participant is disturbed by an external event or factor during a measuring session and when this disturbance has a temporary impact on the measured outcome. For example, when measuring problem behavior in a classroom environment, an observed participant might show temporarily increased problem behavior when a classmate initiates a fight with the participant during a measuring session. To lessen the likelihood of external factors impacting the rate of occurrence, practitioners can try to keep the length of measuring sessions short.

A final assumption of the Poisson distribution is that the events occurring in different time intervals should be independent. This assumption is violated when autocorrelation is present in the data. Practitioners can try to avoid this from happening by making sure their measuring sessions are far enough apart in time.

The results presented in this study have limitations inherent to all simulation studies, i.e., they are conditional on the simulation design and parameter values used. Because this study is the first of its kind, we have used the most basic GLMM design to simulate data and all simulation conditions were exclusively based on sample size and nominal parameter values. Naturally, there is much more to a GLMM design than these two aspects, and the many GLMM design extensions could all provide starting points for further exploration of the impact of model misspecifications for count data. These extensions include: (1) using alternative probability distributions to sample the dependent variable from, such as the binomial or other discrete distributions (to model discrete proportions or percentages), zero-inflated distributions and distributions fit for over-dispersed data; (2) specifying a specific covariance structure and as such modeling autocorrelation, rather than using an unstructured covariance matrix like in this study; (3) simulating data with variable exposure (i.e., the frequency of the behavior of interest is not tallied across the same period of time at each measurement occasion); (4) including linear or non-linear trends in the simulated data and in the fitted models; (5) using different single-case design types, e.g., alternating treatments or phase changes with reversal, rather than the multiple baseline AB design used in this study; and (6) simulating unbalanced data.

We focused mainly on the average treatment effect when comparing the results of the LMM and the GLMM estimations. This is in line with common practice, where applied researchers who are combining SCED count data are usually primarily interested in the average treatment effects (as expressed by Γ and Δ in this study), rather than in the individual treatment effects or the variance components. Moreover, just like average treatment effect estimations, individual effect and variance component estimations are not comparable between the LMM and the GLMM. Attempting to compare them would involve a similar and arguably even more complex method of transformation as illustrated for Γ and Δ. This is beyond the scope of this study.

Finally, we want to point out that inference results of the GLMM are based on an approximate Wald Z-test, which is likely to misspecify the sampling distribution of the Wald statistic as normal, especially in small samples. As explained in Appendix B, this was due to a lack of available procedures in the lme4 package in R. In SAS, the PROC GLIMMIX procedure does include the option to set different degrees of freedom approximations to adjust for small sample sizes. It would be very useful to reanalyze our simulated datasets in SAS to see whether the inference results lead to substantially different conclusions from the conclusions we drew based on the R p values.

Conclusions

This simulation study showed that the GLMM in general does not substantially outperform the LMM, except in terms of the goodness of fit criteria. For the small sample sizes that we have considered, and which are common to SCED count datasets, we have found that the LMM does equally well as the GLMM in terms of fixed effect parameter recovery. In terms of inference, the type I error rates of the LMM are more under control than those of the GLMM. The power of the LMM is generally lower than the power of the GLMM, but the LMM might provide acceptable power for SCED samples with a sufficient number of cases. This simulation provided some evidence that the GLMM might not necessarily be the better choice when it comes to very sparse SCED count data due to the model being too complex to estimate. Evidence for relatively better performance of the LMM if the expected count responses in baseline and/or treatment phases are relatively high was not so clear. Based on our results, we have provided some guidelines for applied researchers. Reviewers or meta-analysts using mixed modeling to combine SCED studies should be well aware of the effects of misspecifying their mixed model for discrete data. Their model choice should be well considered based on the type of raw data included and on the sample sizes.