Introduction

Universal, school-based social and emotional learning (SEL) interventions foster the social-emotional skills (e.g., self-management, social awareness, relationship skills) of children and young people through explicit instruction in the context of learning environments that are safe, caring, well-managed, and participatory (Weissberg et al. 2015). Three recent meta-analyses have rigorously demonstrated that SEL interventions can lead to meaningful improvements in a range of student outcomes, including their social-emotional skills, mental health, and academic attainment (Durlak et al. 2011; Sklad et al. 2012; Wigelsworth et al. 2016). However, the implementation of SEL interventions is variable, and this variability is hypothesized to be a key moderator of intervention outcomes (Durlak 2016). Our primary objective in this study, therefore, was to assess the relationship between levels of implementation and intervention outcomes in the Promoting Alternative Thinking Strategies (PATHS) curriculum. In doing so, we also sought to offer a distinct contribution to knowledge vis-à-vis the distinction between fidelity and quality in implementation and prevention science.

Implementation of School-Based Interventions

The implementation of school-based interventions is typically conceptualized in terms of constructs such as fidelity (what is delivered and how closely does this adhere to intervention guidance materials?), dosage (how much of the intervention is delivered?), quality (how well is the intervention delivered?), reach (was the intervention delivered to all intended recipients?), and participant responsiveness (did recipients engage with the intervention?) (Lendrum et al. 2016). A current source of contention is whether fidelity and quality are distinct. To some, fidelity is a superordinate construct, used to describe the overall pattern of activity, with other implementation dimensions viewed as subordinate indicators (Carroll et al. 2007). This carries an implicit assumption that for intervention outcomes to be replicated, the exact delivery regime under which an intervention was validated must also be replicated. This is the “zero-sum-game” view of implementation: higher fidelity results in better outcomes, and any deviation from the intended intervention model must therefore be detrimental (e.g., Elliott and Mihalic 2004). Fidelity thus becomes synonymous with quality, to the extent that the terms are used interchangeably (Lendrum et al. 2016).

To others, implementation is the superordinate construct, with fidelity included as a subordinate indicator alongside the aforementioned other dimensions (Berkel et al. 2011; Durlak and DuPre 2008) and typically understood and operationalized in procedural terms (e.g., how closely the structure and sequence of activities outlined in the intervention guidance are followed). Implementation quality is viewed as distinct from fidelity, referring to how effectively the intervention has been delivered for the achievement of intended outcomes (O’Donnell 2008; Odom et al. 2010), including facets such as implementer competence and skills, enthusiasm and engagement, and preparedness for implementation (Lendrum et al. 2016). This view is adopted in the current study and is reflected in our approach to assessment (see “Method” section). We see it as more consistent with emergent theorization of implementation that acknowledges the important distinction between what is implemented and how well (Berkel et al. 2011).

Whatever model one subscribes to, there is broad agreement that measurement of implementation is still in its relative infancy: “Even if the concept of implementation is not new, the idea of developing ways of measuring it certainly is” (Ogden and Fixsen 2014, p.8). Thus, the field has yet to emerge at a clear consensus regarding the optimal frequency and timing of measurement within a period of implementation (Durlak 2015; Humphrey 2016b). However, one area where there is general agreement is modality, with independent structured observations considered to be greatly preferable to teacher self-report methods, the latter being subject to substantial positive bias (Hansen et al. 2014).

Although it is generally accepted that “implementation matters” (Durlak and DuPre 2008), the evidence base pertaining to SEL and school-based interventions more generally is currently limited in a number of respects. First, despite a significant rise in the proportion of studies reporting on implementation in the last two decades (currently up to 69% of SEL studies, Wigelsworth et al. 2016; but still less than half of school-based intervention studies more generally, Bruhn et al. 2015), most offer only descriptive data, which are used to provide evidence that a given intervention was actually delivered and thus strengthen the internal validity of trial outcome analyses. By contrast, analyses in which researchers model levels of implementation dimensions as moderators of intervention effects are relatively infrequent, despite their obvious significance in terms of both internal and external validity in program evaluation. For example, a recent systematic review found that only 10% of intervention studies reported implementation-outcome relationships (Schoenwald and Garland 2014).

Second, research published to date has been characterized by a narrow focus on particular aspects of implementation at the expense of others. Thus, while 63% of studies included in Durlak and DuPre’s (2008) seminal review assessed fidelity, only 10% assessed quality. This narrow approach reflects the zero-sum-game model noted earlier and greatly increases the risk of a Type III error (the inaccurate attribution of cause) (Lendrum et al. 2016). Less frequently studied implementation dimensions such as quality may be equally or even more important than fidelity and dosage in driving intervention outcomes (Durlak 2015). A teachers’ preparedness, ability to actively engage and enthuse students, and clarity of delivery is crucial for effective learning; without this, the internalization of lesson content and subsequent skill development that underpins intervention outcomes is unlikely to occur (Berkel et al. 2011; Lendrum et al. 2016).

Third, the conceptualization and assessment of different aspects of implementation currently lags behind other aspects of intervention research. As noted above, terms such as fidelity and quality have been used interchangeably in some studies (e.g., Social and Character Development Research Consortium 2010). Furthermore, the means by which they are measured generally lacks the level of rigor afforded to the assessment of intervention outcomes (Ogden and Fixsen 2014). A recent systematic review found that only around one third of papers provided any data on the psychometric properties of instruments used to generate implementation data (Schoenwald and Garland 2014). Studies reporting factor analytic work to establish the distinctiveness of implementation dimensions as the foundation for implementation-outcomes analyses are extremely rare (Cross et al. 2015; Pettigrew et al. 2015). To a certain degree, this is understandable, given that each intervention typically generates its own implementation measures (which may be used infrequently), and also that interventions often evolve over time (meaning that implementation measures would also need to be revised frequently). One possible solution is the development and application of standardized implementation measures, though existing attempts have met with mixed success to date (Humphrey et al. 2016b).

The PATHS Curriculum

PATHS is a universal SEL intervention that aims to help all children to manage their behavior, understand their emotions, and work well with others. It is designed to be delivered by class teachers and includes a series of lessons on topics such as identifying and labeling feelings, generalization activities and techniques that support the application of new skills during the school day, and parent materials that aim to extend learning to the home environment. Further information about PATHS can be found at www.pathseducation.co.uk/. The PATHS materials used in the current study were subjected to a process of cultural adaptation by Barnardo’s (the children’s charity who own the UK license to distribute PATHS) in order to ‘Anglicize’ them. These primarily surface level changes (e.g., modified vocabulary, photographs and names, changes to cultural references) did not substantively change the structure or delivery model of PATHS.

Several randomized trials have found small-to-moderate but practically meaningful effects of PATHS on a range of outcomes, including children’s social and emotional skills (Domitrovich et al. 2007), their mental health (Crean and Johnson 2013), and academic attainment (Schonfeld et al. 2015). However, reflecting the trends noted above, some existing studies of PATHS only provide descriptive implementation data (e.g., Domitrovich et al. 2007). Those PATHS studies where implementation-outcomes analyses have been conducted have, in many ways, led the SEL field in terms of our understanding of implementation. However, findings across such studies have been somewhat inconsistent. Some have found little or no connection between levels of PATHS implementation and outcomes (e.g., Berry et al. 2016; Social and Character Development Research Consortium 2010), while others have found significant associations (e.g., Faria et al. 2013; Schonfeld et al. 2015). These studies all maintained a relatively narrow focus, measuring only one or two implementation dimensions, and none assessed reach or participant responsiveness. Finally, with a few exceptions (Conduct Problems Prevention Research Group 1999; Kam et al. 2003; Social and Character Development Research Consortium 2010), existing studies of PATHS implementation have relied exclusively on teachers’ self-reports to generate implementation data. While convenient, this method is, as noted above, limited by the substantial positive bias shown by teachers in their self-ratings and its generally weak relation with more rigorous independent observer ratings (Hansen et al. 2014).

The Current Study

In 2012, the authors were commissioned to conduct a 2-year cluster-randomized trial of the PATHS curriculum in primary schools in Greater Manchester, England (ISRCTN85087674). Having already published reports on the outcomes of this trial (which were somewhat varied, with evidence of intervention effects on social-emotional skills, mixed findings in relation to mental health, and null results in relation to academic progress—see Humphrey et al. 2015, 2016a), we turn our attention here to assessing the role of variability in PATHS implementation as a moderator of students’ social and emotional skills and mental health outcomes in the first year of the study. In doing so, we sought to advance the current literature in terms of adopting a more wide-ranging approach to the assessment of implementation, increased objectivity and rigor afforded by the use of independent observational data, and the application of a theoretical framework for implementation that posits quality and fidelity as distinct dimensions, enabling us to concurrently assess the relative importance of what is delivered and how well in determining intervention outcomes (Berkel et al. 2011). These contributions are a direct response to calls for research of this kind in implementation science (e.g., Durlak 2010; Hansen 2014).

Method

Design

A longitudinal, natural variation design was utilized. A multi-method approach was taken, with structured observations and surveys as the primary vehicles for data generation in relation to implementation and intervention outcomes respectively. Outcomes were assessed at baseline (summer term 2012) and 12-month follow-up (summer term 2013), with structured observations of implementation taking place in between (specifically, autumn term 2012 and winter term 2013). See Fig. 1 for flow of participants in the study.

Fig. 1
figure 1

Flow of participants in the study

Participants

Schools and Teachers

Data were drawn from 23 primary schools implementing PATHS across the Greater Manchester region in the northwest of England. Participating schools were representative of norms in England in respect of size, attendance, attainment, ethnicity, and the proportion of children identified as having special educational needs but had moderately higher proportions of children eligible for free school meals (FSM) and speaking English as an additional language (EAL) than national averages (Department for Education 2012).

Implementation data from 69 Year 3 and 4 teachers/classrooms in the first year of the aforementioned trial were collected. Classes contained an average of 25 students. Teachers of these students averaged 8 years of experience in the classroom, were predominantly female (82.5%), educated to postgraduate level (61.5%), and reported having 2–5 years of experience implementing other SEL programs prior to becoming involved in the current study (40.7%).

Students

Outcome data were generated for N = 1721 (839 male, 882 female) students, whose average age was 7 years, 7 months (range 6 years, 7 months to 8 years, 10 months) at baseline. Their demographic characteristics were consistent with national norms, albeit with the same exceptions noted above regarding the school characteristics (Department for Education 2012). The proportion of children scoring in the borderline/abnormal ranges for mental health difficulties broadly mirrored national norms for children aged 5–10 (www.sdqinfo.com).

Ethics

Participation in the study required consent from schools’ head teachers. Child assent and parental opt-out consent were also sought. Ethical approval from the University of Manchester Research Ethics Committee was sought and received (Ref 11470). In total, 88 parents (5.1% of the sample utilized in the current study) exercised their right to opt their children out of the research, and no children declined assent or exercised their right to withdraw.

Measures

Implementation

PATHS lessons were observed by three research assistants, who were each qualified and experienced teachers trained to Masters level in psychology or education. A structured observation schedule was developed by the authors for the study, drawing on the aforementioned theoretical framework for implementation (Berkel et al. 2011), existing rubrics utilized in previous studies of PATHS (e.g., Kam et al. 2003), advice from the program developer and colleagues at Pennsylvania State University, and the extant literature on assessment of implementation (e.g., Domitrovich et al. 2010). Two factual indicators—one each for dosage and reach—were generated and supplemented by ten observer-rated indicators designed to assess fidelity, quality, and participant responsiveness (see Table 1). The data generated was demonstrative of a general trend in which PATHS was delivered between once and twice a week, with most children in a given class present, teachers adhering to most procedural elements outlined in lesson materials, delivering them well, and with children responding appropriately.

Table 1 Descriptive statistics and exploratory factor analysis of PATHS implementation indicators

The schedule and an accompanying explanatory rubricFootnote 1 were explained in detail to the research assistants ahead of piloting and refinement using video footage of PATHS lessons being implemented in English schools in a previous trial (Berry et al. 2016). In this initial formative stage, which lasted several days, the emphasis was on developing a shared understanding of the various implementation indicators and their application in the context of a PATHS lesson. Additional video footage of PATHS lessons was then used in order to generate interrater reliability data for each indicator. Given the multiple raters and the ordinal response format of the coding schedule, the intraclass correlation coefficient (ICC) was used. ICC values can range between −1 and 1, with higher scores being indicative of limited variation between raters. The overall ICC was determined to be 0.91, considered to be “excellent” (Hallgren 2012).

During the live trial observations, each teacher was observed implementing a single PATHS lesson at a mutually agreeable date and time. The third author moderated a randomly selected 10% of these observations in order to guard against drift over time. In order to streamline analyses and thus reduce the likelihood of “model overfitting” (Myung 2000), avoid collinearity, and establish clear differentiation between implementation constructs, the observer-rated implementation data were subjected to exploratory factor analysis (EFA) in SPSS using the Principal Axis Factoring extraction method (common factor analysis) with Varimax rotation (oblique rotation method).Footnote 2 The EFA identified two distinct factors, accounting for 69.4% of the explained common variance in the data, corresponding to procedural fidelity (α = 0.93) and quality and responsiveness (α = 0.93), respectively (see Table 1). Bivariate correlation analyses demonstrated that the two identified factors were clearly distinct from one another (r = .02, p = .85) and from the dosage and reach indicators (quality-dosage, r = −.02, p = .79; fidelity-dosage, r = −.04, p = 0.64; quality-reach, r = .08, p = .38; fidelity-reach, r = .16, p = .07), which in turn shared a weak, albeit statistically significant association with each other (r = .20, p = .02).

To preserve the loading values of each item on a given factor, factor scores were generated using the least squares regression approach (DiStefano et al. 2009). These factor scores, which are standardized to a mean of zero, were subsequently used as explanatory variables for procedural fidelity and quality and responsiveness in the main analysis. To facilitate interpretation within and across models, dosage and reach data were also standardized (e.g., converted to z-scores) (Low et al. 2016).

Child Self-Report Version of the SSIS

The 46-item Social Skills Improvement System (SSIS) provides a broadband index of children’s social-emotional skills (Gresham & Elliot, 2008). The respondent reads a statement (e.g., “I make friends easily”) and indicates their level of agreement on a four-point scale (never, sometimes, often, always). The instrument is psychometrically sound, with good reliability (internal α up to 0.95; test-retest r up to 0.92) and strong validity (factorial: established through confirmatory factor analysis, CFA; convergent: correlates with a range of similar instruments; discriminative: discriminates between clinical and non-clinical samples) (Humphrey et al. 2011). Internal consistency of the SSIS total social skills scale used in the current study was α = 0.92.

Teacher informant-report version of the Social and Emotional Competence Change Index (SECCI). The five-item SECCI was derived from the PATHS program evaluation tools (EPISCenter 2014). Respondents indicate the degree of change they have observed in a child (e.g., “The student’s ability to stop and calm down e.g., when angry, excited or upset”) over a specified period of time using a five-point scale (much worse, a little worse, no change, a little improved, much improved). Internal consistency of this instrument in the current study was α = 0.92.

Teacher Informant-Report Version of the SDQ

The 25-item Strengths and Difficulties Questionnaire (SDQ) provides a measure of children’s internalizing symptoms, externalizing problems, and pro-social behavior.Footnote 3 Respondents read a statement (e.g., “[This child] often lies or cheats”) and indicate their level of agreement on a three-point scale (not true, somewhat true, certainly true). The SDQ has robust psychometric properties, with evidence of both reliability (internal α up to 0.87; test-retest r up to 0.8) and validity (factorial: established through CFA; convergent: correlates with a range of similar instruments; predictive: strongly predictive of independently diagnosed psychiatric disorders) (Goodman et al. 2010; Goodman 2001). Internal consistency in the current study was α = 0.87 for internalizing symptoms, α = 0.90 for externalizing problems, and α = 0.86 for pro-social behavior.

Statistical Analysis

Outcome data were standardized (e.g., converted to z-scores) prior to analysis. In addition to mean-centring the data, this procedure also facilitates interpretation and produces standardized regression coefficients that are comparable to an effect size that accounts for other variables in the model, thereby increasing precision and rigor (Bierman et al. 2014). In view of the hierarchical and clustered nature of the dataset, we used hierarchical linear modeling in MLWin 2.32. Each model was fitted with two levels (classroom, child), with score at follow-up as the response variable. At the class level, procedural fidelity, quality and responsiveness, dosage, and reach were entered as explanatory variables. Given that there are no universally agreed thresholds of implementation ratings for PATHS (or indeed any school-based intervention; any that have been imposed in studies to date are arguably arbitrary; Berry et al. 2016), we used the observational data to classify each class/teacher as either “low,” “moderate,” or “high” for each aspect of implementation using a distributional cut-point method (low, <−1 SD; moderate, −1 to +1 SD; and high, >+1 SD; in subsequent dummy coding, low implementation was the designated reference group). Importantly, these designations were statistical rather than qualitative (that is, they are based on relative position in the distribution as opposed to being based on arbitrarily imposed thresholds of what “good” implementation might look like; Durlak and DuPre 2008). An exception to this was reach: this was coded as high (100%), moderate (90–99%), or low (89% or less) according to the proportion of students present during the PATHS lesson being observed. Descriptive statistics pertaining to these implementation subgroups are available in Table 2.

Table 2 Descriptive statistics (n, means, and SDs) for implementation subgroups

Given their established associations with social-emotional skills and mental health outcomes (e.g., Green et al. 2005), gender and FSM eligibility were entered as covariates alongside baseline outcome scores at the child level. Guidance on power and sample size for hierarchical linear modeling suggested that the level-two (classroom) sample should be the principal focus given that the primary aim of our analysis was to test the effects of variables at this level (Snijders 2005); here, the level-two sample was deemed sufficiently large to support the explanatory variables noted above (Green 1991; Snijders and Bosker 2012).

Implementation data was missing at the classroom level in 6% of cases, where teachers left the school during the observation window. At the child level, outcome data was missing at either baseline or follow-up for between 13% (SECCI) and 30% (SSIS) of the sample due to student absence or them having left a given school. Missing value analysis showed the data was not missing completely at random (MCAR) but was instead conditional on other variables (e.g., pupil outcome data was more likely to be missing at follow-up). Therefore, the data was considered missing at random (MAR) (Heijtan and Basu 1996). Accordingly, multiple imputation procedures were carried out in REALCOM-Impute, using the MAR assumption (Carpenter et al. 2011). This enabled us to include both partially and completely observed cases of all 69 teachers/classes and 1721 students in the analysis, thereby reducing the bias associated with attrition. Gender and the constant were entered as auxiliary variables. REALCOM-Impute default settings of 1000 iterations and a burn-in of 100, refresh of 10, were used, following guidance for multi-level imputation with mixed response types (Carpenter et al. 2011).

Results

Descriptive statistics are presented in Tables 1 (overall implementation data and factor loadings), 2 (implementation data by subgroup), and 3 (outcome data). Inferential statistics are presented in Table 4. In the interests of brevity, only the multiply imputed analyses are shown. Complete case analyses were also performed, but there were no substantive differences in findings. In all cases, inclusion of the explanatory implementation variables significantly improved model fit when compared to “unconditional” models (chi-squared tests of the change in −2*Log-likelihood values were all significant at p < .001). The ICC for the models presented in Table 3 ranged from 0.02 (SSIS) to 0.35 (SECCI).

Table 3 Descriptive statistics (means and SDs) for teacher (SECCI) and child ratings of social and emotional skills (SSIS) and teacher ratings of pro-social behavior, internalizing symptoms, and externalizing problems (SDQ)
Table 4 Hierarchical linear models of the associations between levels of implementation and intervention outcomes in the PATHS curriculum

Quality and Responsiveness

Compared to low levels, high levels of implementation quality and participant responsiveness were associated with significantly lower ratings of students’ externalizing problems at 12-month follow-up (β = −0.26, p = .04). This effect was mirrored in a marginal, non-significant trend relating to moderate implementation quality and participant responsiveness for the same outcome variable (β = −0.14, p = .11). Levels of implementation quality and responsiveness were not significantly associated with any other intervention outcome (all p > .05).

Procedural Fidelity

Levels of procedural fidelity were not significantly associated with any intervention outcome (all p > .05).

Dosage

Contrary to expectations, high levels of dosage (compared to low) were associated with significantly lower ratings of students’ pro-social behavior at 12-month follow-up (β = −0.52, p = .02). Similarly, both moderate (β = −0.25, p = .01) and high (β = −0.28, p = .03), compared to low levels of dosage, were associated with significantly lower ratings of students’ social-emotional skills. Levels of dosage were not significantly associated with any other intervention outcome (all p > .05).

Reach

Levels of reach were not significantly associated with any intervention outcome (all p > .05).

Discussion

The principal aim of the current study was to assess the relationship between implementation and intervention outcomes in the PATHS curriculum. In doing so, we sought to offer distinct contributions to the field by adopting a more wide-ranging approach to the assessment of implementation than has previously been evident, through the increased objectivity and rigor afforded by the use of independent observational data, and via the application of a theoretical framework for implementation that posits quality and fidelity as distinct dimensions, enabling us to concurrently assess the relative importance of what is delivered and how well in determining intervention outcomes (Berkel et al. 2011). Our analysis of observational implementation data revealed distinct dimensions of implementation quality and responsiveness and procedural fidelity. Implementation-outcomes analyses demonstrated that high (and, marginally, moderate) levels of implementation quality and responsiveness were associated with significantly lower ratings of students’ externalizing problems. Contrary to expectations, high (and, for students’ social-emotional skills, moderate) levels of dosage were associated with significantly lower ratings of students’ pro-social behavior and social-emotional skills. No associations were found between variability in either procedural fidelity or reach and intervention outcomes.

The factor analytic model of our observational data offers clear empirical support for the integrated model of implementation that posits fidelity and quality as distinct dimensions (Berkel et al. 2011). The fact that the observational indicators relating to quality and those relating to participant responsiveness loaded strongly together in our EFAFootnote 4 is also consistent with the integrated model, in which the former is seen as a foundation for the latter. It is here where parallels with the literature on therapeutic alliances may be drawn; in this body of work, the relational and interactional bond between therapist and client is articulated in terms the competence and skills of the former and the engagement and active participation of the latter. Interestingly, the quality of this alliance has been shown to be a reliable predictor of positive intervention outcomes regardless of the therapeutic model used (Ardito and Rabellino 2011). Thus, as in the current study, quality matters.

Our implementation-outcomes analyses challenge the predominance of fidelity and dosage in the study of school-based interventions. Elsewhere, we have argued against the “zero sum game” view of implementation (e.g., fidelity is all that matters, and therefore attention to matters beyond fidelity is not worthwhile; Elliott and Mihalic 2004) on conceptual and theoretical bases (Lendrum et al. 2016). Here, we extend our position by demonstrating empirically that variability in procedural fidelity appears to be unrelated to intervention outcomes in the PATHS curriculum. Our findings align with those of Berry et al. (2016) and the Social and Character Development Research Consortium (2010). These authors also found no association between fidelity and outcomes in their recent PATHS trials. However, it is important to note that their analyses did not take account of the critical distinction between fidelity and quality made in the current study.

Our findings contrast with those of Faria et al. (2013) and Schonfeld et al. (2015), both of whom found a significant, positive association between PATHS dosage and outcomes. The dosage levels reported in the current study (see Table 1) are comparable with those of Faria et al. (2013),Footnote 5 so the apparent negative effect seen here is presumably not because of a failure to achieve a “minimum effective dose” (Liu 2010). Instead, we speculate that methodological and other differences between our studies may account for the apparent incongruence. For example, Schonfeld et al. (2015) used different methods to assess implementation (teacher self-report), covered a longer period of implementation (up to 4 years), and assessed different outcomes (academic attainment) than the current study.

Such differences aside, the question still remains as to why higher levels of dosage were found to be associated with significantly worse intervention outcomes. One possible reason is that this high dosage was at the expense of quality. Put another way, some teachers may have engaged in a “race to the finish line,” implementing PATHS quickly rather than implementing it well. An alternative explanation is that the teachers who implemented PATHS more frequently did so because they had a lower functioning class. In support of this hypothesis, exploration of the study data indicated that children in the moderate and high dosage classrooms demonstrated marginally higher internalizing symptoms, externalizing problems, and lower pro-social behavior at baseline. Finally, it may simply be the case that more frequent delivery of PATHS lessons meant that other efficacious activities (including, potentially, targeted interventions) were displaced.

Conversely, we found that higher implementation quality and participant responsiveness was associated with lower ratings of students’ externalizing problems at 12-month follow-up. These analyses support Durlak’s (2015) claim that “in some circumstances, quality of delivery… may be more strongly related than other implementation components to some program benefits” (p.1126) and add to a small but growing evidence base on the importance of this dimension of implementation as a moderator of intervention outcomes. In particular, our findings support those of Pettigrew et al. (2015), whose implementation-outcomes analyses of the keepin’ it REAL program revealed that implementation quality and participant responsiveness were more reliable predictors of intervention outcomes than fidelity. This emergent pattern of findings suggests that a broadening of focus to incorporate quality and responsiveness is perhaps warranted in implementation support processes (e.g., initial training, on-going technical support and assistance). This may, however, prove to be challenging for manualized interventions that perhaps lend themselves to a more procedural emphasis.

The current study is not without limitations. Chief among these was the fact that we were only able to observe each teacher/classroom once, thereby providing only a “snapshot” of implementation. The general recommendation is to capture implementation over multiple occasions to improve reliability and such that temporal patterns can be identified and taken into account in analyses (Humphrey et al. 2016b). As a counterpoint, however, we note the fact that some major observational studies of temporal patterns in implementation have actually evidenced high levels of stability in key dimensions (e.g., Hansen et al. 2013). Of particular relevance is Domitrovich et al.’s (2010) study of PATHS, which found no significant changes in fidelity, dosage, or participant responsiveness when growth models were applied to implementation data collected monthly over the course of a school year. Similar temporal stability (in implementation quality) was found in the FAST Track trial of PATHS (CPPRG 1999). Moreover, multiple observations in the current study were simply not possible due to resource and data burden considerations. We do note though that, as observations were scheduled with teachers, they might have differentially prepared for these lessons. This is, however, almost impossible to avoid, given the ethical and practical considerations inherent in observational studies of implementation in schools.

A second limitation is that, despite adopting a more wide-ranging approach to the assessment of implementation than had previously been evident, the current study was not completely comprehensive. It is difficult, if not impossible, to study all implementation components simultaneously (Durlak 2016). Specifically, we were not able to include data on program differentiation or adaptations in the analyses reported here. In terms of the former, establishing the distinctiveness of a given intervention from existing classroom practice is crucial in terms of determining its “achieved relative strength” (Nelson et al. 2012). In relation to the latter, assessment needs to take into account the reasons for adaptation (e.g., logistical, philosophical), their timing (e.g., pro-active, reactive), and valence (e.g., positive, negative, neutral) (Moore et al. 2013). These two dimensions have proven particularly elusive in the broader field (Humphrey et al. 2016b). However, recent work by Hansen et al. (2013) suggests that reliable and valid assessment is possible, albeit time consuming and costly.

Finally, we should also note alternative explanations for the lack of positive associations between procedural fidelity, dosage, and the outcomes modeled herein. It is possible, for example, that once minimum effective levels of these dimensions of implementation are reached, little or no “added value” is gained from higher levels. While somewhat plausible, this explanation does not align well with findings of other studies (e.g., the aforementioned study by Schonfeld et al. (2015) found that the probability of achieving academic proficiency status in reading increased 1.37 times for each additional lesson taught), and is also discordant with the developers’ recommendations, particularly in relation to dosage. Another explanation is that there was not enough variability in our dataset, particularly for procedural fidelity, to detect its effects. However, scrutiny of the descriptive data for the implementation indicators (Table 1) does not support this interpretation, as the two procedural fidelity indicators actually yielded higher standard deviations (indicative of greater variability) than the quality and responsiveness indicators.

Conclusion

The current study adds to the growing body of literature exploring the relationship between implementation and intervention outcomes in school-based SEL. We provide distinct contributions in terms of the adoption of a more wide-ranging approach to the assessment of implementation than has previously been evident, increased objectivity and rigor afforded by the use of independent observational methods, and the application of a theoretical framework for implementation that posits quality and fidelity as distinct dimensions. Our analyses provide support for the integrated model of implementation and suggest that quality and responsiveness are at least as critical as fidelity and dosage, if not more so, in determining the achievement of expected outcomes. Put another way, the current study reinforces Durlak’s (2010) clarion call for the advancement of implementation science, in which he noted “the importance of doing well in whatever you do” (p.348).