Adolescence is a time of intensive changes to youth’s brain structure and function, cognitive capacities, and socioemotional processing (Crone & Dahl, 2012; Gogtay et al., 2004; Nelson et al., 2005). Neural maturation in adolescence is evidenced by regional reductions in gray matter and global increases in white matter (Blakemore & Choudhury, 2006; Paus et al., 2011), enhanced integrity of white matter tracts (Mohammad & Nashaat, 2017; Schmithorst et al., 2002), and heightened functional network formation (Cohen Kadosh et al., 2010; Fair et al., 2009). Alongside this pattern of neural development, marked changes in social behaviour also emerge: adolescents are increasingly oriented towards their peers (Larson et al., 1996; Nelson et al., 2005) and begin to develop complex and nuanced social relationships (Furman & Buhrmester, 1992; Güroğlu, van den Bos, & Crone, 2014). Importantly, both the neural and behavioural maturation that occur during adolescence are thought to be shaped by individuals’ unique experiences with their social environment (Nelson, 2017; Tottenham, 2014). Therefore, adolescence is often considered a sensitive period for the development of social cognitive functions (Blakemore & Mills, 2014; Crone & Dahl, 2012).

Maturational changes within the “social brain network” have been linked to the simultaneous increase in adolescents’ social behaviour (Nelson, Jarcho, & Guyer, 2016) and social cognition skills (Blakemore, 2008, 2012; Burnett et al., 2011; Kilford, Garrett, & Blakemore, 2016), including emotion recognition (ER). Emotion recognition, or the ability to recognize others’ emotions based on nonverbal cues (e.g., facial expressions, gestures and postures, tone of voice), is essential to social competence (Halberstadt, Denham, & Dunsmore, 2001). This skill matures throughout adolescence, presumably bolstered by increasingly sophisticated cognitive abilities and the experience-driven growth of neural networks (Nelson, 2017; Nelson et al., 2016). For instance, age-related increases in white matter and neural activity in face processing areas of the brain have been associated with greater accuracy in recognizing facial expressions of emotion (facial ER) in 7- to 37-year-olds (Cohen Kadosh et al., 2012).

To date, behavioural and brain-based research has primarily assessed ER development using facial expressions of emotion (Blakemore, 2008; Kilford et al., 2016), which youth can reliably interpret by early adolescence (Herba & Phillips, 2004; Kolb, Wilson, & Taylor, 1992). However, less is known about the mechanisms supporting the development of emotion recognition in other modalities, such as a speaker’s tone of voice (vocal ER). Beyond the content of speech, the acoustic characteristics of vocal prosody, including the pitch, intensity levels, and temporal aspects of the voice, combine to convey important information about a speaker’s emotional state or social attitudes (Banse & Scherer, 1996; Mitchell & Ross, 2013). The emotional content of prosody also necessarily unfolds over time (Liebenthal et al., 2016), which requires extensive executive functioning and processing skills to track and decode adequately (Schirmer, 2017). Perhaps relatedly, vocal ER skills have been found to follow a protracted developmental trajectory (Morningstar et al., 2019; Morningstar, Ly, Feldman, & Dirks, 2018a). Indeed, although children start to correctly label basic emotions in a speaker’s voice between 6 and 9 years old (Matsumoto & Kishimoto, 1983), there is evidence for continued maturation of vocal ER through childhood (Allgood & Heaton, 2015; Doherty et al., 1999; Sauter, Panattoni, & Happé, 2013; Tonks et al., 2007). Furthermore, a handful of studies have found that adult listeners outperform 11- and 12-year-olds (Brosgole & Weisman, 1995; Chronaki et al., 2015), or even 13- to 15-year-olds (Chronaki et al., 2018; Morningstar et al., 2018a), in vocal ER tasks, indicating that this skill continues to develop at least through mid-adolescence (see Morningstar et al., 2018b for a review). However, the neural mechanisms supporting this ongoing maturation remain unknown.

Prior work with adult listeners has established a network of temporal and frontal areas involved in the perception and interpretation of vocal affective prosody (Schirmer & Kotz, 2006; Wildgruber et al., 2006). The brain model for the extraction of emotional content from prosodic cues involves the integration within the superior temporal sulcus and gyrus (STS, STG) of information from the primary auditory cortex in Heschl’s gyrus (A1) and the “temporal voice area” (TVA; Belin, Zatorre, & Ahad, 2002; Belin et al., 2000; Ethofer et al., 2006b; Ethofer et al., 2012; Wiethoff et al., 2008) with subcortical structures such as the amygdala and striatum (Bach et al., 2008; Ethofer et al., 2009b). The temporal areas then project to the dorso-lateral prefrontal cortex (dlPFC; Ethofer et al., 2006a), which is implicated in the explicit interpretation of emotional intent in vocal prosody (Adolphs, Damasio, & Tranel, 2002; Alba-Ferrara et al., 2011; Wildgruber et al., 2006). However, despite extensive characterization of the neural networks involved in the processing of vocal prosody in adults, little is known about how these brain systems develop in youth.

The current study examined developmental influences on neural activation during the processing of vocal emotional prosody. We recruited youth aged 8 to 19 years old to complete a vocal ER task while undergoing functional magnetic resonance imaging (fMRI). Our goals were to a) describe the neural correlates of processing affective prosody in youth, and b) investigate age-related changes in brain activation and neural networks during the ER task. Based on existing work with adults (Ethofer et al., 2012; Wildgruber et al., 2006), we expected that youth would recruit primary auditory processing areas in the temporal lobe (A1 and the TVA), as well as frontal regions, such as the dlPFC and inferior frontal gyrus (IFG), when tasked with attributing emotional intent to heard affective prosody. Although the TVA shows a more specialized and focalized response to nonemotional voices in adults compared with children and adolescents (Bonte et al., 2013; Bonte et al., 2016), we hypothesized that age-related changes in activation during the ER task would be primarily evident in the frontal regions responsible for the cognitive labelling of emotion, rather than in primary or secondary sensory areas (Casey et al., 2005). Furthermore, we posited that frontal regions implicated in the vocal ER task would be increasingly connected with temporal regions with age, both in terms of their functional connectivity and of the increased efficiency of the superior longitudinal fasciculus (SLF), a white matter tract linking these regions.

Method

Participants

Forty-one youth (26 females) aged 8 to 19 years old (M = 14.00, SD = 3.38) participated in the study. Participants were recruited via responses to an email advertisement distributed to employees of a large children’s hospital. Exclusion criteria included the presence of devices or conditions contraindicated for MRI (e.g., braces or a pacemaker; assessed using a metal screening form), gross cognitive impairments, or developmental disorders (e.g., Turner’s syndrome, autism). One participant did not complete the scanner part of the study. Participants’ scaled scores for the matrix reasoning and vocabulary subtests of the Wechsler Intelligence Scale for Children (WISC)/Wechsler Adult Intelligence Scale (WAIS) were average (ranging from low average to very superior). Self-report of ethnicity indicated that 68% of the sample was Caucasian, 17% Black or African American, and 15% multiracial or other ethnicities. All participants spoke English fluently as their dominant language. The Edinburgh Handedness Inventory (Oldfield, 1971) revealed that 88% of participants were right-handed, 2% left-handed, and 10% reported no preference. The distribution of age across our sample was symmetric (skewness = −0.17), with an approximately equal number of participants at each “age” by year (kurtosis = −1.37). Written, parental consent and written participant assent or consent was obtained before the study.

Stimuli and task

Audio stimuli were selected from a set of recordings produced by three 13-year-old community-based actors (2 females), generated with the aid of emotional vignettes (Morningstar, Dirks, & Huang, 2017). Actors spoke the same five sentences (e.g., “Why did you do that?”; “I didn’t know about it”) in five emotional tones of voice: anger, fear, happiness, sadness, and neutral. Recordings retained for the current study were chosen based on adult listeners’ ratings of their recognisability and authenticity (Morningstar et al., 2018a). Each actor contributed 25 recordings to the pool of stimuli (5 sentences x 5 emotional categories), resulting in a total of 75 recordings. The recordings varied in duration from 0.89s to 2.03s seconds, with a mean of 1.34s.

Before MRI acquisition, youth were trained on a practice task containing audio clips of exaggerated vocalizations in a mock scanner. Participants then completed a forced-choice emotion recognition task in the MRI scanner. Each trial of the task consisted of stimulus presentation, followed by a 5-second response period during which subjects selected the speaker’s intended emotion from five labels (anger, fear, happiness, sadness, neutral). Although combining response contingency with stimulus delivery creates some difficulty in interpreting patterns of activation, we opted for this design (rather than passive listening), because we felt engaging participants in emotion categorization during stimulus delivery best captured the processes we were hoping to model. The stimuli were delivered to participants via pneumatic earbuds and responses were recorded with Lumina handheld response devices inside the scanner. Stimuli were presented in an event-related design with a jittered inter-trial interval of 1 to 8 seconds (mean 4.5s). A monitor at the head of the magnet bore was visible to subjects via a mirror mounted on the head coil: a fixation cross was shown on the screen during the inter-trial interval and the auditory stimulus, and a pictogram of response labels was shown during the rating period. The ER task was completed in three runs of approximately 6 minutes in length (25 recordings per run), presented in random order. Each run contained a pseudorandomized order of recordings from all three speakers, retaining a balanced number of recordings for each emotion and sentence.

Image acquisition

Due to equipment upgrade, MRI data were acquired on two Siemens 3 Tesla scanners running identical software, using standard 32-channel and 64-channel head coil arrays. The imaging protocol included three-plane localizer scout images and an isotropic 3D T1-weighted anatomical scan covering the whole brain (MPRAGE). Typical imaging parameters for the MPRAGE were: 1-mm pixel dimensions, 176 sagittal slices, repetition time (TR) = 2200-2300 ms, echo time (TE) = 2.45-2.98 ms, field of view (FOV) = 248-256 mm. Subsequently, functional MRI and 64-direction diffusion tensor imaging (DTI) data were acquired with echo planar imaging (EPI) acquisitions, with a voxel size of 2.5 x 2.5 x 3.5-4 mm. For fMRI scans, dummy data were collected for 9.2s while a blank screen was presented to participants. Imaging parameters were: TR = 1,500 ms, TE = 30-43 ms, FOV = 240 mm. For DTI scans, b = 0 data were acquired with the phase-encoding axis oriented in both anterior-posterior and posterior-anterior directions to allow for subsequent post-processing steps to correct for eddy currents and geometric distortion effects. DTI parameters were: TR = 1,900-2,280 ms, TE = 62-84.2 ms, FOV = 240 mm.

Image processing

EPI images were preprocessed and analyzed in AFNI, version 18.0.11 (Cox, 1996). Functional images were corrected to the first volume, realigned to the AC/PC line, and coregistered to the T1 anatomical image. The resulting image was then normalized nonlinearly to the Talairach template. After normalization, the data were spatially smoothed with a Gaussian filter (FWHM, 6-mm kernel). Within each functional run, voxel-wise signal was scaled to a mean value of 100, and any signal values above 200 were censored. Volumes for which 10% of the voxels or more were deemed to be signal outliers or contained movement greater than 1 mm between volumes were censored before analyses.

Analysis

Behavioural accuracy

Based on signal detection statistics (Pollak, Cicchetti, Hornung, & Reed, 2000), participants’ hit rates (HR; correct responses) and false alarms (FA; incorrect responses) on the ER task were combined into an estimate of sensitivity (Pr, or HR - FA) for each emotion category. Similar to d’ (i.e., z(HR) – z(FA)), Pr is more appropriate when subjects’ recognition accuracy is low (Snodgrass & Corwin, 1988), which often is the case in vocal ER tasks (i.e., average of 60% for standardized voice samples produced by adult actors (Johnstone & Scherer, 2000) or 50% for youth listeners interpreting youth-produced vocal affect (Morningstar et al., 2018a)). Pr ranges from −1 to 1, where a positive value represents more correct responses than errors (i.e., HR > FA), and a negative value represents more errors than correct responses (i.e., FA > HR). Responses that were made within 150 ms of the rating period were censored from analyses due to physiologic implausibility. One participant’s behavioural data were unavailable due to equipment error. A generalized linear model was performed to examine the effect of Emotion (within-subject, 5 levels: anger, fear, happiness, sadness, neutral) and Age in years (between-subject, continuous variable of interest) on Pr, with Gender (between-subject, 2 levels) as a control variable.

fMRI analysis

Event-related response amplitudes were first estimated at the subject level. We included regressors for the presentation of the auditory stimulus (amplitude-modulated in duration) convolved the hemodynamic response function. We contrasted stimuli against an implicit baseline (all nonstimuli periods). A regressor for stimulus emotion category (5 levels) and nuisance regressors for motion (6 affine directions) and scanner drift (3rd polynomial) were also included at the subject level. For group-level analyses, the contrast images produced for each participant were fit to a multivariate model (3dMVM in AFNI; Chen, Adleman, Saad, Leibenluft, & Cox, 2014) of the effect of Emotion category (within-subject, 5 levels) and mean-centered Age in years (between-subject, continuous) on whole-brain activation, with Gender (between-subject, 2 levels) as a control variable. Within this model, we computed a t-statistic from a general linear test of the effect of task on activation. We also computed F-statistics for the main effects of Emotion and Age, and for the interaction of Emotion x Age. We employed the false discovery rate correction factor for all fMRI analyses by combining a conservative threshold (p < 0.001) with a cluster-size correction (minimum size of 26 voxels). This cluster-size threshold correction was generated using the spatial autocorrelation function of 3dclustsim, based on Montecarlo simulations with study-specific smoothing estimates (Cox, Reynolds, & Taylor, 2016), with two-sided thresholding and first-nearest neighbor clustering, at α = 0.05 and p < 0.001. This procedure yielded a cluster threshold of 26 voxels, which was applied to all model results. This is a well-validated approach that has been widely used in neuroimaging literature (Cox et al., 2017; Kessler, Angstadt, & Sripada, 2017). As such, results presented below are clusters of activation that were larger than 26 contiguous voxels at p < 0.001. Regions were identified at their center of mass using the Talairach-Tournoux atlas. Mean activation in the voxels within defined clusters was extracted for follow-up analyses relating neural activation to age or performance.

DTI analysis

All DTI data were analyzed offline using the Diffusion Toolbox within FSL 6.0 (http://www.fmrib.ox.ac.uk/fsl; Smith et al., 2004). The standard TOPUP, EDDY, and DTIFIT routines in the toolbox were used to correct image distortions and reconstruct diffusion tensor and fractional anisotropy (FA) maps. Voxelwise statistical analysis of the FA data was performed using tract-based spatial statistics (TBSS; Smith et al., 2006). FA images were first registered to Montreal Neurological Institute (MNI) template space using linear registration and the FLIRT routine (Jenkinson, Bannister, Brady, & Smith, 2002; Jenkinson & Smith, 2001); the registered FA images were then averaged to create a mean FA image, from which a tract skeleton was generated. After the TBSS-based FA skeleton was created, masks were generated for the left and right superior longitudinal fasciculi using the Jülich histological atlas (Eickhoff, Heim, Zilles, & Amunts, 2006; Eickhoff et al., 2005). FA values for both tracts were calculated using these masks. Data for two participants were missing due to scanner acquisition error.

Results

Behavioural accuracy

The average Pr was 0.26 (Table 1), with average HR = 0.40 and average FA = 0.14. There was a significant effect of Emotion on Pr, F(4, 144) = 2.91, p = 0.02, η2 = 0.08. Post-hoc pairwise comparisons with Šidák corrections revealed that anger was the best recognized emotion, followed by sadness and neutral (which did not differ from one another, p > 0.05), and happiness and fear (which did not differ from one another, p > 0.05; unless otherwise specified, all emotions differed significantly from one another, all ps < 0.05). There also was a main effect of the continuous variable of Age, F(1, 36) = 4.34, p = 0.04, η2 = 0.11, such that older age was associated with greater accuracy (Figure 1). Lastly, there was a main effect of Gender, F(1, 36) = 4.57, p = 0.04, η2 = 0.11, whereby females were more accurate than males. There were no interactions between Emotion and Age or Gender (ps > 0.05).Footnote 1

Table 1 Mean sensitivity (Pr) for all participants
Fig. 1
figure 1

Association between age (in years) and vocal ER accuracy (Pr). Female participants are identified with blue circles; male participants are identified with green triangles. The black line represents the linear relationship between the two variables (R2 = 0.128).

Neuroimaging data

Activation related to task

We examined the general linear test for task activation compared to baseline (Table 2; Figure 2). Activation was noted along the length of the right and left STG, extending into the IFG. Other clusters were found in the frontal lobe, including at midline in the medial frontal gyrus and bilaterally in the precentral gyrus (which may reflect motor activity associated with task response), as well as in the occipital lobe (cuneus and lingual gyrus) and subcortical structures (thalamus and caudate). Deactivation during the task also was noted bilaterally in the inferior parietal lobule, parahippocampal gyrus, middle temporal gyrus, and postcentral gyrus (Table 2).

Table 2 Activation during stimulus presentation compared with baseline
Fig. 2
figure 2

Activation associated with presentation of auditory stimuli. Red areas denote increased activation during the task compared to baseline; blue denotes deactivation during the task compared to baseline. Clusters were formed using 3dclustsim at p < 0.001 (corrected, with a cluster size threshold of 26 voxels). Refer to Table 2 for description of regions of activation.

Effect of emotion type and age on activation during task

A main effect of Emotion was noted in the postcentral gyrus. There was a main effect of Age (Table 3; Figure 3, first column) in the bilateral superior frontal gyrus at midline (B-SFG), the right middle frontal gyrus (R-MFG), the left middle frontal gyrus (L-MFG), the left precentral gyrus (L-PCG), and the left inferior frontal gyrus (L-IFG). Regression analyses revealed that age linearly predicted increased mean activation in each of the five clusters (Figure 3, second column; B-SFG: t(37) = 5.71, β = 0.68, p < 0.001; R-MFG: t(37) = 5.95, β = 0.70, p < 0.001; L-MFG: t(37) = 5.80, β = 0.69, p < 0.001; L-PCG: t(37) = 5.20, β = 0.64, p < 0.001; L-IFG: t(37) = 5.63, β = 0.67, p < 0.001).Footnote 2

Table 3 Effect of age on activation during stimulus presentation compared with baseline
Fig. 3
figure 3

Age-related changes in activation. Clusters were formed using 3dclustsim at p < 0.001 (corrected, with a cluster size threshold of 26 voxels). The first column illustrates five clusters of age-related activation (see Table 3 for description of each cluster). The second column contains scatterplots of the association between activation in each cluster and age. The third column contains scatterplots of the association between activation in each cluster and task performance (Pr, or sensitivity). On each scatterplot, R2 indicates the amount of variance in activation explained by either age or Pr; the significance of the association between both variables is noted as * p < 0.05, *** p < 0.001. Brain images are rendered in the Talairach-Tournoux template space. L = left, R = right. SFG = superior frontal gyrus, MFG = middle frontal gyrus, PCG = precentral gyrus, IFG = inferior frontal gyrus.

We conducted additional regression analyses to determine whether mean activation in these age-related clusters was associated with task performance (Figure 3, third column). Greater activation in the B-SFG and L-PCG significantly predicted greater accuracy (Pr), t(37) = 2.12, β = 0.33, p = 0.04 and t(37) = 2.12, β = 0.33, p = 0.04, respectively. Activation in the other clusters was also positively related to greater Pr, although these models did not reach significance (R-MFG: t(37) = 1.74, β = 0.27, p = 0.09; L-MFG: t(37) = 1.54, β = 0.25, p = 0.13; L-IFG: t(37) = 1.61, β = 0.26, p = 0.12).

Functional connectivity related to age

We conducted follow-up generalized psychophysiological interaction (gPPI) analyses (McLaren, Ries, Xu, & Johnson, 2012) to probe the functional connectivity of each age-related cluster detailed above. We first fit the same subject-level model to activation within those five regions of interest (B-SFG, R-MFG, L-MFG, L-PCG, L-IFG). We then performed a group-level model examining the effect of Age (in years, continuous variable of interest) on functional connectivity with each of those seeds. Identical cluster-size threshold corrections were applied as above (i.e., p < 0.001, corrected).

Age was positively associated with greater functional connection between the seed in the L-PCG and areas in the right insula (R-I), left insula (L-I), and left inferior parietal lobule/supramarginal gyrus, nearing the temporal-parietal junction (L-TPJ). Furthermore, age was also associated with increased connectivity between the seed in the L-IFG and an area in the right inferior parietal lobule/supramarginal gyrus (R-TPJ; see Table 4 and Figure 4, columns 1-3). Follow-up regression analyses (Figure 4, column 4) indicated that greater connectivity between the seed and most target regions was itself related to increased accuracy (Pr) in the ER task (R-I: t(37) = 2.42, β = 0.37, p = 0.02; L-I: t(37) = 2.52, β = 0.38, p = 0.02; R-TPJ: t(37) = 2.53, β = 0.38, p = 0.02; L-TPJ: p = 0.11). Thus, in addition to age-related increases in frontal activation during the vocal ER task, the strength of connections between these frontal regions and both the insula and temporal-parietal junction was also related to age and task performance.

Table 4 Generalized psychophysiological interaction analyses on functional connectivity with clusters of age-related activation
Fig. 4
figure 4

Age-related changes in functional connectivity. Generalized psychophysiological interactions were computed by placing a seed in each of five clusters that showed age-related increases in activation (first column; see Figure 3 and Table 3). The second column represents clusters for which there was an effect of Age on functional connectivity, for each seed region. Clusters were formed using 3dclustsim at p < 0.001 (corrected, with a cluster size threshold of 26 voxels). The third column contains scatterplots of the association between seed-target connectivity and age. The fourth column contains scatterplots of the association between seed-target connectivity and task performance (Pr, or sensitivity). On each scatterplot, R2 indicates the amount of variance in connectivity explained by either age or Pr; the significance of the association between both variables is noted as * p < 0.05, ** p < 0.01. Brain images are rendered in the Talairach-Tournoux template space. L = left, R = right. I = insula, TPJ = temporal-parietal junction.

DTI

Age was associated with greater FA in both the left and right SLF (Figure 5), t(37) = 2.55, β = 0.39, p = 0.02, and t(37) = 2.67, β = 0.40, p = 0.01, respectively. FA in these tracts was positively, but not significantly, associated with task performance (ps > 0.29).

Fig. 5
figure 5

Association between age and fractional anisotropy (FA) in the superior longitudinal fasciculi (SLF). Image represents the mean FA skeleton for the left (green) and right (blue) SLF across all subjects, overlaid on the 1- x 1- x 1-mm Montreal Neurological Institute template. The second column contains scatterplots representing the association between age and FA in the left (top) and right (bottom) SLF. The third column contains scatterplots representing the association between task performance (Pr) and FA. R2 indicates the amount of variance in FA that is explained by age; the significance of the association between both variables is noted as * p < 0.05.

Discussion

The current study’s goals were to describe the neural correlates of vocal ER in youth and to examine age-related changes in neural activation during this social cognitive task. Better performance on the ER task was associated with older age. The task of attributing emotional intent to vocal stimuli activated temporal and frontal areas similar to those noted in work with adult listeners. We also found age-related increases in activation in several focal regions, primarily in the frontal lobe. Direct comparison of performance and brain data suggest that improvement in ER may be related to age-related increases in a) activation of frontal regions, and b) functional and tract-based connectivity between frontal areas, the insula, and the temporal-parietal junction.

Emotion recognition performance

Youth’s vocal ER ability was positively associated with age: specifically, older youth were more sensitive to distinctions among emotions (Pr) than younger youth across all emotions. Generally, accuracy was poorer in this sample than what is typically observed in similar developmental studies outside the scanner (i.e., equivalent to 40%, compared with 50% that has been observed with youth listeners identifying the same emotions portrayed by youth speakers; Morningstar et al., 2018a), although it remained above chance level (20%). The combination of scanner environment, logistics of button response requirements, and inclusion of a “neutral” category (Frank & Stennett, 2001) may have contributed to reduced accuracy. However, the pattern of Pr across different emotion categories is nearly identical to that noted in prior work (e.g., happiness is poorly recognized, whereas anger and sadness are well recognized; Johnstone & Scherer, 2000; Morningstar et al., 2018a), and the finding that performance is positively associated with age is consistent with previous research noting maturation of vocal ER skills across adolescence (Chronaki et al., 2015; Morningstar et al., 2019).

Neural correlates of vocal ER in youth

Across all participants, the ER task generated strong and widespread activation in brain networks previously implicated in auditory and affective processing, particularly in frontal and temporal areas. Bilateral activation was noted along the length of the STG, in the IFG near the pars opercularis, and the medial frontal gyrus. These findings are consistent with previous work on adults’ neural representation of vocal prosody (Schirmer, Kotz, & Friederici, 2002; Wildgruber et al., 2006), suggesting that the current model of prosody processing is broadly applicable to children and adolescents as well. It also is noteworthy that the pattern of activation observed during the ER task occurs at least partially within areas of the “social brain” implicated in social cognition and emotional processing, including the dlPFC and IFG (Prochnow et al., 2013; Wilson-Mendenhall, Barrett, & Barsalou, 2013), and superior temporal regions (Redcay, 2008). Although activation in some areas of the STG (the auditory cortex and TVA) may be specific to vocal processing, our findings suggest that the processes involved in vocal ER may overlap with those recruited to interpret other types of social and emotional signals, such as facial expressions of emotion (Yovel & Belin, 2013).

Age-related changes in neural activation during vocal ER

There was a positive association between age and activation in the superior frontal gyrus (SFG) at midline, the bilateral MFG, the left precentral gyrus, and the left IFG. As hypothesized, these age-related changes in activation were primarily evident in the prefrontal cortex, which is thought to mature in structure and function later in development than lower-order sensory cortices (Gogtay et al., 2004), such as the auditory cortex or temporal voice areas in the STG. Activation in the highlighted frontal areas may be related to functions that also continue to grow during adolescence, such as the top-down processing of emotion, language, and social cues. For instance, the dorsal SFG has been implicated in the regulation and reappraisal of emotion in adults (Buhle et al., 2014; Li et al., 2018). Furthermore, the MFG (BA9) has been shown to play an important role in understanding affective and linguistic prosody (meta-analysis by Belyk & Brown, 2013). The left IFG has similarly been involved in phonological, semantic, and syntactic processing (Heim, Opitz, Müller, & Friederici, 2003; Vigneau et al., 2006) and the explicit evaluation of vocal affect (Alba-Ferrara et al., 2011; Bestelmeyer et al., 2014; Fruhholz, Ceravolo, & Grandjean, 2012) and is suspected to play a role in the interpretation of dynamic, temporal information (Frühholz & Grandjean, 2013; Schirmer, 2017). Outside of vocal processing, the right MFG and left precentral gyrus have been implicated in the decoding of nonverbal cues in facial expressions (Cohen Kadosh et al., 2012). Given that activation in the SFG and left precentral gyrus, and to some extent in the right MFG, predicted increased performance on the ER task, it is possible that increased activation in these prefrontal regions supports the development of vocal emotion recognition skills with age.

The age-related changes in activation within these areas may also reflect increased efficiency and sensitivity in older adolescents compared to younger participants. Age was more robustly related to average activation across these regions than to activation at the peak of each cluster, suggesting that younger participants may have shown more widespread, but lower magnitude, response in these areas than the older participants. This interpretation would be consistent with theories of neurodevelopment and social cognition, such as the Interactive Specialization model (Johnson, Grossmann, & Cohen Kadosh, 2009), which posit specialization of function with age. Such developmental processes may be occurring in the identified prefrontal regions showing age-related differences in response to the vocal emotion recognition task, though longitudinal studies are needed to robustly assess such narrowing of function (Brown, Petersen, & Schlaggar, 2006; Durston et al., 2006).

Furthermore, the functional connectivity between the left precentral and inferior frontal gyri and the bilateral insula and temporal-parietal junction (TPJ) was positively associated with age. Diffusion tensor imaging results corroborated these findings: age was related to greater fractional anisotropy in the superior longitudinal fasciculi (SLF), white matter tracts connecting frontal and opercular regions to the parietal and temporal lobes (Taki et al., 2013). A network connecting these functionally similar areas has been implicated in the processing of vocal emotional prosody, whereby the TVA in the STG (Wildgruber et al., 2006) projects bilaterally to the IFG near the frontal operculum, and connects ipsilaterally to the inferior parietal lobule at the level of the TPJ (Ethofer et al., 2006a; Ethofer et al., 2012). The TPJ itself is considered a core area for the decoding of complex social cues (Blakemore & Mills, 2014; Redcay, 2008) and for theory of mind or “mentalizing” skills (Mahy, Moses, & Pfeifer, 2014; Saxe & Wexler, 2005), and the insula is implicated in the evaluation of emotional salience (Phillips, Drevets, Rauch, & Lane, 2003). Thus, our findings suggest that, with age, there is an emergent network between frontal and temporal-parietal regions involved in processing linguistic and affective cues (Grecucci, Giorgetta, Bonini, & Sanfey, 2013).

Stronger functional connectivity between these areas may facilitate age-related improvements in vocal ER, a skill that requires the processing of both linguistic and emotional information. Indeed, greater connectivity amongst these regions was related to greater accuracy in the vocal ER task. The specialization of pathways and increasing connectivity in networks across development is thought to permit improvement in behavioural performance during a variety of social cognition tasks (Johnson, Grossmann, & Cohen Kadosh, 2009), such as the perception of faces, the detection of biological motion, and mentalizing (Klapwijk et al., 2013). An important contribution of our study is the addition of vocal prosody to this list of social cognitive functions that mature across adolescence and rely on fronto-posterior parietal networks.

Strengths and limitations

The current results describe a neural network that may be involved in the maturation of vocal ER skills in youth. However, since we opted to contrast activation during the ER task to a baseline containing no vocal information, we cannot conclude that our findings pertain specifically to vocal emotion rather than simply auditory processing. We chose not to contrast activation in response to emotional voices to that elicited by neutral voices, given that emotionally “neutral” stimuli also contain social information that must be decoded in the same way as other “basic” emotions and could even be perceived as negative rather than truly neutral (Lee et al., 2008). Although subregions of the right STG and IFG activate more to emotional than neutral prosody during the implicit processing of vocal emotional prosody (Fruhholz et al., 2012), we were interested in understanding the processes at play in the explicit decoding of vocal affect, rather than detecting inter-emotion differences in neural activation patterns. Despite this limitation, our findings are highly consistent with prior work and patterns of activation related specifically to listeners’ task performance, suggesting that the activated regions are indeed likely to play an important functional role in vocal ER. Future work should aim to develop an adequate control for vocal prosody that retains the cognitive requirements associated with the explicit recognition of emotional intent to confirm the specificity of the current results to the detection of vocal affect.

Of note, we did not find evidence of emotion-specific patterns of activation in this study, beyond activity in the postcentral gyrus that likely reflected motor preparation for response. Existing work with adults suggests that emotion-specific patterns of activation should be present in the temporal voice area and/or inferior frontal gyrus (Buchanan et al., 2000; Ethofer et al., 2012; Ethofer et al., 2009b; Fruhholz et al., 2012; Grandjean et al., 2005; Johnstone, van Reekum, Oakes, & Davidson, 2006; Kotz et al., 2013).Footnote 3 However, whether or not discrete emotions generate unique neural signatures is a topic of considerable debate (Hamann, 2012; Lindquist & Feldman Barrett, 2012). Many studies that find such patterns of activation typically compare a limited number of emotion categories (e.g., happy vs. angry, sad vs. happy, or angry vs. neutral; Buchanan et al., 2000; Ethofer et al., 2009a; Fruhholz et al., 2012; Grandjean et al., 2005; Johnstone et al., 2006; Mitchell et al., 2003). Moreover, prior work that does include a greater number of emotion categories has suggested that emotion-specific activations may be difficult to test empirically with functional MRI analytical approaches (Ethofer et al., 2009b; Kotz et al., 2013). Alternatively, it is possible that youth show less functional specialization in the processing of different vocal emotion patterns than would adults; though this hypothesis is strictly speculative, it would be in line with theoretical predictions of the Interactive Specialization model (Johnson et al., 2009) and previous findings regarding the development of nonemotional voice processing in the superior temporal cortex (Bonte et al., 2013; Bonte et al., 2016). Yet another possibility is that youth-produced vocal emotions, which have been found to be less distinct from one another in pitch (Morningstar et al., 2017), may not elicit as strong emotion-specific responses in the brain as would adult-produced stimuli. Future work should investigate whether emotion-specific responses to affective prosody vary depending on factors such as the age of the listener or speaker.

Moreover, the current study utilized exemplars of vocal affect embedded in speech rather than nonspeech vocalizations. Although vocalizations (such as laughs, cries, or sounds of disgust) can convey emotionality in specific circumstances, important emotional information communicated during social interactions is also contained in the prosodic variations of others’ voices. Being able to parse another person’s emotions or attitudes based on “the way they said something” is a crucial skill that continues to develop throughout late adolescence. Indeed, although the interpretation of vocal bursts reaches adult-like maturity around the age of 14 to 15 years (Grosbras, Ross, & Belin, 2018), the current study and previous work have noted continued improvement beyond age 15 for the recognition of vocal affect in speech (Chronaki et al., 2018; Morningstar et al., 2018a). Thus, it is possible that age-related changes in the neural representation of these stimuli also may differ from those noted in response to speech-based vocal emotion. Additionally, the recordings used in the present study were selected based on adult ratings of their quality rather than those of adolescents. Youth may deem these exemplars as less representative of the intended emotions, which may have impacted their behavioural performance and related neural processing of these stimuli. Future work would benefit from investigating whether similar neural and behavioural responses are noted to nonspeech vocal emotion or to stimuli selected based on youth’s ratings of their representativeness.

Lastly, data in the current study were cross-sectional and did not model change within individuals. Defining how brain activation and connectivity map onto behavioural development of ER by assessing change in both neural activation and task performance longitudinally within-subject will be an important future direction. The addition of an adult comparison group also would provide a more complete picture of developmental change in activation during vocal ER; future research is needed to determine the age at which youth attain adult-like maturity in the capacity to decode affective prosody. Results should be replicated in a larger sample before broader conclusions about development can be drawn.

Conclusions

The current study used a multimodal approach integrating task-based fMRI, functional connectivity, and DTI to investigate the neural mechanisms supporting the maturation of vocal ER in adolescence. This ability has been linked to social competence in youth (Nowicki & Duke, 1992, 1994) and relational satisfaction in adults (Carton, Kessler, & Pape, 1999), making it important to understand the mechanisms that facilitate its development. Our findings suggest that ongoing maturation of vocal ER in adolescence may be supported by increased connectivity between frontal and temporal-parietal areas involved in the processing of socioemotional and linguistic cues.