Music and speech comprise sounds that unfold over time. The two domains may draw on separate (Peretz & Coltheart, 2003) or overlapping (e.g., Patel, 2011) mental resources. Here, we examined whether music skills predict phonological perception in a foreign language, asking whether (1) speech perception is associated with musical competence, (2) observed associations are better attributed to music training or perceptual abilities, and (3) such associations are independent of general cognitive abilities.

Music perception and speech perception

Theories of overlap in temporal processing for music and speech (Goswami, 2012; Tallal & Gaab, 2006) imply that rhythm abilities are especially likely to correlate with speech processing. In line with this view, rhythm abilities predict phonological processing in typically developing children (Carr, White-Schwoch, Tierney, Strait, & Kraus, 2014; Moritz, Yampolsky, Papadelis, Thomson, & Wolf, 2013), adolescents (Tierney & Kraus, 2013), and adults (Grube, Cooper, & Griffiths, 2013). Such associations also extend to syntax and reading abilities (Gordon et al., 2015; Grube et al., 2013; Tierney & Kraus, 2013). For children with reading impairments, rhythm abilities are below normal (Overy, Nicolson, Fawcett, & Clarke, 2003) and correlated with their phonological and reading abilities (Huss, Verney, Fosker, Mead, & Goswami, 2011). Moreover, interventions that focus on rhythm and temporal-processing improve their phonological skills (Flaugnacco et al., 2015; Thomson, Leong, & Goswami, 2013).

Among adults, however, the story is actually more complicated. For example, melody and rhythm perception are correlated (.5 < r < .7; Bhatara et al., 2015; Wallentin et al., 2010), and in studies of non-native language (L2) abilities, researchers have reported that L2 experience predicts rhythm but not melody perception (Bhatara, Yeung, & Nazzi, 2015), melody perception is correlated positively with L2 pronunciation (Posedel, Emery, Souza, & Fountain, 2012), and better melody and rhythm abilities predict better L2 phonological abilities (Kempe, Bublitsz, & Brooks, 2015; Slevc & Miyake, 2006). Moreover, for typically developing children, melody perception predicts phonological processing (or reading ability) equally well or better than rhythm perception (Anvari, Trainor, Woodside, & Levy, 2002; Grube, Kumar, Cooper, Turton, & Griffiths, 2012), and associations between rhythm perception and phonological processing can disappear when IQ is held constant (Gordon et al., 2015). It is an open question, then, whether associations with speech perception are stronger for rhythm than for melody perception.

Music training and speech perception

Music training is associated with speech perception, higher-level language abilities (e.g., reading), and general cognitive abilities (Schellenberg & Weiss, 2013). Indeed, music training is associated with L1 phonological perception (Zuk et al., 2013) and reading abilities (Corrigall & Trainor, 2011), and with L2 fluency (Swaminathan & Gopinath, 2013; Yang, Ma, Gong, Hu, & Yao, 2014). Longitudinal interventions with random assignment indicate that music training may actually cause improvement in children’s speech perception (Degé & Schwarzer, 2011; Flaugnacco et al., 2015; François, Chobert, Besson, & Schön, 2013; Moreno et al., 2009; Thomson et al., 2013).

Nevertheless, associations between music training and speech perception are not always replicable (Boebinger et al., 2015; Ruggles, Freyman, & Oxenham, 2014). Moreover, intervention studies have adopted intensive (daily) training that focused primarily on listening skills rather than playing an instrument (Degé & Schwarzer, 2011), programs that included training in speech rhythm in addition to music rhythm (Thomson et al., 2013), or tasks that were biased in favor of the music group (François et al., 2013). Correlational studies typically involve more conventional music lessons but preexisting musical, cognitive, and motivational factors mean that the direction of causation is unclear (Corrigall & Schellenberg, 2015; Corrigall, Schellenberg, & Misura, 2013).

The present study

We sought to determine whether non-native speech perception is associated with music training, and whether it is more closely associated with melody or rhythm perception. Examination of music-perception abilities and music training allowed us to ask whether observed associations were better explained by music-perception abilities with training held constant, or, conversely, by music training with music-perception abilities held constant.

One view holds that music lessons enhance speech skills by training the ability to decode meaning from sound (Kraus & Chandrasekaran, 2010). Because listeners perceive speech sounds from an unfamiliar language using the phonological framework of their native language (e.g., Best, McRoberts, & Goodell, 2001; Werker & Tees, 1984), our stimulus set included tokens that varied in resemblance to Canadian-English phonology (Best et al., 2001; Best, McRoberts, & Sithole, 1988). At the extreme, we tested participants’ ability to discriminate clicks in Zulu. In other conditions, Zulu contrasts were foreign sounding but more easily assimilated to English categories. If there is an association between music and speech, musical competence may be important only when stimuli sound like speech.

Although musical competence predicts speech and language abilities, it is also correlated with visuospatial skills and general cognitive abilities (Schellenberg & Weiss, 2013). Indeed, intelligence and memory are predicted by music training and by basic music-perception skills. Thus, we also tested whether associations between music and speech are a by-product of individual differences in general cognitive functioning.

Method

Participants

The participants were 151 undergraduates who were native speakers of English (87 female, 13 left-handed, mean age 18.4 years, SD = 1.0) and received all of their formal education in English. None reported a history of hearing problems or exposure to an African language. They had an average of 4.9 years of private or school music lessons. For those who reported learning more than one instrument (or voice), duration of training was summed across instruments. Because the distribution was skewed positively (SD = 6.8 years, median = 2), duration of training was square-root transformed for statistical analyses.

Measures

Socioeconomic status

Participants provided information about their family income and mother’s and father’s education, as in previous research (e.g., Schellenberg, 2006). Because the three SES variables were intercorrelated, ps < .001, the principal component was extracted for use in the statistical analysis. This latent variable correlated highly with each original variable, rs > .7, and accounted for 61.8% of the variance.

General cognitive abilities

The forward and backward portions of the Digit Span test were used to measure short-term and working memory, respectively. Nonverbal intelligence was measured with the 12-item version of Raven’s Advanced Progressive Matrices (APM; Bors and Stokes, 1998).

Music-perception skills

The Musical Ear Test (MET; Wallentin, Nielsen, Friis-Olivarius, Vuust, & Vuust, 2010) provided two measures of musical competence, specifically melody perception and rhythm perception. Each trial involved two short auditory sequences that were identical on half of the trials and different on others. Participants judged whether the two sequences were identical.

Speech perception

A Zulu, minimal-pairs, consonant-matching task had four conditions that varied in difficulty depending on similarity to English consonants (Best et al., 1988, 2001). Condition 1 (the easiest) contrasted a voiceless and voiced lateral fricative (/ɬ/-/ɮ/), which native-English listeners assimilate to different English phonemes (/θ s ∫/ vs /ð z ʒ/). Condition 2 had voiceless aspirated and ejective (glottalized) velar stops (/kh/-/k′/), which are typically assimilated to a single English consonant (/k/), although one phoneme sounds like a better approximation. Condition 3 comprised plosive and implosive voiced bilabial stops (/b/-/ɓ/), both of which are assimilated to a single English consonant (/b/) but sound different. Condition 4 (the most difficult) had voiceless unaspirated apical clicks and lateral clicks, which cannot be assimilated to any English consonants. All tokens were consonant-vowel syllables with contrasting consonants but the same vowels within a condition (/ɛ/, /a/, /u/, and /a/ in Conditions 1–4, respectively). All contrasts differed in both temporal and pitch cues, such that associations with either melody or rhythm perception were plausible. The two click syllables varied less in overall duration (286 vs. 293 ms on average), however, compared to pairs in other conditions (310 vs. 345 ms, 285 vs. 264 ms, and 261 vs. 294 ms, for Conditions 1–3, respectively). More detailed acoustic information is provided in Supplementary Materials.

In an AXB discrimination task, A (presented first) and B (presented last) were contrasting speech tokens that had different consonants. X (presented between A and B) was always a non-identical token from the same category as A (half of the trials) or B (the other half). Participants decided whether A or B sounded more like X, such that the task required phoneme discrimination and matching. Assignment of the two phonemic categories to A or B was counterbalanced. Within each trial, the onset-to-onset interval was fixed at 1 s. The test was presented in eight blocks of 40 trials each, with two blocks per condition, such that there were 80 trials per condition, with blocks and trials randomized separately for each participant.

Procedure

Participants were tested individually in a sound-attenuating booth. They completed the digit-span test, the speech-perception test, and a questionnaire that asked for background information (history of music training, demographics). After a short break, they completed the APM and the MET. The speech perception test and the MET were administered on an iMac with stimuli presented over headphones. The testing session took up to 90 min.

Results

Performance was above chance levels in each of the four speech conditions, ps < .001. A one-way repeated-measures Analysis of Variance confirmed that performance differed across conditions, F(3, 450) = 767.24, p < .001, partial η2 = .836, with better performance in Condition 1 than 2, in Condition 2 than 3, and in Condition 3 than 4, ps < .001.

Despite the predicted decline in performance as the stimuli decreased in similarity to English phonemes, a marked violation of sphericity, p < .001 (Mauchly’s test), indicated that pairwise correlations between conditions varied markedly. To reduce redundancy in the results that follow, we conducted a principal components analysis (varimax rotation). A two-component solution accounted for two-thirds (66.45%) of the original variance. Conditions 1, 2, and 3 loaded onto the first component (rs ≥ .68), whereas Condition 4 was almost perfectly correlated with the second component (r = .93). Whereas the first (speech-like) component reflected perception of speech-like aspects of the Zulu tokens, the second (non-speech-like) component reflected perception of non-speech-like aspects. Factors scores were used in subsequent analyses. Results from original conditions are provided in the Supplementary Materials.

Preliminary analyses revealed that SES had no associations with any other variables. SES was not considered further. Correlations among our cognitive variables (short-term memory, working memory, nonverbal intelligence) revealed that working memory was correlated positively with short-term memory, r = .47, and nonverbal intelligence, r = .28, ps < .001. To identify possible confounding variables in the main analyses, we conducted pairwise associations between cognitive variables and each of our music (music training, melody perception, rhythm perception) and phoneme-perception (speech-like, non-speech-like) variables (see Table 1). Duration of music training was associated positively with working memory. Melody perception was correlated positively with short-term memory, whereas rhythm perception was correlated positively with short-term and working memory. The phoneme-perception variables had no associations with the cognitive variables.

Table 1 Associations between cognitive and target variables

The main analyses examined associations between the speech and music variables (see Table 2). As expected, melody perception and rhythm perception were associated positively, and both variables were correlated positively with music training. There were no simple or partial associations between music training and non-native phoneme discrimination. When we examined associations between music perception and phoneme discrimination, only rhythm perception was a significant predictor, but only for the speech-like factor. Rhythm perception continued to be associated with the speech-like factor after we held short-term memory and working memory constant, pr (partial correlation) = .25, p = .002. When we included music training and nonverbal intelligence as additional control variables, the partial association between rhythm perception and the speech-like component was similar in magnitude, pr = .26, p = .001.

Table 2 Associations among target variables

To ensure that the null results with music training were not an artifact of the way it was coded, we coded training in three additional ways. The results did not change (see the Supplementary Materials).

Discussion

In a sample of adult native speakers of English, we examined whether musical expertise predicted speech perception in a foreign language. Rhythm perception predicted phoneme-discrimination performance in Zulu, and this association remained significant even after controlling for general cognition and music training. The association was limited, however, to phonemes that resembled tokens from English phonology. Rhythm perception did not predict the ability to discriminate Zulu clicks, and there were no associations between non-native speech perception and melody perception or music training.

Although rhythm perception and melody perception were correlated in the present study, our findings are consistent with proposals of a special relation between music and language domains (Kraus & Chandrasekharan, 2010; Patel, 2011) that stems primarily from shared temporal processing (Goswami, 2012; Tallal & Gaab, 2006). Nevertheless, a correlation with rhythm but not with melody could also stem from the fact that temporal distinctions (e.g., overall duration) were greater for the speech-like than the non-speech-like contrasts. The association could also be the consequence of native-language background. Because pitch does not determine lexical meaning in English, native speakers may attend preferentially to temporal cues. For tone languages, however, native speakers may be more inclined to attend to pitch cues in non-native speech, possibly giving rise to an association of melody perception with non-native phoneme perception. A similar argument applies to the finding that rhythm perception did not predict the perception of Zulu clicks. If musical competence is especially relevant for acoustic cues that are perceived to be communicative or meaningful (Kraus & Chandrashekaran, 2010), Zulu clicks may not sound meaningful. For native speakers of click languages (Zulu or otherwise), however, musical competence could be associated with non-native click perception. More generally, native-language background may moderate the association between music and non-native speech perception by influencing which speech sounds have communicative relevance.

There was no association between music training and speech perception, a finding consistent with some previous reports (Boebinger et al., 2015; Ruggles et al., 2014), but contrary to results reported by researchers who administered tests of phonology (e.g., Zuk et al., 2013), speech segmentation (François et al., 2013), the perception of pitch and intonation in speech (Besson, Schön, Moreno, Santos, & Magne, 2007), and speech perception in suboptimal conditions (e.g., Strait & Kraus, 2011; Swaminathan et al., 2015). Although researchers frequently interpret significant correlations between music training and speech perception as evidence for training effects or plasticity (e.g., Kraus & Chandrasekaran, 2010; Skoe & Kraus, 2012; Strait & Kraus, 2011), such associations could be the consequence of preexisting individual differences (Schellenberg, 2015). Findings from twin studies confirm that genetic factors play a role in music perception, the propensity to engage in musical activities, and musical accomplishment (for review see Hambrick, Ullén, & Mosing, 2016). In short, nature influences musical competence and the likelihood of taking music lessons, which would, in turn, further influence musical competence—a gene-environment interaction (Hambrick et al., 2016; Schellenberg, 2015). In the present study, rhythm perception, but not music training, was associated with speech perception, and rhythm perception predicted speech perception after holding training constant. In other words, natural abilities were a better predictor of speech perception than music training. If music training influences speech perception, the effect appears to be small (or non-existent), and potentially a consequence of pre-existing differences in music perception (i.e., aptitude).

Unequivocal evidence for training effects comes only from experiments with random assignment, which eliminate processes that promote musical participation in the first place, and make it impossible to examine interactions between genes and the environment. Correlational and quasi-experimental studies, by contrast, provide ecologically valid snapshots but leave training effects undifferentiated from preexisting differences. Future research could diminish this problem by including a more comprehensive suite of hypothesized preexisting factors. In the present investigation, we measured music perception and music training. With training held constant, performance on the music-perception test was a better estimate of natural musical abilities. With music perception held constant, training was a purer measure of learning music, which could in principle promote speech and language skills, although personality, cognitive, and demographic variables are also implicated in the choice to take music lessons (Corrigall et al., 2013). In any event, the present findings indicate that associations between language and music processes are likely to be over-simplified when researchers assume that music training causes improvements in speech perception.