Hierarchical contributions of linguistic knowledge to talker identification: Phonological versus lexical familiarity

McLaughlin, Deirdre E.; Carter, Yaminah D.; Cheng, Cecilia C.; Perrachione, Tyler K.

doi:10.3758/s13414-019-01778-5

Hierarchical contributions of linguistic knowledge to talker identification: Phonological versus lexical familiarity

Perceptual/Cognitive Constraints on the Structure of Speech Communication: In Honor of Randy Diehl
Published: 19 June 2019

Volume 81, pages 1088–1107, (2019)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Hierarchical contributions of linguistic knowledge to talker identification: Phonological versus lexical familiarity

Download PDF

Deirdre E. McLaughlin¹,
Yaminah D. Carter¹,
Cecilia C. Cheng¹ &
…
Tyler K. Perrachione¹

1443 Accesses
3 Citations
Explore all metrics

Abstract

Listeners identify talkers more accurately when listening to their native language compared to an unfamiliar, foreign language. This language-familiarity effect in talker identification has been shown to arise from familiarity with both the sound patterns (phonetics and phonology) and the linguistic content (words) of one's native language. However, it has been unknown whether these two sources of information contribute independently to talker identification abilities, particularly whether hearing familiar words can facilitate talker identification in the absence of familiar phonetics. To isolate the contribution of lexical familiarity, we conducted three experiments that tested listeners’ ability to identify talkers saying familiar words, but with unfamiliar phonetics. In two experiments, listeners identified talkers from recordings of their native language (English), an unfamiliar foreign language (Mandarin Chinese), or “hybrid” speech stimuli (sentences spoken in Mandarin, but which can be convincingly coerced to sound like English when presented with subtitles that prime plausible English-language lexical interpretations based on the Mandarin phonetics). In a third experiment, we explored natural variation in lexical-phonetic congruence as listeners identified talkers with varying degrees of a Mandarin accent. Priming listeners to hear English speech did not improve their ability to identify talkers speaking Mandarin, even after additional training, and talker identification accuracy decreased as talkers’ phonetics became increasingly dissimilar to American English. Together, these experiments indicate that unfamiliar sound patterns preclude talker identification benefits otherwise afforded by familiar words. These results suggest that linguistic representations contribute hierarchically to talker identification; the facilitatory effect of familiar words requires the availability of familiar phonological forms.

Semantic memory: A review of methods, models, and current challenges

Article 03 September 2020

Abhilasha A. Kumar

Animal language studies: What happened?

Article 01 July 2016

Irene M. Pepperberg

The English Sublexical Toolkit: Methods for indexing sound–spelling consistency

Article Open access 09 April 2024

Robert W. Wiley, Sartaj Singh, … Jeremy J. Purcell

Introduction

Talker identification – the process of identifying a speaker by the sound of their voice – is an important social and perceptual skill. Research has consistently demonstrated that the ability to identify talkers is functionally integrated with the processes involved in perceiving speech. A prominent phenomenon demonstrating this integration is the language-familiarity effect in talker identification, in which listeners are more accurate at identifying talkers by the sound of their voice when listening to speech in their native language compared to unfamiliar or foreign languages (Goggin et al., 1991; Perrachione & Wong, 2007; Thompson, 1987). This effect of language on processing talker identity underscores a bi-directional relationship between linguistic and social-perceptual faculties (Kuhl, 2011): Listeners are able to both resolve talker variability in order to arrive at an underlying linguistic message (e.g., Choi, Hu, & Perrachione, 2018; Mullennix & Pisoni, 1990) and employ an underlying linguistic representation in order to more accurately identify a speaker by the sound of their voice (e.g., Perrachione, Del Tufo, & Gabrieli, 2011).

Although the relationship between language familiarity and talker identification ability is reliably observed in a large body of scientific work (reviewed in Perrachione, 2018), there remains no agreed-upon cognitive model to explain either what information is integrated between these two faculties or how such integration occurs. Some authors have asserted a role for higher-level linguistic processing in talker identification, in which listeners gain access to talker identity-relevant information by processing and representing speech at the level of familiar linguistic units such as words (e.g., McLaughlin et al., 2015; Perrachione, Del Tufo, & Gabrieli, 2011). Other authors have described how the language-familiarity effect can arise from acoustic-phonetic processing, in which listeners gain access to talker identity-related information by processing speech with respect to the familiar phonetic patterns of their native language (Fleming et al., 2014; Zarate et al., 2015). Although both sources of information – acoustic-phonetic and lexical – have been found to simultaneously facilitate native-language talker identification (Perrachione et al. 2015), it is currently unknown whether these sources of information contribute independently to this ability, or whether there is a bi-directional or hierarchical dependence between these representations. Determining when and how various levels of linguistic knowledge affect talker identification is necessary to better understand the integration between linguistic, perceptual, and mnemonic processes in the human mind.

The role of familiar sounds and sound patterns in talker identification

Several lines of evidence support the idea that speech with familiar phonetics and phonological structure facilitates listeners’ perception of talker identity, even when that speech lacks familiar words. Listeners identify talkers more accurately from meaningless pseudo-words that follow the sound structure of their native language than they do from foreign-language speech (Perrachione et al., 2015; Xie & Myers, 2015b; Zarate et al., 2015), suggesting familiar sound structure gives listeners access to additional talker-specific information even in the absence of comprehensibility. The benefits of sound-structure familiarity may not even depend on linguistic structure derived from word knowledge: When learning to identify talkers speaking in French, self-reported monolingual English listeners from Canada outperformed monolingual English listeners from the USA, suggesting that incidental, passive exposure to the sound structure of an unfamiliar language may also facilitate talker identification, even in the putative absence of any familiar word forms (Orena, Theodore, & Polka, 2015). Likewise, for infants as young as 7 months, familiarity with the sound structure of their native language, even though they recognize few if any words, is sufficient to elicit a form of the language-familiarity effect (Fecher & Johnson, 2018b; Johnson et al., 2011).

The idea that accurate talker identification is driven in part by phonological familiarity is supported by some reports that show a larger language-familiarity effect between languages that are more dissimilar phonologically (Zarate et al., 2015), although such effects of phonological dissimilarity are not commonly reported (Johnson et al., 2011; Köster & Schiller, 1997; Xie & Myers, 2015a). Finally, when listening to time-reversed speech (which putatively retains certain acoustic-phonetic features while rendering speech incomprehensible) listeners rate talkers of their native language as more dissimilar sounding than talkers of a foreign language (Fleming et al., 2014; cf. Furbeck et al., 2018), suggesting that listeners are more sensitive to inter-talker differences in the presence of familiar, language-specific acoustic structure. Collectively, there is an abundance of evidence that listening to speech with familiar acoustic-phonetic properties contributes to more accurate processing of talker-identity related information.

The role of familiar words and higher-level linguistic units in talker identification

There is also evidence demonstrating that, beyond familiarity with sound structure, talker identification is facilitated by higher-level linguistic processing, particularly representations at the level of words. Several studies have shown that talker identification abilities improve as a function of the amount of linguistic information available from talkers. Listeners identify talkers more accurately as their speech increases in complexity from vowels to words to sentences (Bricker & Pruzansky, 1966; Goggin et al., 1991; Pollack, Pickett, & Sumby, 1954) – an effect that appears to hold for foreign-accented speech as well (Goldstein, Knight, Bailis, & Conover, 1981). When listening to two-word sequences, listeners detect a change in talker across words more accurately when the two words are unrelated than when the words form a meaningful sequence, demonstrating integrated processing of lexical-semantic and phonetic-indexical information (Narayan, Mak, & Bialystok, 2017). Talker identification is also improved as the quantity of known, as opposed to novel, words increases: Listeners perform better when identifying talkers from speech comprised of real words compared to nonsense speech matched in native-language phonological structure (Goggin et al., 1991; Perrachione et al., 2015; Xie & Myers, 2015b). Listeners also learn to identify talkers more accurately in their native language, but not a foreign language, when the lexical content of the speech is repeated, revealing that consistent (but unknown) speech content confers no talker identification benefit in a foreign language, whereas listeners' ability to identify native-language voices improves with their ability to remember and compare the content of their speech (McLaughlin et al., 2015).

Different task demands can also highlight the comparative importance of different levels of linguistic representation during voice and talker perception. Whereas perceptual dissimilarity ratings of time-reversed speech appear to be affected by correspondence in the language spoken by talkers and listeners (Fleming et al., 2014), listeners do not appear to gain an advantage in the identification of talkers in their native language when recordings have been time-reversed (Perrachione et al., 2015). Similarly, whereas familiarity with the language spoken by talkers imparts a large and reliable advantage in talker identification (Perrachione, 2018), linguistic familiarity does not appear to give listeners as much of an advantage in discriminating whether two speech samples come from the same or different talkers (Fecher & Johnson, 2018a; Wester, 2012; Winters, Levi, & Pisoni, 2008). That more complex tasks, such as talker identification, increasingly draw upon higher-level representations compared to simpler tasks, such as talker discrimination, raises the possibility that there are additive contributions of various levels of linguistic knowledge in representing talker-specific information. What remains unknown is whether these levels of representation can contribute independently, or whether there is a hierarchical dependence between lower and higher levels of representation in encoding talker identity-related information.

The present study: Do familiar words always benefit talker identification?

Across three experiments, we explored whether talker identification abilities benefitted from processing familiar lexical information independently from familiar acoustic-phonetic information. Specifically, we examined whether being able to parse a speech stream into familiar words, particularly when the sound structure was unfamiliar, would nonetheless facilitate talker identification accuracy. In the first two experiments, listeners heard sentences spoken in Mandarin that could be convincingly coerced to sound like English when presented with subtitles that primed lexical expectations during speech processing. These sentences were carefully designed to create semantically and syntactically plausible sentences in both languages, with the presence of subtitles priming plausible English glosses of the Mandarin speech.

The coercion of speech produced in one language to sound convincingly like speech from another language has been widely demonstrated in the pop-culture phenomenon of “mondegreens,” in which speech (frequently song lyrics) in a foreign language is heard as native-language speech in the presence of simultaneous native-language subtitles (Liberman, 2007). Speech perception research in the laboratory has likewise demonstrated numerous circumstances where top-down expectations about words influence listeners’ speech processing. The classical example of biasing perception based on lexical expectations comes from the Ganong effect in categorical perception, in which listeners are biased to disregard competing acoustic information in favor of perceiving real words (Ganong, 1980). Biasing perception of speech based on listeners’ expectations also extends to richer phonetic contexts such as sentences. Perception of vocoded sentences, where detailed spectral information is removed from the speech signal, is more accurate when listeners are primed to expect key content words from the sentence (Davis et al., 2005). Using subtitles to prime lexical expectations also helps listeners perceive the words in vocoded speech more accurately (Sohoglu & Davis, 2016). In perhaps the most compelling example of the power of expectations to bias perception in favor of real words, listeners report actually “hearing” target words in speech when they have been primed to expect those words, even when all distinguishing spectral and temporal information from the acoustic signal has been completely effaced (which renders speech otherwise totally incomprehensible; Holdgraf et al., 2016).

From the Ganong effect to the identification of vocoded speech, the power of top-down expectations to alter the correspondence between sensory input and linguistic representations is well established in speech processing. But can these top-down linguistic biases also affect the correspondence between sensory inputs and talker representations? Voice processing may take advantage of a perceptual space wherein talkers’ voices are encoded as deviations from a prototype voice (Latinus & Belin, 2011), the specification of which likely depends on language-specific representations of voices (Goggin et al., 1991) constructed from language-specific acoustic, phonetic, phonological, and lexical features (e.g., Fleming et al., 2014). However, the phonetic-phonological correspondences differ across languages (e.g., Lisker & Abramson, 1964), and thus the informative variability in talker-specific phonetic idiosyncrasies may be more opaque to listeners when they are identifying foreign-language voices. Higher-level linguistic structure, such as words, guides both the perception and interpretation of ambiguous phonetic information (Getz & Toscano, 2019; Samuel, 1997, 2001) and can facilitate phonetic processing even in an unfamiliar language (Samuel & Frost, 2015). Correspondingly, by providing listeners with higher-level linguistic representations through which they can interpret the ambiguous phonetics of foreign language speech, known lexical content may give listeners a scaffold upon which they can extract more information about talker-specific phonetic variation and thus facilitate foreign-language talker identification.

In the present study, we first tested the hypothesis that priming listeners to parse a foreign-language speech stream comprised of unfamiliar sounds into real words via native-language subtitles would improve talker identification accuracy compared to a condition in which no primes were presented. If this manipulation improved talker identification from foreign-language speech, it would favor a model of talker identification in which facilitatory representations of voices are made available via lexical processing in parallel with talker-specific information provided by familiar sound structure. However, if allowing listeners to parse a speech stream comprised of unfamiliar sounds into one made up of familiar words has no effect on talker identification, it would suggest that the talker identification benefits conferred by processing the lexical content of speech (e.g., Goggin et al., 1991; McLaughlin et al., 2015; Perrachione et al., 2015) are only available when the acoustic-phonetic features of speech are also familiar. This latter result would, instead, favor a model of talker identification in which the facilitatory contribution of familiar words has a hierarchical dependence on the availability of familiar sound structure.

In two versions of this experiment involving different amounts of training, we found that, contrary to our expectations, lexical priming does not appear to improve talker identification in the absence of familiar phonological information. The laboratory manipulation of coercing foreign-language speech with an unfamiliar phonology to sound like listeners' native language is somewhat analogous to the common, real-world situation of listening to speech with a heavy foreign accent. Thus, we ran a third, follow-up experiment in which we investigated a related hypothesis: that the degree of phonetic dissimilarity (operationalized here as the degree of perceived foreign accent) negatively affects talker identification abilities for speech produced in listeners' native language. In this experiment, we observed a graded effect of unfamiliar phonetics on English-speaking listeners’ talker identification abilities, with most accurate talker identification for native English-accented talkers, followed by Mandarin-English bilinguals with a slight Mandarin accent (low-accentedness), Mandarin-English bilinguals with a stronger Mandarin accent (high-accentedness), and with Mandarin-speaking talkers identified least accurately. Taken together, the results from these three experiments strongly suggest that familiarity with the sound structure of speech has precedence over processing higher-level linguistic structure when conferring a benefit in talker identification, and thus that linguistic information contributes to talker identification in a hierarchical fashion, with higher levels of representation conferring a benefit only when lower levels are also familiar.

Experiment 1: Priming lexical representations during foreign-language talker identification

In this experiment, we investigated whether allowing listeners to parse a speech stream composed of unfamiliar sounds into familiar words via lexical priming with subtitles could confer a benefit in learning to identify talkers compared to listening to speech from the same foreign language without lexical priming. In a within-subjects, 2 × 2 factorial design, native English-speaking listeners learned to identify talkers speaking in either English or Mandarin, with or without accompanying subtitles to prime listeners to hear English words from the speech. Listeners completed each of these four talker identification conditions (English/Mandarin-speaking talkers presented with/without subtitles) separately in a counter-balanced order.

Methods

Participants

Native speakers of American-English completed this study (N = 32, 26 female, six male; age 18–35 years, M = 21.8). Inclusion criteria required participants to have a self-reported history free from speech, language, or hearing problems and no prior experience with Mandarin. This study was approved and overseen by the Institutional Review Board at Boston University. Participants provided written informed consent and were paid for their participation.

The sample size was determined by the number of permutations of experimental conditions necessary to counterbalance the stimuli, and is larger than most of the prior studies of the role of language in talker identification (Perrachione, 2018). Previous research found that manipulations involving lexical content in talker identification have effect sizes on the order of Cohen’s d = 0.5-1.2 (McLaughlin et al., 2015; Perrachione et al., 2015). Correspondingly, with N = 32 we have 87% to 100% power to detect effect sizes in the published range, and 80% power to detect effect sizes of d ≥ 0.45.

Stimuli

Twenty “English-Mandarin hybrid sentences” were designed for this experiment (Table 1 and Appendix). Each hybrid sentence was syntactically correct and semantically plausible in both languages, but the English and Mandarin forms of the sentence were not translations between the two languages. Instead, the sentences – originally constructed in Mandarin – were designed to have an intended English “gloss” that could convincingly be heard from the phonetics of natural Mandarin speech. The Mandarin sentence and its English gloss were designed based on correspondences between the phonotactic properties of English, Mandarin, Mandarin-accented English, and the patterns of (mis)perception of Mandarin phonemes by English speakers (e.g., Tsao, Liu, & Kuhl, 2006). For example, in the hybrid sentence “陪你晚到了” (/p^heɪ ni wɛn tau lə/), a listener expecting to hear Mandarin-accented English can convincingly hear, “Pay me one dollar” (/p^heɪ mi wʌn dɑlɚ/), a mapping to English words that capitalizes on, among other features, reliable perception of the Mandarin voiceless but unaspirated [t] by English listeners as an English /d/ and the typical reduction in r-coloring of rhotic vowels in Mandarin-accented English. Hybrid stimuli were extensively piloted prior to use in this talker identification study to ensure they could elicit the intended English speech percept, particularly when presented with concomitant subtitles. We also confirmed that orienting listeners’ perceptual expectations towards an English interpretation of the Mandarin speech was effective at eliciting the intended English glosses during the actual talker identification experiment through a supplemental sentence transcription task, undertaken by a subset of participants after completing the Mandarin conditions of the talker identification task. This stimulus validation is described in detail below. (Example audio recordings of the English-Mandarin hybrid stimuli used in Experiment 1 are available as Supplementary Materials.)

Table 1 Example Mandarin-English hybrid sentences used in Experiments 1 and 2

Full size table

The English-Mandarin hybrid sentences were recorded (in Mandarin) by ten female native speakers of Mandarin (age 19–27 years, M = 23 years). Corresponding recordings (in English) of the hybrid sentences’ intended English glosses were made by ten female native speakers of American English (age 19–29 years, M = 22.3 years). Both groups of talkers were without distinctive regional accents. Recordings were made in quiet in a sound-attenuated booth using a Shure MX153 earset microphone, a Behringer Ultragain Pro MIC2200 2-channel tube microphone preamplifier, and Roland Quad Capture USB audio interface with a sampling rate of 44.1 kHz and 16-bit digitization in Praat RMS amplitude. Each sentence was RMS-amplitude normalized to 65 dB SPL using Praat v5.3.63.

In the talker identification experiment, listeners learned to identify two sets of talkers in each language, once with subtitles accompanying their recordings, once with no subtitles. Because some voices are inherently more distinctive than others, we arranged our talkers in each language into two, five-voice sets that would be equally identifiable on average. Additional pilot listeners learned to identify various groupings of these voices, allowing us to calibrate listeners' within-language accuracy to be equal between the two sets of talkers. This piloting ensured that, absent of the lexical priming manipulation in the actual experiment, listeners’ mean accuracy would not differ between repetitions of the talker identification task with different speakers of each language. Furthermore, the two sets of talkers in each language were also counterbalanced so they appeared equally often with or without accompanying subtitles.

Procedure

In a within-subjects, 2 × 2 factorial design experiment, participants learned to identify talkers across manipulations of the language being spoken (English or Mandarin) and the presence of top-down lexical priming (with or without subtitles), resulting in four conditions: (1) English with subtitles, (2) English without subtitles, (3) Mandarin with subtitles, and (4) Mandarin without subtitles. In order to preserve the illusion that the Mandarin with subtitles condition was actually English, before this condition participants were told that they were hearing English speech with a heavy Mandarin accent, and that subtitles were being provided to help them recognize the speech. Prior to the Mandarin-without-subtitles condition, participants were told they would be hearing speech in a foreign language they would not be able to understand. In all conditions, participants were also told that their ability to understand the speech was not important, that we were interested in their ability to learn to identify the talkers.

Participants completed all conditions of the experiment in a single session, and the order of conditions was counterbalanced across participants. Participants learned a unique group of five voices in each condition, and the speech content (i.e., which hybrid sentences were presented) was unique in each condition. The sentences used in each condition were permuted across participants, and the talkers used in the subtitle versus no-subtitle conditions were also permuted (within language) across participants, to control for voice and item effects on the experimental manipulations.

Talker identification training and testing

The procedure for talker identification training and testing was the same in each condition (Fig. 1A), excepting the manipulations of the language being spoken and the presence of subtitles. In each condition, listeners learned to associate five talkers with five unique, numbered avatars. First, listeners were familiarized with, and practiced identifying, the five talkers in a series of interleaved passive listening and active identification blocks. Following familiarization, listeners were tested on their ability to correctly identify the talkers.

In the training phase of each condition, participants learned to identify the talkers by the sound of their voice across five interleaved blocks of passive familiarization and active identification practice. This procedure has been used extensively in talker identification studies (Perrachione & Wong, 2007; Xie & Myers, 2015a; Zarate et al., 2015; inter alia). During familiarization (Fig. 1B), participants heard each of the five talkers say the same sentence in turn while the corresponding avatar and talker number appeared on the screen. Listeners heard each talker say the sentence twice (ten familiarization trials). Next, participants completed a ten-trial block of talker identification practice (Fig. 1C). With all five of the talkers' avatars on the screen, listeners heard each of the talkers saying the same sentence from the preceding familiarization block twice in a random order, and they indicated on each trial which talker they believed was speaking by pressing the corresponding number on a keypad. Participants received corrective feedback indicating whether they had chosen correctly, or who the correct talker was. After ten active practice trials, listeners underwent the next block of passive familiarization with a new sentence, and so on until they had been trained on five sentences. Thus, participants completed a total of 100 trials of training: 50 trials of familiarization with each talker (5 sentences × 5 talkers × 2 repetitions) and 50 trials of active practice identifying the target talkers with feedback.

After training was completed, listeners were tested on their ability to identify the talkers. They again saw all five talkers' avatars on the screen and indicated which of the five speakers they believed said a sentence (Fig. 1D); however, in the test phase, participants did not receive feedback. Participants heard the same sentences during test that they had heard during training. While it is often desirable to test of novel materials to ascertain generalization of talker identity to new speech materials, doing so frequently results in a performance decrement (e.g., McLaughlin et al., 2015; Perrachione & Wong, 2007). To accommodate the possibility that the beneficial effects of the subtitles were small, we chose to use the same sentences during the test phase to maximize listeners’ familiarity with the stimuli and thus their potential opportunity to use lexically-derived cues for talker identification. The order of sentences and talkers was randomized, and participants’ talker identification abilities were tested in 50 test trials (5 talkers × 5 sentences × 2 repetitions).

For the two conditions where subtitles were used to prime lexical expectations, each subtitle was displayed on the screen two seconds before the presentation of the recording, so that listeners would have enough time to read it and form an expectation about the speech content of the upcoming sentence. Subtitles accompanied the presentation of all speech stimuli in these conditions, including during familiarization, practice with feedback, and at test (Fig. 1 E–G). In the conditions without subtitles, a blank screen appeared for two seconds at the beginning of each trial, such that the timing of these experiments was the same.

Transcription of speech in the foreign-language conditions

To ascertain whether English-language subtitles accompanying the Mandarin speech were effective at eliciting the intended English lexical representations, half of the participants (N = 16) undertook an additional sentence transcription task after completing the talker identification test in each Mandarin condition. In this self-paced transcription task, participants heard, in a random order, each of the five talkers saying each of the five sentences from that condition (25 trials). Participants were instructed to “type the sentence exactly as you heard it,” and were told they were free to do so however they thought best reflected what they heard. Participants could see their responses during each trial while typing them; otherwise, no other information (particularly, no subtitles) appeared on the screen during the transcription task.

Transcription of the hybrid sentences during the two Mandarin conditions were scored on a number of dimensions, including (1) whether the sentence exactly matched the intended English gloss, (2) the proportion of words from the intended English gloss that were transcribed as intended, (3) whether the sentence was transcribed using only real English words, and (4) whether the sentence transcription contained any real English words. Sentence transcriptions were assessed conservatively; for instance, if a participant submitted the transcription, “my friends need your jelly,” for the target gloss, “my friend needs some jelly,” this was assessed to be (1) an incorrect transcription of the target sentence, (2) a correct transcription of 2/5 words, and (3/4) a transcription containing any and all English words.

Data analysis

In this and the subsequent experiments, data were analyzed using (generalized) linear mixed-effects models implemented in the libraries lme4 (v1.1-21), lmerTest (v3.1-0), and car (v3.0-2) implemented in R (v3.5.3). Significance was based on the criterion α = 0.05, with degrees of freedom based on the Satterthwaite approximation of the degrees of freedom.

Results

Talker identification

Talker identification was operationalized as participants’ accuracy on each trial of the test phase of each condition. These scores were submitted to a generalized linear mixed model for binomial data. Fixed factors in the model included language (English, Mandarin), subtitles (no subtitles, with subtitles), and their interaction. The model’s random effects structure included by-participant slopes for both fixed-effects terms and their interaction and correlated by-participant intercepts, as well as by-item intercepts for the nested random factors of talker and sentence. The contrast structure specified for the model included deviation coding for both fixed factors. Significance of fitted model terms were assessed using a Type-III ANOVA with Wald chi-square tests. Significant effects were followed by testing the relevant contrast of model terms to ascertain direction and effect size. Participants’ talker identification accuracy in each condition is summarized in Table 2 and illustrated in Fig. 2.

Table 2 Talker identification accuracy by condition in Experiment 1

Full size table

The ANOVA on the linear mixed effects model revealed a significant main effect of language (χ²(1) = 36.16, p ≪ 0.0001), with the corresponding contrast on the linear model revealing significantly better performance in English than in Mandarin (β = 0.99, SE = 0.16, z = 6.01, p ≪ 0.0001). The main effect of subtitles was not significant (χ²(1) = 0.06, p = 0.81), nor was the language × subtitles interaction (χ²(1) = 0.57, p = 0.45). These results indicate that listeners exhibited the classic language-familiarity effect both with and without subtitles, but that the presence of subtitles did not affect listeners’ talker identification accuracy in either language.

Sentence transcriptions

Because the subtitle manipulation had no effect on listeners’ accuracy in either English or Mandarin, nor any effect on the magnitude of the language-familiarity effect, it was critical to also examine whether the subtitles manipulation was effective at eliciting the intended English interpretation of the Mandarin speech. To demonstrate whether listeners actually heard English speech when listening to the English-Mandarin hybrid sentences, and whether listeners’ propensity to hear the speech as English differed depending on whether it had been paired with subtitles, we measured how many English words listeners used during transcription of those sentences in each Mandarin condition.

The four dependent measures of transcription accuracy described in Table 3 were analyzed in separate linear mixed models. (These were generalized linear mixed effects models for binomial data for the measures of (1) whether the target sentence was transcribed exactly as intended, (2) whether it was transcribed with any English words, and (3) whether it was transcribed with only English words.) The fixed factor in all models was condition (no subtitles, with subtitles). The models’ random effects structures included by-participant slopes for the fixed effect term correlated by-participant intercepts, as well as intercepts for the random factors of talker and sentence. The contrast structure specified for the model included deviation coding for the fixed factor. The effect of condition was assessed by testing the contrast of that model term to ascertain significance, direction, and effect size.

Table 3 Use of English in transcription of English-Mandarin hybrid sentences (x̅ ± s)

Full size table

When listeners had learned Mandarin talkers with accompanying subtitles, they were significantly more likely to provide transcriptions of that speech that were comprised of, at least in part, English words (β = 5.67, SE = 2.18, z = 2.60, p < 0.01), and were significantly more likely to provide transcriptions that were comprised of only English words (β = 6.99, SE = 1.47, z = 4.76, p ≪ 0.0001) than the condition without subtitles. Furthermore, listeners who had heard the speech with subtitles were also significantly more likely to hear words from the intended English gloss (β = 0.29, SE = 0.033, t = 8.73, p ≪ 0.0001) and more likely to give exactly the English gloss intended for each sentence (β = 6.63, SE = 3.41, z = 1.95, p = 0.051).

Representative examples of listeners’ transcriptions of Mandarin sentences that had been presented with/without subtitles are provided in Table 4. Qualitative reports by participants after the experiment also indicated that they reliably believed they were listening to heavily Mandarin-accented English during the Mandarin-with-subtitles condition, and to actual Mandarin speech during the Mandarin-without-subtitles condition.

Table 4 Example transcriptions of English-Mandarin hybrid sentences

Full size table

Discussion

Several prior studies have shown that the presence of familiar words in speech facilitates talker identification (e.g., Bricker & Pruzansky, 1966; Goggin et al., 1991; McLaughlin et al., 2015; Perrachione et al., 2015; Pollack, Pickett, & Sumby, 1954; Xie & Myers, 2015b). Numerous other studies have shown that lexical expectations, including those imparted via priming, are effective at inducing lexical percepts, even from highly distorted speech (e.g., Davis et al., 2005; Ganong, 1980; Holdgraf et al., 2016; Sohoglu & Davis, 2016). Correspondingly, we had hypothesized that, by providing English subtitles during talker identification training in Mandarin, listeners’ expectations about the speech would allow them to parse the speech stream into native-language lexical representations, tap into the processes that facilitate talker identification from familiar words, and thereby improve their ability to learn to identify Mandarin-speaking voices compared to when no subtitles were present.

As in previous studies, listeners demonstrated the language-familiarity effect, in that they were better able to learn to identify talkers in their native language than in a foreign language. Additionally, the subtitle manipulation appeared to be effective at inducing listeners to perceive English words from the Mandarin speech. Listeners reported hearing Mandarin-accented English in the Mandarin-with-subtitles condition. They also demonstrated a significant proclivity to use English to transcribe the English-Mandarin hybrid sentences from the subtitles condition, but not when they believed they were hearing Mandarin.

However, the subtitles manipulation had no effect on listeners’ ability to learn to identify voices. Listeners did not perform any better in the foreign-language condition when they had the perceptual experience, based on top-down expectations, that the speech they were hearing contained familiar words. This result suggests that, contrary to our hypothesis, familiar words do not afford listeners additional information about, or the ability to form richer memories of, talker identity in the absence of familiar sound patterns.

However, before committing to the theoretical conclusion that familiar words only facilitate talker identification in the presence of familiar sounds (i.e., when listening to native speech), some methodological considerations warrant further exploration. Results from previous studies have indicated that more extensive training may be necessary for listeners to gain advantage of language-specific representations during talker identification in a less-familiar language. When asking bilingual listeners to learn to identify talkers in their native and second language, they exhibit the language-familiarity effect in favor of their native language on the first day of training, but the magnitude of this difference attenuates and eventually disappears after additional days of training (Perrachione & Wong, 2007). In this way, it may be the case that providing additional days of training with the presence of subtitles to prime lexical expectations during foreign-language talker identification training may allow listeners to overcome their unfamiliarity with the sound structure and take advantage of the additional linguistic information source. Second, in Experiment 1, participants were tested only on trained sentences. Many other studies of talker identification have also tested untrained speech stimuli to assess generalization of talker identity knowledge. It may be the case that the information sources made available during lexical priming will be differentially beneficial for recognizing talker identity from trained stimuli versus generalizing talker identification to novel stimuli – a condition where accuracy typically decreases (McLaughlin et al., 2015, Orena et al., 2015; Perrachione & Wong, 2007). We assessed these questions in Experiment 2.

Experiment 2: Multi-day training of foreign-language talker identification with lexical priming

To test whether accessing familiar lexical representations can confer a benefit during talker identification in a foreign language after additional training, we repeated Experiment 1 in a new group of participants, implementing two key changes: First, participants in Experiment 2 underwent 3 days of talker identity training, as opposed to a single session, to give them additional opportunity to learn to use access to word-level representations for foreign-language talker identification. Additional training has been shown to attenuate the language-familiarity effect in bilinguals, but not in monolinguals, suggesting that additional experience may be necessary to gain advantage from linguistic representations when speech is less familiar (Perrachione & Wong, 2007). In this study, we hypothesized that monolingual English speakers may require additional exposure to lexically-primed hybrid speech in order to make use of the lexical representations, analogous to bilingual listeners’ attenuation of the language-familiarity effect with further training. Second, we included untrained generalization sentences during the test phase, to assess whether lexical access during talker identity learning would confer any differential benefit to familiar versus unfamiliar speech content in the foreign-language condition. The repetition of speech content at test has been shown to have a more beneficial effect in a native language than an unknown foreign language (McLaughlin et al., 2015).