Vocal Communication of Emotion
KeywordsEmotion Recognition Recognition Accuracy Autism Spectrum Condition Vocal Communication Vocal Expression
Vocal communication of emotion refers to the process wherein speakers express emotions by modulating nonverbal aspects of their speech, and listeners utilize the nonverbal aspects of speech in order to make inferences about the emotional experience of the speaker.
The human voice is a prominent channel for nonverbal communication of emotion, both alone and in combination with facial and bodily expressions. Whenever we speak, we do not only convey the linguistic meaning that is contained in the words we use, but also convey emotional information through nonverbal vocal expressions. Vocal emotion expressions can be embedded in the prosodic features of speech (e.g., intonation and rhythm of speech), but can also consist of vocalizations with no linguistic content, often called affect bursts (e.g., cries, sobs, and laughter). Vocal expressions thus reflect the joint effect of linguistics-related demands on speech production and emotion-related effects on the physiology of the vocal apparatus (Scherer 1986).
Expression of Emotion in the Voice
The critical issue for research on emotion expression is how to find exemplars to study, and three main strategies to obtain recordings of vocal expressions can be discerned from the literature. First, researchers have analyzed recordings of spontaneous speech from real-life emotional situations. Such studies have investigated recordings from several types of sources, including interviews, online amateur videos, televised talk shows, and call-center conversations (e.g., Anikin and Persson 2016; Devillers and Vidrascu 2006; Grimm et al. 2008). This method has the advantage of producing authentic expressions, but it can often be difficult to know which emotions are expressed in any given situation. Other limitations involve the difficulty of getting both emotional and neutral speech material from the same speaker (which is needed to control for baseline differences between different speakers), and the fact that many sources (e.g., talk shows) may not be completely spontaneous. Second, researchers have tried to experimentally induce emotional responses in controlled laboratory settings, and then study the resulting vocal expressions (e.g., Gnjatovic and Rösner 2010; Johnstone et al. 2007). This method is limited by ethical considerations regarding the induction of negative emotions and the general difficulty of inducing strong emotional responses.
The third, and by far the most commonly used, method is the use of actor portrayals (for a review, see Juslin and Laukka 2003). This method has the advantage that it enables systematical sampling of a wide range of emotions from the same speakers while controlling for the semantic content of speech. The use of actor portrayals has, however, also been criticized, mainly on the grounds that acted expressions may result in stereotypical portrayals which may differ from spontaneous ones. All of the above methods thus have advantages and disadvantages, and the choice of method should be determined by the specific research goals under study. For example, if the aim is to study affect in everyday interactions then spontaneous expressions should likely be used, but if the goal instead is to investigate the limits of what emotion categories can be expressed through vocalizations, then analyzing actor portrayals is a viable option. The distinction between spontaneous and posed expressions may also not be as clear-cut as one might think, because it is assumed that individuals constantly up- and downregulate their spontaneous expressions for strategic reasons (Scherer 1986).
The main goal of expression studies is to understand how emotions are expressed. In the case of vocal expression, the usual approach is to measure emotion relevant characteristics of the voice using computerized acoustic parameter extraction methods. As an example, Eyben et al. (2016) describe a standard set of objective voice cues containing frequency, energy, spectral balance, and temporal features. Frequency cues contain information about the fundamental frequency (F0) of the voice, which represents the rate with which the vocal folds vibrate and is correlated to our perception of pitch. Energy cues measure various aspects of voice intensity, which reflects the effort that was invested in the production of the vocalization and is perceived as loudness. Spectral balance cues are influenced by various laryngeal and supralaryngeal features and are related to perceived voice quality. Finally, temporal features contain information about the rate and duration of voiced and unvoiced speech segments (e.g., speech rate and pauses).
Anger – High F0 level, much F0 variability, high voice intensity level, much high-frequency energy, precise articulation, and fast speech rate
Fear – High F0 level, little F0 variability (except in panic fear), low voice intensity level (except in panic fear), and fast speech rate
Happiness – High mean F0, much F0 variability, moderate mean voice intensity, moderate high-frequency energy, and fast speech rate
Sadness – Low mean F0, little F0 variability, falling F0 contour, low mean voice intensity, low voice intensity variability, little high-frequency energy, slow speech rate, and much pauses
As can be seen above, there are several overlaps in the acoustic profiles of different emotions. This suggests that each acoustic cue is partly redundant and only probabilistically related to any particular expression, which leaves room for individual and group differences in expressive style. Studies that systematically manipulate various acoustic cues are needed to understand which cues are the most important ones for listeners’ judgments of different emotions (e.g., using speech synthesis, see Birkholz et al. 2015). It is also the case that cue patterns for emotions and affective states other than the ones listed above are less well understood and require more research. Further, only the most frequently investigated acoustic cues are reviewed above, and many other voice parameters may also be important for the acoustic differentiation of emotions. More research is especially needed on parameters that capture the dynamic nature of vocalizations and cues that measure various aspects of voice quality.
Other key research questions for future studies include the systematic acoustic comparison of naturalistic and posed expressions – where initial studies have reported largely similar results for induced and posed expressions (Scherer 2013) – and the universality of vocal expressions. Recent crosscultural studies suggest that vocal expressions share important characteristics across cultures, but have also reported evidence for subtle yet systematic cultural differences in speakers’ expressive style (Laukka et al. 2016).
Perception of Emotion from the Voice
The most commonly used method to assess listeners’ perception of emotion from the nonverbal aspects of the human voice has been the use of forced-choice tasks, wherein participants are provided with recordings of vocal expressions and are instructed to choose one emotion category from a fixed list of response options (Juslin and Laukka 2003). Forced-choice tasks are popular because they provide an index of recognition accuracy, usually measured as the percentage of responses for which the intended expression is the same as the expression judged. It is also recommended that one inspects the confusion patterns and corrects for response biases in order to get a more complete picture of the listeners’ perceptions. Another common method is the use of continuous rating scales, which can give more nuanced information than forced-choice tasks, especially about perception of mixed speaker states. Rating scales are also suitable for studying perception of continuous affective dimensions from vocal expressions. It can be argued that providing participants with fixed response alternatives may artificially increase recognition accuracy and agreement between raters, and that the more naturalistic method of collecting open-ended judgments may give different insights. However, few studies have investigated open-ended judgments of vocal expressions, perhaps because such data is time consuming to analyze (Gendron et al. 2014).
Recognition accuracy rates are dependent on the number of response options that are presented to participants and are thus not readily comparable from one study to another. One way to compare results across studies is to use Rosenthal and Rubin’s (1989) proportion index (pi), which transforms accuracy scores to a standard scale of dichotomous choice where .50 is the null value and 1.00 corresponds to 100% accurate responses. Based on a meta-analysis, Juslin and Laukka (2003) reported that overall within-cultural recognition accuracy for broad emotion categories (e.g., anger, fear, happiness, and sadness) was much above chance at pi = .90. Anger and sadness were the best recognized emotions from vocal expressions, in contrast to facial expression studies where happiness is usually the most recognizable expression. Perceivers can also differentiate between different variants of an emotion within a broader emotion category (e.g., variants of anger such as irritation and rage; Banse and Scherer 1996) from vocal expressions. Emotion recognition is more accurate for nonlinguistic vocalizations than for speech prosody, and some emotions (e.g., disgust) that are difficult to recognize from prosody are well recognized from nonlinguistic vocalizations (Hawk et al. 2009). Studies have also shown that a variety of positive emotions other than happiness (e.g., relief, lust, interest, and serenity) can be accurately recognized from nonlinguistic vocalizations (e.g., Laukka et al. 2013; Simon-Thomas et al. 2009).
In crosscultural studies, vocal expressions from one or more national or ethnic groups are judged by members of their own in-group and at least one national or ethnic group outside of their origin (e.g., Scherer et al. 2001; Van Bezooijen et al. 1983). These studies provide evidence for at least minimal universality, in that participants generally recognize vocal stimuli from foreign groups more accurately than one would expect from chance guessing alone. Notably, recent studies have shown that this is the case also for physically isolated cultural groups with little exposure to mass media (e.g., Bryant and Barrett 2008; Sauter et al. 2010) – which suggests possible innate biological influences on the emotion perception process. However, most studies also show evidence for in-group advantage, whereby individuals more accurately judge vocal expressions from their own culture compared with expressions from foreign cultures. For example, in the meta-analysis by Juslin and Laukka (2003) discussed above, overall accuracy for cross-cultural studies (pi = .84) was lower than that observed for within-cultural studies. In-group advantage is not static, but varies as a function of cultural distance between expresser and perceiver cultures – with increased distance leading to greater in-group advantage (Scherer et al. 2001). In line with this, Laukka et al. (2016) proposed that in-group advantage is caused by a greater match between expression and perception styles in conditions where speakers and listeners come from the same cultural background. It is recommended that evidence for in-group advantage is evaluated using balanced designs, where expressions from two or more cultures are judged by perceivers from the same cultural groups, because this design controls for possible extraneous group differences along dimensions other than cultural group.
The ability to judge vocal emotion expressions also shows inter- and intraindividual variability. Studies indicate that women on average perform slightly better than men on vocal emotion recognition tests (see Thompson and Voyer 2014, for a meta-analysis). Few associations between normal personality traits and vocal emotion recognition ability show consistency across studies, but recognition accuracy deficits have been reported for several psychiatric disorders. For example, schizophrenia patients show a deficit in vocal emotion recognition that has been linked to both impaired pitch perception and higher-order cognitive dysfunction (e.g., Leitman et al. 2011), and emotion recognition deficits have also been reported for autism spectrum conditions (e.g., Globerson et al. 2015).
Developmental studies suggest that infants can discriminate among angry, happy, and sad vocal expressions (e.g., Walker-Andrews and Lennon 1991). The developmental trajectory of emotion recognition from the voice shows improvement throughout childhood (Sauter et al. 2013; Chronaki et al. 2015) and a decline in older adulthood – which may start already in the thirties (Mill et al. 2009). Old adults show significant impairments in recognition accuracy across studies (see Ruffman et al. 2008, for a meta-analysis). It should be noted that developmental studies are mainly based on cross-sectional data, which need to be complemented by future longitudinal studies. In general, the underlying causes for individual differences in vocal emotion recognition ability are not well understood. Future studies should therefore focus on understanding the separate and joint contributions of genetic biological and environmental social factors. For this, tests that measure recognition of a wide range of emotions, and include acoustically well-characterized stimuli with different levels of emotion intensity, should be used to give a more nuanced picture of the sources of variability in emotion perception.
Historically, the study of vocal expression has received less attention than facial expressions, but this situation is beginning to change as a result of increased interest in applications of speech technology – such as acoustic-based classification of speaker states from naturalistic data (e.g., Schüller et al. 2011). The study of vocal communication is a truly interdisciplinary enterprise which involves researchers from not only psychology but also acoustics, computer science, engineering, linguistics, and the medical sciences. The development of more detailed and standardized measures of vocal acoustics (e.g., Eyben et al. 2016) – together with more nuanced tests of emotion perception – will help consolidate results across studies and disciplines, and facilitate understanding of individual differences in emotional abilities.
- Anikin, A., & Persson, T. (2016). Nonlinguistic vocalizations from online amateur videos for emotion research: A validated corpus. Behavior Research Methods. doi:10.3758/s13428-016-0736-yGoogle Scholar
- Devillers, L., & Vidrascu, L. (2006). Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs. In Proceedings of the 9th international conference on spoken language processing, Interspeech 2006 (pp. 801–804). Pittsburgh: International Speech Communication Association.Google Scholar
- Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., Devillers, L. Y., Epps, J., Laukka, P., Narayanan, S. S., & Truong, K. P. (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7, 190–202.CrossRefGoogle Scholar
- Gendron, M., Roberson, D., van der Vyver, J. M., & Barrett, L. F (2014). Cultural relativity in perceiving emotion from vocalizations. Psychological Science, 25, 911–920.Google Scholar
- Grimm, M., Kroschel, K., & Narayanan, S. (2008). The Vera am Mittag German audio-visual emotional speech database. In Proceedings of the 2008 IEEE international conference on multimedia and expo, ICME 2008 (pp. 865–868). Piscataway: Institute of Electrical and Electronics Engineers.Google Scholar
- Laukka, P., Elfenbein, H. A., Thingujam, N. S., Rockstuhl, T., Iraki, F. K., Chui, W., & Althoff, J. (2016). The expression and recognition of emotions in the voice across five nations: A lens model analysis based on acoustic features. Journal of Personality and Social Psychology, 111, 686–705.CrossRefPubMedGoogle Scholar
- Van Bezooijen, R., Otto, S. A., & Heenan, T. A. (1983). Recognition of vocal expressions of emotion: A three-nation study to identify universal characteristics. Journal of Cross-Cultural Psychology, 14, 387–406.Google Scholar