1 Introduction

Many advocate for artificial agents and systems to be more empathic in their interactions with humans. Machines that can recognize emotions stand to play a significant role in the development of next-generation human–computer interaction systems [10, 14, 34, 63]. Furthermore, with the emergence of social media, users are uploading millions of pictures everyday trying to express their emotions and thoughts with others. For example, Whatsapp users are uploading 700 million new photos per day, Facebook’s users share 350 million new photos each day, while Snapchat emerges on top, with total share of 8796 photos per second [48, 49]. This illustrates clearly the need to consider the quality of metadata generation. Whether the intention is to develop algorithms that infer image properties, or to use human-in-the-loop approaches (i.e., relying on the perceptual abilities of online crowdworkers in real time [31], in image depicting people, ensuring that metadata quality is crucial. In the current work, we consider a metadata generation task that is more challenging than the labeling of image content (i.e., whether the image depicts a person, animal, or object). Specifically, we are interested in emotion recognition from images of people in pain and distress.

It must be noted that automated emotion recognition, which contributes to many high-stake applications involving behavioral analysis in both the commercial and medical domains, is not without controversy, given the potential risks. Imagine a robot designed to offer communication support to an individual with depression, which embeds emotion recognition technology. In such a context, the costs of misrecognition of the user’s distress are very high, with many potential consequences. For example, if the robot was to misrecognize a sentiment such as disgust for sadness or hostility, this could lead to inappropriate responses on behalf of the agent, such as offending a patient who may already be in a sensitive state.

In most settings, computer-based emotion recognition is achieved in one or two ways. Emotions can be detected indirectly through observing the other party’s facial expressions, gestures, and voice (including verbal and written communication) [57], or directly from physiological signals such as heart rate [51]. In the current study, the task is to recognize the negative emotions of an individual depicted in an image, where the annotator (or the “crowd”) must rely on visual cues only, in performing indirect emotional distress recognition.

We consider the possibility of using crowdsourcing for emotional distress recognition on a mass scale, as facilitated by a popular crowdsourcing platform (Amazon Mechanical Turk), to label images of people with respect to what they are feeling. Several studies have been conducted on emotion recognition, and found that people tend to be relatively accurate at judging facial expressions [15, 57]. However, to the best of our knowledge, only few studies have explored the detection of emotions based on online pictures/videos, with no additional cues. Achieving a better understanding of the task and the characteristics of individuals who can perform it reliably and accurately stands to benefit human–computer interaction systems and their ability to become more empathic.

1.1 Crowdsourcing emotion judgments

Crowdsourcing has emerged as an effective means to complete small, well-defined tasks, which, while simple for humans, cannot yet be performed reliably by machines. Essentially an open call for labor with flexible contractual arrangements [47], crowdsourcing provides a convenient, low-cost solution for obtaining specific feedback that is arguably objective and valid [38, 69]. Crowdsourcing is already being used to generate annotations for images depicting human behavior with reported success [27, 53, 54].

Crowdsourcing also has the potential to enable us to gather information that is universal and relevant across cultures [43], given the ease in reaching diverse groups of crowdworkers via popular platforms. Nonetheless, the diversity of “the crowd” is precisely the characteristic that may challenge the task of accurately identifying a person in distress and/or pain. As will be explained, psychological studies show clearly that who we are (i.e., our demographic characteristics and personalities) relate to our abilities to understand others’ emotions. In addition, we typically have more affinity towards, and an enhanced ability to understand, those more like ourselves. Given the diversity of the crowd, how can we ensure that high-quality metadata on emotional distress labeling tasks are generated?

1.2 Goals of the current work

This study gauges the feasibility of crowdsourcing on a visual pain recognition task (i.e., images) and an ethnically diverse workforce. The previous research has explored emotion recognition from speech and textual analysis [25, 33]. Since visual recognition has not been extensively explored in the context of crowdsourcing, our goal is to see how factors such as demographics, personality traits (i.e., level of empathy), and the degree to which one identifies with his or her ethnic group impacts one’s approach and performance on the visual pain recognition task. In short, we address four novel research questions:

  • Can “crowdworkers” recognize depicted people in pain and distress? How do demographics and ethnicity of the worker and the target individual depicted in the images impact performance?

  • How does the empathy level of the worker impact performance on task?

  • How does the strength of one’s identity to his/her ethnic group impact performance?

The study aims to achieve the necessary understanding of the relationship between the task of image emotion recognition, the social cues surrounding the task (i.e., how in-group or out-group status affects the empathic response of the annotator), and the quality of image metadata that we might expect to derive via crowdsourcing.

2 Literature review and hypotheses

The psychological construct of empathy refers to the ability to understand and share positive or negative emotions. It is a developmental emotion that first appears in infancy and promotes pro-social behaviors [70]. Levenson and Ruef [36] outline three key components of empathy: “(a) knowing what another person is feeling, (b) feeling what another person is feeling, and (c) responding compassionately to another person’s distress”.

The success of the emotion recognition task primarily hinges on the first of the above three components; it is clear that recognizing the emotions of another requires empathy. However, most of what we know about empathic responses to others concerns face-to-face communication, where around 90% of emotional expressions are communicated non-verbally [20]. In this setting, empathic responses are the result of careful observation of both the verbal and non-verbal cues during communications with the other (i.e., target individual).

In our task, annotators (participants) are asked to infer a stranger’s negative emotion, based only upon visual cues available in a still image. Given the close connection between empathy and pain emotion recognition, we expect that individuals who have greater levels of trait empathy (i.e., a more empathic personality) will be better annotators as compared to those with lower levels of empathy. We also expect better annotation results when annotators and depicted individuals are of the same ethnic group. Obviously, what constitutes a “quality annotation” depends on the intended use of the image metadata. Therefore, we consider multiple response variables.

As explained in the methodology section, we consider one measure of participants’ emotional reaction to a painful image (their reported level of pain arousal) as well as a measure of their beliefs about their ability to describe the subject’s emotion accurately (confidence). Next, we consider the affective content of the tags’ participants use to describe subjects’ emotions. Following Warriner et al. [67], we consider the valence (i.e., pleasantness) of word tags, their degree of arousal (i.e., the intensity of emotion expressed in a chosen tag), as well as word dominance (i.e., the degree of control suggested by a chosen word). In other words, the current work explores these five response variables (pain arousal, task accuracy, as well as affective meanings consisting of valence, arousal, and dominance). In the remainder of this section, we describe the background and motivation of our work as well as the hypotheses to be tested.

2.1 Empathy and reaction to others’ suffering

Empathy manifests itself as a reaction to others’ emotions. We examine the empathic responses to others’ suffering, and more specifically, responses to the primary emotion of sadness conveyed through images of pain and distress. Sadness is a fundamental emotion experienced by all human beings [12, 15], and is most strongly associated with the understanding of a permanent loss. A particularly good example is death, which has the ability to transform the individual’s interpretation of life and world [60]. The importance of the phenomenon is highlighted by the simultaneous experience of several emotions regarding the individual’s shattering: anxiety, irritation-anger, emptiness, worthlessness, meaninglessness, hopelessness, weakness, brokenness, and/or guilt [5, 21, 44, 46].

However, while experienced by everyone, recognizing sadness or pain in another is not necessarily easy. The perception of pain is based on “representations” that one has made. Each individual creates and stores mental representations of personal pain experiences. These representations are later called upon, to identify the perception of pain expressed by others [3, 28]. For these reasons, we expect to find that who someone is, correlates to his or her ability to infer the target person’s emotion.

H1a

Demographic characteristics such as gender and age correlate to participants’ level of pain arousal.

H1b

Demographic characteristics such as gender and age are correlated to confidence on task.

H1c

Demographic characteristics of gender and age are correlated to the affective content of words used to describe painful images.

Women generally self-report as being more emotional than men do [61] and are more empathic than men [35]. Of particular note is that the experience of negative emotions, such as sadness and pain, is most often reported by women [7, 18, 26], and the duration of the feeling seems to be longer in women than men [53]. Finally, women express and interpret emotions more accurately [22, 23, 42], and girls express their sadness more intensely than do boys [11].

In contrast to gender, a few studies have investigated the correlation between emotions, empathic reactions, and age. Some findings support the notion that empathy is a pro-social characteristic that appears and develops throughout life. Specifically, psychologists consider the mechanism of crying in infants as an empathic reaction, with female infants empathizing more than males [40]. In addition, empathy may decrease as one approaches adulthood [55] and then increase again in old age, with older people exhibiting higher scores on standardized measures of empathy [39, 56]. We expect that individuals with higher levels of trait empathy will be better at our visual pain recognition task as compared to those with lower empathy. We also expect to find that such individuals will describe image subjects’ emotions more intensely.

H2a

Empathic individuals will experience greater arousal as compared to less empathic individuals when viewing images of others in pain.

H2b

Empathic individuals will report greater confidence on task as compared to less empathic individuals.

H2c

Empathic individuals will describe images of others in pain using more intense words as compared to less empathic individuals.

2.2 Empathic reaction to out-group members’ feelings

Facial expressions of emotions fall into two basic categories: universal and culturally specific. As a consequence, the origin and ethnicity of the individual can affect his or her non-verbal communication behavior. We should not expect that a diverse group of individuals would express a given emotion in the same manner. Indeed, ethnic or cultural differences, defined as individuals having been positioned within a given in-group while being excluded from one or more out-groups, have been shown to correlate to emotional behavior [64].

Likewise, research suggests that individuals’ abilities to accurately recognize another’s emotions relate to group membership and similarity. Increased similarity as well as identification with others can lead to increased sharing of the experience and hence heightened empathy [52]. In particular, belonging to a social group serves as a form of contingent that enhances empathy among group members [6, 24, 65]. By including the other group members as part of one’s self-concept [2], people are generally able to empathize more strongly with in-group members. This phenomenon is called in-group advantage [16, 17].

The in-group advantage can lead to increased accuracy in visual emotion detection for members of the same ethnicity [66, 68]. Similar results were obtained with regard to emotional detection of speech. Specifically, individuals within the same country but of a different culture or ethnicity (e.g., White Canadians and Canadian Aboriginals) could not detect one another’s emotions as accurately as those within the same group [1].

Finally, based on studies of people’s reactions to others as depicted through images, it is evident that empathic responses toward those suffering were stronger within in-group members and weaker for out-group members’ suffering [8, 19]. As a result, individuals’ responses when viewing images of in-group members were more empathic than when they viewed images of those from an out-group [4].

H3a

Participants report greater pain arousal when viewing images of in-group (versus out-group) members.

H3b

Participants report greater confidence when describing emotions of in-group (versus out-group) members.

H3c

Participants will describe the pain of in-group members using more intense word labels, as compared to members of their out-group.

3 Data and method

Our visual pain recognition task consisted of three parts. After viewing an image depicting a stranger in pain or distress, participants (1) rated their own level of pain arousal, (2) described the emotional content of the image via open-ended tagging, and (3) assessed their performance on the tagging task. Participants also completed a questionnaire concerning their demographic background, as well as two standard psychological questionnaires: Davis’ Interpersonal Reactivity Index (IRI) and the Multigroup Ethnic Identity Measure (MEIM). Relevant details are provided in the following sub-sections.

3.1 Image data set

We used a data set of 42 images developed by a team of neuroscientists, who used them to investigate the neural basis of reactions to depicted subjects [41]. The images depicted East Asian (EA), African (AA), or Caucasian (CA) American subjects. Thirty-six images were painful (e.g., a woman crying during a flood) and six were neutral (e.g., a man enjoying an outdoor picnic) situations. We include neutral images to permit participants’ arousal level to “settle down”, (e.g., to avoid habituation effect). In a previous experiment using this data set, participants were asked to indicate “how badly” they feel for the main subject(s) in the image on a 4-point scale. The results showed that the images elicit both reliable and valid responses. Example images are shown in Fig. 1.

Fig. 1
figure 1

Example images of EA, AA, and CA individuals in painful settings

3.2 Participants

We used Amazon Mechanical Turk (MTurk) to recruit a crowdsourced workforce. We selected MTurk, since it has been successfully used to crowdsource annotations on text, scenes, pictures [9, 27, 62], and emotions [46]. We targeted three groups of participants by their self-reported ethnic background (EA, AA or CA).

Participants had to be native or near-native English speakers who reside in the United States. They were rewarded with $5 for their time, and all but one took less than 60 min. A total of 120 participants, ranging in age from 18 to 57 years (Mage = 29.88, SD = 7.55), completed the study. Specifically, 30 East Asian-American (EA) (Males = 13, Females = 17), 39 African-American (AA) (Males = 22, Females = 17), and 51 Caucasian-American (CA) (Males = 31, Females = 20), participated.

3.3 Experimental design, tasks, and validation

The 42 images were shown to each participant, in the same random order. After viewing an image, participants completed the pain arousal item (“How badly do you feel for the person(s) in the image?”). Next, they were asked to provide three emotion tags (“How would you describe the emotions of the main subject(s) of the image?”). Finally, participants were asked to rate their confidence level (“How confident are you that you accurately described the emotion(s) of the main subject in the image?”). The first and third tasks used a 4-point Likert item (1 = not at all to 4 = very much).

To confirm the validity of our approach, factor analysis was used to examine the structure of the image characteristics, as the images are being used to stimulate a response in participants, along with pain arousal scores. The analysis revealed a solution that explained 59.426% of variance and that had structural coefficients (loadings) > .50 for all factors. Varimax rotation yielded three factors, corresponding to the ethnicity of the subjects (EA, AA, and CA pain arousal pictures), and consisting of 12 items each. This analysis also revealed a high degree of reliability and validity. The internal consistency of each item measured by Cronbach alpha, the EA Pain Arousal α was .932 with an eigenvalue of 10.726, the AA Pain Arousal α was .915 with an eigenvalue of 9.695, and the CA Pain Arousal α was .926 with an eigenvalue of 4.538.

In addition, the reliability and validity of the image characteristics were re-tested with the self-reported confidence scores. This yielded a solution that explained 48.736% of variance and that had structural coefficients > .50. Varimax rotation again yielded three factors in this case, based on the ethnicity of the subjects (EA, AA, and CA) and consisting of 14 items each. This analysis also revealed a high degree of both reliability and validity. In particular, the internal consistency of each item measured by Cronbach alpha, the EA Task Confidence α was .894 with an eigenvalue of 7.783, the AA Task Confidence α was .896 with an eigenvalue of 6.701, and the CA Task Confidence α was .866 with an eigenvalue of 5.986.

3.4 Psychological tests

3.4.1 Davis’ Interpersonal Reactivity Index (IRI)

Davis’ IRI [13] is a measure of dispositional (or trait) empathy that considers a set of four distinct, through related constructs. Each of its four subscales (empathic concern, fantasy, perspective taking and personal distress) was assessed with 7 items on a 5-point Likert scale (0 = does not describe me well to 4 = describes me very well). The subscales that pertain to cognitive dimensions of empathy, the fantasy subscale (FS), and the perspective taking (PT) subscale measure the tendency to get caught up in fictional stories and imagine oneself in the same situations as these fictional characters, and the tendency to take the psychological point of view of others, respectively. The empathic concern and personal distress subscales measure the affective dimensions of empathy. Specifically, the empathic concern (EC) scale measures sympathy and concern for others, and is typically considered an other-oriented emotional response in which attention is directed to the person in distress [59]. On the contrary, the Personal Distress (PD) scale is considered a self-oriented emotional response in which attention is directed at one’s negative emotions of distress and the reduction of these negative emotions.

Others have found the IRI instrument to have a high degree of reliability and validity, which was also supported by our findings. We used exploratory factor analysis to examine its structure. This yielded a solution that explained 62.724% of variance and that had structural coefficients > .50 for all factors. Varimax rotation yielded four factors (EC, PD, PT, and FS), consisting of seven items each. Furthermore, the analysis revealed a high degree of both reliability and validity. Notably, the internal consistency of each item measured by Cronbach alpha, the Empathic Concern (EC) α was .917 with an eigenvalue of 5.030, the Personal Distress (PD) α was .888 with an eigenvalue of 4.426, the Perspective Taking (PT) α was .882 with an eigenvalue of 4.157, and the Fantasy Scale (FS) α was .853 with an eigenvalue of 3.949.

3.4.2 Multigroup Ethnic Identity Measure (MEIM)

The Multigroup Ethnic Identity Measure (MEIM) [50] is an instrument that reveals a high degree of reliability and validity in measuring the feelings and reactions of the individual, in relation to his or her reported ethnic group. The instrument contains questions designed to assess two related constructs. Participants answered 12 closed response items on a 4-point Likert scale (1 = strongly disagree to 4 = strongly agree). The first construct is relevant to Affirmation, Belonging and Commitment, gauges knowledge of and feelings toward one’s ethnic group, and consists of seven items (e.g., “I feel good about my cultural or ethnic background”). The second construct is relevant to Ethnic Identity Search and consists of five questions (e.g., “I think a lot about how my life will be affected by my ethnic group membership”). There are also two categorical items, where one is asked to select his or her ethnic group and that of his or her parents. Finally, in one open question, the participant is asked to state his or her ethnic group (“I consider myself to be…”).

The high degree of reliability and validity of the instrument was supported. Factor analysis was used to examine the structure of the MEIM questionnaire. This yielded a solution that explained 66.471% of variance and that had structural coefficients > .60 for all factors. Varimax rotation yielded two factors (Ethnic–Identity Search and Affirmation–Belonging–Commitment), consisting of five and seven items, respectively. Furthermore, the analysis revealed a high degree of reliability and validity. Particularly, the internal consistency of each item measured by Cronbach alpha, the Ethnic–Identity Search α was .850 with an eigenvalue of 3.734, and the Affirmation–Belonging–Commitment α was .914 with an eigenvalue of 4.243.

3.5 Affective content of emotion tags

We collected a total of 12,960 word tags (i.e., three tags for 36 painful images for 120 participants), which described the emotions of the main subject(s) of each image. Given the size of the corpus, we needed a means to automatically analyze the affective content expressed through the tags. It can be noted that sentiment analysis, or the detection of affect in textual communication, is a very active area of research in recent years, particularly among information retrieval [e.g., 37, 49] and natural language processing [e.g., 45] scholars. To this end, many resources, including sentiment lexicons, have been developed, to enable the exploitation of the rich sources of textual data shared via social media. However, as we aimed to examine the affective content of individual word tags, and their correlation to participants’ demographics and personal characteristics, we selected a lexicon developed by a team of psycholinguists, which aims to depict the affective norms of individual words [67], which is close in spirit to our task.

This resource is a collection of ratings on three affective dimensions for nearly 14,000 English words. As mentioned, the dimensions are valence, arousal, and dominance. In Warriner et al. [67], participants rated a given word on a scale of 1–9, reflecting their feelings when reading the word, as follows:

  • Valence How happy/pleased/satisfied/contented/hopeful do you feel?

  • Arousal How excited/stimulated/frenzied/jittery/wide-awake/aroused do you feel?

  • Dominance How controlled/influenced/cared-for/awed/submissive/guided do you feel?

Table 1 provides examples of words that score relatively high and low on each of the three dimensions. Specifically, what is shown is the mean score assigned by all participants in the study of Warriner et al. [67] who rated the given word.

Table 1 Example words and their mean scores on three affective dimensions

For each word that our participants used as an emotion tag, we obtained the three affective scores to explore how personality and background might influence the words that someone uses to describe another in pain. In total, 83% of our tags were found in the lexicon and have valid scores, leaving us with 10,704 word tags to analyze.

3.6 Statistical analysis

We used parametric analyses [including correlation (Pearson’s r), t tests, one-way ANOVA, and linear regression] to explore the relationships between participants’ demographic characteristics, their levels of empathy, and their ethnic identities and each of the five response variables, i.e. (1) pain arousal ratings, (2) self-assessed task confidence, and the (3) valence, (4) arousal, and (5) dominance of tags used to describe the emotions of strangers depicted in painful settings.

As the first two response variables were left-skewed, we applied the following transformation before performing our analyses: (x + 1)2. In contrast, the scores on the three affective dimensions of word tags are right-skewed and thus were transformed as follows: log(x + 1).

4 Results

4.1 Demographic characteristics of image annotators

Our first set of hypotheses (H1) proposed that annotators’ demographic characteristics, and in particular, age and gender, are correlated to their performance on the visual pain recognition task. The literature suggests that empathy levels vary with both gender and age; thus, one’s ability to understand others’ emotional pain should, in theory, correlate to his or her ability to recognize strangers’ pain and distress. An independent two-group t test reveals significant gender differences in IRI scores, with respect to emotional concern (t = −2.265, p < .05) and personal distress (t = −4.787, p < .001). Women’s scores reveal them to be more in touch with others’ feelings (EC), yet also more focused on their own negative feelings of distress (PD), as compared to men. However, correlation analysis showed no significant correlation between any of the IRI scores and age.

Given these findings, we again used the t test to compare each response variable across gender. As shown in Table 2, which details the mean/median scores by gender, we find no significant gender differences for pain arousal reported by participants or for their self-reported confidence on task. We do, however, find differences on two of the affective dimensions of word tags assigned to images. Women tend to use words suggesting more arousal or excitement (e.g., panicky, dangerous, tragedy, rage) as compared to men, whose chosen tags tended to rate higher on dominance (e.g., strength, courageous, and understanding).

Table 2 T tests comparing 5 response variables by gender (***p < .001; **p < .01)

There were no significant correlations between participant age and either pain arousal scores, self-reported confidence, tag valence, or dominance score. There was a statistically significant, albeit weak correlation between participant age and tag arousal score (r = 0.0192, p < .05).

The analysis leads us to reject hypotheses H1a and H1b concerning gender and age. In contrast, we do observe that gender is correlated to the types of words (i.e., tags) that participants choose to describe the emotions of the depicted subjects in painful images. Thus, we support H1c.

4.2 Empathy levels of image annotators

We hypothesized that more empathic individuals will be better annotators in the emotion detection task, because they have an easier time knowing and feeling what another feels (H2). Table 3 details a linear regression analysis in which each of the five response variables was regressed on the four IRI scores as well as participant gender. It is clear that emotional concern (EC), the dimension of empathy that reflects one’s ability to understand another’s feelings, is positively related to our first two response variables, the pain arousal score, and the self-reported task confidence. It is notable that EC plays a key role in explaining the variance of both pain arousal and task confidence, even when we control for gender, which we found to be highly correlated to EC and PD.

Table 3 Linear regression model: response variables regressed on IRI scores (***p < .001; **p < .01; *p < .05)

We also observe evidence of significant, albeit very weak correlations between EC and PD on the affective properties of word tags. However, the explanatory power of these models is almost nil. Our results support hypotheses H2a and H2b; participants who are other-oriented experience greater pain arousal when viewing images of strangers in pain, and report higher confidence in describing the strangers’ emotions. We reject hypothesis H2c, since IRI scores explain almost zero of the variance in the valence, arousal, and dominance scores of the word tags used to describe images.

4.3 Reacting to emotions of in- versus out-group members

Having examined the correlations between annotator demographics and levels of trait empathy and our five response variables, we now consider the possible role of ethnic group and the greater in-group sensitivity. First, we can ask whether, in general, there are differences in our five response variables with respect to participant ethnic background. Table 4 details the mean/median responses on each response variable, broken out by participant and image subject ethnicity. The last row of the table shows the average responses by participant ethnicity only (i.e., collapsing the three categories of image subjects).

Table 4 Mean/median responses by subject and participant ethnicity group

Considering only participant ethnicity, one-way ANOVA reveals no significant differences with respect to pain arousal scores; however, the self-reported confidence scores differ (F = 5.817, p < .05). Specifically, Tukey HSD reveals that both EA and AA participants report higher confidence as compared to CA (p < .05 for both). For the three affective dimensions of word tags, there are significant differences only with respect to dominance (F = 5.99, p < .05). Here, we find that EA participants use words with lower dominance scores, as compared to either AA or CA participants (p < .05 for both).

Next, we consider the possible effect of the ethnicity of the subject depicted in pain. We divided the images into three groups according to subject ethnicity, as shown in Table 5. We then performed one-way ANOVAs separately on each ethnic group (EA, AA, and CA) for each of the response variables, with participant ethnicity as the grouping variable. In the case of a significant ANOVA, Tukey HSD was used to determine which participant ethnic groups reacted differently to the set of images.

Table 5 Significant group differences per post hoc Tukey HSD (**p < .01; *p < .05)

As shown, the ethnicity of both image subject and participant plays a role in their performance on the visual pain recognition task. For example, with respect to images of African Americans (second column in Table 5) in painful settings, AA and CA participants experienced differing levels of pain arousal, as well as self-reported confidence on task. Table 4 confirms that Caucasian participants experienced less pain arousal, and report reduced confidence, as compared to African Americans.

Having observed that image subject and participant ethnicity are important factors in the visual pain recognition task, we move on to consider greater in-group sensitivity. Specifically, we use regression analysis, applied to the three sets of images (broken out by image subject ethnicity), as described above. We created three indicator variables (AA, EA, and CA participants) to model cases in which out-group participants are viewing a set of images depicting subjects from a different ethnic group. In addition, we examine whether the strength of the participant’s ethnic identity (i.e., MEIM scores) mitigates the greater in-group sensitivity.

Tables 6, 7, 8 detail the regression models for images of EA, AA, and CA subjects, respectively. With respect to participants’ levels of pain arousal and their confidence on task, it is clear that the strength of their ethnic identity (MEIM-ID-search) is highly correlated to the response variables, more so than in-group/out-group status with respect to the subject of the image. Note that for AA and CA images, none of the models predicting the affective dimensions of word tags were significant and are, therefore, not detailed.

Table 6 EA pictures: response variables regressed on out-group member dummies and MEIM scores (***p < .001; **p < .01; *p < .05)
Table 7 AA pictures: response variables regressed on out-group member dummies and MEIM scores (***p < .001; **p < .01)
Table 8 CA pictures: response variables regressed on out-group member dummies and MEIM scores (***p < .001; **p < .01)

We removed the MEIM variables from the regressions to see if the out-group member indicator variables would play a more significant role in explaining the variance in the response variables. These results are shown in Tables 9, 10, and 11. Here, we can see that CA participants tend to be less confident on task when viewing images of out-group members (i.e., in Tables 9 and 10, we observe negative, highly significant coefficients on the CA indicator variable) as compared to the respective in-group participants. Finally, both EA and AA participants report more confidence when describing the pain of CA subjects, as compared to the in-group (CA) participants. None of the models concerning the affective content of word tags were significant and are, therefore, not shown.

Table 9 EA images: response variables regressed on out-group indicator variables (***p < .001; **p < .01)
Table 10 AA images: response variables regressed on out-group indicator variables (**p < .01; *p < .05)
Table 11 CA images: response variables regressed on out-group indicator variables (**p < .01; *p < .05)

Given these results, we supported H3a, H3b, and H3c with some interesting caveats. It is clear that the ethnic background of both the participants and the subject depicted in a painful image are correlated to the response variables, and in particular, to self-reported confidence on task. However, rather than providing support for a clear-cut greater in-group sensitivity across all participants, our results highlight differences between our minority participants (EA and AA) and the Caucasian participants. Caucasians report being less accurate in inferring the emotions of both EA and AA subjects. However, unexpectedly, they also report less self-confidence than others when describing the emotions of other Caucasians in painful settings.

Interestingly, these relationships are mitigated by the degree to which one is in touch with his or her own ethnic identity and background. In particular, we observed that MEIM-ID-search is positively correlated to pain arousal as well as self-reported task confidence. Individuals who have scored high on these items of the MEIM have put forth effort to understand their ethnic background and its impact on their life experiences. This characteristic explains more variance in responses to painful images as compared to in-/out-group relation to the image subject.

The trends concerning the affective content of the emotion tags are less clear. However, it does seem to be the case that ethnic background is relevant, as we observe participants describing EA images using word tags with differing levels of valence, arousal, and dominance, as compared to the EA in-group participants (Table 6).

5 Discussion

As mentioned in the introduction, many believe that increasing the diversity of those involved in all of the processes and tasks that go into building new social technologies—such as automated image tagging—will help to ensure that they are beneficial for all users. In the current study, crowdsourcing allowed us to gather image metadata on a visual pain emotion recognition task from a diverse workforce, consisting of men and women of several ethnic backgrounds. Our findings support the claim that diversity can be of benefit, but also underscore the need to have access to verified information concerning the personality, such as empathy levels and identities of crowdworkers. This is particularly important for tasks that hinge on one’s ability to perceive and interpret the negative feelings of others.

5.1 Interpreting others’ pain

Two of our response variables quantified our participants’ experience on task. The pain arousal rating gauged the extent to which workers were able to feel a depicted subject’s pain, while self-reported confidence measured their self-assurance in their ability to describe, using word tags, the depicted subjects’ emotion(s).

We found a little evidence that worker demographics alone could be used to predict the extent to which one will feel pain for image subjects, or their perceived confidence on task. The one exception here is the correlation between self-reported confidence and ethnicity; Caucasians reported themselves as having less confidence than other ethnic groups, regardless of the ethnicity of the subject depicted.

As compared to demographic characteristics (age, gender, and ethnicity), trait empathy and strength of ethnic identity are more indicative of a participant’s ability to perform the task. In particular, those who have high levels of other-oriented empathy (i.e., emotional concern), and who are in touch with their own ethnic background and identity, are likely to be reliable performers on this task. These two variables appear to serve as indicators of one’s ability to understand and describe another’s feelings of pain and distress.

It is of great importance to better understand the nature of crowdworkers, since the characteristics of MTurkers may be unique and different from the general population. Our findings reveal some differences in the visual pain emotion recognition process of the workers from the general population. The general bibliography indicates significant gender differences in emotion recognition [7, 11, 18, 22, 23, 26, 35, 42, 58, 61]; however, our workers do not seem to extend gender differences in the current study of reactivity. This warrants further explanation of how gender identity is affected while performing crowdsourced tasks online, and in some cases, for several hours per day.

5.2 Describing pain through emotion labels

The remaining three response variables, arousal, valence, and dominance of word tags, quantified three characteristics concerning the affective content of the words participants used to describe the emotions of the individuals depicted in painful images. Interestingly, while demographic variables proved not to be correlated to participants’ level of pain arousal or perceived task confidence, as we expected given the bibliography on gender differences and emotion, they do tell us something about the types of word tags they might use to describe the emotions of others. For example, women are more likely than men to use labels with higher arousal scores (e.g., describing a woman pictured carrying a child through a flooded area as “panicked” rather than simply “scared”). On the other hand, women are less likely than men to use word tags suggesting dominance or control (e.g., describing the woman as “defeated” rather than “courageous”.) There is a vast literature on gender differences and language, with many suggesting that “women’s language” demonstrates their tendency to be more emotional than men, and of course, less powerful, e.g., [30].

There was also evidence suggesting that ethnic background plays a role in the word tags chosen to describe painful emotions. Interestingly, differences occurred with respect to the tags chosen by EA participants in general (Table 4), as well as words chosen by AA and CA participants to describe images of EA subjects in pain (Table 9). In summary, EA participants use word tags expressing less dominance or control, in comparison to others. One possible explanation for this is the difference in the emphasis placed on self-expression by various cultural groups [29], which might lead one to use more neutral/forceful language. What is clear here is that recruiting a more diverse workforce for the generation of image metadata, should in turn result in a richer set of image descriptions.

6 Summary and implications

Our results demonstrate that crowdworkers are not a homogenous group of people, even if they are recruited from within the same country, as in the case of our current study. Their diverse characteristics and the quality of tasks performed should be taken into account when assigning crowdworkers to specific tasks. For instance, we found that gender and age of crowdworkers are correlated to the affective content of words used in the tagging task, but not to workers’ pain arousal during the task. To increase the quality of crowdsourcing work, we believe that the nature of the task should be understood clearly, and suggest that a matching algorithm could be used to match tasks with the most relevant workers based on their profiles.

In addition, our findings concerning the correlations between worker demographics (in particular, ethnicity), and the affective content of words that they chose to use in their descriptions, have implications for other types of tasks that are commonly crowdsourced. For instance, there is growing interest in using crowdsourcing platforms like MTurk to build resources for natural language processing, including word-level emotion association lexicons [46]. Given the known correlations between demographic characteristics and language use, researchers should carefully consider the nature of the human computation tasks that they assign to workers, as well as the characteristics of their workforce. As Law and von Ahn note [32], for many tasks, such as emotion detection and/or association, it may be more reasonable to aim for capturing “cultural truth” rather than “ground truth” (p. 26), in resulting data sets.

It would, therefore, be helpful if crowd platforms consider including verified demographic information of workers without compromising their anonymity. Currently, there is limited information of workers’ background. Apart from information such as how many tasks they have done, and to what degree of accuracy (as determined by the task “requester”), we know very little about the workers’ background. It would be useful if the crowd platforms provide the researcher with basic demographic characteristics, such as gender, age, and ethnicity. Furthermore, we found that for some tasks, demographic characteristics do not have any significant correlation to perceived performance. In some cases, workers’ personalities matter more than demographic profiles in producing good quality crowd content. Given these findings, a main challenge lies in providing enough worker profile information while still maintaining the individual’s anonymity.

7 Conclusion

Our paper sheds light on how crowdworkers interpret emotions through computers and questions the level of empathy that the crowd can feel behind the screen. We also examined how crowd diversity is linked to task outcomes. From the results, it is clear that not all crowdworkers are the same, and for certain tasks, we should consider the demographic and personality profiles behind the massive crowd task force to avoid embarrassing and harmful consequences, such as the miss-tagging incident we highlighted at the introduction of this paper. Inclusive design in UI/UX has become an established research/practice area in HCI. We believe that this notion of inclusivity should be extended to crowdsourcing to design systems that genuinely “do good”.

Therefore, it would be interesting to expand the study in other countries, so we can examine if contextual characteristics beyond ethnicity might affect the empathic process. One of the limitations of Mechanical Turk is that it provides us primarily with workers that are currently living in the United States. In addition, it would be useful to study other types of personality characteristics. In our study, we focused in the characteristic of empathy through the visual pain emotion judgment process. It would be interesting to expand the current study to examine other personality traits and personality types (e.g., narcissistic personality trait, psychopathy and the Big Five). It is very likely that they will have an impact on the way crowdworkers assess painful emotions.

Furthermore, in this study, we focused only on painful images and depicted humans in distress, and as a result, a lot of non-verbal information was not available to facilitate the pain emotion judgment process. The inclusion of verbal information may provide the individual with more confidence for emotion judgment. Therefore, in future studies, it would be interesting to study painful emotion judgment through video. Of course, to use crowdworkers to assess painful emotion through visual content, we need to consider the privacy of sending images or videos of individuals to the crowd. Future work can focus on how we can obscure one’s identity while retaining key facial and non-verbal characteristics, which can still allow the crowdworkers to accurately classify negative and painful emotions.