Keywords

1 Introduction

The relationship between embodied conversational agents’ (ECAs) [1] gestures (see e.g., [2]) and rapport (see e.g., [3, 4]) is a currently active research question. While some studies have reported the effect of users’ perception of extraversion on gesture amplitude [5], other studies reported that gesture amplitude may not affect users’ perception of rapport [6]. This disparity suggests that the naturalness of ECAs’ gestures may be a significant factor in shaping users’ perceptions. Indeed, not all ECAs have the same level of naturalness of behavior.

The building of human-ECA rapport is increasingly important as ECAs take on more meaningful roles, including serving as a means of diagnosing PTSD [7]. SimSensei, the PTSD-diagnosis agent, while having excellent animation of facial features, appeared to have a static torso, which might give the impression that she is holding her breath or not breathing between her utterances. Therefore, we sought to answer the question of whether adding naturalness, in this case for breathing, would lead to higher perceptions of rapport.

To this end, we built an application that would enable us to learn whether adding breathing behaviors to a similar agent would lead humans to perceive the agent as more natural and to develop a higher level of rapport with the agent. The application we developed, called Paola Chat, featured an ECA named Paola that resembled SimSensei but was able to display naturalistic breathing animations. The animations were based on a simple model of respiration and an empirical study of the perceived naturalism of breathing amplitude. The application enabled study of whether users’ perceptions of the ECA’s naturalness would increase based on the varying frequency and amplitude of the ECA’s breathing during the conversation.

2 Implementation

This study comprised two phases: a preliminary phase in which we sought to avoid the possible effects non-standard amplitude could have on perceptions of naturalness (cf., [6]) and a second phase in which we assessed the effect of breathing on users’ perceptions of rapport.

2.1 Phase 1: Naturalistic Amplitude

In the first phase, we sought to find the most natural amplitude of gestures for perceptions of naturalness (cf., [6].) To provide the Paola Chat agent with amplitude of breathing that would be as natural as possible, we conducted an empirical evaluation of human perceptions of the agent with different amplitudes for the breathing animation. We prepared five brief animations of the ECA’s breathing, ranging from static (not breathing) to exaggerated breathing. We sought to have the agent show breathing with amplitude large enough to be salient but not so large as to appear unnatural or distracting. Each participant viewed the five representative animations as an introduction to the task. Each participant then saw and rated 30 animations for naturalness, presented in random order, using a five-point Likert scale.

Unsurprisingly, we found that that the animation rated the most natural was the medium-amplitude animation, which had a 50% amplitude from the extreme. Figure 1 compares Paola’s breathing at the last point of inhale (just before the transition to exhale) with Paola’s breathing at the last point of exhale (just before the transition to inhale) in this medium amplitude; the differences apparent in this static figure are subtle.

Fig. 1.
figure 1

Paola’s inhale before transition (left) to exhale (right).

2.2 Phase 2: Effect on Perception of Rapport

With the amplitude determined, in the study’s second phase we assessed the effect of breathing on the users’ perceptions of rapport. To this end, we implemented Paola Chat, which is a fully automated conversational agent rather than a Wizard-of-Oz system. We designed Paola to resemble SimSensei as much as possible. Paola, displayed as a life-sized person projected on the wall in UTEP’s immersion laboratory, was seated in a large chair, with her legs crossed and her hands resting on her lap, resting on the chair’s armrests, or making gestures while speaking (see Fig. 2). Users’ interactions with Paola consisted of two back-to-back conversations on the topics of vacations and movies. For example, in the movie conversation, Paola asked “Have you seen any movies lately?” Paola would then interpret the user’s response using keyword recognition (i.e., if the user answered that he or she had, Paola would ask for details about the movie, but if the user had responded in the negative, Paola would segue to another question about the user’s favorite movie).

Fig. 2.
figure 2

A person interacts with the Paola Chat agent.

Developing this application required accepting a wide range of dialogue input and generating relevant responses. Paola Chat was developed with the UTEP AGENT system [8], which is capable of accepting a wide range of utterances, called wildcards, irrespective of topic or word choice. A major difficulty was that the study needed to have Paola’s utterances generate extended responses from the users so that they could observe Paola both while she was talking and while she was listening. But, consequently, the system must be able to generate responses that are generic enough to keep the conversation from seeming one-sided or disconnected. Some questions were posed as “yes” or “no” response questions, in which case the dialogue tree would converge back to a certain point to naturalize the flow of utterances generated by Paola.

Breathing Model.

The agent needed a function to control the breathing with respect to three constraints. First, the agent should not be speaking while displaying an inhaling animation. Second, the agent should appear to take breaths of natural length between utterances. Third, the agent’s frequency (in addition to amplitude) of breathing should be perceived as natural.

Breathing includes the states of inhaling or exhaling, and their transitions [8]. Our system required the development of a model that could represent these states smoothly, as well as having amplitude, oscillation, and frequency. In our model, the breathing state depends on the amplitude and oscillation. The amplitude represents the y-value on a graph and visually represents how much the ECA’s torso would expand. The oscillation represents the x-value on a graph in radians. The oscillating state (i.e., the wave) moves depending on the frame rate of the animation and the frequency per frame. The frequency was set to a fixed value per frame, adjusted for frame drops because Unity sometimes skips frames. During the interaction, the oscillations varied between 100 and 0, as an effect of the sine wave function:

$$ {\text{Breathing State = Amplitude + (Amplitude}}\; \times \;{\text{sin( }}\pi \; \times \;{\text{Frequency))}} $$

The breathing oscillating function ran in a cycle in which the agent would either inhale (breathing state = 0) or exhale (breathing state = 100). The cycle was interrupted only when the agent was about to speak, to portray an in-breath before speech. Figure 3 shows the breathing oscillation function: the x-axis is the oscillation of sin(πx), where x is the frequency, and the y-axis is the changes in the wave function for the values of the updating breathing state, with 0 the lowest point of exhale and 100 as the highest point of inhale. Figure 4 depicts the transitions between the states in the breathing oscillating function.

Fig. 3.
figure 3

Breathing oscillation function. The x-axis is the oscillation of sin(πx), where x is the frequency, and the y-axis is the change in the wave function for the values of the updating breathing state, with 0 the lowest point of exhale and 100 as the highest point of inhale.

Fig. 4.
figure 4

The breathing cycles represented as a state diagram.

Application Dialog.

We deployed Paola Chat’s breathing model in a pair of conversations about vacations and movies. The length of the conversations ranged between five to seven minutes, depending on user responses.

The UTEP AGENT system [9], in which Paola Chat was implemented, interfaces with the Unity game engine to automate features during the interaction, i.e., generating dialogue, handling breathing and other gesture animations, cycle through the states of breathing, as well as recognizing user input during conversation .

In the first conversation, Paola greeted the participant and began conversing on the topic of either vacations or movies; the order of the topics alternated as a part of the experimental design. Paola would occasionally ask questions where the participant’s utterance would be either treated as a wildcard (where the content did not matter) or, based on keyword recognition, would trigger an appropriate response.

Table 1 shows examples of responses to questions asked by Paola and follow-up questions Paola asked during the interactions. For example, Paola would ask “Have you seen any movies lately?” If the participant responded that he or she had, then Paola would ask for more details about that movie. If, however, the participant responded no, then Paola would instead say “Its okay! Tell me about your favorite movie, then. What is your favorite?

Table 1. Responses and follow-up questions.

Empirical Evaluation.

We used the Paola Chat application to evaluate users’ perceptions of naturalness of the agent’s breathing. The study was a within-subjects design in which one of the conversations had the ECA with the breathing animations and the other conversation had the ECA that did not use the breathing animations. The design was balanced for order of breathing/non-breathing and for order of conversation topic. After each conversation, participants were asked to complete a seven-point Likert-scale survey of naturalness, rapport, and social presence.

A total of 62 participants interacted with Paola. The population consisted of college students mostly aged 18–25 (about 85%; the remaining 15% were under age 30). The population consisted of 73% males. Further, 68% of the participants were native speakers of English. Of the participants, 21 identified as first-year college students, 11 as second-year college students, 13 as third-year college students, and 7 as fourth-year college students. The remaining 10 participants were in their fifth-year of study or above.

Before the interaction, each participant was asked to complete a demographic survey. Each also signed a permission to be video-recorded during the interaction. Participants were seated in front of a wall where Paola was projected (see Fig. 1). The two conversations each lasted about five to seven minutes, with the exact length of each interaction depending on the user’s responses to Paola’s questions.

After the first conversation, users were asked to complete a survey on the interaction. The interaction continued with a conversation on the other topic. The session would conclude with a final survey. Table 2 displays the 18-question survey participants completed; responses were entered on a 7-point Likert scale of users’ perception of naturalness, rapport and of social presence, as used, for example, in [4, 10].

Table 2. Pre-interaction and post-interaction survey questions.

3 Results

The study’s results suggested that breathing did not affect the agent’s perceived naturalness. Table 3 displays the average scores for rapport, naturalness, and social presence. Although the average rapport scores across the experimental and control conditions were normally distributed (Anderson-Darling test, p = 033 and p = 0.26, respectively), the absolute difference in average scores was small (4.08 − 3.78 = 0.30, on a scale from 1 to 7), and a t-test was not significant (p = 0.76). Similarly the t-tests for naturalness and social presence were also not significant (both p = 0.69).

Table 3. Naturalness, rapport, and social presence on both experimental and control conditions.

Because the design of the experiment allowed for the participants to watch Paola as she spoke or listened, the design of Paola’s utterances required an emphasis on the eliciting questions. It was important to elicit longer, more thoughtful responses than simple affirmations or negations. Table 4 shows questions asked by Paola that elicited longer responses from the human participants.

Table 4. Responses to questions.

In responding to questions about specifics, some users chose to response briefly, with short responses such as “Oh, it was just winter break,” while others gave utterances over ten words, as shown in Table 4. One of Paola’s questions, about things to do in dream destinations, generated longer utterances, but these utterances tended to be more general in tone, with fewer specifics. For example, users gave answers such as “relaxing” at or “exploring” their dream destinations.

As expected, some users responded to questions requiring specific knowledge (e.g., “Can you think of any other artists that go back and forth with movies and music?”) with expanded statements, while other questions generated one-word responses. Responses to general questions (e.g., “What is the one thing you would want to do for fun in your dream destination?”) generated more thoughtful responses from the users, and therefore utterances with a higher word count.

Inviting users to speak about their experiences or preferences produced longer utterances, too. In both topics, users responded to questions about their favorite movie scenes or dream vacations by responding with utterances longer than six words. This was also reflected in user responses to clarifying statements by Paola, for instance, when she asked “Tell me about the vacation” or “What was that movie about?

4 Conclusion

When we first saw SimSensei, the PTSD-diagnosis agent, we noticed that it appeared that she was not breathing. This led to us to develop the Paola Chat application, which we then used for a perception study of naturalness of breathing in the agent. Our results suggested that breathing did not actually affect perceived naturalness, rapport, or presence.

Despite our expectation that increased naturalness from breathing would lead users to report greater rapport in the breathing condition than in the not-breathing condition, the study’s results suggest that animation of breathing appears to neither increase nor decrease these perceptions. Of course, while breathing by an ECA does not increase naturalness, neither does it detract from naturalness. This suggests that ECAs using a breathing model similar to that of the Paola Chat agent can be at least as natural as a non-breathing agent. When combined with other features implemented, such as generating responses not only relevant to the topic but also relevant to the utterance to which Paola responds (see Table 1), the perceived social presence of ECAs could be increased.

So why did we perceive that SimSensei was not breathing, when the participants in our study did not notice the breathing in the Paola Chat agent? One of the differences between the two agents was the amount of dialog produced by the agent. SimSensei was mainly listening to the person with whom it interacted (except for some occasional questions, nods, and hand movements), while Paola was more conversational and contributed substantively to the conversation.

A second factor may be that the Paola Chat agent’s breathing was displayed only visually and not auditorily. If someone is about to speak, you can sometimes hear the inhalation.

A limitation of this study is that the Paolo Chat application was fully automated rather than Wizard-of-Oz. That is, the application generated utterances as output to users and accepted responses as input independent of a human acting behind the scenes. The dialog models were developed beforehand, where responses could be either short responses recognized by key phrases (converging to a previously designed point of the conversation) or wildcard responses (which followed the flow of the conversation more generically while being able to handle longer utterances regardless of keywords). This technology, though, constrains the agent’s conversational responsiveness.

Though Paola displayed animated gestures while talking, nodding, and now breathing, the application did not include more than nominal models of gaze and head movement. Adding these kinds of animations to an otherwise static embodied conversational agent might provide an even more humanlike appearance. These features could provide an improvement in perceived naturalness because breathing affects not only the torso and neck but shoulder movement and even timing of dialog generation.