Keywords

1 Introduction

Nonverbal behavior in human communication has important functions of transmitting emotions and intentions in addition to verbal behavior [2]. This means that an embodied dialogue system should be able to express nonverbal behavior according to the utterance to communicate smoothly with the user [10, 28, 35]. Against such a background, researchers have focused on constructing automatic generation models of nonverbal behavior from speech and linguistic information. Among nonverbal behaviors, nodding of the head is very important for emphasizing speech, giving and receiving speech authority, giving feedback, expressing conversational engagement, and intention of starting to speak [12, 14, 31, 33, 34]. It has been shown that nodding improves the naturalness of avatars and dialog systems and promotes conversation.

Nodding accompanying an utterance has the effect of strengthening the persuasive power of speech and making it easier for the conversational partner to understand the content of the utterance [27]. Researchers have tried to generate nods during speaking from speech and natural language. In particular, they used several acoustic features, such as sound pressure and prosody, for generating nods [1, 4, 6, 10, 24, 25, 37]. However, it has been difficult to accurately generate nods at an appropriate time according to an utterance from only speech.

A few studies have tackled the problem of generating nods from natural language. These studies focused on the final word in the phrase of an utterance and analyzed the co-occurrences with nods. They found that morphemes related to the interjections, feedback, questionnaire, and conjunctions appearing in turn-keeping [8, 9] tend to co-occur with nods. On the basis of this information, a simple automatic nod-generation model was proposed [10, 32]. It was found that the behavior of humanoid robots and avatars that generated nods with the model gave a better impression of naturalness. It is thought that if a model that can generate nodding more accurately is constructed, it will lead to smoother communication between the dialog system and user. Therefore, a more accurate nod-generation model should be constructed by clarifying the relevance of more detailed language information and nodding. It is also known that the relevance of a speech feature to nodding and vice versa depends on the language; for instance this is weaker in Japanese [8][?]. A detailed examination of a nod-generation model using language information is thus considered important.

In this research, we constructed a highly accurate head-nod-generation model using natural Japanese language by focusing on the various text analyzed linguistic information such as dialog act, part of speech, a large-scale Japanese thesaurus, and word position in a sentence, which has not been investigated. A dialogue act is information indicating the intention of a speaker in a whole utterance, and it is believed that the occurrences of nods change in accordance with the intention. We hypothesized that words in phrases other than the final phrase and lexicons of utterances had strong relationships with head nodding.

We collected a corpus consisting of 24 Japanese dialogues including utterances and head-nod information. Next, we used the corpus to create our model that generates a nod during a phrase by using bag of words, dialog act, part of speech, a large-scale Japanese thesaurus, and word position in a sentence in addition to the bag of words. The results indicate that our model using dialog act, part of speech, the large-scale Japanese thesaurus, and word position outperformed a model using only bag of words and chance level. The results indicate that dialog act, part of speech, the large-scale Japanese thesaurus, and word position are useful to generate nods. Moreover, the model using all types of language information had the highest performance. This result indicates that several types of linguistic information have the potential to be strong predictors with which to generate nods automatically.

Fig. 1.
figure 1

Photograph of two participants having dialogue

2 Corpus

To collect a Japanese conversation corpus including verbal and nonverbal behaviors for generating nods in dialogue, we recorded 24 face-to-face two-person conversations (12 groups of two different people). The participants were Japanese males and females in their 20s to 50s who had never met before. They sat facing each other (Fig. 1). To gather more data on nodding accompanying utterances, we adopted the explanation of an animation participants have not seen as the conversational content. Before the dialogue, they watched a famous popular cartoon called “’Tom & Jerry” in which the characters do not speak. In each dialogue, one participant explained the content of the cartoon to the conversational partner within ten minutes. At any time during this period, the partner could freely ask questions about the content.

We recorded the participants’ voices with a pin microphone attached to the chest and videoed the entire discussion. We also took bust (chest, shoulders, and head) shots of each participant (recorded at 30 Hz). In each dialogue, the data on the utterances and nodding behaviors of the person explaining the cartoon were collected in the first half of the ten-minute period (120 min in total) as follows.

  • Utterances: We built an utterance unit using the inter-pausal unit (IPU) [26]. The utterance interval was manually extracted from the speech wave. A portion of an utterance followed by 200 ms of silence was used as the unit of one utterance. We collected 2965 IPUs. Moreover, we used J-tag [5] which is a general morphological analysis tool for Japanese to divide IPU into phrases. We collected 11877 phrases in total.

  • Head nod: A head nod is a gesture in which the head is tilted in alternating up and down arcs along the sagittal plane. A skilled annotator annotated the nods by using bust/head and overhead views in each frame of the videos. We regarded nodding continuously in time as one nod event.

  • Gaze: The participants wore a glass-type eye tracker (Tobii Glass2). The gaze target of the participants and the pupil diameter were measured at 30 Hz.

  • Hand gesture and body posture: The participants’ body movements, such as hand gestures, upper body, and leg movements, were measured with a motion capture device (Xsens MVN) at 240 Hz.

All verbal and nonverbal behavior data were integrated at 30 Hz for display using the ELAN viewer [36]. This viewer enabled us to annotate the multimodal data frame-by-frame and observe the data intuitively. In this research, we only handled utterance and head-nod data in the corpus we constructed. Nods occurred in 1601 out of the 2965 IPUs.

3 Head-Nod-Generation Model

The goal of our research was to demonstrate that bag-of-words, dialog acts, parts of speech, a large-scale Japanese thesaurus, and word position in a sentence is useful for generating nods. We evaluated our proposed model for generating nods from several types of linguistic information and the previously constructed estimation model using only the final word at the end of an utterance [8, 9]. We constructed another estimation model using all types of linguistic information to evaluate the effectiveness of this fusion (All model). The feature values of linguistic information for each phrase were as follows.

  • Length of phrase (LP): Number of characters in a phrase.

  • Word position (WP): Word position in a sentence.

  • Bag of words (BW): The word injunctions related to feedback (e.g., “en”, “ee”, “aa”, “hi”, etc.) and particles related to questioning and turn-keeping (e.g., “de”, “kara”, “kedo”, “’kana”, “janai”, etc.) co-occurring with the nod is used for estimation of noding the previous studies [8, 9]. To deal with more generic word information in addition to them, we examined the number of occurrences of all words, not some morphemes. We used J-tag [5], a general morphological analysis tool for Japanese.

  • Dialogue act (DA): A dialogue act was extracted using an estimation technique for Japanese [7, 29]. The technique can estimate a dialogue act using the word N-grams, semantic categories (obtained from a Japanese thesaurus Goi-Taikei), and character N-grams. The dialog acts and number of IPUs are listed in Table 1.

  • Part of speech (PS): Number of occurrences of parts of speech of words in a phrase. We used J-tag [5] to extract part-of-speech information.

  • Large-scale Japanese thesaurus (LT): Large-scale Japanese thesaurus is a large lexical database of Japanese. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

We constructed the nod-generation models by using J.48 [], which implements a decision tree in Weka [3] and evaluated the accuracy of the models and the effectiveness of each type of linguistic information. The class was a binary value as to whether a nod occurred.

Table 1. Dialogue act labels
Table 2. Evaluation result of generation models.

We used 24-fold cross validation using a leave-one-person-out technique with the data for the 24 participants. We evaluated how well a participant’s nods could be estimated with an estimator generated only from data of other people. As shown in Table 2, the performance of the model using only LP, WP, DA, PS, or LT was higher than the chance level. However, the performance of the model using BW with an F-score of 0.423 was lower than chance level. This suggests that BW, which was used in a previous study [8, 9], was not useful for generating nods in our experiment. The model using LT had the highest performance among the models using only LP, WP, BM, DA, PS, or LT, with an F-score of 0.588. This suggests that LT is most useful in generating nods. In addition, the performance of mode using all information was higher than that using LT. This suggests that using several types of language information is useful to generate nods.

4 Discussion

The experimental results indicate that BW is not useful to generate nods. The reason is that the amount of data is not large, and it is thought that learning cannot be done well because the frequency of each word included in the learning data is too small. Because it is costly and difficult to collect a massive amount of multimodal data, BW is not effective. On the other hand, LT is most effective since such information is super classified rather than the word; therefore, the possibility that it could be learned well even with a relatively small amount of data can be considered. All linguistic information is useful to generate nods. This suggests that using several type of language information has the potential to generate nonverbal behaviors.

In this research, we used language information extracted from a unit of phrase and tried to determine whether nodding occurs in the phrase. We did not consider the time-sequential information as a feature. We plan to focus on time-sequential linguistic information to generate nods. Furthermore, we would like to work on constructing a model that can generate the detailed parameters of nods such as number and depth.

5 Conclusion

We constructed a highly accurate head-nod-generation model using natural Japanese language. In this research, we focused on various text analyzed linguistic information such as dialog acts, parts of speech, a large-scale Japanese, and word positions in sentences. In an experiment, we found that our estimation model these types of information outperformed that using bag-of-words information alone. We also found that a model using all types of linguistic information is most useful to generate nods. These results indicate that several types of linguistic information have the potential to be strong predictors to generate nods automatically.

In the future, we will focus on time-sequential linguistic information to generate nods. We would like to work on constructing a model that can generate the detailed parameters of nods such as number and depth. Furthermore, we plan to construct a model for generating the occurrence timing of nods within an utterance and a model for generating nonverbal behaviors such as gaze, which is important for turn management [11,12,13, 19,20,21,22], expression of conversational engagement [15,16,17,18, 30][?], and body posture.