How Much Should a Robot Trust the User Feedback? Analyzing the Impact of Verbal Answers in Active Learning
- 3.4k Downloads
This paper assesses how the accuracy in user’s answers influence the learning of a social robot when it is trained to recognize poses using Active Learning. We study the performance of a robot trained to recognize the same poses actively and passively and we show that, sometimes, the user might give simplistic answers producing a negative impact on the robot’s learning. To reduce this effect, we provide a method based on lowering the trust in the user’s responses. We conduct experiments with 24 users, indicating that our method maintains the benefits of AL even when the user answers are not accurate. With this method the robot incorporates domain knowledge from the users, mitigating the impact of low quality answers.
KeywordsActive Learn Automatic Speech Recognition Social Robot Learning Space Passive Learn
Recent studies in robotics have started to include ideas from Active Learning (AL). Using this kind of learning, robots are able to mimic how humans learn: first, by observing their teacher, and then by asking questions1 when they have any doubts about the concept to be learnt or the examples they have seen. AL comes from the Machine Learning field and it was introduced by Angluin . Whilst in Passive Learning (PL), the teacher provides the examples to the learner and labels them, in Active Learning it is the learner who takes the initiative by asking queries to the teacher or oracle. The use of AL in robotics has three main motivations when compared with PL. Firstly, active learners can potentially obtain better accuracy of the learned concepts. Secondly, AL may reduce the number of training examples needed to acquire a concept . This is specially relevant in robotics since in interactive learning the cost of acquiring a training example might be time consuming. Finally, people seem to prefer to train robots that learn actively over passive ones .
Two research trends can be distinguished in the field. In the first one, robots learn by self-exploring the environment while in the second one, robots leverage HRI to learn from humans . This paper focuses on the second approach and it is inspired, mainly, by the works of Rosenthal  and Cakmak [5, 6]. Nevertheless, we attempt to go further by understanding how answers to different types of questions could affect the robot’s learning. Rosenthal  explores how different questions affect the accuracy and correctness of the user’s responses. Cakmak et al.  studied how the robot was perceived when it showed three different degrees of interactivity when asking questions. In a related work, Cakmak also  studied how humans ask questions when learning.
There is literature that assessed different types of queries for AL  and that evaluated how to ask these questions to maximise the accuracy of the user answers . However, we did not found evidence on how different types of queries affect the robot’s learning performance. This paper explores the learning impact of different types of queries that seek information regarding the learning parameters in an interactive pose learning task. To do so, we propose a method which consists in reducing the robot’s confidence on the user’s answers by including more parameters than the user answered.
This paper is divided as follows. First, Sect. 2 describes our learning approach were we apply AL for pose learning. After that, we present our experiment in Sect. 3, and the obtained results are shown in Sect. 4. The results are discussed along with the conclusions in Sect. 5.
2 Learning Scheme
When the user starts training the robot, she has to carry out two tasks. The first one is to put herself in the pose she wants to teach the robot. The second task is to tell the robot the pose she is standing at. Gathering the data from vision and from verbal interaction, the robot builds a dataset which feeds a learning algorithm. In this section we describe the components that participate in such learning process.
As a visual input we use the depth data supplied by a Kinect RGB-D camera. In order to do this, we employ the OpenNI2 API to build a skeleton model of the user, composed of the positions and orientations of 15 joints of the user (head, neck, torso, shoulders, elbows, hands, hips, knees and feet). This skeleton model is the input for our learning algorithm (\(x_i\)). The labels (\(y_i\)) of the dataset are gathered interactively during the training process using an Automatic Speech Recognition (ASR) system described in .
Once the training ends, the robot applies a learning algorithm, Random Forests , to the set of training instances. The algorithm is freely available through the Weka Framework . We use the default parameters as provided by the framework as these provide good performance on a wide range of datasets. Further details about the training process can be found in .
2.1 Proposed Approach for Active Learning
We use AL to ask questions related to the poses to be learnt. Active robot learners can ask different types of questions. This paper focuses on Feature Queries, which try to find the features of the learning space that are more relevant for learning. Feature Queries are perceived by users as the smartest questions . These questions consist in asking the user if a certain feature of the learning space is relevant or not (inspired by [7, 12]). In our approach, we ask the questions once the training session is over. At that moment, the robot asks the user which parts of her body have been the most important ones for each pose. For instance, if the user has taught the robot a pointing pose, it is expected that when the robot asks for the features that are more relevant for this pose, the user’s answer should indicate some part of her arm.
With the user’s answers, the robot filters all the features which has been told to be less relevant. Notice that in this paper we focus on the learning effects of filtering parameters due AL instead of how users perceive the questions or which kind of queries are most helpful. Regarding the questions themselves, we have taken into account the effect in which the user’s responses can be affected by the way questions are asked . For that reason, we considered three types of questions to the user: (i) Free Speech Queries (FSQ) consist in open questions which allow the user to answer freely (e.g. “Which is the most important limb in this pose?”); (ii) Yes/No Queries (YNQ), force the user to answer with a yes or no statement (e.g. “Is the hand important?”); and (iii) Rank Queries (RQ), the user must answer quantifying the importance of a limb from not important to very important (e.g. “How important is your hand?”). Typically, answers to FSQ provide a single limb or a short list of limbs which the user considered important (e.g. “I suppose my arm”). Therefore, their use might be interesting when the robot does not know which limbs might be important. Conversely, YNQ and RQ force the user to answer only about the limb the robot asked for. Hence, these questions are better to retrieve information from a specific limb, that is, from a specific parameter or set of parameters in the learning space. The major drawback of FSQs is that those answers need to be parsed in order to map what limbs the user has talked about.
With this thresholding mechanism, we can build parameter filters that enable the robot to filter the data which is not relevant for learning. We have built 3 different filters that can be used to pre-process the data before feeding them to the classifier, namely FSQF, YNQF, and RQF (the final F stands for Filter).
Additionally, we created a fourth filter, the Extended Filter (EF), which is as an extension of RQF. The EF is an RQF that includes all the adjacent limbs of a normal RQ Filter. For instance, if we have an RQF formed with \(\lbrace \)head, neck\(\rbrace \), its associated EF includes: \(\lbrace \)head, neck, left shoulder, right shoulder, torso\(\rbrace \). As shown in Sect. 4.2, the EF improves RQF in the situations where, due a low quality answer from the user, the RQF (and other AL filters) are outperformed by Passive Learning. In such cases, the EF behaves as a Passive Learner, while, when the users provide good answers, the EF offers a learning performance comparable to other AL filters.
The aim of our experiment is to understand how the answers to FSQ, YNQ, and RQ affect robot learning. Accordingly, we prepared an experiment in which several users taught the robot different poses. After the pose acquisition phase, users were asked several questions aimed to improve the robot’s learning performance.
Despite the learning session involved natural interaction between the user and the robot, we evaluated the effectiveness of the user’s answers in offline questionnaires that were filled to the users just after the training session. This was motivated because the aim of our experiment was to explore the uses of different queries and which ones are better to ask, requiring us to ask many questions to the users. In that situation, we observed fatigue in the users when the robot asked so many questions to a user in the same experiment.
The user trained the robot Maggie , which is equipped with Kinect, ASR  and Text to Speech (TTS) systems, coupled in a Natural Dialogue Management System , which enable the robot to carry out natural interactions. These components are tightly coupled by using the ROS  framework. Although most of these components are self-built, any robot equipped with a Kinect, a TTS and an ASR, can replicate our experiments easily.
3.1 Experimental Setup and Method
The users were told to stand in front of Maggie. Then, the experimenter explained the experimental procedure and answered any doubts that might have appeared. During training, users had the freedom to start, pause, and finish the training session whenever they wanted. The procedure was: first, the user waited in front of the robot until it tells that it is ready to start. Then, she was able to start the training whenever she wanted. To record each pose, the user was told to: first stand still at this particular pose and then tell the robot the label for that pose. Before changing to other pose, the user had to tell the robot to stop recording this pose. The user interacted with the robot verbally. Once the robot tells the user that it has finished recording the pose, she is free to move to the next pose and start the process again. The user finishes the teaching session telling it to the robot. The session dynamics are similar to a previous work .
One dataset per experiment was recorded to feed the learning process. After having trained the robot, they filled a questionnaire where they were asked questions relative to the poses they taught to the robot. These questions were, first FSQ, then YNQ, and finally RQ. All users had to respond to the three types of questions, although their answers were treated separatedly. The following examples of the asked questions are provided4: FSQ: What parts of your body do you think the robot has used to learn whether you were looking/pointing left, forward or to the right?. Here, users were asked to write in an open-text box what limbs were considered most important in each experiment. YNQ: What parts of your body do you believe the robot has used to learn in the first experiment?. For this set of questions, the user had to fill a multiple-choice list of the limbs depicted in Sect. 2. RQ: Mark the importance of each of the parts of your body so that the robot can learn. RQs consisted in fifteen questions per experiment, each one asking for the importance of a single limb. The answers consisted in a 4-point scale rating the importance of a limb ranging from Not important at all to Very important.
4.1 Filters for Active Learning
When analyzing the user’s answers to FSQ, we realized that sometimes their answers were too broad. This was the case of the answers to experiment 2, in which most users answered with my arm. There is not a direct mapping between arm and any of the joints. Hence, we decided to include all joints related to the limb, in this case hand, elbow, and shoulder. Moreover, no user indicated which arm was referring when they answered my arm. Thus, we decided that all the uncertain answers that covered different possible limbs would be mapped including all these limbs. For instance, in the case of the my arm answer, we included the user’s both arms.
YNQ and RQ produced nearly the same filters, which might occurred because the users had a list of limbs when choosing their answers. Nevertheless, although both graphs have similar shapes, there is a slight difference in the limbs that were not selected. In RQ, these limbs tended to be closer to the threshold than YNQ. For instance, this effect is clearly shown in Fig. 2, first row. These differences might have been produced because, although the user had the list of limbs in both cases, in RQ they were forced to give an answer to each limb. Two conclusions might be drawn from these effects: (i) if you let users decide which limbs are important, they provide stricter filters than when you ask them for particular limbs; and (ii) RQ and YNQ obtained the same filters because we were too strict selecting our threshold. Had the threshold been lower, RQ would have had more limbs included than YNQ. In that case RQ could have been considered as an extended version of YNQ.
This idea is what led us to create the Extended Filter presented in Sect. 2.1. Since the user answers were too simplistic, we decided that the robot should not blindly trust in their answers. However, since they were neither completely wrong, we opted to extend their answers including more information than they actually gave.
4.2 Learning Results
As already described in Sect. 4.1, YNQ and RQ produced the same filters, therefore, from now on, we will only describe RQ results. Figure 3 shows the results for the looking (left) and pointing (right) experiments. The figure shows how the filters behaved differently in both experiments. In the Looking experiment (Fig. 3, left) RQ and FSQ performed worse than PL with the exception of the cases in which only one user trained the robot. However, the EF achieved an F-Score comparable to PL. A different situation occurred in the Pointing experiment (Fig. 3, right), where FSQ and RQ scored better results than PL. Here the EF behaved as well as the other AL queries.
The lower performance of AL in the looking experiment might be caused because, in this experiment, the users were providing simplistic answers leading the robot to build filters that omitted relevant data. When the robot had few training examples, the users compensated the lack of data by introducing relevant domain information. However, when the robot gathered enough training examples, their answers prevented the robot to use information it could have helpful for learning. We observed this fact when we checked the learning dataset and found that the data of the user’s shoulders might have been relevant. Yet, the users omitted the shoulders when answering the queries, perhaps, because the variations in their shoulders were unperceived.
Because of that, we come up with the idea of lowering the trust in the user’s response and including more information than the user is giving. This is exemplified by the EF, which included adjacent limbs to the user’s answers. Note that despite the EF seems tailored for our scenario, we believe that a similar approach can be followed in other scenarios where a strong correlation between learning parameters can be found. In that sense, if a user produces a filter including some learning parameters, it might be interesting to include other correlated parameters as well, even if the user did not include them. Some potential fields that can benefit from this approach are gesture recognition and object recognition among others.
The main contribution of this paper is twofold. First, we evaluated how different types of Feature Queries affect the learning performance of a robot that learns actively. We found that, in some cases, user’s answers might be too simplistic, potentially leading to a reduction in the robot’s learning accuracy. In such cases, if the robot trusts too much in the user’s responses, Active Learning approaches might be outperformed by Passive Learning. The second contribution of this paper is a method in which the robot reduces its confidence on the user’s answers by extending them to other related parameters of the learning space. This method has proven to keep the learning performance high even in the cases where users did not provide accurate answers.
We tested our approach in an experiment where 24 users trained a social robot to recognise poses. Users were asked for three types of queries Free Speech Queries, Yes/No Queries and Rank Queries. The answers to these queries were used to build feature filters that pre-processed the training data before it was fed to the learning algorithm. We found that, since RQs are more verbose, their filters tend to be more inclusive. However, it was demonstrated that users prefer not to be asked many questions , therefore it remains as a future work to explore the optimal balance between a verbose robot and the quality of filters. In this line, we have found that FSQs have the advantage of being more natural and they can be much more efficient than the other types of queries.
From our experiments we concluded that inaccurate user verbal responses in AL may lead to loss of relevant data in the parameter space. When this happens, AL could produce worse results than PL. To solve this problem we developed the notion of the Extended Filter. As this filter includes limbs omitted by users, it achieves a performance comparable to PL in the situations where AL do not worked as well as expected. What is more interesting, when AL performes better than PL, the EF behaves as a regular AL approach. Therefore, our EF gets the best of both worlds, being a good choice when AL is beneficial but it is not possible to control the accuracy in user’s answers.
Even though our method is applied for pose learning, we believe that it could be applied to other AL-based learning approaches. This is because our approach is based on feature selection, Hence, other learning approaches might apply it as long as the robot knows the relationship between the features it is asking for.
This research has received funding from the projects Development of social robots to help seniors with cognitive impairment - ROBSEN funded by the Ministerio de Economía y Competitividad (DPI2014-57684-R) from the Spanish Government; and RoboCity2030-III-CM (S2013/ MIT-2748), funded by Programas de Actividades I+D of the Madrid Regional Authority and cofunded by Structural Funds of the EU.
- 1.Alonso, F., Gorostiza, J., Salichs, M.: Preliminary experiments on HRI for improvement the Robotic Dialog System (RDS). In: Robocity2030 11th Workshop on Social Robots (2013)Google Scholar
- 6.Cakmak, M., Thomaz, A.L.: Designing robot learners that ask good questions. In: Proceedings of the Seventh Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2012, p. 17. ACM, New York (2012)Google Scholar
- 7.Druck, G., Settles, B., McCallum, A.: Active learning by labeling features. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 81–90. Association for Computational Linguistics (2009)Google Scholar
- 11.Quigley, M., Gerkey, B., Conley, K., Faust, J., Foote, T., Leibs, J., Berger, E., Wheeler, R., Ng, A.: ROS: an open-source Robot Operating System. In: Open-Source SW Workshop of the International Conference on Robotics and Automation (ICRA) (2009)Google Scholar
- 13.Rosenthal, S., Dey, A.K., Veloso, M.: How robots’ questions affect the accuracy of the human responses. In: The 18th IEEE International Symposium on Robot and Human Interactive Communication, RO-MAN 2009, pp. 1137–1142. IEEE (2009)Google Scholar
- 14.Salichs, M., Barber, R., Khamis, A., Malfaz, M., Gorostiza, J., Pacheco, R., Rivas, R., Corrales, A., Delgado, E., Garcia, D.: Maggie: a robotic platform for human-robot social interaction. In: 2006 IEEE Conference on Robotics. Automation and Mechatronics, pp. 1–7. IEEE, Bangkok, December 2006Google Scholar
- 15.Settles, B.: Active learning literature survey. Computer Sciences Technical report 1648, University of Wisconsin-Madison (2010)Google Scholar