1 Introduction

In this year’s RoboCup we participated in the RoboCup@Home Open Platform League. The team consisted of one supervisor and 6 students. Additionally, two more students were supporting the preparation. This year we participated successfully at the RoboCup World Cup where we achieved the first place. After the RoboCup World Cup in Hefei, China (2015) this is the second time that we achieved this title. Beside RoboCup competitions we also attend the European Robotics League and participated in the ICRA Mobile Manipulation Challenge.

For this year’s participation we focused on cooperation between robots of different teams. We demonstrated this twice. Once at the RoboCup GermanOpen and once in RoboCup@Home Open Platform league, where robots of two different teams where handing over objects without an established network connection. This was achieved by using human-robot interaction interfaces and adapt them to robot-robot interaction. The robots were talking to each other using speech synthesis, recognizing objects via speech recognition and handing over objects using the same approach that humans are using to get objects handed over by a person. Furthermore, we created an approach for semantically mapping rooms without prior knowledge. We demonstrated this using an attention guided process, where the robot extracts the face pose of a person. The robot directed its view to the perceived direction and started finding furniture. The location was transmitted to an auxiliary robot to put an object on top of the recognized table.

This year we also improved our team’s infrastructure by a continuous software integration and built our packages for a variety of processor architectures.

Section 2 gives a short overview of the RoboCup@Home competition. In Sect. 3 we present the robots that we used for this years participation. A special focus is put on the hardware changes that the robots have undergone. Section 4 describes the architecture and software approaches we are using. An overview of the current research topics is given in Sect. 5. Finally, Sect. 6 summarizes and concludes this paper.

2 RoboCup@Home

The RoboCup@Home league [1] aims at bringing robots into domestic environments and support people in daily scenarios.

The league has been separated into three sub-leagues, namely the Domestic Standard Platform league (DSPL), the Social Standard Platform League (SSPL) and the Open Platform League (OPL). The first two are based on standard platforms provided by commercial companies and aim at benchmarking teams on their integrated software on a common hardware platform. The extension possibilities of these platforms is limited. The Open Platform League is targeting teams that built their own robots or use robots that don’t belong to one of the two standard platform leagues. The robots can be customized individually as long as they follow the definitions by the rulebook. The goal of the league is to follow one general rulebook for all sub-leagues. In comparison to the soccer league, where a global goal is defined and the rules are heavily influenced by the FIFA rules, in RoboCup@Home the rules are defined on an yearly basis. The competition is divided into three stages. The first stage aims at benchmarking basic functionalities like manipulating, speech understanding, finding, tracking and following persons, recognizing objects and navigation capabilities.

In the second stage the basic functionalities from the first stage are integrated to form more complex tasks. The robots have to e.g. set a table, serve in a restaurant at a random location without prior knowledge and help to carry and store groceries. In the finals the top two teams of each sub-league show a demonstration which should be interesting from a research and application perspective.

Fig. 1.
figure 1

The robots Lisa (left) and Marge (right). Lisa is our main robot, inspired by Marge as a successor. Both robots run the same software with minor exceptions the model descriptions and hardware interfaces.

3 Hardware

In 2017 we updated our two domestic robot platforms slightly (Fig. 1). We extended the sensor setup on our main robot’s head. The sensor head’s basis is 3d-printed and allows for modular extensions. In addition to the Microsoft Kinect 2 which has a slightly higher field of view in comparison to the Microsoft Kinect or the Asus Xtion, we mounted two Asus Xtion RGB-D cameras. One is facing forward, closely located to the Kinect 2 and is used on demand where the Kinect 2 fails to provide depth information (i.e. on black furniture). The second one is facing backwards in order to track persons during guiding. This has the benefit that the pan tilt unit does not need to turn around by 180 degree during following. We changed the directional microphone setup on our main robot. The Rode VideoMic Pro was replaced by a Rode VideoMic Go as the new microphone does not need an external battery and we found no drawbacks in the speech recognition results. For sound source localization we now interface the microphone array of the Kinect 2. Those four microphones allow us to localize sound sources in front of the camera. The power supply was changed to TB47S lithium ion batteries with 22.2 V and 99.9 Wh by DJI, usually used in drones. We use a pair of these in each robot. The batteries can be changed independently and therefore allow hot swapping. Further, the batteries can be transported on airplanes being the ideal choice for traveling with the robots. We use the Kinova Mico arm for mobile manipulation, but designed adapters for allowing interchangeable custom 3d-printed end-effector attachments. The original mechanics for controlling the end-effectors are still used. For gripping tasks we mount Festo Finrays at the attachments, but are flexible to use and design additional tools [2]. An Odroid C2 mini computer is used to interface the base control on our main robot and display the robotic face even when the main computer is not running. The auxiliary robot is now also equipped with an 10 in. screen for supporting the human robot interaction with the robotic face.

4 Approaches

We now present briefly present our architecture and then give description on our algorithmic approaches (see Fig. 2 for a visual overview).

Architecture. ROS [3] Kinetic and Ubuntu 16.04 are building the basis of our architecture. On top of ROS, our developed software is focusing on an higher level representation that simplifies the development of demonstrations. Behaviors are modelled using the ROS actionlib and are interfaced by a robot API which is common for both robots. The behaviours can be executed on both robots by only adjusting the robot model representations.

Fig. 2.
figure 2

Our robot architecture. The robot interface is used as a central interface for accessing the high level robot action (also behaviours) and is common for all our robots. The high level robot actions make use of the basic functionalities, these functionalities are reused in different actions. Similar color encode that the class of the functionality/action. Interesting is that actions like pick and place are relying on a variety of functionalities which reflect a high failure rate in the competition. (Color figure online)

Object Detection. For object detection we use a planar surface estimation and cluster objects above the planar surface. The robot tilts the RGB-D camera towards a possible object location. The gathered point cloud is then transformed into the robots coordinate frame where an up facing normal vector defines the search space for the planar surfaces. Points above the plane hypotheses are individually clustered each resulting in an object hypothesis. In an additional step those hypotheses are filtered to eliminate too small clusters and objects that can not be manipulated.

For obstacle avoidance we estimate planar surfaces on all axes in order to extract table tops, level and side walls of bookcases. These planar points are used for estimating a 3d grip map used for obstacle avoidance and has been found sufficient for the majority of manipulation tasks. In comparison to the use of the full point cloud as an input for the grid map estimation is the robustness against effects like speckles.

Object Recognition. From the filtered object hypotheses we extract an image patch that is passed to an object classification network. In contrast to previous attendances [4] we now use a pre-trained Inception [5] network where we fine tune the final layer. For training we created a dataset containing 4000 images with different perspectives around the object and at different camera heights of all official competition objects. We skipped further data augmentation as we found no major recognition improvements. The resulting training images have been shared before the competition days with all participating teams.

Fig. 3.
figure 3

The robot during the grasping process. The planar surfaces are extracted and a 3d gridmap is generated for obstacle avoidance.

Manipulation. MoveIt! [6] is used as a motion planning framework for manipulation tasks. The RRT-connect [7] algorithm is used for trajectory planning. Planar surfaces at the main axes are extracted and used for obstacle avoidance by an octree based 3d gridmap representation [8] (see Fig. 3). We evaluated and integrated different approaches for the manipulation task. First, we adjust the position of the robot in order to get a certain distance to the object. This allows us to increase the probability of a successful trajectory execution. Many trajectories can not be executed with required accuracy for the end-effector target pose. Therefore, we execute trajectories with target poses in front of the object where the end-effector is aligned upwards. The distance is then again adjusted by movements of the robot base. This approach has some disadvantages i.e., the robot needs enough space for the movements. A second approach samples possible grasp hypotheses around the object and checks if the estimated pose is executable. If the execution fails the next possible grasp pose is generated until a valid one is found. The second approach overcomes the need to adjust the robot’s position.

Speech Recognition. For speech recognition we use a grammar based solution supported by an academic license for the VoCon speech recognition software by NuanceFootnote 1. We combine continuous listening with a begin and end-of-speech detection to get good results even for complex commands. Recognition results below a certain threshold are rejected. The grammar generation is supported by the content of a semantic knowledge base that is also used for our general purpose architecture.

Speech Understanding. We propose a novel approach to teach a robot an unknown command by natural language assuming that the given instruction set is representable with a BNF grammar. To achieve this, first we define a broad BNF grammar containing a huge set of commands that the robot is able to execute. We use a grammar compiler [9] to generate a parser with the BNF grammar as input. In this parser we then connect the parser to actual robot actions and extract the parameters. In case the robot receives a command that was not connected to a robot action (that’s what we call unknown command) the robot asks for further instructions. When you have a very broad set of robot actions you can create a high variety of new commands. Since we also can extract parameters for the unknown commands and replace them the command is applicable in a broad variety of situations.

Fig. 4.
figure 4

Tracking Overview. A RFS Bernoulli single target tracker in cooperation with a deep appearance descriptor to re-identify and online classify the appearance of the tracked identity. Measurements, consisting of positional information and an additional image patch serve as input. The Bernoulli tracker estimates the existence probability and the likelihood of the measurement being the operator. Positive against negative appearances are contentiously trained. The online classifier returns scores of the patch being the operator.

Operator Following. We developed an integrated system to detect and track a single operator that can switch off and on when it leaves and (re-)enters the scene. Our method is based on a set-valued Bayes-optimal state estimator that integrates RGB-D detections and image-based classification to improve tracking results in severe clutter and under long-term occlusion. The classifier is trained in two stages: First, we train a deep convolutional neural network to obtain a feature representation for person re-identification. Then, we bootstrap an online classifier that discriminates the operator from remaining people on the output of the state-estimator. See Fig. 4 for an visual overview. The approach is applicable for following and guiding tasks.

Person Detection. For person detection we integrated multiple approaches for different sensors that can be optionally fused and used to verify measurements of other sensors. A leg detector [10] is applied on the laser data. This yields high frequency, but error prone measurements. For finding persons in point clouds we follow an approach by Munaro et al. [11]. The most reliable detections are by a face detection approach [12], assuming that the persons are facing the camera. For gender estimation we then apply an approach by Levi et al. [13].

Fig. 5.
figure 5

The gesture estimates visualized. The joint estimated are denoted by the red circles. The blue font states the classified gesture. p stands for pointing, w for waving, l for left, r for right and a for arm. Note that there are also joint estimated that have been extracted for the reflection in the door. (Color figure online)

Gesture Recognition. Joints of persons are estimated using a convolutional pose machine [14, 15]. Relations between the arm joints are then constrained by their angle relations to classify gestures like waving, pointing and a stop sign for both arms individually. The convolutional pose machine is able to extract all joint positions of all persons in one pass. The center of the person is projected to the robot frame by depth projection. Figure 5 gives an example visualization of the joint estimation and the corresponding estimated class labels of the gesture per person.

Fig. 6.
figure 6

Handing over objects as a team collaboration with team SocRob [16] in RoboCup German Open 2017 (A) and during the preparation with team Happy Mini [17] during RoboCup 2017 in Nagoya (B). There was no network connection between the robots.

Fig. 7.
figure 7

Handover between a person and a robot. The same approach is used for detecting a handover of a person to robot as we did for a robot to robot handover

Team Collaboration. A focus of our RoboCup@Home participation was a team to team collaboration. In RoboCup@Home German Open we collaborated with SocRob team and in this year’s WorldCup we collaborated with Team Happy-Mini. We built on top of our approaches from 2015, where both of our robots where transferring tasks by natural language to each other. To be more precise, the speech synthesis of one robot was recognized by the speech recognition of the other robot. The robots were not connected by network. This year we transferred this approach to an inter-team collaboration and extended it by adapting an other human-robot interface to it successfully. In addition to the natural language communication we adapted the same approach for handing over objects from a human to a robot as we did for handing over objects between both robots.

Figure 6 show these same approaches in action between robots as well as between a person and a robot (Fig. 7).

5 Current Research

Imitation Learning. With the recent development in pose estimation algorithms it is now possible to develop approaches for imitation learning by visual observation that will passively observe human instructors and allow the reasoning to adapt the behaviour. Where in the past the human instructors need to be wired or marked using visual markers to extract the needed joints we can now extract them passively allowing the instructors to perform a task in a natural way and therefore do not affect the instructor [18].

The goal is to develop approaches that let the operator work as free as possible from any markers or unnatural interactions with the robot or observation system. We further reduce the components, therefore a mobile robot equipped with a visionary system and a manipulator will be sufficient to imitate the performed tasks as it would be done by another person.

The first step is extracting the human poses and set them into relations of further observations like detected objects and the general semantic knowledge of the robot.

These observations should be classified. The classes are represented by the robot behaviours. For instance we can classify a movement behaviour by observing the human pose and set it into relation with certain points of interest in the map. This yields a set of problems we want to address:

  • action classification based on the robot observations and semantic representation

  • developing an approach for the robot to estimate a next best view location for extracting as much meaningful information for the classification as possible

  • a probabilistic representation that allows reasoning based on the input observations

  • for more precise imitation learning tasks we try to analyze the human movements of the arm and hand in relation to recognized objects and adapt it to a robot arm that may is has different degrees of freedom.

The first step is the creation of a dataset that extends existing datasets on action classification. These datasets currently are focusing on simple actions like running, walking and kicking. We aim to create a dataset with focus on domestic indoor activities which includes next to the RGB-D camera stream also the knowledge of the robot about it’s environment like the map and the points/regions of interest.

3D Shape Classification. Most of the robots in RoboCup@Home are equipped with 3D sensors (e.g. RGBD-cameras). One of the most used libraries for 3D data processing is the Point Cloud Library (PCL) [19]. The PCL offers many 3D descriptors, algorithms and useful data structures. Further, a complete object recognition pipeline [20] and an adaptation of Implicit Shape Models (ISM) to 3D data [21] is available. However, these two object classification methods in the PCL are either not flexible enough or do not provide sufficiently good results.

We are currently working on a 3D shape classification approach to overcome these limitations. For our own approach we take inspiration from Naive-Bayes Nearest Neighbor (NBNN) and combine it with the localized hough-voting scheme of ISM. We obtain a competitive 3D classification algorithm that combines the best of ISM and NBNN. The direct utilization of extracted features in training for the codebook makes our method learning-free (as NBNN and contrary to ISM that uses vector quantization). On the other hand the localized hough-voting of ISM allows to filter out erroneous feature correspondences and irrelevant side maxima. Our algorithm is competitive with standard approaches for 3D object classification on commonly used datasets.

Currently, we focus on several methods to estimate a descriptor’s relevance during training to obtain a smaller, more descriptive codebook and omit redundancies. We achieve the goal of codebook cleaning without significant loss or even with an increase of classification performance. Further, we experiment with means of selecting relevant dimensions of a feature descriptor. We inspect the structure of the applied SHOT descriptor and identify less-significant descriptor dimensions that are left out after training. All of these is combined with a global verification approach to boost the classification performance. We plan to make the source code of our approach publicly available and thus to make a contribution to the PCL and boost the performance of open-source 3D classification software.

Affordances. To adequately fulfill its tasks a mobile robot needs to understand its environment. Object classification is a crucial capability when it comes to environment understanding. However, object classification by itself may not be sufficient as it only addresses the task of deciding what a specific object is. In many cases we also want to know how that specific object can be used by a robot or a human. The action possibilities offered by an object to a specific agent (e.g. human, robot) are called affordances [22].

Special requirements on service robots arise if considering that service robots share a common environment with humans. Domestic service robots even reside in environment specifically designed for humans. It is thus crucial for service robots to understand humans and their actions. In other words, in a human-centered environment with humans as agents, a robot needs to be aware of affordances for humans, but also of affordances for itself as the robot also forms an agent in that domestic environment. We have successfully shown how an agent representing a human is used to visually infer affordances for sitting and for lying down [23]. A robot equipped with such a notion of affordances is enabled to assist and guide people in a domestic environment.

We are additionally exploring another abstraction of the agent model concept. Instead of a human or robot we model a hand agent (which again can be a human hand or a robotic gripper). This agent model allows us to detect grasping affordances and to estimate suitable grasping poses for given objects. Together with the above mentioned object classification approach this grasping affordances estimation can be combined to form a complete processing pipeline for mobile manipulation. Further, imitation learning will be helpful to achieve more accurate results in grasping affordance estimation.

6 Summary and Outlook

In this paper we described the how the team is composed, gave a brief overview of RoboCup@Home, described the hardware changes in comparison to previous attendances. A major focus was set on the description of the software approaches. In the end we gave an insight on our current research topics.

We proposed an novel approach for speech understanding which we proved to be adaptable to unknown commands. In addition, we have shown a team-team collaboration without relying on network connection, but instead on interfaces that are also commonly used for human-robot interaction. Together with team SocRob and Happy Mini we have shown the first two robot-robot hand overs between robots from different teams in the league. In the final demonstration we have shown that the robot is able to localize moved furniture which could be used for later manipulation tasks. The localization was attentionally guided by view direction analysis of a person. In the future we aim for more application of our current research topics.