Active Scene Recognition

  • Pascal MeißnerEmail author
Part of the Springer Tracts in Advanced Robotics book series (STAR, volume 135)


Detailed technical presentation of our contributions that are related to Active Scene Recognition. This includes our approaches to Object Pose Prediction and Next-Best-View estimation.

4.1 Concept Overview

An important question that has been left open in Passive Scene Recognition (PSR) is how to acquire the sensor data from which 6-DoF object location estimates have to be inferred. In short, it is necessary that the recording sensor setup adopts a configuration from which it can perceive the objects that are to be localized. For the robot head mentioned in Sect.  1.2, a suitable configuration is equivalent to a camera view whose viewing area contains these objects. At the beginning of Chap.  1, we explain that in indoor scenes, we should not expect an individual view, from which all searched objects can be perceived at once. Instead, we should figure out a sequence of views, each allowing a mobile robot to find a different portion of the searched objects with maximal accuracy and minimal effort. In our forth thesis statement in Sect.  1.2, we declared that Active Scene Recognition (ASR) is the appropriate means for successfully deducing both view sequences and scene models in cluttered and distributed environments. We realize ASR, a decision-making system that connects PSR with three-dimensional object search, as a hierarchical state machine [2]. This state machine is depicted as a state diagram in a simplified fashion in Fig. 4.1. It is executed on the mobile robot MILD that is shown in 2 in Fig.  1.8. While this section presents the logic of our state machine, an overview of the robot architecture in which this decision-making system is embedded is given in Sect. 4.2.
Fig. 4.1

Simplified state diagram of Active Scene Recognition as developed for the mobile robot MILD

Let’s assume that MILD is located at an arbitrary, collision-free position in the environment, such as in Fig.  1.2. Given a 2-D floor plan, MILD is expected to decide upon the presence of n different scene categories such as “Setting—Ready for Breakfast” with the help of ASR. When ASR is started, MILD and in particular its localization and navigation system are initialized during the execution of the states OBJECT_SEARCH_INIT and GLOBAL_ROBOT_LOCALIZATION. The latter delivers a pose estimate for the mobile base of MILD in relation to a fixed global frame. This base pose is updated throughout ASR. Since no object models, i.e. object localization results, are available when ASR starts, the next task, modeled in the state machine, consists of acquiring a first set of object models. For this purpose, we developed two strategies for estimating and traveling to interesting camera views. Both are subsumed in the subsystem DIRECT_SEARCH of the state diagram in Fig. 4.1. The first strategy is an informed search that is based on knowledge about potential object poses that come from past localization results, i.e. experience. Such experience, e.g., can be object poses observed during demonstrations of scene categories. In case that informed search fails, a less efficient, uninformed search is started that is based on [26] and evenly explores the entire environment. Both strategies are explained in a more detailed fashion in [12]. Once any object model has been acquired—for example because two forks of the place setting in Fig.  1.2 have been localized—direct search stops.

Instead, indirect search, the second search mode of ASR apart from direct search, starts in the state machine from Fig. 4.1. In fact, both modes alternate until ASR stops. In Fig. 4.1, for example, this is the case when either a scene model with sufficient confidence has been acquired for each scene category or each object from any scene category has been found or every view has been analyzed that might contain searched objects. Searching a so-called target object with the help of a previously found intermediate object, the authors of [25] defined the difference between direct and indirect search as follows:

An indirect search [9] is a search that operates in a loop, repeatedly locating a new instance of the intermediate object and looking “near” it for the target object until either the target object is found or no more instances of the intermediate object can be found. A direct search, on the other hand, is one that does not attempt to use spatial relationships to predict likely locations and simply searches the entire search space systematically. [25]

According to the authors, indirect search outperforms direct search of a target object. This is achieved by limiting the search space of the target through the intermediate object. In the decision-making system in Fig. 4.1, it is the subsystem INDIRECT_SEARCH that implements the actual core of our ASR approach, following the indirect search paradigm. With techniques for predicting the poses of searched objects and for estimating optimal camera views that cover those predictions, it contains two major contributions of this thesis. Besides those techniques, Passive Scene Recognition is integrated, too. During indirect search, these three main functionalities run in a loop. We remain in indirect search as long as views, promising for searching objects, can be deduced from the already available object models. Otherwise, we return to direct search. The first state that is reached during indirect search is SCENE_RECOGNITION. There, PSR with Implicit Shape Model Trees is performed on the object models that have been acquired until this point. The outcome is a set of scene models with presumably differing confidences. The confidence of each scene model in particular depends on how many objects of the underlying scene category have already been found. In case every object of each scene category has been found, ASR ends at this point. Otherwise, we assume for each searched object that it is suitable for at least one of the available scene models in which it misses. To put it easy, we aim at completing available, partial scene models with searched objects.1 We reduce the space in which we search missing objects to those locations that would allow suchlike completions in order to speed up searching in analogy to [25]. If the robot MILD had just perceived the pair of forks on the right in Fig.  1.1, it would, for example, have deduced a partial scene model for “Setting—Ready for Breakfast”. In order to complete that scene model, the remaining silverware has to be located further on the left.

In order to estimate such locations, we developed an algorithm that predicts the potential poses of missing objects with the help of spatial relations. This object pose prediction technique employs the same relations from Implicit Shape Model Trees that we introduced for performing PSR. The advantages of taking spatial relation as a basis for object search have been discussed in Sect.  1.1.3. While these abstract observations are concretized in this section, the actual prediction algorithm is defined in Sect. 4.5.1. To begin, we present our basic idea for predicting poses with a single ISM as scene category model. In order to calculate a cloud of hypothetical poses for any object that is missing in a given scene model, we aggregate the pose of the reference of this scene model—phenomenologically its emplacement—with an appropriate spatial relation. We select the relation within the star topology of the ISM that connects the reference with the missing object. We say that we invert relations during prediction as we estimate potential poses of objects, thus departing from the reference. This is contrary to scene recognition, during which we estimate hypothetical poses of the reference, thus departing from localized objects. When using ISM trees instead of a single ISM as a scene category model, a single relation does not necessarily directly connect the searched object to the reference of this hierarchical model. In this case, a path of spatial relations has to be identified within the tree-shaped scene representation that transitively connects both before the actual pose prediction can take place. Estimating possible poses for the missing object by means of such a sequence of relations is prone to combinatorial explosion, as the relative poses within each relation have to be combined along the path. This demands an efficient algorithmic solution.

The next problem arises from the question which of the partial scene models—made available by the latest execution of Passive Scene Recognition—to use in order to predict poses of missing objects. An exemplary approach could consist of solely selecting the partial scene model with the highest confidence. However, suchlike proceeding would limit the subsequent object search as follows: In every camera view which is being moved to we would ignore all present objects, completing other scene models than the one that is currently selected. If competing scene models predicted different locations for the same searched objects, all but one model would be ignored, too. The other models would only be considered once the best-rated partial scene model changes. Let’s assume that the scene category “Setting—Ready for Breakfast” expects two knives on the right of the plate, while “Setting—Clear the Table” expects one knife on top of the plate. In case “Setting—Ready for Breakfast” is rated better than “Setting—Clear the Table” and the knife actually lies on the plate, several possible camera views—all pointing to the area on the right of the plate—would be driven to until ASR switches scene model. Processing all available partial scene models at once is no viable alternative as their number can be considerable, especially of those with a low rating. As an efficient middle ground, we developed a method that re-samples a representative subset from all partial scene models which is significantly smaller and statistically biased towards models with high confidences. This scene model sampling is based on the “Select” step of the Condensation Algorithm [11], an approach familiar when tracking objects with particle filters [23, p. 77] and introduced in Sect. 4.5.2. Object Pose Prediction is then performed for each of the scene models within the subset. Together with the pose prediction algorithm, scene model sampling yields the state OBJECT_POSE_PREDICTION in Fig. 4.1.

Once a cloud of hypothetical object poses has been predicted, it is passed to RELATION_BASED_SEARCH—a subsystem that performs informed search on the basis of this cloud. Even though this subsystem is made up of a sequence of states, visualized as a state machine on its own in Fig. 4.2, it is subsumed to an individual state in Fig. 4.1 for the sake of simplicity. RELATION_BASED_SEARCH consists of the following steps: In the state NEXT_BEST_VIEW_CALC, a camera view that is promising for finding missing objects is estimated on the basis of the hypothetical poses which RELATION_BASED_SEARCH just received. After such a camera view has been estimated, a goal configuration for the robot MILD that corresponds to this view as closely as possible is estimated with the help of inverse kinematics. Then, MILD drives to that goal configuration in the state MOVE_ROBOT_TO_VIEW in Fig. 4.2 by controlling both its Pan-Tilt Unit (PTU) and its mobile base. As soon as the goal configuration is reached, an attempt to localize missing objects in 6-DoF is being made in OBJECT_DETECTION. We assume that MILD is capable of reaching estimated camera views with a reasonable error. The discrepancies that may occur between the view from NEXT_BEST_VIEW_CALC and the view MILD actually takes over are compensated by readjusting the PTU. The orientation of the PTU is altered in a fashion that the real and the expected viewing area of the robot head are in maximal accordance with each other.
Fig. 4.2

State diagram of the subsystem RELATION_BASED_SEARCH

In order to assess how suitable a camera view is for searching some objects, we developed an optimization problem for selecting camera views in three-dimensional object search that emphasizes realism. The optimization algorithm that operates on this problem receives the cloud of object pose predictions and the current configuration of the robot. It seeks a reachable view that maximizes an objective function that models the confidence of success in object localization as well as the costs for reaching the view and performing localization. Optimization problem and algorithm are discussed in detail in Sects. 4.6.1 and 4.6.2. Optimization only considers the next view for searching objects instead of looking several steps ahead in order to find an optimal sequence of views for searching objects. Empirical results in [7] support such a greedy approach as “in object search, frequent measurements affect available search policies so much that constant re-planning is necessary” [16]. Only optimizing the next view to be reached is a common approach in view planning and known as the Next-Best-View (NBV) problem. Our NBV estimation algorithm not only deals with optimizing the pose of an NBV but simultaneously decides which objects are worth being searched inside the view. Assume a possible view for MILD is analyzed by NBV estimation: If there is a high probability to find two knives in it but a low probability to also find a spoon, NBV estimation limits time consumption in object localization by deciding that only knives are to be searched in that view. It is this optimized set of objects for which 6-DoF object localization is performed in OBJECT_DETECTION.

In which way informed search goes on depends on the capacity of the state OBJECT_DETECTION to detect any of the objects that are to be searched in the most recent NBV. In case that object localization returns any model of a previously missing object, the subsystem RELATION_BASED_SEARCH is left and the state SCENE_RECOGNITION in Fig. 4.1 is re-entered. In that state, all object models that have been generated since ASR was started are passed to PSR so that an updated set of scene models can be estimated. The new set replaces all scene models that remain in the last estimated set. Only partial models from the new set are going to be processed during the following run of scene model sampling. However, if no new object model could be estimated once the goal configuration had been reached—e.g. because clutter occluded the searched objects from that viewpoint—ASR remains in the RELATION_BASED_SEARCH loop. The reason for this is that in practice, an individual camera view is only capable of covering a small subset of the entire cloud of object pose predictions produced by the last run of Object Pose Prediction. The rest of that cloud, which is also containing valid pose hypotheses, has to be successively covered by additional, optimized camera views until the entire cloud is used up. In order to make sure that the next execution of NEXT_BEST_VIEW_CALC does not return a view identical to that from the current execution, the cloud of pose predictions has to reflect that 6-DoF object localization has just failed. This means that any information that supported Next-Best-View estimation in choosing the most recent estimate has to be removed from the cloud. In the state UPDATE_POSE_PREDICTIONS in Fig. 4.2, this is done, simply put, by invalidating any pose prediction that is visible from the viewpoint MILD took over when it reached its goal configuration. This means especially that poses are only invalidated if they lie within the viewing area of the camera at that goal configuration. A more accurate description of this process is given in Sect. 4.6.3.

Since this update has to be performed once OBJECT_DETECTION has failed, it must be based on the camera view that MILD has actually reached and not on the view that Next-Best-View estimation had returned beforehand. In consequence, UPDATE_POSE_PREDICTIONS might fail like OBJECT_DETECTION if there is a sufficient difference between the estimated pose and the one actually reached. For example, this is the case if a sufficient number of predicted poses drops out of the viewing area of the robot because of deviations between reached and estimated view. If this occurs, the state machine restarts the update procedure, but this time on the basis of the estimated view instead. No matter whether one or two updates are applied to the cloud, RELATION_BASED_SEARCH starts over in the state NEXT_BEST_VIEW_CALC. It is not guaranteed that the missing objects are detected by informed search, even after several loop runs in RELATION_BASED_SEARCH. Given the place setting in Fig.  1.2, such a scenario may occur when parts of it are distributed across several tables even though Object Pose Prediction expected them all to be located on one table. Thus, informed search should be aborted as soon as it becomes obvious that the currently available pose predictions do not cover the actual poses of the searched objects. To this end, several termination conditions have been integrated into NEXT_BEST_VIEW_CALC. One of them, for instance, is fulfilled once the confidence that one or more objects can be found in any estimated camera view falls below a given threshold. If any of those conditions is met, both RELATION_BASED_SEARCH is left and the currently employed cloud of predictions is deleted.

At this point, the state machine has to figure out whether scene model sampling has already used up all partial scene models for predicting object poses. If so, indirect search has to be left in absence of hypotheses still not considered for searching missing objects. Instead, direct search is performed again until any of the searched objects can be found. However, if any unused partial scene model remains from the most recent execution of PSR, the state machine re-enters OBJECT_POSE_PREDICTION in Fig. 4.1 and another iteration of indirect search starts.

4.2 Robot Software Architecture for Active Scene Recognition

After presenting the logic behind ASR in the preceding section, the next unsolved issue is how to organize all those functionalities that ASR relies on. It is obvious that a modular approach is to be followed due to the complexity of the decision-making system that we just discussed. As stated in Sect.  1.2, ASR is a contribution to the robot operating system ROS [20]. Bearing this in mind, we designed ASR as part of a distributed and deliberative robot architecture [6, p. 245], which Sect.  1.3 refers to as one of the contributions of this thesis. Such an architecture in particular matches the notion of robots as rational agents [21, p. 4], as presumed by our ASR concept. For example, the latter is illustrated by the fact that we model ASR as a hierarchical state machine2 that performs view planning. We implemented that kind of architecture by distributing both the logic of ASR and the functionalities it requires on dedicated, equal and self-contained software components organized in a network.

The state machine is a component by itself, in which the communication within the robot architecture is centralized similar to a Mediator Pattern [8, pp. 273 ff.]. One of its main tasks is to coordinate the other components responsible for the functionalities by message-passing. In Fig. 4.3, the component of the state machine is shown as a box in the middle, communicating with the remaining components through double arrows. Not only the components in Fig. 4.3 mostly correspond to functionalities but also many states of the ASR decision-making system in Figs. 4.1 and 4.2. In consequence, many of these states can be assigned to a corresponding component. The state SCENE_RECOGNITION, for instance, mainly deals with interfacing with the ‘Passive Scene Recognition’ component in Fig. 4.3. All components that correspond to functionalities are therefore loosely arranged in Fig. 4.3 according to the order in which the different ASR steps are traversed. In the following, each of these components is discussed one after another in counterclockwise direction, with a focus on its interaction with other components.
Fig. 4.3

Schematic overview of the modular robot architecture into which ASR is integrated

We begin at the upper right component,3 “Offline Knowledge Base” (OKB), the content of which never changes while ASR is executed. The contrary holds true for the component “Online World Model” (OWM) at the bottom right of Fig. 4.3 which stores information ASR acquires until it is flushed, e.g. because it is restarted. In consequence, the OKB is already complete when ASR begins at the state OBJECT_SEARCH_INIT in Fig. 4.1, while the OWM is initially empty. Initial input for the OWM is generated in the following states GLOBAL_ROBOT_LOCALIZATION and DIRECT_SEARCH. For example, this is an initial estimate of the pose of the mobile base of the robot MILD from GLOBAL_ROBOT_LOCALIZATION and an initial set of object models from DIRECT_SEARCH. The most crucial among all the data present in the OKB is related to the n scene categories whose presence ASR is supposed to evaluate. In the OKB, scene categories are represented in two different manners: in terms of the absolute poses of all objects in each category—all recorded during demonstrations and required for direct search based on prior knowledge—as well as in terms of scene category models and ISM trees respectively that are learnt from demonstrations. Besides knowledge about scene categories that originates from sensory recordings and subsequent learning, OKB also makes data available that is, without loss of generality, predefined by the operator of the robot.4 Predefined data, for example, includes a 2-D floor plan, which is required by the navigation framework of the mobile base, and 6-DoF poses of the furniture that is present in our experimental setup. Another location where knowledge from sensory recordings is stored is the aforementioned OWM. For instance, it contains object models whose parameters are estimated with the help of 6-DoF object localization while ASR is executed. As stated in Sect.  1.1.2, each object model includes a measured object pose. Taken together, poses of furniture and localized objects make it possible to visualize the environment knowledge of the robot by means of geometry models at any point in time during the course of ASR. The three-dimensional geometry models required for this purpose are stored in the OKB along with classifiers for any object of each scene category for which ASR shall deliver scene models. Assuming that a localization algorithm is assigned to each object, a classifier corresponds to an internal representation of an object the localization algorithm needs in order to be able to recognize the objects.

The next component is “Passive Scene Recognition” (PSR) which is visible on the left of the OKB in Fig. 4.3 and mainly consists of a wrapper for our approach to recognize scenes based on ISM trees—a major contribution of this thesis. Its interaction with the decision-making system during the course of ASR can be summarized in the following steps: At first, PSR asks the state machine to provide it with all object models that are currently stored in the OWM. When indirect search runs through its loop for the first time, such models can only originate from direct search. Under considerations of all objects that have been recognized until this point, PSR asks the state machine for the ISM tree of every scene category that contains at least one of those objects. In order to get access to these scene category models, the state machine has to contact the OKB. For its part, the OKB contains a dictionary pointing from any object to those scene categories that contain it. Once PSR has received both object and scene category models, it begins to run the scene recognition algorithm it wraps. This algorithm proceeds as shown in Fig.  3.1. As acceptance threshold, it uses \(\epsilon = 0\) to make sure that any scene model is returned that can be calculated from this provided object models and e.g. not only the best-rated per scene category. Among all scene models that are estimated in this step, the incomplete ones,5 are stored in a buffer. This buffer is shared by PSR and Object Pose Prediction, which is another component.

Besides, the scene model with the highest confidence per scene category is transmitted to the “Online World Model”. With the object models from object localization and a pose estimate for the mobile robot MILD,6 the OWM publishes all time-variant information ASR is supposed to make available for other systems, e.g. Programming by Demonstration of manipulation skills. It is to be noted that in terms of scene models, the content of the OWM is flushed every time before new scene models are transmitted to the OWM. Each time scene recognition with ISM trees is performed, recognition re-estimates scene models for all categories, including those for which no previously missing objects have been newly detected during the last execution of object localization. For that reason, scene models are replaced in the OWM instead of accumulating them, as is the case for object models.

The component “Object Pose Prediction” (OPP) is visualized in a joint box with PSR in Fig. 4.3, even though it is a component on its own. Both PSR and OPP are separated by a dashed line that refers to the buffer for partial scene models into which PSR writes and from which OPP reads. This buffer contains a potentially large number of incomplete scene models, each corresponding to a considerable amount of data that would otherwise have to be transferred through the state machine. The component OPP is a wrapper for two techniques: Scene Model Sampling that is introduced in Sect. 4.5.2 and the actual pose prediction algorithm that is defined in Sect. 4.5.1. First, Scene Model Sampling reduces the content of the buffer to a fixed-size set of scene models whose elements are mainly assigned with high confidences. Each element of this set is then individually transmitted to the pose prediction algorithm. Given a scene model, the prediction algorithm in turn has to ask the state machine for the very ISM tree which led to this recognition result. Since scene recognition and pose prediction rely on spatial relations from the same ISM tree, the state machine has to provide it to both. Once poses have been predicted for all sampled scene models, they are all merged to a joint cloud of poses that is sent to the state machine.

The next component that we discuss, “Next-Best-View (NBV) Estimation”, is located on the central left of Fig. 4.3. As OPP, it is a wrapper for two techniques that are presented in detail in Sects. 4.6.1 to 4.6.3. The state machine provides it with both the cloud of object poses that OPP just calculated and the current configuration of the robot MILD. From that data, the first technique estimates a camera view for successfully and efficiently detecting missing objects. Being the view that MILD is supposed to go to next, it is returned to the state machine. The second technique in this component updates clouds of object poses like those delivered by OPP. This step has already been discussed in the previous section in the context of the state UPDATE_POSE_PREDICTIONS of the decision-making system of ASR. It is being used if the just estimated view does not lead to any new estimate of the whereabouts of any missing object.

By converting the camera view that is to be reached into a goal configuration, the state machine can send travel commands to the “Actuators” of MILD. Such a configuration is made up of the pose of the base of MILD as well as of the orientation of its PTU. It is estimated by the inverse kinematics for MILD as mentioned in Sect.  1.2. Inverse kinematics is part of a component “Robot Model” that is omitted in Fig. 4.3 for the sake of clarity. In contrast, the actuators of MILD are depicted on the lower left of Fig. 4.3. As soon as a goal configuration is available, travel commands are issued to the affected actuators. In particular, the mobile base employs the collision-free navigation framework, as named in Sect. 4.3, in order to drive to the goal configuration.

Once MILD reaches the goal configuration—best as it can—, the decision-making system triggers the component “6-DoF Object Localization”, which is in charge of detecting missing objects together with the component “Sensors”, both at the bottom of Fig. 4.3. The detection task starts with the state machine, asking the OKB for classifiers for any object “NBV estimation” suggested to search in its latest result.7 These classifiers are then passed to the Object Localization component. This component has four localization algorithms at its disposal, all of which are mentioned in Sect.  1.2. For any provided classifier, the component launches the localization algorithms that go with it. Through the Sensor component, each algorithm can access the visual sensors whose data it needs. Each selected algorithm is executed for a predefined time span. Once a span has expired, the localization component generates an object model from each result the according algorithm has returned. As soon as all algorithms are done and all results have been converted into object models, the latter are transmitted to the OWM through the state machine. There, the newly generated object models are added to the already present ones.

4.3 Data Acquisition from Demonstrations of Scene Variations

Acquisition of scene category models through Programming by Demonstration (PbD) takes place in two steps, as is usual for this principle. At first, demonstrated variations of a scene are recorded by sensors. Then, a scene classifier is learnt from that data. The learning of scene category models has been introduced for a single ISM as classifier in Sects.  3.4.1 and  3.4.2, while a learning algorithm for the more general ISM tree representation has been presented in Sect.  3.6. While demonstration recording has already been dealt with in relation to its outcomes in Sect.  3.3, this section shall provide an overview of the recording procedure in practice.

The following entities always participate in this process:
  • A robot head that is equipped with visual sensors to record the demonstration. It is either mounted in our sensor dome or on the mobile robot MILD, both visible in Fig.  1.8.

  • A human as the demonstrator, positioning objects and guiding the robot (head) throughout the demonstration.

  • A set of objects o that makes up the configurations that are to be recorded as n variations \(\mathbf {V}(\mathbf {S},1) \dots \mathbf {V}(\mathbf {S},n)\) of the scene category \(\mathbf {S}\) that is to be demonstrated.

  • A recording system that interacts both with the demonstrator and the sensors on the robot head. It performs 6-DoF object localization in real time—according to Sect.  1.2—and saves all estimates \(\mathbf {E}(o)\) about the demonstrated states of the objects as their trajectories \(\mathbf {J}(o)\).

We designed the learning of scene category models as a supervised learning approach. Thus, the demonstrator has to provide the following information to the recording system:
  • The name of the demonstrated scene category.

  • The names of the objects that are elements of this scene category.

  • A set of m different 6-DoF poses \(\mathbf {T}_{H}(1) \dots \mathbf {T}_{H}(m)\) for the robot head, from which the head is supposed to record different portions of the demonstrated scene variations.8

In the following, the demonstration procedure is defined for a robot head that is mounted on the mobile robot MILD. In Fig. 4.4, this process is visualized as a state diagram in which the demonstrator triggers state transitions. Technically, demonstrations in the sensor dome are just a simplified special case of demonstrations on the robot. When using the robot MILD, the demonstrator has to guide its Pan-Tilt Unit as well as its base. In the sensor dome, just the PTU has to be controlled.

  1. 1.

    In DEMONSTRATION_INIT and GLOBAL_ROBOT_LOCALIZATION, the demonstrator communicates the name of the scene category to the robot. Then, he triggers an initial estimation of the pose of the robot in relation to a fixed world frame. This pose is tracked throughout the demonstration so as to be able to jointly represent all state estimates about objects in a common frame.

    For all variations \(\mathbf {V}(\mathbf {S},i)\) with \(i = 1 \dots n\):

    For all robot head poses \(\mathbf {T}_{H}(j)\) with \(j = 1 \dots m\):

  2. 2.

    The demonstrator sets up the object configuration according to variation \(\mathbf {V}(\mathbf {S},i)\) of scene category \(\mathbf {S}\).

  3. 3.

    The demonstrator guides the robot head to the goal pose \(\mathbf {T}_{H}(j)\) in MOVE_ROBOT_TO_VIEW.

  4. 4.

    Once the goal pose is reached, the demonstrator communicates the names of the objects to the recording system for which object localization is to be started. This happens in OBJECT_DETECTION_START.

  5. 5.

    Object state estimates \(\mathbf {E}(o)\) are recorded in OBJECT_LOCATION_RECORDING for the given object names until localization is stopped in OBJECT_DETECTION_STOP.

  6. 6.

    Once all desired object state estimates have been recorded and saved, trajectories \(\mathbf {J}(o)\) for all objects o in the demonstrated scene category are transmitted to scene classifier learning.

In order to increase efficiency in demonstrating, it is possible to permute the nesting of both loops. This enables to demonstrate multiple scene variations without changing the robot head pose in between. It is also not mandatory to visit each robot head pose while recording a variation \(\mathbf {V}(\mathbf {S},i)\) as long as all poses have been at least visited once. Every object for which no new estimate has been recorded during the demonstration of variation \(\mathbf {V}(\mathbf {S},i)\) keeps the estimates that were available when demonstrating variation \(\mathbf {V}(\mathbf {S},i-1)\).
Fig. 4.4

State diagram that models the process of learning scene category models from human demonstrations

4.4 Object-Search-Related Definitions

Up to this point, the functionalities ASR relies on have mainly been introduced as black boxes with a focus on their interfaces within a robot architecture. The components in which most of the contributions in this thesis are located are PSR, OPP and “NBV estimation”. While our contributions in relation to PSR have already been explained in depth in Chap.  3, a detailed description of the algorithms that ensure the functioning of the other two components follows in this chapter. Beforehand, some preliminary definitions have to be established. Their purpose is to derive an abstract robot from the concrete mobile robot MILD whose actuators and sensors we informally introduced in the preceding section. This abstract robot, visualized in Fig. 4.5, corresponds to the minimum requirements ASR imposes on a robot upon which it is to run. Our ASR concept can thus be seamlessly migrated to any concrete robot that is compliant to these requirements.

The formal quantities we define in this section characterize selected aspects of the state and the properties of the abstract robot. Moreover, we rename previously used colloquial names like “viewing area” to more precise alternatives like “viewing frustum”. Most of them are visually assigned to portions of the abstract robot in Fig. 4.5. A significant simplification in comparison to the real robot concerns the sensors with which the abstract robot is supposed to be equipped. While a multitude of cameras is mounted on the head of MILD according to Sect.  1.2, the abstract robot only “has a laser rangefinder for navigation with its mobile base and a stereo camera on a motorized PTU for object localization” [17].

We define a robot configuration \(\mathbf {C} = (x, y, \theta , \rho , \tau )\) based on the values \(\rho \) (pan) and \(\tau \) (tilt) of both degrees-of-freedom of the PTU and on the robot pose \((x, y, \theta )\). The robot pose is estimated relative to a fixed world frame and against a 2-D map. The workspace of the PTU is limited to \([\rho _{\text {min}} ,\rho _{\text {max}}] \times [\tau _{\text {min}} ,\tau _{\text {max}}]\). The poses of left \(\mathbf {T}_L\) and right \(\mathbf {T}_R\) camera in the stereo rig are given in a world frame in 6-DoF.9 The viewing area of a camera is defined by a frustum \(\mathbf {F} = (\text {fov}_x, \text {fov}_y, \text {ncp}, \text {fcp})\). A frustum is a four-sided truncated pyramid that is described by the horizontal and vertical field of view \(\text {fov}_x\) and \(\text {fov}_y\) of the camera as well as the near and far clipping planes \(\text {ncp}\) and \(\text {fcp}\), which set the minimal and maximal distance from which an object can be localized. A view \(V(\mathbf {T})\) is the viewing area of a camera, transformed into a world frame according to the pose \(\mathbf {T}\) of the camera. Each camera view has a direction \(\mathbf {x}(V)\), given by the optical axis. [17]

Fig. 4.5

Illustrative definition of the quantities that describe the abstract robot as required by ASR [F. Marek and P. Meißner]

In order to transform between camera views and robot configuration, a kinematic model of the used robot is required. Since the two parameters \(\rho , \theta \) of our robot configurations are redundant by definition, multiple solutions exist to inverse kinematics. In the inverse kinematics that has been developed for the mobile robot MILD in [1], the very solution or (in other terms) the very robot configuration is selected which MILD can reach the fastest from its current configuration.

4.5 Prediction of Object Poses with Trees of Implicit Shape Models

4.5.1 Object Pose Prediction Algorithm

The functionalities the component “Object Pose Prediction” (OPP) shall provide in the ASR robot architecture are introduced in terms of formal algorithms in this and the following section. First, we present a pose prediction algorithm, which Sect.  1.3 has named as one of the contributions of this thesis. Its task is to deduce hypotheses about potential emplacements of objects \(o_{P}\) that are missing in a given scene category instance \(\mathbf {I}_{\mathbf {S}}\). Such a partial or incomplete instance, for example, is shown in 1 in Fig. 4.6 for scene category “Setting—Ready for Breakfast”. In the context of ASR, the pose prediction algorithm is supposed to provide the subsequent component “NBV estimation” with a set of hypothetical 6-DoF object poses \(\{\mathbf {T}_{P}(o)\}\) in order to complete partial instances of the n scene categories ASR is looking for. Due to “NBV estimation”, pose estimates in full 6-DoF are required instead of simpler 3-DoF positions. Camera views that “NBV estimation” returns shall, among others, maximize the confidence in finding the searched objects. In this thesis, we devise an approach that estimates such views by calculating how well a hypothetical object at a predicted pose would be detected by real object localization systems. The performance of such localizers considerably depends on both the position and the orientation of the object that is to be detected. For example, a bottle is way easier to detect in side view than in top view.
Fig. 4.6

14: Identically composed configurations of objects whose relative poses also correspond. The absolute poses of the objects in 1, 2 are rotated \({90}^{\circ }\) in comparison to those in 3, 4. 1, 3: Partial instances of category “Setting—Ready for Breakfast”. 2, 4: Predicted poses for the objects of that category missing in all configurations and depicted here

The main feature of the predicted poses \(\mathbf {T}_{P}(o)\) is that each of them would further substantiate a partial scene category instance if the object o was detected at exactly that location. More concretely, they would increase the confidence \(b(\mathbf {I}_{\mathbf {S}})\) of the instance \(\mathbf {I}_{\mathbf {S}}\) from which they are deduced. A set (or cloud) of pose predictions is, for instance, visualized in 2 in Fig. 4.6. This cloud, derived from the instance that is depicted in 1, is visualized by coordinate frames, each standing for a pose \(\mathbf {T}_{P}(o)\) predicted for an object o. Each frame is connected to the pose \(\mathbf {T}_{F}\) of the instance from which it is deduced by a straight line. The color of each line stands for the object o whose pose is being predicted. These lines symbolize how our pose prediction algorithm estimates hypotheses about absolute object poses \(\mathbf {T}_{P}\) by concatenating the given reference pose \(\mathbf {T}_{F}\) with relative poses \(\mathbf {T}_{Fo}\). Such relative poses \(\mathbf {T}_{Fo}\) are generated by combining some of the spatial relations \(\mathbf {R}\) which an ISM tree \(\{m\}\) models. Since the same tree is used for recognizing instances of a scene category as well as for predicting the poses of its missing objects, both processes share the same spatial relations. Relying on spatial relations instead of absolute poses when predicting the emplacement of missing objects yields a higher adaptability to changing environmental conditions. The degree-of-freedom in this technique are the results from scene recognition which are the starting point of pose prediction and express a robot’s knowledge about the current state of its environment. For instance, 3 in Fig. 4.6 displays an input configuration that is identical to the one in 1 in Fig. 4.6. The only difference between both are their overall poses in space. 1 and 3 mainly visualize scene category instances resulting from scene recognition on the visible object configurations. 2 and 4 show results from pose prediction derived from the instances in 1 and 3. When comparing the clouds of predicted poses in both figures, we see that the difference between both is nearly identical to that between the input configurations from which they originate.
Fig. 4.7

Data flow through the ISM tree that is depicted in Fig.  3.12, when executing the object pose prediction algorithm

The set of missing objects \(\{o_{P}\}\) for each of which we want to estimate pose predictions can be formally defined as the relative complement of the input objects \(\{i\}_{\mathbf {S}}\) of partial instance \(\mathbf {I}_{\mathbf {S}}\) in the set \(\{o\}\) of objects a scene category contains. With regard to the ISM tree, based on which predictions are generated, missing objects can be present at their leaves \(\{o\}_{L}\) but never at internal vertices. It has to be noted that leaves may occur at multiple levels in a tree, e.g. at levels 0 and 1 in the tree in Fig. 4.7 that models scene category “Setting—Ready for Breakfast”. While this figure illustrates how pose prediction operates on an ISM tree, Fig.  3.12 showed how scene recognition operates on the same tree. Contrary to scene recognition, pose prediction does not start at the highest level of an ISM tree but at root ISM \(m_{R}\). Since the connections between pairs \(m, m'\) of ISMs at neighboring tree levels continue to regulate the flow of data through the tree, they determine an order in which the ISMs can be evaluated, no matter whether we do scene recognition or pose prediction. However, during pose prediction, data flows in a direction through the tree that is opposite to that during scene recognition. More precisely, a sequence of ISM has to be successively evaluated for every object \(o_{P}\) the poses of which shall be predicted. Such sequences, which we designate as paths, start at root ISM \(m_{R}\) and end at one of those ISMs in the tree in which the missing objects participate in spatial relations.

Paths \(\mathbf {P}_{\{m\}}(m_{s}, m_{e})\) entirely consist of connected pairs \((m_{k},m_{k+1})\) of ISMs \(m_{k},m_{k+1}\) from a tree and are formally defined in Eq. 4.3 for arbitrary starting points \(m_{s}\) and end points \(m_{e}\) in terms of ISMs. The length of a path is equivalent to the number l of pairs it consists of. Both ISMs \(m_{s}\) and \(m_{e}\) are defined in Eqs. 4.1 and 4.2 on the basis of vertices \(o_{s}\) and \(o_{e}\) of the tree between which prediction shall be done. In order to make sure that ISMs on the path can actually be evaluated during pose prediction, vertex \(o_{s}\) has to be the reference \(o_{F}(m_{s})\) of ISM \(m_{s}\), while vertex \(o_{e}\) has to be a non-reference object \(\{o\}(m_{e}){\setminus } o_{F}(m_{e})\) in ISM \(m_{e}\). A quick look at the vertices of the tree in Fig. 4.7, all depicted as circles, reveals that multiple leaves \(o_{L}\) in a tree may correspond to the same object o of the modeled scene category. This is the case because an object can participate in several spatial relations at once, depending on the connected relation topology the given ISM tree represents. The fact that in particular such an object can be missing is visible in Fig. 4.7. In this picture, vertices that correspond to missing objects are drawn as circles keyed in colors other than black. The three missing objects are “ForkRight”, “ForkLeft” and “SpoonSmall”. “ForkRight” is present at the same time on two levels (tree levels 0 and 1) of the tree in Fig. 4.7.
$$\begin{aligned} m_{s}= & {} \left\{ m \left| m \in \{m\} \wedge o_{s} = o_{F}(m) \right. \right\} \end{aligned}$$
$$\begin{aligned} m_{e}= & {} \left\{ m \left| m \in \{m\} \wedge o_{e} \in \{o\}(m){\setminus } o_{F}(m) \right. \right\} \end{aligned}$$
$$\begin{aligned} \mathbf {P}_{\{m\}}(m_{s}, m_{e})= & {} \{ (m_{k},m_{k+1}) | m_{1} = m_{s} \wedge \nonumber \\&\forall k: o_{F}(m_{k+1}) \in \{o\}(m_{k}) \wedge m_{l+1} = m_{e} \} \end{aligned}$$
$$\begin{aligned} \{\mathbf {P}_{\{m\}}\}(o)= & {} \left\{ \mathbf {P}_{\{m\}}(m_{R}, m_{e}) \left| o_{e} \text { equal to } o \text { in } \{m\} \right. \right\} \end{aligned}$$
$$\begin{aligned} \mathbf {P}_{\{m\}}^{*}(o)= & {} \mathop {{{\,\mathrm{\arg \!\min }\,}}}_{\mathbf {P}_{\{m\}}(o) \in \{\mathbf {P}_{\{m\}}\}(o)}\, \left| \mathbf {P}_{\{m\}}(o)\right| \end{aligned}$$
This issue is relevant when predicting poses of missing objects. Both the time consumption of pose prediction and the uncertainty in its estimates depend on which leaf \(o_{L}\) is used for predicting poses, i.e. as end point of the employed path. In simple terms, time consumption depends on the number of ISMs that have to be evaluated in order to generate at least a single pose \(\mathbf {T}_{P}\). This number is equivalent to the length of the used path \(\mathbf {P}_{\{m\}}(m_{R}, m_{e})\) between the root \(m_{R}\) and the ISM \(m_{e}\), containing the leaf \(o_{L}\). Since paths from the root towards leaves for the same object may have different lengths, we subsume all of them to \(\{\mathbf {P}_{\{m\}}\}(o)\) in Eq. 4.4. In order to minimize time consumption of pose prediction, we then define a function \(\mathbf {P}_{\{m\}}^{*}(o)\) in Eq. 4.5 that returns the shortest path through an ISM tree \(\{m\}\) for any object o of the represented scene category. This function is precomputed with the help of breadth-first search before the actual pose prediction algorithm starts.
As already mentioned, the length l of a path \(\mathbf {P}_{\{m\}}(o)\) for an object o affects the quality of the resulting estimates \(\mathbf {T}_{P}\), too. For example, in 2 and 4 in Fig. 4.6, it should be noted that the predictions on the left of the green plate exhibit a significantly greater variance than those on its right. This difference is caused, among others, by the fact that poses for objects on the left, such as “ForkLeft”, are predicted with the help of paths of length \(l = 2\), while an object such as “Cup” on the right has a leaf on tree level 0, i.e. directly in root ISM \(m_{R}\). The extent to which the length of the employed path influences estimate quality can be best inferred from Algorithm 18. This algorithm implements geometric operations that compute an individual pose estimate \(\mathbf {T}_{P}\) from the shortest path \(z{P}_{\{m\}}^{*}(o_{P})\) towards missing object \(o_{P}\) as well as the pose \(\mathbf {T}_{F}\) of the given scene category instance \(\mathbf {I}_{\mathbf {S}}\). The algorithm starts with passing pose \(\mathbf {T}_{F}\) to root ISM \(m_{R}\). Then, every ISM m along the shortest path is evaluated one after another in that it casts a single vote per ISM. Whilst scene recognition and pose prediction with ISM trees have in common that they successively evaluate ISMs by casting votes, pose prediction differs from scene recognition by casting votes for the poses of non-reference objects within ISMs instead of voting for ISM reference poses.

Given the reference pose \(\mathbf {T}_{F}\) of a scene category instance, the ISMs in a tree employ the inverses \(\mathbf {T}_{Fo}\) of the transforms, used for voting on reference poses, to calculate object pose predictions \(\mathbf {T} \leftarrow \mathbf {T}_{F} \cdot \mathbf {T}_{Fo}\). [17]

By using poses \(\mathbf {T}_{Fo}\) of object o relative to reference \( o_{F}\), we reverse the principle used in scene recognition. [16]

The reversion of spatial relations within an ISM is visualized in Fig. 4.7 by replacing the upwards arrows from Fig.  3.12 with downwards ones. In both variations, the arrows express connections between objects in an ISM. No matter whether each ISM on a path contributed a single vote or multiple thereof in Algorithm 18, each ISM simply returns all cast votes without any intermediary accumulation of them in a voxel grid, like this has been the case during scene recognition. This is because pose prediction does not only reverse spatial relations but more generally the overall approach in recognition with an ISM that consists of integrating information from different sources, in our case objects, into one sink, i.e. the reference pose of an ISM. Instead, pose prediction distributes information from a single source across multiple sinks within each ISM. Since there is only one possibility to connect a non-reference object in an ISM \(m_{k}\) at level k with the reference of an ISM \(m_{k+1}\) at the next higher level, there is no need for transmitting state estimates \(\mathbf {E}(o_{F})\) of scene reference objects \(o_{F}\) with class and identifier labels as we did in Sect.  3.5.2. Instead, just poses \(\mathbf {T}_{P}\) are transferred, as visualized by green arrows in Fig. 4.7. As long as they are not returned by the end point of a path, these poses represent hypotheses about the emplacement of internal vertices instead of real objects. In summary, predicting a pose with Algorithm 18 is equal to applying a sequence of transforms on the pose \(\mathbf {T}_{F}\). Since every transform is subject to noise and systematic errors, both issues accumulate the longer the used path is. Ultimately, this leads to the difference in variances we observed in 2 and 4 in Fig. 4.6. In consequence, minimizing path lengths is not only in the interest of pose prediction runtime but also in that of pose prediction accuracy. However, Algorithm 18 is not the whole pose prediction algorithm as it just calculates a single prediction, while the component “NBV estimation”, following OPP, expects an entire cloud of predictions for a multitude of objects. In order to meet this requirement, we introduce Algorithm 19 which returns such a cloud \(\{\mathbf {T}_{P}(o)\}\) and is also our actual pose prediction algorithm.
The input to Algorithm 19 is the already mentioned partial scene category instance \(\mathbf {I}_{\mathbf {S}}\) from which it does not only pass on the pose \(\mathbf {T}_{F}\) to Algorithm 18 but also infers the missing objects \(o_{P}\). Objects are considered as missing when they do not belong to the input objects \(\{i\}_{\mathbf {S}}\) of the given instance but to the scene category, for which instance \(\mathbf {I}_{\mathbf {S}}\) is an example. Besides, the precomputed shortest path function \(\mathbf {P}_{\{m\}}^{*}(o)\) is passed to the pose prediction algorithm, too. Predicted poses for three missing objects are indicated as green circles at the bottom edge of Fig. 4.7. Per missing object, it is a set of poses that is output by a reddish leaf, which is located at the end of a path that is colored red, too. That each of these paths is the shortest one for a specific object becomes, for instance, apparent when looking at the path for “ForkRight”. Even though two vertices that correspond to “ForkRight”, are present on tree level 1, pose prediction employs another path, which ends in a vertex on level 0. Conversely, this also means that not every ISM in the tree is necessarily evaluated in the course of Object Pose Prediction. This in particular implies that ISMs like “setting_sub2” or “setting_sub3” that include missing objects in Fig. 4.7 may not participate. The fact that neither of the two ISMs is connected to the root ISM “setting” through pose estimates, shall illustrate this exclusion. Per object \(o_{P}\) that Algorithm 19 regards as missing, it generates a number \(n_{P}\) of predictions. This number is passed on to the algorithm as well. An unequal number of poses per object would potentially influence the component “NBV estimation” to privilege those objects for which more poses have been estimated. However, this component should select camera viewpoints solely due to their good accessibility with regard to the current configuration of the involved robot and the confidence10 in providing numerous and accurate detection results the views offer.

Given that the computational effort in executing Algorithm 18 primarily consists of calculating coordinate transforms and that Algorithm 19 can be considered to be repeatedly running Algorithm 18, the overall effort of the pose prediction algorithm can be equated to the number of coordinate transforms it has to carry out. Since by definition, Algorithm 18 performs l transforms and is repeated \(n_{P}\) times in Algorithm 19, the overall number of transforms for predicting the poses of an object whose path has length l is \(l \cdot n_{P}\). Thus, the time complexity \(\mathcal {O}(l \cdot n_{P})\) of predicting hypothetical poses of an object is linear in relation to the length l of the employed path. We reach this low complexity because we only cast a single vote per ISM during any traversal of a path. On the contrary, in scene recognition, every ISM has cast all votes that could be generated for each state estimate provided to it. No matter whether an ISM is used for scene recognition or Object Pose Prediction, the number of votes an object casts is equal to the number a of table entries the ISM can provide for that object.

Let’s assume that pose prediction would cast all possible votes in every ISM along a path instead of one: In ISM \(m_{k}\) at tree level k, we would generate an number \(d(m_{k})\) of hypotheses about possible poses of the reference of ISM \(m_{k+1}\) on the next higher level of the tree. Since ISM \(m_{k+1}\) is directly connected to ISM \(m_{k}\), each of the pose hypotheses would trigger a separate voting process in ISM \(m_{k+1}\). Each of the voting processes would in turn cause a predictions about the emplacement of the reference of the next ISM \(m_{k+2}\) on the path, if any exists. Thus, the total number of votes, cast in ISM \(m_{k+1}\), could be recursively defined as \(d(m_{k+1}) = a \cdot d(m_{k})\). The further away from the root an ISM is located in its tree, the more votes would be cast. The ISM \(m_{l+1}\) at the end of the path would be the one in which the most votes are cast. This number of votes can be explicitly defined as \(d(m_{l+1}) = a^{l+1}\). More generally, the number of votes that can be cast within an ISM increases exponentially with the distance of the ISM from the root. The overall number of votes along a path would add up to \(\sum _{k = 1}^{l+1} a^{k}\) so that the time complexity of such an alternative approach would be \(\mathcal {O}(a^{l})\). In contrast to the initially introduced technique, this approach is intractable. It should nevertheless be noted that all \(a^{l+1}\) predictions of ISM \(m_{l+1}\) would be valid hypotheses about the poses of a missing object. Moreover, all of them can be non-redundant combinations of the table entries from the different ISMs along a path. The entirety of these combinations can either be created systematically, as described above, or by random sampling [23, p. 67]. Random sampling, for instance, could randomly select a table entry in each ISM along a path and then aggregate these entries with pose \(\mathbf {T}_{F}\) so as to obtain a predicted object pose. This randomized approach could be repeated until all \(a^{l+1}\) combinations of votes have also been created. However, it would be more efficient to stop once a sufficient number \(n_{P}\) of poses has been estimated. Thereby, we would obtain a subset of all possible predictions that is still representative thanks to random sampling. This is exactly what we do in the presented pose prediction algorithm, since we randomly select a table entry in each ISM we traverse on a path. The corresponding command can be found in Algorithm 18. In summary, the algorithm we introduce efficiently generates a representative set of pose predictions.

4.5.2 Sampling of Scene Models

For the component OPP to provide functionality, as agreed upon when introducing the robot architecture for ASR in Sect. 4.2, a technique for sampling scene models11 has to be provided in addition to a pose prediction algorithm. Both scene model sampling and the overall functionality of the OPP component are discussed in detail in this section. First, we should re-envision that Object Pose Prediction as a whole is intended to be the link between the states SCENE_RECOGNITION and RELATION_BASED_SEARCH in the decision-making system of ASR or, in other terms, between the components PSR and “NBV estimation”. From PSR, OPP obtains a set \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) of instances as input which are all incomplete \(\forall \mathbf {I}_{\{\mathbf {S}\}}: b(\mathbf {I}_{\{\mathbf {S}\}}) < 1\) and may belong to different scene categories \(\{\mathbf {S}\}\). Thus, the assumption from the previous section, defining that a specific instance is given in advance, cannot be maintained. Instead, the state OBJECT_POSE_PREDICTION has to decide which instance from which scene category to provide to the prediction algorithm. As an output, OPP is supposed to provide a cloud \(\{\mathbf {T}_{P}(o)\}\) of predicted poses for the searched objects to “NBV estimation”. Such clouds, for instance, are visualized in 1 and 2 in Fig. 4.8. All pictures in this figure originate from an experiment in which an ASR-performing robot, seeks instances of the four scene categories “Setting—Ready for Breakfast”, “Sandwich—Setting”, “Drinks—Setting” and “Cereals—Setting”. These pictures are taken at a point in the course of ASR at which all objects of “Setting—Ready for Breakfast”, as yet the usual place setting in this thesis, and of “Sandwich—Setting” have already been detected. Objects of the latter stand in a line from back to front on the right-hand side of the place setting.
Fig. 4.8

1, 2: Visualization of pose predictions at a point during an execution of ASR. Object models are shown for already acquired detection results. Predictions for missing objects of categories “Cereals—Setting” resp. “Drinks—Setting” are visible as frames in 1 resp. 2. 3: Frustum of view, returned by immediately following an “NBV estimation” that contains predictions from 1, 2. 4: After performing detection on the view in 3, models for all objects in its frustum are available in addition to those in 1, 2

In 1 in Fig. 4.8, the two missing objects of “Cereals—Setting” are a milk and a cereals box, while a water bottle and a carton of vitamin juice, both participating in “Drinks—Setting”, are missing in 2 in this figure. While both pairs of objects are moved from left to right and vice versa in a similar manner during the demonstrations for their respective scene categories, the spatial distribution of the depicted coordinate frames differs considerably. This can be explained by the relation topology that scene category model represents. While both missing objects are connected to an immobile plate in 1 by relations, they are not directly related to each other. In 2, this is in fact the case. Here, the coordinate frames primarily express that the transforms between both missing objects remained almost the same during the entire demonstration for “Drinks—Setting”. While this fact is also true for the objects in 1, it becomes especially visible in 2 because of another difference: In contrast to 1, in which the ISM reference has already been found, it is still missing in 2. Due to the star topology modeled by ISMs, every table entry in an ISM is defined in relation to this reference. Put differently, all votes that are cast during scene recognition point towards the reference, while all votes from pose prediction have the reference as their starting point. If we visualized the votes in 2 that are cast from the cup to the water bottle, we would notice that a line very similar to that in 1, produced by the coordinate frames. Thus, information about the movements of the objects in 2 is available to us as well, but in terms of a set of scene category instances that are arranged along a line. In 2, we only picked one of those instances to predict object poses. If we had predicted poses with all available instances, we would have obtained a line of coordinate frames similar to that in 1. In “Cereals—Setting”, the reference is the immobile plate. Thus, all votes from scene recognition for this category fall into the same bucket, generating a single instance. In summary, information about hypothetical object poses is distributed across several instances in 2, while it is concentrated on a single instance in 1. This example shows that in ASR, relevant information about the poses of missing objects is not only distributed across the different scene categories but also within them.

Which instances we use for predicting poses decides upon the efficiency and the effectiveness of object search. Besides the current configuration of the employed robot, the pose estimates for missing objects are the major source of information for “NBV estimation” when deducing camera views for object search. This issue can be illustrated with the help of the laboratory setup that is shown at the beginning of the introduction to this thesis. In this setup, a table and a shelf are located at opposite ends of a room. We further assume that the instance \(\mathbf {I}_{\mathbf {S}}\) induces predictions on the table, while another instance \(\mathbf {I}_{\mathbf {S}'}\) produces suchlike on the shelf. If the robot stood close to the table under these circumstances, using instance \(\mathbf {I}_{\mathbf {S}'}\) for predicting poses would provoke the robot to needlessly travel to the shelf. This would in turn increase time consumption during object search. In this example, deciding upon which instance to use requires knowledge about the emplacement of the pose predictions each considered instance provides. Thus, the predictions from both instances would have to be calculated in advance. The question then is: Why not just pass the clouds from both instances at once to “NBV estimation” so that it takes the decision whose predictions undergo object detection first? From a technical viewpoint, there is no obstacle to such a proceeding as clouds \(\{\mathbf {T}_{P}(o)\}, \{\mathbf {T}'_{P}(o)\}\) from different instances \(\mathbf {I}_{\mathbf {S}}, \mathbf {I}_{\mathbf {S}'}\) can be trivially fused. Providing clouds from multiple instances to “NBV estimation” at once (instead of processing them one after another) also makes it possible to simultaneously search objects from different scene categories in the same view. For example, this option is useful when predicted poses from different instances are close to each other like in 1 and 2 in Fig. 4.8. If only a single instance was processed at a time, the objects from each category would have to be searched in separate loop runs of the indirect search, no matter whether their predictions were included in the same camera view. Since it is not even guaranteed that the objects of both categories would be searched in consecutive loop runs, it cannot be ruled out that the robot would be advised to drive to other locations before eventually returning to searching those objects it could have detected in the first place. The answer to how processing several instances at once provides efficiency, for example, can be seen in 3 and 4 in Fig. 4.8. 3 shows pose predictions, this time not as coordinate frames but as bluish, three-dimensional models of the objects whose possible poses the models represent. Among them, there are predictions for every object that is missing in 1 and 2. All predictions are located within the turquoise frustum12 of a camera view which “NBV estimation” deduced from them. Object localization that is executed right after this view has been estimated returns models for all objects that in 1 and 2. In 4, the four object models resulting from that are depicted in addition to the models of those objects that had previously been recognized. In summary, scene model sampling is expected to tell not only which instance(s) to pass to the prediction algorithm but also how many.

A straightforward answer would be to just pass all partial instances \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) from PSR to the pose prediction algorithm in order to obtain the most comprehensive fused cloud of poses. In the following, we designate the set \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) as a basic set. Proceeding in that manner would at least have two flaws. The first one is that the predictions of an exemplary pair of instances \(\mathbf {I}_{\mathbf {S}}, \mathbf {I}_{\mathbf {S}'}\) would have an equal influence on “NBV estimation”, no matter how much their respective confidences \(b(\mathbf {I}_{\mathbf {S}}) \ll b(\mathbf {I}_{\mathbf {S}'})\) differ. Since instances with low confidence tend to be more frequent in the basic set returned by PSR, “NBV estimation” would often provide camera views that rather cover poses from such instances than poses from instances with high ratings. Since low confidences are equivalent to low plausibility that estimated scene category instances correspond to actually existing scenes, the same applies for hypothetical object poses deduced from such instances. One approach to prevent ASR from being influenced too much by hardly plausible pose predictions is to make sure that instances with high confidences make up the majority of the instances that are passed to the pose prediction algorithm. The second flaw is related to the size of the basic set. This size corresponds to the number of scene categories PSR is looking for, multiplied with the number of instances the ISM tree returns for each scene category. Especially this second flaw can be considerable as we set the acceptance threshold \(\epsilon _{R} = 0\) to the lowest possible value in order to minimize false negatives. The result is that scene recognition produces the maximum scene category instances. Given any ISM tree, scene recognition can return up to one instance per bucket contained in the accumulator of its root ISM. The risk of false negatives is particularly high when ASR begins. In the first loop runs of indirect search, even promising scene category instances still have low confidences, since only a limited number of objects is usually available up to this point. Failure to predict object poses from such instances because of too restrictive acceptance thresholds in PSR would make it impossible for ASR to ever improve the confidences of these instances by detecting the objects that are still missing in them. Larger sizes of the basic set \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) translate into greater time consumption in the component OPP and “NBV estimation” as the input of both increases proportionally. In consequence, limiting the number of instances in the basic set is decisive for maintaining the performance of indirect search in ASR.

Instead of passing every instance in basic set \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) to the pose prediction algorithm when traversing the state OBJECT_POSE_PREDICTION, it would be more appropriate to just do so for a subset \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\)13 of it, the size \(n_{r}\) of which is sufficiently large to guarantee that all plausible pose predictions can be calculated. This set is later designated as a target set. However, the instances in the relative complement of this subset with respect to the basic set would not be discarded. Instead, if no object could be detected on the basis of this subset \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\), another equally-sized subset would be derived from the complement and used for predicting object poses when again traversing the state OBJECT_POSE_PREDICTION during the following loop run of indirect search. From Fig. 4.8, we have learnt that the number of scene category instances containing all relevant information about potential locations of missing objects highly varies depending on the number of searched scene categories and on the relation topologies that are used for representing each of them. Since assessing this influence would exceed the scope of this thesis, we chose to manually set \(n_{r} = 20\) throughout our experiments, based on our experience with ASR. Even after \(n_{r}\) has been set to a specific value, the question remains which instances \(\mathbf {I}_{\{\mathbf {S}\}}\) from basic set \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) are actually to be moved to target set \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\). Though it is obvious that instances with high confidences should be preferentially added to \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\), those with low confidences should not be entirely ignored. Until the end of ASR, high variances \(b(\mathbf {I}_{\mathbf {S}}) \ll b(\mathbf {I}_{\mathbf {S}'})\) among the confidences of instances of different scene categories are usual, e.g. depending on the initial pose of the employed robot. Strictly inserting instances into the target set according to their confidences would implicitly favor instances in accordance with the scene category to which they belong. When using this strategy, it could e.g. occur that OPP insists on predicting poses of objects from scene categories for which no other missing objects can be found, instead of trying out other scene categories whose instances could still be improved.
Our approach for composing the target set from a given basic set is a two-step procedure which is presented in Algorithm 20. On the one side, this resampling procedure strictly selects those instances within each considered scene category that have the best ratings; but on the other side, it allows considerable differences between the confidences of instances of different categories. The effects of this algorithm on an exemplary basic set are shown in Fig. 4.9. In the upper left corner of this figure, the instances of the basic set are visualized as abstract clouds, with the name of the corresponding scene category being written into the clouds. The color14 of each cloud represents the confidence of the respective instance. The instances in the basic set cover all scene categories introduced in Fig.  1.2 at the beginning of this thesis. The first step consists of preselecting the largest possible number \(n_{n}\) of instances \(\mathbf {I}_{\{\mathbf {S}\}}^{*}\) per scene category \(\mathbf {S}\), strictly according to their confidences. By adding the same number \(n_{n}\) of instances for all scene categories to the preselection \(\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\}\), we later prevent “NBV estimation” from preferentially searching missing objects of those categories that contribute more instances to the basic set than others. Unequal numbers of instances per scene category would ultimately have an impact similar to that of solely processing instances with high confidences. As already mentioned, the number of instances scene recognition returns with an ISM tree depends on the size of the accumulator of its root ISM. More precisely, it is equal to the number of accumulator buckets that have been filled with votes during scene recognition. Thus, the degree to which votes are distributed across the accumulators in a tree and in consequence the spatial characteristics of modeled relations, have their share in influence on this number.
Fig. 4.9

Schematic illustration of our two-step resampling procedure for selecting the set of scene category instances whose elements undergo pose prediction during the current traversal of the OBJECT_POSE_PREDICTION state

From the upper half of Fig. 4.9, which displays this preselection, we see that the largest number of instances \(n_{n}\) which we can equally select from every scene category is equivalent to the smallest number \(|\{\mathbf {I}_{\mathbf {S}}\}|\) of instances \(\{\mathbf {I}_{\mathbf {S}}\}\) any category \(\mathbf {S}\) contributes to the basic set. In Fig. 4.9, this minimum is determined by the scene category “Dishwasher—Basket Filled”. Since this category contributes just a single instance, only the best-rated instances per scene category are transferred to the preselection \(\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\}\). This preselection is visible in the upper right corner. Each transfer of an instance is visualized by a light green arrow. When comparing both sets of instances in the upper half of Fig. 4.9, we see that preselection reduces by half the share of instances whose confidences are rated low or average.15 Thus, the desired effect to increase the share of instances with high ratings is reached. In our experiments, the actual numbers of instances of all categories largely exceeded those in this figure. Therefore, the total size of the preselection is considerably greater than the size \(n_{r}\) we request from target \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\), meaning that further instances have to be left out.

Reducing the number \(|\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\}| \ge |[\mathbf {I}_{\{\mathbf {S}\}}^{r}]|\) of instances beyond the first step of Algorithm 20 at a certain point implies selecting between instances of different scene categories. The starting point for developing the second step of our resampling procedure was to reinterpret scene recognition with ISM trees from the perspective of state estimation [23, p. 9]. Since scene recognition, as we define it, deduces not directly observable scene estimates from measured object detection results, the underlying problem fulfills the authors’ criteria for a state estimation problem. However, our approach to scene recognition does not return parametrical distributions, which many of the techniques from state estimation require. Instead, the returned sets of instances rather correspond to punctual estimates of scenes that possibly exist in the real world. In this sense, a basic set of instances is more similar to a set of random samples from a probability distribution [23, p. 3] over the space of possible instances. A prominent technique that is specifically designed for processing such sets of samples is the particle filter [23, p. 80]. In this context, the eponymous particles are synonymous with random samples. However, [14] rightly notes that results from ISMs cannot be considered as values of a sound random variable. Though probabilistic modeling of our approach should be avoided, there is an undeniable similarity between our basic sets of instances with their confidences and the sets of particles with their importance factors [23, p. 79]. If the basic and the target set were made up of particles ruled by probability distributions f and g, the resampling step [23, p. 79] could be used for transformation between both sets. This transformation would be based on drawing samples with replacement from the basic set. The probability of drawing an instance from the basic set would not be uniform but proportional to its confidence. This would ensure that instances with higher confidences are more frequent in the target set than those with low confidences. Nevertheless, the latter would not be completely excluded.

Independent of the particular guarantees a particle filter fulfills due to probabilistic modeling, the effect of its aforementioned resampling step matches our requirements for the second step of our resampling procedure. Thus, we adopted importance resampling [23, p. 80], the technique behind the resampling step, in Algorithm 20. We normalize the confidences among all instances in the preselection for being able to interpret them as probabilities when sampling \(n_{r}\) instances \(\mathbf {I}_{\{\mathbf {S}\}}^{r}\) from preselection \(\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\}\) and implement importance resampling according to the “Select” step of the Condensation Algorithm [11]. The effect of this additional step on our example set of instances in Fig. 4.9 is shown in its lower half. Since it is operating on our preselection, the set \(\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\}\) is both depicted as output of step 1 on the upper right and as input of step 2 on the lower left. The instances transferred by the second step from preselection \(\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\}\) to target set \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\) are visualized by dark green arrows. For the first time, instances with high confidences are mainly in the target set, whereby some scene categories are not represented anymore.

It should be kept in mind that all instances which have not been taken over by our resampling procedure can still be used for predicting object poses later. This can be best explained by means of Algorithm 21 that represents the overall functionality component OPP provides. The first operation of this algorithm is to acquire exclusive access to the buffer into which PSR writes the partial scene category instances it has recognized. The copy of this buffer’s content generated next by Algorithm 21 corresponds to the aforementioned basic set. In the following, this copy is reduced to the target set by our resampling procedure. Only the instances within the target set are then removed from the buffer, with duplicates being ignored. However, all duplicates are processed when all instances in the target set are successively passed to the pose prediction algorithm in the next step. \(|\{\mathbf {V}(\mathbf {S},t)\}|\) predictions are calculated per scene category that is present in the target set. This number corresponds to the length of the demonstration for scene category \(\mathbf {S}\). All individual clouds of possible object poses that are created in this manner are fused to the final result of Object Pose Prediction.

4.5.3 Discussion

Research into scene recognition and into object search is typically conducted independently of one another. As a side effect, little work is dedicated to algorithms and data structures that could link both. Object Pose Prediction by means of spatial relations is precisely such a link. Existing approaches to pose prediction can be classified according to the type of model they choose for representing spatial relations. For two exemplary types of models, i.e. metric and symbolic representations, we each present a scientific contribution to pose prediction. In this presentation, we focus on the impact relation models have on prediction performance. Both approaches we discuss have in common that they just use a single spatial relation in order to predict possible poses of an object. Predicting poses by evaluating entire chains of spatial relations—as we do in our randomized pose prediction algorithm—is outside of the scope of these contributions.

In [15], five different sources of prior knowledge—i.e. knowledge from the web, about scenes and the physical world in general—are fused in a Probabilistic Graphical Model. This model is then used to infer locations of a target object in terms of symbolic relations to other objects. While our approach is limited to spatial relations as the single source of information, it still provides metric knowledge about poses of missing objects and no symbols. Symbolic information, like that provided by [15], has first to be converted into metric knowledge before it can be used to search objects by view planning. Thus, symbolic approaches have to take a detour from the subsymbolic to the symbolic level and back, which, for example, entails a loss of information. By sticking to subsymbolic representations throughout our ASR approach, we can simply bypass these additional calculations.

In contrast, [13] employs a metric model for spatial relations that is based on parametrical distributions. It combines predefined knowledge about locations of objects with Qualitative Spatial Relations of two different types in order to estimate metric hypotheses about object positions. This contribution does not estimate full 6-DoF poses because modeling poses with parametrical distributions is a problematic issue. As already mentioned, knowledge about possible orientations of searched objects is nevertheless essential for setting up realistic planning models. Another problem with this contribution is that it does not model to which extent the applicability of spatial relations on an already detected object depends on the environment in which this object is located—or more precisely, on the present scenes. In Sect. 4.1, we introduced a pair of scene categories “Setting—Ready for Breakfast” and “Setting—Clear the Table”, whose respective spatial relations requested for the same knife to lie either on top of a plate or right next to it. This small example already illustrates that the location of a missing object may vary considerably, depending on the scene category it belongs to.

In the preceding section, we argued that a particle filter is not applicable in ASR in its entirety because probabilistic modelling of our approach to scene recognition is troublesome. Apart from this, there are at least two other reasons why we only adopted its resampling step. The first is that the particle filter is a recursive state estimator [23, p. 9]. As such, it successively updates existing state estimates through new measurements. In contrast, ISM trees estimate scene instances from scratch any time scene recognition is requested. Therefore, data fusion for updating would rather have to take place on the level of the measurements and not on the one of the estimates. Put more concretely, we could rather define a filter for object models than for scene category instances. On the level of object models, for example, we could fuse all 6-DoF poses object localization returns for an object in the course of ASR. As much as the measurement update step [23, p. 39] of particle filters seems inappropriate, its prediction step [23, p. 39] is not any better. Since the prediction step is mainly supposed to model the dynamics of how a state changes, it could only be trivial if scene category instances were states. This is because we required all searched objects in Sect.  1.2 to be immobile during ASR. This second issue that is related to prediction gives additional confirmation that the analogy between ASR and particle filtering does not extend beyond the resampling step.

4.6 Estimation of Next-Best-Views from Predicted Object Poses

4.6.1 Objective Function for the Rating of Camera Views

The last component of our robot architecture for ASR, whose underlying techniques we discuss in depth in this chapter, is “NBV estimation”. This component plays an important role in the state machine that describes the logic behind ASR. The two states NEXT_BEST_VIEW_CALC and UPDATE_POSE_PREDICTIONS, which can be considered to be wrappers for functionalities from that component, are a proof of the importance of “NBV estimation”.

With the help of such functionality, the state NEXT_BEST_VIEW_CALC estimates a Next-Best-View \(V_N\) together with the optimal set of objects \(\{o\}_N\) to be searched in view \(V_N\). Combinations of object set and NBV \((\{o\}_N,V_N)\) are returned by Algorithm 24 in Sect. 4.6.2 which receives the current view \(V_C\) of the mobile robot, the current robot configuration \(\mathbf {C}_C\) and predicted poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) as input. [17]

Predicted poses \(\mathbf {T}_P\) alone are insufficient to reliably estimate views for object detection. Objects o are not equally well detectable from each perspective. Instead, we use lines of sight \(\{\mathbf {n}\}\), formally three-dimensional vectors, which we empirically estimate per object in the searched scene categories. Each line, defined in relation to the object frame, represents a perspective suitable for detection. In world coordinates, each line depends on the pose of its object. Lines are assigned to each pose \(\mathbf {T}_P\) in Algorithm 22. Example lines are visible in Fig. 4.10.16 We designate \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) as poses with lines. Before NBV estimation starts, the poses with lines are pre-filtered by the method invalidateUsedLinesOfSight() from Algorithm 25. It is called for all views \(V_{U}\) that have already been explored. Instead of invalidating entire predicted poses in the frustum of a view \(V_{U}\), it only deletes the lines of sight that point into the direction \(\mathbf {x}(V_{U})\) of the view so that the poses can still be searched from other directions. [17]

Algorithm 22 is a method that is only called after direct transitions between the states OBJECT_POSE_PREDICTION and NEXT_BEST_VIEW_CALC in the ASR state machine. Both the necessity to transform a cloud \(\{\mathbf {T}_P\}\) of pure pose predictions into poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) and to pre-filter with used views \(\{V_{U}\}\) from the current ASR execution can be attributed to the fact that Object Pose Prediction does not extend the already available cloud \(\{\mathbf {T}_P\}\) of prediction with every additional traversal of OBJECT_POSE_PREDICTION. Instead, all existing predictions from previously considered scene category instances \(\{\mathbf {I}_{\mathbf {S}}\}\) are replaced by new predictions that are inferred from other instances \(\{\mathbf {I}'_{\mathbf {S}}\}\). Proceeding in that manner makes sense as OBJECT_POSE_PREDICTION is only entered when the existing cloud of predictions is obsolete. This is the case when either one or several additional objects have been localized in the state OBJECT_DETECTION or when no further promising camera views to be driven to can be deduced from the cloud in NEXT_BEST_VIEW_CALC. However, it is then possible that a new prediction of hypothetical object poses from the instances \(\{\mathbf {I}'_{\mathbf {S}}\}\) generates predictions that would have been covered by one of the already used views \(V_{U}\), if these predictions had been estimated earlier. It is therefore necessary to pre-filter any new cloud of pose predictions in order to prevent NEXT_BEST_VIEW_CALC from re-estimating camera views as Next-Best-Views for searching a set \(\{o\}\) of objects that have already been looked for in a nearly identical view beforehand.

At the beginning of this section, we introduced estimating Next-Best-Views as calculating an optimal combination \((\{o\}_N,V_N)\) of a view \(V_N\) and the objects \(\{o\}_N\) that shall be searched in it. Thus, estimating Next-Best-Views can be defined as an optimization problem. Algorithm 24, which performs this estimation, is the corresponding optimization algorithm. It operates on a problem we designed with the aim to represent the underlying real-world problem as realistically as possible. However, efficiency in estimating Next-Best-Views is also a predominant issue, since the presented technique has to be executed in the loop during the RELATION_BASED_SEARCH of missing objects. While we focus our efforts on speeding up the estimation with the help of the employed optimization algorithm, the definition of the optimization problem should not be neglected at all in this context.

This is in particular the case since the central element of any optimization problem, the objective function, has to be evaluated for any candidate view V17 whose appropriateness as Next-Best-View is to be rated in Algorithm 24. The space of parameters \(2^{\{o\}_{P}} \times \{V\}\), upon which our objective function is defined, consists of all possible combinations of subsets of the missing objects \(\{o\}_{P}\), with all candidate views \(\{V\}\) the mobile robot can reach. Since realistic modeling presumes that camera views are represented as three-dimensional viewing frustums with 6-DoF poses, parameter space has a high dimensionality. Moreover, it is not entirely continuous because the power set \(2^{\{o\}_{P}}\) of the missing objects is discrete. In consequence, we will approach the estimation of Next-Best-Views as a combinatorial optimization problem, including the associated algorithms.

A reward \(r(\{o\}, V)\) that consists of a weighted sum of a utility \(u'(\{o\}, V)\) and inverse cost functions \(i'_{p}(\mathbf {C}, \mathbf {C}_{C}), i'_{n}(\mathbf {C}, \mathbf {C}_{C}), i'_{d}(\mathbf {C}, \mathbf {C}_{C}), i'_{o}(\{o\})\) is used to rate combinations of object sets and views \((\{o\}, V)\) to find the best. [17]

The corresponding equation can be found in line 26 in Algorithm 23. We use the term “reward” for our objective function as estimating Next-Best-Views is usually considered to be a problem from the field of automated planning. In the reward, we include inverses of cost functions instead of the cost functions themselves. Since finding optimal combinations \((\{o\}, V)\) makes it necessary to simultaneously minimize costs and maximize utility, an objective function consisting of both would be contradictory. Unifying inverse costs with a utility in a common object function instead yields a consistent maximization problem.
Our concept of utility is closely related to the one of viewing frustums \(\mathbf {F}\)18 which we introduce as elements of camera views in Sect. 4.4. Like other contributions in three-dimensional object search, we see a correlation between the utility of a candidate view and the number of pose hypotheses for missing objects inside a frustum that belongs to the view. However, we do not equate both aspects, as many do. In our approach, only pose predictions \(\mathbf {T}_\mathbf {F}\) inside a frustum contribute to the utility of the corresponding candidate view V. In utility function \(u'(\{o\}, V)\), we only need to take those portions of a cloud \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) into account that are inside frustums. We designate them as \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}\). The only purpose of the utility function parameter \(\{o\}\) is to filter out those predictions \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}_{\{o\}}\) belonging to objects \(\{o\}\) that would be searched in candidate view V. In consequence, it is cloud \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}_{\{o\}}\) which is the actual basis for rating a candidate view. It is obvious that parameter \(\{o\}\) could be equally replaced by the cloud \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}_{\{o\}}\). While cloud \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}_{\{o\}}\) is actually used in Algorithm 24, we define utility as \(u(V, \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\})\) in Eq. 4.6 for the sake of simplicity. The sum across all poses with lines \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}\) in this equation corresponds to the correlation between the number of predictions and the utility we just postulated.
$$\begin{aligned} u(V, \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}) = \sum \limits _{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\}) \in \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}} u_a(V, \mathbf {T}_\mathbf {F}) \cdot u_d(V, \mathbf {T}_\mathbf {F}) \cdot u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\}) \end{aligned}$$

Utility function \(u(V, \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\})\) rates each pose prediction \((\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\) in a view V regarding its confidence in being detectable for object localizers, given its 6-DoF pose \(\mathbf {T}_\mathbf {F}\). Confidence of detection is defined as optimal for a pose when the measures \(u_a(V, \mathbf {T}_\mathbf {F})\), \(u_d(V, \mathbf {T}_\mathbf {F})\), \(u_n(V, \{\mathbf {n}\})\) are maximized at once: \(u_a(V, \mathbf {T}_\mathbf {F})\), which evaluates the angle \(\angle (\mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}})\) between camera view direction \(\mathbf {x}(V)\) and the ray \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\) between camera position \(\mathbf {p}\) and position \(\mathbf {p}_\mathbf {F}\) (from predicted pose \(\mathbf {T}_\mathbf {F}\)), favors that the predicted pose lies at the center of the camera field of view. \(u_d(V, \mathbf {T}_\mathbf {F})\), which evaluates the projection \(\langle \mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\rangle \) of ray \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\) on the view direction, favors that the predicted pose lies halfway between ncp and fcp along the view direction. \(u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\), which evaluates the angle between ray \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\) and the line of sight \(\mathbf {T}_\mathbf {F} \cdot \mathbf {n}\) in world coordinates of the predicted pose most similar to \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\), ensures that the predicted object is being observed from a perspective favorable for its detection. [17]

$$\begin{aligned}&u_a(V, \mathbf {T}_\mathbf {F}) =\nonumber \\&{\small {\left\{ \begin{array}{ll} \frac{1}{2} + \frac{1}{2}\cos \left( \frac{\pi }{\frac{1}{2}\min (\text {fov}_x, \text {fov}_y)} \angle (\mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}})\right) &{} \text {if } \angle (\mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}) \le \frac{1}{2}\min (\text {fov}_x, \text {fov}_y)\\ 0 &{} \text {otherwise} \end{array}\right. }}\end{aligned}$$
$$\begin{aligned}&u_d(V, \mathbf {T}_\mathbf {F}) =\nonumber \\&{\small {\left\{ \begin{array}{ll} \frac{1}{2} + \frac{1}{2}\cos \left( \frac{\pi }{\frac{\left| \text {fcp}-\text {ncp}\right| }{2}} \left( \langle \mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\rangle - \frac{\text {fcp}+\text {ncp}}{2}\right) \right) &{} \text {if } \langle \mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\rangle - \frac{\text {fcp}+\text {ncp}}{2} \le \frac{\left| \text {fcp}-\text {ncp}\right| }{2}\\ 0 &{} \text {otherwise} \end{array}\right. }}\end{aligned}$$
$$\begin{aligned}&u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\}) =\nonumber \\&{\small {\left\{ \begin{array}{ll} \frac{1}{2} + \frac{1}{2}\cos \left( \frac{\pi }{\alpha } \min \limits _{\mathbf {n} \in \{\mathbf {n}\}}\left( \angle (-\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}, \mathbf {T}_\mathbf {F} \cdot \mathbf {n})\right) \right) &{} \text {if } \min \limits _{\mathbf {n} \in \{\mathbf {n}\}}\left( \angle (-\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}, \mathbf {T}_\mathbf {F} \cdot \mathbf {n})\right) \le \alpha \\ 0 &{} \text {otherwise} \end{array}\right. }} \end{aligned}$$
All of these three measures are formally defined in Eqs. 4.7 to 4.9. The reason why Eq. 4.9 rates the angles between lines of sight \(\mathbf {T}_\mathbf {F} \cdot \mathbf {n}\) and rays \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\) instead of comparing those lines with view directions \(\mathbf {x}(V)\) is that the latter does not express from which direction a candidate view V looks at the predicted pose \(\mathbf {T}_\mathbf {F}\). This becomes obvious in Fig. 4.10 in which the view direction points into the void instead of any sphere. Instead, it is both dashed lines in this picture that visualize how the depicted camera view with position \(\mathbf {p}\) perceives either sphere at pose \(\mathbf {p}_\mathbf {F}\). The angles between the lines of sight and the dashed lines, visualized as black circles in this picture, correspond to the angles that measure \(u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\) rates. For this measure to decide up to which angle values it considers a view to be favorable for visually covering a predicted pose \(\mathbf {T}_\mathbf {F}\), we introduce threshold \(\alpha \) that is set by hand. Any predicted pose the angles of which exceed this threshold, does not contribute to the utility of a candidate view.
Fig. 4.10

Geometric quantities by means of which utility measure \(u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\) is calculated for pose predictions inside a frustum [F. Marek and P. Meißner]

While Fig. 4.10 is supposed to illustrate Eq. 4.9, it is up to Fig. 4.11 to do so for Eqs. 4.7 and 4.8. The frustum that is displayed in Fig. 4.11 contains two predicted poses, too. The first prediction is located at position \(\mathbf {p}_{1}\) and belongs to a box-shaped object. The second one has position \(\mathbf {p}_{2}\) and belongs to a cylindrical object. Both the angles \(\angle (\mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}})\), from which measure \(u_a(V, \mathbf {T}_\mathbf {F})\) is calculated, and the line segments \(\langle \mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\rangle - \frac{\text {fcp}+\text {ncp}}{2}\), on which measure \(u_d(V, \mathbf {T}_\mathbf {F})\) is based, are colored red.
Fig. 4.11

Geometric quantities by means of which utility measures \(u_a(V, \mathbf {T}_\mathbf {F})\), \(u_d(V, \mathbf {T}_\mathbf {F})\) are calculated for pose predictions inside a frustum [F. Marek and P. Meißner]

In the utility function \(u(V, \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\})\) from Eq. 4.6, the three aforementioned measures are combined by multiplication. This ensures that an object is considered as hard to detect at pose \(\mathbf {T}_\mathbf {F}\) as soon as any of the three measures returns a low rating for that pose. Let’s assume, for instance, that the measures \(u_a(V, \mathbf {T}_\mathbf {F})\), \(u_d(V, \mathbf {T}_\mathbf {F})\) tell that pose \(\mathbf {T}_\mathbf {F}\) is located at the center of the frustum of a candidate view V. The missing object this pose belongs to could still not be detected if measure \(u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\) told us that the view covers no side of the object from which it is well detectable. As in Sect.  3.4.4, we combine measures that operate on incommensurable spaces so that a normalization of each of them is required before they are multiplied by each other. We normalize each measure with a trigonometric function \(f(x,\text {max}) = \frac{1}{2} + \frac{1}{2}\cos \left( \frac{\pi }{\text {max}} x\right) \) where x stands for one of the geometric quantities from which the measures are calculated. This function imposes a curve on the interval \([0, \text {max}]\) that resembles one half of a normal distribution but lacks the variance \(\sigma \) that would otherwise have to be manually fixed. Moreover, \(f(x,\text {max})\) is a function “that is monotonically decreasing and for which applies: \(f(0,\text {max}) = 1\) and \(f(\text {max},\text {max}) = 0\)” [17].

Being the basis for reward \(r(\{o\}, V)\), values \(u(V, \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\})\) of the utility function now have to be combined with those from the inverse cost functions. However, since Eq. 4.6 summarizes across an entire set of poses with lines, the utility function is not normalized in general. Thus, the utility function has to be normalized as well. Reward \(r(\{o\}, V)\) actually already processes normalized utilities, because it does not include the function \(u(\{o\}, V)\) but its normalized variant \(u'(\{o\}, V)\). The respective normalization takes place in line 24 of Algorithm 23 with the help of the expression \(u(\{o\}^{*},V^{*})\). This expression corresponds to the highest utility that the algorithm can derive from the portion \(\{(\{o\}, V)\}\) for the parameter space it currently considers. Instead normalizing by means of the number of predicted poses inside the frustum of the view in question is no alternative as this compromises the desired correlation between the number of predicted poses and the value of the normalized utility function.
$$\begin{aligned}&i'_p(\mathbf {C}, \mathbf {C}_C) = \min \left( 1 - \frac{|\rho - \rho _C|}{|\rho _{\text {max}} - \rho _{\text {min}} |}, 1 - \frac{|\tau - \tau _C|}{|\tau _{\text {max}} - \tau _{\text {min}}|}\right) \end{aligned}$$
$$\begin{aligned}&i'_n(\mathbf {C}, \mathbf {C}_C) =\nonumber \\&{\small {\left\{ \begin{array}{ll} 1 - \min \left( \frac{|\theta - \theta _C|}{2\pi }, 2\pi - \frac{|\theta - \theta _C|}{2\pi }\right) &{} \text {if } (x,y) \approx (x_{C},y_{C})\\ 1 - \min \left( \frac{|\gamma - \theta _C|}{2\pi }, 2\pi - \frac{|\gamma - \theta _C|}{2\pi }\right) - \min \left( \frac{|\theta - \gamma |}{2\pi }, 2\pi - \frac{|\theta - \gamma |}{2\pi }\right) &{} \text {otherwise} \end{array}\right. }} \end{aligned}$$
$$\begin{aligned}&i'_o(\{o\}) = 1 - \frac{\sum \limits _{o \in \{o\}}t(o)}{\sum \limits _{o \in \{o\}_P}t(o)} \end{aligned}$$
Most of the inverse cost functions we employ in our reward take goal configuration \(\mathbf {C}\) as a parameter instead of candidate view V on which our utility function relies. Thus, for every candidate view that is to be rated, a corresponding robot configuration that allows the left camera of the robot head to reach the view has to be estimated with inverse kinematics. This is particularly necessary since the kinematics of the abstract robot from Sect. 4.4 that ASR is based on is redundant. The same view may match multiple robot configurations, each being assigned with different costs.

Inverse costs from current configuration \(\mathbf {C}_C\) to goal configuration \(\mathbf {C}\) are a weighted sum of normalized inverted travel costs for PTU \(i'_p(\mathbf {C}, \mathbf C _C)\), robot orientation \(i'_n(\mathbf {C}, \mathbf C _C)\) and position \(i'_d(\mathbf {C}, \mathbf C _C)\) (distance normalized by a Gaussian) as well as \(i'_o(\{o\})\) which contains the runtimes t(o) of running detection for all objects in combination \((\{o\}, V)\) in relation to the sum of detection runtimes for all searched objects \(\{o\}_P\). [17]

All of these four cost expressions are formally defined in Eqs. 4.10 to 4.13. The fact that these equations define inverse costs, can be deduced from the shape of most of them, subtracting the actual normalized costs from the preceding maximum costs. In relation to the inverse costs \(i'_p(\mathbf {C}, \mathbf C _C)\) for the PTU in Eq. 4.10, it should be noted that the distances both axes \(\rho , \tau \) have to travel in order to reach the goal configuration of the PTU are not added up but compared by a minimum operator. The reason for this is that both axes are driven in parallel and not sequentially. The minimum operator is also used in Eq. 4.11, which defines the inverse costs \(i'_n(\mathbf {C}, \mathbf {C}_C)\) in relation to orientation \(\theta \) of the mobile base. There, the minimum operator expresses that the base always turns in the direction in which the smaller angle has to be covered for reaching the goal orientation \(\theta _{C}\). This equation differentiates between two cases, depending on whether the current \((x_{C}, y_{C})\) and the goal position (xy) of the mobile base are approximately identical. If so, the robot just needs to turn around its own axis so as to reach the goal configuration for its base. Otherwise, it has to turn twice. At the position \((x_{C}, y_{C})\) of its current configuration \(\mathbf {C}_C\), it first orients itself towards an interim orientation \(\gamma \) so that it can drive as linearly as possible to the position (xy) of the goal configuration \(\mathbf {C}_C\). Once it has reached this position, it orients itself towards the goal orientation. Linear, translational movements of the mobile base are favored since the inverse costs \(i'_d(\mathbf {C}, \mathbf {C}_C)\) for relocating the base are approximated in Eq. 4.12 through the Euclidean distance \(\sqrt{\left( {x - x_C} \right) ^2 + \left( {y - y_C} \right) ^2}\) between its goal and its current position.
Fig. 4.12

Geometric quantities from which we derive the costs of navigating the mobile base towards its goal configuration [F. Marek and P. Meißner]

Underlying the cost expressions \(i'_n(\mathbf {C}, \mathbf {C}_C), i'_d(\mathbf {C}, \mathbf {C}_C)\) that refer to the mobile base, the movement model is visualized in Fig. 4.12. The portions of the model both expressions cover are colored red. Of course, collision-free navigation makes the robot follow non-linear paths in general. In Fig. 4.12, a robot in the dark grey configuration would, for example, have to drive around the table in order to reach the hatched robot configuration. Following the dashed line segment instead, it would collide with the table. The precise path that navigation estimates for the robot is not used for rating every candidate view, as such an approach would cause a considerable number of repetitive executions of motion planning [22, p. 109] and therefore prohibitive computational costs at the current point in time. In summary, both aforementioned cost expressions make up a lower bound for the actual travel costs.
Fig. 4.13

Geometric quantities from which we derive the costs of orienting the PTU towards its goal configuration [F. Marek and P. Meißner]

The model with which we approximate the costs for moving the PTU is shown in Fig. 4.13. On the left, we see an exemplary configuration the PTU is supposed to have taken over, together with a pair of coordinate frames. Each frame includes one of the axes around which the PTU turns. The distances each of both axes has to cover in order to transfer the PTU from its current configuration \((\rho _{C},\tau _{C})\) to the goal configuration \((\rho ,\tau )\) on the right of Fig. 4.13 are visualized in the middle of this picture. Apart from the inverse translational travel costs \(i'_d(\mathbf {C}, \mathbf {C}_C)\), all cost expressions employed in our reward are normalized in a straightforward manner. However, the translational expression is normalized by the density function of a normal distribution which lacks its preceding factor. We normalize suchlike since no upper bound exists for the values of the Euclidean distances that are to be rated. The manually fixed variance \(\sigma \) of this distribution nevertheless enables us roughly to distinguish between cheap and expensive translational travel costs. Indirectly, we thereby set the range of coverage of the mobile robot in the course of ASR.

When adding up the diverse cost expressions in our reward, the question arises what they actually represent as a whole. A straightforward and universal approach to quantifying the costs for a robot travelling to a view V and executing object localization for all objects \(\{o\}\) potentially present in that view is time consumption. Since an exact estimation of time consumption would require a costly simulation, we favored subdividing the overall costs on the four aforementioned expressions instead. For each cost aspect covered by each of the terms, we identified a quantity that can be efficiently estimated and correlates with the time consumption of the aspect it represents. Of course, proceeding suchlike implicates simplifications. For example, all of our inverse travel costs implicitly presume constant velocity. What the individual cost expressions do not cover is the percentage to which every single aspect, covered by a corresponding expression, contributes to the overall time consumption. The share of each expression is therefore modeled by the weights \(\omega _{\rho }, \omega _{\theta }, \omega _{d}, \omega _{o}\) in line 26 of Algorithm 23 that we set by hand.

Altogether, the four inverse cost expressions would form an objective function of a convex shape, with the global maximum being located at the current configuration of the robot, where it would look for a single missing object. On the other hand, the utility function as an objective function would produce local maxima in proximity to clusters of predicted object poses. It can hardly be expected that local maxima of inverse costs and utility coincide at the same locations in the environment of the robot. Under these circumstances, utility and inverse costs would mutually damp each other if we multiplied both in the reward \(r(\{o\}, V)\). Multiplication would just make high rewards easy to identify while the frontier between poor and average rewards would be blurred. In order to focus on this frontier, we decided to aggregate utility and inverse costs by summation instead.

4.6.2 Optimization Algorithm for Next-Best-View Estimation

Modeling view planning in a realistic way does not only yield increased complexity and therefore computational costs but also a non-linear objective function \(r(\{o\}, V)\) in our case. Thus, we handle estimating Next-Best-Views as a non-linear optimization problem [4, p. 861]. We have already dealt with such problems when verifying ISM accumulators in Sect.  3.4.4 and when selecting relation topologies in Sect.  3.6.2. Before explaining which optimization algorithm we designed for estimating camera views under these circumstances, we first analyze the parameter space \(2^{\{o\}_{P}} \times \{V\}\) of our reward in detail.

Apart from being high-dimensional,19 this parameter space is infinite in its size. This is due to the space of 6-DoF camera poses it comprises. Furthermore, the space is also exponential in size with respect to the number \(|\{o\}_{P}|\) of the searched objects it includes. While the size of the object-related portion of the parameter space cannot be influenced without resorting to heuristics, time consumption in evaluating the objective function across the power set \(2^{\{o\}_{P}}\) can be reduced on an algorithmic level, e.g. by dynamic programming.20 The dimensionality of the object-related portion of the parameter space cannot be changed without using heuristics.

The situation is different in camera pose space, on which the optimization-relevant parameters of the views \(\{V\}\) are defined. The kinematic chain of the abstract robot we introduced in Sect. 4.4, makes it possible to reduce the dimensionality of this camera-related portion of the parameter space in two ways. On one side, the height of the sensor head above the ground is constant so that only the coordinates (xy) of the camera in the two-dimensional plane of the floor have to be optimized. We also designate these coordinates as two-dimensional robot positions. On the other side, the sensor head can only be pivoted in two \((\rho ,\tau )\) of the three degrees-of-freedom with which a rigid body can be rotated in space. This is because the PTU, which moves the sensor head, only has two axes to be rotated. Resulting from both reductions, the 4-DoF space is still infinite. However, only the portion referring to robot positions is defined on an unbounded domain [4, p. 118]. In comparison, orientations \((\rho ,\tau )\)—we also call camera orientations—are defined on the bounded interval \([{0}^{\circ }, {360}^{\circ }]\).

The subject of this section is to estimate Next-Best-Views. According to Sect.  1.3, Next-Best-View estimation is one of the contributions of this thesis. In the preceding section, we have introduced this task as a combinatorial optimization problem. Since it includes a partially continuous parameter space but should still be handled by algorithms from combinatorial optimization, we discretize the combined robot position and camera orientation space. Of course, this operation may exclude the globally optimal camera view from optimization. Combining this discretized space with the power set of the searched objects \(\{o\}_{P}\) yields the search space for which we introduce a Next-Best-View estimation algorithm in this section. By simultaneously operating on all degrees-of-freedom that are relevant for adjusting the camera during object search, optimization can move to parts of the search space which would be out of reach otherwise—e.g. if we separately optimized robot position, camera orientation and objects to be detected.

In the search space \(\{(x,y)\} \times \{\mathbf {q}\} \times 2^{\{o\}_{P}}\) we just defined, the only portion that remains infinite is that of the sampled robot positions \(\{(x,y)\}\), because only a finite number of camera orientations \(\{\mathbf {q}\}\) can be sampled from the domain \([{0}^{\circ }, {360}^{\circ }]\). We define estimating Next-Best-Views not only as a combinatorial optimization problem with a non-linear objective function. This formalization is analogous to that of [26] who proved that Next-Best-View estimation is NP-complete. Thus, globally optimal solutions to it cannot be efficiently found with state-of-the-art techniques. However, if the search space was sufficiently small, a global optimum could be found, e.g. by exhaustive search. Handling the infiniteness of the search space so that only a finite but relevant subset is considered during optimization is the basic condition for being able to use exhaustive search. We resolve this issue by defining a rectangular area around the current position \((x_{C},y_{C})\) of our mobile robot. Robot positions are then only sampled from the inside of that area. The surface of this rectangular area should be consistent with the range of coverage of the mobile robot during ASR. This range is set by means of the variance \(\sigma \) from the inverse cost function \(i'_d(\mathbf {C}, \mathbf {C}_C)\). While position sets \(\{(x,y)\}\) extracted from such areas are finite, their size can still be considerable depending on the sampling resolution and the surface from which they are being sampled. Just as the range of coverage of a mobile robot can theoretically be increased unlimitedly, no upper bound for the potential size of position sets can be specified. Nor can we define an upper bound for the costs of exhaustive search on the search space in general.

In order to remain as close as possible to exhaustive search while still attaining efficiency, we define an optimization algorithm in Algorithm 24 that mixes exhaustive search with an approach that is based on the divide-and-conquer paradigm. Put more concretely, we exhaustively analyze all subsets \(2^{\{o\}_{P}}\) of the searched objects and all camera orientations \(\{\mathbf {q}\}\), whereas we traverse sets of robot positions with the help of divide-and-conquer. Robot position space “is iteratively discretized with increasing resolutions \(\epsilon \) by repeatedly executing Algorithm 24 in Algorithm 23” [17]. We combine all robot positions that are sampled in the course of this approach with the set of sampled camera orientations and the elements of the power set of the searched objects. The iterative approach in sampling robot positions includes a heuristic that greedily selects best-rated robot positions at resolution \(\epsilon _{n}\) as starting points for searching better positions in the next iteration at resolution \(\epsilon _{n+1}\).

In the first execution of iterationStep(), i.e. Algorithm 23, we tessellate the entire rectangular area in robot position space being of interest to object search, at a coarse resolution \(\epsilon _0\). We do so with a hex grid [18] that is aligned to the current position of the robot \((x_{c},y_{c})\). [17]

Alignment is attained by placing the current position of the robot at the center of one of the generated hexagons. Tessellating the two-dimensional floor with a hex grid instead of a square grid is advantageous as soon as the rooms in which we perform ASR are not rectangular. From our experience, samples from hex grids are closer to obstacles in such rooms. Thus, the robot can get closer to searched objects on top of obstacles.

Tessellation is performed in line 2 of Algorithm 23 which getHexGrid() returns the corners as a position set \(\{(x,y)\}_{\epsilon }\). From position set \(\{(x,y)\}_{\epsilon }\), each position whose distance to any pose prediction \(\mathbf {T}_P\) of a searched object is larger than the fcp or that lies too close to an obstacle on the 2-D environment map is discarded. [17]

Pruning robot positions on the basis of their overlapping with obstacles is done by means of an occupancyGrid[] with cell size s whose values at positions (xy) are checked against a threshold \(t_{o}\). This threshold corresponds to the “probability that a grid cell is occupied” [23, p. 224]. The other heuristic that prunes robot positions because of their distances to predicted object poses \(\mathbf {T}_P\) presumes that robot positions (xy) and the positions \(\mathbf {p}_P\) inside the predicted poses \(\mathbf {T}_P\) can be compared. However, due to differing dimensionalities, this is not the case. We therefore make the two comparable by appending the fixed height h of the sensor head to any robot position in question.
Exemplary rectangular areas, from which hex grids are extracted are shown from 2 to 5 in Fig. 4.14. Each of those pictures presents results from another execution of iterationStep() in Algorithm 24. 2 corresponds to the first iteration in an example execution of Algorithm 24. The areas stand out from the rest of the floor by their transparent, violet color. Generally, the floor is colored white where it is navigable and dark grey where it is occupied. The robot positions that result from hex grid tessellation are visualized as small cylinders on the floor which change their color with each additional execution of iterationStep(). All position sets from any area have in common that they mimic the black lines on the floor at a nearly constant distance to them. These lines represent the borders of the obstacles on the map. The observed gap between samples and obstacles is the result of the aforementioned pruning heuristics. In particular, no samples are extracted below the bottom edges of the pictures, because their distances to the predicted object poses would be too large. From an optimization viewpoint, both heuristics can be considered as constraints [4, p. 846] which reduce position set \(\{(x,y)\}_{\epsilon }\) to its pruned variant \(\{(x,y)\}_{r}\).
Fig. 4.14

Exemplary estimation of a Next-Best-View in order to search missing objects. The predicted poses from which the view in 6 is deduced have already been shown in 3 in Fig. 4.8. Different iterations of our estimation algorithm are visible in temporal order from 2 to 5, while 1 displays the input and 6 the output of our estimation algorithm. Results from the first iteration are shown in 2, whereas 5 presents results from the forth iteration

A set of camera orientations \(\{\mathbf {q}\}_{S}\), from evenly sampling a unit sphere with a golden angle spiral [5] in getSpiralApprox() at line 2 in Algorithm 24, is passed to iterationStep(). These orientations are combined with each robot position in the set \(\{(x,y)\}_{r}\) to generate view candidates V with the frustum F in getView() in line 13 in Algorithm 23. In 2 to 5 in Fig. 4.14, a camera orientation set, limited to the workspace \([\rho _{\text {min}} ,\rho _{\text {max}}] \times [\tau _{\text {min}} ,\tau _{\text {max}}]\) of the PTU, is visualized on a sphere for a given robot position. [17]

Every candidate view is then combined with all elements of the power set of the searched objects. Next, all combinations \((\{o\}, V)\) of candidate views and object sets \(\{o\} \subset \{o\}_{P}\) that are created in a single iteration are rated in two steps in Algorithm 24. As explained in the previous section, rewards \(r(\{o\}, V)\) should be calculated only from those predicted poses with lines \(\{(\mathbf {T}_\mathbf F , \{\mathbf {n}\})\}\) inside the frustum of view V. In line 14 in Algorithm 23, these poses are extracted from the overall set of poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) by means of frustum culling [3]. The fact that the rewards of all combinations cannot be calculated in a single step is due to normalized utility \(u'(\{o\}, V)\) being used in our reward function. Since this involves estimating the highest utility \(u(\{o\}^{*},V^{*})\) among all combinations that result from an iteration step, we first have to estimate non-normalized utilities \(u(\{o\}, V)\) for all of the generated combinations. Non-normalized utilities are derived from those predicted poses \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}_{\{o\}}\) inside the frustum that belong to objects from the set \(\{o\}\) inside each combination.

Each execution of iterationStep() estimates the best-rated view \(V_{I}\) at a resolution \(\epsilon \) according to a reward \(r(\{o\}, V)\) once the combination \((\{o\}^{*},V^{*})\) with the highest utility in an iteration step has been identified. It is passed as best-rated view \(V_{A}\) from the last iteration to the next call of iterationStep() in Algorithm 24. In the next iterationStep() execution, a new hex grid whose origin lies at configuration \(\mathbf {C}_A\) is created with doubled resolution \(\epsilon \) and halved area to be tessellated. \(\mathbf {C}_A\) is inferred by inverse kinematics from the best-rated candidate view \(V_{A}\) in the preceding iteration. [17]

The increase of resolutions with each additional iteration step is exemplarily shown in 2 to 5 in Fig. 4.14. In these pictures, increasing resolutions translate into decreasing distances between the small cylinders on the floor that represent sampled robot positions. In each iteration step, the rating of the best combination \((\{o\}_{I}, V_{I})\) is guaranteed to be at least as high \(r(\{o\}_{I}, V_{I}) \ge r(\{o\}_{A}, V_{A})\) as that of the best combination \((\{o\}_{A}, V_{A})\) from the preceding iteration step. Through greedily selecting the best-rated combination in each of its iteration steps, Algorithm 24 reaches local optima in the discrete search space. In relation to global optima, we did not notice throughout our experiments that any unreasonable Next-Best-Views were estimated. All of those experiments were performed with the same fixed set of parameters that we inter alia set in the objective function. The parameters were estimated from tests representing pathological cases for the estimation of Next-Best-Views. Iteration is aborted in Algorithm 24 once the ratings \(r(\{o\}_{I}, V_{I})\), \(r_{A} = r(\{o\}_{A}, V_{A})\) of the best combinations from two consecutive iteration steps differ less than a threshold \(t_{V}\). The more iteration steps are performed, the smaller the differences get between the poses of the best views \(V_{I}, V_{A}\) of consecutive iteration steps. At steep slopes of the objective function, the ratings of such very similar views may still be sufficiently different so that threshold \(t_{V}\) is exceeded. For cases of this sort, we introduced another termination condition that stops iterating once a maximum resolution \(\epsilon _{\infty }\) is reached with which robot positions may be tessellated. The result \((\{o\}_{I},V_{I})\) of the final iteration step is returned by Algorithm 24 as the overall result \((\{o\}_{N},V_{N})\) of our Next-Best-View estimation algorithm.

By decreasing the tessellated area proportionally to increasing the resolution in each step, the size of robot position sets is constant throughout iterating. Uniformly tessellating the whole area of interest for object search with the resolution used in the last iteration would return a lot more positions. [17]

Doubling the resolution with each additional step, roughly21 translates into quadrupling the number of robot positions that are sampled. For the following arguments, we presume that Algorithm 24 samples n positions in its first iteration step and that it terminates after i iterations at resolution \(\epsilon _{\infty }\) without loss of generality. In the course of an exemplary Next-Best-View estimation under these assumptions, our iterative algorithm would sample \(n (i-1)\) robot positions in total. Instead, if we tried to estimate a Next-Best-View by exhaustive search, i.e. in one iteration at the maximal resolution, we would sample \(m = n 4^{i-1}\) positions. In this case, we would have to sample from the largest rectangular area, which is used in our iterative approach, at the maximal resolution \(\epsilon _{\infty }\) instead of the minimal resolution \(\epsilon _{0}\). Because of the inequalities \(n (i-1) \le n (i-1) \log 4 = n \log 4^{i-1} \le n (\log 4^{i-1} + \log n) = n \log (n 4^{i-1}) = n \log m\), Algorithm 24 has time complexity \(\mathcal {O}(\log (m)|\{\mathbf {q}\}_{S}||2^{\{o\}_{P}}|)\) under the assumption that evaluating the reward function comes along with constant time consumption. Logarithmic complexity in terms of robot positions can be found in the literature for similar greedy divide-and-conquer approaches.

Another example for estimating Next-Best-Views with our approach is visualized in Fig. 4.14. 1 in this figure shows the configuration the mobile robot takes over before Next-Best-View estimation starts. In particular, the views both cameras on the sensor head of the robot have taken over are made visible by two transparent red frustums. The directions of both views are depicted as arrows of the same color that stick out of the frustums. Besides the robot configuration, 1 also displays the cloud of object poses our estimation algorithm receives from Object Pose Prediction. Every predicted pose is made visible by a scaled-down three-dimensional model of the object to which the pose belongs. A bunch of lines of sight, each visualized by a yellow arrow, is attached to every prediction. 25 in Fig. 4.14, representing executions of Algorithm 23, each show view \(V_{I}\) that belongs to the best combination \((\{o\}_{I},V_{I})\) identified during the respective iteration step. A view \(V_{I}\) is made visible by an arrow indicating its direction. This arrow sticks out of a sphere of the same color. On the surface of this sphere, we see the camera orientations \(\{\mathbf {q}\}_{S}\) from spiral sampling, visualized by small squares from which small arrows stick out. A perpendicular blue line connects the sphere—and therefore all camera orientations that it comprises—with a sampled robot position. This shall express how Algorithm 23 combines any robot position with all available camera orientations. Besides, it shows which position the view \(V_{I}\) corresponds to. The length of the blue line segment increases across the pictures so that spheres from different iteration steps do not overlap. In contrast, the arrows stay at the same height. 6 in Fig. 4.14 shows a pair of views the stereo camera on the sensor head would take over if the mobile robot adopted the configuration \(\mathbf {C}_{N}\) that corresponds to the Next-Best-View \(V_{N}\). It should be noted that the orientations of the cameras differ from the displayed directions of the visualized views. The frustums of these target views are colored turquoise so that they cannot be mixed up with the current views of the sensor head. It should also be noted that the predicted object poses are located at the center of the union of both viewing frustums and that the directions of the depicted views are similar to a number of the lines of sight inside the frustums.

4.6.3 Invalidation of Lines of Sight in Clouds of Predicted Poses

The second functionality offered by ASR robot architecture component “NBV estimation” besides estimating such views is removing lines of sight \(\mathbf {n}\) from a given cloud of predicted poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\). This functionality is needed during two stages of ASR. First, removing superfluous lines of sight is necessary in Algorithm 22 every time a new cloud of hypothetical object poses is generated in the component OPP. Second, it is also required each time a Next-Best-View, having been driven to, did not provide any new object model to be returned by the state OBJECT_DETECTION. In this case, removal of lines of sight is used in the state UPDATE_POSE_PREDICTIONS. Without loss of generality, we deal with the second case in the following. A view \(V_{U}\) shall be given. As explained in Sect. 4.1, this is either the view in which the robot has just failed to detect objects or the view returned by the state NEXT_BEST_VIEW_CALC. The set of poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) that have led to the Next-Best-View \(V_{N}\) (on which OBJECT_DETECTION failed) is also available. We further assume without loss of generality that the view \(V_{U}\) is identical with Next-Best-View \(V_{N}\) and that the set of objects \(\{o\}_{N}\) has been searched in it. It is important to note that failure to find these objects from the viewpoint \(V_{N}\) is not sufficient to prove that the objects are not present at any predicted pose \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\} \subset \{(\mathbf {T}_{P}, \{\mathbf {n}\})\}\) inside the frustum \(\mathbf {F}\) of this view. What is certain, however, is that the objects are not visible from the perspective specified by view \(V_{N}\). Thus, all lines of sight \(\{\mathbf {n}\}\) of poses \(\mathbf {T}_\mathbf {F}\) inside the frustum that point towards the camera—thus adopting view \(V_{U}\)—are no longer equivalent to perspectives favorable for detecting objects. It is necessary to correct the cloud \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) in Algorithm 25 by removing these lines so that the next traversal of NEXT_BEST_VIEW_CALC returns another view as estimation result in any case. This is done in two steps.
First, we discern those poses with lines \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}\) at which the searched objects \(\{o\}_{N}\) should have been detectable according to the utility measures \(u_a(V_U, \mathbf {T}_\mathbf {F})\), \(u_d(V_U, \mathbf {T}_\mathbf {F})\). This is again done by frustum culling. For each of the poses \(\mathbf {T}_\mathbf {F}\) inside the frustum, we then go through all lines of sight \(\{\mathbf {n}\}\) that are assigned to this pose and have not yet been removed. Any line \(\mathbf {n}\) is removed that, according to utility measure \(u_n(V_U, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\), supported object detection on the basis of view \(V_U\). Support is heuristically identified by comparing measure \(u_n(V_U, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\) to threshold \(\alpha \). Only those lines \(\{\mathbf {n}\}_r\) remain in the thinned-out cloud—output by Algorithm 25—that do not point towards the camera to such an extent that the detectability of a missing object at pose \(\mathbf {T}_\mathbf {F}\) could have been deduced from them.
Fig. 4.15

Invalidation of lines of sight located at predicted object poses inside the frustum of a view. This invalidation takes place in the state UPDATE_POSE_PREDICTIONS. 1 shows exemplary input to Algorithm 25. 2 displays the lines of sight that remain after this algorithm has been applied in 1

For example, the effect of Algorithm 25 can be studied in Fig. 4.15. 1 in this figure contains lines at its lower left corner that are removed while executing this algorithm. When displaying the results of Algorithm 25, these lines are no longer visible in 2. The removed lines have been both present inside the turquoise frustum of view \(V_U\) and oriented towards both cameras of the depicted mobile robot.

4.6.4 Discussion

In this chapter, we defined finding an optimal combination \((\{o\}_{N},V_{N})\) of a view \(V_{N}\) and the objects \(\{o\}_{N}\) to be searched in it as a combinatorial optimization problem. As already discussed, exhaustive search across the entire search space is not a viable option in view of its size. As an alternative, we presented a heuristic approximation algorithm in Sect. 4.6.2 that stays as close as possible to the exhaustive approach. In Sect. 4.6.2, we presumed that our technique converges towards a value within the partly continuous parameter space \(2^{\{o\}_{P}} \times \{V\}\) of the reward function. This shall be shown in the following. For the first iteration step of Algorithm 24, we assume that the maximum distance \(d_{1}\) is given between any current robot position \((x_{C},y_{C})\) and the position \((x_{1},y_{1})\) of the best-rated view’s configuration. Due to the continuity of the floor map, an \(n_0\)th iteration of that algorithm exists, from which on the distances between such “best-rated” positions \((x_{n},y_{n})\), \((x_{n-1},y_{n-1})\) from consecutive iteration steps decrease to \(d_{n} = ||(x_{n},y_{n}) - (x_{n-1},y_{n-1}) ||\le \frac{1}{2^{n-1}} d_{1}\). \(d_{n} \rightarrow 0\) converges for \(n \rightarrow \infty \). While the positions of the configurations to which hex grid generation is aligned change less and less, the same camera orientations and object sets are considered by default in every iteration step. In total, the portions of the search space considered in every iteration step changes less and less with every additional iteration. The same holds true for the best-rated combinations \((\{o\}_{I}, V_{I})\) within these search space portions. The fact that the combination \((\{o\}_{N},V_{N})\) towards which Algorithm 24 converges is a local optimum inside the discretized search space has already been discussed in Sect. 4.6.2. Graphically, the Next-Best-View estimation algorithm stops being capable of converging to the globally optimal combination as soon as it no longer lies within the diminishing rectangular area from which the algorithm exacts hex grids.

The starting point for developing this algorithm were existing contributions to the field of Next-Best-View estimation. When we look at exemplary contributions for prominent application scenarios—such as three-dimensional Object Reconstruction [24] or Robot Exploration [19]—that have been pursued in the last years, we notice that they focus on developing suitable objective functions while putting less emphasis on how to efficiently select candidate views that shall be rated. In these contributions, exhaustive search is done on straightforwardly discretized spaces. As a matter of fact, the aforementioned application scenarios place low demands on the time consumption of Next-Best-View estimation. On the contrary, seminal work [26] in three-dimensional Object Search, which is time-sensitive, has put significantly more emphasis on efficiently selecting candidate views. However, their approach optimizes different dimensions of their parameter space—e.g. robot positions and camera orientations separately—with the already mentioned limitations in comparison to our approach. Our approach shall stress the importance of search space design and optimization algorithm development in the context of Next-Best-View estimation. While being able to access as large portions of the parameter space of the objective function as possible, increasing efficiency makes it possible to take advantage of more realistic objective functions. This enables us to estimate valuable Next-Best-Views and simultaneously maintain low time consumption, e.g. appropriate for repeatedly estimating Next-Best-Views in an object search loop.


  1. 1.

    During indirect search, we only look for objects that miss in the available scene models. Objects from scene categories for which no partial scene models have been estimated yet are being looked for during direct search.

  2. 2.

    We used the library SMACH [2] for implementing ASR.

  3. 3.

    We look at Fig. 4.3 in landscape format.

  4. 4.

    Since in relation to learning, the contributions in this thesis are restricted to that of scene category models, such a simplification makes sense.

  5. 5.

    We designate a scene model from a scene category as incomplete when the model does not contain all objects that belong to the category.

  6. 6.

    The pose of the robot is composed of the pose of its mobile base and the orientation of its PTU.

  7. 7.

    Please keep in mind that “NBV estimation” selects both the objects that shall be searched from a viewpoint and the viewpoint itself.

  8. 8.

    Without loss of generality, all variations are supposed to be visible from the same set of robot head poses.

  9. 9.

    We present all operations as performed on the view of the left camera.

  10. 10.

    The confidence in detecting objects from a given viewpoint depends on the absolute number of predictions within the viewing frustum the given viewpoint induces.

  11. 11.

    Please bear in mind that scene model instance and scene category instance are synonyms.

  12. 12.

    This kind of visualization is further detailed out in Sect. 4.6.

  13. 13.

    The subset \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\) is a multiset [10, p. 29].

  14. 14.

    The color of a cloud changes from red to green with increasing confidence of the represented instance.

  15. 15.

    Instances with low, average and high ratings are displayed as red, yellow and green clouds.

  16. 16.

    This picture displays a dark green frustum of a view at position \(\mathbf {p}\) that contains two predicted poses for a spherical grey object. The same lines of sight, visualized as dark red arrows, are assigned to each of both predictions.

  17. 17.

    For the sake of simplicity, we only consider the left camera on the sensor head of the abstract robot from Sect. 4.4 in this and the following two sections.

  18. 18.

    Throughout the experiments we conducted for this thesis, we set frustum parameters as follows: \(\text {fov}_x = {30}^{\circ }\), \(\text {fov}_y = {20}^{\circ }\), ncp \(= 0.4\) m, fcp \(= 1.5\) m.

  19. 19.

    The dimensionality of the parameter space of our objective function is \(|\{o\}_{P}| + 6\).

  20. 20.

    As earlier in this thesis, we limit the description of our contributions to design decisions that reduce problem complexity, leaving out canonical algorithm optimization strategies such as parallelization.

  21. 21.

    Here, we ignore the specificities of hex grids.


  1. 1.
    Aumann-Cleres, F.: Markerbasiertes Kalibrieren der kinematischen Kette und Aufstellen der Rückwärtstransformation zwischen der Basis und dem Sensorkopf eines mobilen Roboters. Bachelor’s thesis, Advisor: P. Meißner, Reviewer: R. Dillmann, Karlsruhe Institute of Technology (2016)Google Scholar
  2. 2.
    Bohren, J., Cousins, S.: The smach high-level executive. IEEE Robot. Autom. Mag. (2013)Google Scholar
  3. 3.
    Bourke, P.: Frustum culling. (2000). Accessed 01 Dec 2017
  4. 4.
    Bronshtein, I., Semendyayev, K., Musiol, G., Muehlig, H.: Handbook of Mathematics, 5th edn. Springer, Berlin (2007)Google Scholar
  5. 5.
    Devert, A.: Spreading points on a disc and on a sphere—Marmakoide’s Blog. (2012). Accessed 14 Nov 2017
  6. 6.
    Dillmann, R., Huck, M.: Informationsverarbeitung in der Robotik. Springer, Berlin (1991)Google Scholar
  7. 7.
    Eidenberger, R., Grundmann, T., Schneider, M., Feiten, W., Fiegert, M., Wichert, G.V., Lawitzky, G.: Scene analysis for service robots. In: Towards Service Robots for Everyday Environments, pp. 181–213. Springer, Berlin (2012)Google Scholar
  8. 8.
    Gamma, E., Johnson, R., Vlissides, J., Helm, R.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Boston (1995)Google Scholar
  9. 9.
    Garvey, T.D.: Perceptual strategies for purposive vision. Tech-Technical Note 117, SRI International (1976)Google Scholar
  10. 10.
    Hein, J.L.: Discrete Mathematics, 2nd edn. Jones and Bartlett Publishers, Inc, Burlington (2002)Google Scholar
  11. 11.
    Isard, M., Blake, A.: Condensation–conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998)CrossRefGoogle Scholar
  12. 12.
    Karrenbauer, O.: Realisierung und komparative Analyse von alternativen Methoden zum uninformierten Generieren optimaler Folgen von Ansichten für die 3D-Objektsuche. Bachelor’s thesis, Advisor: P. Meißner, Reviewer: R. Dillmann, Karlsruhe Institute of Technology (2017)Google Scholar
  13. 13.
    Kunze, L., Doreswamy, K.K., Hawes, N.: Using qualitative spatial relations for indirect object search. In: Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 163–168. IEEE (2014)Google Scholar
  14. 14.
    Lehmann, A., Leibe, B., Van Gool, L.: Fast prism: branch and bound hough transform for object class detection. Int. J. Comput. Vis. 94(2), 175–197 (2011)CrossRefGoogle Scholar
  15. 15.
    Lorbach, M., Hofer, S., Brock, O.: Prior-assisted propagation of spatial information for object search. In: Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), pp. 2904–2909. IEEE (2014)Google Scholar
  16. 16.
    Meißner, P., Reckling, R., Wittenbeck, V., Schmidt-Rohr, S., Dillmann, R.: Active scene recognition for programming by demonstration using next-best-view estimates from hierarchical implicit shape models. In: Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 5585–5591. IEEE (2014)Google Scholar
  17. 17.
    Meißner, P., Schleicher, R., Hutmacher, R., Schmidt-Rohr, S., Dillmann, R.: Scene recognition for mobile robots by relational object search using next-best-view estimates from hierarchical implicit shape models. In: Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 137–144. IEEE (2016)Google Scholar
  18. 18.
    Patel, A.: Hexagonal grids. (2013 & 2015). Accessed 11 Nov 2017
  19. 19.
    Potthast, C., Sukhatme, G.S.: A probabilistic framework for next best view estimation in a cluttered environment. J. Vis. Commun. Image Represent. 25(1), 148–164 (2014)CrossRefGoogle Scholar
  20. 20.
    Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R., Ng, A.Y.: Ros: an open-source robot operating system. In: ICRA Workshop on Open Source Software, Kobe, p. 5 (2009)Google Scholar
  21. 21.
    Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd international edn. Prentice Hall Press, Upper Saddle River (2010)Google Scholar
  22. 22.
    Siciliano, B., Khatib, O.: Springer Handbook of Robotics. Springer Science + Business Media, Berlin (2008)Google Scholar
  23. 23.
    Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, Cambridge (2005)Google Scholar
  24. 24.
    Vasquez-Gomez, J.I., Sucar, L.E., Murrieta-Cid, R.: View planning for 3d object reconstruction with a mobile manipulator robot. In: Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), pp. 4227–4233. IEEE (2014)Google Scholar
  25. 25.
    Wixson, L.E., Ballard, D.H.: Using intermediate objects to improve the efficiency of visual search. Int. J. Comput. Vis. 12(2–3), 209–230 (1994)CrossRefGoogle Scholar
  26. 26.
    Ye, Y., Tsotsos, J.K.: Sensor planning for 3d object search. Comput. Vis. Image Underst. 73(2), 145–168 (1999)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.IAR-IPRKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations