Indoor Scene Recognition by 3D Object Search pp 125176  Cite as
Active Scene Recognition
Abstract
Detailed technical presentation of our contributions that are related to Active Scene Recognition. This includes our approaches to Object Pose Prediction and NextBestView estimation.
4.1 Concept Overview
Let’s assume that MILD is located at an arbitrary, collisionfree position in the environment, such as in Fig. 1.2. Given a 2D floor plan, MILD is expected to decide upon the presence of n different scene categories such as “Setting—Ready for Breakfast” with the help of ASR. When ASR is started, MILD and in particular its localization and navigation system are initialized during the execution of the states OBJECT_SEARCH_INIT and GLOBAL_ROBOT_LOCALIZATION. The latter delivers a pose estimate for the mobile base of MILD in relation to a fixed global frame. This base pose is updated throughout ASR. Since no object models, i.e. object localization results, are available when ASR starts, the next task, modeled in the state machine, consists of acquiring a first set of object models. For this purpose, we developed two strategies for estimating and traveling to interesting camera views. Both are subsumed in the subsystem DIRECT_SEARCH of the state diagram in Fig. 4.1. The first strategy is an informed search that is based on knowledge about potential object poses that come from past localization results, i.e. experience. Such experience, e.g., can be object poses observed during demonstrations of scene categories. In case that informed search fails, a less efficient, uninformed search is started that is based on [26] and evenly explores the entire environment. Both strategies are explained in a more detailed fashion in [12]. Once any object model has been acquired—for example because two forks of the place setting in Fig. 1.2 have been localized—direct search stops.
According to the authors, indirect search outperforms direct search of a target object. This is achieved by limiting the search space of the target through the intermediate object. In the decisionmaking system in Fig. 4.1, it is the subsystem INDIRECT_SEARCH that implements the actual core of our ASR approach, following the indirect search paradigm. With techniques for predicting the poses of searched objects and for estimating optimal camera views that cover those predictions, it contains two major contributions of this thesis. Besides those techniques, Passive Scene Recognition is integrated, too. During indirect search, these three main functionalities run in a loop. We remain in indirect search as long as views, promising for searching objects, can be deduced from the already available object models. Otherwise, we return to direct search. The first state that is reached during indirect search is SCENE_RECOGNITION. There, PSR with Implicit Shape Model Trees is performed on the object models that have been acquired until this point. The outcome is a set of scene models with presumably differing confidences. The confidence of each scene model in particular depends on how many objects of the underlying scene category have already been found. In case every object of each scene category has been found, ASR ends at this point. Otherwise, we assume for each searched object that it is suitable for at least one of the available scene models in which it misses. To put it easy, we aim at completing available, partial scene models with searched objects.^{1} We reduce the space in which we search missing objects to those locations that would allow suchlike completions in order to speed up searching in analogy to [25]. If the robot MILD had just perceived the pair of forks on the right in Fig. 1.1, it would, for example, have deduced a partial scene model for “Setting—Ready for Breakfast”. In order to complete that scene model, the remaining silverware has to be located further on the left.An indirect search [9] is a search that operates in a loop, repeatedly locating a new instance of the intermediate object and looking “near” it for the target object until either the target object is found or no more instances of the intermediate object can be found. A direct search, on the other hand, is one that does not attempt to use spatial relationships to predict likely locations and simply searches the entire search space systematically. [25]
In order to estimate such locations, we developed an algorithm that predicts the potential poses of missing objects with the help of spatial relations. This object pose prediction technique employs the same relations from Implicit Shape Model Trees that we introduced for performing PSR. The advantages of taking spatial relation as a basis for object search have been discussed in Sect. 1.1.3. While these abstract observations are concretized in this section, the actual prediction algorithm is defined in Sect. 4.5.1. To begin, we present our basic idea for predicting poses with a single ISM as scene category model. In order to calculate a cloud of hypothetical poses for any object that is missing in a given scene model, we aggregate the pose of the reference of this scene model—phenomenologically its emplacement—with an appropriate spatial relation. We select the relation within the star topology of the ISM that connects the reference with the missing object. We say that we invert relations during prediction as we estimate potential poses of objects, thus departing from the reference. This is contrary to scene recognition, during which we estimate hypothetical poses of the reference, thus departing from localized objects. When using ISM trees instead of a single ISM as a scene category model, a single relation does not necessarily directly connect the searched object to the reference of this hierarchical model. In this case, a path of spatial relations has to be identified within the treeshaped scene representation that transitively connects both before the actual pose prediction can take place. Estimating possible poses for the missing object by means of such a sequence of relations is prone to combinatorial explosion, as the relative poses within each relation have to be combined along the path. This demands an efficient algorithmic solution.
The next problem arises from the question which of the partial scene models—made available by the latest execution of Passive Scene Recognition—to use in order to predict poses of missing objects. An exemplary approach could consist of solely selecting the partial scene model with the highest confidence. However, suchlike proceeding would limit the subsequent object search as follows: In every camera view which is being moved to we would ignore all present objects, completing other scene models than the one that is currently selected. If competing scene models predicted different locations for the same searched objects, all but one model would be ignored, too. The other models would only be considered once the bestrated partial scene model changes. Let’s assume that the scene category “Setting—Ready for Breakfast” expects two knives on the right of the plate, while “Setting—Clear the Table” expects one knife on top of the plate. In case “Setting—Ready for Breakfast” is rated better than “Setting—Clear the Table” and the knife actually lies on the plate, several possible camera views—all pointing to the area on the right of the plate—would be driven to until ASR switches scene model. Processing all available partial scene models at once is no viable alternative as their number can be considerable, especially of those with a low rating. As an efficient middle ground, we developed a method that resamples a representative subset from all partial scene models which is significantly smaller and statistically biased towards models with high confidences. This scene model sampling is based on the “Select” step of the Condensation Algorithm [11], an approach familiar when tracking objects with particle filters [23, p. 77] and introduced in Sect. 4.5.2. Object Pose Prediction is then performed for each of the scene models within the subset. Together with the pose prediction algorithm, scene model sampling yields the state OBJECT_POSE_PREDICTION in Fig. 4.1.
In order to assess how suitable a camera view is for searching some objects, we developed an optimization problem for selecting camera views in threedimensional object search that emphasizes realism. The optimization algorithm that operates on this problem receives the cloud of object pose predictions and the current configuration of the robot. It seeks a reachable view that maximizes an objective function that models the confidence of success in object localization as well as the costs for reaching the view and performing localization. Optimization problem and algorithm are discussed in detail in Sects. 4.6.1 and 4.6.2. Optimization only considers the next view for searching objects instead of looking several steps ahead in order to find an optimal sequence of views for searching objects. Empirical results in [7] support such a greedy approach as “in object search, frequent measurements affect available search policies so much that constant replanning is necessary” [16]. Only optimizing the next view to be reached is a common approach in view planning and known as the NextBestView (NBV) problem. Our NBV estimation algorithm not only deals with optimizing the pose of an NBV but simultaneously decides which objects are worth being searched inside the view. Assume a possible view for MILD is analyzed by NBV estimation: If there is a high probability to find two knives in it but a low probability to also find a spoon, NBV estimation limits time consumption in object localization by deciding that only knives are to be searched in that view. It is this optimized set of objects for which 6DoF object localization is performed in OBJECT_DETECTION.
In which way informed search goes on depends on the capacity of the state OBJECT_DETECTION to detect any of the objects that are to be searched in the most recent NBV. In case that object localization returns any model of a previously missing object, the subsystem RELATION_BASED_SEARCH is left and the state SCENE_RECOGNITION in Fig. 4.1 is reentered. In that state, all object models that have been generated since ASR was started are passed to PSR so that an updated set of scene models can be estimated. The new set replaces all scene models that remain in the last estimated set. Only partial models from the new set are going to be processed during the following run of scene model sampling. However, if no new object model could be estimated once the goal configuration had been reached—e.g. because clutter occluded the searched objects from that viewpoint—ASR remains in the RELATION_BASED_SEARCH loop. The reason for this is that in practice, an individual camera view is only capable of covering a small subset of the entire cloud of object pose predictions produced by the last run of Object Pose Prediction. The rest of that cloud, which is also containing valid pose hypotheses, has to be successively covered by additional, optimized camera views until the entire cloud is used up. In order to make sure that the next execution of NEXT_BEST_VIEW_CALC does not return a view identical to that from the current execution, the cloud of pose predictions has to reflect that 6DoF object localization has just failed. This means that any information that supported NextBestView estimation in choosing the most recent estimate has to be removed from the cloud. In the state UPDATE_POSE_PREDICTIONS in Fig. 4.2, this is done, simply put, by invalidating any pose prediction that is visible from the viewpoint MILD took over when it reached its goal configuration. This means especially that poses are only invalidated if they lie within the viewing area of the camera at that goal configuration. A more accurate description of this process is given in Sect. 4.6.3.
Since this update has to be performed once OBJECT_DETECTION has failed, it must be based on the camera view that MILD has actually reached and not on the view that NextBestView estimation had returned beforehand. In consequence, UPDATE_POSE_PREDICTIONS might fail like OBJECT_DETECTION if there is a sufficient difference between the estimated pose and the one actually reached. For example, this is the case if a sufficient number of predicted poses drops out of the viewing area of the robot because of deviations between reached and estimated view. If this occurs, the state machine restarts the update procedure, but this time on the basis of the estimated view instead. No matter whether one or two updates are applied to the cloud, RELATION_BASED_SEARCH starts over in the state NEXT_BEST_VIEW_CALC. It is not guaranteed that the missing objects are detected by informed search, even after several loop runs in RELATION_BASED_SEARCH. Given the place setting in Fig. 1.2, such a scenario may occur when parts of it are distributed across several tables even though Object Pose Prediction expected them all to be located on one table. Thus, informed search should be aborted as soon as it becomes obvious that the currently available pose predictions do not cover the actual poses of the searched objects. To this end, several termination conditions have been integrated into NEXT_BEST_VIEW_CALC. One of them, for instance, is fulfilled once the confidence that one or more objects can be found in any estimated camera view falls below a given threshold. If any of those conditions is met, both RELATION_BASED_SEARCH is left and the currently employed cloud of predictions is deleted.
At this point, the state machine has to figure out whether scene model sampling has already used up all partial scene models for predicting object poses. If so, indirect search has to be left in absence of hypotheses still not considered for searching missing objects. Instead, direct search is performed again until any of the searched objects can be found. However, if any unused partial scene model remains from the most recent execution of PSR, the state machine reenters OBJECT_POSE_PREDICTION in Fig. 4.1 and another iteration of indirect search starts.
4.2 Robot Software Architecture for Active Scene Recognition
After presenting the logic behind ASR in the preceding section, the next unsolved issue is how to organize all those functionalities that ASR relies on. It is obvious that a modular approach is to be followed due to the complexity of the decisionmaking system that we just discussed. As stated in Sect. 1.2, ASR is a contribution to the robot operating system ROS [20]. Bearing this in mind, we designed ASR as part of a distributed and deliberative robot architecture [6, p. 245], which Sect. 1.3 refers to as one of the contributions of this thesis. Such an architecture in particular matches the notion of robots as rational agents [21, p. 4], as presumed by our ASR concept. For example, the latter is illustrated by the fact that we model ASR as a hierarchical state machine^{2} that performs view planning. We implemented that kind of architecture by distributing both the logic of ASR and the functionalities it requires on dedicated, equal and selfcontained software components organized in a network.
We begin at the upper right component,^{3} “Offline Knowledge Base” (OKB), the content of which never changes while ASR is executed. The contrary holds true for the component “Online World Model” (OWM) at the bottom right of Fig. 4.3 which stores information ASR acquires until it is flushed, e.g. because it is restarted. In consequence, the OKB is already complete when ASR begins at the state OBJECT_SEARCH_INIT in Fig. 4.1, while the OWM is initially empty. Initial input for the OWM is generated in the following states GLOBAL_ROBOT_LOCALIZATION and DIRECT_SEARCH. For example, this is an initial estimate of the pose of the mobile base of the robot MILD from GLOBAL_ROBOT_LOCALIZATION and an initial set of object models from DIRECT_SEARCH. The most crucial among all the data present in the OKB is related to the n scene categories whose presence ASR is supposed to evaluate. In the OKB, scene categories are represented in two different manners: in terms of the absolute poses of all objects in each category—all recorded during demonstrations and required for direct search based on prior knowledge—as well as in terms of scene category models and ISM trees respectively that are learnt from demonstrations. Besides knowledge about scene categories that originates from sensory recordings and subsequent learning, OKB also makes data available that is, without loss of generality, predefined by the operator of the robot.^{4} Predefined data, for example, includes a 2D floor plan, which is required by the navigation framework of the mobile base, and 6DoF poses of the furniture that is present in our experimental setup. Another location where knowledge from sensory recordings is stored is the aforementioned OWM. For instance, it contains object models whose parameters are estimated with the help of 6DoF object localization while ASR is executed. As stated in Sect. 1.1.2, each object model includes a measured object pose. Taken together, poses of furniture and localized objects make it possible to visualize the environment knowledge of the robot by means of geometry models at any point in time during the course of ASR. The threedimensional geometry models required for this purpose are stored in the OKB along with classifiers for any object of each scene category for which ASR shall deliver scene models. Assuming that a localization algorithm is assigned to each object, a classifier corresponds to an internal representation of an object the localization algorithm needs in order to be able to recognize the objects.
The next component is “Passive Scene Recognition” (PSR) which is visible on the left of the OKB in Fig. 4.3 and mainly consists of a wrapper for our approach to recognize scenes based on ISM trees—a major contribution of this thesis. Its interaction with the decisionmaking system during the course of ASR can be summarized in the following steps: At first, PSR asks the state machine to provide it with all object models that are currently stored in the OWM. When indirect search runs through its loop for the first time, such models can only originate from direct search. Under considerations of all objects that have been recognized until this point, PSR asks the state machine for the ISM tree of every scene category that contains at least one of those objects. In order to get access to these scene category models, the state machine has to contact the OKB. For its part, the OKB contains a dictionary pointing from any object to those scene categories that contain it. Once PSR has received both object and scene category models, it begins to run the scene recognition algorithm it wraps. This algorithm proceeds as shown in Fig. 3.1. As acceptance threshold, it uses \(\epsilon = 0\) to make sure that any scene model is returned that can be calculated from this provided object models and e.g. not only the bestrated per scene category. Among all scene models that are estimated in this step, the incomplete ones,^{5} are stored in a buffer. This buffer is shared by PSR and Object Pose Prediction, which is another component.
Besides, the scene model with the highest confidence per scene category is transmitted to the “Online World Model”. With the object models from object localization and a pose estimate for the mobile robot MILD,^{6} the OWM publishes all timevariant information ASR is supposed to make available for other systems, e.g. Programming by Demonstration of manipulation skills. It is to be noted that in terms of scene models, the content of the OWM is flushed every time before new scene models are transmitted to the OWM. Each time scene recognition with ISM trees is performed, recognition reestimates scene models for all categories, including those for which no previously missing objects have been newly detected during the last execution of object localization. For that reason, scene models are replaced in the OWM instead of accumulating them, as is the case for object models.
The component “Object Pose Prediction” (OPP) is visualized in a joint box with PSR in Fig. 4.3, even though it is a component on its own. Both PSR and OPP are separated by a dashed line that refers to the buffer for partial scene models into which PSR writes and from which OPP reads. This buffer contains a potentially large number of incomplete scene models, each corresponding to a considerable amount of data that would otherwise have to be transferred through the state machine. The component OPP is a wrapper for two techniques: Scene Model Sampling that is introduced in Sect. 4.5.2 and the actual pose prediction algorithm that is defined in Sect. 4.5.1. First, Scene Model Sampling reduces the content of the buffer to a fixedsize set of scene models whose elements are mainly assigned with high confidences. Each element of this set is then individually transmitted to the pose prediction algorithm. Given a scene model, the prediction algorithm in turn has to ask the state machine for the very ISM tree which led to this recognition result. Since scene recognition and pose prediction rely on spatial relations from the same ISM tree, the state machine has to provide it to both. Once poses have been predicted for all sampled scene models, they are all merged to a joint cloud of poses that is sent to the state machine.
The next component that we discuss, “NextBestView (NBV) Estimation”, is located on the central left of Fig. 4.3. As OPP, it is a wrapper for two techniques that are presented in detail in Sects. 4.6.1 to 4.6.3. The state machine provides it with both the cloud of object poses that OPP just calculated and the current configuration of the robot MILD. From that data, the first technique estimates a camera view for successfully and efficiently detecting missing objects. Being the view that MILD is supposed to go to next, it is returned to the state machine. The second technique in this component updates clouds of object poses like those delivered by OPP. This step has already been discussed in the previous section in the context of the state UPDATE_POSE_PREDICTIONS of the decisionmaking system of ASR. It is being used if the just estimated view does not lead to any new estimate of the whereabouts of any missing object.
By converting the camera view that is to be reached into a goal configuration, the state machine can send travel commands to the “Actuators” of MILD. Such a configuration is made up of the pose of the base of MILD as well as of the orientation of its PTU. It is estimated by the inverse kinematics for MILD as mentioned in Sect. 1.2. Inverse kinematics is part of a component “Robot Model” that is omitted in Fig. 4.3 for the sake of clarity. In contrast, the actuators of MILD are depicted on the lower left of Fig. 4.3. As soon as a goal configuration is available, travel commands are issued to the affected actuators. In particular, the mobile base employs the collisionfree navigation framework, as named in Sect. 4.3, in order to drive to the goal configuration.
Once MILD reaches the goal configuration—best as it can—, the decisionmaking system triggers the component “6DoF Object Localization”, which is in charge of detecting missing objects together with the component “Sensors”, both at the bottom of Fig. 4.3. The detection task starts with the state machine, asking the OKB for classifiers for any object “NBV estimation” suggested to search in its latest result.^{7} These classifiers are then passed to the Object Localization component. This component has four localization algorithms at its disposal, all of which are mentioned in Sect. 1.2. For any provided classifier, the component launches the localization algorithms that go with it. Through the Sensor component, each algorithm can access the visual sensors whose data it needs. Each selected algorithm is executed for a predefined time span. Once a span has expired, the localization component generates an object model from each result the according algorithm has returned. As soon as all algorithms are done and all results have been converted into object models, the latter are transmitted to the OWM through the state machine. There, the newly generated object models are added to the already present ones.
4.3 Data Acquisition from Demonstrations of Scene Variations
Acquisition of scene category models through Programming by Demonstration (PbD) takes place in two steps, as is usual for this principle. At first, demonstrated variations of a scene are recorded by sensors. Then, a scene classifier is learnt from that data. The learning of scene category models has been introduced for a single ISM as classifier in Sects. 3.4.1 and 3.4.2, while a learning algorithm for the more general ISM tree representation has been presented in Sect. 3.6. While demonstration recording has already been dealt with in relation to its outcomes in Sect. 3.3, this section shall provide an overview of the recording procedure in practice.

A robot head that is equipped with visual sensors to record the demonstration. It is either mounted in our sensor dome or on the mobile robot MILD, both visible in Fig. 1.8.

A human as the demonstrator, positioning objects and guiding the robot (head) throughout the demonstration.

A set of objects o that makes up the configurations that are to be recorded as n variations \(\mathbf {V}(\mathbf {S},1) \dots \mathbf {V}(\mathbf {S},n)\) of the scene category \(\mathbf {S}\) that is to be demonstrated.

A recording system that interacts both with the demonstrator and the sensors on the robot head. It performs 6DoF object localization in real time—according to Sect. 1.2—and saves all estimates \(\mathbf {E}(o)\) about the demonstrated states of the objects as their trajectories \(\mathbf {J}(o)\).

The name of the demonstrated scene category.

The names of the objects that are elements of this scene category.

A set of m different 6DoF poses \(\mathbf {T}_{H}(1) \dots \mathbf {T}_{H}(m)\) for the robot head, from which the head is supposed to record different portions of the demonstrated scene variations.^{8}
In the following, the demonstration procedure is defined for a robot head that is mounted on the mobile robot MILD. In Fig. 4.4, this process is visualized as a state diagram in which the demonstrator triggers state transitions. Technically, demonstrations in the sensor dome are just a simplified special case of demonstrations on the robot. When using the robot MILD, the demonstrator has to guide its PanTilt Unit as well as its base. In the sensor dome, just the PTU has to be controlled.
 1.
In DEMONSTRATION_INIT and GLOBAL_ROBOT_LOCALIZATION, the demonstrator communicates the name of the scene category to the robot. Then, he triggers an initial estimation of the pose of the robot in relation to a fixed world frame. This pose is tracked throughout the demonstration so as to be able to jointly represent all state estimates about objects in a common frame.
For all variations \(\mathbf {V}(\mathbf {S},i)\) with \(i = 1 \dots n\):
For all robot head poses \(\mathbf {T}_{H}(j)\) with \(j = 1 \dots m\):
 2.
The demonstrator sets up the object configuration according to variation \(\mathbf {V}(\mathbf {S},i)\) of scene category \(\mathbf {S}\).
 3.
The demonstrator guides the robot head to the goal pose \(\mathbf {T}_{H}(j)\) in MOVE_ROBOT_TO_VIEW.
 4.
Once the goal pose is reached, the demonstrator communicates the names of the objects to the recording system for which object localization is to be started. This happens in OBJECT_DETECTION_START.
 5.
Object state estimates \(\mathbf {E}(o)\) are recorded in OBJECT_LOCATION_RECORDING for the given object names until localization is stopped in OBJECT_DETECTION_STOP.
 6.
Once all desired object state estimates have been recorded and saved, trajectories \(\mathbf {J}(o)\) for all objects o in the demonstrated scene category are transmitted to scene classifier learning.
4.4 ObjectSearchRelated Definitions
Up to this point, the functionalities ASR relies on have mainly been introduced as black boxes with a focus on their interfaces within a robot architecture. The components in which most of the contributions in this thesis are located are PSR, OPP and “NBV estimation”. While our contributions in relation to PSR have already been explained in depth in Chap. 3, a detailed description of the algorithms that ensure the functioning of the other two components follows in this chapter. Beforehand, some preliminary definitions have to be established. Their purpose is to derive an abstract robot from the concrete mobile robot MILD whose actuators and sensors we informally introduced in the preceding section. This abstract robot, visualized in Fig. 4.5, corresponds to the minimum requirements ASR imposes on a robot upon which it is to run. Our ASR concept can thus be seamlessly migrated to any concrete robot that is compliant to these requirements.
We define a robot configuration \(\mathbf {C} = (x, y, \theta , \rho , \tau )\) based on the values \(\rho \) (pan) and \(\tau \) (tilt) of both degreesoffreedom of the PTU and on the robot pose \((x, y, \theta )\). The robot pose is estimated relative to a fixed world frame and against a 2D map. The workspace of the PTU is limited to \([\rho _{\text {min}} ,\rho _{\text {max}}] \times [\tau _{\text {min}} ,\tau _{\text {max}}]\). The poses of left \(\mathbf {T}_L\) and right \(\mathbf {T}_R\) camera in the stereo rig are given in a world frame in 6DoF.^{9} The viewing area of a camera is defined by a frustum \(\mathbf {F} = (\text {fov}_x, \text {fov}_y, \text {ncp}, \text {fcp})\). A frustum is a foursided truncated pyramid that is described by the horizontal and vertical field of view \(\text {fov}_x\) and \(\text {fov}_y\) of the camera as well as the near and far clipping planes \(\text {ncp}\) and \(\text {fcp}\), which set the minimal and maximal distance from which an object can be localized. A view \(V(\mathbf {T})\) is the viewing area of a camera, transformed into a world frame according to the pose \(\mathbf {T}\) of the camera. Each camera view has a direction \(\mathbf {x}(V)\), given by the optical axis. [17]
In order to transform between camera views and robot configuration, a kinematic model of the used robot is required. Since the two parameters \(\rho , \theta \) of our robot configurations are redundant by definition, multiple solutions exist to inverse kinematics. In the inverse kinematics that has been developed for the mobile robot MILD in [1], the very solution or (in other terms) the very robot configuration is selected which MILD can reach the fastest from its current configuration.
4.5 Prediction of Object Poses with Trees of Implicit Shape Models
4.5.1 Object Pose Prediction Algorithm
The set of missing objects \(\{o_{P}\}\) for each of which we want to estimate pose predictions can be formally defined as the relative complement of the input objects \(\{i\}_{\mathbf {S}}\) of partial instance \(\mathbf {I}_{\mathbf {S}}\) in the set \(\{o\}\) of objects a scene category contains. With regard to the ISM tree, based on which predictions are generated, missing objects can be present at their leaves \(\{o\}_{L}\) but never at internal vertices. It has to be noted that leaves may occur at multiple levels in a tree, e.g. at levels 0 and 1 in the tree in Fig. 4.7 that models scene category “Setting—Ready for Breakfast”. While this figure illustrates how pose prediction operates on an ISM tree, Fig. 3.12 showed how scene recognition operates on the same tree. Contrary to scene recognition, pose prediction does not start at the highest level of an ISM tree but at root ISM \(m_{R}\). Since the connections between pairs \(m, m'\) of ISMs at neighboring tree levels continue to regulate the flow of data through the tree, they determine an order in which the ISMs can be evaluated, no matter whether we do scene recognition or pose prediction. However, during pose prediction, data flows in a direction through the tree that is opposite to that during scene recognition. More precisely, a sequence of ISM has to be successively evaluated for every object \(o_{P}\) the poses of which shall be predicted. Such sequences, which we designate as paths, start at root ISM \(m_{R}\) and end at one of those ISMs in the tree in which the missing objects participate in spatial relations.
Given the reference pose \(\mathbf {T}_{F}\) of a scene category instance, the ISMs in a tree employ the inverses \(\mathbf {T}_{Fo}\) of the transforms, used for voting on reference poses, to calculate object pose predictions \(\mathbf {T} \leftarrow \mathbf {T}_{F} \cdot \mathbf {T}_{Fo}\). [17]
The reversion of spatial relations within an ISM is visualized in Fig. 4.7 by replacing the upwards arrows from Fig. 3.12 with downwards ones. In both variations, the arrows express connections between objects in an ISM. No matter whether each ISM on a path contributed a single vote or multiple thereof in Algorithm 18, each ISM simply returns all cast votes without any intermediary accumulation of them in a voxel grid, like this has been the case during scene recognition. This is because pose prediction does not only reverse spatial relations but more generally the overall approach in recognition with an ISM that consists of integrating information from different sources, in our case objects, into one sink, i.e. the reference pose of an ISM. Instead, pose prediction distributes information from a single source across multiple sinks within each ISM. Since there is only one possibility to connect a nonreference object in an ISM \(m_{k}\) at level k with the reference of an ISM \(m_{k+1}\) at the next higher level, there is no need for transmitting state estimates \(\mathbf {E}(o_{F})\) of scene reference objects \(o_{F}\) with class and identifier labels as we did in Sect. 3.5.2. Instead, just poses \(\mathbf {T}_{P}\) are transferred, as visualized by green arrows in Fig. 4.7. As long as they are not returned by the end point of a path, these poses represent hypotheses about the emplacement of internal vertices instead of real objects. In summary, predicting a pose with Algorithm 18 is equal to applying a sequence of transforms on the pose \(\mathbf {T}_{F}\). Since every transform is subject to noise and systematic errors, both issues accumulate the longer the used path is. Ultimately, this leads to the difference in variances we observed in 2 and 4 in Fig. 4.6. In consequence, minimizing path lengths is not only in the interest of pose prediction runtime but also in that of pose prediction accuracy. However, Algorithm 18 is not the whole pose prediction algorithm as it just calculates a single prediction, while the component “NBV estimation”, following OPP, expects an entire cloud of predictions for a multitude of objects. In order to meet this requirement, we introduce Algorithm 19 which returns such a cloud \(\{\mathbf {T}_{P}(o)\}\) and is also our actual pose prediction algorithm.By using poses \(\mathbf {T}_{Fo}\) of object o relative to reference \( o_{F}\), we reverse the principle used in scene recognition. [16]
Given that the computational effort in executing Algorithm 18 primarily consists of calculating coordinate transforms and that Algorithm 19 can be considered to be repeatedly running Algorithm 18, the overall effort of the pose prediction algorithm can be equated to the number of coordinate transforms it has to carry out. Since by definition, Algorithm 18 performs l transforms and is repeated \(n_{P}\) times in Algorithm 19, the overall number of transforms for predicting the poses of an object whose path has length l is \(l \cdot n_{P}\). Thus, the time complexity \(\mathcal {O}(l \cdot n_{P})\) of predicting hypothetical poses of an object is linear in relation to the length l of the employed path. We reach this low complexity because we only cast a single vote per ISM during any traversal of a path. On the contrary, in scene recognition, every ISM has cast all votes that could be generated for each state estimate provided to it. No matter whether an ISM is used for scene recognition or Object Pose Prediction, the number of votes an object casts is equal to the number a of table entries the ISM can provide for that object.
Let’s assume that pose prediction would cast all possible votes in every ISM along a path instead of one: In ISM \(m_{k}\) at tree level k, we would generate an number \(d(m_{k})\) of hypotheses about possible poses of the reference of ISM \(m_{k+1}\) on the next higher level of the tree. Since ISM \(m_{k+1}\) is directly connected to ISM \(m_{k}\), each of the pose hypotheses would trigger a separate voting process in ISM \(m_{k+1}\). Each of the voting processes would in turn cause a predictions about the emplacement of the reference of the next ISM \(m_{k+2}\) on the path, if any exists. Thus, the total number of votes, cast in ISM \(m_{k+1}\), could be recursively defined as \(d(m_{k+1}) = a \cdot d(m_{k})\). The further away from the root an ISM is located in its tree, the more votes would be cast. The ISM \(m_{l+1}\) at the end of the path would be the one in which the most votes are cast. This number of votes can be explicitly defined as \(d(m_{l+1}) = a^{l+1}\). More generally, the number of votes that can be cast within an ISM increases exponentially with the distance of the ISM from the root. The overall number of votes along a path would add up to \(\sum _{k = 1}^{l+1} a^{k}\) so that the time complexity of such an alternative approach would be \(\mathcal {O}(a^{l})\). In contrast to the initially introduced technique, this approach is intractable. It should nevertheless be noted that all \(a^{l+1}\) predictions of ISM \(m_{l+1}\) would be valid hypotheses about the poses of a missing object. Moreover, all of them can be nonredundant combinations of the table entries from the different ISMs along a path. The entirety of these combinations can either be created systematically, as described above, or by random sampling [23, p. 67]. Random sampling, for instance, could randomly select a table entry in each ISM along a path and then aggregate these entries with pose \(\mathbf {T}_{F}\) so as to obtain a predicted object pose. This randomized approach could be repeated until all \(a^{l+1}\) combinations of votes have also been created. However, it would be more efficient to stop once a sufficient number \(n_{P}\) of poses has been estimated. Thereby, we would obtain a subset of all possible predictions that is still representative thanks to random sampling. This is exactly what we do in the presented pose prediction algorithm, since we randomly select a table entry in each ISM we traverse on a path. The corresponding command can be found in Algorithm 18. In summary, the algorithm we introduce efficiently generates a representative set of pose predictions.
4.5.2 Sampling of Scene Models
In 1 in Fig. 4.8, the two missing objects of “Cereals—Setting” are a milk and a cereals box, while a water bottle and a carton of vitamin juice, both participating in “Drinks—Setting”, are missing in 2 in this figure. While both pairs of objects are moved from left to right and vice versa in a similar manner during the demonstrations for their respective scene categories, the spatial distribution of the depicted coordinate frames differs considerably. This can be explained by the relation topology that scene category model represents. While both missing objects are connected to an immobile plate in 1 by relations, they are not directly related to each other. In 2, this is in fact the case. Here, the coordinate frames primarily express that the transforms between both missing objects remained almost the same during the entire demonstration for “Drinks—Setting”. While this fact is also true for the objects in 1, it becomes especially visible in 2 because of another difference: In contrast to 1, in which the ISM reference has already been found, it is still missing in 2. Due to the star topology modeled by ISMs, every table entry in an ISM is defined in relation to this reference. Put differently, all votes that are cast during scene recognition point towards the reference, while all votes from pose prediction have the reference as their starting point. If we visualized the votes in 2 that are cast from the cup to the water bottle, we would notice that a line very similar to that in 1, produced by the coordinate frames. Thus, information about the movements of the objects in 2 is available to us as well, but in terms of a set of scene category instances that are arranged along a line. In 2, we only picked one of those instances to predict object poses. If we had predicted poses with all available instances, we would have obtained a line of coordinate frames similar to that in 1. In “Cereals—Setting”, the reference is the immobile plate. Thus, all votes from scene recognition for this category fall into the same bucket, generating a single instance. In summary, information about hypothetical object poses is distributed across several instances in 2, while it is concentrated on a single instance in 1. This example shows that in ASR, relevant information about the poses of missing objects is not only distributed across the different scene categories but also within them.
Which instances we use for predicting poses decides upon the efficiency and the effectiveness of object search. Besides the current configuration of the employed robot, the pose estimates for missing objects are the major source of information for “NBV estimation” when deducing camera views for object search. This issue can be illustrated with the help of the laboratory setup that is shown at the beginning of the introduction to this thesis. In this setup, a table and a shelf are located at opposite ends of a room. We further assume that the instance \(\mathbf {I}_{\mathbf {S}}\) induces predictions on the table, while another instance \(\mathbf {I}_{\mathbf {S}'}\) produces suchlike on the shelf. If the robot stood close to the table under these circumstances, using instance \(\mathbf {I}_{\mathbf {S}'}\) for predicting poses would provoke the robot to needlessly travel to the shelf. This would in turn increase time consumption during object search. In this example, deciding upon which instance to use requires knowledge about the emplacement of the pose predictions each considered instance provides. Thus, the predictions from both instances would have to be calculated in advance. The question then is: Why not just pass the clouds from both instances at once to “NBV estimation” so that it takes the decision whose predictions undergo object detection first? From a technical viewpoint, there is no obstacle to such a proceeding as clouds \(\{\mathbf {T}_{P}(o)\}, \{\mathbf {T}'_{P}(o)\}\) from different instances \(\mathbf {I}_{\mathbf {S}}, \mathbf {I}_{\mathbf {S}'}\) can be trivially fused. Providing clouds from multiple instances to “NBV estimation” at once (instead of processing them one after another) also makes it possible to simultaneously search objects from different scene categories in the same view. For example, this option is useful when predicted poses from different instances are close to each other like in 1 and 2 in Fig. 4.8. If only a single instance was processed at a time, the objects from each category would have to be searched in separate loop runs of the indirect search, no matter whether their predictions were included in the same camera view. Since it is not even guaranteed that the objects of both categories would be searched in consecutive loop runs, it cannot be ruled out that the robot would be advised to drive to other locations before eventually returning to searching those objects it could have detected in the first place. The answer to how processing several instances at once provides efficiency, for example, can be seen in 3 and 4 in Fig. 4.8. 3 shows pose predictions, this time not as coordinate frames but as bluish, threedimensional models of the objects whose possible poses the models represent. Among them, there are predictions for every object that is missing in 1 and 2. All predictions are located within the turquoise frustum^{12} of a camera view which “NBV estimation” deduced from them. Object localization that is executed right after this view has been estimated returns models for all objects that in 1 and 2. In 4, the four object models resulting from that are depicted in addition to the models of those objects that had previously been recognized. In summary, scene model sampling is expected to tell not only which instance(s) to pass to the prediction algorithm but also how many.
A straightforward answer would be to just pass all partial instances \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) from PSR to the pose prediction algorithm in order to obtain the most comprehensive fused cloud of poses. In the following, we designate the set \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) as a basic set. Proceeding in that manner would at least have two flaws. The first one is that the predictions of an exemplary pair of instances \(\mathbf {I}_{\mathbf {S}}, \mathbf {I}_{\mathbf {S}'}\) would have an equal influence on “NBV estimation”, no matter how much their respective confidences \(b(\mathbf {I}_{\mathbf {S}}) \ll b(\mathbf {I}_{\mathbf {S}'})\) differ. Since instances with low confidence tend to be more frequent in the basic set returned by PSR, “NBV estimation” would often provide camera views that rather cover poses from such instances than poses from instances with high ratings. Since low confidences are equivalent to low plausibility that estimated scene category instances correspond to actually existing scenes, the same applies for hypothetical object poses deduced from such instances. One approach to prevent ASR from being influenced too much by hardly plausible pose predictions is to make sure that instances with high confidences make up the majority of the instances that are passed to the pose prediction algorithm. The second flaw is related to the size of the basic set. This size corresponds to the number of scene categories PSR is looking for, multiplied with the number of instances the ISM tree returns for each scene category. Especially this second flaw can be considerable as we set the acceptance threshold \(\epsilon _{R} = 0\) to the lowest possible value in order to minimize false negatives. The result is that scene recognition produces the maximum scene category instances. Given any ISM tree, scene recognition can return up to one instance per bucket contained in the accumulator of its root ISM. The risk of false negatives is particularly high when ASR begins. In the first loop runs of indirect search, even promising scene category instances still have low confidences, since only a limited number of objects is usually available up to this point. Failure to predict object poses from such instances because of too restrictive acceptance thresholds in PSR would make it impossible for ASR to ever improve the confidences of these instances by detecting the objects that are still missing in them. Larger sizes of the basic set \(\{\mathbf {I}_{\{\mathbf {S}\}}\}\) translate into greater time consumption in the component OPP and “NBV estimation” as the input of both increases proportionally. In consequence, limiting the number of instances in the basic set is decisive for maintaining the performance of indirect search in ASR.
From the upper half of Fig. 4.9, which displays this preselection, we see that the largest number of instances \(n_{n}\) which we can equally select from every scene category is equivalent to the smallest number \(\{\mathbf {I}_{\mathbf {S}}\}\) of instances \(\{\mathbf {I}_{\mathbf {S}}\}\) any category \(\mathbf {S}\) contributes to the basic set. In Fig. 4.9, this minimum is determined by the scene category “Dishwasher—Basket Filled”. Since this category contributes just a single instance, only the bestrated instances per scene category are transferred to the preselection \(\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\}\). This preselection is visible in the upper right corner. Each transfer of an instance is visualized by a light green arrow. When comparing both sets of instances in the upper half of Fig. 4.9, we see that preselection reduces by half the share of instances whose confidences are rated low or average.^{15} Thus, the desired effect to increase the share of instances with high ratings is reached. In our experiments, the actual numbers of instances of all categories largely exceeded those in this figure. Therefore, the total size of the preselection is considerably greater than the size \(n_{r}\) we request from target \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\), meaning that further instances have to be left out.
Reducing the number \(\{\mathbf {I}_{\{\mathbf {S}\}}^{*}\} \ge [\mathbf {I}_{\{\mathbf {S}\}}^{r}]\) of instances beyond the first step of Algorithm 20 at a certain point implies selecting between instances of different scene categories. The starting point for developing the second step of our resampling procedure was to reinterpret scene recognition with ISM trees from the perspective of state estimation [23, p. 9]. Since scene recognition, as we define it, deduces not directly observable scene estimates from measured object detection results, the underlying problem fulfills the authors’ criteria for a state estimation problem. However, our approach to scene recognition does not return parametrical distributions, which many of the techniques from state estimation require. Instead, the returned sets of instances rather correspond to punctual estimates of scenes that possibly exist in the real world. In this sense, a basic set of instances is more similar to a set of random samples from a probability distribution [23, p. 3] over the space of possible instances. A prominent technique that is specifically designed for processing such sets of samples is the particle filter [23, p. 80]. In this context, the eponymous particles are synonymous with random samples. However, [14] rightly notes that results from ISMs cannot be considered as values of a sound random variable. Though probabilistic modeling of our approach should be avoided, there is an undeniable similarity between our basic sets of instances with their confidences and the sets of particles with their importance factors [23, p. 79]. If the basic and the target set were made up of particles ruled by probability distributions f and g, the resampling step [23, p. 79] could be used for transformation between both sets. This transformation would be based on drawing samples with replacement from the basic set. The probability of drawing an instance from the basic set would not be uniform but proportional to its confidence. This would ensure that instances with higher confidences are more frequent in the target set than those with low confidences. Nevertheless, the latter would not be completely excluded.
It should be kept in mind that all instances which have not been taken over by our resampling procedure can still be used for predicting object poses later. This can be best explained by means of Algorithm 21 that represents the overall functionality component OPP provides. The first operation of this algorithm is to acquire exclusive access to the buffer into which PSR writes the partial scene category instances it has recognized. The copy of this buffer’s content generated next by Algorithm 21 corresponds to the aforementioned basic set. In the following, this copy is reduced to the target set by our resampling procedure. Only the instances within the target set are then removed from the buffer, with duplicates being ignored. However, all duplicates are processed when all instances in the target set are successively passed to the pose prediction algorithm in the next step. \(\{\mathbf {V}(\mathbf {S},t)\}\) predictions are calculated per scene category that is present in the target set. This number corresponds to the length of the demonstration for scene category \(\mathbf {S}\). All individual clouds of possible object poses that are created in this manner are fused to the final result of Object Pose Prediction.
4.5.3 Discussion
Research into scene recognition and into object search is typically conducted independently of one another. As a side effect, little work is dedicated to algorithms and data structures that could link both. Object Pose Prediction by means of spatial relations is precisely such a link. Existing approaches to pose prediction can be classified according to the type of model they choose for representing spatial relations. For two exemplary types of models, i.e. metric and symbolic representations, we each present a scientific contribution to pose prediction. In this presentation, we focus on the impact relation models have on prediction performance. Both approaches we discuss have in common that they just use a single spatial relation in order to predict possible poses of an object. Predicting poses by evaluating entire chains of spatial relations—as we do in our randomized pose prediction algorithm—is outside of the scope of these contributions.
In [15], five different sources of prior knowledge—i.e. knowledge from the web, about scenes and the physical world in general—are fused in a Probabilistic Graphical Model. This model is then used to infer locations of a target object in terms of symbolic relations to other objects. While our approach is limited to spatial relations as the single source of information, it still provides metric knowledge about poses of missing objects and no symbols. Symbolic information, like that provided by [15], has first to be converted into metric knowledge before it can be used to search objects by view planning. Thus, symbolic approaches have to take a detour from the subsymbolic to the symbolic level and back, which, for example, entails a loss of information. By sticking to subsymbolic representations throughout our ASR approach, we can simply bypass these additional calculations.
In contrast, [13] employs a metric model for spatial relations that is based on parametrical distributions. It combines predefined knowledge about locations of objects with Qualitative Spatial Relations of two different types in order to estimate metric hypotheses about object positions. This contribution does not estimate full 6DoF poses because modeling poses with parametrical distributions is a problematic issue. As already mentioned, knowledge about possible orientations of searched objects is nevertheless essential for setting up realistic planning models. Another problem with this contribution is that it does not model to which extent the applicability of spatial relations on an already detected object depends on the environment in which this object is located—or more precisely, on the present scenes. In Sect. 4.1, we introduced a pair of scene categories “Setting—Ready for Breakfast” and “Setting—Clear the Table”, whose respective spatial relations requested for the same knife to lie either on top of a plate or right next to it. This small example already illustrates that the location of a missing object may vary considerably, depending on the scene category it belongs to.
In the preceding section, we argued that a particle filter is not applicable in ASR in its entirety because probabilistic modelling of our approach to scene recognition is troublesome. Apart from this, there are at least two other reasons why we only adopted its resampling step. The first is that the particle filter is a recursive state estimator [23, p. 9]. As such, it successively updates existing state estimates through new measurements. In contrast, ISM trees estimate scene instances from scratch any time scene recognition is requested. Therefore, data fusion for updating would rather have to take place on the level of the measurements and not on the one of the estimates. Put more concretely, we could rather define a filter for object models than for scene category instances. On the level of object models, for example, we could fuse all 6DoF poses object localization returns for an object in the course of ASR. As much as the measurement update step [23, p. 39] of particle filters seems inappropriate, its prediction step [23, p. 39] is not any better. Since the prediction step is mainly supposed to model the dynamics of how a state changes, it could only be trivial if scene category instances were states. This is because we required all searched objects in Sect. 1.2 to be immobile during ASR. This second issue that is related to prediction gives additional confirmation that the analogy between ASR and particle filtering does not extend beyond the resampling step.
4.6 Estimation of NextBestViews from Predicted Object Poses
4.6.1 Objective Function for the Rating of Camera Views
With the help of such functionality, the state NEXT_BEST_VIEW_CALC estimates a NextBestView \(V_N\) together with the optimal set of objects \(\{o\}_N\) to be searched in view \(V_N\). Combinations of object set and NBV \((\{o\}_N,V_N)\) are returned by Algorithm 24 in Sect. 4.6.2 which receives the current view \(V_C\) of the mobile robot, the current robot configuration \(\mathbf {C}_C\) and predicted poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) as input. [17]
Algorithm 22 is a method that is only called after direct transitions between the states OBJECT_POSE_PREDICTION and NEXT_BEST_VIEW_CALC in the ASR state machine. Both the necessity to transform a cloud \(\{\mathbf {T}_P\}\) of pure pose predictions into poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) and to prefilter with used views \(\{V_{U}\}\) from the current ASR execution can be attributed to the fact that Object Pose Prediction does not extend the already available cloud \(\{\mathbf {T}_P\}\) of prediction with every additional traversal of OBJECT_POSE_PREDICTION. Instead, all existing predictions from previously considered scene category instances \(\{\mathbf {I}_{\mathbf {S}}\}\) are replaced by new predictions that are inferred from other instances \(\{\mathbf {I}'_{\mathbf {S}}\}\). Proceeding in that manner makes sense as OBJECT_POSE_PREDICTION is only entered when the existing cloud of predictions is obsolete. This is the case when either one or several additional objects have been localized in the state OBJECT_DETECTION or when no further promising camera views to be driven to can be deduced from the cloud in NEXT_BEST_VIEW_CALC. However, it is then possible that a new prediction of hypothetical object poses from the instances \(\{\mathbf {I}'_{\mathbf {S}}\}\) generates predictions that would have been covered by one of the already used views \(V_{U}\), if these predictions had been estimated earlier. It is therefore necessary to prefilter any new cloud of pose predictions in order to prevent NEXT_BEST_VIEW_CALC from reestimating camera views as NextBestViews for searching a set \(\{o\}\) of objects that have already been looked for in a nearly identical view beforehand.Predicted poses \(\mathbf {T}_P\) alone are insufficient to reliably estimate views for object detection. Objects o are not equally well detectable from each perspective. Instead, we use lines of sight \(\{\mathbf {n}\}\), formally threedimensional vectors, which we empirically estimate per object in the searched scene categories. Each line, defined in relation to the object frame, represents a perspective suitable for detection. In world coordinates, each line depends on the pose of its object. Lines are assigned to each pose \(\mathbf {T}_P\) in Algorithm 22. Example lines are visible in Fig. 4.10.^{16} We designate \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) as poses with lines. Before NBV estimation starts, the poses with lines are prefiltered by the method invalidateUsedLinesOfSight() from Algorithm 25. It is called for all views \(V_{U}\) that have already been explored. Instead of invalidating entire predicted poses in the frustum of a view \(V_{U}\), it only deletes the lines of sight that point into the direction \(\mathbf {x}(V_{U})\) of the view so that the poses can still be searched from other directions. [17]
At the beginning of this section, we introduced estimating NextBestViews as calculating an optimal combination \((\{o\}_N,V_N)\) of a view \(V_N\) and the objects \(\{o\}_N\) that shall be searched in it. Thus, estimating NextBestViews can be defined as an optimization problem. Algorithm 24, which performs this estimation, is the corresponding optimization algorithm. It operates on a problem we designed with the aim to represent the underlying realworld problem as realistically as possible. However, efficiency in estimating NextBestViews is also a predominant issue, since the presented technique has to be executed in the loop during the RELATION_BASED_SEARCH of missing objects. While we focus our efforts on speeding up the estimation with the help of the employed optimization algorithm, the definition of the optimization problem should not be neglected at all in this context.
The corresponding equation can be found in line 26 in Algorithm 23. We use the term “reward” for our objective function as estimating NextBestViews is usually considered to be a problem from the field of automated planning. In the reward, we include inverses of cost functions instead of the cost functions themselves. Since finding optimal combinations \((\{o\}, V)\) makes it necessary to simultaneously minimize costs and maximize utility, an objective function consisting of both would be contradictory. Unifying inverse costs with a utility in a common object function instead yields a consistent maximization problem.A reward \(r(\{o\}, V)\) that consists of a weighted sum of a utility \(u'(\{o\}, V)\) and inverse cost functions \(i'_{p}(\mathbf {C}, \mathbf {C}_{C}), i'_{n}(\mathbf {C}, \mathbf {C}_{C}), i'_{d}(\mathbf {C}, \mathbf {C}_{C}), i'_{o}(\{o\})\) is used to rate combinations of object sets and views \((\{o\}, V)\) to find the best. [17]
Utility function \(u(V, \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\})\) rates each pose prediction \((\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\) in a view V regarding its confidence in being detectable for object localizers, given its 6DoF pose \(\mathbf {T}_\mathbf {F}\). Confidence of detection is defined as optimal for a pose when the measures \(u_a(V, \mathbf {T}_\mathbf {F})\), \(u_d(V, \mathbf {T}_\mathbf {F})\), \(u_n(V, \{\mathbf {n}\})\) are maximized at once: \(u_a(V, \mathbf {T}_\mathbf {F})\), which evaluates the angle \(\angle (\mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}})\) between camera view direction \(\mathbf {x}(V)\) and the ray \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\) between camera position \(\mathbf {p}\) and position \(\mathbf {p}_\mathbf {F}\) (from predicted pose \(\mathbf {T}_\mathbf {F}\)), favors that the predicted pose lies at the center of the camera field of view. \(u_d(V, \mathbf {T}_\mathbf {F})\), which evaluates the projection \(\langle \mathbf {x}(V), \overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\rangle \) of ray \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\) on the view direction, favors that the predicted pose lies halfway between ncp and fcp along the view direction. \(u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\), which evaluates the angle between ray \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\) and the line of sight \(\mathbf {T}_\mathbf {F} \cdot \mathbf {n}\) in world coordinates of the predicted pose most similar to \(\overrightarrow{\mathbf {p}\mathbf {p}_{\mathbf {F}}}\), ensures that the predicted object is being observed from a perspective favorable for its detection. [17]
In the utility function \(u(V, \{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\})\) from Eq. 4.6, the three aforementioned measures are combined by multiplication. This ensures that an object is considered as hard to detect at pose \(\mathbf {T}_\mathbf {F}\) as soon as any of the three measures returns a low rating for that pose. Let’s assume, for instance, that the measures \(u_a(V, \mathbf {T}_\mathbf {F})\), \(u_d(V, \mathbf {T}_\mathbf {F})\) tell that pose \(\mathbf {T}_\mathbf {F}\) is located at the center of the frustum of a candidate view V. The missing object this pose belongs to could still not be detected if measure \(u_n(V, \mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\) told us that the view covers no side of the object from which it is well detectable. As in Sect. 3.4.4, we combine measures that operate on incommensurable spaces so that a normalization of each of them is required before they are multiplied by each other. We normalize each measure with a trigonometric function \(f(x,\text {max}) = \frac{1}{2} + \frac{1}{2}\cos \left( \frac{\pi }{\text {max}} x\right) \) where x stands for one of the geometric quantities from which the measures are calculated. This function imposes a curve on the interval \([0, \text {max}]\) that resembles one half of a normal distribution but lacks the variance \(\sigma \) that would otherwise have to be manually fixed. Moreover, \(f(x,\text {max})\) is a function “that is monotonically decreasing and for which applies: \(f(0,\text {max}) = 1\) and \(f(\text {max},\text {max}) = 0\)” [17].
All of these four cost expressions are formally defined in Eqs. 4.10 to 4.13. The fact that these equations define inverse costs, can be deduced from the shape of most of them, subtracting the actual normalized costs from the preceding maximum costs. In relation to the inverse costs \(i'_p(\mathbf {C}, \mathbf C _C)\) for the PTU in Eq. 4.10, it should be noted that the distances both axes \(\rho , \tau \) have to travel in order to reach the goal configuration of the PTU are not added up but compared by a minimum operator. The reason for this is that both axes are driven in parallel and not sequentially. The minimum operator is also used in Eq. 4.11, which defines the inverse costs \(i'_n(\mathbf {C}, \mathbf {C}_C)\) in relation to orientation \(\theta \) of the mobile base. There, the minimum operator expresses that the base always turns in the direction in which the smaller angle has to be covered for reaching the goal orientation \(\theta _{C}\). This equation differentiates between two cases, depending on whether the current \((x_{C}, y_{C})\) and the goal position (x, y) of the mobile base are approximately identical. If so, the robot just needs to turn around its own axis so as to reach the goal configuration for its base. Otherwise, it has to turn twice. At the position \((x_{C}, y_{C})\) of its current configuration \(\mathbf {C}_C\), it first orients itself towards an interim orientation \(\gamma \) so that it can drive as linearly as possible to the position (x, y) of the goal configuration \(\mathbf {C}_C\). Once it has reached this position, it orients itself towards the goal orientation. Linear, translational movements of the mobile base are favored since the inverse costs \(i'_d(\mathbf {C}, \mathbf {C}_C)\) for relocating the base are approximated in Eq. 4.12 through the Euclidean distance \(\sqrt{\left( {x  x_C} \right) ^2 + \left( {y  y_C} \right) ^2}\) between its goal and its current position.Inverse costs from current configuration \(\mathbf {C}_C\) to goal configuration \(\mathbf {C}\) are a weighted sum of normalized inverted travel costs for PTU \(i'_p(\mathbf {C}, \mathbf C _C)\), robot orientation \(i'_n(\mathbf {C}, \mathbf C _C)\) and position \(i'_d(\mathbf {C}, \mathbf C _C)\) (distance normalized by a Gaussian) as well as \(i'_o(\{o\})\) which contains the runtimes t(o) of running detection for all objects in combination \((\{o\}, V)\) in relation to the sum of detection runtimes for all searched objects \(\{o\}_P\). [17]
The model with which we approximate the costs for moving the PTU is shown in Fig. 4.13. On the left, we see an exemplary configuration the PTU is supposed to have taken over, together with a pair of coordinate frames. Each frame includes one of the axes around which the PTU turns. The distances each of both axes has to cover in order to transfer the PTU from its current configuration \((\rho _{C},\tau _{C})\) to the goal configuration \((\rho ,\tau )\) on the right of Fig. 4.13 are visualized in the middle of this picture. Apart from the inverse translational travel costs \(i'_d(\mathbf {C}, \mathbf {C}_C)\), all cost expressions employed in our reward are normalized in a straightforward manner. However, the translational expression is normalized by the density function of a normal distribution which lacks its preceding factor. We normalize suchlike since no upper bound exists for the values of the Euclidean distances that are to be rated. The manually fixed variance \(\sigma \) of this distribution nevertheless enables us roughly to distinguish between cheap and expensive translational travel costs. Indirectly, we thereby set the range of coverage of the mobile robot in the course of ASR.
When adding up the diverse cost expressions in our reward, the question arises what they actually represent as a whole. A straightforward and universal approach to quantifying the costs for a robot travelling to a view V and executing object localization for all objects \(\{o\}\) potentially present in that view is time consumption. Since an exact estimation of time consumption would require a costly simulation, we favored subdividing the overall costs on the four aforementioned expressions instead. For each cost aspect covered by each of the terms, we identified a quantity that can be efficiently estimated and correlates with the time consumption of the aspect it represents. Of course, proceeding suchlike implicates simplifications. For example, all of our inverse travel costs implicitly presume constant velocity. What the individual cost expressions do not cover is the percentage to which every single aspect, covered by a corresponding expression, contributes to the overall time consumption. The share of each expression is therefore modeled by the weights \(\omega _{\rho }, \omega _{\theta }, \omega _{d}, \omega _{o}\) in line 26 of Algorithm 23 that we set by hand.
Altogether, the four inverse cost expressions would form an objective function of a convex shape, with the global maximum being located at the current configuration of the robot, where it would look for a single missing object. On the other hand, the utility function as an objective function would produce local maxima in proximity to clusters of predicted object poses. It can hardly be expected that local maxima of inverse costs and utility coincide at the same locations in the environment of the robot. Under these circumstances, utility and inverse costs would mutually damp each other if we multiplied both in the reward \(r(\{o\}, V)\). Multiplication would just make high rewards easy to identify while the frontier between poor and average rewards would be blurred. In order to focus on this frontier, we decided to aggregate utility and inverse costs by summation instead.
4.6.2 Optimization Algorithm for NextBestView Estimation
Modeling view planning in a realistic way does not only yield increased complexity and therefore computational costs but also a nonlinear objective function \(r(\{o\}, V)\) in our case. Thus, we handle estimating NextBestViews as a nonlinear optimization problem [4, p. 861]. We have already dealt with such problems when verifying ISM accumulators in Sect. 3.4.4 and when selecting relation topologies in Sect. 3.6.2. Before explaining which optimization algorithm we designed for estimating camera views under these circumstances, we first analyze the parameter space \(2^{\{o\}_{P}} \times \{V\}\) of our reward in detail.
Apart from being highdimensional,^{19} this parameter space is infinite in its size. This is due to the space of 6DoF camera poses it comprises. Furthermore, the space is also exponential in size with respect to the number \(\{o\}_{P}\) of the searched objects it includes. While the size of the objectrelated portion of the parameter space cannot be influenced without resorting to heuristics, time consumption in evaluating the objective function across the power set \(2^{\{o\}_{P}}\) can be reduced on an algorithmic level, e.g. by dynamic programming.^{20} The dimensionality of the objectrelated portion of the parameter space cannot be changed without using heuristics.
The situation is different in camera pose space, on which the optimizationrelevant parameters of the views \(\{V\}\) are defined. The kinematic chain of the abstract robot we introduced in Sect. 4.4, makes it possible to reduce the dimensionality of this camerarelated portion of the parameter space in two ways. On one side, the height of the sensor head above the ground is constant so that only the coordinates (x, y) of the camera in the twodimensional plane of the floor have to be optimized. We also designate these coordinates as twodimensional robot positions. On the other side, the sensor head can only be pivoted in two \((\rho ,\tau )\) of the three degreesoffreedom with which a rigid body can be rotated in space. This is because the PTU, which moves the sensor head, only has two axes to be rotated. Resulting from both reductions, the 4DoF space is still infinite. However, only the portion referring to robot positions is defined on an unbounded domain [4, p. 118]. In comparison, orientations \((\rho ,\tau )\)—we also call camera orientations—are defined on the bounded interval \([{0}^{\circ }, {360}^{\circ }]\).
The subject of this section is to estimate NextBestViews. According to Sect. 1.3, NextBestView estimation is one of the contributions of this thesis. In the preceding section, we have introduced this task as a combinatorial optimization problem. Since it includes a partially continuous parameter space but should still be handled by algorithms from combinatorial optimization, we discretize the combined robot position and camera orientation space. Of course, this operation may exclude the globally optimal camera view from optimization. Combining this discretized space with the power set of the searched objects \(\{o\}_{P}\) yields the search space for which we introduce a NextBestView estimation algorithm in this section. By simultaneously operating on all degreesoffreedom that are relevant for adjusting the camera during object search, optimization can move to parts of the search space which would be out of reach otherwise—e.g. if we separately optimized robot position, camera orientation and objects to be detected.
In the search space \(\{(x,y)\} \times \{\mathbf {q}\} \times 2^{\{o\}_{P}}\) we just defined, the only portion that remains infinite is that of the sampled robot positions \(\{(x,y)\}\), because only a finite number of camera orientations \(\{\mathbf {q}\}\) can be sampled from the domain \([{0}^{\circ }, {360}^{\circ }]\). We define estimating NextBestViews not only as a combinatorial optimization problem with a nonlinear objective function. This formalization is analogous to that of [26] who proved that NextBestView estimation is NPcomplete. Thus, globally optimal solutions to it cannot be efficiently found with stateoftheart techniques. However, if the search space was sufficiently small, a global optimum could be found, e.g. by exhaustive search. Handling the infiniteness of the search space so that only a finite but relevant subset is considered during optimization is the basic condition for being able to use exhaustive search. We resolve this issue by defining a rectangular area around the current position \((x_{C},y_{C})\) of our mobile robot. Robot positions are then only sampled from the inside of that area. The surface of this rectangular area should be consistent with the range of coverage of the mobile robot during ASR. This range is set by means of the variance \(\sigma \) from the inverse cost function \(i'_d(\mathbf {C}, \mathbf {C}_C)\). While position sets \(\{(x,y)\}\) extracted from such areas are finite, their size can still be considerable depending on the sampling resolution and the surface from which they are being sampled. Just as the range of coverage of a mobile robot can theoretically be increased unlimitedly, no upper bound for the potential size of position sets can be specified. Nor can we define an upper bound for the costs of exhaustive search on the search space in general.
Alignment is attained by placing the current position of the robot at the center of one of the generated hexagons. Tessellating the twodimensional floor with a hex grid instead of a square grid is advantageous as soon as the rooms in which we perform ASR are not rectangular. From our experience, samples from hex grids are closer to obstacles in such rooms. Thus, the robot can get closer to searched objects on top of obstacles.In the first execution of iterationStep(), i.e. Algorithm 23, we tessellate the entire rectangular area in robot position space being of interest to object search, at a coarse resolution \(\epsilon _0\). We do so with a hex grid [18] that is aligned to the current position of the robot \((x_{c},y_{c})\). [17]
Pruning robot positions on the basis of their overlapping with obstacles is done by means of an occupancyGrid[] with cell size s whose values at positions (x, y) are checked against a threshold \(t_{o}\). This threshold corresponds to the “probability that a grid cell is occupied” [23, p. 224]. The other heuristic that prunes robot positions because of their distances to predicted object poses \(\mathbf {T}_P\) presumes that robot positions (x, y) and the positions \(\mathbf {p}_P\) inside the predicted poses \(\mathbf {T}_P\) can be compared. However, due to differing dimensionalities, this is not the case. We therefore make the two comparable by appending the fixed height h of the sensor head to any robot position in question.Tessellation is performed in line 2 of Algorithm 23 which getHexGrid() returns the corners as a position set \(\{(x,y)\}_{\epsilon }\). From position set \(\{(x,y)\}_{\epsilon }\), each position whose distance to any pose prediction \(\mathbf {T}_P\) of a searched object is larger than the fcp or that lies too close to an obstacle on the 2D environment map is discarded. [17]
Every candidate view is then combined with all elements of the power set of the searched objects. Next, all combinations \((\{o\}, V)\) of candidate views and object sets \(\{o\} \subset \{o\}_{P}\) that are created in a single iteration are rated in two steps in Algorithm 24. As explained in the previous section, rewards \(r(\{o\}, V)\) should be calculated only from those predicted poses with lines \(\{(\mathbf {T}_\mathbf F , \{\mathbf {n}\})\}\) inside the frustum of view V. In line 14 in Algorithm 23, these poses are extracted from the overall set of poses with lines \(\{(\mathbf {T}_P, \{\mathbf {n}\})\}\) by means of frustum culling [3]. The fact that the rewards of all combinations cannot be calculated in a single step is due to normalized utility \(u'(\{o\}, V)\) being used in our reward function. Since this involves estimating the highest utility \(u(\{o\}^{*},V^{*})\) among all combinations that result from an iteration step, we first have to estimate nonnormalized utilities \(u(\{o\}, V)\) for all of the generated combinations. Nonnormalized utilities are derived from those predicted poses \(\{(\mathbf {T}_\mathbf {F}, \{\mathbf {n}\})\}_{\{o\}}\) inside the frustum that belong to objects from the set \(\{o\}\) inside each combination.A set of camera orientations \(\{\mathbf {q}\}_{S}\), from evenly sampling a unit sphere with a golden angle spiral [5] in getSpiralApprox() at line 2 in Algorithm 24, is passed to iterationStep(). These orientations are combined with each robot position in the set \(\{(x,y)\}_{r}\) to generate view candidates V with the frustum F in getView() in line 13 in Algorithm 23. In 2 to 5 in Fig. 4.14, a camera orientation set, limited to the workspace \([\rho _{\text {min}} ,\rho _{\text {max}}] \times [\tau _{\text {min}} ,\tau _{\text {max}}]\) of the PTU, is visualized on a sphere for a given robot position. [17]
The increase of resolutions with each additional iteration step is exemplarily shown in 2 to 5 in Fig. 4.14. In these pictures, increasing resolutions translate into decreasing distances between the small cylinders on the floor that represent sampled robot positions. In each iteration step, the rating of the best combination \((\{o\}_{I}, V_{I})\) is guaranteed to be at least as high \(r(\{o\}_{I}, V_{I}) \ge r(\{o\}_{A}, V_{A})\) as that of the best combination \((\{o\}_{A}, V_{A})\) from the preceding iteration step. Through greedily selecting the bestrated combination in each of its iteration steps, Algorithm 24 reaches local optima in the discrete search space. In relation to global optima, we did not notice throughout our experiments that any unreasonable NextBestViews were estimated. All of those experiments were performed with the same fixed set of parameters that we inter alia set in the objective function. The parameters were estimated from tests representing pathological cases for the estimation of NextBestViews. Iteration is aborted in Algorithm 24 once the ratings \(r(\{o\}_{I}, V_{I})\), \(r_{A} = r(\{o\}_{A}, V_{A})\) of the best combinations from two consecutive iteration steps differ less than a threshold \(t_{V}\). The more iteration steps are performed, the smaller the differences get between the poses of the best views \(V_{I}, V_{A}\) of consecutive iteration steps. At steep slopes of the objective function, the ratings of such very similar views may still be sufficiently different so that threshold \(t_{V}\) is exceeded. For cases of this sort, we introduced another termination condition that stops iterating once a maximum resolution \(\epsilon _{\infty }\) is reached with which robot positions may be tessellated. The result \((\{o\}_{I},V_{I})\) of the final iteration step is returned by Algorithm 24 as the overall result \((\{o\}_{N},V_{N})\) of our NextBestView estimation algorithm.Each execution of iterationStep() estimates the bestrated view \(V_{I}\) at a resolution \(\epsilon \) according to a reward \(r(\{o\}, V)\) once the combination \((\{o\}^{*},V^{*})\) with the highest utility in an iteration step has been identified. It is passed as bestrated view \(V_{A}\) from the last iteration to the next call of iterationStep() in Algorithm 24. In the next iterationStep() execution, a new hex grid whose origin lies at configuration \(\mathbf {C}_A\) is created with doubled resolution \(\epsilon \) and halved area to be tessellated. \(\mathbf {C}_A\) is inferred by inverse kinematics from the bestrated candidate view \(V_{A}\) in the preceding iteration. [17]
Doubling the resolution with each additional step, roughly^{21} translates into quadrupling the number of robot positions that are sampled. For the following arguments, we presume that Algorithm 24 samples n positions in its first iteration step and that it terminates after i iterations at resolution \(\epsilon _{\infty }\) without loss of generality. In the course of an exemplary NextBestView estimation under these assumptions, our iterative algorithm would sample \(n (i1)\) robot positions in total. Instead, if we tried to estimate a NextBestView by exhaustive search, i.e. in one iteration at the maximal resolution, we would sample \(m = n 4^{i1}\) positions. In this case, we would have to sample from the largest rectangular area, which is used in our iterative approach, at the maximal resolution \(\epsilon _{\infty }\) instead of the minimal resolution \(\epsilon _{0}\). Because of the inequalities \(n (i1) \le n (i1) \log 4 = n \log 4^{i1} \le n (\log 4^{i1} + \log n) = n \log (n 4^{i1}) = n \log m\), Algorithm 24 has time complexity \(\mathcal {O}(\log (m)\{\mathbf {q}\}_{S}2^{\{o\}_{P}})\) under the assumption that evaluating the reward function comes along with constant time consumption. Logarithmic complexity in terms of robot positions can be found in the literature for similar greedy divideandconquer approaches.By decreasing the tessellated area proportionally to increasing the resolution in each step, the size of robot position sets is constant throughout iterating. Uniformly tessellating the whole area of interest for object search with the resolution used in the last iteration would return a lot more positions. [17]
Another example for estimating NextBestViews with our approach is visualized in Fig. 4.14. 1 in this figure shows the configuration the mobile robot takes over before NextBestView estimation starts. In particular, the views both cameras on the sensor head of the robot have taken over are made visible by two transparent red frustums. The directions of both views are depicted as arrows of the same color that stick out of the frustums. Besides the robot configuration, 1 also displays the cloud of object poses our estimation algorithm receives from Object Pose Prediction. Every predicted pose is made visible by a scaleddown threedimensional model of the object to which the pose belongs. A bunch of lines of sight, each visualized by a yellow arrow, is attached to every prediction. 2–5 in Fig. 4.14, representing executions of Algorithm 23, each show view \(V_{I}\) that belongs to the best combination \((\{o\}_{I},V_{I})\) identified during the respective iteration step. A view \(V_{I}\) is made visible by an arrow indicating its direction. This arrow sticks out of a sphere of the same color. On the surface of this sphere, we see the camera orientations \(\{\mathbf {q}\}_{S}\) from spiral sampling, visualized by small squares from which small arrows stick out. A perpendicular blue line connects the sphere—and therefore all camera orientations that it comprises—with a sampled robot position. This shall express how Algorithm 23 combines any robot position with all available camera orientations. Besides, it shows which position the view \(V_{I}\) corresponds to. The length of the blue line segment increases across the pictures so that spheres from different iteration steps do not overlap. In contrast, the arrows stay at the same height. 6 in Fig. 4.14 shows a pair of views the stereo camera on the sensor head would take over if the mobile robot adopted the configuration \(\mathbf {C}_{N}\) that corresponds to the NextBestView \(V_{N}\). It should be noted that the orientations of the cameras differ from the displayed directions of the visualized views. The frustums of these target views are colored turquoise so that they cannot be mixed up with the current views of the sensor head. It should also be noted that the predicted object poses are located at the center of the union of both viewing frustums and that the directions of the depicted views are similar to a number of the lines of sight inside the frustums.
4.6.3 Invalidation of Lines of Sight in Clouds of Predicted Poses
For example, the effect of Algorithm 25 can be studied in Fig. 4.15. 1 in this figure contains lines at its lower left corner that are removed while executing this algorithm. When displaying the results of Algorithm 25, these lines are no longer visible in 2. The removed lines have been both present inside the turquoise frustum of view \(V_U\) and oriented towards both cameras of the depicted mobile robot.
4.6.4 Discussion
In this chapter, we defined finding an optimal combination \((\{o\}_{N},V_{N})\) of a view \(V_{N}\) and the objects \(\{o\}_{N}\) to be searched in it as a combinatorial optimization problem. As already discussed, exhaustive search across the entire search space is not a viable option in view of its size. As an alternative, we presented a heuristic approximation algorithm in Sect. 4.6.2 that stays as close as possible to the exhaustive approach. In Sect. 4.6.2, we presumed that our technique converges towards a value within the partly continuous parameter space \(2^{\{o\}_{P}} \times \{V\}\) of the reward function. This shall be shown in the following. For the first iteration step of Algorithm 24, we assume that the maximum distance \(d_{1}\) is given between any current robot position \((x_{C},y_{C})\) and the position \((x_{1},y_{1})\) of the bestrated view’s configuration. Due to the continuity of the floor map, an \(n_0\)th iteration of that algorithm exists, from which on the distances between such “bestrated” positions \((x_{n},y_{n})\), \((x_{n1},y_{n1})\) from consecutive iteration steps decrease to \(d_{n} = (x_{n},y_{n})  (x_{n1},y_{n1}) \le \frac{1}{2^{n1}} d_{1}\). \(d_{n} \rightarrow 0\) converges for \(n \rightarrow \infty \). While the positions of the configurations to which hex grid generation is aligned change less and less, the same camera orientations and object sets are considered by default in every iteration step. In total, the portions of the search space considered in every iteration step changes less and less with every additional iteration. The same holds true for the bestrated combinations \((\{o\}_{I}, V_{I})\) within these search space portions. The fact that the combination \((\{o\}_{N},V_{N})\) towards which Algorithm 24 converges is a local optimum inside the discretized search space has already been discussed in Sect. 4.6.2. Graphically, the NextBestView estimation algorithm stops being capable of converging to the globally optimal combination as soon as it no longer lies within the diminishing rectangular area from which the algorithm exacts hex grids.
The starting point for developing this algorithm were existing contributions to the field of NextBestView estimation. When we look at exemplary contributions for prominent application scenarios—such as threedimensional Object Reconstruction [24] or Robot Exploration [19]—that have been pursued in the last years, we notice that they focus on developing suitable objective functions while putting less emphasis on how to efficiently select candidate views that shall be rated. In these contributions, exhaustive search is done on straightforwardly discretized spaces. As a matter of fact, the aforementioned application scenarios place low demands on the time consumption of NextBestView estimation. On the contrary, seminal work [26] in threedimensional Object Search, which is timesensitive, has put significantly more emphasis on efficiently selecting candidate views. However, their approach optimizes different dimensions of their parameter space—e.g. robot positions and camera orientations separately—with the already mentioned limitations in comparison to our approach. Our approach shall stress the importance of search space design and optimization algorithm development in the context of NextBestView estimation. While being able to access as large portions of the parameter space of the objective function as possible, increasing efficiency makes it possible to take advantage of more realistic objective functions. This enables us to estimate valuable NextBestViews and simultaneously maintain low time consumption, e.g. appropriate for repeatedly estimating NextBestViews in an object search loop.
Footnotes
 1.
During indirect search, we only look for objects that miss in the available scene models. Objects from scene categories for which no partial scene models have been estimated yet are being looked for during direct search.
 2.
We used the library SMACH [2] for implementing ASR.
 3.
We look at Fig. 4.3 in landscape format.
 4.
Since in relation to learning, the contributions in this thesis are restricted to that of scene category models, such a simplification makes sense.
 5.
We designate a scene model from a scene category as incomplete when the model does not contain all objects that belong to the category.
 6.
The pose of the robot is composed of the pose of its mobile base and the orientation of its PTU.
 7.
Please keep in mind that “NBV estimation” selects both the objects that shall be searched from a viewpoint and the viewpoint itself.
 8.
Without loss of generality, all variations are supposed to be visible from the same set of robot head poses.
 9.
We present all operations as performed on the view of the left camera.
 10.
The confidence in detecting objects from a given viewpoint depends on the absolute number of predictions within the viewing frustum the given viewpoint induces.
 11.
Please bear in mind that scene model instance and scene category instance are synonyms.
 12.
This kind of visualization is further detailed out in Sect. 4.6.
 13.
The subset \([\mathbf {I}_{\{\mathbf {S}\}}^{r}]\) is a multiset [10, p. 29].
 14.
The color of a cloud changes from red to green with increasing confidence of the represented instance.
 15.
Instances with low, average and high ratings are displayed as red, yellow and green clouds.
 16.
This picture displays a dark green frustum of a view at position \(\mathbf {p}\) that contains two predicted poses for a spherical grey object. The same lines of sight, visualized as dark red arrows, are assigned to each of both predictions.
 17.
For the sake of simplicity, we only consider the left camera on the sensor head of the abstract robot from Sect. 4.4 in this and the following two sections.
 18.
Throughout the experiments we conducted for this thesis, we set frustum parameters as follows: \(\text {fov}_x = {30}^{\circ }\), \(\text {fov}_y = {20}^{\circ }\), ncp \(= 0.4\) m, fcp \(= 1.5\) m.
 19.
The dimensionality of the parameter space of our objective function is \(\{o\}_{P} + 6\).
 20.
As earlier in this thesis, we limit the description of our contributions to design decisions that reduce problem complexity, leaving out canonical algorithm optimization strategies such as parallelization.
 21.
Here, we ignore the specificities of hex grids.
References
 1.AumannCleres, F.: Markerbasiertes Kalibrieren der kinematischen Kette und Aufstellen der Rückwärtstransformation zwischen der Basis und dem Sensorkopf eines mobilen Roboters. Bachelor’s thesis, Advisor: P. Meißner, Reviewer: R. Dillmann, Karlsruhe Institute of Technology (2016)Google Scholar
 2.Bohren, J., Cousins, S.: The smach highlevel executive. IEEE Robot. Autom. Mag. (2013)Google Scholar
 3.Bourke, P.: Frustum culling. http://paulbourke.net/miscellaneous/frustum (2000). Accessed 01 Dec 2017
 4.Bronshtein, I., Semendyayev, K., Musiol, G., Muehlig, H.: Handbook of Mathematics, 5th edn. Springer, Berlin (2007)Google Scholar
 5.Devert, A.: Spreading points on a disc and on a sphere—Marmakoide’s Blog. http://blog.marmakoide.org/?p=1 (2012). Accessed 14 Nov 2017
 6.Dillmann, R., Huck, M.: Informationsverarbeitung in der Robotik. Springer, Berlin (1991)Google Scholar
 7.Eidenberger, R., Grundmann, T., Schneider, M., Feiten, W., Fiegert, M., Wichert, G.V., Lawitzky, G.: Scene analysis for service robots. In: Towards Service Robots for Everyday Environments, pp. 181–213. Springer, Berlin (2012)Google Scholar
 8.Gamma, E., Johnson, R., Vlissides, J., Helm, R.: Design Patterns: Elements of Reusable ObjectOriented Software. AddisonWesley, Boston (1995)Google Scholar
 9.Garvey, T.D.: Perceptual strategies for purposive vision. TechTechnical Note 117, SRI International (1976)Google Scholar
 10.Hein, J.L.: Discrete Mathematics, 2nd edn. Jones and Bartlett Publishers, Inc, Burlington (2002)Google Scholar
 11.Isard, M., Blake, A.: Condensation–conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998)CrossRefGoogle Scholar
 12.Karrenbauer, O.: Realisierung und komparative Analyse von alternativen Methoden zum uninformierten Generieren optimaler Folgen von Ansichten für die 3DObjektsuche. Bachelor’s thesis, Advisor: P. Meißner, Reviewer: R. Dillmann, Karlsruhe Institute of Technology (2017)Google Scholar
 13.Kunze, L., Doreswamy, K.K., Hawes, N.: Using qualitative spatial relations for indirect object search. In: Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 163–168. IEEE (2014)Google Scholar
 14.Lehmann, A., Leibe, B., Van Gool, L.: Fast prism: branch and bound hough transform for object class detection. Int. J. Comput. Vis. 94(2), 175–197 (2011)CrossRefGoogle Scholar
 15.Lorbach, M., Hofer, S., Brock, O.: Priorassisted propagation of spatial information for object search. In: Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), pp. 2904–2909. IEEE (2014)Google Scholar
 16.Meißner, P., Reckling, R., Wittenbeck, V., SchmidtRohr, S., Dillmann, R.: Active scene recognition for programming by demonstration using nextbestview estimates from hierarchical implicit shape models. In: Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 5585–5591. IEEE (2014)Google Scholar
 17.Meißner, P., Schleicher, R., Hutmacher, R., SchmidtRohr, S., Dillmann, R.: Scene recognition for mobile robots by relational object search using nextbestview estimates from hierarchical implicit shape models. In: Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 137–144. IEEE (2016)Google Scholar
 18.Patel, A.: Hexagonal grids. https://www.redblobgames.com/grids/hexagons (2013 & 2015). Accessed 11 Nov 2017
 19.Potthast, C., Sukhatme, G.S.: A probabilistic framework for next best view estimation in a cluttered environment. J. Vis. Commun. Image Represent. 25(1), 148–164 (2014)CrossRefGoogle Scholar
 20.Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R., Ng, A.Y.: Ros: an opensource robot operating system. In: ICRA Workshop on Open Source Software, Kobe, p. 5 (2009)Google Scholar
 21.Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd international edn. Prentice Hall Press, Upper Saddle River (2010)Google Scholar
 22.Siciliano, B., Khatib, O.: Springer Handbook of Robotics. Springer Science + Business Media, Berlin (2008)Google Scholar
 23.Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, Cambridge (2005)Google Scholar
 24.VasquezGomez, J.I., Sucar, L.E., MurrietaCid, R.: View planning for 3d object reconstruction with a mobile manipulator robot. In: Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2014), pp. 4227–4233. IEEE (2014)Google Scholar
 25.Wixson, L.E., Ballard, D.H.: Using intermediate objects to improve the efficiency of visual search. Int. J. Comput. Vis. 12(2–3), 209–230 (1994)CrossRefGoogle Scholar
 26.Ye, Y., Tsotsos, J.K.: Sensor planning for 3d object search. Comput. Vis. Image Underst. 73(2), 145–168 (1999)CrossRefGoogle Scholar