1 Introduction

The advent of low cost RGBD sensors such as Microsoft Kinect or Asus Xtion Pro is completely changing the computer vision world, as they are being successfully used in several applications and research areas. Many of these applications, such as gaming or human computer interaction systems, rely on the efficiency of learning a scene background model for detecting and tracking moving objects, to be further processed and analyzed. Depth data is particularly attractive and suitable for applications based on moving objects detection, since they are not affected by several problems representative of color-based imagery. However, depth data suffer from other problems, such as depth camouflage or depth sensor noisy measurements, which limit the efficiency of depth-only based background modeling approaches. The complementary nature of color and depth synchronized information acquired by RGBD sensors poses new challenges and design opportunities. New strategies are required that explore the effectiveness of the combination of depth- and color-based features, or their joint incorporation into well known moving object detection and tracking frameworks.

In order to evaluate and compare scene background modelling methods for moving object detection on RGBD videos, we assembled and made available the SBM-RGBD datasetFootnote 1. It provides all facilities (data, ground truths, and evaluation scripts) for the SBM-RGBD Challenge, organized in conjunction with the Workshop on Background Learning for Detection and Tracking from RGBD Videos, 2017. The dataset and the results of the SBM-RGBD Challenge, which are described in the following sections, will remain available also after the competition, as reference for future methods.

2 Video Categories

The SBM-RGBD dataset provides a wide set of synchronized color and depth sequences acquired by the Microsoft Kinect. The dataset consists of 33 videos (about 15000 frames) representative of typical indoor visual data captured in video surveillance and smart environment scenarios, selected to cover a wide range of scene background modeling challenges for moving object detection. The videos come from our personal collections as well as from existing public datasets, including the GSM dataset, described in Moyá-Alcover et al. [13], MULTIVISION, described in Fernandez-Sanchez et al. [5], the Princeton Tracking Benchmark, described by Song and Xiao [14], the RGB-D object detection dataset, described by Camplani and Salgado [3], and the UR Fall Detection Dataset, described by Kwolek and Kepski [7].

The videos have 640\(\,\times \,\)480 spatial resolution and their length varies from 70 to 1400 frames. Depth images are recorded at either 16 or 8 bits. They are already synchronized and registered with the corresponding color images by projecting the depth map onto the color image, allowing a color-depth pixel correspondence. For each sequence, pixels that have no color-depth correspondence (due to the difference in the color and depth cameras centers) are indicated in black in a binary Region-of-Interest (ROI) image (see Fig. 2-(c)) and are excluded by the evaluation (see Sect. 4).

The videos span seven categories, selected to include diverse scene background modelling challenges for moving object detection. These well known challenges can be related only to the RGB channels (RGB), only to the depth channel (D), or can be related to all the channels (RGB+D):

  1. 1.

    Bootstrapping (RGB+D): Videos including foreground objects in all their frames. The challenge is to learn a model of the scene background (to be adopted for background subtraction) even when the usual assumption of having a set of training frames empty of foreground objects fails.

    This category includes five videos, in most of which the background is never shown in some scene regions, being always occupied by foreground people.

  2. 2.

    Color Camouflage (RGB): Videos including foreground objects whose color is very close to that of the background, making hard a correct segmentation based only on color. This category consists of four videos where foreground objects are moved in front of similarly colored background (e.g., a white box in front of other white boxes or a rolling furniture moving in front of other furniture of the same color).

  3. 3.

    Depth Camouflage (D): Videos including foreground objects very close in depth to the background. Indeed, in these cases the sensor gives the same depth data values for foreground and background, making hard a correct segmentation based only on depth. The category consists of four videos where people move their hands or other objects very close to the background.

  4. 4.

    Illumination Changes (RGB): Videos containing strong and mild illumination changes. The challenge here is to adapt the color background model to illumination changes in order to achieve an accurate foreground detection. Four videos are included into this category, where the illumination varies due to the covering of the light source or to unstable illumination acquisition.

  5. 5.

    Intermittent Motion (RGB+D): Videos with scenarios known for causing ghosting artifacts in the detected motion, i.e., abandoned foreground objects or removed foreground objects. The challenge here is to detect foreground objects even if they stop moving (abandoned object) or if they were initially stationary and then start moving (removed object). This category consists of six videos including abandoned and removed objects. Two videos are obtained by reversing the original temporal order of the frames (so that an object that is abandoned in the original sequence results as removed in the reversed sequence).

  6. 6.

    Out of Sensor Range (D): Videos including foreground or background objects that are too close to/far from the sensor. Indeed, in these cases the sensor is unable to measure depth, due to its minimum and maximum depth specifications, resulting in invalid depth values. Five videos are included into this category, where several invalid depth values are due to foreground objects whose distance from the sensor is out of the admissible sensor range.

  7. 7.

    Shadows (RGB+D): Videos showing shadows caused by foreground objects. Indeed, foreground objects block the active light emitted by the sensor from reaching the background. This causes the casting on the background of shadows, that apparently behave as moving objects. RGBD sensors exhibit two different types of shadows: visible-light shadows in the RGB channels or IR shadows in the depth channel. The category consists of five videos including more or less strong shadows.

Examples of videos from all the categories are reported in Fig. 1.

Fig. 1.
figure 1

Examples of videos from all the categories: (a) Bootstrapping, (b) ColorCamouflage, (c) DepthCamouflage, (d) IlluminationChanges, (e) IntermittentMotion, (f) OutOfRange, (g) Shadows.

3 Ground Truths

To enable a precise quantitative comparison of various algorithms for moving object detection from RGBD videos, all the videos come with pixel-wise ground truth foreground segmentations for each video. A foreground region is intended as anything that does not belong to the background, including abandoned objects and still persons, but excluding light reflections, shadows, etc. The ground truth images, some of which created using the GroundTruther software kindly made available by the organizers of changedetection.net, contain four labels (see Fig. 2-(d)), namely:

  • 0: Background

  • 85: Outside ROI

  • 170: Unknown motion

  • 255: Foreground

Areas around moving objects are labeled as unknown motion, due to semi-transparency and motion blur that do not allow a precise foreground/background classification. Therefore, these areas, as those not included into the ROI, are excluded by the evaluation.

Fig. 2.
figure 2

Sequence ChairBox: (a) color and (b) depth images; (c) ROI; (d) ground truth.

While our evaluation is made across all the ground truths for all the videos, only a subset of the available ground truths is made publicly available for testing, in order to reduce the possibility of overtuning method parameters.

4 Metrics

The SBM-RGBD dataset comes also with tools to compute performance metrics for moving object detection from RGBD videos, and thus identify algorithms that are robust across various challenges. Let TP, FP, FN, and TN indicate, for each video, the total number of True Positive, False Positive, False Negative, and True Negative pixels, respectively. The seven adopted metrics, widely adopted in the literature for evaluating the results of moving object detection (e.g., [6]), are

  1. 1.

    Recall

    $$\begin{aligned} Rec=\frac{TP}{TP + FN} \end{aligned}$$
  2. 2.

    Specificity

    $$\begin{aligned} Sp=\frac{TN}{TN + FP} \end{aligned}$$
  3. 3.

    False Positive Rate

    $$\begin{aligned} FPR=\frac{FP}{FP + TN} \end{aligned}$$
  4. 4.

    False Negative Rate

    $$\begin{aligned} FNR=\frac{FN}{TP + FN} \end{aligned}$$
  5. 5.

    Percentage of Wrong Classifications

    $$\begin{aligned} PWC=100 * \frac{FN + FP}{TP + FN + FP + TN} \end{aligned}$$
  6. 6.

    Precision

    $$\begin{aligned} Prec=\frac{TP}{TP + FP} \end{aligned}$$
  7. 7.

    F-Measure

    $$\begin{aligned} F_1=\frac{2 * Prec * Rec}{Prec + Rec} \end{aligned}$$

The Matlab scripts to compute all performance metrics have been adapted by the scripts available from changedetection.net.

5 Experimental Results

Several authors submitted their results to the SBM-RGBD challenge, and some of them provided a description of their method: RGBD-SOBS and RGB-SOBS [11], SCAD [12], and cwisardH+ [4]. Therefore, our experimental analysis is mainly devoted to assess to what extent the different background modelling challenges introduced in Sect. 2 pose troubles to these background subtraction methods.

In Table 1, we report average results on the whole dataset achieved by all submitted methods (as of July 4th, 2017), while in Tables 2 and 3, we report their average results for each challenge categoryFootnote 2.

Table 1. Average results on the whole SBM-RGBD dataset.
Table 2. Average results for each category of the SBM-RGBD dataset (Part 1).
Table 3. Average results for each category of the SBM-RGBD dataset (Part 2).

Bootstrapping can be a problem, especially for selective background subtraction methods (e.g., [9]), i.e. those that update the background model using only background information. Indeed, once a foreground object is erroneously included into the background model (e.g., due to inappropriate background initialization or to inaccurate segmentation of foreground objects), it will hardly be removed by the model, continuing to produce false negative results. The problem is even harder if some parts of the background are never shown during the sequences, as it happens in most of the videos of the Bootstrapping category. Indeed, in these cases, also the best performing background initialization methods [1] fail, as illustrated in Fig. 3, and only alternative techniques (e.g., inpainting) can be adopted to recover missing data [10]. Nonetheless, depth information seems to be beneficial for affording the challenge, as reported in Table 2, where accurate results are achieved by most of the methods that exploit depth information.

Fig. 3.
figure 3

Background image for sequence adl24cam0 (where the center area of the room is always covered by the man) computed using: (a) temporal median filter and (b) LabGen [8].

As expected, all the methods that exploit depth information achieve high accuracy in case of color camouflage. An evident example of the benefits induced by depth information for this category is given by the F-measure value achieved by the RGBD-SOBS method, that doubles the value achieved by the same method but without considering depth (RGB-SOBS). A similar reasoning can be applied to the illumination changes challenge. However, we point out that, in this case, the analysis should be based on Specificity, FPR, FNR, and PWC, rather than on the other three metrics. Indeed, two of the four videos of this category have no foreground objects throughout the whole duration, their rationale being the willingness of not detecting false positives under varying illumination conditions. This leads to have no positive cases in all ground truths and, consequently, to undefined values of Precision, Recall, and F-measure (in the experiments, values for these undefined cases are set to zero).

Depth can be beneficial also for detecting and properly handling cases of intermittent motion. Indeed, foreground objects can be easily identified based on their depth, that is lower than that of the background, even when they remain stationary for long time periods. Methods that explicitly exploit this characteristic (e.g., RGBD-SOBS and SCAD) succeed in handling cases of removed and abandoned objects, achieving high accuracy.

Overall, shadows do not seem to pose a strong challenge to most of the methods. Indeed, depth shadows due to moving objects cause some undefined depth values, generally close to the object contours, but these can be handled based on motion. Color shadows can be handled either exploiting depth information, that is insensitive to this challenge, or through color shadow detection techniques (e.g., as in RGB-SOBS and SCAD), when only color information is taken into account. Instead, they are still a challenge when the sole grey level intensity is considered (e.g., as in SRPCA).

Out of range and Depth camouflage are among the most challenging issues, at least when information on color is disregarded or not properly combined with depth. Indeed, even though accuracy of most of the methods is moderately high, several false negatives are produced, as shown in Fig. 4 for depth camouflage.

Fig. 4.
figure 4

Sequence DCamSeq2 (DepthCamouflage): (a) image no. 534, corresponding (b) depth image, and (c) ground truth; segmentation masks achieved by: (d) RGBD-SOBS, (e) RGB-SOBS, (f) SRPCA, (g) AvgM-D, (h) Kim, (i) SCAD, (j) CwisardH+.

6 Conclusions and Perspectives

The paper describes a novel benchmarking framework that we set up and made publicly available in order to evaluate and compare scene background modeling methods for moving object detection on RGBD videos. The SBM-RGBD dataset is the largest RGBD video collection ever made available for this specific purpose. The 33 videos span seven categories, selected to include diverse scene background modeling challenges for moving object detection. Seven evaluation metrics, chosen among the most widely used, are adopted to evaluate the results against a wide set of pixel-wise ground truths. A preliminary analysis of results achieved by several methods investigates to what extent the various background modeling challenges pose troubles to background subtraction methods that exploit color and depth information. The proposed framework will serve as a reference for future methods aiming at overcoming these challenges.