1 Introduction

Urban structures (lines, planes, etc.) recognition is a useful task for computer vision systems since it provides rich scene information that can be exploited to understand the scene. This is due to the fact that human-made scenes floor have a consistent appearance and they can be used to improve several computer vision applications performance [1,2,3,4]. For instance, floor recognition could be used as preliminary cues to infer the road for autonomous vehicle navigation [5]. Currently, there are several trends for floor recognition, for example, some works propose the use of external devices such as Radio-Frequency Identification or pressure sensing to solve the floor recognition problem [6, 7]. However, floor information acquisition can be affected by several environmental factors such as type of material or shape of the road (i.e. if it is uneven, bumpy, etc) [8].

Other approaches focus on the analysis of two or more images captured from different camera views for the same scene, some of these works extend their methodology to the use of Simultaneous Localization And Mapping algorithms, from which cloud points and camera poses can be estimated and used within the floor segmentation problem [9]. More generally speaking, multi-view approaches also rely on fitting algorithms, typically RANSAC and some optimization technique to fit the plane floor within the 3D point clouds (mapping), provided by the SLAM algorithm or any other point cloud generation algorithm based on two or more views. Nevertheless, several thresholds and specialized tuning are required to guarantee high performance for a specific scene. This is an important limitation because in several cases it is difficult to set appropriate threshold values. Also, this approach requires sufficient parallax, i.e., some difference between camera views in order to reach accurate results.

Another approach, which we are interested in this work, is that of carrying out floor recognition from a single image [10]. Unlike the previous approaches (using two views or using external devices) this approach performs floor recognition without thresholds and without parallax constraints. Besides, it has high stability under outdoor scenarios and in most cases, it uses RGB cameras (easy to mobilize and with low size and cost) as the only sensing mechanism. Moreover, there exit several single views available from shots taken from historical images, internet images, personal pictures, holiday photos, etc, which do not have an additional view, hence, floor recognition could not be performed with the conventional methods described before. In contrast, this cases are ideal test benches for an algorithm designed to work with single view images.

Motivated by the high scope of floor recognition from a single image, in this work we are interested in tackling the problem of floor recognition from a single image. In particular, we are interested in how to improve performance under image degradation (i.e., blur, lighting changes, noise, etc.), especially because most of the previous work, also in the domain of single RGB images, fail under high image degradation. Different to previous work we do not classify every pixel in the input image via semantic segmentation. Instead, we propose a novel method that relates image regions with floor patterns. The proposed floor recognition method uses a new binary descriptor which is robust to image degradation, noise and that considers a larger pixels number than previous work, LBP-based solutions [11,12,13,14].

Therefore, in order to present our proposal, this paper has been organized as follows: Sect. 2 discusses the related work; Sect. 3 describes the methodology behind our approach; Sect. 4 presents and discusses our results; conclusions are finally outlined in Sect. 5.

2 Related Work

In recent work, important progress under floor recognition from a single image has been made. In particular, the promising result was achieved via learning algorithms that learn the relationship between visual appearance and the floor [15]. In this context, one popular trend is for learning algorithms used in a semantic segmentation core where the aim is for assigning object labels to each pixel of an image. Another approach is direct learning, more efficient regarding computational size and cost. In this case, floor recognition is carried out without segmentation core and therefore input parameters are directly correlated with the image content to recover floor patterns.

For the direct approaches, [16] propose a methodology to estimate the ground plane structure. This methodology uses a supervised learning algorithm with MRF to find the relation between image descriptors (texture and gradient) and its depth information. In order to locate the ground plane boundaries on the depth map, this method divides the input image in similar regions using superpixels information within depth map. In our case, in recent years we have focused in to develop previous work within the direct learning-based approach. In first manuscript [17], we present a new dominant plane recognition method from a single image that provides five 3D orientation within dominant planar structures to detect (floor, wall and ceiling) on interior scenes. For that, we train a learning algorithm with texture descriptors to predict the 3D orientation in a planar structure. In the second manuscript [18], we present a floor recognition method to integrate virtual information on interior scenes. To detect the floor light variations, we proposed a rule system that integrates three variables: texture descriptors, blurring and superpixels-based segmentation. In order to remove noise, we proposed a remove noise technique that analyzes the consecutive pixels behavior.

Although in our previous works we used descriptors to obtain floor recognition information on interior scenarios. These descriptors have low robustness under image degradations (blur, lighting changes, noise, etc.) that outdoor scenarios provide. To solve these problems, in this work we propose a new floor recognition method that aims for high robustness under image degradations and high density for floor recognition on interior and outdoor scenarios. To solve these problems, we introduce a new texture descriptor based on binary patterns. This descriptor is robust to noise, robust to lighting changes and invariant to rotation. Also, a texture descriptor with these properties (robust to noise, robust to lighting changes and invariant to rotation) is useful considering learning algorithms since would be possible to decrease elements number used within training and it decrease elements number to detect.

3 The Proposed Method

In this section, we present the proposed method to obtain floor recognition using a learning algorithm, a proposed texture descriptor based on binary patterns, a color descriptor using a Gaussian filter and floor recognition analyses.

3.1 Input Image

In this article, the input image is denoted as I. The image I is used to obtain the texture and RGB color descriptors. We divide the image I in a grid \(\varTheta \) to obtain faster processing. For that, the grid \(\varTheta \) consists of sections \(\varTheta _{w}\). Section \(\varTheta _{w}\) is a finite set of pixels \(\varTheta _{w}=\{ x_{1}, ..., x_{q} \}\), \(\varTheta _{w} \in \varTheta \), where, q is the number of pixels in one section \(\varTheta _{w}\) \(\iff \) q is an odd number. Each section \(\varTheta _{w}\) has a patch \(\vartheta _{\varphi ,\omega }\). Patch \(\vartheta _{\varphi ,\omega }\) is a finite set of pixels \(\vartheta _{\varphi ,\omega }=\{ x_{1}, ..., x_{u} \}\), \(\vartheta _{\varphi ,\omega }\in \varTheta \), where, u is the number of pixels in one patch \(\vartheta _{\varphi ,\omega }\) \(\iff \) u is an odd number. Pixel \(\rho _{\varphi ,\omega }\) is the central pixel within patch \(\vartheta _{\varphi ,\omega }\). Pixel \(\varrho _{\varphi ,\omega }\) is a pixel within patch \(\vartheta _{\varphi ,\omega }\). Where, w denotes the w-th section in \(\varTheta \), \(\varphi \) is the abscissa from grid \(\varTheta \) and \(\omega \) is the ordinate from grid \(\varTheta \). Fig. 1(a) shows a grid example \(\varTheta \) of 3 \(\times \) 2. Where orange squares are the patches \(\vartheta _{\varphi ,\omega }\) of 7 \(\times \) 7, green square is one section \(\varTheta _{w}\) of 5 \(\times \) 5 and the gray lines are the limits of sections \(\varTheta _{w}\).

3.2 Patches Detection with Floor

In this work, to obtain the floor recognition, we use a training system that detects floor patches. The training system is to recognize the floor with its different light intensities using texture and color descriptors.

Fig. 1.
figure 1

(a) Grid of 3 \(\times \) 2 for the BIRRN; (b) example of n BIRRN circles \(\varDelta _{j}\) (Color figure online)

Training Set. In the patches \(\vartheta _{\varphi ,\omega }\) we extract the training matrix descriptors. Where, the training labels \(\upsilon _{\varphi ,\omega }\) are the floor light intensities variations and training matrix descriptors \(\psi _{\varphi ,\omega }\) are extracted through the pixels of patches \(\vartheta _{\varphi ,\omega }\). Our training matrix descriptors \(\psi _{\varphi ,\omega }\) are conformed of the texture descriptor \(\varPsi _{\varphi ,\omega }\) and color descriptor \(\chi ^{k}_{\varphi ,\omega }\). We obtained the descriptors number for training matrix used the Pareto principle or 80/20 rule [19].

Training Labels. The training labels \(\upsilon _{\varphi ,\omega }\) are the different light intensities of the floor. Where, the different light intensities are a finite set of classes i = \(\{1, 2, ..., m\}\) and m correspond to the light intensities number.

Texture Descriptor. We propose a new descriptor to obtain texture based on binary patterns: BIRRN (Binary descriptor Invariant to Rotation and Robust to Noise). The BIRRN descriptor considers a set of neighbor pixels within circular distributions with binary values, where binary values are added in each circular distribution. We defined to \(\varDelta _{j}\) as the set of neighboring pixels in a circular distribution or BIRRN circle. The BIRRN provides the texture information in a patch \(\vartheta _{\varphi ,\omega }\), applying Eqs. 15.

$$\begin{aligned} pc_{\varphi ,\omega }= (\sum _{\tau =1}^{n}\sum _{k=0}^{\varsigma _{\tau }-1}~p_{(\tau \sin \frac{2\pi k}{\varsigma _{\tau }},\tau \cos \frac{2\pi k}{\varsigma _{\tau }})})/(\sum _{\tau =1}^{n}\varsigma _{\tau }) \end{aligned}$$
(1)
$$\begin{aligned} \varPsi _{\varphi ,\omega } = \sum _{\tau =1}^{n} (\sum _{k=0}^{\varsigma _{\tau }-1}~S(pv_{(\tau \sin \frac{2\pi k}{\varsigma _{\tau }},\tau \cos \frac{2\pi k}{\varsigma _{\tau }})}-pc_{\varphi ,\omega })){f}^{\tau } \end{aligned}$$
(2)

Where, n is the radios number, \(\tau \) are the radius of different BIRRN circles \(\tau =\{1,2,...,n\}\) \(\iff \) \(\tau > 0\), \(\varsigma _{j}\) is the neighboring pixels number of BIRRN circle \(\varDelta _{j}\), \(\varsigma \) is a set of the neighboring pixels number \(\varsigma _{j}\) within of different BIRRN circles \(\varsigma =\{ \varsigma _{1}, ..., \varsigma _{n} \}\), \(\varsigma \in \vartheta _{\varphi ,\omega }\) \(\iff \) \(\varsigma _{1}, ..., \varsigma _{n} > 1\), \(pc_{\varphi ,\omega }\) is the gray value average within BIRRN circles, \(pv_{x,y}\) is the gray value of each neighboring pixel, \(p_{\alpha , \beta }\) is gray value of a pixel within BIRRN circles, \(\alpha \) is the abscissa of a pixel within BIRRN circles, \(\beta \) is the ordinate of a pixel within BIRRN circles, j denotes the j-th BIRRN circle \(\varDelta _{j}\) of patch \(\vartheta _{\varphi ,\omega }\), S is a thresholding function and \(\varPsi _{\varphi ,\omega }\) is the BIRRN value. In addition, f is a factor to discriminate the values of BIRRN circles. The pixels distribution (xy) used in BIRRN circles are defined as:

$$\begin{aligned} x=\tau \sin \frac{2\pi k}{\varsigma _{\tau }} \end{aligned}$$
(3)
$$\begin{aligned} y=\tau \cos \frac{2\pi k}{\varsigma _{\tau }} \end{aligned}$$
(4)

The threshold function S, which is used to determine the types of local pattern transition, is defined as a characteristic function:

(5)

Figure 1(b) shows an image generalized of BIRRN descriptor. Where, this image has n BIRRN circles (green ring) \(\varDelta _{j}\), each BIRRN circles has different numbers of neighbor pixels (red circle) \(pv_{x,y}\) and the blue squares are the pixels of the BIRRN circles \(\varDelta _{j}\).

Color Descriptor. The color descriptor \(\chi ^{k}_{\varphi ,\omega }\) is obtained using a Gaussian filter on patches \(\vartheta _{\varphi ,\omega }\). The Gaussian filter information in this work is used to consider uniform RGB values in the training set. To obtain the Gaussian information Eq. 6 the image I is divided in patches \(\vartheta _{\varphi ,\omega }\) as in the Fig. 1(a). Where, I is the input image, \(\varphi \) is the abscissa from grid \(\varTheta \), \(\omega \) is the ordinate from grid \(\varTheta \), \(\delta \) is the pixel number of the rows or columns of the patch \(\vartheta _{\varphi ,\omega }\), \(\sigma \) is the standard deviation of the distribution of the Gaussian function, \(i'\) denotes the position of an \(i'\)-th pixel, \(j'\) denotes the position of a \(j'\)-th pixel, a is the abscissa of Gaussian filtering, b is the ordinate of Gaussian filtering, the Gaussian function. is expressed as \( g(i',j') =\frac{1 }{2\pi \sigma ^{2}} e ^{ \frac{i'^{2}+j'^{2} }{2\sigma ^{2}} } \), \(\chi ^{k}_{\varphi ,\omega }\) are the RGB gaussian values of a patch \(\vartheta _{\varphi ,\omega }\) and k is a set with the RGB channel \(k = \{R,G,B\}\).

$$\begin{aligned} \chi ^{k}_{\varphi ,\omega }= \sum _{i'=1}^{a}\sum _{j'=1}^{b}~( \frac{1 }{2\pi \sigma ^{2}} e ^{ \frac{i'^{2}+j'^{2} }{2\sigma ^{2}} }) I((\delta * \varphi )+ i', (\delta * \omega ) + j') \end{aligned}$$
(6)

Learning Algorithm. We use gradient descent to adjust the parameters \(\theta _{0}, \theta _{1},\) \(\theta _{2},\) \(\theta _{3}, ...,\theta _{n}\) of the logistic regression hypothesis [20]. The cost function used in this methodology is shown in Eq. 7. Where, the number of elements is n, the number of examples in training set is m, the i-th elements of the training set are (\(x^{i}\), \(y^{i}\)) and the regularization parameter is denoted by \(\lambda \).

$$\begin{aligned} J(\theta )=-[\frac{1}{m}\sum _{i=1}^{m}y^{i}\log h_{\theta }(x^{i})+(1-y^{i})\log (1-h_{\theta }(x^{i}))]+\frac{\lambda }{2m}\sum _{j=1}^{n}\theta ^{i}_{j} \end{aligned}$$
(7)

Trained System. The proposed method recognizes m different floor light intensities. We use regularized logistic regression with one against all technique to predict the different light intensities [20]. To estimate the light intensities the image I is divided in patches \(\vartheta _{\varphi ,\omega }\) as in the Fig. 1(a). The logistic regression hypothesis used to recognizes light intensities is presented in Eq. 8. Where, the logistic regression classifier \(h_{\theta }^{i}(x)\) looks for to find the probability that y is equal to the classes i, i.e., \(h_{\theta }^{i}(x)=P(y=i|x;\theta )\) be i a finite set of classes \(i=\{1,2,...,m\}\). Element \(\theta _{j}\) is a parameter adjusted of the logistic regression, the element \(x_{j}\) is the texture \(\varPsi _{\varphi ,\omega }\) and color \(\chi ^{k}_{\varphi ,\omega }\) descriptor of a patch \(\vartheta _{\varphi ,\omega }\) in the image I.

$$\begin{aligned} h_{\theta }^{i}(x)=g(\theta _{j}^{T}x_{j}) \end{aligned}$$
(8)

3.3 Floor Recognition Analyses

The first analysis increases the floor recognition and removes the floor recognition with a low connection. For that, the first analysis connects patches with floor recognition and the patches with less connections are removed. The second analysis uses the floor recognition with connection to obtain recognition surface sets. Finally, the set with more connection and more surface is considered the floor.

Fig. 2.
figure 2

Floor and grass recognition using our method and proposed texture descriptor.

4 Discussion and Results

We elaborated a dataset to validate the floor recognition on urban environments. This dataset consists of urban scenes with 1,500 images (\(720\times 1280\) pixels), five different classes (grass, road, smooth carpet, tile and square carpet). These 1,500 images have floor labeled. The dataset images were divided into training images and test images. We use the proposed dataset to compares our floor recognition method using different binary descriptors: [11,12,13,14]. To provide quantitative results, we use three measures (recall, precision and \(F-score\)). Comparing the recall, precision and \(F-score\) measures of regularized logistic regression (Table 1) and proposed floor recognition method (Table 2), all texture descriptors increase using the proposed method (Table 2). The recall has an increase of 21\(\%\) to 33\(\%\), the precision has an increase of 7\(\%\) to 11\(\%\) and the \(F-score\) has an increase of 17\(\%\) to 28\(\%\). In Table 2 our method using proposed texture descriptor (BIRRN) has the best result in the average recall, its precision has a similar result to the other descriptors (with an average variation of 1\(\%\)) and its \(F-score\) has the best result in the average. In addition, for our floor recognition method using proposed texture descriptor, experimental results demonstrated that it delivers high stability under different scenes, it reaches lower misrecognition and higher recall and \(F-score\) than previous descriptors under floor recognition domains.

We evaluate our approach on proposed dataset, a dataset with multi-class segmentation (MSRC-21) [21] and a dataset that provide different urbanized scenes (Make3D) [22]. Quantitative evaluation is performed using pixels comparisons of the floor recognition with ground-truth. Figure 2 shows floor and grass recognition using our method and proposed texture descriptor. The blue regions show our results on the proposed dataset, Make3D dataset, and MSRC-21 dataset.

Table 1. Texture descriptors comparison only using regularized logistic regression.
Table 2. Texture descriptors comparison using our floor recognition method.

5 Conclusions

In this work, we have introduced a new floor recognition algorithm which is robust enough to provide accurate floor recognition under different urbanized environments. In order addressed the image degradation and improved the floor recognition performance, two different algorithmic improvements were proposed. The first one consists of a new binary texture descriptor (BIRRN) that it is robust to noise, illumination and rotation, and it uses a larger pixel number than the used in the previous LBP-based descriptors. The second improvement consists of two analyses that consider the floor connection and its segmentation in floor surface sets. Regarding the experimental results, it was demonstrated that our binary texture descriptor and the proposed analyses improves the floor recognition performance. For the proposed binary texture descriptor, it reaches high performance under several real world scenarios, more recall and \(F-score\) than previous texture descriptors and higher robustness under image degradation.