1 Introduction

In this paper we would like to propose an extension to the popular fuzzy c-means clustering method by introducing an additional disparity cue. The reason for introducing the additional cues is to improve the segmentation. This can be achieved by using the following approach, but only under the condition of having the stereo image pair of the segmented scene. Beside the segmentation with the additional depth constraints, our method is also capable of producing the disparity map of the input image pair and hence can be also considered as a form of the stereo matching algorithm.

The following text describes the adaptation of the fuzzy c-means algorithm to perform the clustering in space extended by the dimension of the disparity. The creation of the clusters will be driven by a degree of the stereo match (this measure will be described later on). An attractive aspect of this strategy is that we are able to take advantage of known number of depth levels or objects (if this information is available).

The motivation for our work was to provide an algorithm that can separate objects based on their different colour and spatial depth. We regard this method as more suitable in specific cases (will be described later on) than the segmentation with the final disparity maps of the stereo matching algorithms. The distance, based on both dissimilarities (spatial and colour), provide more sensitive segmentation (especially on segment borders) than the segmentation performed on the filtered disparity maps which contain only the best matches, and do not take into account segment properties. The algorithm was originally developed for a very specific purpose – the segmentation of the moss clusters (as a part of a biological research involving these species). Therefore, we have tested and evaluated the algorithm mainly on the “Map” dataset, introduced in [26], as it strongly resembles the stone structures which are frequently covered by the moss layers. However, as we will show in the next paragraphs, the algorithm can be used in more general cases. The main domain of application of our algorithm is defined by the following constraints:

  • The images should contain relatively small number of segments.

  • The segments should be preferably planes or linear gradients.

  • There should be no or minimal occlusions.

The clustering technique is usually described as a process of forming partitions from a data set on the basis of a performance function, also known as an objective function. The underlying idea of our algorithm is to consider the disparity space (e.g., in disparity maps) as a specific type of the data set, consisting of clusters representing the three dimensional objects of the scene. The fuzzy c-means algorithm has already been used to create the segmentations based on the depth information or disparity maps, e.g., [1, 22], and was also adapted to incorporated the spatial neighbourhood information, e.g., [7, 17, 19], but in these approaches, the algorithms were run on the input data already containing the depth information. In contrast, the proposed algorithm does not need the depth information in advance, since it calculates it itself by means of the stereo matching.

The stereo matching problem itself is a multicriterion decision problem. The most common classification of the stereo matching algorithms is based on the size of the processed area. In this way, we recognize the local and global methods. In the local matching methods, the correspondence of a pixel is based on the similarity of its neighbourhood. The similarity itself can be computed using the measures such as the sum of the absolute differences (SAD), sum of the squared difference (SSD), normalized SSD, normalized cross-correlation etc. A comparison of the different similarity measures can be found in [9]. The global methods usually tend to minimize an energy function, e.g., by using the dynamic programming [8, 30], graph cuts [6, 16], Markov random fields [5] or belief propagation with segmentation [15].

The problem has been also solved by the fuzzy aggregation operators [31] or fuzzy relaxation technique [23]. The last method improves the matching in case of partially occluded objects. In [3], a fuzzy integral was introduced to improve the results obtained with the classical fuzzy averaging operators. The basic idea of using the clustering technique together with the stereo matching process was introduced in [4, 28] and further developed in [18, 27, 29, 32]. Compared to these, our approach differs in several aspects. The clustering is not performed on the individual input images, but on both stereo images simultaneously, and takes into account the matching properties. In each step, the clusters are adjusted to minimize the matching cost.

The paper is organized as follows. In Sect. 2 we briefly introduce the classical fuzzy c-means algorithm. Then, the extension of the fuzzy c-means is described in Sect. 3 in order to provide the depth segmentation based on the differences of the two stereo images. The experimental validation and the benchmark results are provided in Sect. 4. Finally, conclusions are presented in Sect. 5.

2 Fuzzy C-Means

Let us briefly introduce the original method. Fuzzy c-means is a widely used clustering technique, developed by [10] and improved by [2]. It is based on a standard least squared error model that generalizes an earlier and popular non-fuzzy c-means mode [20]. Fuzzy c-means can be generalized in many ways to include, e.g., Minkowski, Hamming, Canberrar or hybrid distances.

The fuzzy c-means algorithm attempts to partition a collection of n data points \(\{x_k\}_{k=1}^n\) into a collection of c fuzzy clusters (represented by the cluster centres) on the basis of a distance d between the cluster centre and the data point. The algorithm is minimizing the objective function J(UV), where \(V = (v_1, \ldots , v_c)\) is the set of cluster centres and \(U = [u_{ki}]\) is the \(n \times c\) membership matrix. The space of all possible values of U is denoted as \(U_f\). The elements of the matrix U are organized as follows. The column i gives the membership of all n input data points (rows) in the cluster i for \(i = 1 \dots c\). The \(u_{ki}\) stands for the membership of the k-th point of the i-th cluster. The idea is that the closer the data point is to the cluster centre, the larger is its membership value towards that specific cluster. Consequently, the sum of all memberships of the data point across all clusters is equal to one. The fuzzy membership is formally given by the following constraint

$$\begin{aligned} \nonumber U_f = \{ U = (u_{ki}) : \sum _{j=1}^c u_{kj} = 1, 1 \le k \le n; \\ u_{ki} \in [0,1], 1 \le k \le n, 1 \le i \le c \} . \end{aligned}$$
(1)

The minimized objective function J(UV) is defined as [2]

$$\begin{aligned} J(U,V) = \sum _{i=1}^{c} \sum _{k=1}^{n} (u_{ki})^m d(x_k, v_i), (1 \le m \le \infty ), \end{aligned}$$
(2)

where \(u_{ki}\) is a degree of membership of \(x_k\) in the cluster i, and \(v_i\) represents the centre of the cluster. The parameter m is called the weighting exponent of the model. For \(m = 1\), the memberships converge to 0 or 1, producing a crisp partitioning. The best choice for m is probably in the interval [1.5, 2.5], where \(m=2\) is the most common choice [24]. The distance \(d(x_k, v_i)\) represents (usually) Euclidean distance between the k-th data point and the i-th cluster centre.

We should notice that the minimization of the objective function J(UV) is not an exact minimization but an iteration procedure of so called “alternate minimization”. In essence, the algorithm is searching for a local optimal solution, which we will denote with stripe (e.g., \(\bar{U}\)). The overall iterative process may be summarised as follows.

Algorithm Steps

  1. 1.

    Initialize the matrix U by randomly generated \(u_{ki}\) membership coefficients for all cluster centres \(\bar{V} =(\bar{v_{1}},\ldots , \bar{v_{c}})\).

  2. 2.

    Find the optimal U by iteratively calculating \( \bar{U} = \arg \displaystyle \min _{U \in U_f} J(U, \bar{V})\). The following solution can be derived using the Lagrange multiplier method [20]

    $$\begin{aligned} \bar{u}_{ki} = \left[ \displaystyle \sum \limits _{j=1}^{c} \left( \frac{d(x_k, \bar{v}_i)}{d(x_k, \bar{v}_j)} \right) ^{\frac{2}{m-1} } \right] ^{-1}, (x_k \ne v_i). \end{aligned}$$
    (3)

    The solution for \((x_k = v_i)\) is obviously \(\bar{u}_{ki} = 1\).

  3. 3.

    Find the optimal V by calculating \( \bar{V} = \arg \min _{V} J(\bar{U}, V)\). The solution is computed by differentiating J with respect to V [20]:

    $$\begin{aligned} \bar{v}_{i} = \frac{\displaystyle \sum \limits _{k=1}^{n} (\bar{u}_{ki})^m x_k}{\displaystyle \sum \limits _{k=1}^{n} (\bar{u}_{ki})^m}. \end{aligned}$$
    (4)
  4. 4.

    Repeat from step 2 until \( \bar{U}\) and \( \bar{V}\) is convergent.

The convergence is achieved when \(\displaystyle \max _{k,i} \vert \bar{u}_{ki} - {u}_{ki} \vert < \epsilon \), where \( \bar{u}\) is the new solution, u is the value from the previous iteration and \(\epsilon \) is a small positive number, the threshold. Alternatively, we can use \(\displaystyle \max _{1 \le i \le c} \Vert \bar{v}_{i} - {v}_{i} \Vert < \epsilon \) as a convergence condition.

3 Introducing the Matching Constraint to Fuzzy C-Means

In a simplified way, we can say that the original fuzzy c-means algorithm (when used in image processing) is usually based only on the pixel positions and their intensities (colours). In our approach, we have extended this algorithm to include the matching constraints. First, by expanding the dimension of the data vector to include the disparity (depth), and then, by evaluating the dissimilarity of the stereo pair (which will be explained later).

As stated in Sect. 2, the algorithm attempts to partition the elements with respect to a given criterion, defined as a degree of belonging that is related inversely to the distance. However, for the depth segmentation, we need to add additional components measuring the intensity (colour) difference between the point and its supposed projection and the distance between the point disparity and the disparity of its supposed cluster. The sources of the spatial information are the small differences in the stereo images. In that way, we associate the clusters with the disparity space. Therefore, we have to define the vector of the cluster centre as

$$\begin{aligned} v = (v_X, v_Y, v_I, v_D), \end{aligned}$$
(5)

where \(v_X\),\(v_Y\) stand for the spatial position, \(v_I\) for the brightness and \(v_D\) for the disparity value. For the clarity, the capitalized subscripts, X, Y, I and D, are used to indicate the vector elements (e.g., \(v_X\)). The small subscripts will later be used to specify a particular vector from the set (e.g., \(x_k\)).

Fig. 1.
figure 1

Illustration of the rationale behind the algorithm. The left figure shows the coloured disparity levels of the dataset [25], while the right one depicts the original pixel colours. The algorithm is based on the observation that the objects share the similar disparity, as well as similar colour. This can be clearly seen on the red lamp in the foreground or the white statue on the left (Color figure online).

Our new membership function takes into account the dissimilarity of the left image pixel (\(\phi _L(x_X, x_Y)\)) and the right image pixel shifted by the average cluster disparity (\(\phi _R(x_X + {{v}}_D, x_Y)\)). Basically, we use the disparity in the similar fashion as the intensity, grouping the pixels sharing the same, or almost the same disparity value (see Fig. 1). For this, we need to adapt the membership function to penalize the pixels having the incorrect match (not similar to their projections on the other image) and provide the way of measuring the distance between the cluster centres and pixels with associated disparity value.

We propose the use of the extended vector space model with the additional dimensions reflecting the disparity and pixel dissimilarity in the stereo image pair. The distance in the proposed vector space is, for clarity, separated into the two components (d and \(d_s\)), described later on.

The proposed fuzzy stereo partitioning is carried out using the following membership function (the subscripts kij are the indexes)

$$\begin{aligned} \bar{u}_{ki} = \left[ \displaystyle \sum \limits _{j=1}^{c} \left( \frac{d^2(x_k, \bar{v}_i) + d_{\text {s}}^2(x_k,\bar{v}_i) }{d^2(x_k, \bar{v}_j) + d_{\text {s}}^2(x_k,\bar{v}_j)} \right) ^{\frac{1}{m-1} } \right] ^{-1}, (x_k \ne v_i). \end{aligned}$$
(6)
Fig. 2.
figure 2

Visualisation of the data points and their clusters taken from our experiments. The points on the left figure are coloured according to the disparity levels associated with them. The right figure shows their real colour. The both figures shows the depth levels as obtained from the calculations of the proposed modification of the fuzzy c-means algorithm (Color figure online).

The membership \(\bar{u}_{ki}\) is related inversely to the distance between the processed point and the cluster centre (as calculated in the previous iteration). The new term \(d_{\text {s}}\) reflects the correctness of the stereo match between the pixel of the left (\(\phi _L\)) and its projection on the right (\(\phi _R\)) image (the subscripts XYD denotes the vector elements):

$$\begin{aligned} d^2_{\text {s}}(x,{{v}}) = \lambda _m (\phi _L(x_X, x_Y) - \phi _R(x_X + {{v}}_D, x_Y))^2, \end{aligned}$$
(7)

where x is the data point (vector) and v is the cluster centroid. The uppercase subscript of the vector denotes its component. The constant \(\lambda _m\) stands for the weight of the matching term. For \(\phi _L\) and \(\phi _R\) we assume the rectified images. It is possible to replace the difference \(\phi _L(x_X, x_Y) - \phi _R(x_X + {{v}}_D, x_Y)\) by the difference of the aggregating windows (SAD, SSD, etc.), but as the aggregation of the pixels is inherently given by the fuzzy c-means, it does not provide any further advantage and even worsens the results by blurring the edges. The distance d(xv) is calculated (as in original method) using the Euclidean distance:

$$\begin{aligned} d^2(x, {v}) = \lambda _i( {x_I} - {{{v}}_I})^2 + \lambda _d( {x_D} - {{{v}}_D})^2 + \nonumber \\ \lambda _s ( {x_X} - {{{v}}_X})^2 + \lambda _s ( {x_Y} - {{{v}}_Y})^2, \end{aligned}$$
(8)

where \(x_X\), \(x_Y\) are the pixel coordinates, \(x_I\) colour intensity and \(x_D\) is the disparity value. When compared to the original method, we have added the term measuring the disparity distance of the processed point and the cluster (see Fig. 2). For the pixel disparity \(x_D\) we can take an initial guess since, as we will show later, the algorithm is quite insensitive to this value. Basically, it only helps in the beginning to form the initial clusters. The values \(\lambda _i\), \(\lambda _d\) and \(\lambda _s\) denote the intensity, disparity and spatial weights. The effects of these weights are discussed with results (Sect. 4).

The iteration steps remain the same as in Sect. 2. The outline of the algorithm can be summarized as follows: (i) choose the proper parameters, especially the number of clusters (discussed in Sect. 3.1), (ii) to each point assign random cluster membership coefficients, (iii) in each iteration compute the centroid for each cluster (Eq. 4), followed by the computation of the membership coefficients for all points (Eq. 6). Repeat this step until the algorithm has converged. Finally, create the output disparity map based on the cluster disparities (iv).

The algorithm was tested on several types of real images (depicting the processed botanical samples) and also on the standard dataset used for the evaluation of the stereo matching algorithms [25]. While our approach is not intended to be used as the general purpose stereo matching algorithm, we would like to give the reader an opportunity to examine the results in the standard stereo matching benchmark tests (see Sect. 4).

Fig. 3.
figure 3

The reference images (a, b), the ground truth disparity map (d), and its segments (c) used for the demonstration of the cluster count problem.

3.1 Cluster Count Problem

The disadvantage of the fuzzy c-means (as well as k-means) is the result dependency on the initial choice of weights. This is also true for our method. Despite the algorithm minimizes the intra-cluster variance, calculated minimum is still only a local minimum. But more serious problem of the fuzzy c-means algorithm is that it requires the number of clusters to be known in advance.

The correct choice of the cluster count is ambiguous, with interpretations depending on the shape and scale of the data point distribution in the input data set and the desired resolution. This may seem as a disadvantage for general settings, but may be an advantage for special cases, where the number of segments or number of disparity planes is already known. For example, in Fig. 3 the box is the only object in the foreground, and can be easily represented by only a small number of segments. As you can see (Fig. 4), with only a few clusters, we are able to acquire very precise disparity map and by increasing the number of the segments, we are able to capture even smaller changes in the disparity gradient (the box in the example is slightly tilted). We can say that by choosing the number of clusters, we can set, whether we are more interested in large segments covering the whole objects, or small fine-grained parts.

The results of our approach surpass (but only for the specific types of scenes, similar to the sample images) the performance of the majority of the standard state-of-the-art algorithms (see Table 1, “Map” column). However, due to the algorithm specialization, it is less suitable for the other types of scenes. But still, the additional cue improves the segmentation results.

4 Tests and Results

This section describes the experiments and shows the results confirming the anticipated segmentation features and proper depth discrimination.

First, we have performed the tests on the images fulfilling the assumptions, we made in the beginning  – the scenes with only a few objects, each having almost the flat depth. The “Map” dataset (Fig. 4) complies with these requirements. The results for this specific dataset are very satisfactory (Table 1, column “Map”); however, the results for the other types of image pairs (from the dataset) are not very encouraging. We do not consider this as a disadvantage, since the intentions of this algorithm are different than the general purpose stereo matching algorithms. The explanation for the results on the other samples is that these pairs violate the initial presumptions of our algorithm; the scenes contain a lot of objects with fine-grained disparity. The limits of our algorithm – the number of clusters and plane disparities – do not offer many opportunities for improvements in such general cases.

Fig. 4.
figure 4

The influence of the cluster count on the output disparity map. For better reading, the segments are coloured, numbered (number in brackets), and marked with their disparity values (the value below the number in brackets). The reference images are depicted in Fig. 3. The subfigures show the results of the modified fuzzy c-means algorithm set to 9, 15 and 40 segments. The top subfigures show the segments, while the bottom ones show the disparity maps obtained from the segments disparity values (Color figure online).

Table 1. The performance of the modified fuzzy c-means algorithms according the Middlebury stereo test bed [25]. The overall performance is measured by the percentage of bad pixels in the non-occluded areas (nocc). The performance measured on the whole image (all) is provided as well. Our algorithm is denoted as FZ. The total cluster count was set to 200. In order to give a better idea of the performance of our methods compared to the state-of-the-art algorithms, we have included the results of the selected algorithms from the Middlebury evaluation.
Fig. 5.
figure 5

The algorithm results achieved with different disparity and spatial weights (\(\lambda _d\) and \(\lambda _s\)). The algorithm was set to generate 100 segments. The different disparity weights (\(\lambda _d\)) are represented by the different line colours. The significant effects of the disparity weight (\(\lambda _d\)) can be seen only on the images containing the planar objects (e.g., the “Venus” pair) (Color figure online).

Fig. 6.
figure 6

Segmentation results of two samples from the Adobe Open Source Data Sets. These samples illustrate the ideal configurations for the proposed algorithm – raised flat foreground objects.

To illustrate the algorithm performance on the images with optimal object configurations, we have chosen several samples from the Adobe Open Source Data SetsFootnote 1. The data set contains stereo images and ground truth segmentation of the foreground object. The results of the selected images are visible in Fig. 6. We have to point out that these images illustrate the optimal cases. Nevertheless, we have also performed the tests on the images that are not very suitable for our approach. The absolute results with the comparison of the other algorithms are shown in Table 1. The evaluation has been performed on the Middlebury dataset [25]. The full list of algorithms is available on the Middlebury stereo vision website. While our algorithm is not the typical stereo matching algorithm, due to the lack of more suitable, generally accepted dataset for segmenting the stereo images, we decided to perform the tests on these images. The parameters were maintained the same for all images – cluster count \(n = 200, \lambda _d = 1.0 \), \(\lambda _s = 0.1\), \(\lambda _i = 0.05 \), and \(\lambda _m = 0.1\).

The proposed algorithm converges approximately after 15 iterations on all images of the given set. The outputs with 100 segments are displayed in Fig. 7 (evaluated outputs with 200 segments were not used for the illustration purposes, due to the hard distinguishability of the small clusters). The images in the upper row show the segments. The disparity maps obtained from the segment properties are displayed below. As you can see, the proposed algorithm is capable of obtaining the disparity maps of more sophisticated scenes, but not at the level of detail as the generally used stereo matching approaches.

To increase the overall performance, it is possible to increase the number of clusters, which in result leads to a more grained segmentation, where each segment can have different disparity. The drawback of a huge number of clusters is the increasing computational time. At the certain level, the additional increasing of cluster count starts to be inefficient. We have used no more than 200 segments.

Fig. 7.
figure 7

The images show the disparity and cluster maps obtained for the default Middlebury dataset using our expanded fuzzy c-means algorithm (fz). The ground truth data are provided in the last row. As you can see, the output disparity maps are not as good as the results from the “Map” dataset (Fig. 4). The reason is that the Middlebury dataset contains images with a lot of details and a set of various objects, which contradicts the initial algorithm assumptions. In order to improve the performance it is necessary to significantly increase the number of the clusters, which consequently leads to a much longer processing time. Unfortunately, this still does not guarantee for all inputs the results comparable to the best algorithms.

Fig. 8.
figure 8

The segmentation results of the moss sample using the modified fuzzy c-means algorithm (Sect. 3). The figure shows (from left to right): the left image of the input pair depicting the moss layers on the stone base, the segments coloured according the average colour, the segments coloured according the disparity and the visualization of the clusters itself. As you can see, our modification of the fuzzy c-means still retains the properties of the original algorithm and in addition provides the disparity values (Color figure online).

During the development, we have also performed several experiments to investigate the effects of the algorithm parameters on the segmentation performance. The parameter settings may vary from scenario to scenario, but generally, only two parameters appear to be particularly influential - the spatial and disparity weight (\(\lambda _s\) and \(\lambda _d\)). Figure 5 shows the influence of these weights on the output segmentation consisting of 100 segments. The experiment showed the significant effect of the disparity weight (\(\lambda _d\)) mainly on the images containing the planar objects (e.g., the “Venus” pair). This is a predicted behaviour as our algorithm favours planar disparities. On an example of “Venus” pair, you can see that the increasing disparity weight forces the algorithm to create segments with less disparity deviations from the cluster centroid, leading to the better results. However, for images not containing such objects (e.g., “Teddy” or “Tsukuba”) the change in these parameters has only a small impact on the results. We have not evaluated all possible parameter configurations for all dataset images, but empirically, we can say that the best results were achieved with \(\lambda _s = 0.1\). Increasing this value forced the algorithm to create too compact clusters and, vice versa, decreasing \(\lambda _s\) caused merging too distant pixels into one cluster.

In the application that the algorithm was originally developed for, it was important to separate the layer of the base (usually the stone) and the layer above, formed by the moss structures. An example is illustrated in Fig. 8. As you can see, the resulting segmentation strongly benefits from the inherit features of the algorithm. The design of the algorithm was strongly driven by the expected look of the captured samples.

5 Conclusions

In this paper, we have presented a modification of the fuzzy c-means algorithm. The fuzzy c-means algorithm is one of the most popular clustering techniques in image processing. In the past, it has been modified in many ways to take into account different constraints. In our case, we have added an additional disparity constraint and examined its impact on the segmentation performance and depth discrimination. In the context of the image segmentation, we see the advantage of the proposed joint analysis using brightness and depth constraints. We believe, such combination improves the segmentation by creating edges not only in places where brightness changes abruptly but also in places of the depth discontinuities. Objects of the similar colour in different depths may be connected by the classical algorithm but with an additional depth constraint they are separated correctly.

The motivation was to develop a segmentation technique that can be used in cases, where we have the possibility of obtaining the stereo images and, in such way, improve the segmentation by applying additional depth information. In the biological application (the segmentation of the moss layers), the method provided better results than the standard fuzzy c-mean algorithm. As the algorithm was intended for this specific application, we have mainly tested and evaluated the algorithm on the datasets that resemble stone structures (e.g., the standard “Map” dataset). For such cases, the algorithm provides very good results.

To sum up, the paper proposed the method that improves the segmentation in cases where the pixel intensities are not sufficient for correct segmentation and the stereo images are available. This area of research, however, still offers the space for improvements. The results can be further improved by tuning the distance weights. The goal is to create an algorithm that can automatically adapt the weight variables according to the input dataset. Similar approaches were already published for the closely related k-means clustering, e.g. [13, 21], and should be applicable to the fuzzy c-means as well.