Keywords

1 Introduction

Unsupervised segmentation is an important problem in computer vision, since perceptual grouping plays a powerful role in human visual perception [25]. In this context, the method must decide what are the relevant image regions without user guidance, based on color and texture similarity or local contrast.

The unsupervised over-segmentation of an image into compact regions of similar and connected pixels is commonly called superpixels [1, 22]. It can greatly reduce the computational time of computer vision algorithms, by replacing the rigid structure of the pixel grid [1]. In graph-based methods, it allows the fast creation of a Region Adjacency Graph (RAG), drastically reducing the number of graph elements compared to the graph at the pixel level (Figs. 1a-b).

Several graph-based methods have been proposed for unsupervised segmentation, including watersheds [3], mean cut [24], ratio cut [23], normalized cuts [4, 19], and minimum spanning tree (MST) based methods [7, 9,10,11,12, 26]. For instance, Felzenszwalb and Huttenlocher proposed an efficient segmentation algorithm that evaluates a predicate for measuring the evidence for a boundary between two regions, which produces segmentations satisfying global properties, although based on greedy decisions [9]. Other methods include the usage of component trees [20, 21], which can also be combined with watersheds, allowing the selection of catchment basins according to their extinction values.

Seed-based methods for region-based image segmentation are known to provide satisfactory results for several applications, being usually easy to extend to multi-dimensional images. In this work, we extend a seed-based method, named Oriented Image Foresting Transform (OIFT) [15, 17], to perform unsupervised image segmentation, leading to a new method based on optimum cuts in graphs, named UOIFT, that can be tailored to different objects, according to their boundary polarity. OIFT has been demonstrated to be an effective and efficient solution for the segmentation of a given target object based on user provided seeds, allowing the incorporation of several high-level constraints, including shape constraints [16, 18] and connectivity priors [14].

The proposed method is based on the Image Foresting Transform (IFT) [8] algorithm, which has linearithmic implementations, being much faster compared to other methods based on cuts in graphs [4, 19, 23, 24]. Differently from [13], our method exploits non-monotonic-incremental cost functions in directed graphs.

The proposed method encompasses as a particular case the single-linkage algorithm by MST, establishing important theoretical contributions, and requires a lower number of image partitions to isolate the desired regions of interest as compared to other approaches commonly used in the literature.

Figures 1c–h present the central idea of this work, which is to explore the boundary polarity in the unsupervised segmentation of images in directed graphs. Figure 1a shows a synthetic image containing dark and bright regions to be segmented in five different regions. Regular unsupervised methods, based on undirected graphs, such as watersheds, cannot distinguish the different types of boundary polarity, giving as output a mixture of bright and dark regions, as shown in Figs. 1c-d. Our proposed method can favor a particular polarity, giving the results shown in Figs. 1e-f or Figs. 1g-h.

Fig. 1.
figure 1

(a) Input image with \(320\times 200\) pixels. (b) Image divided into 640 superpixels by IFT-SLIC [2]. (c) The segmentation into five regions by a single-linkage algorithm using the MST of the RAG. (d) Candidate seeds ranked by their energies by UOIFT without boundary polarity lead to the same result depicted in (c). The UOIFT results into five regions and seeds ranked by their energies, with polarity favoring transitions: (e-f) from bright to dark pixels and (g-h) from dark to bright pixels.

2 Graph Concepts

We consider a weighted digraph G as a triple \(\langle \varvec{\mathcal{V}}, \varvec{\mathcal{A}}, \omega \rangle \), where \(\varvec{\mathcal{V}}\) is a nonempty set of vertices or nodes, \(\varvec{\mathcal{A}}\) is a set of ordered pairs of distinct vertices called arcs or directed edges, and \(\omega : \varvec{\mathcal{A}} \rightarrow \mathbb {R}\) represents the weights associated to the arcs.

An image can be interpreted as a weighted digraph \(G=\langle \varvec{\mathcal{V}},\varvec{\mathcal{A}},\omega \rangle \), whose nodes \(\varvec{\mathcal{V}}\) are the image pixels (or superpixels) in its image domain and whose arcs are the ordered pairs \(\langle s,t \rangle \in {\varvec{\mathcal{A}}}\) of neighboring pixel (superpixels), e.g., 4-neighborhood in case of 2D images. The digraph G is symmetric if for any of its arcs \( \langle s,t \rangle \in \varvec{\mathcal{A}}\), the pair \( \langle t,s \rangle \) is also an arc of G, but we can have \(\omega (\langle s,t \rangle ) \ne \omega (\langle t,s \rangle )\). The transpose \(G^T\) of G is the unique weighted digraph on the same set of vertices \(\varvec{\mathcal{V}}\) with all arcs reversed compared to the corresponding arcs in G.

For a given image graph \(G=\langle \varvec{\mathcal{V}},\varvec{\mathcal{A}}, \omega \rangle \), a path \({\pi =\langle t_1,t_2,\ldots ,t_n \rangle }\) is a sequence of adjacent nodes (i.e., \(\langle t_i,t_{i+1} \rangle \in \varvec{\mathcal{A}}\), \(i=1,2,\ldots ,n-1\)) with no repeated vertices (\(t_i \ne t_j\) for \(i \ne j\)). A path \({\pi _t=\langle t_1,t_2,\ldots ,t_n = t \rangle }\) is a path with terminus at a node t. When we want to explicitly indicate the origin of the path, the notation may also be used, where s stands for the origin and t for the destination node. A path is trivial when \(\pi _t=\langle t \rangle \). A path \(\pi _t=\pi _s\cdot \langle s,t\rangle \) indicates the extension of a path \(\pi _s\) by an arc \( \langle s,t \rangle \). To notation \(\varPi (G)\) is used to indicate the set of all possible paths in a graph G.

A predecessor map is a function P that assigns to each node t in \({\varvec{\mathcal{V}}}\) either some other adjacent node in \({\varvec{\mathcal{V}}}\), or a distinctive marker nil not in \({\varvec{\mathcal{V}}}\) — in which case t is said to be a root of the map. A spanning forest is a predecessor map which contains no cycles — i.e., one which takes every node to nil in a finite number of iterations. For any node \(t\in {\varvec{\mathcal{V}}}\), a spanning forest P defines a path \(\pi ^{P}_t\) recursively as \(\langle t \rangle \) if \(P(t) = nil\), and \(\pi ^{P}_s\cdot \langle s,t\rangle \) if \(P(t)=s\ne nil\).

A connectivity function \(f: \varPi (G)\rightarrow \mathbb {R}\) computes a value \(f(\pi _t)\) for any path \(\pi _t\), usually based on arc weights. A path \(\pi _t\) is optimum if \(f(\pi _t) \le f(\tau _t)\) for any other path \(\tau _t\) in G. The optimum-path value \(V_{opt}(t)\) is uniquely defined by \(V_{opt}(t) = \min _{\pi _t \in \varPi (G)} \{ f(\pi _t) \}\). An optimum-path forest P is a spanning forest where all paths \(\pi ^{P}_t\) for \(t \in {\varvec{\mathcal{V}}}\) are optimum.

The cost of a trivial path \(\pi _t=\langle t \rangle \) is usually given by a handicap value H(t). For example, \(H(t) = 0\) for all \(t \in \varvec{\mathcal{S}}\) and \(H(t) = \infty \) otherwise, where \(\varvec{\mathcal{S}}\) is a seed set. The costs for non-trivial paths follow a path-extension rule. For example:

$$\begin{aligned} f_{\max }(\pi _s\cdot \langle s,t\rangle )= & {} \max \{f_{\max }(\pi _s),\omega (\langle s,t \rangle )\} \end{aligned}$$
(1)
$$\begin{aligned} f_{\varSigma }(\pi _s\cdot \langle s,t\rangle )= & {} f_{\varSigma }(\pi _s) + \omega (\langle s,t \rangle ) \end{aligned}$$
(2)
$$\begin{aligned} f_{\omega }(\pi _s\cdot \langle s,t\rangle )= & {} \omega (\langle s,t \rangle ) \end{aligned}$$
(3)

The max-arc path-cost function \(f_{\max }\) and the additive path-cost function \(f_{\varSigma }\) with \(\omega (\langle s,t \rangle ) \geqslant 0\) are Monotonic-Incremental cost functions (MI), while \(f_{\omega }\) indicates a non-monotonic-incremental cost function.

The image foresting transform (IFT) [8] (Algorithm 1) computes the path-cost map V, which is precisely \(V_{opt}\) in the case of MI functions [6]. It is also optimized in handling infinite costs, by storing in \(\varvec{\mathcal{Q}}\) only the nodes with finite-cost path, assuming without loss of generality that \(V_{opt}(t) < +\infty \) for all \(t \in {\varvec{\mathcal{V}}}\).

figure a

3 Efficient Optimum Cuts in Graphs

For a given partition of the graph nodes in two sets \(\varvec{X}\) and \(\varvec{\mathcal{V}} \setminus \varvec{X}\), let \(\mathcal{C}(\varvec{X}) = \{ \langle s,t \rangle \in \varvec{\mathcal{A}} \mid s \in \varvec{X} ~\text{ and }~ t \notin \varvec{X} \}\) denote the set of arcs in its cut from \(\varvec{X}\) to \(\varvec{\mathcal{V}} \setminus \varvec{X}\). Consider the following energy formulation:

$$\begin{aligned} E(\varvec{X}) = \min _{\langle s,t \rangle \in \mathcal{C}(\varvec{X})} \omega (\langle s,t \rangle ) \end{aligned}$$
(4)

Let \(\mathcal{U}(x, y) = \{ \varvec{X} \subset \varvec{\mathcal{V}} \mid x \in \varvec{X} ~\text{ and }~ y \in \varvec{\mathcal{V}} \setminus \varvec{X} \}\) denote the universe of all possible partitions separating the nodes x and y, where y represents the background. By using x and y as internal and external seeds, respectively, the OIFT algorithm [17] computes an optimum partition \(\varvec{X}_{opt} \in \mathcal{U}(x, y)\) by maximizing the above energy (Eq. 4) in a symmetric directed graph, that is, \(E(\varvec{X}_{opt}) = \max _{\varvec{X} \in \mathcal{U}(x, y)} E(\varvec{X})\). OIFT is build upon the IFT framework by considering the following path function in a symmetric digraph:

(5)

where, in this work, we use \(\varvec{\mathcal{S}_1} = \{x\}\) and \(\varvec{\mathcal{S}_0} = \{y\}\). The set \(\varvec{X}_{opt} \in \mathcal{U}(x, y)\) by OIFT is defined from the forest P computed by Algorithm 1 with , by taking the pixels that were conquered by paths rooted in \(\varvec{\mathcal{S}_1} = \{x\}\) [15].

For the purpose of unsupervised segmentation, for a given reference point r in the background, we would like to find a node \(t^{\prime } \in \varvec{\mathcal{V}} \setminus \{r\}\), resulting in a partition of maximum energy among all results in \(\bigcup _{t \in \varvec{\mathcal{V}} \setminus \{r\}} \mathcal{U}(t, r)\). Fortunately, \(t^{\prime }\) can be efficiently obtained by taking , where V is the cost map by IFT using \(f_{max}\) with \(\varvec{\mathcal{S}}=\{r\}\) in the transpose graph, according to Lemma 1 from [5]. This result can be equally obtained by taking as V the cost map by IFT using \(f_{\omega }\) with \(\varvec{\mathcal{S}}=\{r\}\) in the transpose graph, but this later approach has the advantage that it allows us to rank the nodes according to their non-increasing order of values, such that the next cut with maximum energy can be easily selected (Figs. 1d, f, h). In this way we can create a hierarchy of partitions according to the following proposed algorithm:

figure b

Algorithm 2 generates a hierarchical segmentation by successive binary divisions, leading at the end to a segmentation with k partitions. Each IFT execution has linearithmic complexity in the number of involved nodes. Since UOIFT is based on multiple OIFTs executions (at each iteration being applied to smaller graphs), we considered a Region Adjacency Graph (RAG), where the regions are the superpixels computed by IFT-SLIC [2, 22] of size \(10\times 10\) pixels, rather than using the pixels directly (Fig. 1b). The initial reference node for the background was taken to be the first top/left superpixel in the image. In order to exploit the boundary polarity, we consider the following arc weight assignment:

$$\begin{aligned} \omega (\langle s,t \rangle )= & {} \left\{ \begin{array}{ll} |I(t)-I(s)|\times (1+\alpha ) &{} \text{ if } I(s) > I(t) \\ |I(t)-I(s)|\times (1-\alpha ) &{} \text{ if } I(s) < I(t)\\ |I(t)-I(s)| &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(6)

where the weights \(\omega (\langle s,t \rangle )\) are a combination of an undirected dissimilarity measure \(|I(t)-I(s)|\) between neighboring superpixels s and t, multiplied by an orientation factor for \(\alpha \in [-1,1]\), such that \(\alpha < 0\) favors the segmentation of dark objects in a brighter background (Fig. 1g) and \(\alpha > 0\) favors the opposite orientation (Fig. 1e), and I(t) is the mean intensity inside superpixel t.

We conducted experiments, comparing the proposed unsupervised segmentation by OIFT with other graph-base methods. In the following, MST denotes the clustering of the previously described RAG nodes, obtained by successive removals of edges of maximum weight from the minimum spanning tree, where \(\omega (\langle s,t \rangle ) = |I(t)-I(s)|\), which is related to the nearest-neighbor (single-linkage) algorithm. FH denotes the unsupervised approach by Felzenszwalb and Huttenlocher [9], which computes a predicate for measuring the evidence for a boundary between two regions based on the minimum spanning tree computed in the RAG graph. EF+WS indicates the IFT-based watershed transform [3], after a volume extinction filter [20] set to preserve k leaves of the Min-tree, in order to consider only the most relevant catchment basins of a morphological gradient by a disk of radius 1. We used the code for the extinction filter available in the iamxt toolbox [21]. Note that Algorithm 2 encompasses as a particular case the single-linkage algorithm (MST) for \(\alpha = 0.0\), since its first step corresponds to a MST computation for \(\alpha = 0.0\) and each \(V(t_i)\) on its second step corresponds to an edge of maximum weight in the MST.

Fig. 2.
figure 2

Segmentation results for a real MR image of the foot. In order to properly segment the talus bone, MST required \(k=73\), FH \(k=46\) and EF+WS \(k=44\), while UOIFT could get it using \(k=10\) only.

Fig. 3.
figure 3

The mean curves of Dice accuracy of the best union of produced regions for different values of k and methods, to segment: (a) talus bone and (b) spinal-vertebra.

We performed experiments using 40 slice images from real MR images of the foot to segment the talus bone (Fig. 2) and 40 slice images from CT cervical spine studies of 10 subjects to segment the spinal-vertebra. We computed the mean accuracy curve of all the methods for different values of k (Fig. 3). For each value of k, we computed the Dice similarity coefficient between the ground truth and the best union of segmented regions leading to the object. Since the method by Felzenszwalb and Huttenlocher only provides indirect control over the number of generated regions, in our plot, we are showing for FH the mean number of regions obtained for each value of its input parameter. The results indicate that UOIFT requires a lower value of k compared to the other approaches to generate the talus bone and the spinal-vertebra for different values of \(\alpha \), due to its boundary polarity information, demonstrating the robustness of UOIFT.

Regarding the computational time, for an image of \(256\times 256\) pixels, to compute 625 superpixels by IFT-SLIC takes 203.4 ms and the final clustering into 300 regions by UOIFT in the RAG takes only 13.15 ms, in an Intel Core i3-5005U CPU @ 2.00 GHz\(\times 4\). As future work, we intend to extend UOIFT to consider more sophisticated predicates based on the following works [7, 10,11,12, 26].