Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Over the past decades, edge detection played a significant role in computer vision. Early edge detection methods often formulate the task as a low-level or mid-level grouping problem where Gestalt laws and perceptual grouping play considerable roles in algorithm design [7, 16, 23, 44]. Latter works start to consider learning edges in a data-driven way, by looking into the statistics of features near boundaries [1, 2, 12, 13, 25, 31, 34, 39]. More recently, advances in deep representation learning [18, 26, 43] have further led to significant improvements on edge detection, pushing the boundaries of state of the art performance [3, 20, 24, 49, 50] to new levels. The associated tasks also expended from the conventional binary edge detection problems to the recent more challenging category-aware edge detection problems [4, 17, 22, 38, 52]. As a result of such advancement, a wide variety of other vision problems have enjoyed the benefits of reliable edge detectors. Examples of these applications include, but are not limited to (semantic) segmentation [1, 4, 5, 9, 51], object proposal generation [4, 50, 53], object detection [29], depth estimation [19, 32], and 3D vision [21, 33, 42], etc.

Fig. 1.
figure 1

Examples of edges predicted by different methods on SBD (a-d) and Cityscapes (e–h). “CASENet” indicates the original CASENet from [52]. “SEAL” indicates the proposed framework trained with CASENet backbone. Best viewed in color. (Color figure online)

With the strong representation abilities of deep networks and the dense labeling nature of edge detection, many state of the art edge detectors are based on FCNs. Despite the underlying resemblance to other dense labeling tasks, edge learning problems face some typical challenges and issues. First, in light of the highly imbalanced amounts of positive samples (edge pixels) and negative samples (non-edge pixels), using reweighted losses where positive samples are weighted higher has become a predominant choice in recent deep edge learning frameworks [22, 24, 30, 49, 52]. While such a strategy to some extent renders better learning behaviorsFootnote 1, it also induces thicker detected edges as well as more false positives. An example of this issue is illustrated in Figs. 1(c) and (g), where the edge mapspredicted by CASENet [52] contains thick object boundaries. A direct consequence is that many local details are missing, which is not favored for other potential applications using edge detectors.

Another challenging issue for edge learning is the training label noise caused by inevitable misalignment during annotation. Unlike segmentation, edge learning is generally more vulnerable to such noise due to the fact that edge structures by nature are much more delicate than regions. Even slight misalignment can lead to significant proportion of mismatches between ground truth and prediction. In order to predict sharp edges, a model should learn to distinguish the few true edge pixels while suppressing edge responses near them. This already presents a considerable challenge to the model as non-edge pixels near edges are likely to be hard negatives with similar features, while the presence of misalignment further causes significant confusion by continuously sending false positives during training. The problem is further aggravated under reweighted losses, where predicting more false positives near the edge is be an effective way to decrease the loss due to the significant higher weights of positive samples.

Fig. 2.
figure 2

Evolution of edge alignment during training (progression from left to right). Blue color indicates the aligned edge labels learned by SEAL, while red color indicates the original human annotation. Overlapping pixels between the aligned edge labels and the original annotation are color-coded to be blue. Note how the aligned edge labels gradually tightens around the airplane as training progresses. Best viewed in color. (Color figure online)

Unfortunately, completely eliminating misalignment during annotation is almost impossible given the limit of human precision and the diminishing gain of annotation quality from additional efforts as a result. For datasets such as Cityscapes [11] where high quality labels are generated by professional annotators, misalignment can still be frequently observed. For datasets with crowdsourcing annotations where quality control presents another challenge, the issue can become even more severe. Our proposed solution is an end-to-end framework towards Simultaneous Edge Alignment and Learning (SEAL). In particular, we formulate the problem with a probabilistic model, treating edge labels as latent variables to be jointly learned during training. We show that the optimization of latent edge labels can be transformed into a bipartite graph min-cost assignment problem, and present an end-to-end learning framework towards model training. Figure 2 shows some examples where the model gradually learns how to align noisy edge labels to more accurate positions along with edge learning.

Contrary to the widely believed intuition that reweighted loss benefits edge learning problems, an interesting and counter-intuitive observation made in this paper is that (regular) sigmoid cross-entropy loss works surprisingly well under the proposed framework despite the extremely imbalanced distribution. The underlying reason is that edge alignment significantly reduces the training confusion by increasing the purity of positive edge samples. Without edge alignment, on the other hand, the presence of label noise together with imbalanced distribution makes the model more difficult to correctly learn positive classes. As a result of the increased label quality and the benefit of better negative suppression using unweighted loss, our proposed framework produces state of the art detection performance with high quality sharp edges (see Figs. 1(d) and (h)).

2 Related Work

2.1 Boundary Map Correspondence

Our work is partly motivated by the early work of boundary evaluation using precision-recall and F-measure [34]. To address misalignment between prediction and human ground truth, [34] proposed to compute a one-to-one correspondence for the subset of matchable edge pixels from both domains by solving a min-cost assignment problem. However, [34] only considers the alignment between fixed boundary maps, while our work addresses a more complicated learning problem where edge alignment becomes part of the optimization with learnable inputs.

2.2 Mask Refinement via Energy Minimization

Yang et al. [50] proposed to use dense-CRF to refine object mask and contour. Despite the similar goal, our method differs from [50] in that: 1. The refinement framework in [50] is a separate preprocessing step, while our work jointly learns refinement with the model in an end-to-end fashion. 2. The CRF model in [50] only utilizes low-level features, while our model considers both low-level and high-level information via a deep network. 3. The refinement framework in [50] is segmentation-based, while our framework directly targets edge refinement.

2.3 Object Contour and Mask Learning

A series of works [8, 37, 40] seek to learn object contours/masks in a supervised fashion. Deep active contour [40] uses learned CNN features to steer contour evolution given the input of an initialized contour. Polygon-RNN [8] introduced a semi-automatic approach for object mask annotation, by learning to extract polygons given input bounding boxes. DeepMask [37] proposed an object proposal generation method to output class-agnostic segmentation masks. These methods require accurate ground truth for contour/mask learning, while this work only assumes noisy ground truths and seek to refine them automatically.

2.4 Noisy Label Learning

Our work can be broadly viewed as a structured noisy label learning framework where we leverage abundant structural priors to correct label noise. Existing noisy label learning literatures have proposed directed graphical models [48], conditional random fields (CRF) [45], neural networks [46, 47], robust losses [35] and knowledge graph [27] to model and correct image-level noisy labels. Alternatively, our work considers pixel-level labels instead of image-level ones.

2.5 Virtual Evidence in Bayesian Networks

Our work also shares similarity with virtual evidence [6, 28, 36], where the uncertainty of an observation is modeled by a distribution rather than a single value. In our problem, noisy labels can be regarded as uncertain observations which give conditional prior distributions over different configurations of aligned labels.

3 A Probabilistic View Towards Edge Learning

In many classification problems, training of the models can be formulated as maximizing the following likelihood function with respect to the parameters:

$$\begin{aligned} \max _{\mathbf {W}}\mathcal {L}(\mathbf {W}) = P(\mathbf {y}|\mathbf {x}; \mathbf {W}), \end{aligned}$$
(1)

where \(\mathbf {y}\), \(\mathbf {x}\) and \(\mathbf {W}\) indicate respectively training labels, observed inputs and model parameters. Depending on how the conditional probability is parameterized, the above likelihood function may correspond to different types of models. For example, a generalized linear model function leads to the well known logistic regression. If the parameterization is formed as a layered representation, the model may turn into CNNs or multilayer perceptrons. One may observe that many traditional supervised edge learning models can also be regarded as special cases under the above probabilistic framework. Here, we are mostly concerned with edge detection using fully convolutional neural networks. In this case, the variable \(\mathbf {y}\) indicates the set of edge prediction configurations at every pixel, while \(\mathbf {x}\) and \(\mathbf {W}\) denote the input image and the network parameters, respectively.

4 Simultaneous Edge Alignment and Learning

To introduce the ability of correcting edge labels during training, we consider the following model. Instead of treating the observed annotation \(\mathbf {y}\) as the fitting target, we assume there is an underlying ground truth \(\hat{\mathbf {y}}\) that is more accurate than \(\mathbf {y}\). Our goal is to treat \(\hat{\mathbf {y}}\) as a latent variable to be jointly estimated during learning, which leads to the following likelihood maximization problem:

$$\begin{aligned} \begin{aligned} \max _{\hat{\mathbf {y}}, \mathbf {W}}\mathcal {L}(\hat{\mathbf {y}},\mathbf {W}) = P(\mathbf {y}, \hat{\mathbf {y}}|\mathbf {x}; \mathbf {W}) = P(\mathbf {y}|\hat{\mathbf {y}})P(\hat{\mathbf {y}}|\mathbf {x}; \mathbf {W}),\\ \end{aligned} \end{aligned}$$
(2)

where \(\hat{\mathbf {y}}\) indicates the underlying true ground truth. The former part \(P(\mathbf {y}|\hat{\mathbf {y}})\) can be regarded as an edge prior probabilistic model of an annotator generating labels given the observed ground truths, while the latter part \(P(\hat{\mathbf {y}}|\mathbf {x}; \mathbf {W})\) is the standard likelihood of the prediction model.

4.1 Multilabel Edge Learning

Consider the multilabel edge learning setting where one assumes that \(\mathbf {y}\) does not need to be mutually exclusive at each pixel. In other words, any pixel may correspond to the edges of multiple classes. The likelihood can be decomposed to a set of class-wise joint probabilities assuming the inter-class independence:

$$\begin{aligned} \begin{aligned} \mathcal {L}(\hat{\mathbf {y}},\mathbf {W}) = \prod _{k}P(\mathbf {y}^{k}|\hat{\mathbf {y}}^{k})P(\hat{\mathbf {y}}^{k}|\mathbf {x}; \mathbf {W}),\\ \end{aligned} \end{aligned}$$
(3)

where \(\mathbf {y}^{k}\in \{0,1\}^N\) indicates the set of binary labels corresponding to the k-th class. A typical multilabel edge learning example which alsoassumes inter-class independence is CASENet [52]. In addition, binary edge detection methods such as HED [49] can be viewed as special cases of multilabel edge learning.

4.2 Edge Prior Model

Solving Eq. (2) is not easy given the additional huge search space of \(\hat{\mathbf {y}}\). Fortunately, there is some prior knowledge one could leverage to effectively regularize \(\hat{\mathbf {y}}\). One of the most important prior is that \(\hat{\mathbf {y}}^{k}\) should not be too different from \(\mathbf {y}^{k}\). In addition, we assume that edge pixels in \(\mathbf {y}^{k}\) is generated from those in \(\hat{\mathbf {y}}^{k}\) through an one-to-one assignment process, which indicates \(|\mathbf {y}^{k}|=|\hat{\mathbf {y}}^{k}|\). In other words, let \(y_{\mathbf {q}}^{k}\) denote the label of class k at pixel \(\mathbf {q}\), and similarly for \(\hat{y}_{\mathbf {p}}^{k}\), there exists a set of one-to-one correspondences between edge pixels in \(\hat{\mathbf {y}}^{k}\) and \(\mathbf {y}^{k}\):

$$\begin{aligned} \begin{aligned} \mathcal {M}(\mathbf {y}^{k}, \hat{\mathbf {y}}^{k}) =&\ \{m(\cdot )|\forall \mathbf {u},\mathbf {v}\in \{\mathbf {q}|y_{\mathbf {q}}^{k}=1\}: \hat{y}_{m({\mathbf {u}})}^{k}=1,\\&\hat{y}_{m({\mathbf {v}})}^{k}=1,\mathbf {u}\ne \mathbf {v} \Rightarrow m(\mathbf {u}) \ne m(\mathbf {v}) \}, \\ \end{aligned} \end{aligned}$$
(4)

where each \(m(\cdot )\) is associated with a finite set of pairs:

$$\begin{aligned} m(\cdot )\sim E_{m} = \{(\mathbf {p},\mathbf {q})|\hat{y}_{\mathbf {p}},y_{{\mathbf {q}}}=1, m(\mathbf {q})=\mathbf {p}\}. \end{aligned}$$
(5)

The edge prior therefore can be modeled as a product of Gaussian similarities maximized over all possible correspondences:

$$\begin{aligned} \begin{aligned} P(\mathbf {y}^{k}|\hat{\mathbf {y}}^{k})&\propto \sup _{m\in \mathcal {M}(\mathbf {y}^{k},\hat{\mathbf {y}}^{k})}\prod _{(\mathbf {p},\mathbf {q})\in E_{m}}\exp \Big (-\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2}\Big )\\&=\exp \Big (-\inf _{m\in \mathcal {M}(\mathbf {y}^{k},\hat{\mathbf {y}}^{k})}\sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2}\Big ), \\ \end{aligned} \end{aligned}$$
(6)

where \(\sigma \) is the bandwidth that controls the sensitivity to misalignment. The misalignment is quantified by measuring the lowest possible sum of squared distances between pairwise pixels, which is determined by the tightest correspondence.

4.3 Network Likelihood Model

We now consider the likelihood of the prediction model, where we assume that the class-wise joint probability can be decomposed to a set of pixel-wise probabilities modeled by bernoulli distributions with binary configurations:

$$\begin{aligned} \begin{aligned} P(\hat{\mathbf {y}}^{k}|\mathbf {x}; \mathbf {W}) = \prod _{\mathbf {p}}P(\hat{y_{\mathbf {p}}}^{k}|\mathbf {x};\mathbf {W})=\prod _{\mathbf {p}}h_{k}(\mathbf {p}|\mathbf {x};\mathbf {W})^{\hat{y}_{\mathbf {p}}^{k}}(1-h_{k}(\mathbf {p}|\mathbf {x};\mathbf {W}))^{(1-\hat{y}_{\mathbf {p}}^{k})},\\ \end{aligned} \end{aligned}$$
(7)

where \(\mathbf {p}\) is the pixel location index, and \(h_k\) is the hypothesis function indicating the probability of the k-th class. We consider the prediction model as FCNs with k sigmoid outputs. As a result, the hypothesis function in Eq. (7) becomes the sigmoid function, which will be denoted as \(\sigma (\cdot )\) in the rest part of this section.

4.4 Learning

Taking Eqs. (6) and (7) into Eq. (3), and taking log of the likelihood, we have:

$$\begin{aligned} \begin{aligned} \log \mathcal {L}(\hat{\mathbf {y}},\mathbf {W})=&\sum _{k} \Big \{-\inf _{m\in \mathcal {M}(\mathbf {y}^{k},\hat{\mathbf {y}}^{k})}\sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2}\\&+\sum _{\mathbf {p}}\Big [\hat{y}_{\mathbf {p}}^{k}\log \sigma _{k}(\mathbf {p}|\mathbf {x};\mathbf {W})+(1-\hat{y}_{\mathbf {p}}^{k})\log (1-\sigma _{k}(\mathbf {p}|\mathbf {x};\mathbf {W}))\Big ]\Big \}, \\ \end{aligned} \end{aligned}$$
(8)

where the second part is the widely used sigmoid cross-entropy loss. Accordingly, learning the model requires solving the constrained optimization:

$$\begin{aligned} \begin{aligned} \min _{\hat{\mathbf {y}},\mathbf {W}}&~-\log \mathcal {L}(\hat{\mathbf {y}},\mathbf {W})\\ \mathrm {s.t.}&~~|\hat{\mathbf {y}}^{k}| = |\mathbf {y}^{k}|, \forall k \end{aligned} \end{aligned}$$
(9)

Given a training set, we take an alternative optimization strategy where \(\mathbf {W}\) is updated with \(\hat{\mathbf {y}}\) fixed, and vice versa. When \(\hat{\mathbf {y}}\) is fixed, the optimization becomes:

$$\begin{aligned} \begin{aligned} \min _{\mathbf {W}}~\sum _{k}\sum _{\mathbf {p}}-\Big [\hat{y}_{\mathbf {p}}^{k}\log \sigma _{k}(\mathbf {p}|\mathbf {x};\mathbf {W})+(1-\hat{y}_{\mathbf {p}}^{k})\log (1-\sigma _{k}(\mathbf {p}|\mathbf {x};\mathbf {W}))\Big ],\\ \end{aligned} \end{aligned}$$
(10)

which is the typical network training with the aligned edge labels and can be solved with standard gradient descent. When \(\mathbf {W}\) is fixed, the optimization can be modeled as a constrained discrete optimization problem for each class:

$$\begin{aligned} \begin{aligned}&\min _{\hat{\mathbf {y}}^{k}}~\inf _{m\in \mathcal {M}(\mathbf {y}^{k},\hat{\mathbf {y}}^{k})}\sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2} \\&~~~~~~~~~- \sum _{\mathbf {p}}\Big [\hat{y}_{\mathbf {p}}^{k}\log \sigma _{k}(\mathbf {p})+(1-\hat{y}_{\mathbf {p}}^{k})\log (1-\sigma _{k}(\mathbf {p}))\Big ]\\&~~\mathrm {s.t.}~~|\hat{\mathbf {y}}^{k}| = |\mathbf {y}^{k}|\\ \end{aligned} \end{aligned}$$
(11)

where \(\sigma (\mathbf {p})\) denotes \(\sigma (\mathbf {p}|\mathbf {x};\mathbf {W})\) for short. Solving the above optimization is seemingly difficult, since one would need to enumerate all possible configurations of \(\hat{\mathbf {y}}^k\) satisfying \(|\hat{\mathbf {y}}^{k}| = |\mathbf {y}^{k}|\) and evaluate the associated cost. It turns out, however, that the above optimization can be elegantly transformed to a bipartite graph assignment problem with available solvers. We first have the following definition:

Definition 1

Let \(\hat{\mathbf {Y}}=\{\hat{\mathbf {y}}||\hat{\mathbf {y}}|=|\mathbf {y}|\}\), a mapping space \(\mathbf {M}\) is the space consisting all possible one-to-one mappings:

$$\begin{aligned} \mathbf {M} = \{m|m\in \mathcal {M}(\mathbf {y},\hat{\mathbf {y}}), \hat{\mathbf {y}}\in \hat{\mathbf {Y}}\} \end{aligned}$$

Definition 2

A label realization is a function which maps a correspondence to the corresponding label given:

$$\begin{aligned} \begin{aligned} f_{L}:&\mathbf {Y}\times \mathbf {M}\mapsto \hat{\mathbf {Y}}\\&f_{L}(\mathbf {y}, m) = \hat{\mathbf {y}} \end{aligned} \end{aligned}$$

Lemma 1

The mapping \(f_{L}(\cdot )\) is surjective.

Remark

Lemma 1 shows that a certain label configuration \(\hat{\mathbf {y}}\) may correspond to multiple underlying mappings. This is obviously true since there could be multiple ways in which pixels in \(\mathbf {y}\) are assigned to the \(\hat{\mathbf {y}}\).

Lemma 2

Under the constraint \(|\hat{\mathbf {y}}|=|\mathbf {y}|\), if:

$$\begin{aligned} \begin{aligned}&\hat{\mathbf {y}}^{*}=\mathop {{{\mathrm{arg\,min}}}}\limits _{\hat{\mathbf {y}}}-\sum _{\mathbf {p}}\Big [\hat{y}_{\mathbf {p}}\log \sigma (\mathbf {p})+(1-\hat{y}_{\mathbf {p}})\log (1-\sigma (\mathbf {p}))\Big ]\\&m^{*}=\mathop {{{\mathrm{arg\,min}}}}\limits _{m\in \mathbf {M}}\sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\Big [\log (1-\sigma (\mathbf {p}))-\log \sigma (\mathbf {p})\Big ]\\ \end{aligned} \end{aligned}$$

then \(f_{L}(\mathbf {y},m^{*})=\hat{\mathbf {y}}^{*}\).

Proof

Suppose in the beginning all pixels in \(\hat{\mathbf {y}}\) are 0. The corresponding loss therefore is:

$$\begin{aligned} \mathcal {C}_{N}(\mathbf {0})=-\sum _{\mathbf {p}}\log (1-\sigma (\mathbf {p})) \end{aligned}$$

Flipping \(y_{\mathbf {p}}\) to 1 will accordingly introduce a cost \(\log (1-\sigma (\mathbf {p}))-\log \sigma (\mathbf {p})\) at pixel \(\mathbf {p}\). As a result, we have:

$$\begin{aligned} \mathcal {C}_{N}(\hat{\mathbf {y}})= \mathcal {C}_{N}(\mathbf {0}) + \sum _{\mathbf {p}\in \{\mathbf {p}|\hat{y}_{\mathbf {p}}=1\}}\Big [\log (1-\sigma (\mathbf {p}))-\log \sigma (\mathbf {p})\Big ] \end{aligned}$$

In addition, Lemma 1 states that the mapping \(f_{L}(\cdot )\) is surjective, which incites that the mapping search space \(\mathbf {M}\) exactly covers \(\hat{\mathbf {Y}}\). Thus the top optimization problem in Lemma 2 can be transformed into the bottom problem.

Lemma 2 motivates us to reformulate the optimization in Eq. (11) by alternatively looking to the following problem:

$$\begin{aligned} \min _{m\in \mathbf {M}} \sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\Big [\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2}+\log (1-\sigma (\mathbf {p}))-\log \sigma (\mathbf {p})\Big ] \end{aligned}$$
(12)

Equation (12) is a typical minimum cost bipartite assignment problem which can be solved by standard solvers, where the cost of each assignment pair \((\mathbf {p},\mathbf {q})\) is associated with the weight of a bipartite graphos edge. Following [34], we formulate a sparse assignment problem and use the Goldbergos CSA package, which is the best known algorithms for min-cost sparse assignment [10, 15]. Upon obtaining the mapping, one can recover \(\hat{\mathbf {y}}\) through label realization.

However, solving Eq. (12) assumes an underlying relaxation where the search space contains m which may not follow the infimum requirement in Eq. (11). In other words, it may be possible that the minimization problem in Eq. (12) is an approximation to Eq. (11). The following theorem, however, proves the optimality of Eq. (12):

Theorem 1

Given a solver that minimizes Eq. (12), the solution is also a minimizer of the problem in Eq. (11).

Proof

We use contradiction to prove Theorem 1. Suppose there exists a solution of (12) where:

$$\begin{aligned} f_{L}(\mathbf {y},m^{*})=\hat{\mathbf {y}}, ~m^{*}\ne \mathop {{{\mathrm{arg\,min}}}}\limits _{m\in \mathcal {M}(\mathbf {y}^{k},\hat{\mathbf {y}}^{k})}\sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2} \end{aligned}$$

There must exist another mapping \(m'\) which satisfies:

$$\begin{aligned} f_{L}(\mathbf {y},m')=\hat{\mathbf {y}}, \sum _{(\mathbf {p},\mathbf {q})\in E_{m'}}\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2} < \sum _{(\mathbf {p},\mathbf {q})\in E_{m^{*}}}\frac{\Vert \mathbf {p}-\mathbf {q}\Vert ^2}{2\sigma ^2} \end{aligned}$$

Since \(f_{L}(\mathbf {y},m')=f_{L}(\mathbf {y},m^{*})=\hat{\mathbf {y}}\), substituting \(m'\) to (12) leads to an even lower cost, which contradicts to the assumption that \(m^{*}\) is the minimizer of (12).

In practice, we follow the mini-batch SGD optimization, where \(\hat{\mathbf {y}}\) of each image and \(\mathbf {W}\) are both updated once in every batch. To begin with, \(\hat{\mathbf {y}}\) is initialized as \(\mathbf {y}\) for every image in the first batch. Basically, the optimization can be written as a loss layer in a network, and is fully compatible with end-to-end training.

4.5 Inference

We now consider the inference problem given a trained model. Ideally, the inference problem of the model trained by Eq. (2) would be the following:

$$\begin{aligned} \hat{\mathbf {y}}^{*} = \mathop {{{\mathrm{arg\,max}}}}\limits _{\hat{\mathbf {y}}}P(\mathbf {y}|\hat{\mathbf {y}})P(\hat{\mathbf {y}}|\mathbf {x}; \mathbf {W}) \end{aligned}$$
(13)

However, in cases where \(\mathbf {y}\) is not available during testing. we can alternatively look into the second part of (2) which is the model learned under \(\hat{\mathbf {y}}\):

$$\begin{aligned} \hat{\mathbf {y}}^{*} = \mathop {{{\mathrm{arg\,max}}}}\limits _{\hat{\mathbf {y}}}P(\hat{\mathbf {y}}|\mathbf {x}; \mathbf {W}) \end{aligned}$$
(14)

Both cases can find real applications. In particular, (14) corresponds to general edge prediction, whereas (13) corresponds to refining noisy edge labels in a dataset. In the latter case, \(\mathbf {y}\) is available and the inferred \(\hat{\mathbf {y}}\) is used to output the refined label. In the experiment, we will show examples of both applications.

5 Biased Gaussian Kernel and Markov Prior

The task of SEAL turns out not easy, as it tends to generate artifacts upon having cluttered background. A major reason causing this failure is the fragmented aligned labels, as shown in Fig. 3(a). This is not surprising since we assume an isotropic Gaussian kernel, where labels tend to break and shift along the edges towards easy locations. In light of this issue, we assume that the edge prior follows a biased Gaussian (B.G.), with the long axis of the kernel perpendicular to local boundary tangent. Accordingly, such model encourages alignment perpendicular to edge tangents while suppressing shifts along them.

Fig. 3.
figure 3

Examples of edge alignment using different priors and graphical illustration.

Another direction is to consider the Markov properties of edges. Good edge labels should be relatively continuous, and nearby alignment vectors should be similar. Taking these into consideration, we can model the edge prior as:

$$\begin{aligned} \begin{aligned} P(\mathbf {y}|\hat{\mathbf {y}}) \propto&\sup _{m\in \mathcal {M}(\mathbf {y},\hat{\mathbf {y}})}\prod _{(\mathbf {p},\mathbf {q})\in E_{m}}\exp (-\mathbf {m}_{\mathbf {q}}^{\top }\mathbf {\Sigma _{\mathbf {q}}}\mathbf {m}_{\mathbf {q}})\prod _{\begin{array}{c} (\mathbf {u},\mathbf {v})\in E_{m},\\ \mathbf {v}\in \mathcal {N}(\mathbf {q}) \end{array}}\exp (-\lambda \Vert \mathbf {m}_{\mathbf {q}}-\mathbf {m}_{\mathbf {v}}\Vert ^{2})\\ \end{aligned} \end{aligned}$$
(15)

where \(\lambda \) controls the strength of the smoothness. \(\mathcal {N}(\mathbf {q})\) is the neighborhood of \(\mathbf {q}\) defined by the geodesic distance along the edge. \(\mathbf {m}_{\mathbf {q}} = \mathbf {p}-\mathbf {q}\), and \(\mathbf {m}_{\mathbf {v}} = \mathbf {u}-\mathbf {v}\). An example of the improved alignment and a graphical illustration are shown in Figs. 3(b) and (c). In addition, the precision matrix \(\mathbf {\Sigma }_{\mathbf {q}}\) is defined as:

$$\begin{aligned} \mathbf {\Sigma }_{\mathbf {q}} = \begin{bmatrix} \frac{\cos (\theta _{\mathbf {q}})^2}{2\sigma _{x}^2} + \frac{\sin (\theta _{\mathbf {q}})^2}{2\sigma _{y}^2}&\frac{\sin (2\theta _{\mathbf {q}})}{4\sigma _{y}^2} - \frac{\sin (2\theta _{\mathbf {q}})}{4*\sigma _{x}^2}\\ \frac{\sin (2\theta _{\mathbf {q}})}{4\sigma _{y}^2} - \frac{\sin (2\theta _{\mathbf {q}})}{4\sigma _{x}^2}&\frac{\sin (\theta _{\mathbf {q}})^2}{2\sigma _{x}^2} + \frac{\cos (\theta _{\mathbf {q}})^2}{2\sigma _{y}^2} \\ \end{bmatrix} \end{aligned}$$

where \(\theta _{\mathbf {q}}\) is the angle between edge tangent and the positive x-axis, and \(\sigma _y\) corresponds to the kernel bandwidth perpendicular to the edge tangent. With the new prior, the alignment optimization becomes the following problem:

$$\begin{aligned} \begin{aligned} \min _{m\in \mathbf {M}} ~\mathcal {C}(m) =&~\mathcal {C}_{Unary}(m) + \mathcal {C}_{Pair}(m)\\ =&\sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\Big [\mathbf {m}_{\mathbf {q}}^{\top }\mathbf {\Sigma _{\mathbf {q}}}\mathbf {m}_{\mathbf {q}}+\log ((1-\sigma (\mathbf {p}))/\sigma (\mathbf {p}))\Big ]\\&~~~~~+\lambda \sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\sum _{\begin{array}{c} (\mathbf {u},\mathbf {v})\in E_{m},\\ \mathbf {v}\in \mathcal {N}(\mathbf {q}) \end{array}}\Vert \mathbf {m}_{\mathbf {q}}-\mathbf {m}_{\mathbf {v}}\Vert ^{2}\\ \end{aligned} \end{aligned}$$
(16)

Note that Theorem 1 still holds for (16). However, solving (16) becomes more difficult as pairwise dependencies are included. As a result, standard assignment solvers can not be directly applied, and we alternatively decouple \(\mathcal {C}_{Pair}\) as:

$$\begin{aligned} \mathcal {C}_{Pair}(m, m') = \sum _{(\mathbf {p},\mathbf {q})\in E_{m}}\sum _{\begin{array}{c} (\mathbf {u},\mathbf {v})\in E_{m'},\\ \mathbf {v}\in \mathcal {N}(\mathbf {q}) \end{array}}\Vert \mathbf {m}_{\mathbf {q}}-\mathbf {m}_{\mathbf {v}}\Vert ^{2} \end{aligned}$$
(17)

and take the iterated conditional mode like iterative approximation where the alignment of neighboring pixels are taken from the alignment in previous round:

$$\begin{aligned} \begin{aligned} \mathbf{Initialize\!: }\&m^{(0)} = \mathop {{{\mathrm{arg\,min}}}}\limits _{m\in \mathbf {M}}~\mathcal {C}_{Unary}(m)\\ \mathbf{Assign\!: }\&m^{(t+1)} = \mathop {{{\mathrm{arg\,min}}}}\limits _{m\in \mathbf {M}}~\mathcal {C}_{Unary}(m)+\mathcal {C}_{Pair}(m, m^{(t)})\\ \mathbf{Update\!: }\&\mathcal {C}_{Pair}(m, m^{(t)})\rightarrow \mathcal {C}_{Pair}(m, m^{(t+1)})\\ \end{aligned} \end{aligned}$$

where the Assign and Update steps are repeated multiple times. The algorithm converges very fast in practice. Usually two or even one Assign is sufficient.

6 Experimental Results

In this section, we comprehensively test the performance of SEAL on category-ware semantic edge detection, where the detector not only needs to localize object edges, but also classify to a predefined set of semantic classes.

6.1 Backbone Network

In order to guarantee fair comparison across different methods, a fixed backbone network is needed for controlled evaluation. We choose CASENet [52] since it is the current state of the art on our task. For additional implementation details such as choice of hyperparameters, please refer to the supplementary material.

6.2 Evaluation Benchmarks

We follow [17] to evaluate edges with class-wise precision recall curves. However, the benchmarks of our work differ from [17] by imposing considerably stricter rules. In particular: 1. We consider non-suppressed edges inside an object as false positives, while [17] ignores these pixels. 2. We accumulate false positives on any image, while the benchmark code from [17] only accumulates false positives of a certain class on images containing that class. Our benchmark can also be regarded as a multiclass extension of the BSDS benchmark [34].

Both [17] and [34] by default thin the prediction before matching. We propose to match the raw predictions with unthinned ground truths whose width is kept the same as training labels. The benchmark therefore also considers the local quality of predictions. We refer to this mode as “Raw” and the previous conventional mode as “Thin”. Similar to [34], both settings use maximum F-Measure (MF) at optimal dataset scale (ODS) to evaluate the performance.

Another difference between the problem settings of our work and [17] is that we consider edges between any two instances as positive, even though the instances may belong to the same class. This differs from [17] where such edges are ignored. Our motivation on making such changes is two fold: 1. We believe instance-sensitive edges are important and it makes better sense to distinguish these locations. 2. The instance-sensitive setting may better benefit other potential applications where instances need to be distinguished.

6.3 Experiment on the SBD Dataset

The Semantic Boundary Dataset (SBD) [17] contains 11355 images from the trainval set of PASCAL VOC2011 [14], with 8498 images divided as training set and 2857 images as test set. The dataset contains both category-level and instance-level semantic segmentation annotations, with semantic classes defined following the 20 class definitions in PASCAL VOC.

Table 1. Results on the SBD test set. MF scores are measured by \(\%\).
Table 2. Results on the SBD test set (re-annotated). MF scores are measured by \(\%\).

Parameter Analysis.

We set \(\sigma _x=1\) and \(\sigma _y > \sigma _x\) to favor alignment perpendicular to edge tangents. Details on the validation of \(\sigma _y\) and \(\lambda \) are in supplementary.

Results on SBD Test Set.

We compare SEAL with CASENet, CASENet trained with regular sigmoid cross-entropy loss (CASENet-S), and CASENet-S trained on labels refined by dense-CRF following [50] (CASENet-C), with the results visualized in Fig. 5 and quantified in Table 1. Results show that SEAL is on par with CASENet-S under “Thin” setting, while significantly outperforms all other baselines when edge sharpness is taken into account.

Results on Re-annotated SBD Test Set.

A closer analysis shows that SEAL actually outperforms CASENet-S considerably under the “Thin” setting. The original SBD labels turns out to be noisy, which can influence the validity of evaluation. We re-annotated more than 1000 images on SBD test set using LabelMe [41], and report evaluation using these high-quality labels in Table 2. Results indicates that SEAL outperforms CASENet-S in both settings.

Fig. 4.
figure 4

MF vs. tolerance.

Results of SBD GT Refinement.

We output the SEAL aligned labels and compare against both dense-CRF and original annotation. We match the aligned labels with re-annotated labels by varying the tolerance threshold and generating F-Measure scores. Figure 4 shows that SEAL indeed can improve the label quality, while dense-CRF performs even worse than original labels. In fact, the result of CASENet-C also indicates the decreased model performance.

Table 3. Non-IS results.

Non-Instance-Insensitive (non-IS) Mode.

We also train/evaluate under non-IS mode, with the evaluation using re-annotated SBD labels. Table 3 shows that the scores have high correlation with IS mode.

Table 4. Results on SBD test following the same benchmark and ground truths as [52].
Table 5. Results on the Cityscapes dataset. MF scores are measured by \(\%\).

Comparison with State of the Art.

Although proposing different evaluation criteria, we still follow [52] by training SEAL with instance-insensitive labels and evaluating with the same benchmark and ground truths. Results in Table 4 show that this work outperforms previous state of the art by a significant margin.

Fig. 5.
figure 5

Qualitative comparison among ground truth, CASENet, CASENet-S, CASENet-C, and SEAL (ordering from left to right). Best viewed in color. (Color figure online)

6.4 Experiment on the Cityscapes Dataset

Results on Validation Set.

The Cityscapes dataset contains 2975 training images and 500 images as validation set. Following [52], we train SEAL on the training set and test on the validation set, with the results visualized in Fig. 6 and quantified in Table 5. Again, SEAL overall outperforms all comparing baselines.

Alignment Visualization.

We show that misalignment can still be found on Cityscapes. Figure 7 shows misaligned labels and the corrections made by SEAL.

Fig. 6.
figure 6

Qualitative comparison among ground truth, CASENet, CASENet-S, and SEAL (ordering from left to right in the figure). Best viewed in color. (Color figure online)

Fig. 7.
figure 7

Edge alignment on Cityscapes. Color coding follows Fig. 2. Best viewed in color. (Color figure online)

7 Concluding Remarks

In this paper, we proposed SEAL: an end-to-end learning framework for joint edge alignment and learning. Our work considers a novel pixel-level noisy label learning problem, levering structured priors to address an open issue in edge learning. Extensive experiments demonstrate that the proposed framework is able to correct noisy labels and generate sharp edges with better quality.