# Learning Diverse Models: The Coulomb Structured Support Vector Machine

- 1 Citations
- 10k Downloads

## Abstract

In structured prediction, it is standard procedure to discriminatively train a single model that is then used to make a single prediction for each input. This practice is simple but risky in many ways. For instance, models are often designed with tractability rather than faithfulness in mind. To hedge against such model misspecification, it may be useful to train multiple models that all are a reasonable fit to the training data, but at least one of which may hopefully make more valid predictions than the single model in standard procedure. We propose the Coulomb Structured SVM (CSSVM) as a means to obtain at training time a full ensemble of different models. At test time, these models can run in parallel and independently to make diverse predictions. We demonstrate on challenging tasks from computer vision that some of these diverse predictions have significantly lower task loss than that of a single model, and improve over state-of-the-art diversity encouraging approaches.

## Keywords

Structured output learning Diverse predictions Multiple output learning Structured support vector machine## 1 Introduction

The success of large margin methods for structured output learning, such as the structured support vector machine (SSVM) [1], is partly due to their good generalization performances achieved on test data, compared to, *e.g.* maximum likelihood learning on structured models [2]. Despite such regularization strategies, however, it is not guaranteed that the model which optimizes the learning objective function really generalizes well to unseen data. Reasons include wrong model assumptions, noisy data, ambiguities in the data, missing features, insufficient training data, or a task loss which is too complex to model directly.

*(i)*generate multiple likely solutions from the model [3, 4, 5] or,

*(ii)*learn

*multiple*models which generate diverse predictions [6, 7, 8]. The different predictions for a given structured input may then be analyzed to compute robustness/uncertainty measures, or may be the input for a more complex model exploiting higher-order dependencies, as is done in re-ranking models,

*e.g.*Yadollahpour

*et al.*[9] augment their features with global ones for automatic re-ranking. Other successful applications include prediction of diverse hypotheses for machine translation [10], on-demand feature computation [11], or active learning methods [12, 13]. Furthermore, an oracle may choose amongst all predictions that one which is closest to the ground truth. This becomes handy for proof-reading tasks in order to keep manual interactions at a minimum. It is particularly beneficial in structured output spaces to present to the user not only similarly likely, but also

*diverse*proposal solutions. The set of diverse predictions may still contain a low-loss solution, even if the most likely prediction of the single model has a large loss. As a consequence, instead of minimizing the expected generalization error of a

*single*model in structured learning, (cf. Fig. 1(a)), it is favorable to minimize the expected generalization error

*amongst multiple*models, see Fig. 1(b, c).

Our main contribution is an algorithm termed the *Coulomb structured support vector machine* (CSSVM) which learns an ensemble of *M* models with different parameters, thanks to a corresponding diversity-encouraging prior. This is qualitatively different from previous work which requires that the *outputs* of the *M* models are diverse. In particular, we allow the *M* models in the ensemble to make identical predictions (and hence perfectly fit the data) at training time. Another benefit is that CSSVM can learn diverse models even if only a single structured training example is available. In Sect. 3.4, we generalize our algorithm to allow for structured clustering.

## 2 Related Work

One major research avenue is to generate at prediction time multiple (possibly diverse) solutions from a single previously trained structured model [3, 4, 5]. In order to find *M* similarly likely solutions, Yanover *et al.* [3] propose a message passing scheme to iteratively add constraints forbidding the previous solutions. Batra *et al.* [5] build on the same idea but incorporate these constraints directly into the objective function. This yields a deterministic framework which tries to find *diverse* solutions by requiring a minimum distance to the previous solutions. Their idea is extended in [14] to jointly infer diverse predictions at test time. Papandreou and Yuille [4], instead, perturb model parameters repeatedly with noise from a Gumbel distribution, and subsequently solve for the maximum-a-posteriori (MAP) solution to sample *M* plausible solutions. Their idea of perturbing the data term is natural when data is assumed to be noisy.

Sampling *M* solutions could of course also be achieved using Gibbs sampling or other MCMC techniques, however with very slow mixing time on general graphs; more efficient sampling strategies have been proposed recently [15]. Recent work aims at finding the *M* best modes of the probability distribution (local maxima) directly [16, 17]. While promising, their algorithms are yet not applicable to general graphs. Another recently discussed approach to sample diverse predictions at test time are determinantal point processes [18].

Rather than learning one model and then sampling successively (possibly diverse) solutions from the model, recent developments [6, 7, 8] allow to *train* multiple diverse models, *i.e.* diversity is already considered at training time. Typically, only one ground truth solution is provided per training sample rather than a diverse set, and thus diversity amongst the models can not be directly measured by means of training data. There are multiple works which tackle this challenge successfully: Gane *et al.* [8] learn (multi-modal) distributions over the perturbations in Perturb-and-MAP models using latent variable models which include inverse convex programs to determine relations between the model parameters and the MAP solution. Most similar to our work is [6, 7], where a set of *M* SSVMs is optimized while trading diversity versus data fit. In the former, diversity is encouraged through clustering: Each structured training example is assigned to the learner which achieves the lowest task loss for this sample in the current iteration. Their idea builds on the assumption that there are *M* clusters present in the training samples, thus requiring at least *M* (implicitly) diverse training samples. This requirement may be a crucial problem on small training sets. Our approach, in contrast, can learn *M* diverse models even if only one training example is present, as is often the case in CRF learning, *e.g.* co-segmentation (Sect. 4i), [19, 20]. In their more recent work, Guzman-Rivera *et al.* [7] extend their idea by augmenting the learning objective directly with a convex term which explicitly rewards diversity in the *outputs* of different learners, as also done in [21]. In our approach, in contrast, the diversity prior is posed on the *parameters* of the *M* models, and thus, all learners might achieve the same loss on the training samples while still providing diverse predictions on test data, *cf.* Fig. 1(b, c).

## 3 Coulomb Structured Support Vector Machine

The goal of this work is to learn *M* mappings from one structured input to *M* possibly *diverse* structured outputs from a training set \(\mathcal D = \{ (\mathbf x_i, \mathbf {y}_i) \}_{i=1,\ldots ,N}\).

### 3.1 Problem Description and Diversity Prior

*M*concurrent structured SVMs, which amounts to the following optimization problem:

*i*-th training example evaluated by the

*m*-th learner. \(\varOmega (W)\) is the regularization term on the parameters \(W=[\mathbf {w}_1,\ldots ,\mathbf {w}_M]\) (in SSVMs typically an

*L*2 regularizer is used on each single \(\mathbf {w}_i\)), and a bias term is omitted since it does not have an influence on the optimization problem [22]. Diversity amongst the

*M*learners is encouraged by the diversity prior \(\varGamma (W)\) on the parameters

*W*, where \(\alpha \) regulates the degree of diversity. In this way, \(\alpha =0\) reveals the standard SSVM formulation, since all

*M*weights converge to the same optimum.

For the ease of argument, let us now assume the training set is linearly separable^{1} as in Fig. 1. Moreover, assume that feature selection yielded independent features. Our illustration of the structured learning problem in Fig. 1 is analogous to representations of flat classification problems where we regard the ground truth labeling of the structured training samples as the single positive examples and all other (exponentially many) labelings as corresponding negative examples. The objective in Fig. 1(a) is to find a weight vector \(\mathbf {w}\) which separates the positive from the negative examples and maximizes the margin [1].

*version space*\(V(\mathcal D)\) analogously as in flat classification [24, 25], as

*i.e.*\(\mathbf {w} = \sum _{i=1}^N c_i \mathbf {x}_i\) for coefficients \(c_i\), and the version space may be restricted appropriately. Note that the error of a structured model induced from a weight vector in version space may still be large for randomly chosen query points (

*i.e.*high generalization error) in spite of achieving zero loss on the training set.

Typically, version space is only summarized by a single point such as the center of the largest inscribed sphere (the hard-margin SVM) or the center of mass of the version space (the Bayes point machine [26]). To learn an ensemble of classifiers, our goal is to distribute *M* weight vectors \(\mathbf {w}_m \in \mathcal W\), \(m=1,\ldots ,M\), in version space such that the most diverse predictions on unseen points are obtained. To this end, it is sufficient for structured models with energy functions linear in \(\mathbf {w}\) – similar to flat linear classification [27] – to only investigate weight vectors on the unit sphere (*i.e.* \(\Vert \mathbf {w} \Vert _2 = 1\)): At prediction time, labelings are scored by the energy function of the structured model \(E(\mathbf {x}, \mathbf {y}) = \mathbf {w}^\top f(\mathbf {x}, \mathbf {y})\), where \(f(\mathbf {x}, \mathbf {y})\) is the joint feature function. Replacing \(\mathbf {w}\) by \(\lambda \mathbf {w}\), \(\lambda > 0\), still yields the same ordering of the labelings.

We hence have to solve an experimental design problem on parts of the unit sphere to get an ensemble of *diverse* structured models, in other words – disregarding training data – we want to evenly distribute *M* points on the unit sphere. The goal of experimental design [28, 29] is to select from a set of possible experiments/configurations/parameter settings the subset with greatest expected merit. In our case, the set of experiments to choose from is the sphere \(||\mathbf {w} ||_2 = 1\). In other words, rather than sample the sphere uniformly, we need to bias our experimental design towards parameters that produce low empirical loss. Hence, we next introduce the repulsive diversity energy term \(\varGamma (W)\) which makes Eq. (1) a non-convex optimization problem, which we optimize approximately.

### 3.2 Diversity Through Coulomb Potential

*M*points evenly on the unit sphere is much studied in information theory and is known as a spherical code [30]: Different variants include

*sphere packing*(maximize the minimal angle between any two parameter vectors) and

*covering problems*(minimize the distance between any point on the sphere and the closest parameter vector). In three dimensions, the problem is known as the

*Thomson problem*

^{2}: The goal is to minimize the energy configuration of

*M*charges on a unit sphere while the charges repel each other with forces determined by Coulomb’s law. While yet unsolved exactly, approximate solutions have been proposed in the literature, including spiral approximations [31], subdivisions of polyhedrons [32], or gradient descent methods [33, 34, 35] which correspond to electrostatic repulsion simulations exploiting Coulomb’s law: Particles of equal charge repel each other with a force proportional to the square of their pairwise distance, the Coulomb force. More generally, in the equilibrium state of the

*M*particles \(\mathbf p_1,\ldots ,\mathbf p_M\) on the unit sphere, the Riesz energy,

### 3.3 Optimization by an Electrostatic Repulsion Model

*i.e.*\(\bar{\mathbf {w}}_m = \frac{\mathbf {w}_m}{\Vert \mathbf w_m\Vert }\), and use the Coulomb energy \(E_C\) as the diversity term

^{3}in Eq. (1),

*not*constrained to the unit sphere.

*cf.*Fig. 2),

*m*amounts to the force \(F_m^{RR}\),

*i.e.*the direction to go in the next optimization step when only considering the regularized risk. The

*L*2 regularized risk of

*one*learner is given by \(R_1(\mathbf {w}_m, \mathcal D) = \frac{1}{2} \Vert \mathbf {w}_m \Vert _2^2 + \frac{C}{N} \sum _{k=1}^N L(\mathbf {x}_k, \mathbf {y}_k; \mathbf {w}_m)\). When choosing the structured hinge loss \(L(\mathbf {x}_k,\mathbf {y}_k; \mathbf {w}_m) = \max _{\mathbf {y} \in \mathcal Y} \left( \varDelta (\mathbf {y}_k,\mathbf y) - \mathbf {w}_m^\top f(\mathbf {x}_k,\mathbf {y}) \right) + \mathbf {w}_m^\top f(\mathbf {x}_k,\mathbf {y}_k)\), where \(\varDelta (\mathbf {y}_k, \mathbf {y})\) is the task loss,

*f*the feature function, and \((\mathbf {x}_k,\mathbf {y}_k)\) are the training examples; then the subgradient \(\mathbf {g}^m_k\) for training example

*k*is given by

*i.e.*the regularized risk force on particle \(\mathbf {w}_m\) is \(F_m^{RR} = - \frac{1}{N} \sum _{k=1}^N \mathbf {g}^m_k\).

*t*and \(\mathbf {g}_l^m\) as in Eq. (7), then the update of \(\mathbf {w}_m\) is given by

*stochastic*subgradient algorithm with a random \(l \in \{1,\ldots ,N\}\). Note that element \([\mathbf {w}_m']_i\) may be projected to zero to guarantee submodular energies during training as proved in [39]. For initialization of the CSSVM, we train one SSVM to get the optimum \(\mathbf {w}_*\). Then

*M*random perturbations of \(\mathbf {w}_*\) give starting points for \(\mathbf {w}_1,\ldots \mathbf {w}_M\).

### 3.4 Extension: Structured Clustering

Our model suggests a straightforward extension to structured clustering: In the stochastic subgradient update given in Eq. (8), a random training sample is chosen for each learner to update the weight vector. Instead of random selection, a steered selection of training samples for each individual learner would increase diversity. Similarly to the structured K-means block-coordinate descent algorithm proposed in [6], we assign training examples to individual learners: After each subgradient iteration in Sect. 3.3, the task losses \(\varDelta (\mathbf {y}^m,\mathbf y_i; \mathbf {w}_m)\) between prediction \(\mathbf {y}^m\) and ground truth \(\mathbf {y}_i\) are computed for each learner *m*, \(m \in \{1,\ldots ,M\}\), and normalized over all learners, *i.e.* \(\pi _i^m = \frac{\varDelta (y^m,y_i;\mathbf {w}_m)}{\sum _{k=1}^M \varDelta (y^k,y_i; \mathbf {w}_k)}, \ \sum _m \pi _i^m = 1\).

*i*is then assigned to any of the

*M*learners according to some indicator vector \(\sigma (\varvec{\pi }_i)\), where \([\sigma (\varvec{\pi }_i)]_m = 1\) if training sample

*i*is assigned to learner

*m*, 0 otherwise. In Table 1, we propose different alternatives for the mapping \(\sigma (\cdot )\). The subgradient update step in Eq. (8) is then modified accordingly:

## 4 Experiments and Results

To evaluate the performance of our approach, we run experiments on three challenging tasks from computer vision: *(i)* co-segmentation, *(ii)* foreground/background segmentation, and *(iii)* semantic segmentation. We use the iCoseg [40] database for *(i)* and *(ii)* and PASCAL VOC 2010 [41] for *(iii)*. Note that for clearer comparison with previous work, we focus on the evaluation of our first stage model usually used in a two stage pipeline. The proposed method can be combined with any second stage model [10, 11, 12, 13, 42, 43, 44].

We implemented our algorithm in Python using the PyStruct [45] framework. The code is made available on https://github.com/martinsch/coulomb_ssvm. On all three tasks, we are comparing our results with the state-of-the-art diversity inducing methods Multiple Choice Learning [6] (MCL) and Diverse Multiple Choice Learning [7] (DivMCL), the Matlab implementations of which as well as their features/splitting criterions for the iCoseg dataset in task *(ii)* were kindly provided by the authors. The energies for tasks *(i)* and *(ii)* are submodular, and we thus use graph-cut as inference method; for the multi-label problem in *(iii)*, we utilize TRWS [46].

Generating *M* diverse outputs is particularly useful in early stages of cascaded approaches, where at a later stage, *e.g.* a human or a second complex model may choose the best of *M* predictions according to a higher-order loss function. The goal of our approach is, hence, to generate *M* diverse predictions some of which ought to achieve better task loss than the prediction of the single max-margin model. We therefore stick to the evaluation criterion as applied in prior works, where an oracle chooses the best out of *M* predictions. In this way, we can evaluate the usefulness of such an approach for cascade models. We relate to this loss as *pick best* error, *i.e.* the lowest task loss among the *M* predictions.

*(i) Co-Segmentation.* The design of the proposed CSSVM allows to learn an ensemble of diverse models on very small training sets, in fact, even on training sets which consist of one structured training example only. To demonstrate the usefulness of our approach on such tasks, we run experiments on a co-segmentation dataset. The goal in co-segmentation in general is the simultaneous segmentation of two images each containing similar objects [47]. In our experiments, we assume that a model can be learned on the annotations of one image to predict the segmentation of similar images. We choose six categories from the iCoseg database and use the superpixels and features from [7], their 12-dim. color features for the nodes and a contrast-sensitive and -insensitive Potts term for the edges.

The results for MCL, DivMCL, and our model are depicted in Fig. 3. For each category, we vary the number of models in the ensemble *M* from 1 to 10, where \(M=1\) may be viewed as the baseline and corresponds to the training of a standard SSVM. We perform a full \(N_c\)-fold crossvalidation on each category, where \(N_c\) is the number of images in category *c*, and report the test losses of all *M* models. We choose the regularization and diversity trade-off parameters of each method on a hold-out validation set consisting of three images per category. Note that these losses are computed on superpixel level rather than pixel level which makes for a fair comparison since all three models are using the same superpixels and features. In these datasets, \(N_c\)’s are in the range of 10 to 33, dependent on the dataset. Obviously, we use strategy “all” from Table 1 for these experiments.

It should be noted that, if we took the same implementations, exactly the same losses for all three competing models for \(M=1\) would be obtained (since all three models are direct generalizations of SSVM). The deviations here are probably due to different optimization strategies, *e.g.* different minima on a plateau or not enough iterations for the subgradient method (Div-/MCL use cutting-plane optimization instead).

On all six datasets, our method clearly improves over the baseline of only one SSVM (\(M=1\)) and achieves better pick-best errors for large *M* than MCL and DivMCL do, with the exception of the *speed-skating* category. We show for this category exemplarily, however, that our algorithm learns *M* models which are all performing similarly well while in DivMCL only few models are strong, and in MCL, there exists only one strong model since diversity is only encouraged by assigning the training samples (here: 1) to specific models (shown for \(M=10\) in bottom left of Fig. 3). The phenomenon that our method yields significantly better average errors across all predictors in the ensemble is also reflected in the histogram of all losses from the full cross validation, as provided in Fig. 3 bottom right. The fact that most of the predictions achieve low loss in the proposed CSSVM is a strong advantage when the model is used in a cascade model since *all* predictions are good candidates to be selected as the best solution.

*not*rewarded but models are distributed diversely in version space as argued in Sect. 3.1 in order to achieve a low generalization error on unseen data when the predictions of all

*M*models are considered jointly. This is in contrast to the competing methods, where diversity among the models is also enforced on the training set.

Possible mappings for the assignment of training samples to individual learners

\([\sigma (\varvec{\pi }_i)]_m =\) | Description | Abbrev. |
---|---|---|

1 | Assign the sample | |

Assign the sample | | |

Sample a learner index \(\hat{m}\) from the distribution defined by \(q_i^1,\ldots q_i^M\) and assign the sample | | |

Sample one training example index \(\hat{j}_m \in \{1,\ldots ,N\}\) for each learner \(m\in \{1,\ldots ,M\}\), | |

*(ii) Foreground/Background Segmentation.* In this experiment, we use all these categories together (166 images in total) and use the same split criterion for the 5-fold cross validation as in [7]. We train the models on one fold, select regularization and diversity trade off parameters on two validation folds and report the test error on the remaining two folds. Figure 4 presents the results for MCL, DivMCL, and our model with different sample assignment strategies as in Table 1. Since this dataset consists of different categories, it seems natural that the models which cluster the training data by assigning training instances to distinct models (as in Div-/MCL, Ours-sampled, and Ours-best) perform better than the models which try to fit all *M* models to the *entire* dataset (Ours-all, Ours-stochastic). Our model achieves similar accuracies as the state-of-the-art method DivMCL in this experiment.

*(iii) Semantic Segmentation.* We also evaluate our algorithm on the PASCAL VOC 2010 benchmark dataset for object class segmentation (challenge 5). The dataset consists of an official training set and validation set comprising 964 images each, which contain 21 object classes. We use the SLIC superpixels and Textonboost potentials [48] publicly available from [45]. Due to the lack of a publicly available test set, we are selecting the parameters of all three models on the official validation set and report these validation errors in Table 2 using the PASCAL VOC evaluation criterion, the Jaccard index. For structured learning, all models use a loss weighted by the inverse class frequency present in the training data. The baselines for this experiment are given by an \({{\mathrm{arg\,max}}}\) operation on our features (“unaries only”), a linear SVM on the unary features, and a structured SVM (\(M=1\)). With these publicly available features, these baselines achieve average accuracies of \(21.6\,\%\), \(27.4\,\%\), and \(29.1\,\%\) which is much lower than the current best results reported on this challenge. In this experiment, however, we want to focus on how much a baseline algorithm can be improved thanks to a diverse ensemble, and not indulge in feature and pipeline tuning.

*cf.*Table 1). This massive relative improvement underlines the usefulness of a diverse ensemble approach. MCL (\(35.0\,\%\)) and DivMCL (\(34.5\,\%\)) yield inferior performance.

Pascal VOC 2010 Validation Accuracy (higher is better). We tune a popular conditional random field [45] as baseline structured models (top rows). We here focus on the relative improvement that different diversity strategies can achieve (bottom rows), rather than tweaking the baseline model itself.

## 5 Conclusion

We propose an algorithm termed the *Coulomb structured support vector machine* which learns an ensemble of multiple models in order to yield *diverse* predictions on *test data*. The diversity prior is imposed on the set of model weights rather than on the outputs of training samples as in previous approaches. This allows for the training of diverse models even on a single structured training example. The CSSVM trades off diversity, large margins, and a data term during training in order to optimize the minimum expected generalization error of the *entire* ensemble. The coupling between the *M* models is effective only at training but not at test time. As a consequence, predictions can be made in parallel without communication overhead in contrast to [5]. Our algorithm learns multiple *strong* predictors in an ensemble on the entire dataset, other than [6, 7] where predictors ’focus’ on the different clusters in the data, if present. We demonstrate on numerous real world datasets that the *M* diverse outputs of the proposed ensemble method include predictions with significantly lower task loss compared to only one model. Moreover, our approach of inducing diversity significantly improves over state-of-the-art methods on very small training sets while staying on par with the state-of-the-art methods on bigger training sets. The usefulness for machine learning tasks beyond computer vision is evident.

## Footnotes

- 1.
Note that this is almost always true once we have a sufficient number of independent features, see the function counting theorem [23].

- 2.
Note that we want to approximate this problem in a high dimensional space instead of only 3 dimensions.

- 3.
Note that we assume electrostatic charges on the

*parameters*, and not the*training samples*as done in [37].

## Notes

### Acknowledgements

We would like to thank Abner Guzman-Rivera for making the (Div)MCL source code available.

## References

- 1.Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. JMLR
**6**, 1453–1484 (2005)MathSciNetzbMATHGoogle Scholar - 2.Nowozin, S., Lampert, C.H.: Structured learning and prediction in computer vision. Found. Trends Comput. Graph. Vis.
**6**(3–4), 185–365 (2011)zbMATHGoogle Scholar - 3.Yanover, C., Weiss, Y.: Finding the M most probable configurations in arbitrary graphical models. In: NIPS 2003, pp. 289–296 (2003)Google Scholar
- 4.Papandreou, G., Yuille, A.L.: Perturb-and-map random fields: using discrete optimization to learn and sample from energy models. In: ICCV (2011)Google Scholar
- 5.Batra, D., Yadollahpour, P., Guzman-Rivera, A., Shakhnarovich, G.: Diverse M-best solutions in Markov random fields. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 1–16. Springer, Heidelberg (2012)Google Scholar
- 6.Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: learning to produce multiple structured outputs. In: NIPS, pp. 1808–1816 (2012)Google Scholar
- 7.Guzman-Rivera, A., Kohli, P., Batra, D., Rutenbar, R.A.: Efficiently enforcing diversity in multi-output structured prediction. In: AISTATS (2014)Google Scholar
- 8.Gane, A., Hazan, T., Jaakkola, T.: Learning with maximum a-posteriori perturbation models. In: AISTATS, pp. 247–256 (2014)Google Scholar
- 9.Yadollahpour, P., Batra, D., Shakhnarovich, G.: Discriminative re-ranking of diverse segmentations. In: CVPR (2013)Google Scholar
- 10.Gimpel, K., Batra, D., Dyer, C., Shakhnarovich, G.: A systematic exploration of diversity in machine translation. In: EMNLP (2013)Google Scholar
- 11.Roig, G., Boix, X., de Nijs, R., Ramos, S., Kühnlenz, K., Van Gool, L.: Active MAP inference in CRFs for efficient semantic segmentation. In: ICCV (2013)Google Scholar
- 12.Maji, S., Hazan, T., Jaakkola, T.: Active boundary annotation using random map perturbations. In: AISTATS (2014)Google Scholar
- 13.Premachandran, V., Tarlow, D., Batra, D.: Empirical minimum bayes risk prediction: how to extract an extra few % performance from vision models with just three more parameters. In: CVPR (2014)Google Scholar
- 14.Kirillov, A., Savchynskyy, B., Schlesinger, D., Vetrov, D., Rother, C.: Inferring m-best diverse labelings in a single one. In: ICCV, pp. 1814–1822 (2015)Google Scholar
- 15.Hazan, T., Maji, S., Jaakkola, T.: On sampling from the Gibbs distribution with random maximum a-posteriori perturbations. In: NIPS, pp. 1268–1276 (2013)Google Scholar
- 16.Chen, C., Kolmogorov, V., Zhu, Y., Metaxas, D., Lampert, C.: Computing the M most probable modes of a graphical model. In: AISTATS, pp. 161–169 (2013)Google Scholar
- 17.Chen, C., Liu, H., Metaxas, D., Zhao, T.: Mode estimation for high dimensional discrete tree graphical models. In: NIPS, pp. 1323–1331 (2014)Google Scholar
- 18.Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083 (2012)Google Scholar
- 19.Lucchi, A., Li, Y., Smith, K., Fua, P.: Structured image segmentation using kernelized features. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 400–413. Springer, Heidelberg (2012)Google Scholar
- 20.Lou, X., Hamprecht, F.A.: Structured learning for cell tracking. In: NIPS (2011)Google Scholar
- 21.Li, Y.F., Zhou, Z.H.: Towards making unlabeled data never hurt. IEEE Trans. PAMI
**37**(1), 175–188 (2015)CrossRefGoogle Scholar - 22.Lampert, C.H.: Maximum margin multi-label structured prediction. In: NIPS, pp. 289–297 (2011)Google Scholar
- 23.Cover, T.M.: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput.
**EC–14**(3), 326–334 (1965)CrossRefzbMATHGoogle Scholar - 24.Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Inc., New York (1997)zbMATHGoogle Scholar
- 25.Herbrich, R., Graepel, T., Williamson, R.C.: The structure of version space. Technical report MSR-TR-2004-63, Microsoft Research, July 2004Google Scholar
- 26.Herbrich, R., Graepel, T., Campbell, C.: Bayes point machines. JMLR
**1**, 245–279 (2001)MathSciNetzbMATHGoogle Scholar - 27.Graepel, T., Herbrich, R.: The kernel Gibbs sampler. In: NIPS, pp. 514–520 (2001)Google Scholar
- 28.Sacks, J., Welch, W.J., Mitchell, T.J., Wynn, H.P.: Design and analysis of computer experiments. Stat. Sci.
**4**, 409–423 (1989)MathSciNetCrossRefzbMATHGoogle Scholar - 29.Hardin, R., Sloane, N.: A new approach to the construction of optimal designs. J. Stat. Plann. Infer.
**37**(3), 339–369 (1993)MathSciNetCrossRefzbMATHGoogle Scholar - 30.Conway, J.H., Sloane, N.J.A.: Sphere-packings, Lattices, and Groups. Springer, New York (1987)zbMATHGoogle Scholar
- 31.Saff, E.B., Kuijlaars, A.B.: Distributing many points on a sphere. Math. Intell.
**19**(1), 5–11 (1997)MathSciNetCrossRefzbMATHGoogle Scholar - 32.Katanforoush, A., Shahshahani, M.: Distributing points on the sphere, I. Exp. Math.
**12**(2), 199–209 (2003)MathSciNetCrossRefzbMATHGoogle Scholar - 33.Claxton, T., Benson, G.: Stereochemistry and seven coordination. Can. J. Chem.
**44**(2), 157–163 (1966)CrossRefGoogle Scholar - 34.Erber, T., Hockney, G.: Equilibrium configurations of n equal charges on a sphere. J. Phys. A: Math. Gen.
**24**(23), L1369 (1991)CrossRefGoogle Scholar - 35.Lakhbab, H., EL Bernoussi, S., EL Harif, A.: Energy minimization of point charges on a sphere with a spectral projected gradient method. Int. J. Sci. Eng. Res.
**3**(5) (2012)Google Scholar - 36.Neubauer, S., Watkins, Z.: An algorithm for finding potential minimizing configurations of points on a sphere (1998). http://www.csun.edu/~hcmth007/electrons/algorithm.html. Accessed 30 Aug 2016
- 37.Hochreiter, S., Mozer, M.C., Obermayer, K.: Coulomb classifiers: generalizing support vector machines via an analogy to electrostatic systems. In: NIPS, pp. 561–568 (2003)Google Scholar
- 38.Ratliff, N.D., Bagnell, J.A., Zinkevich, M.A.: (Online) Subgradient methods for structured prediction. In: AISTATS (2007)Google Scholar
- 39.Prasad, A., Jegelka, S., Batra, D.: Submodular meets structured: finding diverse subsets in exponentially-large structured item sets. In: NIPS, pp. 2645–2653 (2014)Google Scholar
- 40.Batra, D., Kowdle, A., Parikh, D., Luo, J., Chen, T.: iCoseg: interactive co-segmentation with intelligent scribble guidance. In: CVPR (2010)Google Scholar
- 41.Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge. Int. J. Comput. Vis.
**88**, 303–338 (2010)CrossRefGoogle Scholar - 42.Tsai, Y.H., Yang, J., Yang, M.H.: Decomposed learning for joint object segmentation and categorization. In: BMVC (2013)Google Scholar
- 43.Lee, T., Fidler, S., Dickinson, S.: Learning to combine mid-level cues for object proposal generation. In: CVPR, pp. 1680–1688 (2015)Google Scholar
- 44.Wang, S., Fidler, S., Urtasun, R.: Lost shopping! monocular localization in large indoor spaces. In: ICCV (2015)Google Scholar
- 45.Müller, A.C., Behnke, S.: PyStruct - learning structured prediction in python. JMLR
**15**, 2055–2060 (2014)MathSciNetzbMATHGoogle Scholar - 46.Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Trans. PAMI
**28**(10), 1568–1583 (2006)CrossRefGoogle Scholar - 47.Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In: CVPR, pp. 993–1000 (2006)Google Scholar
- 48.Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS (2011)Google Scholar