1 Introduction

Humans[NOSPACE] [NOSPACE][SPACE]have a foveated visual system: only a small central part of the retina has high receptor density allowing the perception of the details of a scene. Therefore humans make eye movements to place the high resolution fovea on things they want to see. Understanding where they choose to look is therefore an important component of understanding behaviour.

A long-standing account of bottom-up attentional guidance posits the existence of a “saliency map” (or maps) in the human brain [26, 48]. Here, a saliency map represents spatial importance, usually defined to be local contrast in low-level features such as luminance, color or orientation. Since Itti and Koch formulated this concept into their seminal image-based model [17], a large number of models have been proposed for predicting fixations from image features, e.g. [1, 6, 15, 24, 25, 55, 56] and more recently many models based on deep learning, e.g. [16, 28, 30, 31, 36, 49]; see [4, 19] for extensive reviews of the literature. New models are published on a regular basis with contributions coming mainly from the communities of computer vision and psychology. It has been extensively discussed which effects are important for fixation prediction, from low and high-level influences [3, 7, 12, 14, 18, 31, 50] to biases [8, 44,45,46], tasks [27, 41, 43] and semantic effects [11]. Over time, the concept of a saliency map has moved away from its origins in low-level feature integration, and can now refer more generally to “a map that predicts fixations”. In practice, saliency maps are now synonymous with saliency models.

The large number of models created the need for quantitative metrics to assess progress in the field and compare models. Many different metrics have been proposed. The AUC-type metrics [45] used to be most common while the last years have seen a shift towards metrics like CC [22], NSS [37] and SIM [23], and recently the information gain metric has been proposed [32]. For an overview of the different metrics in use see e.g. [4, 23]. The community uses these metrics in benchmarks to keep track of the progress: the MIT Saliency Benchmark [9, 23] and the LSUN Challenge [21, 52,53,54].

The most widely accepted MIT benchmark evaluates submissions in eight different metrics. Depending on which metric one chooses, the model rankings and performances change dramatically. This fact has lead to substantial research analyzing the differences between metrics and giving recommendations in which situation to use which metric [10, 33, 38,39,40, 51]. Other authors have instead proposed new approaches to modeling and evaluation: Modeling as point processes [2, 42], other loss functions [20] and GLMMs [35].

The general conclusion in the field is that the metrics measure qualitatively different things [10, 40, 51], and that it is even conceptually impossible to determine a best model independent of the different metrics. Recently, Kümmerer et al. [32] tried to argue for a unique ranking between different models by showing that much of the disagreement between different metrics can be removed via postprocessing of the saliency maps by optimizing the saliency scale and smoothing kernel for information gain (IG, essentially log-likelihood).

However, this does not seem to be a satisfactory solution: For one, this approach requires access to all models one wants to compare to and needs tedious postprocessing for each of them. In addition to this practical barrier the approach also suffers from the major conceptual shortcoming that optimizing for IG cannot be optimal for all metrics. In fact, we show below that the log densities proposed in [32] perform suboptimally on most metrics and can still produce inconsistent rankings. Ideally one would like a model to be able to compete in all metrics on the metric’s original scale with other models, even with models that are directly optimized for that metric and where only the metric performances are known. This is not possible when evaluating on log densities as proposed in [32].

In fact, we show in this paper that even with knowledge of the true fixation distribution, no single saliency map can perform well in all metrics. In practice however, researchers must still decide on a particular saliency map to submit to the benchmark. Therefore, their model cannot compete with state-of-the-art models in all metrics – not because the model is intrinsically bad on those metrics, but because different metrics require the saliency maps to look different, independent of the encoded information about fixation placement (see Fig. 1). As long as one evaluates all saliency metrics on the same saliency maps, it is impossible to solve the benchmarking problem.

Here, we argue that the fundamental problem is that saliency models and saliency maps are considered to be the same. A major insight from Bayesian decision theory is that the derivation of optimal decisions can be decomposed into a task-independent probability distribution over possible outcomes of an experiment and a task-dependent error metric. In the saliency setting, one decides on a saliency map to submit to a certain metric. Correspondingly, saliency models should be defined as metric-independent probability densities over possible fixations and subsequently many different metric-dependent saliency maps can be derived from the same density for different error metrics.

We show that saliency maps for the most influential metrics AUC, sAUC, NSS, CC, SIM, and KL-Div can be derived from fixation densities in a principled way. We demonstrate the validity of our approach on real models and real data. By decoupling the notions of saliency models and saliency maps, saliency models can be meaningfully compared on all metrics in their original scale, and the MIT saliency benchmark will implement our suggested approach.

Fig. 1.
figure 1

No single saliency map can perform best in all metrics even when the true fixation distribution is known. This problem can be solved by separating saliency models from saliency maps. (a) Fixations are distributed according to a ground truth fixation density \(p(x, y \mid I)\) for some stimulus I (see supplementary material for details on the visualization). (b) This ground truth density predicts different saliency maps depending on the intended metric. The saliency maps differ dramatically due to the different properties of the metrics but always reflect the same underlying model. Note that the maps for the NSS and IG metrics are the same, as are those for CC and KL-Div. (c) Performances of the saliency maps from (b) under seven saliency metrics on a large number of fixations sampled from the model distribution in (a). Colors of the bars correspond to the frame colors in (b). The predicted saliency map for the specific metric (framed bar) yields best performance in all cases.

2 Theory

Motivated by the line of thoughts presented above we here propose to use the following definitions:

  1. 1.

    a saliency model predicts a fixation probability density \(p(x, y \mid I)\) given an image I.

  2. 2.

    a saliency metric is a performance measure for a saliency map on ground truth data.

  3. 3.

    a saliency map \(s_{p, \text {metric}}(x, y, I)\) is a metric-specific prediction derived from the model density.

It has been argued before that formulating saliency models as probabilistic models is advantageous (e.g. [2, 32]). In this definition, a saliency model predicts a fixation probability density, that is, the probability \(p(x, y \mid I)\) of observing a fixation at a given pixel in a given imageFootnote 1. The three definitions we propose above follow the rationale of Bayesian decision theory: the saliency model is a posterior density over all possible events and the saliency metric is a utility function. Based on the posterior density and the utility function, a saliency map is then chosen to maximize the expected utility.

2.1 Predicting Saliency Maps from Saliency Models

From the predicted fixation density of a model, one can use expected utility maximization to derive the saliency map which the model expects to yield highest performance in some metricFootnote 2.

Evaluating a saliency metric involves a saliency map \(s(x, y \mid I)\) for a stimulus I and ground truth fixation data \((x_i, y_i)\). Therefore, we can phrase a metric as a function \(M[s(x, y \mid I); (x_1, y_1), \dots , (x_n, y_n)]\). Note that some metrics as CC or SIM use an empirical saliency map instead of ground truth fixations (distribution-based metrics,richeSaliency2013). However, the empirical saliency map is always constructed from ground truth fixations, usually by convolving them with a Gaussian. This can be taken to be part of the metric evaluation, as we will demonstrate below. Simplifying notation with \(D = (x_1, y_1), \dots , (x_n, y_n)\), the metric evaluation can be written as

$$ M[s(x, y \mid I); D]. $$

Assuming that the fixations are distributed according to some distribution \((x_i, y_i) \sim p(x, y \mid I)\) and therefore \(D \sim \prod _1^n p(x, y)\), the expected performance of the metric on a saliency map is \(\mathbb {E}_D M[s(x, y \mid I); D]\). One should choose the saliency map which is expected to yield highest performance for the metric M: that is, the solution of

$$ \max _{s(x,y\mid I)} \mathbb {E}_D M[D, s(x, y \mid I)] $$

Solving this optimization problem for a fixation distribution p given by a model of interest essentially answers the following question: if we assume that the unknown fixations, on which the saliency map later will be evaluated, come from the model density p (and therefore \(D = \prod _i^n p\)), what would be the best saliency map to use for metric M? For a metric M the solutions to the optimization problem give rise to a transformation \(p(x, y \mid I) \mapsto s_M(x, y \mid I)\) from fixation densities to derived metric-specific saliency maps. While the optimization problem might be hard in general, for most commonly-used saliency metrics it can be solved exactly or approximately, as we show below. Importantly, the methods we outline here are deterministic transformations depending only on the model’s density prediction. No optimization using ground truth data is necessary.

In the following we give exact or approximate solutions for six of the most widely used metrics, including three metrics which operate directly on ground truth fixations (AUC, sAUC, and NSS) and three distribution-based metrics which first convert the ground truth fixations into a empirical saliency map (CC, SIM, KL-Div). Additionally we include the IG metric introduced in [32] since we use this metric for converting existing saliency map models to probabilistic models.

AUC, sAUC. The AUC-type metrics (“Area Under the Curve”, [45]) measure the model performance in a 2AFC (2 alternative forced choice) task where the model has to decide which one of two locations has been fixated: in a 2AFC task, a system is presented with one signal and one noise stimulus and chooses which stimulus is the “signal”. In the case of the AUC in saliency, signal and noise correspond to fixated and non-fixated image locations respectively (See supplementary material for a proof of the equivalence between the ROC curve and the 2AFC task). Denoting the model’s fixation distribution \(p_\text {fix}(x, y)\), the nonfixation distribution \(p_\text {nonfix}(x, y)\) (which is uniform for AUC and the image independent center bias for sAUC) and denote the two locations by \((x_1, y_1)\) resp. \((x_2, y_2)\). The 2AFC task reduces to deciding whether these points are sampled from \(p_\text {fix} \times p_\text {nonfix}\) or from \(p_\text {nonfix} \times p_\text {fix}\). The likelihoods of the two points given these two distributions are \(p_\text {fix}(x_1, y_1) p_\text {nonfix}(x_2, y_2)\) resp. \(p_\text {nonfix}(x_1, y_1) p_\text {fix}(x_2, y_2)\). The model expects optimal performance by choosing the distribution which has higher likelihood, or equivalently, the point for which \(p_\text {fix}(x, y)/p_\text {nonfix}(x, y)\) has the higher value. Therefore the model should expect the saliency map \(p_\text {fix}(x, y)/p_\text {nonfix}(x, y)\) to yield highest performance. In the special case of the standard AUC metric, \(p_\text {nonfix}\) is constant and the saliency map boils down to \(p_\text {fix}\). An additional practical consideration is that the MIT benchmark currently only accepts submissions as JPEG images. To compensate for this limited precision and possible JPEG-artefacts, one should additionally histogram-equalize the saliency map (see Supplementary Material).

NSS. The Normalized Scanpath Saliency (NSS, [37]) performance of a saliency map model is defined to be the average saliency value of fixated pixels in the normalized (zero mean, unit variance) saliency maps (i.e., the average z-score of the fixated saliency values).

We can show analytically that one should expect the highest NSS score from the predicted fixation density itself: given an image with N pixels let the probability for a single fixation falling onto pixel i be \(p_i\). Then the expected NSS of a saliency map \(q=(q_1, \dots , q_N)\) with \(\tfrac{1}{N} \sum _i q_i = \bar{q} = 0\), \(\Vert q \Vert _2^2 = 1\) is \(\sum _i^N p_i \cdot q_i = \langle p, q\rangle \). Finding the saliency map with the best possible NSS is equivalent to finding the solution of the problem

$$ \max \langle p, q\rangle \quad s.t.\ \bar{q}= 0, \Vert q\Vert ^2=1 $$

Since \(q \mapsto q' = \bar{p} + \alpha q\) with \(\alpha = \sqrt{\Vert p\Vert ^2 - 1/N}\) induces a maximum-preserving bijection between \(\{q \mid \bar{q}=0, \Vert q\Vert ^2 = 1\}\) and \(\{q' \mid \bar{q}' = \bar{p} = 1/N, \Vert q\Vert ^2 = \Vert p\Vert ^2\}\), we can look for the maximum of \(\langle p, q'\rangle \quad s.t.\ \bar{q}'=\bar{p}, \Vert q'\Vert ^2=\Vert p\Vert ^2\) instead (and normalize q afterwards to get the normalized saliency map). Because of \(\langle x, y \rangle = \tfrac{1}{2} (\Vert x\Vert ^2 + \Vert y\Vert ^2 | - \Vert x-y\Vert ^2)\), the maximum under these conditions is identical with the minimum of \(\Vert p - q\Vert ^2\), which is p.

Therefore, the best possible saliency map with respect to NSS is the density of the fixation distribution.

IG. The information gain (IG, [32]) metric requires the saliency map to be a probability distribution and compares the average log-probability of fixated pixels to that given by a baseline model (usually the centerbias or a uniform model). The optimal saliency map for IG depends on how the metric interprets saliency maps as probability densities. We normalize the saliency maps to be probability vectors (nonnegative, unit sum) and in this case the predicted density itself yields the highest expected performance: Let \(p = (p_1, \dots , p_N)\) with \(p \ge 0\), \(\sum _i p_i = 1\) denote the predicted probabilities for each pixel and q with \(q \ge 0\), \(\sum _i q_i = 1\) a saliency map. Let \(p_{bl} = (p_{bl,1}, \dots , p_{bl,N})\) be the pixel probabilities of the baseline model. Then the expected IG of q is \(\mathbb {E}_p IG(q) = \sum _i p_i(\log q_i - \log p_{bl,i})\) and its maximum is \(\mathrm {argmax}_q \mathbb {E}_p IG(q) = \mathrm {argmax}_q \sum _i p_i (\log q_i - \log p_{bl, i}) = \mathrm {argmax}_q \sum _i p_i \log q_i = \mathrm {argmax}_q \sum _i p_i (\log q_i - \log p_i) = \mathrm {argmin}_q \sum _i p_i(\log p_i - \log q_i) = \mathrm {argmin}_q KL[p, q] = p\).

CC. The correlation coefficient (CC, [22]) measures the correlation between model saliency map and empirical saliency map after normalizing both saliency maps to have zero mean and unit variance. This is equivalent to measuring the euclidean distance between the predicted saliency map and the normalized empirical saliency map. The expected euclidean distance to a random variable is minimized by its expectation value. Therefore the optimal saliency map with respect to CC is the expected normalized empirical saliency map.

This shows that predicting the optimal saliency map for CC crucially depends on how the empirical saliency maps are computed. Empirical saliency maps are typically computed by blurring observed fixation positions from eye movement data with a Gaussian kernel of a certain size. In this case the expected empirical saliency map would be \(\mathbb {E}_{x_i \sim p} \frac{1}{N} \sum _i^N G_\sigma (x) = \frac{1}{N}\sum _i^N \mathbb {E}_{x \sim p} G_\sigma (x) = \frac{1}{N} \sum _i^N G_\sigma * p = G_\sigma * p, \) that is, the density blurred with a Gaussian kernel of size \(\sigma \).

Unfortunately, the expected empirical saliency map is not the expected normalized empirical saliency map which was earlier shown to be optimal for CC. Normalization involves subtracting the mean and dividing by the standard deviation, and the latter is nonlinear. Effectively, normalizing the variance just changes the weight by which the different empirical saliency maps are averaged in the expectation value. As long as the variances of the different empirical saliency maps don’t differ too much, this won’t have much of an effect and our simulations suggest that this is the case (Supplementary Material). Therefore, as an approximation to the expected normalized empirical saliency map, we use the expected saliency map in this paper, which is computed by convolving the expected density by a Gaussian.

Obviously, if more involved techniques are used to compute the empirical saliency maps (e.g. cross validation of the kernel size as in [32]), then the expected empirical saliency map is harder or impossible to calculate analytically. However, one can still approximate it numerically by sampling normalized empirical saliency maps from the expected fixation distribution and averaging them.

KL-Div. The KL-Div metric computes the Kullback-Leibler divergence between the empirical saliency maps and the model saliency maps after converting both of them into probability distributions (by making them nonnegative and normalizing them to have unit sum) Therefore, unlike for most other metrics, in KL-Div lower values are better.

We can show that for the KL-Div metric, the expected empirical saliency map expects the best performance: let \(e = (e_1, \dots , e_N)\) with \(e \ge 0\), \(\sum _i e_i = 1\) denote the random variable which represents the empirical saliency map and q with \(q \ge 0\), \(\sum _i q_i = 1\) the model saliency map. Then we are looking for the q which minimizes \(\mathbb {E}_p KL[e, q]\). Since \(\mathbb {E}_p[KL[e, q]] = \mathbb {E}_p \left[ \sum _i e_i \frac{\log e_i}{\log q_i}\right] = \mathbb {E}_p\left[ \sum _i e_i \log e_i \right] - \sum _i \mathbb {E}_p[e_i] \log q_i, \) this is equivalent to finding the maximum of \(\sum _i \mathbb {E}_p[e_i] \log q_i\), which is again equivalent to finding the minimum of \(\sum _i \mathbb {E}_p[e_i] \log \mathbb {E}_p[e_i] -\sum _i \mathbb {E}_p[e_i] \log q_i = KL[\mathbb {E}_p[e], q]\). This is obviously minimized by \(q=\mathbb {E}_p[e]\), the expected empirical saliency map. As for CC, this is the density blurred by the same kernel size as used for the empirical saliency map.

SIM. The Similarity (SIM, [23]) metric normalizes the model saliency map and the empirical saliency map to be probability vectors (in the same way as KL-Div) and sums the pixelwise minimum of two saliency maps. As opposed to the CC-metric, which can be interpreted as measuring the \(l_2\)-distance between normalized saliency maps, this effectively measures the \(l_1\)-distance between saliency maps (\( \sum _i \min (p_i, q_i) = \sum _i \frac{1}{2} \left( p_i + q_i - |p_i - q_i| \right) = 1 - \frac{1}{2} \Vert p - q \Vert _1, \)) This optimization problem cannot be solved analytically in general. Instead we solve it numerically: we perform a constrained stochastic gradient descend on sets of fixations sampled from the probability density (see Sect. 3 for details). Note that the optimal saliency map for SIM, unlike all other saliency maps presented here, depends on the number of fixations per image (see the Supplement for details on this effect).

3 Experiments and Results

We use the pysaliency toolbox [29] to compute saliency metrics (see Supplement for details). From a probability density over an image we compute five types of saliency maps: AUC saliency maps are created by equalizing the probability density to yield a uniform histogram over all pixels. sAUC saliency maps are created by dividing the probability density by the center bias density and again equalizing the saliency map to yield a uniform histogram over all pixels. The center bias density was estimated using a Gaussian kernel density estimate over all fixations from the MIT1003 dataset and crossvalidated across images. NSS/IG saliency maps are simply the probability density. CC/KL-Div saliency maps are calculated by convolving the probability density with a Gaussian kernel with \(\sigma =35px\) (corresponding to 1dva, as commonly used on the MIT1003 dataset). SIM saliency maps: We divide the CC saliency map by its sum to normalize it. Starting from there, we perform constrained (nonnegative, unit sum) stochastic gradient descend on fixations sampled from the predicted density to maximize the expected SIM performance (see Supplementary Material for implementation details).

3.1 No Saliency Map to Rule Them All

Here we illustrate using simulated data that even if the true fixation density is known, no single saliency map can win in all saliency metrics. From a fictional fixation density (Fig. 1a) we compute the saliency maps that we predict to be optimal for the seven saliency metrics AUC, sAUC, NSS/IG, CC/KL-Div and SIM (Fig. 1b). We sample 1000 sets of 100 fixations from the fixation density and evaluate all five saliency maps using the seven different saliency metrics on this dataset (Fig. 1c, raw data in the Supplement).

Although the saliency maps in Fig. 1b all are predicted by the same model, they appear visually different: while the AUC saliency map is essentially just the normalized density, the sAUC saliency map removes the center bias contribution (see above). The NSS/IG saliency map is exactly the density and shows large areas with very low values. The CC/KL-Div saliency map, being a blurred version of the density, is much smoother than the NSS saliency map. The SIM saliency map looks mostly like the CC/KL-Div saliency map but is slightly more sparse.

The ranking of the five saliency maps is highly inconsistent across metrics (Fig. 1c): even with knowledge of the real fixation distribution, no saliency map can be optimal for all saliency metrics. However, each saliency map is optimal for exactly those metrics for which it has been predicted to be optimal (framed bars). This illustrates our main result: By deriving metric-specific saliency maps in a principled way from fixation densities, one model can perform optimally in all metrics. Notice that in current practice, the situation faced by an individual research team is rather to pick from one of the maps in Fig. 1b and be penalized accordingly on other metrics in Fig. 1c.

Fig. 2.
figure 2

The predicted saliency map for various metrics according to different models, for the same stimulus. For six models (rows) we show their original saliency map (first column), the probability distribution after converting the model into a probabilistic model (second column) and the saliency maps predicted for seven different metrics (columns three through seven). The predictions of different models for the same metric (column) appear more similar than the predictions of the same model for different metrics (row). In particular, note the inconsistency of the original models (what are typically compared on the benchmark) relative to the per-metric saliency maps. It is therefore difficult to visually compare original model predictions, which have been formulated for different metrics.

3.2 MIT1003

In our main experiment, we use our approach to evaluate six saliency models on the popular benchmarking dataset MIT1003 (freeviewing fixations of 15 subjects on 1003 images, [24]). For all evaluated models, the original source code and default parameters have been used. The included models are AIM [6], Boolean Map-based Saliency (BMS) [55], the Ensemble of Deep Networks (eDN) [49], OpenSALICON [47], SalGAN [36] and DeepGaze II [31].

Converting existing models that produce arbitrary saliency maps into probabilistic models is not straightforward [32]. We used the method described in [32] and implemented in the pysaliency toolbox as optimize_for_information_gain: we fitted a pixelwise monotone nonlinearity and a center bias for each model to yield maximum information gain for the MIT1003 dataset (see supplementary material for details). Unlike [32] we did not optimize an additional Gaussian convolution to smooth the predictions. Since DeepGaze II is already formulated as a probabilistic model, there was no need to convert this model. For showing the “original saliency map” we use the log density in this case.

Example saliency maps. In Fig. 2, we show the probability distribution and the predicted saliency maps (columns) for the saliency models (rows) for one example stimulus. Comparing the saliency maps within and between columns, i.e. metrics, one notices that the process of predicting saliency maps for certain metrics has a strong effect on the shape of the saliency maps that is consistent across models. It influences the visual appearance of the saliency map to a larger degree than the actual model does: the AUC and sAUC maps are very high contrast, while the NSS and CC saliency maps have large areas of very little saliency. The CC and SIM saliency maps are much smoother than all other saliency maps. It is a quite common technique in the field to compare the saliency maps of different models visually (e.g., see [13], Figure 6; [5], Figure 6; [4], Figure 9). Figure 2 shows that this technique can be very misleading unless the saliency maps are of the same type (i.e. intended for the same saliency metric).

Fig. 3.
figure 3

We reformulated several saliency models in terms of fixation densities and evaluated AUC, sAUC, NSS, IG, CC, KL-Div and SIM on the original saliency maps (dashed line) and the saliency maps derived from the probabilistic model for the different saliency metrics (solid lines) on the MIT1003 dataset. Saliency maps derived for a given metric always yield the highest performance for that metric(thick line), and for each metric the model ranking is consistent when using the correct saliency maps – unlike for the original saliency maps and some other derived saliency maps. Note that AUC metrics yield identical results on AUC saliency maps, NSS saliency maps and log-density saliency maps, therefore the blue and purple lines are hidden by the red line in the AUC and sAUC plots. Also, the CC metric yields only slightly worse results on the SIM saliency map than on the CC saliency map, therefore the orange line is hidden by the green line in the CC plot. OpnS=OpenSALICON, DGII=DeepGaze II.

Comparing model performance. In Fig. 3 we evaluate the saliency maps of the saliency models (AIM, BMS, eDN, OpenSALICON, SalGAN, DeepGaze II; x-axis) on the seven saliency metrics (subplots, raw data in the Supplement). Each line indicates the models’ performances in the evaluated metric when using a specific type of saliency map. The dashed lines indicate performance using the models’ original saliency maps (i.e. not transformed into true probability densities). The performances are very inconsistent between the different metrics on the original saliency maps. The solid lines indicate the metric performances on the five types of derived saliency maps (red: AUC, pink: sAUC, blue: NSS and IG, green: CC and KL-Div, orange: SIM). Additionally, we included log-density saliency maps as proposed in [32] (purple dotted lines).

For each metric, the saliency map predicted for that metric (thick line in each sub plot) yields highest performance for all models. Conversely, saliency maps derived for other metrics often incur severe penalties (except for very few borderline cases, see below). While the model rankings given by the different metrics on each saliency map type are much more consistent than on the original saliency maps, there is still disagreement between metrics left when evaluating all metrics on the same saliency map type.

Interestingly, the AIM model reaches better NSS performance with the CC saliency map than with the NSS saliency map. This is easy to explain: the AIM model’s predicted density improves after blurring. For the better models this effect vanishes. For example, DeepGaze II reaches significantly higher NSS scores with the NSS saliency map than with the CC saliency map and vice versa for the CC metric. The SIM metric seems to show only slightly better performance on the SIM saliency map than on the CC saliency map, with the average difference being just 0.006. However, the best five models with respect to SIM in the MIT Saliency Benchmark perform within a range of less than 0.02. A difference of 0.006 could easily change a model’s ranking by multiple places.

Figure 3 also serves to illustrate a key difference between the metric unification proposed in [32] and our method of predicting saliency maps from fixation densities: the metric results presented in [32] correspond to the purple dotted log-density lines for AUC, sAUC, NSS and to the blue density lines for IG and KL-Div (in our implementation taking the logarithm of the density is part of the metric itself). As reported in [32], the model rankings are more consistent for those lines than for the original saliency maps. However, except for AUC and IG, in all other metrics the models are penalized when evaluated like this and additionally for the best models even the agreement between metric rankings is lost (SalGAN vs DeepGaze II, AUC/sAUC/IG vs NSS/CC/KL-Div). This shows that the method proposed in [32], while managing to remove a significant amount of the disagreement between metrics, is not perfect.

To summarize, Fig. 3 illustrates the main result of this paper: No matter what saliency map type you decide for, even state-of-the-art models will perform suboptimally in some metrics and rankings will still be inconsistent. Only by using the right saliency map for each metric given the model density, every model performs as well as it can theoretically and all model rankings agree. Consequently, our evaluation yields a unique winner of the benchmark: from all included models, DeepGaze II performs best in all considered metrics.

4 Discussion

Despite much progress in fixation prediction in recent years, comparing saliency models to each other can be confusing due to the large number of benchmarking metrics, giving inconsistent model rankings. Here we argue that benchmarking can be simplified by considering saliency models to be probability density predictors, saliency metrics to be performance measures that assess saliency maps against ground truth fixations, and subsequently saliency maps to be metric-specific predictions derived from the model’s density. We have shown that probabilistic models can predict good saliency maps for the most common saliency metrics: “good models” perform well in many metrics.

Importantly, this metric-specific prediction reflects the same underlying model. It is not the case that the model is being re-trained for each metric. Rather, the saliency maps we show are derived deterministically from the fixation density predicted by a model. In this way it is possible to obtain optimal predictions from a given saliency density for arbitrary metrics without retraining. The saliency model density captures all necessary information in the training data and represents it in a way that it can readily be used in combination with arbitrary error metrics. Information gain (equivalently, log-likelihood) is an ideal optimization metric because it reflects all information in the structure of the fixation density, independent of any particular metric. Therefore, it should lead to good results in all metrics.

The fact that metrics impose strong constraints on saliency maps means that it is misleading to visually compare saliency maps intended for different metrics (see Fig. 2)—but this is commonly done in the field ([4, 5, 13]) For example, the optimal saliency maps for distribution-based metrics like CC, SIM and KL-Div require blurring unlike those for NSS and IG.

Another consequence of the present work is that the eight metrics available on the MIT benchmark can now be seen as a benefit rather than a possible source of confusion. Since each metric assesses different aspects of the fixation prediction, the benchmark would now allow fair comparison over a number of tasks of interest, which may be more or less relevant for certain applications. For example, sAUC is most relevant when one is interested in a model’s predictive performance once the center bias is excluded (e.g., in applying to a setting with a different center bias from the MIT1003 training data).

While the saliency maps we have derived give the optimal metric-specific saliency map for a given fixation density, it is nevertheless still possible that a given model could do better on a metric with a saliency map not intended for that metric, rather than the metric-specific saliency map itself. If the model’s density is not the correct one (i.e. does not reflect the data-generating density), then the derived saliency maps can be suboptimal. If the model’s density is especially bad, some metrics might even perform better on saliency maps not predicted for this metric than on the one predicted for this metric. For example: if a model’s density prediction is too sparse, the AUC metric will perform better on the smoothed CC saliency map than it will perform on the actual AUC saliency map. Therefore, actually optimizing model predictions for each specific metric may yield insights into the differences between the metrics (by comparing the underlying densities). Indeed, this could in practice produce better performance on the training metric than an information gain optimized density. The fact that we don’t observe this effect on the original saliency maps (which were trained in the case of eDN, OpenSALICON, SalGAN and DeepGaze II: Fig. 3, dashed lines) suggests any improvement is likely small, and can come at the price of performing substantially worse in other metrics.

Finally, we would like to note that the distinction between saliency models and saliency maps we draw here does not contradict ideas that a “saliency map” or maps may be instantiated in the human brain, as a corollary of bottom-up attentional guidance or an importance map for (e.g.) choosing the next place to fixate in a scene [26, 34, 48]. Our nomenclature is rather independent and intended for saliency model benchmarking.

The code for evaluating saliency models as demonstrated in this work has been released as part of the pysaliency python library (available at https://github.com/matthias-k/pysaliency).

Conclusion. Our work solves the problem that one saliency model cannot reach state-of-the-art performance in all relevant saliency metrics. Our key theoretical contribution is to decouple the notions of saliency models and saliency maps. For benchmarking practice, this means that saliency models can be meaningfully compared on all metrics in their original scale. Therefore, our method allows comparing to traditional models that do not use this method; it works even if only metric scores of other models are known (as for example in cases where metric scores are published in a paper). Practically, this means that there is no need to revise an existing benchmark: researchers who submit model densities can have their performance fairly evaluated, but existing models can remain in the table. The MIT saliency benchmark will implement this option.