Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In many computer vision tasks, visual elements are represented by vectors in high-dimensional spaces. This is the case for image retrieval [3, 14], object recognition [17, 23], object detection [9], action recognition [20], semantic segmentation [16] and many more. Visual entities can be whole images or videos, or regions of images corresponding to potential object parts. The high-dimensional vectors are used to train a classifier [19] or to directly perform a similarity search in high-dimensional spaces [14].

Vector representations are often post-processed by mapping to a different representation space, which can be higher or lower dimensional. Such mappings or embeddings can be either non-linear [2, 5] or linear [4, 6]. In the non-linear case, methods that directly evaluate [2] or efficiently approximate [5] non-linear kernels are known to be perform better. Typical applications range from image classification [5] and retrieval [4] to semantic segmentation [8]. Examples of the linear kind are used for dimensionality reduction in which dimensions carrying the most meaningful information are kept. Dimensionality reduction with Principal Component Analysis (PCA) is very popular in numerous tasks [4, 6, 15]. In the same vein as PCA is data whitening, which is the focus of this workFootnote 1.

A whitening transformation is a linear transformation that performs correlation removal or suppression by mapping the data to a different space such that the covariance matrix of the data in the transformed space is identity. It is commonly learned in an unsupervised way from a small sample of training vectors. It is shown to be quite effective in retrieval tasks with global image representations, for example, when an image is represented by a vector constructed through the aggregation of local descriptors [13] or by a vector of Convolutional Neural Network (CNN) activations [11, 22]. In particular, PCA whitening significantly boosts the performance of CNN compact image vectors, i.e. 256 to 512 dimensions, due to handling of inherent co-occurrence phenomena [4]. Principal components found are ordered by decreasing variance, allowing for dimensionality reduction at the same time [12]. Dimensionality reduction may also be performed in a discriminative, supervised fashion. This is the case in the work by Cai et al. [6], where the covariance matrices are constructed by using information of pairs of similar and non-similar elements. In this fashion, the injected supervision performs better separation between matching and non-matching vectors and has better chances to avoid outliers in the estimation. It has been shown [10] that an unsupervised approach based on least squares minimization is likely to be affected by outliers: even a single outlier of high magnitute can significantly deviate the solution.

In this work, we propose an unsupervised way to learn the whitening transformation such that the estimation is robust to outliers. Inspired by the Iteratively Re-weighted Least Squares of Aftab and Hartley [1], we employ robust M-estimators. We perform minimization of robust cost functions such as \(\ell _1\) or Cauchy. Our approach iteratively alternates between two minimizations, one to perform the centering of the data and one to perform the whitening. In each step a weighted least squares problem is solved and is shown to minimize the sum of the \(\ell _2\) norms of the training vectors. We demonstrate the effectiveness of this approach on synthetic 2D data and on real data of CNN-based representation for image search. The method is additionally extended to handle supervised cases, as in the work of Cai et al. [6], where we show further improvements. Finally, our methodology is not limited to data whitening. We provide a discussion on applying it for robust patch rectification of MSER features [18].

The rest of the paper is organized as follows: In Sect. 2 we briefly review conventional data whitening and give our motivation, while in Sect. 3 we describe the proposed iterative whitening approach. Finally, in Sects. 4 and 5 we compare our method to the conventional approach on synthetic and real data, respectively.

2 Data Whitening

In this section, we first briefly review the background of data whitening and then give a geometric interpretation, which forms our motivation for the proposed approach.

2.1 Background on Whitening

A whitening transformation is a linear transformation that transforms a vector of random variables with a known covariance matrix into a set of new variables whose covariance is the identity matrix. The transformation is called “whitening” because it changes the input vector into a white noise vector.

We consider the case where this transformation is applied on a set of zero centered vectors \(\mathcal {X} = \lbrace \mathbf {x} _1, \ldots , \mathbf {x} _i, \ldots , \mathbf {x} _N \rbrace \), with \(\mathbf {x} _i \in \mathbb {R} ^d\), where \(\varSigma = \sum _i \mathbf {x} _i \mathbf {x} _i^{{\!\top }}\). The whitening transformation P is given by

$$\begin{aligned} P^{\!\top }P = \varSigma ^{-1}. \end{aligned}$$
(1)

In Fig. 1 we show a toy example of 2D points and their whitened counterpart.

Fig. 1.
figure 1

Left: Points in 2D and their covariance shown with an ellipse. Right: The corresponding whitened 2D point set.

Assumption. In the following text, we assume that the points of \(\mathcal {X} \) do not lie in a linear subspace of dimensionality \(d' < d\). If this is the case, a solution is to first identify the \(d'\)-dimensional subspace and perform the proposed algorithms on this subspace. The direct consequence of the assumption is that the sample covariance matrix \(\varSigma \) is full rank, in particular \(\det {(\varSigma )} > 0\).

It is clear from (1) that the whitening transformation is given up to an arbitrary rotation \(R \in \mathbb {R} ^{d \times d}\), with \(R^{{\!\top }}R = I\). The transformation matrix P of the whitening is thus given by

(2)

2.2 Geometric Interpretation

We provide a geometric interpretation of data whitening, which also serves as our motivation for the proposed method in this work.

Observation. Assuming zero-mean points, the whitening transform P in (2) minimizes the sum of squared \(\ell _2\) norms among all linear transforms T with .

Proof.

$$\begin{aligned} \begin{aligned} C_{\ell _2}(P)&= \sum _i ||P\mathbf {x} _i||^2 \\&= \sum _i tr\left( \mathbf {x} _i^{\!\top }P^{\!\top }P \mathbf {x} _i \right) \\&= \sum _i tr\left( \left( \mathbf {x} _i \mathbf {x} _i^{\!\top }\right) P^{\!\top }P \right) \\&= tr\left( \left( \sum _i \mathbf {x} _i \mathbf {x} _i^{\!\top }\right) P^{\!\top }P \right) \\&= tr\left( \varSigma P^{\!\top }P \right) \\&= \sum _{j=1}^d \lambda _j, \end{aligned} \end{aligned}$$
(3)

where \(\lambda _i\) are the eigenvalues of \( \varSigma P^{\!\top }P\) and \(||\cdot ||\) is denoting \(\ell _2\) norm. Upon imposing the condition , we get that \(\det ( \varSigma P^{\!\top }P ) = \prod _{j=1}^d \lambda _j\) is constant with respect to P. It follows from the arithmetic and geometric mean inequality, that the sum in (3) is minimized when \(\lambda _i=\lambda _j , \forall i=j\). Equality of all eigenvalues allows us to show that

(4)

which is exactly the solution in (2) that also minimizes (3). The need for the existence of \(\varSigma ^{-1}\) justifies the stated full rank assumption.

We have just shown that learning a whitening transformation reduces to a least squares problem.

3 Robust Whitening

In this section we initially review the necessary background on the the iteratively re-weighted least squares (IRLS) method recently proposed by Aftab and Hartley [1], which is the starting point for our method. Then, we present the robust whitening and centering procedures, which are posed as weighted least squares problems and performed iteratively. Finally, the extension to the supervised case is described.

3.1 Background on IRLS

In the context of distance minimization the IRLS method minimizes the cost function

$$\begin{aligned} C_h(\mathbf {\theta }) = \sum _{i=1}^N h \circ f(\mathbf {\theta }, \mathbf {x} _i), \end{aligned}$$
(5)

where f is a distance function that is defined on some domain, h is a function that makes the cost less sensitive to outliers, and \(\mathbf {x} _i \in \mathcal {X} \). Some examples of robust h functions are \(\ell _1\), Huber, pseudo-Huber, etc. as described in [1]. For instance, assume the case of the geometric median of the points in \(\mathcal {X} \). Setting \(f(\varvec{\mu }, \mathbf {x} _i) = ||\varvec{\mu }-\mathbf {x} _i||\) and \(h(z)=z\), we get the cost (5) as the sum of \(\ell _2\) norms. The minimum of this cost is attained when \(\varvec{\mu } \) is equal to the geometric median.

It is shown [1] that a solution for \({{\mathrm{argmin}}}_{\mathbf {\theta }} C_h(\mathbf {\theta })\) may be found by solving a sequence of weighted least squares problems. Given some initial estimate \(\mathbf {\theta } ^0\), the parameters \(\mathbf {\theta } \) are iteratively estimated

$$\begin{aligned} \mathbf {\theta } ^{t+1} = \mathop {\text {argmin}}\limits _{\mathbf {\theta }} \sum _{i=1}^N w(\mathbf {\theta } ^t,\mathbf {x} _i) f(\mathbf {\theta },\mathbf {x} _i)^2, \end{aligned}$$
(6)

where for brevity \(w(\mathbf {\theta } ^t,\mathbf {x} _i)\) is denoted \(w_i^t\) in the following. Provided \(h(\sqrt{z})\) is differentiable at all points and concave, for certain values of \(w_i^t\) and conditions on f this solution minimizes \(C_h(\mathbf {\theta })\). In some cases, it may even be possible to find a simple and anlytic solution.

Given that the iterative procedure indeed converges to a minimum cost of (5), we get the following condition on the weights:

$$\begin{aligned} \begin{aligned} \nabla _{\mathbf {\theta }} (h \circ f({\mathbf {\theta },\mathbf {x} _i}))&= 0, \\ \nabla _{\mathbf {\theta }} (w_i^t f({\mathbf {\theta },\mathbf {x} _i})^2)&= 0. \\ \end{aligned} \end{aligned}$$
(7)

This results in the following weights

$$\begin{aligned} w_i^t = \frac{h'(f({\mathbf {\theta } ^t,\mathbf {x} _i}))}{2f({\mathbf {\theta } ^t,\mathbf {x} _i})}. \end{aligned}$$
(8)

Geometric median. The geometric median \(\varvec{\mu } \) of a set of points \(\lbrace \mathbf {x} _i \rbrace \) is the point that minimizes the sum of \(\ell _2\) distances to the points. As shown in one of the cases in the work by Aftab and Hartley [1], the problem of finding the geometric median can be cast in an IRLS setting for certain value of weights. Setting \(f(\varvec{\mu }, \mathbf {x} _i) = ||\varvec{\mu }-\mathbf {x} _i||\) and \(h(z)=z\), the IRLS algorithm minimizes the sum of distances at each iteration, thus converging to the geometric median.

3.2 Method

From the observation in Sect. 2.2, we know that there is a closed-form solution to the problem of finding a linear transformation P so that \(\sum _i ||P\mathbf {x} _i||^2 \) is minimized subject to a fixed determinant \(\det (P)\). The idea of the robust whitening is to use this least squares minimizer in a framework similar to the iterative re-weighted least squares to minimize a robust cost.

Robust transformation estimation. In contrast to the conventional whitening and the minimization of (3), we now propose the estimation of a whitening transform (transformation matrix P) in a way that is robust to outliers. We assume zero mean points and seek the whitening transformation that minimizes the robust cost function of (5). We set \(f(P, \mathbf {x} _i) = ||P\mathbf {x} _i||\) and use the \(\ell _1\) cost function \(h(z) = z\). Other robust cost functions can be used, tooFootnote 2.

We seek to minimize the sum of \(\ell _2\) norms in the whitened space

$$\begin{aligned} C_{\ell _1}(P) = \sum _{i=1}^N f(P, \mathbf {x} _i) = \sum _{i=1}^N ||P\mathbf {x} _i||. \end{aligned}$$
(9)

The corresponding iteratively re-weighted least squares solution is given by

$$\begin{aligned} P^{t+1} = \mathop {\text {argmin}}\limits _{P} \sum _{i=1}^N w_i^t ||P \mathbf {y} _i^t||^2, \end{aligned}$$
(10)

where \(\mathbf {y} _i^t = P^t \mathbf {y} _i^{t-1}\) and \(\mathbf {y} _i^0=\mathbf {x} _i\). This means that each time transformation \(P^{t}\) is estimated and applied to whiten the data points. In the following iteration, the estimation is performed on data points in the whitened space. The effective transformation at iteration t with respect to the initial points \(\mathbf {x} _i\) is given by

$$\begin{aligned} \hat{P}^t = \prod _{i=1}^{t} P^i. \end{aligned}$$
(11)

Along the lines of proof (3) we find a closed form solution that minimizes (9) as

$$\begin{aligned} \begin{aligned}&\sum _i w_i^t ||P\mathbf {y} _i^t||^2 \\&= tr\left( \left( \sum _i w_i^t \mathbf {y} _i^t {\mathbf {y} _i^t}^{\!\top }\right) P^{\!\top }P \right) \\&= tr\left( \tilde{\varSigma } P^{\!\top }P \right) \\ \end{aligned} \end{aligned}$$
(12)

where \(\tilde{\varSigma } = \sum _i w_i^t \mathbf {y} _i^t {\mathbf {y} _i^t}^{\!\top }\) is a weighted covariance. Therefore, P is given, up to a rotation, as

(13)

Joint centering and transformation matrix estimation. In this section we describe the proposed approach for data whitening. We propose to jointly estimate a robust mean \(\varvec{\mu } \) and a robust transformation matrix P by alternating between the two previously described procedures: estimating the geometric median and estimating the robust transformation. In other words, in each iteration, we first find \(\varvec{\mu } \) keeping P fixed and then find P keeping \(\varvec{\mu } \) fixed. In this way the assumption for centered points when finding P is satisfied. Given that each iteration of the method outlined above reduces the cost, and that the cost must be non-negative, we are assured convergence to a local minimum.

We propose to minimize cost

$$\begin{aligned} C_{\ell _1}(P,\varvec{\mu }) = \sum _{i=1}^N ||P(\mathbf {x} _i-\varvec{\mu })||. \end{aligned}$$
(14)

In order to reformulate this as an IRLS problem, we use \(h(z) = z\), and \(f(P,\varvec{\mu },\mathbf {x} _i)= ||P(\mathbf {x} _i-\varvec{\mu })||\). Now, at iteration t the minimization is performed on points \(\mathbf {y} _i^t = \hat{P}^t(\mathbf {x} _i-\hat{\varvec{\mu }}^t)\) and the conditions for convergence with respect to \(\varvec{\mu } \) (skipping t and notation for effective parameters for brevity) are

$$\begin{aligned} \begin{aligned} \nabla _{\varvec{\mu }} (h \circ f)&= \nabla _{\varvec{\mu }} ||P(\mathbf {x} _i - \varvec{\mu })|| \\&= \nabla _{\varvec{\mu }} \sqrt{(\mathbf {y} _i - \varvec{\mu })^{{\!\top }}P^{{\!\top }}P(\mathbf {y} _i - \varvec{\mu })} \\&= \frac{1}{2||P(\mathbf {y} _i - \varvec{\mu })||} \cdot \nabla _{\varvec{\mu }} M\\ \\ \nabla _{\varvec{\mu }} (w_i \cdot f^2)&= w_i \cdot \nabla _{\varvec{\mu }} M \\ \end{aligned} \end{aligned}$$
(15)

where we have \(M= (\mathbf {y} _i - \varvec{\mu })^{{\!\top }}P^{{\!\top }}P(\mathbf {y} _i - \varvec{\mu })\). This gives the expression for the weight

$$\begin{aligned} w_i^t = \frac{1}{2||\hat{P}^t(\mathbf {x} _i - \hat{\varvec{\mu }}^t)||}. \end{aligned}$$
(16)

A similar derivation gives us the weights for the iteration step of P. Therefore in each iteration, we find the solutions to the following weighted least squares problems,

$$\begin{aligned} \varvec{\mu } ^{t+1} = \mathop {\text {argmin}}\limits _{\varvec{\mu }} \sum _{i=1}^N w_i(P^t,\varvec{\mu } ^t) ||P^t(\mathbf {y} _i - \varvec{\mu })||^2, \end{aligned}$$
(17)
$$\begin{aligned} P^{t+1} = \mathop {\text {argmin}}\limits _{P} \sum _{i=1}^N w_i(P^t,\varvec{\mu } ^{t+1}) ||P(\mathbf {y} _i^t - \varvec{\mu } ^{t+1})||^2. \end{aligned}$$
(18)

The effective centering and transformation matrix at iteration t are given by

$$\begin{aligned} \hat{\varvec{\mu }}^t = \sum _{i=1}^t \left( \prod _{j=1}^{i-1} P_j^{-1}\right) \varvec{\mu } ^i \quad , \quad \hat{P}^t = \prod _{i=1}^{t} P^i. \end{aligned}$$
(19)

The whole procedure is summarized in Algorithm 1, where chol is used to denote the Cholesky decomposition.

figure a

3.3 Extension with Supervision

We firstly review the work of Cai et al. [6] who perform supervised descriptor whitening and then present our extension for robust supervised whitening.

Background on linear discriminant projections [6]. The linear discriminant projections (LDP) are learned via supervision of pairs of similar and dissimilar descriptors. A pair (ij) is similar if \((i,j) \in \mathcal {S} \) while dissimilar if \((i,j) \in \mathcal {D} \). The projections are learned in two parts. Firstly, the whitening part is obtained as the square-root of the intra-class covariance matrix , where

$$\begin{aligned} C_{\mathcal {S}} = \sum _{(i,j\in \mathcal {S})} (x_i - x_j)(x_i - x_j)^\top . \end{aligned}$$
(20)

Then, the rotation part is given by the PCA of the inter-class covariance matrix which is computed in the space of the whitened descriptors. It is computed as , where

$$\begin{aligned} C_{\mathcal {D}} = \sum _{(i,j\in \mathcal {D})} (x_i - x_j)(x_i - x_j)^\top . \end{aligned}$$
(21)

The final whitening is performed by \(P_{\mathcal {S} \mathcal {D}}^\top (x-m)\), where m is the mean descriptor and . It is noted [6] that, if the number of descriptors is large compared to the number of classes (two in this case), then \( C_{\mathcal {D}} \approx C_{\mathcal {S} \cup \mathcal {D}}\) since \(|\mathcal {S} | \ll |\mathcal {D} |\). This is the approach we follow.

Robust linear discriminant projections. The proposed method uses the provided supervision in a robust manner by employing the method introduced in Sect. 3.2. The whitening is estimated in a robust manner by Algorithm 1 on the intra-class covariance. In this manner, small weights are assigned to pairs of descriptors that are found to be outliers. Then, the mean and covariance are estimated in a robust manner in the whitened space. The whole procedure is summarized in Algorithm 2. Mean \(\mu _1\) is zero due to the including the pairs in a symmetric manner.

figure b
Fig. 2.
figure 2

(a) Set of 2D points drawn from a Gaussian distribution with zero mean. (b) Same set as (a) with an additional point (outlier) placed at a distance equal to 2 times the maximum distance from the center of the initial set. (c) Visualization of the weights assigned in the set of (b) with the robust whitening which uses the \(\ell _1\) cost function. Note that the size of the circles is inversely proportional to the weight. (d) Same as (c), but using the Cauchy cost.

Fig. 3.
figure 3

Visualization of the covariance (ellipse) and center (cross) of the estimated whitening transformation at iteration t and the conventional estimate. The example is performed using the set of 2D points of Fig. 2. The ground truth distribution that created the data points is shown in black. The conventional estimate is shown in cyan. We show the effective estimate of the \(t^{th}\) iteration. The two approaches are compared without an outlier in (a) or with an outlier using \(\ell _1\) in (b) or Cauchy cost function in (c). The outlier is placed at a distance equal to 10 times the maximum inlier distance. The outlier is not plotted to keep the scale of the figure reasonable. The \(\ell _1\) (or Cauchy) cost is shown in the legend. (Color figure online)

Fig. 4.
figure 4

Visualization of the covariance (ellipse) and center (cross) of the estimated whitening transformation using the conventional approach and ours. The example is performed using the set of 2D points of Fig. 2. The two approaches are compared for the case of an outlier placed at distance equal to 3 (a), 5 (b) and 10 (c) times the maximum inlier distance. The outlier is not shown to keep the resolution high.

4 Examples on Synthetic Data

We compare the proposed and the conventional whitening approaches on synthetic 2D data in order to demonstrate the robustness of our method to outliers. We sample a set of 2D points from a normal distribution, which is shown in Fig. 2(a) and then add an outlier and show the result in Fig. 2(b). In the absence of outliers, both methods provide a similar estimation as shown in Fig. 3. It is also shown how the iterative approach reduces the cost at each iteration. With the presence of an outlier, the estimation of the conventional approach is largely affected, while the robust method gives a much better estimation, as shown in Fig. 3. Using the Cauchy cost function the estimated covariance is very close to that of the ground truth. The weights assigned to each point with the robust approach are visualized in Fig. 2 and show how the outlier is discarded in the final estimation. Finally, in Fig. 4, we compare the conventional way with our approach for outlier of increasing distance.

5 Experiments

In this section, the robust whitening is applied to real-application data. In particular, we test on SPOC [4] descriptors, which are CNN-based image descriptors constructed via sum pooling of network activations in the internal convolutional layers. We evaluate on 3 popular retrieval benchmarks, namely Oxford5k, Paris6k and Holidays (the upright version), and use around 25 k training images to learn the whitening. We use VGG network [21] to extract the descriptors and, in contrast to the work of Babenko and Lempitsky [4], we do not \(\ell _2\)-normalize the input vectors. The final ranking is obtained using Euclidean distance between the query and the database vectors. Evaluation is performed by measuring mean Average Precision (mAP). As in the case of conventional whitening, the dimension reduction is performed by preserving those dimensions that have the highest variance. This is done by finding an eigenvalue decomposition of the estimated covariance and ordering the eigenvectors according to decreasing eigenvalue.

There are many approaches performing robust PCA [7, 24, 25] by assuming that the data matrix can be decomposed into the sum of a low rank matrix and a sparse matrix corresponding to the outliers. We employ the robust PCA (RPCA) method by Candès et al. [7] to perform a comparison. The low rank matrix is recovered and PCA whitening is learned on this.

Fig. 5.
figure 5

Retrieval performance comparison using mAP on 3 common benchmarks. Comparison of the conventional PCA whitening, RPCA whitening and our approach for descriptors of varying dimensionality. The training set contains a small subset of 512 vectors randomly selected. The experiment is performed 10 times and mean performance is reported while standard deviation is shown on the curves. Descriptors extracted using VGG.

We present results in Table 1, where the robust approach offers a consistent improvement over the conventional PCA whitening [4]. Especially in the case where the whitening is learned on few training vectors, the improvement is larger as outliers will heavily influence the conventional whitening, as shown in Fig. 5. Our approach is also better than RPCA whitening for large dimensionalities. It seems that RPCA underestimates the rank of the matrix and does not offer any further improvements for large dimensions.

Table 1. Retrieval performance comparison using mAP on 3 common benchmarks. Comparison of retrieval using the initial sum-pooled CNN activations, post-processing using the baselines and our methods for unsupervised and supervised whitening. Results for descriptors of varying dimensionality. The full training set is used. Descriptors extracted using VGG. S: indicates the use of supervision.

6 Discussion

The applicability of the proposed method goes beyond robust whitening. Consider, for example, the task of affine-invariant descriptors of local features, such as MSERs [18]. A common approach is to transform the detected feature into a canonical frame prior to computing a robust descriptor based on the gradient map of the normalized patch (SIFT [17]). To remove the effect of an affine transformation, a centre of gravity and centered second-order moment (covariance matrix) are used. It can be shown that both the centre of gravity and the covariance matrix are affine-covariants, i.e. if the input point set is transformed by an affine transformation A, they transform with the same transformation A.

The proposed method searches \(\mu \) and P by minimization over all possible affine transformations with a fixed determinant. In turn, \(\mu \) is fully affine covariant and P is affine covariant up to an unknown scale (and rotation, \(P^{\!\top }P\) cancels the rotation). To the best of our knowledge, this type of robust-to-outliers covariants have not been used.

7 Conclusions

We cast the problem of data whitening as minimization of robust cost functions. In this fashion we iteratively estimate a whitening transformation that is robust to the presence of outliers. With the use of synthetic data, we show that our estimation is almost unaffected even with extreme cases of outliers, while it also offers improvements when whitening CNN descriptors for image retrieval.