Image Classification with the Fisher Vector: Theory and Practice

Sánchez, Jorge; Perronnin, Florent; Mensink, Thomas; Verbeek, Jakob

doi:10.1007/s11263-013-0636-x

Image Classification with the Fisher Vector: Theory and Practice

Published: 12 June 2013

Volume 105, pages 222–245, (2013)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jorge Sánchez¹,
Florent Perronnin²,
Thomas Mensink³ &
…
Jakob Verbeek⁴

12k Accesses
1152 Citations
14 Altmetric
Explore all metrics

Abstract

A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature. The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements. This leads to the popular Bag-of-Visual words representation. In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an “universal” generative Gaussian mixture model. This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization. We report experimental results on five standard datasets—PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K—with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locality constrained encoding of frequency and spatial information for image classification

Article 01 March 2018

Bag-of-Words Image Representation: Key Ideas and Further Insight

Texture Classification with Patch Autocorrelation Features

Notes

http://www.image-net.org
http://www.flickr.com/groups
Normalizing by any $\ell _p$-norm would cancel-out the effect of $\omega $. Perronnin et al. (2010c) chose the $\ell _2$-norm because it is the natural norm associated with the dot-product. In Sect. 3.2 we experiment with different $\ell _p$-norms.
See Appendix A.2 in the extended version of Jaakkola and Haussler (1998) which is available at: http://people.csail.mit.edu/tommi/papers/gendisc.ps
Xiao et al. (2010) also report results with one training sample per class. However, a single sample does not provide any way to perform cross-validation which is the reason why we do not report results in this setting.
See http://people.csail.mit.edu/jxiao/SUN/
Actually, any continuous distribution can be approximated with arbitrary precision by a GMM with isotropic covariance matrices.
Note that since $q$ draws values in a finite set, we could replace the $\int \nolimits _q$ by $\sum _q$ in the following equations but we will keep the integral notation for simplicity.
While it is standard practice to report per-class accuracy on this dataset (see Deng et al. 2010; Sánchez and Perronnin 2011), Krizhevsky et al. (2012); Le et al. (2012) report a per-image accuracy. This results in a more optimistic number since those classes which are over-represented in the test data also have more training samples and therefore have (on average) a higher accuracy than those classes which are under-represented. This was clarified through a personal correspondence with the first authors of Krizhevsky et al. (2012) and Le et al. (2012).
Available at: http://htk.eng.cam.ac.uk/.

References

Amari, S., & Nagaoka, H. (2000). Methods of information geometry, translations of mathematical monographs (Vol. 191). Oxford: Oxford University Press.
Berg, A., Deng, J., & Fei-Fei, L. (2010). ILSVRC 2010. Retrieved from http://www.image-net.org/challenges/LSVRC/2010/index.
Bergamo, A., & Torresani, L. (2012). Meta-class features for large-scale object categorization on a budget. In CVPR.
Bishop, C. (1995). Training with noise is equivalent to tikhonov regularization. In Neural computation (Vol 7).
Bo, L., & Sminchisescu, C. (2009). Efficient match kernels between sets of features for visual recognition. In NIPS.
Bo, L., Ren, X., & Fox, D. (2012). Multipath sparse coding using hierarchical matching pursuit. In NIPS workshop on deep learning.
Boiman, O., Shechtman, E., & Irani, M. (2008). In defense of nearest-neighbor based image classification. In CVPR.
Bottou, L. (2011). Stochastic gradient descent. Retrieved from http://leon.bottou.org/projects/sgd.
Bottou, L., & Bousquet, O. (2007). The tradeoffs of large scale learning. In NIPS.
Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In CVPR.
Boureau, Y. L., LeRoux, N., Bach, F., Ponce, J., & LeCun, Y. (2011). Ask the locals: Multi-way local pooling for image recognition. In ICCV.
Burrascano, P. (1991). A norm selection criterion for the generalized delta rule. IEEE Transactions on Neural Networks, 2(1), 125–30.
Article Google Scholar
Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In BMVC.
Cinbis, G., Verbeek, J., & Schmid, C. (2012). Image categorization using Fisher kernels of non-iid image models. In CVPR.
Clinchant, S., Csurka, G., Perronnin, F., & Renders, J. M. (2007). XRCEs participation to imageval. In ImageEval workshop at CVIR.
Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV SLCV workshop.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Deng, J., Berg, A., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In ECCV.
Everingham, M., Gool, L.V., Williams, C., Winn, J. & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results.
Everingham, M., Gool, L.V., Williams, C., Winn, J., Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results.
Everingham, M., van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Google Scholar
Farquhar, J., Szedmak, S., Meng, H., & Shawe-Taylor, J. (2005). Improving “bag-of-keypoints” image categorisation. Technical report. Southampton: University of Southampton.
Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric $\ell _p$-norm feature pooling for image classification. In CVPR.
Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In ICCV.
Gray, R., & Neuhoff, D. (1998). Quantization. IEEE Transactions on Information Theory, 44(6), 2724–2742.
Article MathSciNet Google Scholar
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset.. California Institute of Technology. Retrieved from http://authors.library.caltech.edu/7694.
Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In CVPR.
Harzallah, H., Jurie, F., & Schmid, C. (2009). Combining efficient object localization and image classification. In ICCV.
Haussler, D. (1999). Convolution kernels on discrete structures. Technical report. Santa Cruz: UCSC.
Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In NIPS.
Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR.
Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR.
Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. In IEEE PAMI.
Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.
Article Google Scholar
Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with fisher vectors for image categorization. In ICCV.
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Image classification with deep convolutional neural networks. In NIPS.
Kulkarni, N., & Li, B. (2011). Discriminative affine sparse codes for image classification. In CVPR.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., et al. (2012). Building high-level features using large scale unsupervised learning. In ICML.
Lin, Y., Lv, F., Zhu, S., Yu, K., Yang, M., & Cour, T. (2011). Large-scale image classification: Fast feature extraction and svm training. In CVPR.
Liu, Y., & Perronnin, F. (2008). A similarity measure between unordered vector sets with application to image categorization. In CVPR.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Lyu, S. (2005). Mercer kernels for object recognition with local features. In CVPR.
Maji, S., & Berg, A. (2009). Max-margin additive classifiers for detection. In ICCV.
Maji, S., Berg, A., & Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In CVPR.
Mensink, T., Verbeek, J., Csurka, G., & Perronnin, F. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.
Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.
Perronnin, F., Dance, C., Csurka, G., & Bressan, M. (2006). Adapted vocabularies for generic visual categorization. In ECCV.
Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010a). Large-scale image retrieval with compressed Fisher vectors. In CVPR.
Perronnin, F., Sánchez, J., & Liu, Y. (2010b). Large-scale image categorization with explicit data embedding. In CVPR.
Perronnin, F., Sánchez, J., & Mensink, T. (2010c). Improving the Fisher kernel for large-scale image classification. In ECCV.
Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.
Sabin, M., & Gray, R. (1984). Product code vector quantizers for waveform and voice coding. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(3), 474–488.
Article Google Scholar
Sánchez, J., & Perronnin, F. (2011). High-dimensional signature compression for large-scale image classification. In CVPR.
Sánchez, J., Perronnin, F., & de Campos, T. (2012). Modeling the spatial layout of images beyond spatial pyramids. Pattern Recognition Letters, 33(16), 2216–2223.
Article Google Scholar
Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimate sub-gradient solver for SVM. In ICML.
Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In ICCV.
Smith, N., & Gales, M. (2001). Speech recognition using SVMs. In NIPS.
Song, D., & Gupta, A. K. (1997). Lp-norm uniform distribution. Proceedings of American Mathematical Society, 125, 595–601.
Article MathSciNet MATH Google Scholar
Spruill, M. (2007). Asymptotic distribution of coordinates on high dimensional spheres. In Electronic communications in probability (Vol. 12).
Sreekanth, V., Vedaldi, A., Jawahar, C., & Zisserman, A. (2010). Generalized rbf feature maps for efficient detection. In BMVC.
Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: John Wiley.
MATH Google Scholar
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR.
Uijlings, J., Smeulders, A., & Scha, R. (2009). What is the spatial extent of an object? In CVPR.
van de Sande, K., Gevers, T., & Snoek, C. (2010). Evaluating color descriptors for object and scene recognition. IEEE PAMI, 32(9), 1582–1596.
Article Google Scholar
VanGemert, J., Veenman, C., Smeulders, A., & Geusebroek, J. (2010). Visual word ambiguity. In IEEE TPAMI.
Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In CVPR.
Vedaldi, A., & Zisserman, A. (2012). Sparse kernel approximations for efficient classification and detection. In CVPR.
Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. In ICCV.
Wang, G., Hoiem, D., & Forsyth, D. (2009). Learning image similarity from flickr groups using stochastic intersection kernel machines. In ICCV.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.
Winn, J., Criminisi, A., & Minka, T. (2005). Object categorization by learned visual dictionary. In ICCV.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.
Yan, S., Zhou, X., Liu, M., Hasegawa-Johnson, M., & Huang, T. (2008). Regression from patch-kernel. In CVPR.
Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Group sensitive multiple kernel learning for object categorization. In ICCV.
Yang, J., Yu, K., Gong, Y., & Huang, T. (2009b). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.
Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, S., Valtchev V. & Woodland P. (2002). The HTK book (version 3.2.1). Cambridge: Cambridge University Engineering Department.
Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 123–138.
Article Google Scholar
Zhou, Z., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.

Download references

Author information

Authors and Affiliations

CIEM-CONICET, FaMAF, Universidad Nacional de Córdoba, X5000HUA , Córdoba, Argentina
Jorge Sánchez
Xerox Research Centre Europe, 6 Chemin de Maupertuis, 38240 , Meylan, France
Florent Perronnin
Inteligent Systems Lab Amsterdam, University of Amsterdam, Science Park 904, 1098 XH , Amsterdam, The Netherlands
Thomas Mensink
LEAR Team, INRIA Grenoble, 655 Avenue de l’Europe, 38330 , Montbonnot, France
Jakob Verbeek

Authors

Jorge Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Florent Perronnin
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Mensink
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Verbeek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Sánchez.

Appendices

Appendix 1 : An Approximation of the Fisher Information Matrix

In this appendix we show that, under the assumption that the posterior distribution $\gamma _x(k) = w_k u_k(x) / u_\lambda (x)$ is sharply peaked, the normalization with the FIM takes a diagonal form. Throughout this appendix we assume the data $x$ to be one dimensional. The extension to the multidimensional data case is immediate for the mixtures of Gaussians with diagonal covariance matrices that we are interested in.

Under some mild regularity conditions on $u_\lambda (x)$, the entries of the FIM can be expressed as:

$$\begin{aligned} {\left[ {F_\lambda }\right] }_{i,j} = \mathbb{E {\left[ { - \frac{\partial ^2 \log u_\lambda (x)}{\partial \lambda _i \partial \lambda _j}}\right] }}. \end{aligned}$$

(57)

First, let us consider the partial derivatives of the posteriors w.r.t. the mean and variance parameters. If we use $\theta _s$ to denote one such parameter associated with $u_s(x)$, i.e. mixture component number $s$, then:

$$\begin{aligned}&\frac{\partial \gamma _x(k)}{\partial \theta _s} = \gamma _x(k) \frac{\partial \log \gamma _x(k)}{\partial \theta _s} \end{aligned}$$

(58)

$$\begin{aligned}&\quad = \gamma _x(k) \frac{\partial }{\partial \theta _s}\big [ \log w_k + \log u_k(x) - \log u_\lambda (x)\big ]\end{aligned}$$

(59)

$$\begin{aligned}&\quad = \gamma _x(k) \left[ [\![k=s]\!] \frac{\partial \log u_s(x)}{\partial \theta _s} - \frac{\partial \log u_\lambda (x)}{\partial \theta _s} \right] \end{aligned}$$

(60)

$$\begin{aligned}&\quad = \gamma _x(k) \left[ [\![k=s]\!] \frac{\partial \log u_s(x)}{\partial \theta _s} - \gamma _x(s) \frac{\partial \log u_s(x)}{\partial \theta _s} \right] \end{aligned}$$

(61)

$$\begin{aligned}&\quad = \gamma _x(k) \Big ( [\![k=s]\!] - \gamma _x(s)\Big ) \frac{\partial \log u_s(x)}{\partial \theta _s} \approx 0, \end{aligned}$$

(62)

where $[\![\cdot ]\!]$ is the Iverson bracket notation which equals one if the argument is true, and zero otherwise. It is easy to verify that the assumption that the posterior is sharply peaked implies that the partial derivative is approximately zero, since the assumption implies that (i) $\gamma _x(k)\gamma _x(s)\approx 0$ if $k\ne s$ and (ii) $\gamma _x(k) \approx \gamma _x(k)\gamma _x(s)$ if $k=s$.

From this result and Eqs. (12), (13), and (14), it is then easy to see that second order derivatives are zero if (i) they involve mean or variance parameters corresponding to different mixture components ($k\ne s$), or if (ii) they involve a mixing weight parameter and a mean or variance parameter (possibly from the same component).

To see that the cross terms for mean and variance of the same mixture component are zero, we again rely on the observation that $\partial \gamma _x(k) / \partial \theta _s\approx 0$ to obtain:

$$\begin{aligned}&\frac{\partial ^2 \log u_\lambda (x)}{\partial \sigma _k \partial \mu _k} \approx \gamma _x(k)(x-\mu _k) \frac{\partial \sigma _k^{-2}}{\partial \sigma _k}\nonumber \\&\quad = -2\sigma _k^{-3}\gamma _x(k) (x-\mu _k) \end{aligned}$$

(63)

Then by integration we obtain:

$$\begin{aligned} {\left[ {F_\lambda }\right] }_{\sigma _k,\mu _k}&= -\int \limits _x u_\lambda (x) \frac{\partial ^2 \log u_\lambda (x)}{\partial \sigma _k \partial \mu _k} {\mathrm{d }}x\end{aligned}$$

(64)

$$\begin{aligned}&\approx 2\sigma _k^{-3}\int \limits _x u_\lambda (x)\gamma _x(k) (x-\mu _k) {\mathrm{d }}x\end{aligned}$$

(65)

$$\begin{aligned}&= 2\sigma _k^{-3}w_k\int \limits _x u_k(x) (x-\mu _k) {\mathrm{d }}x = 0 \end{aligned}$$

(66)

We now compute the second order derivatives w.r.t. the means:

$$\begin{aligned} \frac{\partial ^2 \log u_\lambda (x)}{(\partial \mu _k)^2} \approx \sigma _k^{-2}\gamma _x(k)\frac{\partial (x-\mu _k)}{\partial \mu _k} = -\sigma _k^{-2}\gamma _x(k) \end{aligned}$$

(67)

Integration then gives:

$$\begin{aligned} {\left[ {F_\lambda }\right] }_{\mu _k,\mu _k}&= -\int \limits _x u_\lambda (x) \frac{\partial ^2 \log u_\lambda (x)}{(\partial \mu _k)^2} {\mathrm{d }}x \approx \sigma _k^{-2}\int \limits _x u_\lambda (x) \gamma _x(k) {\mathrm{d }}x\end{aligned}$$

(68)

$$\begin{aligned}&= \sigma _k^{-2}w_k\int \limits _x u_k(x) {\mathrm{d }} x = \sigma _k^{-2}w_k, \end{aligned}$$

(69)

and the corresponding entry in $L_\lambda $ equals $\sigma _k / \sqrt{w_k}$. This leads to the normalized gradients as presented in (17).

Similarly, for the variance parameters we obtain:

$$\begin{aligned} \frac{\partial ^2 \log u_\lambda (x)}{(\partial \sigma _k)^2} \approx \sigma _k^{-2}\gamma _x(k)\Big (1 - 3(x-\mu _k)^2 / \sigma _k^2\Big ) \end{aligned}$$

(70)

Integration then gives:

$$\begin{aligned}&{\left[ {F_\lambda }\right] }_{\sigma _k,\sigma _k}\nonumber \\&\quad \approx \sigma _k^{-2}\int \limits _x u_\lambda (x) \gamma _x(k) \Big (3(x-\mu _k)^2 / \sigma _k^2 -1\Big ) {\mathrm{d }}x\end{aligned}$$

(71)

$$\begin{aligned}&\quad = \sigma _k^{-2}w_k\int \limits _x u_k(x) \Big (3(x-\mu _k)^2 / \sigma _k^2 -1\Big ) {\mathrm{d }}x = 2\sigma _k^{-2}w_k,\nonumber \\ \end{aligned}$$

(72)

which leads to a corresponding entry in $L_\lambda $ of $\sigma _k / \sqrt{2w_k}$. This leads to the normalized gradients as presented in (18).

Finally, the computation of the normalization coefficients for the mixing weights is somewhat more involved. To compute the second order derivatives involving mixing weight parameters only, we will make use of the partial derivative of the posterior probabilities $\gamma _x(k)$:

$$\begin{aligned} \frac{\partial \gamma _x(k)}{\partial \alpha _s} = \gamma _x(k)\Big ([\![k=s]\!] - \gamma _x(s)\Big ) \approx 0, \end{aligned}$$

(73)

where the approximation follows from the same observations as used in (62). Using this approximation, the second order derivatives w.r.t. mixing weights are:

$$\begin{aligned} \frac{\partial ^2 \log u_\lambda (x)}{\partial \alpha _s \partial \alpha _k}&= \frac{\partial \gamma _x(k)}{\partial \alpha _s} - \frac{\partial w_k}{\partial \alpha _s}\approx - \frac{\partial w_k}{\partial \alpha _s}\nonumber \\&= w_k\big ( [\![k=s]\!] - w_s\big ) \end{aligned}$$

(74)

Since this result is independent of $x$, the corresponding block of the FIM is simply obtained by collecting the negative second order gradients in matrix form:

$$\begin{aligned} {\left[ {F_\lambda }\right] }_{\alpha ,\alpha } = { w w^{\prime }} - {\mathrm{d iag}}(w), \end{aligned}$$

(75)

where we used $w$ and $\alpha $ to denote the vector of all mixing weights, and mixing weight parameters respectively.

Since the mixing weights sum to one, it is easy to show that this matrix is non-invertible by verifying that the constant vector is an eigenvector of this matrix with associated eigenvalue zero. In fact, since there are only $K-1$ degrees of freedom in the mixing weights, we can fix $\alpha _K=0$ without loss of generality, and work with a reduced set of $K-1$ mixing weight parameters. Now, let us make the following definitions: let $\tilde{\alpha } = (\alpha _1, \ldots , \alpha _{K-1})^T$ denote the vector of the first $K-1$ mixing weight parameters, let $G_{\tilde{\alpha }}^x$ denote the gradient vector with respect to these, and $F_{\tilde{\alpha }}$ the corresponding matrix of second order derivatives. Using this definition $F_{\tilde{\alpha }}$ is invertible, and using Woodburry’s matrix inversion lemma, we can show that

$$\begin{aligned} G_{\tilde{\alpha }}^x F_{\tilde{\alpha }}^{-1} G_{\tilde{\alpha }}^y = \sum \limits _{k=1}^K (\gamma _x(k)-w_k)(\gamma _y(k)-w_k) / w_k. \end{aligned}$$

(76)

The last form shows that the inner product, normalized by the inverse of the non-diagonal $K-1$ dimensional square matrix $F_{\tilde{\alpha }}$, can in fact be obtained as a simple inner product between the normalized version of the $K$ dimensional gradient vectors as defined in (12), i.e. with entries $\big (\gamma _x(k)-w_k\big )/\sqrt{w_k}$. This leads to the normalized gradients as presented in (16).

Note also that, if we consider the complete data likelihood:

$$\begin{aligned} p(x,z|\lambda ) = u_\lambda (x) p(z|x,\lambda ) \end{aligned}$$

(77)

the Fisher information decomposes as:

$$\begin{aligned} F_{c} = F_\lambda + F_{r}, \end{aligned}$$

(78)

where $F_{c}, F_\lambda $ and $F_{r}$ denote the FIM of the complete, marginal (observed) and conditional terms. Using the $1\mathrm -of- K$ formulation for $z$, it can be shown that $F_c$ has a diagonal form with entries given by (68), (71) and (76), respectively. Therefore, $F_r$ can be seen as the amount of “information” lost by not knowing the true mixture component generating each of the $x$’s Titterington et al. (1985). By requiring the distribution of the $\gamma _x(k)$ to be “sharply peaked” we are making the approximation $z_k\approx \gamma _x(k)$.

From this derivation we conclude that the assumption of sharply peaked posteriors leads to a diagonal approximation of the FIM, which can therefore be taken into account by a coordinate-wise normalization of the gradient vectors.

Appendix 2 : Good Practices for Gaussian Mixture Modeling

We now provide some good practices for GMM. For a public GMM implementation and for more details on how to train and test GMMs, we refer the reader to the excellent HMM ToolKit (HTK) Young et al. (2002)^{Footnote 10}.

Computation in the Log Domain We first describe how to compute in practice the likelihood (8) and the soft-assignment (15). Since the low-level descriptors are quite high-dimensional (typically $D=64$ in our experiments), the likelihood values $u_k(x)$ for each Gaussian can be extremely small (and even fall below machine precision if using floating point values) because of the $\exp $ of Eq. (9). Hence, for a stable implementation, it is of utmost importance to perform all computations in the log domain. In practice, for descriptor $x$, one never computes $u_k(x)$ but

$$\begin{aligned}&\log u_k(x) \nonumber \\&\quad = -\frac{1}{2} \sum \limits _{d=1}^D \left[ \log (2\pi ) + \log (\sigma _{kd}^2) + \frac{(x_d - \mu _{kd})^2)}{\sigma _{kd}^2} \right] \end{aligned}$$

(79)

where the subscript $d$ denotes the $d$-th dimension of a vector. To compute the log-likelihood $\log u_\lambda (x) = \log \sum \nolimits _{k=1}^K w_k u_k(x)$, one does so incrementally by writing $\log u_\lambda (x) = \log \big ( w_1 u_1(x) + \sum \nolimits _{k=2}^K w_k u_k(x) \big )$ and by using the fact that $\log (a+b) = \log (a) + \log \left( 1+ \exp (\log (b) - \log (a)\right) $ to remain in the log domain.

Similarly, to compute the posterior probability (15), one writes $\gamma _k = \exp \left[ \log (w_k u_k(x)) - \log (u_\lambda (x)) \right] $ to operate in the log domain.

Variance Flooring Because the variance $\sigma _k^2$ appears in a $\log $ and as a denominator in Eq. (79), too small values of the variance can lead to instabilities in the Gaussian computations. In our case, this is even more likely to happen since we extract patches densely and we do not discard uniform patches. In our experience, such patches tend to cluster in a Gaussian mixture component with a tiny variance. To avoid this issue, we use variance flooring: we compute the global covariance matrix over all our training set and we enforce the variance of each Gaussian to be no smaller than a constant $\alpha $ times the global variance. Such an operation is referred to as variance flooring. HTK suggest a value $\alpha = 0.01$.

Posterior Thresholding To reduce the cost of training GMMs as well as the cost of computing FVs, we assume that all the posteriors $\gamma (k)$ which are below a given threshold $\theta $ are equal to exactly zero. In practice, we use a quite conservative threshold $\theta =10^{-4}$ and for a GMM with 256 Gaussians, 5–10 Gaussians maximum exceed this threshold. After discarding some of the $\gamma (k)$ values, we renormalize the $\gamma ^{\prime }$s to ensure that we still have $\sum \nolimits _{k=1}^K \gamma (k) = 1$.

Note that this operation does not only reduce the computational cost, it also sparsifies the FV (see Sect. 4.2). Without such a posterior thresholding, the FV (or the soft-BOV) would be completely dense.

Incremental Training It is well known that the ML estimation of a GMM is a non-convex optimization problem for more than one Gaussian. Hence, different initializations might lead to different solutions. While in our experience, we have never observed a drastic influence of the initialization on the end result, we strongly advise the use of an iterative process as suggested for instance in Young et al. (2002). This iterative procedure consists in starting with a single Gaussian (for which a closed-form formula exists), splitting all Gaussians by slightly perturbing the mean and then re-estimating the GMM parameters with EM. This iterative splitting-training strategy enables cross-validating and monitoring the influence of the number of Gaussians $K$ in a consistent way.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, J., Perronnin, F., Mensink, T. et al. Image Classification with the Fisher Vector: Theory and Practice. Int J Comput Vis 105, 222–245 (2013). https://doi.org/10.1007/s11263-013-0636-x

Download citation

Received: 18 January 2013
Accepted: 28 May 2013
Published: 12 June 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s11263-013-0636-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Classification with the Fisher Vector: Theory and Practice

Abstract

Access this article

Similar content being viewed by others

Locality constrained encoding of frequency and spatial information for image classification

Bag-of-Words Image Representation: Key Ideas and Further Insight

Texture Classification with Patch Autocorrelation Features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1 : An Approximation of the Fisher Information Matrix

Appendix 2 : Good Practices for Gaussian Mixture Modeling

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image Classification with the Fisher Vector: Theory and Practice

Abstract

Access this article

Similar content being viewed by others

Locality constrained encoding of frequency and spatial information for image classification

Bag-of-Words Image Representation: Key Ideas and Further Insight

Texture Classification with Patch Autocorrelation Features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1 : An Approximation of the Fisher Information Matrix

Appendix 2 : Good Practices for Gaussian Mixture Modeling

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation