Computer Vision Metrics pp 217282  Cite as
Interest Point Detector and Feature Descriptor Survey
Abstract
■■■
Keywords
Entropy Covariance Retina Smoke Auto Correlation“Who makes all these?”
—Jack Sparrow, Pirates of the Caribbean
Many algorithms for computer vision rely on locating interest points, or keypoints in each image, and calculating a feature description from the pixel region surrounding the interest point. This is in contrast to methods such as correlation, where a larger rectangular pattern is stepped over the image at pixel intervals and the correlation is measured at each location. The interest point is the anchor point, and often provides the scale, rotational, and illumination invariance attributes for the descriptor; the descriptor adds more detail and more invariance attributes. Groups of interest points and descriptors together describe the actual objects.
However, there are many methods and variations in feature description. Some methods use features that are not anchored at interest points, such as polygon shape descriptors, computed over larger segmented polygonshaped structures or regions in an image. Other methods use interest points only, without using feature descriptors at all. And some methods use feature descriptors only, computed across a regular grid on the image, with no interest points at all.
Interest Point Tuning
What is a good keypoint for a given application? Which ones are most useful? Which ones should be ignored? Tuning the detectors is not simple. Each detector has different parameters to tune for best results on a given image, and each image presents different challenges regarding lighting, contrast, and image preprocessing. Additionally, each detector is designed to be useful for a different class of interest points, and must be tuned accordingly to filter the results down to a useful set of good candidates for a specific feature descriptor. Each feature detector will work best with certain descriptors, see appendix A.

DynamicAdaptedFeatureDetector. This class will tune supported detectors using an adjusterAdapter() to only keep a limited number of features, and iterate the detector parameters several times and redetect features in an attempt to find the best parameters, keeping only the requested number of best features. Several OpenCV detectors have an adjusterAdapter() provided, some do not; the API allows for adjusters to be created.

AdjusterAdapter. This class implements the criteria for culling and keeping interest points. Criteria may include KNN nearest neighbor matching, detector response or strength, radius distance to nearest other detected points, number of keypoints within a local region, and other measures that can be included for culling keypoints for which a good descriptor cannot be computed.

PyramidAdaptedFeatureDetector.This class can be used to adapt detectors that do not use a scalespace pyramid, and the adapter will create a Gaussian pyramid and detect features over the pyramid.

GridAdaptedFeatureDetector. This class divides an image into grids and adapts the detector to find the best features within each grid cell.
Interest Point Concepts
An interest point may be composed of various types of corner, edge, and maxima shapes, as shown in Figure 61. In general, a good interest point must be easy to find and ideally fast to compute; it is hoped that the interest point is at a good location to compute a feature descriptor. The interest point is thus the qualifier or keypoint around which a feature may be described.
There are various concepts behind the interest point methods currently in use, as this is an active area of research. One of the best analyses of interest point detectors is found in Mikolajczyk et al.[153], with a comparison framework and taxonomy for affine covariant interest point detectors, where covariant refers to the elliptical shape of the interest region, which is an affine deformable representation. Scale invariant detectors are represented well in a circular region. Maxima region and blob detectors can take irregular shapes. See the response of several detectors against synthetic interest point and corner alphabets in Appendix A.
Corners are often preferred over edges or isolated maxima points, since the corner is a structure and can be used to compute an angular orientation for the feature. Interest points are computed over color components as well as gray scale luminance. Many of the interest point methods will first apply some sort of Gaussian filter across the image and then perform a gradient operator. The idea of using the Gaussian filter first is to reduce noise in the image, which is otherwise amplified by gradient operators.
Each detector locates features with different degrees of invariance to attributes such as rotation, scale, perspective, occlusion, and illumination. For evaluations of the quality and performance of interest point detection methods measured against various robustness and invariance criteria on standardized datasets, see Mikolajczyk and Schmidt [144] and Gauglitz et al.[145]. One of the key challenges for interest point detection is scale invariance, since interest points change dramatically in some cases over scale. Lindberg [212] has extensively studied the area of scale independent interest point methods.
Affine invariant interest points have been studied in detail by Mikolajcyk and Schmid [107,141,144,153,306,311]. In addition, Mikolajcyk and Schmid [519] developed an affineinvariant version of the Harris detector. As shown in [541], it is often useful to combine several interest point detection methods to form a hybrid, for example, using the Harris or Hessian to locate suitable maxima regions, and then using the Laplacian to select the best scale attributes. Variations are common, Harrisbased and Hessianbased detectors may use scalespace methods, while local binary detector methods do not use scale space.

Gradient Magnitude. This is the first derivative of the pixels in the local interest region, and assumes a direction. This is an unsigned positive number. Open image in new window

Gradient Direction. This is the angle or direction of the largest gradient angle from pixels in the local region in the range +π to π. Open image in new window

Laplacian.This is the second derivative and can be computed directionally using any of three terms:

Hessian Matrix or Hessian. A square matrix containing secondorder partial derivatives describing surface curvature. The Hessian has several interesting properties useful for interest point detection methods discussed in this section.

Largest Hessian. This is based on the second derivative, as is the Laplacian, but the Hessian uses all three terms of the second derivative to compute the direction along which the second derivative is maximum as a signed value.

Smallest Hessian. This is based on the second derivative, is computed as a signed number, and may be a useful metric as a ratio between largest and smallest Hessian.

Hessian Orientation, largest and smallest values. This is the orientation of the largest second derivative in the range +π to π, which is a signed value, and it corresponds to an orientation without direction. The smallest orientation can be computed by adding or subtracting π/2 from the largest value.

Determinant of Hessian, Trace of Hessian, Laplacian of Gaussian. All three names are used to describe the trace characteristic of a matrix, which can reveal geometric scale information by the absolute value, and orientation by the sign of the value. The eigenvalues of a matrix can be found using determinants.

Eigenvalues, Eigenvectors, Eigenspaces. Eigen properties are important to understanding vector direction in local pixel region matrices. When a matrix acts on a vector, and the vector orientation is preserved, and when the sign or direction is simply reversed, the vector is considered to be an eigenvector, and the matrix factor is considered to be the eigenvalue. An eigenspace is therefore all eigenvectors within the space with the same eigenvalue. Eigen properties are valuable for interest point detection, orientation, and feature detection. For example, Turk and Petland [158] use eigenvectors reduced into a smaller set of vectors via PCA for face recognition, in a method they call Eigenfaces.
Interest Point Method Survey

Laplacian of Gaussian (LOG)

Moravac corner detector

Harris and Stephens corner detection

Shi and Tomasi corner detector (improvement on Harris method)

Difference of Gaussians (DoG; an approximation of LOG)

Harris methods, Harris–/Hessian–Laplace, Harris–/Hessian–Affine

Determinant of Hessian (DoH)

Salient regions

SUSAN

FAST, FASTER, AGAST

Local curvature

Morphological interest points

MSER (discussed in the section on polygon shape descriptors)

*NOTE: many feature descriptors, such as SIFT, SURF, BRISK and others, provide their own detector method along with the descriptor method, see Appendix A.
Laplacian and Laplacian of Gaussian
The Lapacian operator, as used in image processing, is a method of finding the derivative or maximum rate of change in a pixel area. Commonly, the Laplacian is approximated using standard convolution kernels that add up to zero, such as:
The Laplacian of Gaussian (LOG) is simply the Laplacian performed over a region that has been processed using a Gaussian smoothing kernel to focus edge energy; see Gun [155].
Moravac Corner Detector
The Moravic corner detection algorithm is an early method of corner detection whereby each pixel in the image is tested by correlating overlapping patches surrounding each neighboring pixel. The strength of the correlation in any direction reveals information about the point: a corner is found when there is change in all directions, and an edge is found when there is no change along the edge direction. A flat region yields no change in any direction. The correlation difference is calculated using the SSD between the two overlapping patches. Similarity is measured by the nearzero difference in the SSD. This method is compute intensive; see Moravac [330].
Harris Methods, HarrisStephens, ShiTomasi, and HessianType Detectors
The Harris or HarrisStephens corner detector family [156,365] provides improvements over the Moravic method. The goal of the Harris method is to find the direction of fastest and lowest change for feature orientation, using a covariance matrix of local directional derivatives. The directional derivative values are compared with a scoring factor to identify which features are corners, which are edges, and which are likely noise. Depending on the formulation of the algorithm, the Harris method can provide high rotational invariance, limited intensity invariance, and in some of the formulations of the algorithm, scale invariance is provided such as the HarrisLaplace methodusing scale space [519] [212]. Many Harris family algorithms can be implemented in a computeefficient manner.
Note that corners have an illdefined gradient, since two edges converge at the corner, but near the corner the gradient can be detected with two different values with respect to x and y—this is a basic idea behind the Harris corner detector.

The Shi, Tomasi and Kanade corner detector [157] is an optimization on the Harris method, using only the minimum eigenvalues for discrimination, thus streamlining the computation considerably.

The Hessian (HessianAffine) corner detector [153] is designed to be affine invariant, and it uses the basic Harris corner detection method but combines interest points from several scales in a pyramid, with some iterative selection criteria and a Hessian matrix.

Many other variations on the basic Harris operator exist, such as the Harris–Hessian–Laplace [331], which provides improved scale invariance using a scale selection method, and the Harris–/Hessian–Affine method [306,153].
Hessian Matrix Detector and HessianLaplace
The Hessian Matrix method, also referred to as Determinant of Hessian (DoH) method, is used in the popular SURF algorithm [160]. It detects interest objects from a multiscale image set where the determinant of the Hessian matrix is at a maxima and the Hessian matrix operator is calculated using the convolution of the secondorder partial derivative of the Gaussian to yield a gradient maxima.
The DoH method uses integral images to calculate the Gaussian partial derivatives very quickly. Performance for calculating the Hessian Matrix is therefore very good, and accuracy is better than many methods. The related HessianLaplace method [331,306] also operates on local extrema, using the determinant of the Hessian at multiple scales for spatial localization, and the Laplacian at multiple scales for scale localization.
Difference of Gaussians
The Difference of Gaussians (DoG) is an approximation of the Laplacian of Gaussians, but computed in a simpler and faster manner using the difference of two smoothed or Gaussian filtered images to detect local extrema features. The idea with Gaussian smoothing is to remove noise artifacts that are not relevant at the given scale, which would otherwise be amplified and result in false DoG features. The DoG features are used in the popular SIFT method [161], and as shown later in Figure 615, the simple difference of Gaussian filtered images is taken to identify maxima regions.
Salient Regions
 1.
The Shannon entropy E of pixel attributes such as intensity or color are computed over a scale space, where Shannon entropy is used the measure of unpredictability.
 2.
The entropy values are located over the scale space with maxima or peak values M. At this stage, the optimal scales are determined as well.
 3.
The probability density function (PDF) is computed for magnitude deltas at each peak within each scale, where the PDF is computed using a histogram of pixel values taken from a circular window of desired radius from the peak.
 4.
Saliency is the product of E and M at each peak, and is also related to scale. So the final detector is salient and robust to scale.
SUSAN, and Trajkovic and Hedly
Each USAN contains structural information about the image in the local region, and the size, centroid, and secondorder moments of each USAN can be computed. The SUSAN method can be used for both edge and corner detection. Corners are determined by the ratio of pixels similar to the center pixel in the circular region: a low ratio around 25 percent indicates a corner, and a higher ratio around 50 percent indicates an edge. SUSAN is very robust to noise.
The Trajkovic and Hedly method [214] is similar to SUSAN, and discriminates among points in USAN regions, edge points, and corner points.
SUSAN is also useful for noise suppression, and the bilateral filter [302], discussed in Chapter 2, is closely related to SUSAN. SUSAN uses fairly large circular windows; several implementations use 37 pixel radius windows. The FAST [138] detector is also similar to SUSAN, but uses a smaller 7x7 or 9x9 window and only some of the pixels in the region instead of all of them; FAST yields a local binary descriptor.
Fast, Faster, AGHAST
The FAST methods [138] are derived from SUSAN with respect to a bimodal segmentation goal. However, FAST relies on a connected set of pixels in a circular pattern to determine a corner. The connected region size is commonly 9 or 10 out of a possible 16; either number may be chosen, referred to as FAST9 and FAST10. FAST is known to be efficient to compute and fast to match; accuracy is also quite good. FAST can be considered a relative of the local binary pattern LBP.
FAST is not a scalespace detector, and therefore it may produce many more edge detections at the given scale than a scalespace method such as used in SIFT.
Local Curvature Methods
Local curvature methods [208–212] are among the early means of detecting corners, and some local curvature methods are the first known to be reliable and accurate in tracking corners over scale variations [210]. Local curvature detects points where the gradient magnitude and the local surface curvature are both high. One approach taken is a differential method, computing the product of the gradient magnitude and the level curve curvature together over scale space, and then selecting the maxima and minima absolute values in scale and space. One formulation of the method is shown here. Open image in new window
Various formulations of the basic algorithm can be taken depending on the curvature equation used. To improve scale invariance and noise sensitivity, the method can be modified using a normalized formulation of the equation over scale space, as follows:
where
At larger scales, corners can be detected with less sharp and more rounded features, while at lower scales or at unity scale sharper corners over smaller areas are detected. The Wang and Brady method [213] also computes interest points using local curvature on the 2D surface, looking for inflexion points where the surface curvature changes rapidly.
Morphological Interest Regions
Interest points can be determined from a pipeline of morphological operations, such as thresholding followed by combinations or erosion and dilation to smooth, thin, grown, and shrink pixel groups. If done correctly for a given application, such morphological features can be scale and rotation invariant. Note that the simple morphological operations alone are not enough; for example, erode left unconstrained will shrink regions until they disappear. So intelligence must be added to the morphology pipeline to control the final region size and shape. For polygon shape descriptors, morphological interest points define the feature, and various image moments are computed over the feature, as described in Chapter 3 and also in the section on polygon shape descriptors later in this chapter.
Morphological operations can be used to create interest regions on binary, gray scale, or color channel images. To prepare gray scale or color channel images for morphology, typically some sort of preprocessing is used, such as pixel remapping, LUT transforms, or histogram equalization. (These methods were discussed in Chapter 2.) For binary images and binary morphology approaches, binary thresholding is a key preprocessing step. Many binary thresholding methods have been devised, ranging from simple global thresholds to statistical and structural kernelbased local methods.
Note that the morphological interest region approach is similar to the maximally stable extrema region (MSER) feature descriptor method discussed later in the section on polygon shape descriptors, since both methods look for connected groups of pixels at maxima or minima. However, MSER does not use morphology operators.
Feature Descriptor Survey

Local binary descriptors

Spectra descriptors

Basis space descriptors

Polygon shape descriptors

3D, 4D, and volumetric descriptors

General Vision Taxonomy and FME: covering feature attributes including spectra, shape, and pattern, single or multivariate, compute complexity criteria, data types, memory criteria, matching method, robustness attributes, and accuracy.

General Robustness Attributes: covering invariance attributes such as illumination, scale, perspective, and many others.
No direct comparisons are made between feature descriptors here, but ample references are provided to the literature for detailed comparisons and performance information on each method.
Local Binary Descriptors
This family of descriptors represents features as binary bit vectors. To compute the features, image pixel pointpairs are compared and the results are stored as binary values in a vector. Local binary descriptors are efficient to compute, efficient to store, and efficient to match using Hamming distance. In general, local binary pattern methods achieve very good accuracy and robustness compared to other methods.
A variety of local sampling patterns are used with local binary descriptors to set the pairwise point comparisons; see the section in Chapter 4 on local binary descriptor pointpair patterns for a discussion on local binary sampling patterns. We start this section on local binary descriptors by analyzing the local binary pattern (LBP) and some LBP variants, since the LBP is a powerful metric all by itself and is well known.
Local Binary Patterns
Local binary patterns (LBP) were developed in 1994 by Ojala et al. [173] as a novel method of encoding both pattern and contrast to define texture [169,170–173]. LBP’s can be used as an image processing operator. The LBP creates a descriptor or texture model using a set of histograms of the local texture neighborhood surrounding each pixel. In this case, local texture is the feature descriptor.
In its simplest embodiment, LBP has the goal of creating a binary coded neighborhood descriptor for a pixel. It does this by comparing each pixel against its neighbors using the > operator and encoding the compare results (1,0) into a binary number, as shown later in Figure 68. LPB histograms from larger image regions can even be used as signals and passed into a 1D FFT to create a feature descriptor. The Fourier spectrum of the LBP histogram is rotational invariant; see Figure 66. The FFT spectrum can then be concatenated onto the LBP histogram to form a multivariate descriptor.
As shown in Figure 66, the LBP is used as an image processing operator, region segmentation method, and histogram feature descriptor. The LBP has many applications. An LBP may be calculated over various sizes and shapes using various sizes of forming kernels. A simple 3x3 neighborhood provides basic coverage for local features, while wider areas and kernel shapes are used as well.
Assuming a 3x3 LBP kernel pattern is chosen, this means that there will be 8 pixel compares and up to 2^{8} combinations of results for a 256bin histogram possible. However, it has been shown [18] that reducing the 8bit 256bin histogram to use only 56 LBP bins based on uniform patterns is the optimal number. The 56 bins or uniform patterns are chosen to represent only two contiguous LBP patterns around the circle, which consists of two connected contiguous segments rather than all 256 possible pattern combinations [173,15]. The same uniform pattern logic applies to LBPs of dimension larger than 8 bits. So, uniform patterns provide both histogram space savings and feature comparespace optimization, since fewer features need be matched (56 instead of all 256).
Neighborhood Comparison
Each pixel is compared to its neighbors according to a forming kernel that allows selection of neighbors for the comparison. In Figure 610, all pixels are used in the forming kernel (all 1s). If the neighbor is > than the center pixel, the binary pattern is 1, otherwise it is 0.
Histogram Composition
Each LBP descriptor over an image region is recorded in a histogram to describe the cumulative texture feature. Uniform LBP histograms would have 56 bins, since only singleconnected regions are histogrammed.
Optionally Normalization
The final histogram can be reduced to a smaller number of bins using binary decimation for powers of two or some similar algorithm, such as 256 ➤ 32. In addition, the histograms can be reduced in size by thresholding the range of contiguous bins used for the histogram—for example, by ignoring bins 1 to 64 if little or no information is binned in them.
Descriptor Concatenation
Multiple LBPs taken over overlapping regions may be concatenated together into a larger histogram feature descriptor to provide better discrimination.

Spectra: Local binary

Feature shape: Square

Feature pattern: Pixel region compares with center pixel

Feature density: Local 3x3 at each pixel

Search method: Sliding window

Distance function: Hamming distance

Robustness: 3 (brightness, contrast, *rotation for RILBP)
Rotation Invariant LBP (RILBP)
To achieve rotational invariance, the rotation invariant LBP (RILBP) [173] is calculated by circular bitwise rotation of the local LBP to find the minimum binary value. The minimum value LBP is used as a rotation invariant signature and is recorded in the histogram bins. The RILBP is computationally very efficient.
Note that many researchers [171, 172] are extending the methods used for LBP calculation to use refinements such as local derivatives, local median or mean values, trinary or quinary compare functions, and many other methods, rather than the simple binary compare function, as originally proposed.
Dynamic Texture Metric Using 3D LBPs
Dynamic textures are visual features that morph and change as they move from frame to frame; examples include waves, clouds, wind, smoke, foliage, and ripples. Two extensions of the basic LBP used for tracking such dynamic textures are discussed here: VLBP and LBPTOP.
Volume LBP (VLBP)
LPBTOP
The LBPTOP [176] is created like the VLBP, except that instead of calculating the three individual LBPs from parallel planes, they are calculated from orthogonal planes in the volume (x,y,z) intersecting the interest point, as shown in Figure 612. The 3D composite descriptor is the same size as the VLBP and contains three planes’ worth of data. The histograms for each LBP plane are also concatenated for the LBPTOP like the VLBP.
Other LBP Variants
LBP Variants (from reference [173])
ULBP (Uniform LBP) Uses only 56 uniform bins instead of the full 256 bins possible with 8bit pixels to create the histogram. The uniform patterns consist of contiguous segments of connected TRUE values. 
RLBP (ROBUST LBP) Adds + scale factor to eliminate transitions due to noise (p1  p2 + SCALE) 
CSLBP Circlesymmetric, half as many vectors an LBP, comparison of opposite pixel pairs vs. w/center pixel, useful to reduce LBP bin counts 
LBPHF Fourier spectrum descriptor + LBP 
MLBP Median LBP Uses area median value instead of center pixel value for comparison 
MLBP Multiscale LBP combining multiple radii LBPs concatenated 
MBLBP Multiscale Block LBP; compare average pixel values in small blocks 
SEMBLBP: Statistically Effective MBLBP (SEMBLBP) uses the percentage in distributions, instead of the number of 01 and 10 transitions in the LBP and redefines the uniform patterns in the standard LBP. Used effectively in face recognition using GENTLE ADABOOSTing [549] 
VLBP Volume LBP over adjacent video frames OR within a volume  concatenate histograms together to form a longer vector 
LGBP (Local Gabor Binary Pattern) 40 or so Gabor filters are computed over a feature, LBPs are extracted and concatenated to form a long feature vector that is invariant over more scales and orientations 
LEP Local Edge Patterns: Edge enhancement (Sobel) prior to standard LBP 
EBP Elliptic Binary Pattern Standard LBP but over elliptical area instead of circular 
EQP Elliptical Quinary Patterns  LBP extended from binary (2) level resolution to quinary (5) level resolution (2,1, 0,1,2) 
LTP  LBP extended over Ternary range to deal with near constant areas (1, 0, 1) 
LLBP Local line Binary Pattern  calculates LBP over line patterns (cross shape) and then calculates a magnitude metrics using SQRT of SQUARES of each X/Y dimension 
TPLBP [x5]three LBPs are calculated together: the basic LBP for the center pixel, plus two others around adjacent pixels so the total descriptor is a set of overlapping LBP’s, 
FPLBP [x5]four LBPs are calculated together: the basic LBP for the center pixel, plus two others around adjacent pixels so the total descriptor is a set of overlapping LBP’s, XPLBP – 
*NOTE: The TPLBP and FPLBP method can be extended to 3,4,n dimensions in feature space. LARGE VECTORS. 
TBP  Ternary (3) Binary pattern, like LBP, but uses three levels of encoding (1,0,1) to effectively deal with areas of equal or near equal intensity, uses two binary patterns (one for + and one for ) concatenated together 
ETLP  Elongated Ternary Local Patterns (elliptical + ternary [5] levels 
FLBP  Fuzzy LBP where each pixel contributes to more than one bin 
PLBP  Probabilistic LBP computes magnitude of difference between each pixel & center pixel (more compute, more storage) 
SILTP  Scale invariant LBP using a 3 part piecewise comparison function to compensate and support intensity scale invariance to deal with image noise 
tLBP  Transition Coded LBP, where the encoding is clockwise between adjacent pixels in the LBP 
dLBP  Direction Coded LBP  similar to CSLBP, but stores both maxima and comparison info (is this pixel greater, less than, or maxima) 
CBP  Centralized Binary pattern  center pixel compared to average of all nine kernel neighbors 
SLBP Semantic LBP done in a colorimetricaccurate space (like CIE LAB etc.) over uniform connected LBP circular patterns to find principal direction + arc length used to form a 2D histogram as the descriptor. 
FLBP  Fourier Spectrum of color distance from center pixel to adjacent pixels 
LDP  Local Derivate Patterns (higher order derivatives)  basic 
LBP is the first order directional derivative, which is combined with additional nth order directional derivatives concatenated into a histogram, more sensitive to noise of course 
BLBP  Baysian LBP  combination of LBP and LTP together using Baysian methods to optimize towards a more robust pattern 
FLS  Filtering, Labeling and Statistical Framework for LBP comparison, translates LBP’s or any type of histogram descriptor into vector space allowing efficient comparison “A Bayesian Local Binary Pattern Texture Descriptor” 
MBLBP Multiscale Block LBP  compare average pixel values in small blocks instead of individual pixels, thus a 3x3 pixel PBL will become a 9x9 block LBP where each block is a 3x3 region. The histogram is calculated by scaling the image and creating a rendering at each scale and creating a histogram of each scaled image and concatenating the histograms together. 
PMLBP Pyramid Based MultiStructured LBP  used 5 templates to extract different structural info at varying levels 1) Gaussian filters, 4 anisotrophic filters to detect gradient directions 
MSLBF  Multiscale Selected Local Binary Features 
RILBP  Rotation Invariant LBP rotates the bins (binary LBP value) until maximum value is achieved, the max value is considered rotational invariant. This is the most widely used method for LBP rotational invariance. 
ALBP  Adaptive LBP for rotational invariance, instead of shifting to a maximal value as in the standard LBP method, find the dominant vector orientation and shift the vector to the dominant vector orientation 
LBPV  Local binary pattern variance  uses local area variance to weight pixel contribution to the LBP, align features to principal orientations, determine nondominant patterns and reduce their contribution. 
OCLBP  Opponent Color LBP  describes color and texture together  each color channel LBP is converted, then opposing color channel LBP’s are converted by using one color as the center pixel and another color as the neighborhood, so 9 total histograms are computed but only size are used R G B RG RG RB 
SDMCLBP  SDM (co LBP images for each color are used as the basis for generating occurrence matrices, and then Haralick features are extracted from the images to form a multi dimensional feature space. 
MSCLBP  Multi Scale Color Local Binary Patterns (concatenate 6 histograms together) USES COLOR SPACE COMPONENTS 
HUELBP OPPONENTLBP (ALL 3 CHANNELS) nOPPONENTLBP (COMPUTED OVER 2 CHANNELS), light intensity change, intensity shift, intensity change+shift, colorchange colorshift, DEFINE SIX NEW OPERATORS: transformed color LBP (RGB)[subtract mean, divide by STD DEV], opponent LBP, nOpponent LBP, Hue LBP, RGBLBP, nRGBLBP [x8] “Multiscale Color Local Binary Patterns for Visual Object Classes Recognition”, Chao ZHU, CharlesEdmond BICHOT, Liming CHEN 
3D histograms  3DRGBLBP [best performance, high memory footprint]  3D histogram computed over RGBLBP color image space using uniform pattern minimization to yield 10 levels or patterns per color yielding a large descriptor: 10 x 10 x 10 = 1000 descriptors. 
Census
The Census transform [177] is basically an LBP, and like a population census, it uses simple greaterthan and lessthan queries to count and compare results. Census records pixel comparison results made between the center pixel in the kernel and the other pixels in the kernel region. It employs comparisons and possibly a threshold, and stores the results in a binary vector. The Census transform also uses a feature called the rank value scalar, which is the number of pixel values less than the center pixel. The Census descriptor thus uses both a bit vector and a rank scalar.

Spectra: Local binary + scalar ranking

Feature shape: Square

Feature pattern: Pixel region compares with center pixel

Feature density: Local 3x3 at each pixel

Search method: Sliding window

Distance function: Hamming distance

Robustness: 2 (brightness, contrast)
Modified Census Transform
As shown in Figure 613, the MCT relies on the full set of possible 3x3 binary patterns (2^{9} − 1 or 511 variations) and uses these as a kernel index into the binary patterns as the MCT output, since each binary pattern is a unique signature by itself and highly discriminative. The end result of the MCT is analogous to a nonlinear filter that assigns the output to any of the 2^{9} − 1 patterns in the kernel index. Results show that the MCT results are better than the basic CT for some types of object recognition [205].
BRIEF
As described in Chapter 4, in the section on local binary descriptor pointpair patterns, and illustrated in Figure 411, the BRIEF [132,133] descriptor uses a random distribution pattern of 256 pointpairs in a local 31x31 region for the binary comparison to create the descriptor. One key idea with BRIEF is to select random pairs of points within the local region for comparison.
BRIEF is a local binary descriptor and has achieved very good accuracy and performance in robotics applications [203]. BRIEF and ORB are closely related; ORB is an oriented version of BRIEF, and the ORB descriptor pointpair pattern is also built differently than BRIEF. BRIEF is known to be not very tolerant of rotation.

Spectra: Local binary

Feature shape: Square centered at interest point

Feature pattern: Random local pixel pointpair compares

Feature density: Local 31x31 at interest points

Search method: Sliding window

Distance function: Hamming distance

Robustness: 2 (brightness, contrast)
ORB
ORB [134] is an acronymn for Oriented BRIEF, and as the name suggests, ORB is based on BRIEF and adds rotational invariance to BRIEF by determining corner orientation using FAST9, followed by a Harris corner metric to sort the keypoints; the corner orientation is refined by intensity centroids using Rosin’s method [61]. The FAST, Harris, and Rosin processing are done at each level of an image pyramid scaled with a factor of 1.4, rather than the common octave pyramid scale methods. ORB is discussed in some detail in Chapter 4, in the section on local binary descriptor pointpair patterns, and is illustrated in Figure 411.
It should be noted that ORB is a highly optimized and very well engineered descriptor, since the ORB authors were keenly interested in compute speed, memory footprint, and accuracy. Many of the descriptors surveyed in this section are primarily research projects, with less priority given to practical issues, but ORB focuses on optimizing and practical issues.
Compared to BRIEF, ORB provides an improved training method for creating the local binary patterns for pairwise pixel point sampling. While BRIEF uses random point pairs in a 31x31 window, ORB goes through a training step to find uncorrelated point pairs in the window with high variance and means ∼ .5, which is demonstrated to work better. For details on visualizing the ORB patterns, see Figure 411.
For correspondence search, ORB uses multiprobe locally sensitive hashing (MPLSH), which searches for matches in neighboring buckets when a match fails, rather than renavigating the hash tree. The authors report that MPLSH requires fewer hash tables, resulting in a lower memory footprint. MPLSH also produces more uniform hash bucket sizes than BRIEF. Since ORB is a binary descriptor based on pointpair comparisons, Hamming distance is used for correspondence.
ORB*  SURF  SIFT 

15.3ms  217.3ms  5228.7ms 

Spectra: Local binary + orientation vector

Feature shape: Square

Feature pattern: Trained local pixel pointpair compares

Feature density: Local 31x31 at interest points

Search method: Sliding window

Distance function: Hamming distance

Robustness: 3 (brightness, contrast, rotation, *limited scale)
BRISK
BRISK [131,143] is a local binary method using a circularsymmetric pattern region shape and a total of 60 pointpairs as line segments arranged in four concentric rings, as shown in Figure 410 and described in detail in Chapter 4. The method uses pointpairs of both short segments and long segments, and this provides a measure of scale invariance, since short segments may map better for fine resolution and long segments may map better at coarse resolution.

Detects keypoints using FAST or AGHAST based selection in scale space.

Performs Gaussian smoothing at each pixel sample point to get the point value.

Makes three sets of pairs: long pairs, short pairs, and unused pairs (the unused pairs are not in the long pair or the short pair set; see Figure 412).

Computes gradient between long pairs, sums gradients to determine orientation.

Uses gradient orientation to adjust and rotate short pairs.

Creates binary descriptor from short pair pointwise comparisons.

Spectra: Local binary + orientation vector

Feature shape: Square

Feature pattern: Trained local pixel pointpair compares

Feature density: Local 31x31 at FAST interest points

Search method: Sliding window

Distance function: Hamming distance

Robustness: 4 (brightness, contrast, rotation, scale)
FREAK
FREAK [130] uses a novel fovealinspired multiresolution pixel pair sampling shape with trained pixel pairs to mimic the design of the human eye as a coarsetofine descriptor, with resolution highest in the center and decreasing further into the periphery, as shown in Figure 49. In the opinion of this author, FREAK demonstrates many of the better design approaches to feature description; it combines performance, accuracy, and robustness. Note that FREAK is fast to compute, has good discrimination compared to other local binary descriptors such as LBP, Census, BRISK, BRIEF, and ORB, and compares favorably with SIFT.
The FREAK feature training process involves determining the pointpairs for the binary comparisons based on the training data, as shown in Figure 49. The training method allows for a range of descriptor sampling patterns and shapes to be built by weighting and choosing sample points with high variance and low correlation. Each sampling point is first smoothed from the local region using variablesized radius approximations to create Gaussian kernels over circular regions. The circular regions are designed with some overlap to adjacent regions, which improves accuracy.
The feature descriptor is thus designed in a coarsetofine cascade of four groups of 16 byte coarsetofine descriptors containing pixelpair binary comparisons stored in a vector. The first 16 bytes, the coarse of highest resolution set in the cascade, is normally sufficient to find 90 percent of the matching features and to discard nonmatching features. FREAK uses 45 point pairs for the descriptor from a 31x31 pixel patch sampling region.
By storing the pointpair comparisons in four cascades of decreasing resolution pattern vectors, the matching process proceeds from coarse to fine, mimicking the human visual system’s saccadic search mechanism, allowing for accelerated matching performance when there is early success or rejection in the matching phase. In summary, the FREAK approach works very well.

Spectra: Local binary coarsetofine + orientation vector

Feature shape: Square

Feature pattern: 31x31 region pixel pointpair compares

Feature density: Sparse local at AGAST interest points

Search method: Sliding window over scale space

Distance function: Hamming distance

Robustness: 6 (brightness, contrast, rotation, scale, viewpoint, blur)
Spectra Descriptors
Compared to the local binary descriptor group, the spectra group of descriptors typically involves more intense computations and algorithms, often requiring floating point calculations, and may consume considerable memory. In this taxonomy and discussion, spectra is simply a quantity that can be measured or computed, such as light intensity, color, local area gradients, local area statistical features and moments, surface normals, and sorted data such 2D or 3D histograms of any spectral type, such as histograms of local gradient direction. Many of the methods discussed in this section use local gradient information.
Local binary descriptors, as discussed in the previous section, are an attempt to move away from more costly spectral methods to reduce power and increase performance. Local binary descriptors in many cases offer similar accuracy and robustness to the more computeintensive spectra methods.
SIFT
The Scale Invariant Feature Transform (SIFT) developed by Lowe [161,178] is the most wellknown method for finding interest points and feature descriptors, providing invariance to scale, rotation, illumination, affine distortion, perspective and similarity transforms, and noise. Lowe demonstrates that by using several SIFT descriptors together to describe an object, there is additional invariance to occlusion and clutter, since if a few descriptors are occluded, others will be found [161]. We provide some detail here on SIFT since it is well designed and well known.
SIFT is commonly used as a benchmark against which other vision methods are compared. The original SIFT research paper by author David Lowe was initially rejected several times for publication by the major computer vision journals, and as a result Lowe filed for a patent and took a different direction. According to Lowe, “By then I had decided the computer vision community was not interested, so I applied for a patent and intended to promote it just for industrial applications.”^{1} Eventually, the SIFT paper was published and went on to become the most widely cited article in computer vision history!
The descriptors are fed into a matching pipeline to find the nearest distance ratio metric between closest match and second closest match, which considers a primary match and a secondary match together and rejects both matches if they are too similar, assuming that one or the other may be a false match. The local gradient magnitudes are weighted by a strength value proportional to the pyramid scale level, and then binned into the local histograms. In summary, SIFT is a very well thought out and carefully designed multiscale localized feature descriptor.
A variation of SIFT for color images is known as CSIFT [179].
Here is the basic SIFT descriptor processing flow (note: the matching stage is omitted since this chapter is concerned with feature descriptors and related metrics):
Create a Scale Space Pyramid
Identify ScaleInvariant Interest Points
As shown in Figure 616, the candidate interest points are chosen from local maxima or minima as compared between the 26 adjacent pixels in the DOG images from the three adjacent octaves in the pyramid. In other words, the interest points are scale invariant.
The selected interest points are further qualified to achieve invariance by analyzing local contrast, local noise, and local edge presence within the local 26 pixel neighborhood. Various methods may be used beyond those in the original method, and several techniques are used together to select the best interest points, including local curvature interpolation over small regions, and balancing edge responses to include primary and secondary edges. The keypoints are localized to subpixel precision over scale and space. The complete interest points are thus invariant to scale.
Create Feature Descriptors
A local region or patch of size 16x16 pixels surrounding the chosen interest points is the basis of the feature vector. The magnitude of the local gradients in the 16x16 patch and the gradient orientations are calculated and stored in a HOG (Histogram of Gradients) feature vector, which is weighted in a circularly symmetric fashion to downweight points farther away from the center interest point around which the HOG is calculated using a Gaussian weighting function.
As shown in Figure 617, the 4x4 gradient binning method allows for gradients to move around in the descriptor and be combined together, thus contributing invariance to various geometric distortions that may change the position of local gradients, similar to the human visual system treatment of the 3D position of gradients across the retina [248]. The SIFT HOG is reasonably invariant to scale, contrast, and rotation. The histogram bins are populated with gradient information using trilinear interpolation, and normalized to provide illumination and contrast invariance.
SIFT can also be performed using a variant of the HOG descriptor called the Gradient Location and Orientation Histogram (GLOH), which uses a log polar histogram format instead of the Cartesian HOG format; see Figure 617. The calculations for the GLOH log polar histogram are straightforward, as shown below from the Cartesian coordinates used for the Cartesian HOG histogram, where the vector magnitude is the hypotenuse and the angle is the arctangent.
As shown in Figure 617, SIFT HOG and GLOH are essentially 3D histograms, and in this case the histogram bin values are gradient magnitude and direction. The descriptor vector size is thus 4x4x8=128 bytes. The 4x4 descriptor (center image) is a set of histograms of the combined eightway gradient direction and magnitude of each 4x4 group in the left image, in Cartesian coordinates, while the GLOH gradient magnitude and direction are binned in polar coordinate spaced into 17 bins over a greater binning region. SIFTHOG (left image) also uses a weighting factor to smoothly reduce the contribution of gradient information in a circularly symmetric fashion with increasing distance from the center.
SIFT Compute Complexity ( from Vinukonda [180])
SIFT Pipeline Step  Complexity  Number of Operations 
Gaussian blurring pyramid  ⊝N ^{2} U ^{2} s  4N ^{2} W ^{2} s 
Difference of Gaussian pyramid  ⊝sN ^{2}  4N ^{2} s 
Scalespace extrema detection  ⊝sN ^{2}  104sN ^{2} 
Keypoint detection  ⊝αsN ^{2}  100sαN ^{2} 
Orientation assignment  ⊝sN ^{2} (1  αβ)  48sN ^{2} 
Descriptor generation  ⊝(x ^{2} N ^{2} (ab + γ))  ⊝1520x ^{2} (αβ + γ)N ^{2} 
The resulting feature vector for SIFT is 128 bytes. However, methods exist to reduce the dimensionality and vary the descriptor, which are discussed next.

Spectra: Local gradient magnitude + orientation

Feature shape: Square, with circular weighting

Feature pattern: Square with circularsymmetric weighting

Feature density: Sparse at local 16x16 DoG interest points

Search method: Sliding window over scale space

Distance function: Euclidean distance (*or Hellinger distance with RootSIFT retrofit)

Robustness: 6 (brightness, contrast, rotation, scale, affine transforms, noise)
SIFTPCA
The SIFTPCA method developed by Ke and Suthankar [183] uses an alternative feature vector derived using principal component analysis (PCA), based on the normalized gradient patches rather than the weighted and smoothed histograms of gradients, as used in SIFT. In addition, SIFTPCA reduces the dimensionality of the SIFT descriptor to a smaller set of elements. SIFT originally was reported using 128 vectors, but using SIFTPCA the vector is reduced to a smaller number such as 20 or 36.
 1.
Construct an eigenspace based on the gradients from the local 41x41 image patches resulting in a 3042 element vector; this vector is the result of the normal SIFT pipeline.
 2.
Compute local image gradients for the patches.
 3.
Create the reducedsize feature vector from the eigenspace using PCA on the covariance matrix of each feature vector.
SIFTPCA is shown to provide some improvements over SIFT in the area of robustness to image warping, and the smaller size of the feature vector results in faster matching speed. The authors note that while PCA in general is not optimal as applied to image patch features, the method works well for the SIFT style gradient patches that are oriented and localized in scale space [183].
SIFTGLOH
The Gradient Location and Orientation Histogram (GLOH) [144] method uses polar coordinates and radially distributed bins rather than the Cartesian coordinate style histogram binning method used by SIFT. It is reported to provide greater accuracy and robustness over SIFT and other descriptors for some ground truth datasets [144]. As shown in Figure 617, GLOH uses a set of 17 radially distributed bins to sum the gradient information in polar coordinates, yielding a 272bin histogram. The center bin is not direction oriented. The size of the descriptor is reduced using PCA. GLOH has been used to retrofit SIFT.
SIFTSIFER Retrofit
The Scale Invariant Feature Detector with Error Resilience (SIFER) [224] method provides alternatives to the standard SIFT pipeline, yielding measurable accuracy improvements reported to be as high as 20 percent for some criteria. However, the accuracy comes at a cost, since the performance is about twice as slow as SIFT. The major contributions of SIFER include improved scalespace treatment using a higher granularity image pyramid representation, and better scaletuned filtering using a cosine modulated Gaussian filter.
Comparison of SIFT, SURF, and SIFER Pipelines (adapted from [224])
SIFT  SURF  SIFER  

Scale Space Filtering  Gaussian 2nd derivative  Gaussian 2nd derivative  Cosine Modulated Gaussian 
Detector  LoG  Hessian  Wavelet Modulus Maxima 
Filter approximation level  OK accuracy  OK accuracy  Good accuracy 
Optimizations  DoG for gradient  Integral images, constant time  Convolution, constant time 
Image upsampling  2x  2x  Not used 
Subsampling  Yes  Yes  Not used 
Since the performance of the CMG is not good, SIFER provides a fast approximation method that provides reasonable accuracy. Special care is given to the image scale and the filter scale to increase accuracy of detection, thus the cosine is used as a bandpass filter for the Gaussian filter to match the scale as well as possible, tuning the filter in a filter bank over scale space with wellmatched filters for each of the six scales per octave. The CMG provides more error resilience than the SIFT Gaussian second derivative method.
SIFT CSLBP Retrofit
The SIFTCSLBP retrofit method [202,173] combines the best attributes of SIFT and the center symmetric LBP (CSLBP) by replacing the SIFT gradient calculations with much more computeefficient LBP operators, and by creating similar histogrambinned orientation feature vectors. LBP is computationally simpler both to create and to match than the SIFT descriptor.
SIFT and CSLBP Retrofit Performance (as per reference [202])
Feature extraction  Descriptor construction  Descriptor normalization  Total ms time  

CSLBP 256  0.1609  0.0961  0.007  0.264 
CSLBP 128  0.1148  0.0749  0.0022  0.1919 
SIFT 128  0.4387  0.1654  0.0025  0.6066 
RootSIFT Retrofit

Hellinger distance: RootSIFT uses a simple performance optimization of the SIFT object retrieval pipeline using Hellinger distance instead of Euclidean distance for correspondence. All other portions of the SIFT pipeline remain the same; kmeans is still employed to build the feature vector set, and other approximate nearest neighbor methods may still be used as well for larger feature vector sets. The authors claim a simple modification to SIFT code to perform the Hellinger distance optimization instead of Euclidean distance can be a simple set of oneline changes to the code. Other enhancements in RootSIFT are optional, discussed next.

Feature augmentation: This method increases total recall. Developed by Turcot and Lowe [332], it is applied to the features. Feature vectors or visual words from similar views of the same object in the database are associated into a graph used for finding correspondence among similar features, instead of just relying on a single feature.

Discriminative query expansion (DQE): This method increases query expansion during training. Feature vectors within a region of proximity are associated by averaging into a new feature vector useful for requeries into the database, using both positive and negative training data in a linear SVM; better correspondence is reported in reference [174].
By combining the three innovations described above into the SIFT pipeline, performance, accuracy, and robustness are shown to be significantly improved.
CenSurE and STAR
The Center Surround Extrema or CenSurE [185,184,145] method provides a true multiscale descriptor, creating a feature vector using full spatial resolution at all scales in the pyramid, in contrast to SIFT and SURF, which find extrema at subsampled pixels that compromises accuracy at larger scales. CenSurE is similar to SIFT and SURF, but some key differences are summarized in Table 65. Modifications have been made to the original CenSurE algorithm in OpenCV, which goes by the name of STAR descriptor.
Major Differences between CenSurE and SIFT and SURF (adapted from reference [185])
CenSurE  SIFT  SURF  

Resolution  Every pixel  Pyramid subsampled  Pyramid subsampled 
Edge filter method  Harris  Hessian  Hessian 
Scale space extrema method  Laplace, Center Surround  Laplace, DOG  Hessian, DOB 
Rotational invariance  Approximate  yes  no 
Spatial resolution in scale  Full  subsampled  Subsampled 
 1.Use of bilevel centersurround filters, as shown in Figure 619, including Difference of Boxes (DoB), Difference of Octagons (DoO) and Difference of Hexagons (DoH) filters, octagons and hexagons are more rotationally invariant than boxes. DoB is computationally simple and may be computed with integral images vs. the Gaussian scale space method of SIFT. The DoO and DoH filters are also computed quickly using a modified integral image method. Circle is the desired shape, but more computationally expensive.
 2.
To find the extrema, the DoB filter is computed using a sevenlevel scale space of filters at each pixel, using a 3x3x3 neighborhood. The scale space search is composed using centersurround Haarlike features on nonoctave boundaries with filter block sizes [1,2,3,4,5,6,7] covering 2.5 octaves between [1 and 7] yielding five filters. This scale arrangement provides more discrimination than an octave scale. A threshold is applied to eliminate weak filter responses at each level, since the weak responses are likely not to be repeated at other scales.
 3.
Nonrectangular filter shapes, such as octagons and hexagons, are computed quickly using combinations of overlapping integral image regions; note that octagons and hexagons avoid artifacts caused by rectangular regions and increase rotational invariance; see Figure 619.
 4.
CenSurE filters are applied using a fast, modified version of the SURF method called Modified Upright SURF (MUSURF) [188,189], discussed later with other SURF variants, which pays special attention to boundary effects of boxes in the descriptor by using an expanded set of overlapping subregions for the HAAR responses.

Spectra: Centersurround shaped bilevel filters

Feature shape: Octagons, circles, boxes, hexagons

Feature pattern: Filter shape masks, 24x24 largest region

Feature density: Sparse at Local interest points

Search method: Dense sliding window over scale space

Distance function: Euclidean distance

Robustness: 5 (brightness, contrast, rotation, scale, affine transforms)
Correlation Templates
One of the most well known and obvious methods for feature description and detection is simply to take an image of the complete feature and search for it by direct pixel comparison—this is known as correlation. Correlation involves stepping a sliding window containing a first pixel region template across a second image region template and performing a simple pixelbypixel region comparison using a method such as sum of differences (SAD); the resulting score is the correlation.
Since image illumination may vary, typically the correlation template and the target image are first intensity normalized, typically by subtracting the mean and dividing by the standard deviation; however, contrast leveling and LUT transform may also be used. Correlation is commonly implemented in the spatial domain on rectangular windows, but can be used with frequency domain methods as well [4,9].
Correlation is used in videobased target tracking applications where translation as orthogonal motion from frametoframe over small adjacent regions predominates. For example, video motion encoders find the displacement of regions or blocks within the image using correlation, since usually small block motion in video is orthogonal to the Cartesian axis and maps well to simple displacements found using correlation. Correlation can provide subpixel accuracy between 1/4 to 1/20 of a pixel, depending on the images and methods used; see reference [151]. For video encoding applications, correlation allows for the motion vector displacements of corresponding blocks to be efficiently encoded and accurately computed. Correlation is amenable to fixed function hardware acceleration.
Variations on correlation include crosscorrelation (sliding dot product) normalized crosscorrelation (NCC), zeromean normalized crosscorrelation (ZNCC), and texture auto correlation (TAC).
In general, correlation is a good detector for orthogonal motion of a constantsized monospace pattern region. It provides subpixel accuracy, has limited robustness and accuracy over illumination, but little to no robustness over rotation or scale. However, to overcome these robustness problems, it is possible to accelerate correlation over a scale space, as well as various geometric translations, using multiple texture samplers in a graphics processor in parallel to rapidly scale and rotate the correlation templates. Then, the correlation matching can be done either via SIMD SAD instructions or else using the fast fixed function correlators in the video encoding engines.

Spectra: Correlation

Feature shape: Square, rectangle

Feature pattern: Dense

Feature density: Variable sized kernels

Search method: Dense sliding window

Distance function: SSD typical, others possible

Robustness: 1 (illumination, subpixel accuracy)
HAAR Features
By using the average pixel value in the rectangular feature, the intent is to find a set of small patterns in adjacent areas where brighter or darker region adjacency may reveal a feature—for example, a bright cheek next to a darker eye socket. However, HAAR features have drawbacks, since rectangles by nature are not rotation invariant much beyond 15 degrees. Also, the integration of pixel values within the rectangle destroys fine detail.
Depending on the type of feature to be detected, such as eyes, a specific set of HAAR feature is chosen to reveal eye/cheek details and eye/nose details. For example, HAAR patterns with two rectangles are useful for detecting edges, while patterns with three rectangles can be used for lines, and patterns with an inset rectangle or four rectangles can be used for singleobject features. Note that HAAR features may be a rotated set.
Of course, the scale of the HAAR patterns is an issue, and since a given HAAR feature only works with an image of appropriate scale. Image pyramids are used for HAAR feature detection, along with other techniques for stepping the search window across the image in optimal grid sizes for a given application. Another method to address feature scale is to use a wider set of scaled HAAR features to perform the pyramiding in the feature space rather than the image space. One method to address HAAR feature granularity and rectangular shape is to use overlapping HAAR features to approximate octagons and hexagons; see the CenSurE and STAR methods in Figure 619.
HAAR features are closely related to wavelets [227,334]. Wavelets can be considered as an extension of the earlier concept of Gabor functions [333,187]. We provide only a short discussion of wavelets and Gabor functions here; more discussion was provided in Chapter 2. Wavelets are an orthonormal set of small duration functions. Each set of wavelets is designed to meet various goals to locate shortterm signal phenomenon. There is no single wavelet function; rather, when designing wavelets, a mother wavelet is first designed as the basis of the wavelet family, and then daughter wavelets are derived using translation and compression of the mother wavelet into a basis set. Wavelets are used as a set of nonlinear basis functions, where each basis function can be designed as needed to optimally match a desired feature in the input function. So, unlike transforms which use a uniform set of basis functions like the Fourier transform, composed of SIN and COS functions, wavelets use a dynamic set of basis functions that are complex and nonuniform in nature. Wavelets can be used to describe very complex shortterm features, and this may be an advantage in some feature detection applications.
However, compared to integral images and HAAR features, wavelets are computationally expensive, since they represent complex functions in a complex domain. HAAR 2D basis functions are commonly used owing to the simple rectangular shape and computational simplicity, especially when HAAR features are derived from integral images.

Spectra: Integral box filter

Feature shape: Square, rectangle

Feature pattern: Dense

Feature density: Variablesized kernels

Search method: Grid search typical

Distance function: Simple difference

Robustness: 1 (illumination)
Viola Jones with HAARLike Features
 1.
Integral images used to rapidly compute HAARlike features.
 2.
The ADABOOST learning algorithm to create a strong pattern matching and classifier network by combining strong classifiers with good matching performance with weak classifiers that have been “boosted” by adjusting weighting factors during the training process.
 3.
Combining classifiers into a detector cascade or funnel to quickly discard unwanted features at early stages in the cascade.
Since thousands of HAAR pattern matches may be found in a single image, the feature calculations must be done quickly. To make the HAAR pattern match calculation rapidly, the entire image is first processed into an integral image. Each region of the image is searched for known HAAR features using a sliding window method stepped at some chosen interval, such as every n pixels, and the detected features are fed into a classification funnel known as a HAAR Cascade Classifier. The top of the funnel consists of feature sets which yield low false positives and false negatives, so the firstorder results of the cascade contain highprobability regions of the image for further analysis. The HAAR features become more complex progressing deeper into the funnel of the cascade. With this arrangement, images regions are rejected as soon as possible if the desired HAAR features are not found, minimizing processing overhead.
A complete HAAR feature detector may combine hundreds or thousands of HAAR features together into a final classifier, where not only the feature itself may be important but also the spatial arrangements of features—for example, the distance and angular relationships between features could be used in the classifier.
SURF
The Speededup Robust Features Method (SURF) [160] operates in a scale space and uses a fast Hessian detector based on the determinant maxima points of the Hessian matrix. SURF uses a scale space over a 3x3x3 neighborhood to localize bloblike interest point features. To find feature orientation, a set of HAARlike feature responses are computed in the local region surrounding each interest point within a circular radius, computed at the matching pyramid scale for the interest point.
To create the SURF descriptor vector, a rectangular grid of 4x4 regions is established surrounding the interest point, similar to SIFT, and each region of this grid is split into 4x4 subregions. Within each subregion, the HAAR wavelet response is computed over 5x5 sample points. Each HAAR response is weighted using a circularly symmetric Gaussian weighting factor, where the weighting factor decreases with distance from the center interest point, which is similar to SIFT. Each feature vector contains four parts:
The wavelet responses d _{ x } and d _{ y } for each subregion are summed, and the absolute value of the responses d _{ x } and d _{ y } provide polarity of the change in intensity. The final descriptor vector is 4x4x4: 4x4 regions with four parts per region, for a total vector length of 64. Of course, other vector lengths can be devised by modifying the basic method.
As shown in Figure 622, the SURF gradient grid is rotated according to the dominant orientation, computed during the sliding sector window process, and then the wavelet response is computed in each square region relative to orientation for binning into the feature vector. Each of the wavelet directional sums d _{ x }, d _{ y }, d _{ x } , d _{ y } is recorded in the feature vector.
The SURF and SIFT pipeline methods are generally comparable in implementation steps and final accuracy, but SURF is one order of magnitude faster to compute than SIFT, as compared in an ORB benchmarking test [134]. However, the local binary descriptors, such as ORB, are another order of magnitude faster than SURF, with comparable accuracy for many applications [134]. For more information, see the section earlier in this chapter on local binary descriptors.

Spectra: Integral box filter + orientation vector

Feature shape: HAAR rectangles

Feature pattern: Dense

Feature density: Sparse at Hessian interest points

Search method: Dense sliding window over scale space

Distance function: Mahalanobis or Euclidean

Robustness: 4 (scale, rotation, illumination, noise)
Variations on SURF
SURF Variants (as discussed in Alcantarilla et. Al [188])
SURF  Circular Symmetric Gaussian Weighting Scheme, 20x20 grid 

USURF [189]  Faster version of SURF, only upright features are used; no orientation. Like MSURF except calculated upright “U” with no rotation of the grid, uses a 20x20 grid, no overlapping HAAR features, modified Gaussian weighting scheme, bilinear interpolation between histogram bins. 
MSURF MUSURF [189]  Circular symmetric Gaussian weighting scheme computed in two steps instead of one as for normal SURF, 24x24 grid using overlapping HAAR features, rotation orientation left out in MUSURF version. 
GSURF, GUSURF [188]  Instead of HAAR features, substitutes 2^{nd} order gauge derivatives in Gauge coordinate space, no Gaussian weighting, 20x20 grid. Gauge derivatives are rotation and translation invariant, while the HAAR features are simple rectangles, and rectangles have poor rotational invariance, maybe +/15 degrees at best. 
MGSURF [188]  Same as MSURF, but uses gauge derivatives. 
NGSURF [188]  N = No Gaussian weighting as in SURF; same as SURF but no Gaussian weighting applied, allows for comparison between gauge derivate features and HAAR features. 
Histogram of Gradients (HOG) and Variants
The Histogram of Gradients (HOG) method [106] is intended for image classification, and relies on computing local region gradients over a dense grid of overlapping blocks, rather than at interest points. HOG is appropriate for some applications, such as person detection, where the feature in the image is quite large.
HOG operates on raw data; while many methods rely on Gaussian smoothing and other filtering methods to prepare the data, HOG is designed specifically to use all the raw data without introducing filtering artifacts that remove fine details. The authors show clear benefits using this approach. It’s a tradeoff: filtering artifacts such as smoothing vs. image artifacts such as fine details. The HOG method shows preferential results for the raw data. See Figure 412, showing a visualization of a HOG descriptor.

Raw RGB image is used with no color correction or noise filtering, using other color spaces and color gamma adjustment provided little advantage for the added cost.

Prefers a 64x128 sliding detector window; 56x120 and 48x112 sized windows were also tested. Within this detector window, a total of 8x16 8x8 pixel block regions are defined for computation of gradients. Block sizes are tunable.

For each 8x8 pixel block, a total of 64 local gradient magnitudes are computed. The preferred method is simple line and column derivatives [1,0,1] in x/y; other gradient filter methods are tried, but larger filters with or without Gaussian filtering degrade accuracy and performance. Separate gradients are calculated for each color channel.

Local gradient magnitudes are binned into a 9bin histogram of edge orientations, quantizing dimensionality from 64 to 9, using bilinear interpolation; <9 bins produce poorer accuracy, >9 bins does not seem to matter. Note that either rectangular RHOG or circular log polar CHOG binning regions can be used.

Normalization of gradient magnitude histogram values to unit length to provide illumination invariance. Normalization is performed in groups, rather than on single histograms. Overlapping 2x2 blocks of histograms are used within the detector window; the block overlapping method reduces sharp artifacts, and the 2x2 region size seems to work best.

For the 64x128 pixel detector window method, a total of 128 8x8 pixel blocks are defined. Each 8x8 block has four cells for computing separate 9bin histograms. The total descriptor size is then 8x16x4x9=4608.

The abrupt edges at fine scales in the raw data are required for accuracy in the gradient calculations, and postprocessing and normalizing the gradient bins later works well.

L2 style block normalization of local contrast is preferred and provides better accuracy over global normalization; note that the local region blocks are overlapped to assist in the normalization.

Dropping the L2 block normalization stage during histogram binning reduces accuracy by 27 percent.

HOG features perform much better than HAARstyle detectors, and this makes sense when we consider that a HAAR wavelet is an integrated directionless value, while gradient magnitude and direction over the local HOG region provides a richer spectra.

Spectra: Local region gradient histograms

Feature shape: Rectangle or circle

Feature pattern: Dense 64x128 typical rectangle

Feature density: Dense overlapping blocks

Search method: Grid over scale space

Distance function: Euclidean

Robustness: 4 (illumination, viewpoint, scale, noise)
PHOG and Related Methods
PHOG is similar to related work using a coarsetofine grid of region histograms called Spatial Pyramid Matching by Lazebni, Schmid, and Ponce [534], using histograms of oriented edges and SIFT features to provide multiclass classification. It is also similar to earlier work on pyramids of concatenated histogram features taken over a progressively finer grid, called Pyramid Match Kernel and developed by Grauman and Darrell [535], which computes correspondence using weighted, multiresolution histogram intersection. Other related earlier work using multiresolution histograms for texture classification are described in reference [55].

Shape features, derived from local distribution of edges based on gradient features inspired by the HOG method [106].

Spatial relationships, across the entire image by computing histogram features over a set of octave grid cells with blocks of increasingly finer size over the image.

Appearance features, using a dense set of SIFT descriptors calculated across a regularly spaced dense grid. PHOG is demonstrated to compute SIFT vectors for color images; results are provided in [191] for the HSV color space.
A set of training images is used to generate a set of PHOG descriptor variables for a class of images, such as cars or people. This training set of PHOG features is reduced using Kmeans clustering to a set of several hundred visual words to use for feature matching and image classification.
Some key concepts of the PHOG are illustrated in Figure 623. For the feature shape, the edges are computed using the Canny edge detector, and the gradient orientation is computed using the Sobel operator. The gradient orientation binning is linearly interpolated across adjacent histogram bins by gradient orientation (HOG), each bin represents the angle of the edge. A HOG vector is computed for each size of grid cell across the entire image. The final PHOG descriptor is composed of a weighted concatenation of all the individual HOG histograms from each grid level. There is no scalespace smoothing between the octave grid cell regions to reduce fine detail.
As shown in Figure 623, the final PHOG contains all the HOGs concatenated. Note that for the center left image, the full grid size cell produces 1 HOG, for the center right, the half octave grid produces 4 HOGs, and for the right image, the fine grid produces 16 HOG vectors. The final PHOG is normalized to unity to reduce biasing due to concentration of edges or texture.

Spectra: Global and regional gradient orientation histograms

Feature shape: Rectangle

Feature pattern: Dense grid of tiles

Feature density: Dense tiles

Search method: Grid regions, no searching

Distance function: l2 norm

Robustness: 3 (image classification under some invariance to illumination, viewpoint, noise)
Daisy and ODaisy
Daisy does not need local interest points, and instead computes a descriptor densely at each pixel, since the intended application is stereo mapping and tracking. Rather than using gradient magnitude and direction calculations like SIFT and GLOH, Daisy computes a set of convolved orientation maps based on a set of oriented derivatives of Gaussian filters to create eight orientation maps spaced at equal angles.
As shown in Figure 624, the size of each filter region and the amount of blur in each Gaussian filter increase with distance away from the center, mimicking the human visual system by maintaining a sharpness and focus in the center of the field of view and decreasing focus and resolution farther away from the center. Like SIFT, Daisy also uses histogram binning of the local orientation to form the descriptor.
Daisy is designed with optimizations in mind. The convolution orientation map approach consumes fewer compute cycles than the gradient magnitude and direction approach of SIFT and GLOH, yet yields similar results. The Daisy method also includes optimizations for computing larger Gaussian kernels by using a sequential set of smaller kernels, and also by computing certain convolution kernels recursively. Another optimization is gained using a circular grid pattern instead of the rectangular grid used in SIFT, which allows Daisy to vary the rotation by rotating the sampling grid rather than recomputing the convolution maps.
As shown in Figure 624 (right image), Daisy also uses binary occlusion masks to identify portions of the descriptor pattern to use or ignore in the feature matching distance functions. This is a novel feature and provides for invariance to occlusion.
An FPGA optimized version of Daisy, called ODaisy [217], provides enhancements for increased rotational invariance.

Spectra: Gaussian convolution values

Feature shape: Circular

Feature pattern: Overlapping concentric circular

Feature density: Dense at each pixel

Search method: Dense sliding window

Distance function: Euclidean

Robustness: 3 (illumination, occlusion, noise)
CARD
The Compact and Realtime Descriptor (CARD) method [218] is designed with performance optimizations in mind, using learningbased sparse hashing to convert descriptors into binary codes supporting fast Hamming distance matching. A novel concept from CARD is the lookuptable descriptor extraction of histograms of oriented gradients from local pixel patches, as well as the lookuptable binning into Cartesian or log polar bins. CARD is reported to achieve significantly better rotation and scale robustness compared to SIFT and SURF, with performance at least ten times better than SIFT and slightly better than SURF.
CARD follows the method of RIFF [222][219] for feature detection, using FAST features located over octave levels in the image pyramid. The complete CARD pyramid includes intermediate levels between octaves for increased resolution. The pyramid levels are computed at intervals of Open image in new window , with level 0 being the full image. Keypoints are found using a ShiTomasi [157] optimized Harris corner detector.
As shown in Figure 625, to speed up binning, instead of rotating the patch based on the estimated gradient direction to extract and bin a rotationally invariant descriptor, as done in SIFT and other methods, CARD rotates the binning pattern over the patch based on the gradient direction and then performs binning, which is much faster. Figure 625 shows the binning pattern unrotated on the right, and rotated by Open image in new window on the left. All binned values are concatenated and normalized to form the descriptor, which is 128 bits long in the most accurate form reported [218].

Spectra: Gradient magnitude and direction

Feature shape: Circular, variable sized based on pyramid scale and principal orientation

Feature pattern: Dense

Feature density: Sparse at FAST interest points over image pyramid

Search method: Sliding window

Distance function: Hamming

Robustness: 3 (illumination, scale, rotation)
Robust Fast Feature Matching
Robust Feature Matching in 2.3us developed by Taylor, Rosten and Drummond [220] (RFM2.3) (this acronym is coined here by the author) is a novel, fast method of feature description and matching, optimized for both compute speed and memory footprint. RFM2.3 stands alone among the feature descriptors surveyed here with regard to the combination of methods and optimizations employed, including sparse region histograms and binary feature codes. One of the key ideas developed in RFM2.3 is to compute a descriptor for multiple views of the same patch by creating a set of scaled, rotated, and affine warped views of the original feature, which provides invariance under affine transforms such as rotation and scaling, as well as perspective.
In addition to warping, some noise and blurring is added to the warped patch set to provide robustness to the descriptor. RFM2.3 is one of few methods in the class of deformable descriptors [344–346]. FAST keypoints in a scale space pyramid are used to locate candidate features, and the warped patch set is computed for each keypoint. After the warped patch set has been computed, FAST corners are again generated over each new patch in the set to determine which patches are most distinct and detectable, and the best patches are selected and quantized into binary feature descriptors and saved in the pattern database.
The descriptor is modeled during training as a 64value normalized intensity distribution function, which is reduced in size to compute the final descriptor vector in two passes: first, the 64 values are reduced to a fivebin histogram of pixel intensity distribution; second, when training is complete, each histogram bin is binary encoded with a 1 bit if the bin is used, and a 0 bit if the bin is rarely used. The resulting descriptor is a compressed, binary encoded bit vector suitable for Hamming distance.

Spectra: Normalized histogram patch intensity encoded into binary patch index code

Feature shape: Rectangular, multiple viewpoints

Feature pattern: Sparse patterns in 15x15 pixel patch

Feature density: Sparse at FAST9 interest points

Search method: Sliding window over image pyramid

Distance function: Hamming

Robustness: 4 (illumination, scale, rotation, viewpoint)
RIFF, CHOG
The Rotation Invariant Fast Features (RIFF) [222][219] method is motivated by tracking and mapping applications in mobile augmented reality. The basis of the RIFF method includes the development of a radial gradient transform (RGT), which expresses gradient orientation and magnitude in a computeefficient and rotationally invariant fashion. Another contribution of RIFF is a tracking method, which is reported to be more accurate than KLT with 26x better performance. RIFF is reported to be 15x faster than SURF.
As shown in Figure 627 (right image) the basis vectors can be optimized by using gradient direction approximations in the approximated radial gradient transform (ARGT), which is optimized to be easily computed using a simple differences between adjacent, normalized pixels along the same gradient line, and simple 45 degree quantization. Also note in Figure 627 (center left image), that the histogramming is optimized by sampling every other pixel within the annuli regions, and four annuli regions are used for practical reasons as a tradeoff between discrimination and performance. To meet realtime system performance goals for quantizing the gradient histogram bins, RIFF uses a 5x5 scalar quantizer rather than a vector quantizer.
In Figure 627 (left image), the gradient projection of g at point c onto a radial coordinate system (r,t) is used for a rotationally invariant gradient expression, and the descriptor patch is centered at c. The center left image (Annuli) illustrates the method of binning, using four annuli rings, which reduces dimensionality, and sampling only the gray pixels provides a 2x speedup. The center and center right images illustrate the bin centering mechanism for histogram quantization: (1) the more flexible scalar quantizer SQ25 and (2) the faster vector quantizer VQ17. And the right image illustrates the radial coordinate system basis vectors for gradient orientation radiating from the center outwards, showing the more compute efficient ARGT, or approximated radial gradient transform (RGT), which does not use floating point math (RGT not shown, see [222]).

Spectra: Local region histogram of approximated radial gradients

Feature shape: Circular

Feature pattern: Sparse every other pixel

Feature density: Sparse at FAST interest points over image pyramid

Search method: Sliding window

Distance function: Symmetric KLdivergence

Robustness: 4 (illumination, scale, rotation, viewpoint)
Chain Code Histograms
Chain code histograms are covered by U.S. Patent US4783828. CCH was invented in 1961 [206] and is also known as the Freeman chain code. A variant of the CCH is the Vertex chain code [207], which allows for descriptor size reduction and is reported to have better accuracy.
DNETS

Clique DNETS: A fully connected network of strips linking all the interest points. While the type of interest point used may vary within the method, the initial work reports results using SIFT keypoints.

Iterative DNETS: Dynamically creates the network using a subset of the interest points, increasing the connectivity using a stopping criterion to optimize the connection density for obtaining desired matching performance and accuracy.

Densely sampled DNETS: This variant does not use interest points, and instead densely samples the nets over a regularly spaced grid, a 10pixel grid being empirically chosen and preferred, with some hysteresis or noise added to the grid positions to reduce pathological sampling artifacts. The dense method is suitable for highly parallel implementations for increased performance.
For an illustration of the three DNETS patterns and some discussion, see Figure 49.

Strip vector sampling, where each pixel strip vector is sampled at equally spaced locations between 10 and 80 percent of the length of the pixel strip vector; this sampling arrangement was determined empirically to ignore pixels near the endpoints.

Quantize the pixel strip vector by integrating the values into a set of uniform chunks, s , to reduce noise.

Normalize the strip vector for scaling and translation.

Discretize the vector values into a limited bit range, b .

Concatenate all uniform chunks into the dtoken, which is a bit string of length s*b.
Descriptor matching makes use of an efficient and novel hashing and hypothesis correspondence voting method. DNETS results are reported to be higher in precision and recall than ORB or SIFT.

Spectra: Normalized, averaged linear pixel intensity chunks

Feature shape: Line segment connected networks

Feature pattern: Sparse line segments between chosen points

Feature density: Sparse along lines

Search method: Sliding window

Distance function: Hashing and voting

Robustness: 5 (illumination, scale, rotation, viewpoint, occlusion)
Local Gradient Pattern
A variation of the LBP approach, the local gradient pattern (LGP) [204] uses local region gradients instead of local image intensity pair comparison to form the binary descriptor. The 3x3 gradient of each pixel in the local region is computed, then each gradient magnitude is compared to the mean value of all the local region gradients, and the binary bit value of 1 is assigned if the value is greater, and 0 otherwise. The authors claim accuracy and discrimination improvements over the basic LBP in facerecognition algorithms, including a reduction in false positives. However, the compute requirements are greatly increased due to the local region gradient computations.

Spectra: Local region gradient comparisons between center pixel and local region gradients

Feature shape: Square

Feature pattern: Every pixel 3x3 kernel region

Feature density: Dense in 3x3 region

Search method: Sliding window

Distance function: Hamming

Robustness: 3 (illumination, scale, rotation)
Local Phase Quantization
The local phase quantization (LPQ) descriptor [166–168] was designed to be robust to image blur, and it leverages the blur insensitive property of Fourier phase information. Since the Fourier transform is required to compute phase, there is some compute overhead; however, integer DFT methods can be used for acceleration. LPQ is reported to provide robustness for uniform blur, as well as uniform illumination changes. LPQ is reported to provide equal or slightly better accuracy on nonblurred images than LBP and Gabor filter bank methods. While mainly used for texture description, LPQ can also be used for local feature description to add blur invariance by combining LPQ with another descriptor method such as SIFT.
To compute, first a DFT is computed at each pixel over small regions of the image, such as 8x8 blocks. The low four frequency components from the phase spectrum are used in the descriptor. The authors note that the kernel size affects the blur invariance, so a larger kernel block may provide more invariance at the price of increased compute overhead.
Before quantization, the coefficients are decorrelated using a whitening transform, resulting in a uniform phase shift and 8degree rotation, which preserves blur invariance. Decorrelating the coefficients helps to create samples that are statistically independent for better quantization.
For each pixel, the resulting vectors are quantized into an 8dimensional space, using an 8bit binary encoded bit vector like the LBP and a simple scalar quantizer to yield 1 and 0 values. Binning into the feature vector is performed using 256 hypercubes derived from the 8dimensional space. The resulting feature vector is a 256dimensional 8bit code.

Spectra: Local region whitened phase using DFT > an 8bit binary code

Feature shape: Square

Feature pattern: 8x8 kernel region

Feature density: Dense every pixel

Search method: Sliding window

Distance function: Hamming

Robustness: 3 (contrast, brightness, blur)
Basis Space Descriptors
This section covers the use of basis spaces to describe image features for computer vision applications. A basis space is composed of a set of functions, the basis functions, which are composed together as a set, such as a series like the Fourier series (discussed in Chapter 3). A complex signal can be decomposed into a chosen basis space as a descriptor.
Basis functions can be designed and used to describe, reconstruct, or synthesize a signal. They require a forward transform to project values into the basis set, and an inverse transform to move data back to the original values. A simple example is transforming numbers between the base 2 number system and the base 10 number system; each basis had advantages.
Sometimes it is useful to transform a dataset from one basis space to another to gain insight into the data, or to process and filter the data. For example, images captured in the time domain as sets of pixels in a Cartesian coordinate system can be transformed into other basis spaces, such as the Fourier basis space in the frequency domain, for processing and statistical analysis. A good basis space for computer vision applications will provide forward and inverse transforms. Again, the Fourier transform meets these criteria, as well as several other basis spaces.
Basis spaces are similar to coordinate systems, since both have invertible transforms to related spaces. In some cases, simply transforming a feature spectra into another coordinate system makes analysis and representation simpler and more efficient. ( Chapter 4 discusses coordinates systems used for feature representation.) Several of the descriptors surveyed in this chapter use nonCartesian coordinate systems, including GLOH, which uses polar coordinate binning, and RIFF, which uses radial coordinate descriptors.
Fourier Descriptors
Fourier descriptors [227] represent feature data as sine and cosine terms, which can be observed in a Fourier Power Spectrum. The Fourier series, Fourier transform, and Fast Fourier transform are used for a wide range of signal analysis, including 1D, 2D, and 3D problems. No discussion of image processing or computer vision is complete without Fourier methods, so we will explore Fourier methods here with applications to feature description.
Instead of developing the mathematics and theory behind the Fourier series and Fourier transform, which has been done very well in the standard text by Bracewell [227], we discuss applications of the Fourier Power Spectrum to feature description and provide minimal treatment of the fundamentals here to frame the discussion; see also Chapter 3. The basic idea behind the Fourier series is to define a series of sine and cosine basis functions in terms of magnitude and phase, which can be summed to approximate any complex periodic signal. Conversely, the Fourier transform is used to decompose a complex periodic signal into the Fourier series set of sine and cosine basis terms. The Fourier series components of a signal, such as a line or 2D image area, are used as a Fourier descriptor of the region.

Fourier Spectrum of LBP Histograms. As shown in Figure 310, an LBP histogram set can be represented as a Fourier Spectrum magnitude, which makes the histogram descriptor invariant to rotation.
 Fourier Descriptor of Shape Perimeter. As shown in Figure 629, the shape of a polygon object can be described by Fourier methods using an array of perimeter to centroid line segments taken at intervals, such as 10 degrees. The array is fed into an FFT to produce a shape descriptor, which is scale and rotation invariant.

Fourier Descriptor of Gradient Histograms. Many descriptors use gradients to represent features, and use gradient magnitude or direction histograms to bin the results. Fourier Spectrum magnitudes may be used to create a descriptor from gradient information to add invariance.

Fourier Spectrum of Radial Line Samples. As used in the RFAN descriptor [136], radial line samples of pixel values from local regions can be represented as a Fourier descriptor of Fourier magnitudes.

Fourier Spectrum Phase. The LPQ descriptor, described in this chapter, makes use of the Fourier Spectrum phase information in the descriptor, and the LPQ is reported to be insensitive to blur owing to the phase information.
Other Basis Functions for Descriptor Building
Besides the Fourier basis series, other function series and basis sets are used for descriptor building, pattern recognition, and image coding. However, such methods are usually applied over a global or regional area. See Chapter 3 for details on several other methods.
Sparse Coding Methods
In this discussion on basis space descriptors, we briefly discuss sparse coding methods, since they are analogous to a basis space. Many approaches are taken to sparse coding [530–533], using subtle differences in terminology, including visual vocabularies and bag of words methods [537]. However, sparse coding methods use a reduced set of learned feature descriptors or codes instead of basis functions. The key idea is to build a sparse codebook of basis features from the training images, and match against the sparse codebook. The sparse codes may be simple image patches or other descriptors.
A range of machine learning methods (outside the scope of this book, see [546] by Prince for more on machine learning) are used for finding the optimal sparse feature set. In addition, each sparse coding method may prefer a particular style of classification and matching. Sparse codes are associated as subsets or signatures to identify objects. Any of the local feature descriptor methods discussed in this chapter may be used as the basis for a sparse codebook. Sparse coding and related methods are discussed in more detail in Chapter 4. See the work by Aharon, Alad, and Bruckstein [536] for more details on sparse coding, as well as FeiFei, Fergus, and Torralba [537].
Examples of Sparse Coding Methods
Polygon Shape Descriptors
Polygon shape methods are commonly used in medical and industrial applications, such as automated microscopy for cell biology, and also for industrial inspection; see Figure 631. Commercial software libraries are available for polygon shape description, commonly referred to as particle analysis or blob analysis. See Appendix C.
MSER Method
The Maximally Stable Extremal Regions (MSER) method [194] is usually discussed in the literature as an interest region detector, and in fact it is. However we include MSER in the shape descriptor section because MSER regions can be much larger than other interest point methods, such as HARRIS or FAST.
The MSER detector was developed for solving disparity correspondence in a wide baseline stereo system. Stereo systems create a warped and complex geometric depth field, and depending on the baseline between cameras and the distance of the subject to the camera, various geometric effects must be compensated for. In a wide baseline stereo system, features nearer the camera are more distorted under affine transforms, making it harder to find exact matches between the left/right image pair. The MSER approach attempts to overcome this problem by matching on bloblike features. MSER regions are similar to morphological blobs and are fairly robust to skewing and lighting. MSER is essentially an efficient variant of the watershed algorithm, except that the goal of MSER is to find a range of thresholds that leave the watershed basin unchanged in size.
The MSER method involves sorting pixels into a set of regions based on binary intensity thresholding; regions with similar pixel value over a range of threshold values in a connected component pattern are considered maximally stable. To compute a MSER, pixels are sorted in a binary intensity thresholding loop, which sweeps the intensity value from min to max. First, the binary threshold is set to a low value such as zero on a single image channel— luminance, for example. Pixels < the threshold value are black, pixels >=are white. At each threshold level, a list of connected components or pixels is kept. The intensity threshold value is incremented from 0 to the max pixel value. Regions that do not grow or shrink or change as the intensity varies are considered maximally stable, and the MSER descriptor records the position of the maximal regions and the corresponding thresholds.

Multiscale features and multiscale detection. Since the MSER features do not require any image smoothing or scale space, both coarse features and fineedge features can be detected.

Variablesize features computed globally across an entire region, not limited to patch size or search window size.

Affine transform invariance, which is a specific goal.

General invariance to shape change, and stability of detection, since the extremal regions tend to be detected across a wide range of image transformations.
The MSER can also be considered as the basis for a shape descriptor, and as an alternative to morphological methods of segmentation. Each MSER region can be analyzed and described using shape metrics, as discussed later in this chapter.
Object Shape Metrics for Blobs and Polygons
Object shape metrics are powerful and yield many degrees of freedom with respect to invariance and robustness. Object shape metrics are not like local feature metrics, since object shape metrics can describe much larger features. This is advantageous for tracking from frame to frame. For example, a large object described by just a few simple object shape metrics such as area, perimeter, and centroid can be tracked from frame to frame under a wide range of conditions and invariance. For more information, see references [128,129] for a survey of 2D shape description methods.

Object shape moments and metrics: the focus of this section.

Image moments: see Chapter 3 under “Image Moments.”

Fourier descriptors: discussed in this chapter and Chapter 3.

Shape Context feature descriptor: discussed in this section.

Chain code descriptor for perimeter description: discussed in this section.
Object shape is closely related to the field of morphology, and computer methods for morphological processing are discussed in detail in Chapter 2. Also see the discussion about morphological interest points earlier in this chapter.
In many areas of computer vision research, local features seem to be favored over object shapebased features. The lack of popularity of shape analysis methods may be a reaction to the effort involved in creating preprocessing pipelines of filtering, morphology, and segmentation to prepare the image for shape analysis. If the image is not preprocessed and prepared correctly, shape analysis is not possible. (See Chapter 8 for a discussion of a hypothetical shape analysis preprocessing pipeline.)
Various Common Object Shape and Blob Object Metrics
Object Binary Shape Metrics  Description 

Perimeter  Length of all points around the edge of the object, including the sum of diagonal lengths ∼=1.4 and adjacent lengths = 1 
Area  Total area of object in pixels 
Convex hull  Polygon shape or set of line segments enclosing all perimeter points 
Centroid  Center of object mass, average value of all pixel coordinates or average value of all perimeter coordinates 
Fourier descriptor  Fourier spectrum result from an array containing the length of a set of radial line segments passing from centroid to perimeter at regular angles used to model a 1D signal function, the 1D signal function is fed into a 1D FFT and the set of FFT magnitude data is used as a metric for a chosen set of octave frequencies 
Major/minor axis  Longest and shortest line segments passing through centroid contained within and touching the perimeter 
Feret  Largest caliper diameter of object 
Breadth  Shortest caliper diameter 
Aspect ratio  Feret / Breadth 
Circularity  4 X Pi X Area / Perimeter2 
Roundness  4 X Area / (Pi X Feret2) (Can also be calculated from the Fourier descriptors) 
Area equivalent diameter  sqrt((4 / Pi) X Area) 
Perimeter equivalent diameter  Area/Pi 
Equivalent ellipse  (Pi X Feret X Breadth) / 4 
Compactness  sqrt((4 / Pi) X Area) / Feret 
Solidity  Area / Convex_Area 
Concavity  Convex_Area  Area 
Convexity  Convex_Hull / Perimeter 
Shape  Perimeter2 / Area 
Modification ratio  (2 X MinR) / Feret 
Shape matrix  A 2D matrix representation or plot of a polygon shape (may use Cartesian or polar coordinates; see Figure 632) 
Grayscale Object Shape Metrics  
SDM plots  *See Chapter 3, “Texture Metrics” section. 
Scatter plots  *See Chapter 3, “Texture Metrics” section. 
Statistical moments of gray scale pixel values  Minimum Maximum Median Average Average deviation Standard deviation Variance Skewness Kurtosis Entropy 
Shape is considered to be binary; however, shape can be computed around intensity channel objects as well, using gray scale morphology. Perimeter is considered as a set of connected components. The shape is defined by a single pixel wide perimeter at a binary threshold or within an intensity band, and pixels are either on, inside, or outside of the perimeter. The perimeter edge may be computed by scanning the image, pixel by pixel, and examining the adjacent touching pixel neighbors for connectivity. Or, the perimeter may be computed from the shape matrix [335] or chain code discussed earlier in this chapter. Perimeter length is computed for each segment (pixel), where segment length = 1 for horizontal and vertical neighbors, and Open image in new window otherwise for diagonal neighbors.
The perimeter may be used as a mask, and gray scale or color channel statistical metrics may be computed within the region. The object area is the count of all the pixels inside the perimeter. The centroid may be computed either from the average of all (x,y) coordinates of all points contained within the perimeter area, or from the average of all perimeter (x,y) coordinates.
Shape metrics are powerful. For example, shape metrics may be used to remove or excluding objects from a scene prior to measurement. For example, objects can be removed from the scene when the area is smaller than a given size, or if the centroid coordinates are outside a given range.
As shown in Figure 629 and Figure 218, the Fourier descriptor provides a rotation and scale invariant shape metric, with some occlusion invariance also. The method for determining the Fourier descriptor is to take a set of equally angularspaced radius measurements, such as every 10 degrees, from the centroid out to points on the perimeter, and then to assemble the radius measurements into a 1D array that is run through a 1D FFT to yield the Fourier moments of the object. Or radial pixel spokes can be used as a descriptor.
Other examples of useful shape metrics, shown in Figure 629, include the bounding box with major and minor axis, which has longest and shortest diameter segments passing through the centroid to the perimeter; this can be used to determine rotational orientation of an object.
The SNAKES method [540] uses a spline model to fit a collection of interest points, such as selected perimeter points, into a region contour. The interest points are the spline points. The SNAKE can be used to track contoured features from frame to frame, deforming around the interest point locations.
In general, the 2D object shape methods can be extended to 3D data; however, we do not explore 3D object shape metrics here, see reference [200,201] for a survey of 3D shape descriptors.
Shape Context
The shape context method developed by Belongie, Malik, and Puzicha [239–241], describes local feature shape using a reference point on the perimeter as the Cartesian axis origin, and binning selected perimeter point coordinates relative to the reference point origin. The relative coordinates of each point are binned into a log polar histogram. Shape context is related to the earlier shape matrix descriptor [335] developed in 1985 as shown in Figure 632, which describes the perimeter of an object using log polar coordinates also. The shape context method provides for variations, described in several papers by the authors [239–241]. Here, we look at a few key concepts.
To begin, the perimeter edge of the object is sparsely sampled at uniform intervals, typically keeping about 100 edge sample points for coarse binning. Sparse perimeter edge points are typically distinct from interest points, and found using perimeter tracing. Next, a reference point is chosen on the perimeter of the object as the origin of a Cartesian space, and the vector angle and magnitude Open image in new window from the origin point to each other perimeter point are computed. The magnitude or distance is normalized to fit the histogram. Each sparse perimeter edge point is used to compute a tangent with the origin. Finally, each normalized vector is binned using Open image in new window into a log polar histogram, which is called the shape context.
An alignment transform is generated between descriptor pairs during matching, which yields the difference between targets and chosen patterns, and could be used for reconstruction. The alignment transform can be chosen as desired from affine, Euclidean, splinebased, and other methods. Correspondence uses the Hungarian method, which includes histogram similarity, and is weighted by the alignment transform strength using the tangent angle dissimilarity. Matching may also employ a local appearance similarity measure, such as normalized correlation between patches or color histograms.
3D, 4D, Volumetric, and Multimodal Descriptors
With the advent of more and more 3D sensors, such as stereo cameras and other depthsensing methods, as well as the ubiquitous accelerometers and other sensors built into inexpensive mobile devices, the realm of 3D feature description and multimodal feature description is beginning to blossom.
Many 3D descriptors are associated with robotics research and 3D localization. Since the field of 3D feature description is early in the development cycle, it is not yet clear which methods will be widely adopted, so we present only a small sampling of 3D descriptor methods here. These include 3D HOG [196], 3D SIFT [195], and HON 4D [198], which are based on familiar 2D methods. We refer the interested reader to references [200,201,216] for a survey of 3D shape descriptors. Several interesting 3D descriptor metrics are available as open source in the Point Cloud Library,^{2} including RadiusBased Surface Descriptors (RSD) [539], Principal Curvature Descriptors (PCD), Signatures of Histogram Orientations (SHOT) [541], Viewpoint Feature Histogram (VFH) [398], and Spin Images [538].
Key applications driving the research into 3D descriptors include robotics and activity recognition, where features are tracked frame to frame as they morph and deform. The goals are to localize position and recognize human actions, such as walking, waving a hand, turning around, or jumping. See also the LBP variants for 3D: VLBP and LBPTOP, which are surveyed earlier in this chapter as illustrated in Figure 612, which are also used for activity recognition. Since the 2D features are moving during activity recognition, time is the third dimension incorporated into the descriptors. We survey some notable 3D activityrecognition research here.
One of the key concepts in the actionrecognition work is to extend familiar 2D features into a 3D space that is spatiotemporal, where the 3D space is composed of 2D x,y video image sequences over time t into a volumetric representation with the form v(x,y,t). In addition, the 3D surface normal, 3D gradient magnitude, and 3D gradient direction are used in many of the actionrecognition descriptor methods.
3D HOG
The 3D HOG [196] is partially based on some earlier work in volumetric features [199]. The general idea is to employ the familiar HOG descriptor [106] in a 3D HOG descriptor formulation, using a stack of sequential 2D video frames or slices as a 3D volume, and to compute spatiotemporal gradient orientation on adjacent frames within the volume. For efficiency, a novel integral video approach is developed as an alternative to image pyramids based on the same line of thinking as the integral image approach use in the Viola Jones method.
A similar approach using the integral video concept was also developed in [199] using a subsampled space of 64x64 over 4 to 40 video frames in the volume, using pixel intensity instead of the gradient direction. The integral video method, which can also be considered an integral volume method, allows for arbitrary cuboid regions from stacked sequential video frames to be integrated together to compute the local gradient orientation over arbitrary scales. This is space efficient and time efficient compared to using precomputed image pyramids. In fact, this integral video integration method is a novel contribution of the work, and may be applied to other spectra such as intensity, color, and gradient magnitude in either 2D or 3D to eliminate the need for image pyramids—providing more choices in terms of image scale besides just octaves.
HON 4D
A similar approach to the 3D HOG is called HON 4D [198], which computes descriptors as Histogram of Oriented 4D Normals, where the 3D surface normal + time add up to four dimensions (4D). HON 4D uses sequences of depth images or 3D depth maps as the basis for computing the descriptor, rather than 2D image frames, as in the 3D HOG method. So a depth camera is needed. In this respect, HON 4D is similar to some volume rendering methods which compute 3D surface normals, and may be accelerated using similar methods [452,453,454].
In the HON 4D method, the surface normals capture the surface shape cues of each object, and changes in normal orientation over time can be used to determine motion and pose. Only the orientation of the surface normal is significant in this method, so the normal lengths are all normalized to unity length. As a result, the binning into histograms acts differently from the HOG style binning, so that the fourth dimension of time encodes differences in the gradient from frame to frame. The HON 4D descriptor is binned and quantized using 4D projector functions, which quantize local surface normal orientation into a 600cell polychron, which is a geometric extension of a 2D polygon into 4space.
Consider the discrimination of the HON 4D method using gradient orientation vs. the HOG method using gradient magnitude. If two surfaces are the same or similar with respect to gradient magnitude, the HOG style descriptor cannot differentiate; however, the HON 4D style descriptor can differentiate owing to the orientation of the surface normal used in the descriptor. Of course, computing 3D normals is computeintensive without special optimizations considering the noncontiguous memory access patterns required to access each component of the volume.
3D SIFT
The 3D orientation of the gradient pair orientation is computed as follows:
This method provides a unique twovalued (φ, θ) representation for each angle of the gradient orientation in 3space at each keypoint. The binning stage is handled differently from SIFT, and instead uses orthogonal bins defined by meridians and parallels in a spherical coordinate space. This is simpler to compute, but requires normalization of each value to account for the spherical difference in the apparent size ranging from the poles to the equator.
To compute the SIFT descriptor, the 3D gradient orientation of each subhistogram is used to guide rotation of the 3D region at the descriptor keypoint to point to 0, which provides a measure of rotational invariance to the descriptor. Each point will be represented as a single gradient magnitude and two orientation vectors (φ, θ) instead of one, as in 2D SIFT. The descriptor binning is computed over three dimensions into adjacent cubes instead of over two dimensions in the 2D SIFT descriptor.
Once the feature vectors are binned, the feature vector set is clustered into groups of like features, or words, using hierarchical Kmeans clustering into a spatiotemporal word vocabulary. Another step beyond the clustering could be to reduce the feature set using sparse coding methods [115–117], but the sparse coding step is not attempted.
Results using 3D SIFT for action recognition are reported to be quite good compared to other similar methods; see reference [195].
Summary
In this chapter we surveyed a wide range of local interest point detectors and feature descriptor methods to learn ‘what’ practitioners are doing, including both 2D and 3D methods. The vision taxonomy from Chapter 5 was used to divide the feature descriptor survey along the lines of descriptor families, such as local binary methods, spectra methods, and polygon shape methods. There is some overlap between local and regional descriptors, however this chapter tries to focus on local descriptor methods, leaving regional methods to Chapter 3. Local interest point detectors are discussed in a simple taxonomy including intensitybased regions methods, edgebased region methods, and shapebased region methods, including background on key concepts and mathematics used by many interest point detector methods. Some of the difficulties in choosing an appropriate interest point detector were discussed and several detector methods were surveyed.
This chapter also highlighted retrofits to common descriptor methods. For example, many descriptors are retrofitted by changing the descriptor spectra used, such as LBP vs. gradient methods, or by swapping out the interest point detector for a different method. Summary information was provided for feature descriptors following the taxonomy attributes developed in Chapter 5 to enable limited comparisons, using concepts from the analysis of local feature description design concepts presented in Chapter 4.