21.1 Introduction: SIFT, CVDS, and the TMuC

The extensive use of images and video streams in nowadays applications requires the development of algorithms that conduct automatic search of information in a frame. One of the most appreciated algorithm is the Scale Invariant Feature Transform (SIFT) proposed by D.G. Lowe [1].

The circuit proposed in this paper is part of a contribution to the CDVS standardization project (Compact Descriptors for Visual Search). This project wants to produce a standard for the extraction and interchange of image-extracted information. The process is helped by the adoption of an Evaluation Framework (the Test Model under Consideration—TMuC) which is used for simulating and validating proposals made by the members of the standardization committee: the algorithm that is now considered by the CDVS committee is based on SIFT.

The SIFT algorithm, [1] detects a number of point of interests (keypoints, KP) from an input image that are then described by means of a vector known as the SIFT descriptor. Such descriptor is based on a statistical characterization of the gradients of the luminance for the pixels surrounding the KP itself.

SIFT identifies the KPs building a band-pass pyramid of images and characterizes the KPs (i.e. builds the descriptor) using a low-pass pyramid of images. The generation of the image pyramids is accomplished, in [1], by filtering the input image with a set of Gaussian kernels to produce the low-pass images and, by subtracting low-pass images to obtain the Difference of Gaussian (DoG) images. These images, as proven by Lowe, constitute an approximation of the scale-normalized Laplacian of Gaussian (LoG). The LoG is an operator for obtaining the band-pass pyramid of an image, discussed in [2] and proven in [3] to produce the most stable features when compared to a range of other operators.

During the process, an iterative downsampling is performed and the whole cascade filtering and difference process is repeated to produce the next octave, until the complete pyramids are generated.

In the subsequent algorithm steps, local extrema are searched for along the scales leading to a first set of KP candidate locations. For these locations a stability analysis is then carried out. Next, a first statistical characterization of image gradients around the locations is performed to assign one or more intrinsic orientations to the keypoints; finally, each KP is normalized with respect to the orientations and its SIFT descriptor is calculated.

This process, while providing the best performances proposed to date, is very demanding in terms of computational resources: as a consequence, various hardware approaches to SIFT elaboration in real time scenarios have been proposed.

21.2 State of the Art Approaches to SIFT Elaboration

Current scientific literature provides several approaches to real time SIFT elaboration: some of them, [4], leverage on General Purpose GPU processing power; others exploit multi-processor and/or multi-core systems, [5].

For embedded systems applications, where limited general purpose processing capabilities and low power consumption are major concerns, typical solutions include use of specialized hardware, either in form of ASICs or FPGAs deployment. In [6], a mixed hardware/soft-processor environment is proposed, while in [7], a FPGA and a DSP processor are jointly used to speed up the overall processing. Other recent works, [8], propose all-hardware solutions that operate in the space domain and adopt DoG filters. This paper presents a study towards the implementation of a SIFT keypoint detector that operates in the frequency domain and adopts LoG filters.

21.3 Frequency Based Approach and Block-Based LoG

In this study we explore the frequency domain approach to evaluate performance with respect to the TMuC standard (space domain) floating point implementation. Furthermore, this gives us the opportunity to utilize the LoG filters, which characterize the Scale Space theory image pyramid in its former formulation as given in [2], without incurring in the high computational costs related to a 2D convolution, which is necessary in space domain.

Processing the whole VGA image in the frequency domain would result in the need for large buffers (i.e. large enough to contain the whole Discrete Fourier Transform of the image, which would be composed by exactly \(640\,\times \,480\) samples). This is a typical downside of operating in the frequency domain, which is less suitable for streaming applications, and a common solution is partitioning the frame into blocks. According to the approach proposed in [9], and taking into account the length of the finite impulse response of the filters which will be used, we choose the optimal block size.

This optimal choice is given by the minimization of the product of the time needed to perform analysis on a single block with the total number of blocks: if we denote the image width and height respectively by W and H, the block size by N and the maximum filter tail by L, the total number of blocks is:

(21.1)

The next point is determining the computational load for the elaboration of a single block. For each block we firstly perform N FFTs in the row direction and N FFTs in the column direction to obtain its frequency spectrum. Further, for each filter of the image pyramids, we perform \(N^2\) multiplications. Finally, we perform N IFFTs in the row direction and N IFFTs in the column direction to return in the space domain. Since each FFT requires \(N\log _2(N)\) operations, and the filter bank is composed by 8 filters coherently with the TMuC formulation of SIFT, we obtain:

$$\begin{aligned} N_{ops}=9\cdot ( 2N\cdot (N\log _2(N)))+8N^2 \end{aligned}$$
(21.2)

Limiting to block sizes that are powers of two, we obtain the computational loads shown in Table 21.1, which refer to VGA image resolution and maximum kernel size, L, equal to 33, as in the TMuC SIFT formulation. Table 21.1 indicates that operating on blocks of \(128\,\times \,128\) pixels is the optimal choice.

Table 21.1 Computational loads for block sizes which are integer powers of two

21.4 Proposed Filtering Processor Architecture

The proposed architecture is depicted in Fig. 21.1. The fundamental blocks of the processor are a 1D FFT unit which, iteratively used, transforms the block from the space domain to the frequency domain and vice versa, and a filter bank which allows for the calculation of both the LoG and the Gaussian-smoothed images needed by the SIFT algorithm to construct the keypoint descriptors.

Fig. 21.1
figure 1

Structure of the block processor: memory blocks are shown in red

The FFT unit is a mixed-radix dual-channel unit which operates following a decimation in frequency approach. Internal multiply operation are optimized in terms of speed and HW resource utilization exploiting the circuits and the design techniques proposed in [1013]. The output from the FFT is not reordered inside the unit itself to avoid incurring in the associated resource utilization penalty: for this reason, a memory address generator issues to block buffers 1 and 2 the correct sequence of read/write addresses in order to allow the storage of the output data in the natural ordering. The memory address generator also allows to read and write the buffers in row/column order, which is needed to transform the rows and the columns separately via the 1D FFT unit.

The multiplexers at the sides of the FFT unit direct the data flow to allow an iterated use of the unit; we can distinguish 4 phases of operation, of which the last two are repeated for all the filters in the bank:

  1. 1.

    The block is fed from the circuit input to the FFT unit row-wise and the output is stored, half transformed, in Block Buffer 1.

  2. 2.

    The block is fetched column-wise from Buffer 1, transformed, and stored again in the buffer. This will now store the frequency spectrum of the block.

  3. 3.

    Samples of the frequency spectrum are multiplied with the corresponding samples of the filter, inversely transformed row-wise and then stored in Buffer 2, to keep the spectrum stored for processing by the other filters.

  4. 4.

    The final inverse FFT unit (in the column direction) is performed, and the result is fed to the circuit output.

21.5 Architecture Tuning and Experimental Results

Since the processor operates on fixed-point precision data, in order to obtain an acceptable fidelity with respect to floating point elaboration an accurate tuning of datapath and internal signal representation is needed, [14, 15]. In this case the aspects of the coprocessor that have been studied are mainly related to the intermediate data widths and lsb (least significant bit) weights of the input and output images. An exhaustive PSNR measurement with respect to the TMuC floating point elaboration has thus been conducted to find the best possible combination of resources allocations: considering that each FFT step produces an increment of \(\log _2(N)\) bits in the dynamic range of the samples (in our case, being \(N=128\), we have an increment of 7 bits per FFT step) we have a set of constraints for our allocation problem. If we denote the datapath width by D, the weight of the lsb by l, the scaling factors after the first and second FFT by f and s, we have:

  • input is given in 8 bit unsigned samples, which are extended to 9 bits to include sign information.

  • it must be \(9 + 2\cdot 7 + l - (f + s) <= D\) to avoid overflows or saturations, with the equality ensuring the maximum precision for the datapath width D.

From these constraints emerge a set of possible resources allocation schemes which result in the best combinations (for each of the given datapath widths) shown in Table 21.2.

Table 21.2 Best resources allocations for various datapath widths
Table 21.3 FPGA resources occupation for the 20 bit datapath version of the coprocessor

Table 21.3 shows the implementation results for the 20 bit datapath version of the coprocessor on an Altera Stratix IV family FPGA.

21.6 Conclusions

The proposed architecture constitutes an approach to SIFT elaboration in the frequency domain which obtains a PSNR of almost 50dB with respect to the TMuC’s floating point elaboration with a small footprint with respect to FPGA deployment. The architecture is currently under development, with further stages to be introduced to complete the whole SIFT pipeline in a full-hardware environment.