Towards a Frequency Domain Processor for Real-Time SIFT-based Filtering

Lopez, Giorgio; Napoli, Ettore; Strollo, Antonio G. M.

doi:10.1007/978-3-319-20227-3_21

Towards a Frequency Domain Processor for Real-Time SIFT-based Filtering

Giorgio Lopez²,
Ettore Napoli² &
Antonio G. M. Strollo²

Chapter
First Online: 01 January 2015

960 Accesses

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 351))

Abstract

The Scale Invariant Feature Transform (SIFT) extracts relevant features from images and video frames. The extracted features are robust against luminance variations, geometrical transformations, and image resolution. Due to its performances, the SIFT algorithm is of great importance in fields such as object recognition, content retrieval from image databases, robotic navigation, and gesture recognition. Main drawback of the SIFT algorithm is the high computational complexity. This paper presents the development of a hardware filtering accelerator for the implementation of SIFT-based visual search. The accelerator works in the frequency domain, operating on a block-by-block basis. This enables to work faithfully to the original Scale-Space theory, which employes non-separable Laplacian of Gaussian (LoG) filters. The targeted throughput is of $\sim $20 fps, making the coprocessor suitable for real time processing.

Download chapter PDF

21.1 Introduction: SIFT, CVDS, and the TMuC

The extensive use of images and video streams in nowadays applications requires the development of algorithms that conduct automatic search of information in a frame. One of the most appreciated algorithm is the Scale Invariant Feature Transform (SIFT) proposed by D.G. Lowe [1].

The circuit proposed in this paper is part of a contribution to the CDVS standardization project (Compact Descriptors for Visual Search). This project wants to produce a standard for the extraction and interchange of image-extracted information. The process is helped by the adoption of an Evaluation Framework (the Test Model under Consideration—TMuC) which is used for simulating and validating proposals made by the members of the standardization committee: the algorithm that is now considered by the CDVS committee is based on SIFT.

The SIFT algorithm, [1] detects a number of point of interests (keypoints, KP) from an input image that are then described by means of a vector known as the SIFT descriptor. Such descriptor is based on a statistical characterization of the gradients of the luminance for the pixels surrounding the KP itself.

SIFT identifies the KPs building a band-pass pyramid of images and characterizes the KPs (i.e. builds the descriptor) using a low-pass pyramid of images. The generation of the image pyramids is accomplished, in [1], by filtering the input image with a set of Gaussian kernels to produce the low-pass images and, by subtracting low-pass images to obtain the Difference of Gaussian (DoG) images. These images, as proven by Lowe, constitute an approximation of the scale-normalized Laplacian of Gaussian (LoG). The LoG is an operator for obtaining the band-pass pyramid of an image, discussed in [2] and proven in [3] to produce the most stable features when compared to a range of other operators.

During the process, an iterative downsampling is performed and the whole cascade filtering and difference process is repeated to produce the next octave, until the complete pyramids are generated.

In the subsequent algorithm steps, local extrema are searched for along the scales leading to a first set of KP candidate locations. For these locations a stability analysis is then carried out. Next, a first statistical characterization of image gradients around the locations is performed to assign one or more intrinsic orientations to the keypoints; finally, each KP is normalized with respect to the orientations and its SIFT descriptor is calculated.

This process, while providing the best performances proposed to date, is very demanding in terms of computational resources: as a consequence, various hardware approaches to SIFT elaboration in real time scenarios have been proposed.

21.2 State of the Art Approaches to SIFT Elaboration

Current scientific literature provides several approaches to real time SIFT elaboration: some of them, [4], leverage on General Purpose GPU processing power; others exploit multi-processor and/or multi-core systems, [5].

For embedded systems applications, where limited general purpose processing capabilities and low power consumption are major concerns, typical solutions include use of specialized hardware, either in form of ASICs or FPGAs deployment. In [6], a mixed hardware/soft-processor environment is proposed, while in [7], a FPGA and a DSP processor are jointly used to speed up the overall processing. Other recent works, [8], propose all-hardware solutions that operate in the space domain and adopt DoG filters. This paper presents a study towards the implementation of a SIFT keypoint detector that operates in the frequency domain and adopts LoG filters.

21.3 Frequency Based Approach and Block-Based LoG

In this study we explore the frequency domain approach to evaluate performance with respect to the TMuC standard (space domain) floating point implementation. Furthermore, this gives us the opportunity to utilize the LoG filters, which characterize the Scale Space theory image pyramid in its former formulation as given in [2], without incurring in the high computational costs related to a 2D convolution, which is necessary in space domain.

Processing the whole VGA image in the frequency domain would result in the need for large buffers (i.e. large enough to contain the whole Discrete Fourier Transform of the image, which would be composed by exactly $640\,\times \,480$ samples). This is a typical downside of operating in the frequency domain, which is less suitable for streaming applications, and a common solution is partitioning the frame into blocks. According to the approach proposed in [9], and taking into account the length of the finite impulse response of the filters which will be used, we choose the optimal block size.

This optimal choice is given by the minimization of the product of the time needed to perform analysis on a single block with the total number of blocks: if we denote the image width and height respectively by W and H, the block size by N and the maximum filter tail by L, the total number of blocks is:

(21.1)

The next point is determining the computational load for the elaboration of a single block. For each block we firstly perform N FFTs in the row direction and N FFTs in the column direction to obtain its frequency spectrum. Further, for each filter of the image pyramids, we perform $N^2$ multiplications. Finally, we perform N IFFTs in the row direction and N IFFTs in the column direction to return in the space domain. Since each FFT requires $N\log _2(N)$ operations, and the filter bank is composed by 8 filters coherently with the TMuC formulation of SIFT, we obtain:

$$\begin{aligned} N_{ops}=9\cdot ( 2N\cdot (N\log _2(N)))+8N^2 \end{aligned}$$

(21.2)

Limiting to block sizes that are powers of two, we obtain the computational loads shown in Table 21.1, which refer to VGA image resolution and maximum kernel size, L, equal to 33, as in the TMuC SIFT formulation. Table 21.1 indicates that operating on blocks of $128\,\times \,128$ pixels is the optimal choice.

Table 21.1 Computational loads for block sizes which are integer powers of two

Full size table

21.4 Proposed Filtering Processor Architecture

The proposed architecture is depicted in Fig. 21.1. The fundamental blocks of the processor are a 1D FFT unit which, iteratively used, transforms the block from the space domain to the frequency domain and vice versa, and a filter bank which allows for the calculation of both the LoG and the Gaussian-smoothed images needed by the SIFT algorithm to construct the keypoint descriptors.

The FFT unit is a mixed-radix dual-channel unit which operates following a decimation in frequency approach. Internal multiply operation are optimized in terms of speed and HW resource utilization exploiting the circuits and the design techniques proposed in [10–13]. The output from the FFT is not reordered inside the unit itself to avoid incurring in the associated resource utilization penalty: for this reason, a memory address generator issues to block buffers 1 and 2 the correct sequence of read/write addresses in order to allow the storage of the output data in the natural ordering. The memory address generator also allows to read and write the buffers in row/column order, which is needed to transform the rows and the columns separately via the 1D FFT unit.

The multiplexers at the sides of the FFT unit direct the data flow to allow an iterated use of the unit; we can distinguish 4 phases of operation, of which the last two are repeated for all the filters in the bank:

1.
The block is fed from the circuit input to the FFT unit row-wise and the output is stored, half transformed, in Block Buffer 1.
2.
The block is fetched column-wise from Buffer 1, transformed, and stored again in the buffer. This will now store the frequency spectrum of the block.
3.
Samples of the frequency spectrum are multiplied with the corresponding samples of the filter, inversely transformed row-wise and then stored in Buffer 2, to keep the spectrum stored for processing by the other filters.
4.
The final inverse FFT unit (in the column direction) is performed, and the result is fed to the circuit output.

21.5 Architecture Tuning and Experimental Results

Since the processor operates on fixed-point precision data, in order to obtain an acceptable fidelity with respect to floating point elaboration an accurate tuning of datapath and internal signal representation is needed, [14, 15]. In this case the aspects of the coprocessor that have been studied are mainly related to the intermediate data widths and lsb (least significant bit) weights of the input and output images. An exhaustive PSNR measurement with respect to the TMuC floating point elaboration has thus been conducted to find the best possible combination of resources allocations: considering that each FFT step produces an increment of $\log _2(N)$ bits in the dynamic range of the samples (in our case, being $N=128$, we have an increment of 7 bits per FFT step) we have a set of constraints for our allocation problem. If we denote the datapath width by D, the weight of the lsb by l, the scaling factors after the first and second FFT by f and s, we have:

input is given in 8 bit unsigned samples, which are extended to 9 bits to include sign information.
it must be $9 + 2\cdot 7 + l - (f + s) <= D$ to avoid overflows or saturations, with the equality ensuring the maximum precision for the datapath width D.

From these constraints emerge a set of possible resources allocation schemes which result in the best combinations (for each of the given datapath widths) shown in Table 21.2.

Table 21.2 Best resources allocations for various datapath widths

Full size table

Table 21.3 FPGA resources occupation for the 20 bit datapath version of the coprocessor

Full size table

Table 21.3 shows the implementation results for the 20 bit datapath version of the coprocessor on an Altera Stratix IV family FPGA.

21.6 Conclusions

The proposed architecture constitutes an approach to SIFT elaboration in the frequency domain which obtains a PSNR of almost 50dB with respect to the TMuC’s floating point elaboration with a small footprint with respect to FPGA deployment. The architecture is currently under development, with further stages to be introduced to complete the whole SIFT pipeline in a full-hardware environment.

References

Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)
Article Google Scholar
Lindeberg, T.: Scale-space theory: a basic tool for analyzing structures at different scales. J. Appl. Stat. 21, 225–270 (1994)
Google Scholar
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. In: Proceedings. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, pp. II-257. IEEE (2003)
Google Scholar
Heymann, S., Frhlich, B., Medien, F., Mller, K., Wiegand, T.: SIFT implementation and optimization for general-purpose gpu. In: WSCG 07 (2007)
Google Scholar
Zhang, Q., Chen, Y., Zhang, Y., Xu, Y.: SIFT implementation and optimization for multi-core systems. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–8. IEEE (2008)
Google Scholar
Bonato, V., Marques, E., Constantinides, G.A.: A parallel hardware architecture for scale and rotation invariant feature detection. IEEE Trans. Circuits Syst. Video Technol. 18, 1703–1712 (2008)
Article Google Scholar
Zhong, S., Wang, J., Yan, L., Kang, L., Cao, Z.: A real-time embedded architecture for SIFT. J. Syst. Archit. 59, 16–29 (2013)
Article Google Scholar
Jiang, J., Li, X., Zhang, G.: SIFT hardware implementation for real-time image feature extraction. IEEE Trans. Circuits Syst. Video Technol. 24, 1209–1220 (2014)
Google Scholar
Hunt, B.: Minimizing the computation time for using the technique of sectioning for digital filtering of pictures. IEEE Trans. Comput. 100, 1219–1222 (1972)
Article Google Scholar
Garofalo, V., Petra, N., De Caro, D., Strollo, A., Napoli, E.: Low error truncated multipliers for DSP applications. Proc. ICECS 2008, 29–32 (2008)
Google Scholar
Garofalo, V., Coppola, M., De Caro, D., Napoli, E., Petra, N., Strollo, A.: A novel truncated squarer with linear compensation function. ISCAS 2010, 4157–4160 (2010)
Google Scholar
De Caro, D., Petra, N., Strollo, A., Tessitore, F., Napoli, E.: Fixed-width multipliers and multipliers-accumulators with min-max approximation error. IEEE Trans. Circuits Syst. I Regul. Pap. 60, 2375–2388 (2013)
Article MathSciNet Google Scholar
Petra, N., De Caro, D., Garofalo, V., Napoli, E., Strollo, A.: Truncated squarer with minimum mean-square error. Microelectron. J. 45, 799–804 (2014)
Google Scholar
Genovese, M., Napoli, E.: FPGA-based architecture for real time segmentation and denoising of HD video. J. Real-Time Image Proc. 8, 389–401 (2013)
Article Google Scholar
Genovese, M., Napoli, E.: ASIC and FPGA implementation of the gaussian mixture model algorithm for real-time segmentation of high definition video. IEEE Trans. VLSI Syst. 22, 537–547 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Information Technology Engineering, University of Napoli Federico II, Napoli, Italy
Giorgio Lopez, Ettore Napoli & Antonio G. M. Strollo

Authors

Giorgio Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Ettore Napoli
View author publications
You can also search for this author in PubMed Google Scholar
Antonio G. M. Strollo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giorgio Lopez .

Editor information

Editors and Affiliations

Electronic Engineering, University of Genova, Genova, Italy
Alessandro De Gloria

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lopez, G., Napoli, E., Strollo, A.G.M. (2016). Towards a Frequency Domain Processor for Real-Time SIFT-based Filtering. In: De Gloria, A. (eds) Applications in Electronics Pervading Industry, Environment and Society. Lecture Notes in Electrical Engineering, vol 351. Springer, Cham. https://doi.org/10.1007/978-3-319-20227-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-20227-3_21
Published: 08 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20226-6
Online ISBN: 978-3-319-20227-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Abstract

21.1 Introduction: SIFT, CVDS, and the TMuC

21.2 State of the Art Approaches to SIFT Elaboration

21.3 Frequency Based Approach and Block-Based LoG

21.4 Proposed Filtering Processor Architecture

21.5 Architecture Tuning and Experimental Results

21.6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation