The Challenge of Onboard SAR Processing: A GPU Opportunity
 147 Downloads
Abstract
Data acquired by a Synthetic Aperture Radar (SAR), onboard a satellite or an airborne platform, must be processed to produce a visible image. For this reason, data must be transferred to the ground station and processed through a time/computingconsuming focusing algorithm. Thanks to the advances in avionic technology, now GPUs are available for onboard processing, and an opportunity for SAR focusing opened. Due to the unavailability of avionic platforms for this research, we developed a GPUparallel algorithm on commercial offtheshelf graphics cards, and with the help of a proper scaling factor, we projected execution times for the case of an avionic GPU. We evaluated performance using ENVISAT (Environmental Satellite) ASAR Image Mode level 0 on both NVIDIA Kepler and Turing architectures.
Keywords
Onboard SAR focusing GPUparallel RangeDoppler algorithm1 Introduction
Thanks to its synthetic aperture, SAR systems can acquire very long land swaths organized in proper data structures. However, to form a comprehensible final image, a processing procedure (focusing) is needed.
The focusing of a SAR image can be seen as an inherently spacevariant twodimensional correlation of the received echo data with the impulse response of the system. Radar echo data and the resulting SingleLook Complex (SLC) image are stored in matrices of complex numbers representing the inphase and quadrature (i/q) components of the SAR signal. Several processors are available, based on three main algorithms: RangeDoppler, \(\omega \)k, and Chirp Scaling [7].
Usually, this processing takes time and needs HPC algorithms in order to process data quickly. Heretofore, considering the limited computing hardware onboard, data had been transmitted to ground stations for further processing. Nevertheless, the vast amount of acquired data and the severely limited downlink transfer bandwidth imply that any SAR system also needs an efficient raw data compression tool. Because of structures with apparent higher entropy, a quasiindependence of inphase and quadrature components showing histograms with nearly Gaussian shape and identical variance, conventional image compression techniques are illsuited, and resulting compression rates are low.
Thanks to advances in the development of avionic specialized computing accelerators (GPUs) [1, 12], now the onboard SAR processing with realtime GPUparallel focusing algorithms is possible. These could improve sensor data usability on both strategic and tactical points of view. For example, we can think of an onboard computer provided with a GPU directly connected to both a ground transmitter and a SAR sensor through GPUDirect [13] RDMA [5] technology.
Several efforts have been made to implement GPU SAR processors for different raw SAR data using CUDA Toolkit. In [4], the focusing of an ERS2 image with \(26,880 \times 4,912\) samples on an NVIDIA Tesla C1060 was obtained in 4.4 s using a RangeDoppler algorithm. A similar result is presented in [14], where a COSMOSkyMed image of \(16,384\, {\times }\, 8,192\) samples has been processed employing both RangeDoppler and \(\omega \)k algorithms in 6.7 s. Another implementation of the \(\omega \)k algorithm, described in [20], focused a Sentinel1 image with \(22,018 \times 18,903\) in 10.87 s on a single Tesla K40, and 6.48 s in a two GPUs configuration. In [15], a \(\omega \)kbased SAR processor implemented in OpenCL and run on four Tesla K20 has been used to focus an ENVISAT ASAR IM image of \(30,000 \times 6,000\) samples in 8.5 s and a Sentinel1 IW image of \(52,500 \times 20,000\) samples in 65 s. All these results have accurately analyzed the ground station case, where one or more Tesla GPU products have been used.
Our idea is to exploit the onboard avionic GPU computing resources, which are usually more limited than the Tesla series. For example, on the one hand, the avionic EXK107 GPU of the Kepler generation is provided with 2 Streaming Multiprocessors (SMs), each with 192 CUDA core. On the other hand, the Tesla K20c, of the same architecture generation, has 13 SMs, also with 192 CUDA core each.
This paper shares the experiences gathered during the testing of a prototype HPC platform, whose details are subject to a nondisclosure agreement and therefore excluded from this presentation. However, several insights can be useful to discuss new approaches in the design of SAR processing procedures and strategies. Indeed, from previous experiences in GPU computing, which also included special devices ([8, 9, 10, 11, 16]), we can make some assumptions. Furthermore, the reasoning made when dealing with an offtheshelf hardware solution can be in some way translated to an avionic product, accepting that the algorithmic logic does not change. In order to develop and test our algorithm, with the intent to exploit the massive parallelism of GPUs, we applied the approach proposed in [2].
In the next section, we provide a schematic description of the RangeDoppler algorithm, and we focus on dataparallel kernels that can be efficiently implemented on a GPU. Section 3 presents the actual kernels implemented and their relative footprint in the perspective of avionic hardware. Testing is presented in Sect. 4, with an estimation of the execution time on an avionic GPU. Finally, we discuss results and conclude in Sect. 5.
2 RangeDoppler Algorithm and Identification of DataParallel Kernels
The GMTSAR processing system relies on precise orbits (submeter accuracy) to simplify the processing algorithms, and techniques such as clutterlock and autofocus are not necessary to derive the orbital parameters from the data.
In the esarp focusing component, data are processed by patches in order not to overload the computing platform. Each patch contains all the samples along the range direction and a partial record along the azimuth direction. Several patches are concatenated to obtain the image for the complete strip.
 1.
Range Compression – In the ENVISAT signal, there are 5681 points along the range direction that must be recovered in a sharp radar pulse by deconvolution with the chirp used during signal transmission. The operation is done in the frequency domain: firstly, the chirp is transformed, then the complex product of each row with the conjugate of the chirp is computed. A Fast Fourier Transform (FFT) is therefore needed before and after the product. In order to take advantage of the speed of radix2 FFT, data are zeropadded to the length of 8192. This procedure allows obtaining phase information for a longer strip, which will be later reduced to 6144 points for further processing.
 2.
Patch Processing – In order to focus the image in the azimuth direction, data must be transformed in the rangedoppler domain, which means in the frequency domain for the azimuth direction, by applying an FFT on the transposed matrix representing the range compressed image. For the ENVISAT radar, the synthetic aperture is 2800 points long. Again, to exploit the speed of radix2 FFT, 4096 rows are loaded and processed, consisting of a patch. The last 1296 rows are overlapped with the following patch.
 3.
Range Migration – As the platform moves along the flight path, the distance between the antenna and a point target changes, and that point appears as a hyperbolicshaped reflection. To compensate for this effect, we should implement a remapping of samples in the rangedoppler domain through a sort of interpolator. Such a migration path can be computed from the orbital information required by the GMTSAR implementation and must be applied to all the samples in the range direction.
 4.
Azimuth Compression – To complete the focusing in the azimuth direction, a procedure similar to the Range Compression is implemented. In the rangedoppler domain, a frequencymodulated chirp is created to filter the phase shift of the target. This chirp depends on: the pulse repetition frequency, the range, and the velocity along the azimuth direction. As before, after the complex product, the result is inversely Fourier transformed back to the spatial domain to provide the focused image.
The filtering subalgorithms can be easily organized as pointwise matrix operations, assuming that the chirps are available in the device memory for reading. This step is efficiently achievable by building the range chirp directly on the GPU, as it consists of a monodimensional array with spatial properties, and by subsequently transforming it in the frequency domain through a proper FFT. Similarly, the azimuth chirp can be built and transformed directly on the GPU, but this time it is a 2D array.
About the mapping of the samples in the Range direction, assuming enough memory is available for storing the migrated samples, it can be seen as pointwise matrix operation, as each sample corresponds to a previously patch processed data subject to operations involving orbital information.
3 GPU Kernels and Memory Footprint
In order to evaluate the feasibility of onboard processing, we present an analysis of the resources needed.
In Algorithm 1, a GPUparallel pseudocode presents the kernels and the cuFFT runs of the GPUparallel version of esarp. In the following, we analyze the kernels with their possible sources of Algorithmic Overhead [3] and their memory footprint.

d_orbit_coef: in order to remap the range samples and to compensate platform movement within the range migration step, for each sample in the range, there are 8 parameters describing the orbit characteristics and their influence on the migration. These parameters are the same for each row of the patch, and they are scaled considering the position in the synthetic aperture, that is the position in the azimuth direction. They are also useful to put up the chirp in the azimuth direction. To save useless recomputing, this kernel precomputes 8 arrays of 6144 elements with a corresponding memory footprint of 384 KBytes. Their values can be computed independently by 6144 threads in an appropriate threadblock configuration that takes into account the number of SMs in the GPU.

d_ref_rng: this kernel populates an array with the chirp in range direction based on the pulse emitted by the sensor. The array is also zeropadded to the length of the nearest power of 2 to exploit subsequent radix2 FFT efficiency. For the ENVISAT data, the array consists of 8192 complex numbers of 8 bytes each, i.e., 64 Kbytes. The workload of this kernel is proportional to the number of elements in the array. Moreover, each element can be processed independently of the others, meaning that the workload can be split among threads. If those are organized in a number of blocks, which is multiple of the number of SMs present in the GPU, we can have a good occupancy of the devices. Also, the divergence induced by the zeropadding can be minimized during threadblock configuration.

d_ref_az: by using previously calculated orbital parameters, a 2D array of the same size of the patch is populated with the chirp in the azimuth direction, which is different for each column. Hence, the memory footprint is \(6144 \cdot 4096 \cdot 8=192\) MBytes. Beforehand, the array is reset to zero values since not all the samples are involved in the filtering. To limit divergence, each element in the array can be assigned to a thread that populates the array if necessary, or it waits for completion. Since the same stored orbital parameters are used for each row, the threads can be arranged in blocks with columnwise memory access in mind in order to limit collisions among different SMs. Hence, the execution configuration can be organized in a 2D memory grid with blocks of threads on the same column.

d_mul_r: implements a pointwise multiplication of each row of the patch by the conjugate of the chirp in the frequency domain. The workload can be assigned to independent threads with coalescent memory accesses. Following reasoning similar to d_ref_az, with the idea of limiting memory collisions, each thread in a block can compute one column of the patch in a for cycle, realizing a coalesced write of the results with the other threads in the same warp. This kernel does not require additional memory occupation.

d_scale: after the inverse FFT needed to transform the patch back to the spatial domain, a pointwise scaling is needed. As before, independent threads can work with coalescent memory accesses, and efficient workload assignments can be configured.

d_trans_mat: this kernel follows the highly efficient sample proposed in [6]. In this case, the memory footprint corresponds to a new array with the same dimension of the patch, i.e., 192 MBytes.

d_intp_tot: the remapping of the samples is carried on in a pointwise procedure. The output patch must be in a different memory location, and therefore the memory footprint consists again of an additional 192 MBytes. Making similar reasoning on the memory accesses as we did for the d_ref_az kernel, we can configure the execution to minimize global memory collisions, optimizing block dimensions for occupancy.

d_mul_a: this kernel filters the patch to focus the final image in the frequency domain. The operations consist of elementwise matrix products and do not need additional work area in memory. An efficient threadblock configuration can follow the reasoning made for the previous kernel.
To summarize the analysis of the memory footprint for the whole procedure to focus a patch: \(192\times 2\) MBytes are necessary to swap the patch for transposing and remapping data in several kernels, 256 MBytes are necessary for the most demanding FFT, and the preliminary computing of chirps and orbit data require \(\approx \)192.5 MBytes. The total is less than 1 GByte of memory, which is a fair amount available on every GPU.
4 Testing on Workstation and Reasoning on Avionic Platform
As mentioned in the introduction, we had access to a prototype avionic platform for testing purposes, and we had the opportunity to run our algorithm repeatedly. Even if we cannot disclose details about platform architecture and testing outcomes due to an NDA, we can refer to the GPU installed, which is an Nvidia EXK107 with Kepler architecture.
Workstation used for testing on Kepler architecture
Workstation Kepler  

OS  Ubuntu 18.04 
CPU  Intel Core i5 650 @3.20 GHz 
RAM  6 GB DDR3 1333 MT/s 
GPU  GeForce GTX 780 (12 SMs with 192 cores each) 
Let us consider the execution time of our GPU version of the esarp processor, excluding any memory transfer between host and device, i.e., considering data already on the GPU memory. Such is a fair assumption since all the focusing steps are executed locally without memory transfers between host and device. In an avionic setting, only two RDMA transfers happen: the input of a raw patch from the sensor, the output of a focused patch to the transmitter (Fig. 4).
Workstation used for testing on Turing architecture
Workstation Turing  

OS  CentOS 7.6 
CPU  Gold Intel Xeon 5215 
RAM  94 GB 
GPU  Quadro RTX 6000 (72 SMs with 64 cores each) 
In Table 3 we present the execution times of the GPUesarp software, relatively to the steps of the RangeDoppler algorithm, on Workstation Kepler. The preliminary processing step, which includes the creation of arrays containing orbital information and chirps in both range and azimuth direction, is executed just for the first patch, as the precomputed data do not change for other patches within the same swath. The total execution time needed to focus the whole image is \(t_{wk}=1.12\) s, excluding inputoutput overhead and relative memory transfers between host and device.
Execution times in milliseconds for each step of the GPUesarp software on the Workstation Kepler
Execution time in milliseconds  

Preliminary processing  21.7  
Range compression  46.9  46.2  46.2  46.1  46.3  46  45.8  45.9  46 
Patch processing  4.8  4.8  4.8  4.8  4.8  4.8  4.8  4.7  4.7 
Range migration  47.8  47.1  46.9  46.9  47.2  46.9  47.1  47.2  47.3 
Azimuth compression  24.4  24.1  24.1  24.2  24.3  24.2  24.8  24.1  24.1 
Total (excl. I/O)  145.4  122.2  122  122  122.6  121.9  122.5  121.9  122.1 
Patch  1  2  3  4  5  6  7  8  9 
Execution times in milliseconds of the GPUesarp software on the Workstation Turing
Execution time in milliseconds  

Total (excl. I/O)  28.5  24.3  22.9  22.3  22.3  22.2  22  22  22 
Patch  1  2  3  4  5  6  7  8  9 
If we consider the execution times on Workstation Turing (Table 4), we see that the total time needed to focus the whole image is \(t_{wt}=0.208\) s, excluding inputoutput transfers, which is very promising for the next generation of avionic GPUs. Moreover, considering the spare time available for further processing during downlink transmission, we can think about computing Azimuth FM rate and Doppler Centroid estimators. Those algorithms are useful to provide parameters for Range Migration, and Azimuth Compression steps in case of nonuniform movements of the platform, as it happens on airborne SAR.
5 Conclusions
When thinking about SAR sensing, a common approach is to consider it as an instrument for delayed operational support. Usually, SAR raw data are compressed, downlinked, and processed in the ground stations to support several earth sciences research activities, as well as disaster relief and military operations. In some cases, timely information could be advisable, and onboard processing is becoming an approach feasible thanks to advances in GPUtechnology with reduced power consumption.
In this work, we developed a GPUparallel algorithm based on the RangeDoppler algorithm as implemented in the opensource GMTSAR processing system. The results, in terms of execution time on offtheshelf graphics cards, are encouraging if scaled to proper avionic products. Even if we did not present actual results on an avionic GPU, thanks to some insights acquired during testing of a prototype avionic computing platform and a constant scale factor, we showed that onboard processing is possible when an efficient GPUparallel algorithm is employed.
Since this result is based on the algorithmic assumption that orbital information is available, some processing techniques such as clutterlock and autofocus have been avoided. That is the case for many satellite SAR sensors, but further experiments must be carried on to verify the feasibility of onboard processing on airborne platforms, where parameters like altitude and velocity may slightly change during data acquisition. In this sense, as future work, we plan to implement a GPUparallel algorithm for parameters estimation.
References
 1.GRA112 graphics board, July 2018. https://www.abaco.com/products/gra112graphicsboard
 2.D’Amore, L., Laccetti, G., Romano, D., Scotti, G., Murli, A.: Towards a parallel component in a GPU–CUDA environment: a case study with the LBFGS harwell routine. Int. J. Comput. Math. 92(1), 59–76 (2015). https://doi.org/10.1080/00207160.2014.899589CrossRefzbMATHGoogle Scholar
 3.D’Amore, L., Mele, V., Romano, D., Laccetti, G.: Multilevel algebraic approach for performance analysis of parallel algorithms. Comput. Inform. 38(4), 817–850 (2019). https://doi.org/10.31577/cai_2019_4_817CrossRefGoogle Scholar
 4.di Bisceglie, M., Di Santo, M., Galdi, C., Lanari, R., Ranaldo, N.: Synthetic aperture radar processing with GPGPU. IEEE Signal Process. Mag. 27(2), 69–78 (2010). https://doi.org/10.1109/MSP.2009.935383CrossRefGoogle Scholar
 5.Franklin, D.: Exploiting GPGPU RDMA capabilities overcomes performance limits. COTS J. 15(4), 16–20 (2013)Google Scholar
 6.Harris, M.: An efficient matrix transpose in CUDA C/C++, February 2013. https://devblogs.nvidia.com/efficientmatrixtransposecudacc/
 7.Hein, A.: Processing of SAR Data Fundamentals, Signal Processing, Interferometry, 1st edn. Springer, Heidelberg (2010)Google Scholar
 8.Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for highdimensional integrals on heterogeneous CPUGPU systems. Concurr. Comput.: Pract. Exper. 31(19), e4945 (2019). https://doi.org/10.1002/cpe.4945. https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4945, e4945 cpe.4945
 9.Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014). https://doi.org/10.1007/9783642552243_66CrossRefGoogle Scholar
 10.Marcellino, L., et al.: Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with onpremises and cloud based computational resources. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017. LNCS, vol. 10778, pp. 14–24. Springer, Cham (2018). https://doi.org/10.1007/9783319780542_2CrossRefGoogle Scholar
 11.Montella, R., Giunta, G., Laccetti, G.: Virtualizing highend GPGPUs on ARM clusters for the next generation of high performance cloud computing. Cluster Comput. 17(1), 139–152 (2014). https://doi.org/10.1007/s1058601303410CrossRefGoogle Scholar
 12.Munir, A., Ranka, S., GordonRoss, A.: Highperformance energyefficient multicore embedded computing. IEEE Trans. Parallel Distrib. Syst. 23(4), 684–700 (2012). https://doi.org/10.1109/TPDS.2011.214CrossRefGoogle Scholar
 13.NVIDIA Corporation: Developing a Linux Kernel Module Using RDMA for GPUDirect (2019). http://docs.nvidia.com/cuda/gpudirectrdma/index.html, version 10.1
 14.Passerone, C., Sansoè, C., Maggiora, R.: High performance SAR focusing algorithm and implementation. In: 2014 IEEE Aerospace Conference, pp. 1–10, March 2014. https://doi.org/10.1109/AERO.2014.6836383
 15.Peternier, A., Boncori, J.P.M., Pasquali, P.: Nearrealtime focusing of ENVISAT ASAR Stripmap and Sentinel1 TOPS imagery exploiting OpenCL GPGPU technology. Remote Sens. Environ. 202, 45–53 (2017). https://doi.org/10.1016/j.rse.2017.04.006. Big Remotely Sensed Data: Tools, Applications and Experiences
 16.Rea, D., Perrino, G., di Bernardo, D., Marcellino, L., Romano, D.: A GPU algorithm for tracking yeast cells in phasecontrast microscopy images. Int. J. High Perform. Comput. Appl. 33(4), 651–659 (2019). https://doi.org/10.1177/1094342018801482CrossRefGoogle Scholar
 17.Sandwell, D., Mellors, R., Tong, X., Wei, M., Wessel, P.: GMTSAR: an InSAR processing system based on generic mapping tools (2011)Google Scholar
 18.Schättler, B.: ASAR level 0 product analysis for image, wideswath and wave mode. In: Proceedings of the ENVISAT Calibration Review. Citeseer (2002)Google Scholar
 19.Střelák, D., Filipovič, J.: Performance analysis and autotuning setup of the cuFFT library. In: Proceedings of the 2nd Workshop on AutotuniNg and ADaptivity AppRoaches for Energy Efficient HPC Systems, ANDARE 2018. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3295816.3295817
 20.Tiriticco, D., Fratarcangeli, M., Ferrara, R., Marra, S.: Near realtime multiGPU \(\omega \)k algorithm for SAR processing. In: AgencyEsrin, E.S. (ed.) Big Data from Space (BiDS), pp. 277–280, October 2014. https://doi.org/10.2788/1823