Architecture of an FPGABased Heterogeneous System for CodeSearch Problems
Abstract
Code search problems refer to searching a particular bit pattern that satisfies given constraints. Obtaining such codes is very important in fields such as data encoding, error correcting, cryptography, etc. Unfortunately, the search time increases exponentially with the number of bits in the code, and typically requires many months of computation to find large codes. On the other hand, the search method mostly consists of 1bit computations, so that reconfigurable hardware such as FPGAs (field programmable gate arrays) can be used to successfully obtain a massive degree of parallelism. In this paper, we propose a heterogeneous system with a CPU and an FPGA to speedup code search problems. According to the evaluation, we obtain over 86 times speedup compared to typical CPUbased implementation for extremal doubly even selfdual code search problem of length 128.
1 Introduction
Fields such as cryptography, data encoding, error correcting, etc. often requires bit patterns that satisfy particular conditions. However, finding such bit patterns is a very time consuming problem. For example, in order to find 64bit code that satisfies particular conditions, we have to search \(2^{64}\) bit patterns. Even if we can search one bit pattern in one clock cycle using a 4 GHz CPU, it requires over 146 years to search all combinations. For a 128bit code search problem, the required processing time exceeds the age of the universe. Therefore, how we can solve such code search problems. Mathematicians propose many algorithms to generate a particular code that satisfies the conditions, instead of search for all possible bit patterns. For example, to find all 64bit numbers that are divisible by four, we can fix the least significant two bits to zero and generate all combinations of bit patterns of the other 62 bits. This increases the processing speed by four times. For more complex problems, many different methods are available to reduce the amount of searches. Most of those methods use bit operations or fixedpoint computations. On the other hand, CPUs and GPUs are specialized for floatingpoint computations, and using those for such simple bit operations is not an efficient method.
In this paper, we use reconfigurable hardware call FPGA (field programmable gate array) [1] to efficiently compute bit operations. FPGAs contain over millions of multiinput logic gates and registers [2]. Since all these logic gates and their interconnections are configurable, we can design custom processing elements and datapaths to efficiently execute the required operation with a massive degree of parallelism. Recently, OpenCLbased FPGA design [3] has been introduced to design accelerators using Clike highlevel programming. This design method allows us to exploit the full potential of an FPGA while reducing the design time [4]. OpenCL is not only can be used to design FPGA accelerator, but also can be used to design a whole heterogeneous system including the computations of a CPU and also the data transfers between a CPU and an FPGA.
In this paper, we propose an FPGAbased heterogeneous accelerator to speedup the “extremal doubly even selfdual code” search problem [5, 6, 7]. To solve this problem, the work in [8] proposes a method that contains many bitoperations that can be done in parallel. The proposed FPGA accelerator contains thousands of processing elements to perform bitoperations in parallel while the CPU computes complex but sequential operations in a higher clock frequency compared to the FPGA. The FPGA accelerator design and the heterogeneous system implementation are done using OpenCL. According to the evaluation, we obtain over 86 times speedup compared to a typical CPUbased implementation for extremal doubly even selfdual code search problem of length 128.
2 CodeSearch Problems
In this paper, we consider the acceleration of extremal doubly even selfdual code search [8], as an example to show the efficiency of the FPGAbased heterogeneous system for such problems. Selfdual codes are an important class of linear codes with both theoretical importance and practical applications [7]. It is important in the fields such as cryptography, error correcting, etc. In this section, we briefly explain the extremal doubly even selfdual code search algorithm. Note that, we restrict the details of the mathematical background since it is not in the scope of this paper. Readers can refer [8] for the details. We focus on the types of computations required in such code search problems, and how to accelerate those computations using FPGAbased heterogeneous system.
2.1 Extremal Doubly Even Selfdual Code Search
In the work in [8], the extremal doubly even selfdual code is described as follows. “A binary selfdual code C of length n is a code over \(\mathbb {F}_2\) satisfying \(C = C^\bot \) where the dual code \(C^\bot \) of C is defined as \(C^\bot = \{ x \in \mathbb {F}^n_2  x \cdot y = 0 ~ \mathrm {for ~all} y \in C \} \) under the standard inner product \(x \cdot y\). A selfdual code C is doubly even if all codewords of C have hamming weight divisible by four, and singly even if there is at least one codeword of hamming weight \(\equiv \) 2 (mod 4). Note that a doubly even selfdual code of length n exists if and only if n is divisible by eight. It was shown in [9] that the minimum hamming weight d of a doubly even selfdual code of length n is bounded by \(d \le 4[n/24] + 4\). A doubly even selfdual code meeting this upper bound is called extremal.”
For example, an extremal doubly even selfdual code C of length 128 satisfies the following three conditions.
 1.
Hamming weight \(\equiv 0 \pmod 4\)
 2.
\(C = C^\bot \)
 3.
\(d(C) = 24\)
To find such a code, work in [8] proposes the following algorithm that contains four steps.

Step 1: \(x \in \mathbb {F}^{64}_2 ~ \mathrm{and}~~wt(x) \equiv 3 \pmod 4\)
 Step 2: If \(AA^{\mathrm {T}} + BB^{\mathrm {T}} \ne I_{32}\) go to step 1. A and B are circulant matrices given by Eq. (1).$$\begin{aligned} A = \left( \begin{array}{cccc} x_1 &{} x_2 &{} \cdots &{} x_{32} \\ x_{32} &{} x_1 &{} \cdots &{} x_{31} \\ \vdots &{} \vdots &{} &{} \vdots \\ x_2 &{} x_3 &{} \cdots &{} x_1 \end{array} \right) ,~ B = \left( \begin{array}{cccc} x_{33} &{} x_{34} &{} \cdots &{} x_{64} \\ x_{64} &{} x_{33} &{} \cdots &{} x_{63} \\ \vdots &{} \vdots &{} &{} \vdots \\ x_{34} &{} x_{35} &{} \cdots &{} x_{33} \end{array} \right) \end{aligned}$$(1)
 Step 3: The matrices G and H in Eq. (2) are the generator matrices of C and \(C^\bot \) respectively. If the hamming weight of the sum until the \(10^{th}\) row of G is less than 20, go to step 1. If the hamming weight of the sum until the \(10^{th}\) row of H is less than 20, go to step 1.$$\begin{aligned} M = \left( \begin{array}{cc} A &{} B \\ B^{\mathrm {T}} &{} A^{\mathrm {T}} \end{array} \right) ,~ G = \left( I_{64}, M \right) ,~ H = \left( M^T, I_{64} \right) \end{aligned}$$(2)

Step 4: A code is found and exit.
In order to satisfy the step 3 of the code search algorithm, the hamming weight of \(\{x_1 \cdots x_{64}\}\) must be equal or larger than 19. That is, at least 19 bits of the 64 bits in the code must be ones. Therefore, we have to search koutof64 codes where \(19 \le k \le 64\). Searching for such a code is a very time consuming problem.
3 FPGABased Heterogeneous Architecture
3.1 Exploiting the Parallelism
3.2 Overall Architecture of the FPGABased Heterogeneous System
Figure 2 shows the overall architecture of the FPGAbased heterogeneous system. It consists of a CPU and an FPGA accelerator. The CPU and the FPGA works together to generate koutofn codes that satisfies the step 1 of the algorithm explained in Sect. 2.1. For each koutofn code, matrix calculations in step 2 is performed in parallel. After the matrix computation is done, hamming weight is calculated as explained in step 3.
3.3 koutofn Code Generation
The algorithm to generate circular permutation [12] is a serial one. Therefore, the only way to increase the processing speed of the circular permutation generation is to increase the clock frequency. Unfortunately, the clock frequency of an FPGA is usually less than 300 MHz. Therefore, we use a CPU for the permutation generation which has more than 10 times larger clock frequency compared to that of an FPGA. Once the permutations are generated, those are transferred to the external memory (DRAM) of the FPGA board. The FPGA accelerator access those data and performs 63 shift operations in parallel and select the codes that satisfies the step 1 of the algorithm explained in Sect. 2.1. The permutation generation and bitshift can be done in parallel as shown in Fig. 4. The time required for a circular permutation generation and parallel bitshift operations are nearly equal. This way, the CPU and the FPGA are used in parallel manner.
3.4 Matrix Calculation and Hamming Weight
In step 3, hamming weight until the \(10^{th}\) row is calculated. However, most codes can be rejected by computing the hamming weight until the first few rows. Therefore, we divide the step 3 into two stages. In the first stage, the hamming weight until the first 5 rows are computed. The codes satisfy this condition go to the second stage. We use only 2 modules in the second stage since a smaller degree of parallelism is required. A part of step 3 program code is shown in Fig. 6.
4 Evaluation
We used two systems for the evaluation, where one contains only one CPU and the other contains one CPU and one FPGA. In the CPU only system, the CPU is Intel Xeon E51650 v3 (3.50 GHz). In the heterogeneous system, the CPU is Intel Xeon E52643, and the FPGA is Terasic DE5anet FPGA board [13] with Intel Arria 10 FPGA. FPGA is configured using Quartus prime pro 16.1 and Intel FPGA SDK for OpenCL [14]. CPU codes are compiled using Intel compiler 17.0 with OpenMP directives for parallel computation.
Results
Conventional  Proposed  

Device  CPU only  CPU & FPGA 
Clock frequency (MHz)  3500  207 
Processing time (s)  29.13  0.33 
Number of clock cycles \((10^9)\)  10.21  0.07 
Codes checked per clock cycle  0.04  63.5 
FPGA resource utilization
Resource type  Utilization (Percentage %) 

Logic  143,186 (34) 
Memory bits  3,899,280 (7) 
RAM blocks  425 (16) 
The evaluation is done for only koutof64 codes where k equals 8. We also found similar results for 19outof64 codes. Neither of those search results have given a solution for extremal doubly even selfdual code of length 128. Therefore, it is still an unsolved problem. In order to find a solution for this problem, we have to search other k values and that is one of our future works. Also note that, the data transferred to the FPGA for the CPU are only the circular permutation data generated in CPU. Therefore, the DRAM access by the FPGA is minimal. Although the FPGA board has a small memory bandwidth of 25.6 GB/s bandwidth, it is not a bottleneck for the code search problems.
5 Conclusion
In this paper, we propose an FPGAbased heterogeneous system for extremal doubly even selfdual code search. Although we are yet to solve the problem, there is a great potential to find a solution in near future due to over 86 times of speedup of the proposed system compared to a conventional one with only a CPU. Moreover, we used only 34% of the FPGA resources, so that further increase of speed is possible. It is very important to exploit the possibility of accelerating other code search problems using FPGAs in future.
References
 1.Marchal, P.: Fieldprogrammable gate arrays. Commun. ACM 42(4), 57–59 (1999)CrossRefGoogle Scholar
 2.
 3.Czajkowski, T.S., Neto, D., Kinsner, M., Aydonat, U., Wong, J., Denisenko, D., Yiannacouras, P., Freeman, J., Singh, D.P., Brown, S.D.: OpenCL for FPGAs: prototyping a compiler. In: Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), p. 1 (2012)Google Scholar
 4.Waidyasooriya, H.M., Hariyama, M., Uchiyama, K.: Design of FPGABased Computing Systems with OpenCL (2017)Google Scholar
 5.MacWilliams, F.J., Sloane, N.J.A.: The Theory of ErrorCorrecting Codes. NorthHolland, Amsterdam (1977)MATHGoogle Scholar
 6.Pasquier, G.: A binary extremal doubly even selfdual code (64, 32, 12) obtained from an extended ReedSolomon code over F16. IEEE Trans. Inform. Theory 27, 807–808 (1981)MathSciNetCrossRefGoogle Scholar
 7.Rains, E., Sloane, N.J.A.: Selfdual codes. In: Pless, V.S., Huffman, W.C. (eds.) Handbook of Coding Theory, pp. 177–294. Elsevier, Amsterdam (1998)Google Scholar
 8.Harada, M.: An extremal doubly even selfdual code of length 112. Electron. J. Comb. 15, 1–5 (2008)Google Scholar
 9.Mallows, C.L., Sloane, N.J.A.: An upper bound for selfdual codes. Inform. Control 22, 188–200 (1973)MathSciNetCrossRefGoogle Scholar
 10.Harbison, S.P., Steele Jr., G.L.: C: A Reference Manual. Prentice Hall, Englewood Cliffs (1987)Google Scholar
 11.
 12.Sawada, Joe: A fast algolithm to generate neckleces with fixed content. Theoret. Comput. Sci. 301, 477–489 (2003)MathSciNetCrossRefGoogle Scholar
 13.Terasic, DE5Net FPGA Development Kit. http://www.terasic.com.tw/cgibin/page/archive.pl?Language=English&CategoryNo=158&No=526
 14.Intel FPGA SDK for OpenCL, Programming Guide. https://www.altera.com/en_US/pdfs/literature/hb/openclsdk/aocl_programming_guide.pdf
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.