1 Introduction

SRAM-based FPGAs are susceptible to radiation-induced upsets, more specifically Single Event Upsets (SEUs) in their configuration memory bits and embedded memory cells. SEUs can also occur in the Flip-Flops (FFs) of the Configuration Logic Blocks (CLBs) used to implement the user’s sequential logic. In this case, the bit-flip has a transient effect and the next load of the flip-flop can correct it. Multiple Bit Upsets (MBUs) can also occur in SRAM-based FPGAs due to charge sharing and accumulation of upsets. Thus, the majority of the errors observed in SRAM-based FPGAs used in harsh environments come from bit-flips (SEUs, MBUs) in the configuration memory bits and, therefore, Triple Modular Redundancy (TMR) with majority voters is commonly used to mask errors combined with reconfiguration [1]. Bit-flips in the bitstream are only corrected by partial or full reconfiguration. However, according to the reconfiguration rate, upsets can accumulate in the FPGA configuration memory of FPGAs.

TMR is usually applied at Register Transfer Level (RTL) or gate-level descriptions in the FPGA design flow. It can be implemented manually or automatically if appropriate tools are available. There are many challenges on applying TMR in a design that will be synthesized into an SRAM-based FPGA. The first one is to ensure that commercial synthesis tools will not remove any logic redundancy [2]. The second one is to explore the TMR implementation in a way that it can achieve high error coverage with an efficient area and performance overhead. Depending on the architecture of the design implemented into SRAM-based FPGAs, more or less configuration bits are used and more or less susceptible bits may be responsible for provoking an error in the design output. However, it is not only the number of used bits that determine the sensitivity of a design. The error masking effect of the application algorithm and the TMR implementation play an important role. Moreover, there are trade-offs in the architecture such as area, performance, execution time, and types of resources utilized that may direct contribute to SEU susceptibility analysis in FPGAs.

Hardware accelerators are built from SRAM-based FPGAs to improve the performance of applications running on embedded hard-core and soft-core processors. In this context, High-Level Synthesis (HLS) is widely used for reducing the development time and exploring efficiently the design space of algorithms with different architectures. HLS is an automated design process that starts interpreting an algorithm described in a high-level software programmable language (e.g. C, C++) to automatically produce an RTL hardware that performs the same function.

However, SRAM-based FPGAs and APSoCs are demanded in many high reliability applications such as satellites, autonomous vehicles, servers, and others. Therefore, the code executed in the processor and the hardware accelerator must be able to mitigate SEUs. With regard to HLS-based designs, applying TMR in the high-level algorithm so that the resulted RTL code is protected is challenging, because there are different ways to implement the TMR scheme and its voters, as well as the input and output interfaces of the design.

This work investigates the use of TMR in HLS-based designs for mitigating multiple bit upsets. TMR schemes are implemented directly in the algorithms described in C programming language to be synthesized in the Xilinx Vivado HLS [3] tool for use in Xilinx SRAM-based devices. Nevertheless, we believe the proposed approach and the achieved results are capable to be generic and extendable to other HLS tools. Our objective is to evaluate different TMR implementations at C-level under soft errors. Area resources, performance overheads, and error rate for multiple bit upsets are evaluated for different TMR approaches. TMR can mitigate SEUs but not necessarily MBUs. However, since the implemented voters mask signals bit by bit, many errors due to MBUs that do not affect the same bit are still capable of being masked. Previous works [4] have shown that the use of Diverse TMR (DTMR) may work properly under SEU accumulation in the configuration memory bits. In this work, we observe how TMR implemented at C-level is also able to mitigate accumulated upsets.

Some previous studies related to HLS have investigated the trade-offs among performance, area, and types of resources used [5,6,7]. Other studies have investigated the use of TMR in RTL designs generated by HLS for use in Application Specific Integrated Circuit (ASIC) devices [8]. However, from the best of our knowledge, there is no study that have investigated the use of TMR applied at C language level to be synthesized in HLS and evaluated in SRAM-based FPGAs for SEUs.

The case-studied FPGA is a 28-nm Artix-7 FPGA from Xilinx. Different TMR approaches were implemented in a matrix multiplication algorithm described in C language connected to a soft-core Microblaze responsible for sending and receiving the workload data stream. Bit-flips were injected into the FPGA bitstream by a fault injection framework developed in our research group [5]. Several fault injection campaigns were performed for all the designs in order to identify the error rate under accumulated bit-flips. Results show that the TMR can mask multiple errors as expected, but redundancy in the voters and in the interface is mandatory to increase reliability. Results show that by using a coarse grain TMR with triplicated inputs, voters, and outputs, it is possible to reach 95% of reliability by accumulating up to 61 bit-flips and 99% of reliability by accumulating up to 17 bit-flips in the configuration memory bits. These numbers imply in a Mean Time Between Failure (MTBF) of the coarse grain TMR at ground level from 50% to 70% higher than the MTBF of the unhardened version for the same reliability confidence.

2 TMR in Hardware Accelerators Generated by HLS

The concept of TMR is to have three identical copies processing data and a majority voter voting their outputs to mask errors in one of the copies. TMR can be implemented in hardware at gate level, for instance, where each module is triplicated and voters are added, but it can also be implemented in software, where part of the code is triplicated and its outputs are voted. According to the granularity of the TMR and the location of the majority voters, there is the coarse grain TMR (CGTMR), in which voters are placed only at the outputs of the design, and there is the fine grain TMR (FGTMR), in which voters are placed at the outputs of all or selected flip-flops and/or combinational logic, according to the design requirements. In this work, we are implementing TMR in a piece of high-level code to generate a hardware block through HLS. Thus, after synthesis, redundant hardware and majority voters are automatically generated. The input/output interfaces can be triplicated or not. However, if the interface is not triplicated, single point of failures can be observed in the TMR design.

When describing an algorithm to be synthesized by an HLS tool, one can consider that the algorithm source code is composed of operations, conditional statements, loops, and functions. Therefore, TMR must be implemented in these code structures. The question is how to triplicate all these structures to generate coarse or fine grain TMR in an efficient way, ensuring that the redundant logic will not be removed and, at the same time, being able to take advantage of some of the optimization strategies usually provided by HLS tools.

By default, an HLS tool translates each high-level function call in an RTL block. As consequence, if a function is called three times, three identical RTL blocks will be generated and the HLS tool will interpret that they can be executed in parallel if no data dependencies exist among them. Conversely, if we perform an operation three times in sequence inside a same function, the HLS tool will generate a serial hardware in which each operation will be executed sequentially, one at a time. With regards to the majority voters, since they are always implemented as a function call, they are always synthesized as independent RTL blocks. These are the main principles in which our investigation relies. Lastly, based on these approaches, one can observe that in a modularized design (parallel), the majority voters are placed separately of the TMR blocks, while in a non-modularized design (serial), the majority voters are placed together with the TMR circuitry. In this work, we investigate coarse grain TMR implemented in parallel, named CGPTMR.

For hardware accelerators, the interface to receive the workload data stream is very important. In Xilinx devices, high-performance hardware accelerators are usually connected to soft- or hard-core processors through a Direct Memory Access (DMA) interface and Advanced eXtensible Interface Stream (AXI-S) ports. This interconnect infrastructure provides a pipelined control that enables the software running on the processor to queue multiple tasks requests, reducing its latency. According to [9], each accelerator operates as an independent thread, synchronized in hardware at the transport level by AXI-S handshaking, with the input arrival and accelerator hardware “start/done” synchronization barriers realized by the Stream interface of the DMA.

The architecture of the proposed evaluation setup is composed of the design generated by the HLS (here referred as the Design Under Test - DUT), a Microblaze soft-core processor, which is a 32-bit 5-state pipeline Reduced Instruction Set Computer (RISC) soft processor, Advanced eXtensible Interface (AXI) units, memories (BRAM), Direct Memory Access (DMA) unit and the fault injector framework, as described in Fig. 1. Note that in Fig. 1(a) there is only one interface for communication, while the setup in Fig. 1(b) the input and output interfaces are triplicated.

Fig. 1.
figure 1

Block diagram of the (a) CGPTMR SingleStream and (b) CGPTMR MultipleStream case-study designs connected to the Microbaze soft-core processor and fault injection framework.

Figure 2 shows an execution time representation of a piece of code implemented in an HLS tool in terms of the number of steps to perform input reads, execution, and outputs writes. Each step can take several clock cycles. The algorithm execution contains the read of inputs, the main execution code, and the write of outputs (Fig. 2(a)). In case of TMR, the redundancy can be implemented in parallel by triplicating the functions as represented in Fig. 2(b) and maintaining the single stream AXI port interface. In this case, each function is triplicated and a single voter is placed at the end of the code to vote out the data outputs. This scheme is named coarse grain parallel TMR with single stream (CGPTMR SingleStream). The voters and interfaces can also be triplicated, as shown in Fig. 2(c). This scheme is named coarse grain parallel TMR with multiple stream (CGPTMR MultipleStream). In this work, we are exploring these two implementations to analyze how area and performance overhead are impacted and comparing also with the reliability of the TMR scheme. The resource allocation and binding select the necessary and efficient RTL resources to implement behavioral functionalities.

Fig. 2.
figure 2

The three versions of the M × M implementations: Unhardened with single stream(a), CGPTMR with single stream(b) and CGPTMR with multiple stream(c) represented in the number of steps to run the applications.

We selected matrix multiplication (MxM) algorithm, shown in Fig. 3, to start our investigation, as this algorithm is rich in parallelism and loops. Each input matrix is a 6 × 6 8-bits array generating a 6 × 6 16-bits array output. Three versions of the M × M algorithm were implemented and generated using the Xilinx Vivado HLS tool from the C algorithm source code: TMR Coarse Grain Parallel version (CGPTMR) without optimization and single stream input, output data, TMR Coarse Grain Parallel version (CGPTMR) without optimization and multi stream input and output data, and the unhardened version without optimization single stream input, output data.

Fig. 3.
figure 3

Unhardened matrix multiplication algorithm without optimizations.

It is important to mention that for TMR implementations, it is not advised to use the Vivado HLS optimization option named function inline, which optimizes designs for area. Function inline removes the function hierarchy aiming to improve area by allowing the components within the function to be better shared or optimized with the logic in the calling function, which is something that is not recommended for redundant circuits.

The CGPTMR version code is represented in Fig. 4 with single stream and in Fig. 5 with multiple streams. Each function call is replicated. Optimizations performed in the function are extended to all the replicas. The majority voter votes the data output bit by bit after the call of the three redundant functions. The status is used to check bit by bit if there is any difference among the three modules. Status equal to zero means that all bits match, otherwise status is equal to one.

Fig. 4.
figure 4

Coarse Grain Parallel TMR (CGPTMR) with single stream.

Fig. 5.
figure 5

Coarse Grain Parallel TMR (CGPTMR) with multi stream.

3 Fault Injection Method for Accumulated SEUs

Fault injection (FI) by emulation is a well-known method to analyze the reliability of a design implemented in an SRAM-based FPGA. The original bitstream configured into the FPGA can be modified by an embedded design or a computer program by flipping one of the bits of the bitstream, one at a time. This bit-flip emulates an SEU in the configuration memory.

The fault injector platform used in this work is based on the work presented in [5]. Our fault injection platform is composed of an ICAP controller circuit embedded in the FPGA and a script running on a monitor computer. The ICAP controller circuit controls the Internal Configuration Access Port (ICAP) and is connected to the script, which defines the injection area and type of fault injection (sequential or random) and controls the campaign. Faults are only injected in the area of the DUT and in their configuration bits related to CLBs (LUTs, user FFs, and interconnections), DSP resources (DSP48E), and clock distribution interconnections. Faults are not injected in BRAM configuration bits in order not to affect the inputs and outputs of the DUT. The flow and the design floorplanning are shown in Fig. 6.

Fig. 6.
figure 6

The fault injection methodology in (a) and the FPGA floorplanning of designs in (b).

The Microblaze is responsible to send the input data as a data stream through AXI connections, receive the data output stream, and to compare the received values with the reference ones. The data is sent in 288 bits (6 × 6 8-bit matrix data) through the AXI interface. All the system runs at 100 MHz. The execution time of the Microblaze is around 175,727 clock cycles, which includes the time to send the control to the DUT, send the data inputs, wait for the DUT execution, read data outputs, compare the values, and wait for the fault injection framework next injection. For the DUT design, the execution time contains the number of clock cycles for reading the input data, execute the multiplication of matrix, voting, and writing data output. As an example, for the CGPTMR SingleStream, it is needed 216 clock cycles for reading the input data, 710 clock cycles to execute the HLS application, 156 clock cycles to execute the majority voter, 36 clock cycles to write the voter data, and 36 clock cycles to write the status data. The total time spent to perform all operations is 1,154 clock cycles.

4 Experimental Results

Table 1 presents the area resources and performance. The area can be evaluated by the number of LUTs, flip-flops, and DSP blocks. One can notice that the TMR designs present very similar areas. In this work, we mapped all the designs to the same target area of 388 frames. The area overhead of the TMR designs is three times or more, as expected. The maximum overhead is reached when the inputs and outputs AXI interfaces are triplicated. In terms of performance, each TMR design presents a very different execution time compared to the unhardened version. As explained, the execution time is calculated by the number of clock cycles needed to read the input matrices, execute, vote, and write the output matrices. The performance overhead of the TMR designs comes from the fact that the data input and data output is now triplicated in time as well, and the voting phase also takes several clock cycles of the total execution time, as shown in Fig. 2.

Table 1. Resource usage and performance results of each case-study design

Accumulated SEUs where injected as described in Sect. 3. Although each design uses a different amount of resources as detailed in Table 1, the fault injection campaigns considered the same injection area for all designs. Thus, we stablish a condition similar to all designs, which emulates a same fluence of particles on its surface, for instance.

In this work, each DUT was implemented in a rectangular physical block of 388 configuration memory frames. Since a frame on Xilinx Artix-7 FPGA has 3,232 bits, the total inject area comprises 1,254,016 bits. The value of essential bits is obtained from the Vivado Design Suite tool [3]. In this case, it is only considered the HLS accelerator design under fault injection. In the fault injection campaigns, the number of SEUs injected was limited up to 300 bits. Since the number of faults injected is small compared to the total number of configuration bits in the fault injection area, the likelihood of the same bit getting hit more then once is small allowing the error rate to be estimated as the average of errors over total injected faults. The average error rate for the different design, aside its upper and lower quartile, is presented in Fig. 7.

Fig. 7.
figure 7

Average number of bit flips required to provoke an error.

A more detailed comparison of the designs can be seen in Fig. 8, where reliability is presented as the complement of the cumulative failure distribution (R(t) = 1−F(t)). The failure rate F(t) of the system is the probability of one or more modules have failed by time t. In our case, Fig. 8 represents the reliability in terms of the accumulated bit-flips.

Fig. 8.
figure 8

Observed reliability on the different designs

The inferiority of the CGPTMR SingleStream design, even when compared to the unhardened design, can be related to the amount of data that is serialized through the stream, as can be seen in Fig. 2(b) CGPTMR Single Stream steps and the single point of failures of the DMA interconnection. Being a single point of failure, not only a communication failure is more likely due to larger amount of data being serialized, but also it jeopardizes the efforts placed on the TMR implementation. On the other hand, the CGPTMR MultipleStream gives clearly a reliability improvement over the unhardened along the range of SEUs injected on this experiment. Notwithstanding, as with to any TMR implementation, that may be a crossing point ahead in the reliability curves where the unhardened performs better than the TMR implementation.

Even with these experiments limited to 300 SEUs injected, the expected exponential behavior of reliability curves and the relationship among the reliability of the design can be seen when we look at this same data in semi-log coordinate, as presented in Fig. 9. Two useful observations can be extracted from Fig. 9 contributing to further engineering decisions. First, if any recovery strategy, such as scrubbing or system reconfiguration by reset, is to be is activated before the expected time when up to approximately 10 SEUs are accumulated, then the power of TMR will not be exploited and no profit is given by its implementation on the system. Second, as we can see a trend that the crossing from better TMR performance to better unhardened performance occurs somewhere between 300 and 1,000 SEUs, that defines the upper bound limit of TMR performance.

Fig. 9.
figure 9

Semi-log view of the observed reliability of the different designs.

Considering the neutron flux at New York as reference (13 n/cm2.h) [10] and the static neutron cross-section of Artix-7 FPGAs (7 × 10−15 cm2/bit) [11], we can estimate the static neutron cross-section of the target area (388 frames × 3232 bits = 1,254,016 bits), which is 8.78 × 10−9 cm2. The expression to obtain static neutron cross-section of the target area is:

$$ \sigma_{static,target\;area} = \sigma_{static,device} \times Bits_{target\;area} $$

Failure rate is the most common reliability metric. The failure rate itself is either time-dependent or time-independent [12]. The failure rate target area (1.14 × 10-7 h−1) is calculated by multiplying the value of the static neutron cross-section by the neutron flux at New York, as follows:

$$ Failure\;rate_{target\;area} = \sigma_{static,target\;area} \times Flux $$

Mean time between failure (MTBF) is defined as the average amount of time a device or product works before it fails. We calculate the MTBF of bit-flips (8.7 × 106 h) for the target area as follows:

$$ MTBF_{target\;area} = \frac{1}{{Failure\,rate_{target\;area} }} $$

Then we can calculate the MTBF of the design as follows:

$$ MTBF_{design} = MTBF_{target\;area} \times Accumulated\;bits $$

For instance, at a reliability of 99%, the unhardened version can accumulate in average up to 10 bit-flips, which implies in a MTBF of 8.7 × 107 h, while the CGPTMR MultiStream can accumulate in average up to 17 bit-flips in the configuration memory, which implies in a MTBF of 1.48 × 108 h, 70% higher. The improvement in MTBF reduces as the percentage of reliability reduces. For instance, at a reliability of 95%, the unhardened version can accumulate in average up to 41 bit-flips, which implies in a MTBF of 3.6 × 108 h, while the CGPTMR MultiStream can accumulate in average up to 61 bit-flips in the configuration memory, which implies in a MTBF of 5.4 × 108 h, 50% higher (Table 2).

Table 2. Reliability of accumulated bit-flips and MTBF for the unhardened version and the CGTMR multistream version

5 Conclusions

This work demonstrated the feasibility of generating hardware design or hardware accelerators intrinsically hardened based on the introduction of that hardening in a high level specification, in this case in the form of C/C++ language to be processed by high level synthesis. The methodology adopted of accumulated SEUs injection allowed the characterization and comparison of alternative design and evidenced design pitfalls, such as in the case of the CGPTMR SingleStream. This methodology also allows the proper calibration of recovery mechanisms and allowed further analysis on the voting mechanism operating point.