A Flexible ASIC for Time-Domain Decision-Directed Channel Estimation in MIMO-OFDM Systems
Abstract
Channel estimation is a crucial task for the overall communication performance of a wireless receiver. Compared to traditional approaches the estimation of the wireless channel can be improved by using iterative estimation with feedback from other receiver components, however the VLSI implementation of such iterative channel estimation in multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems is challenging due to the high computational complexity. In this chapter we introduce the first ASIC for Decision-Directed MIMO-OFDM channel estimation which tracks channel variations using feedback from a decoder and supports M-QAM. Furthermore, timing and power dissipation trade-offs are analyzed.
Keywords
MIMO-OFDM VLSI ASIC Channel estimation Expectation maximization SAGE1 Introduction
Orthogonal frequency division multiplexing (OFDM) and spatial multiplexing over multiple-input multiple-output (MIMO) transmissions schemes are adopted by several recent wireless communication standards such as 3GPP Long Term Evolution (LTE) or IEEE 802.11n and beyond. Due to the concept of coherent detection in MIMO-OFDM receivers the channel estimation (CE) is a crucial and computational intensive part of the overall system and has a significant impact an the communication performance in terms of frame error rate and thus influence directly the maximal achievable throughtput. Traditionally, pilot-aided channel estimation (PACE) is applied where the channel is estimated at predefined pilot positions. The complete channel over all subcarriers is obtained via interpolation. Iterative channel estimation can be used to improve the estimates which delivers promising SNR gains [1]. In the case of iterative channel estimation algorithms, a priori knowledge from the detector or the decoder is used to improve the channel estimates iteratively.
However, the gain in terms of algorithmic performance is payed for in terms of high latencies since each block that participates in the iterations has to complete its processing before the next one can start using the improved input. An alternative that does not increase the latency but still provides significant benefits is the channel tracking approach. This approach [2, 3] provides channel estimation updates for time instance \(n+U_x\) based on the detector or decoder decisions for time instance n, as shown in Fig. 1.
Especially, fast fading channels are interesting scenarios for such a solution. An algorithmic investigation for a communication system similar to LTE is performed in [4] using the simplified frequency domain (FD) SAGE (space-alternating generalized expectation-maximization) algorithm, which calculates updates of the channel impulse response (CIR) estimates for M-QAM constellations. The authors of [3] present a modification that provides a gain of 2 dB for 64-QAM. This modification requires a non trivial matrix inversion. Fortunately, this matrix inversion can be avoided by iteratively processing each tap of the CIR in the time domain, as proposed by the time-domain (TD) SAGE algorithm presented in [5].
Furthermore, the authors of [6] present an analysis of the TD-SAGE algorithm in the context of LTE-Advanced. The results show that it has the potential to double the system throughput at high user mobility due to a better channel estimate and therefore, a lower frame error rate. Apart from that the authors present simulation results about the impact on the system performance of usage of variable number of feedback symbols from the decoder to the channel estimation block. These results show interesting trade-offs between the computational complexity and the algorithmic performance for the TD-SAGE algorithm. Therefore, the number of feedback symbols seems to be interesting parameter for a trade-off analysis between energy dissipation and algorithmic performance of a dedicated VLSI architecture.
Contributions: This chapter introduces an extension of the first ASIC implementation of the TD-SAGE algorithm which is presented in [7]. The TD-SAGE algorithm is transformed to a novel variant, further reducing the computational complexity, to what is termed the tap alternating (TA) SAGE. Apart from that this work discusses two options to further reduce the computational complexity with the penalty of a loss in algorithmic performance. Either reducing the number of feedback symbols as discussed in [6] or reducing the frequency of updating the channel estimate. A suitable VLSI architecture is presented and area, timing and power numbers are provided. The support for a variable number of feedback symbols introduces a modification to the state machine that has an impact on the critical path. Therefore the implementation results for a architectures with and without variable feedback support are presented. This work evaluates the hardware costs of channel tracking in a MIMO-OFDM system.
The frame structure of the system
2 System Model
Receiver model
The channel model used in this chapter is a frequency-selective Rayleigh fading channel with a power delay profile according to the typical urban COST259 model. It is time variant with the correlation according to Jake’s model with a normalized Doppler frequency of \(f_d = 1.4468\cdot 10^{-5}\), a sub-carrier spacing of 15 kHz, a user velocity \(v=50\,\mathrm{km/h}\) and a carrier frequency \(f_\mathrm {c} = 2.4\,\mathrm{GHz}\).
The frame structure of the system setup used throughout this chapter is shown in Fig. 1. The first OFDM symbol of a frame is a preamble following an orthogonal preamble scheme over the transmit antennas as proposed in [8]. The subsequent OFDM symbols are only consisting of data symbol vectors across all sub-carriers.
The receiver model considered throughout this chapter is depicted in Fig. 2. Each received data symbol vector is iteratively processed by a soft-in soft-out (SISO) detector and a SISO decoder. A data symbol vector is defined as the vector over all receive antennas at the kth sub-carrier. Detection is performed by a max-log MAP SISO sphere detector (SD) with QR decomposition [9], while the channel decoder is a BCJR decoder providing soft information of the coded bits \(\{c\}\). For the simulation results each processing block is executed twice, corresponding to one complete iteration between the detector and the decoder is performed.
The channel estimation provides the required estimate of the channel frequency response to the detector. As depicted in Fig. 2 the CE is split into two parts. First, the PACE processing block calculates an initial CIR estimate based on the preamble. This initial estimate is used to decode the second OFDM symbol of the frame. Second, the DD-CE block uses the decoder decisions of time n in order to provide an updated estimate at time \(n + U_x\) for the detection of the next OFDM symbol. Thus, the third OFDM symbol is detected and decoded based on the updated channel estimate. The update frequency (\(U_x\)) can vary depending on the considered Doppler frequency.
There is an option to adjust the computational complexity of calculation of an update of the CIR. It was presented in [6]. The idea is to reduce the feedback from the decoder and therefore reduce the number of computations. Either the full feedback is used, meaning that all modulation symbols of the previous OFDM symbol are used to calculate the CIR estimate or only every second, third and so on symbol is used in the calculation.
3 The TA-SAGE Algorithm
The \(\varvec{H}_{i,j}\) from (1) can be expressed as the DFT of the CIR: \(\varvec{H}_{i,j} = \text {DFT}(\varvec{h}_{i,j})\), where \(\varvec{h}_{i,j}\) is the CIR between the ith transmit antenna and the jth receive antenna. Only the calculation of the CIR estimate \(\hat{\varvec{h}}_{i,j}=( \hat{h}_{i,j}[0], ..., \hat{h}_{i,j}[N_\mathrm {L}-1] )^T\) will be investigated in the remainder of this chapter.
The description of the TD-SAGE algorithm in [5] is modified in this chapter to remove redundant calculations for an efficient VLSI implementation.
The iterations are done for each receive antenna independently. The whole processing of the algorithm is done on the TD samples and can be split into four steps that have to be executed for each receive antenna. For these steps a new variable is introduced. The residual \(\epsilon _\mathrm {j}^{(m)}\) is the vector of the values that are remaining after subtracting the reconstruction of the observation given the current CIR estimate from the real observation \(\varvec{y}_j\), which is basically the current estimation error. Then, the steps of the modified SAGE algorithm are the following:
Block error rate for the TA-SAGE with 4-,16- and 64-QAM. Using all available time-domain samples (\(N_K\))
4 Algorithm Evaluation
Figure 3 shows the block error rate (BLER) for the investigated modulation schemes 4-,16- and 64-QAM using a floating-point implementation. A block is defined as one code word which is spread over one OFDM symbol. The simulations for 4-, 16- and 64-QAM were performed with \(N_i=3\) iterations. The number of estimated taps is \(N_L=32\) which equals the length of the cyclic prefix. This is a worst case assumption for the presented OFDM system which is used throughout this work. Besides the floating-point simulations the results for a fixed-point implementation are shown in Fig. 3. The degradation due to the fixed-point arithmetic is negligible.
For the following evaluation the modulation 4-QAM was chosen exemplary and the number of internal iteration is \(N_i=1\). In Fig. 1 the update frequency is given by \(U_x\). In the following two different options to reduce the computational complexity by a factor of four are evaluated for two different exemplary operation points. The plot depicted in Fig. 4 shows the algorithmic performance for 4-QAM and a mobile device speed of \(v=50\, \mathrm{km/h}\) and an \(U_x=1\) and \(U_x=4\). The loss in algorithmic performance is about 1 dB at a BLER of 1 %. Additionally the BLER for a reduced feedback from the decoder is plotted. There \(F_b\) equals 4 which means that every 4th symbol from decoder is used to calculate the update of the CIR estimate. In this case leads to the same computational complexity than using the full feedback but updating the CIR estimate only every 4th OFDM symbol. From an algorithmic perspective it can be concluded from Fig. 4 that for the given mobile speed and the same computational complexity it is a better choice to update the CIR estimate every 4th OFDM symbol using the full decoder feedback.
Block error rate for the TA-SAGE with 4-QAM at a speed of \(v=50\,\mathrm{km/h}\)
Block error rate for the TA-SAGE with 4-QAM at a speed of \(v=100\,\mathrm{km/h}\)
5 TA-SAGE VLSI Architecture
High level architecture
5.1 Processing Schedule
The processing is split into four different phases: load, pre-computation, iteration and write-back. The load and the write-back phases do not include any computation but are necessary to load the input data in the memories and write back the results. These phases are considered for completeness of the hardware complexity analysis. In the load phase the received data \(\varvec{y}_j\), the current CIR estimate \(\varvec{\hat{h}}\) and the decoder feedback \(\varvec{\hat{s}}_i\) are loaded into the memories depicted in Fig. 6.
The first processing phase (pre-computation) corresponds to step 1 of the algorithm description. First, the scalar \(||{\hat{\varvec{s}}_i}||^2\) is calculated for all transmit antennas in the SP unit. Second, the residual vector \(\varvec{\epsilon }_j^{(0)}\) for all receive antennas is calculated in the SP and RU unit. Both units are running concurrently, processing different receive antennas. In parallel to the \(\varvec{\epsilon }_j^{(0)}\) calculation the reciprocal of \(||{\hat{\varvec{s}}_i}||^2\) is pre-computed for all transmit antennas, since it does not change over the internal iterations.
The second processing phase is the iteration phase, which corresponds to steps 2, 3 and 4 of the algorithm. Step 2 is reflected in the dedicated address generation of each memory. Step 3 is executed by the SP unit calculating the inner product of (9) and the multiplication with the scaling factor \(\frac{1}{||{\hat{\varvec{s}}_i}||^2}\). The last steps of the algorithm are (11), executed on the RU unit and (10) calculated by the tap unit. To achieve full utilization of the processing units and account for the data dependencies between (9) and (11) the SP and RU unit are separated by a pipeline register and execute the calculations concurrently for different receive antennas. This is possible since there is no data dependency between different receive antennas, which is a property of the SAGE algorithm.
In the write-back phase the new calculated estimate of the CIR is written from the tap memory to the output ports.
5.2 Processing Units
SP Unit
SP Unit. Section 5.1 discussed that the SP unit is used in two phases and calculates (5), (6) and (9). It can be seen from (5) that all complex multiplications can be executed in parallel. Therefore, it is possible to have a data path parallelism up to \(N_\mathrm {K}\). In (6) and (9) it is necessary to accumulate the result of the concurrent calculations. This is implemented via an adder tree. Due to the high data path parallelism (up to \(w=32\)) the maximum achievable frequency is determined by these adder trees. This leads to the design decision to have a dedicated pipeline stage as shown in Fig. 7 (third pipeline stage). The separation into the first and second pipeline stage is done to avoid two real multipliers in chain. This unit includes \(6 \cdot w\) multipliers and \(3\cdot \log (w) + 7\cdot w\) adders. The multipliers in the first pipeline stage are active in the pre-computation phase and the iteration phase. The dotted part in Fig. 7 is only active during the pre-computation phase, where first \(||{\hat{\varvec{s}}_i}||^2\) is computed and written into a register file and then the initial residual vector \(\varvec{\epsilon }_j^{(0)}\) is computed while the sequential divider concurrently outputs all \(\frac{1}{||{\hat{\varvec{s}}_i}||^2}\). The dashed part is active in the iteration phase calculating \(\delta ^{(m)}\).
RU Unit
Tap Update Unit. This unit updates the current tap (10). Due to the low requirements in terms of throughput and the low complexity (2 adders) of this unit compared to the RU and SP units it is no longer discussed separately.
5.3 Memory Architecture
As shown in Fig. 6 the design has three different memories. Each of these memories has a dedicated controller that includes an address generation unit and multiplexers to realize the different data access schemes. The first memory is the tap memory, which stores the \(N_\mathrm {R}N_\mathrm {T}N_\mathrm {L}\) taps of the CIR. This memory has the most relaxed constraints in the architecture. During the initialization phase it is read in every cycle from the RU and the SP unit with a linear addressing scheme. In the iteration phase the tap memory is read and written once per tap update, i.e. every \(\frac{N_\mathrm {K}}{w}\) cycles. Therefore, one read/write port is sufficient (Fig. 8).
The TX memory stores the \(N_\mathrm {T}N_\mathrm {K}\) TD samples of the complex symbols of the remapped decisions from the decoder. The circular shift in (3) is realized as part of the address calculations. This memory is read by the RU and SP unit during the pre-computation phase and the iteration phase every cycle in parallel and written in the load phase. Each access reads/writes w elements in parallel. Therefore, two read/write ports with a word width of w elements are implemented.
The third memory is the residual memory. It stores the \(N_\mathrm {R}N_\mathrm {K}\) \(\epsilon \)-values and needs to be read by the SP unit (9) and read and written by the RU unit (11) independently and concurrently with a word width of w elements. Furthermore, during the pre-computation phase the SP and RU units read and write independently the residual memory (5). Thus, the residual memory has two read and two write ports with a data width matching the data parallelism w.
With w and the algorithmic parameters \(N_\mathrm {T}\), \(N_\mathrm {R}\), \(N_\mathrm {L}\), \(N_\mathrm {K}\) and \(N_i\), the cycle count of each phase can be calculated using the following equations.
6 Implementation Results
Area-time trade-offs for different degrees of parallelism w and different synthesis/layout constraints. The algorithmic parameters are \(N_i=3\), \(N_T=2\), \(N_R=2\), \(N_L=32\) and \(N_K=512\).
6.1 Full Feedback Architecture
Three different configurations of the architecture were implemented. A configuration is defined by its data path parallelism \(w=\{8,16,32\}\). Each configuration was synthesized and layouted for its maximum achievable frequency and additionally for 400 MHz and 200 MHz. There are only two different design points for \(w=32\) since the maximum achievable frequency is 400 MHz.
The area-time trade-off diagram for the architecture variants is shown in Fig. 9. In this diagram \(T_\mathrm {exec}\) is defined as the time that the specific architecture requires to calculate a complete update of the CIR.
The best \(AT_\mathrm {exec} = 53.55\,\mathrm{mm}^2\upmu \mathrm{s}\) product is the configuration with \(w=32\) and a synthesis and layout constraint of 400 MHz. However, the following discussion will focus on the configurations with an execution time around \(70\,{\upmu }\mathrm{s}\). The configuration with data path parallelism \(w=8\) @ 400 MHz has the \(AT_\mathrm {exec} = 122.81\, \mathrm{mm}^2\upmu \, \mathrm{s}\) product. Doubling the parallelism \(w=16\) and halving the frequency (\(AT_\mathrm {exec} = 138.9\, \mathrm{mm}^2\upmu \mathrm{s}\)), leads to the same execution time and only a slight increase in terms of area. This stems from the fact that this architecture is memory dominated while an increase in the data path parallelism does not influence the memory as much as the data path (Table 1).
The memories in the presented architecture are implemented using standard cell based memories (SCM) [10]. In this work flip-flop SCMs are used. The TX memory needs to be split into w banks each providing one word to allow for non-aligned vector accesses. This would lead to 32 macro cells for the maximum configuration, rendering floor-planning difficult. Therefore, the SCMs were utilized in this architecture.
Besides area and timing analysis, post-layout simulations were performed to obtain power estimates for the different configurations. The post-layout simulations with timing annotations were executed for independent test vectors for each configuration in order to obtain statistic toggling information. Synopsys Power Compiler uses the post-layout netlist and the annotated toggling information to calculate the average power estimates.
Area breakdown for the TA-SAGE \(w=16\) @ 200 MHz and \(w=8\) @ 400 MHz
Power-time trade-offs for different configurations of the TA-SAGE and different synthesis/layout constraints. The algorithmic parameters are \(N_i=3\), \(N_T=2\), \(N_R=2\), \(N_L=32\) and \(N_K=512\).
Layout of the TA-SAGE ASIC for a parallelism degree of \(w=8\).
6.2 Flexible Feedback Architecture
The extension of the architecture to support the flexible feedback involved adjustments in the state machine and therefore in the schedule. Figure 12 compares the implementation results for the full and the flexible feedback architectures.
Power comparision of the full and flexible feedback architecture with different frequencies and a \(w=8\)
Area-time trade-offs for different degrees of parallelism w and different synthesis/layout constraints. The algorithmic parameters are \(N_i=1\), \(N_T=2\), \(N_R=2\), \(N_L=32\) and \(N_K=512\).
7 Conclusion
In this chapter we present to the best of our knowledge the first ASIC implementation of a decision directed channel estimation for MIMO-OFDM for M-QAM. The architecture is described and formulas for the calculation of the run-time of the algorithm depending on its parameters on the architecture are presented. The implementation is characterized in terms of area-time trade-offs and power dissipation.
It was shown, that the additional hardware costs of a channel tracking algorithm like the TA-SAGE are high compared to traditional PACE as presented in [11] but it is possible and therefore worth further investigations.
Future work will include the influence of using latched based SCMs as presented in [10] and the evaluation of mixing macro cell memories (e.g. for the residual and the tap memory) with the SCMs approach for the TX memory. Furthermore, this architecture will be compared with simplified algorithms as for example in [12].
Notes
Acknowledgments
This work has been supported by the UMIC Research Centre, RWTH Aachen University. The authors would like to thank Ernst Martin Witte, David Kammler, Martin Senst, Filippo Borlenghi and Uwe Deidersen for the valuable discussions and their feedback.
References
- 1.Gao, J., Liu, H.: Low-complexity MAP channel estimation for mobile MIMO-OFDM systems. IEEE Trans. Wireless Commun. 7(3), 774–780 (2008)CrossRefGoogle Scholar
- 2.Li, Y., Seshadri, N., Ariyavisitakul, S.: Channel estimation for OFDM systems with transmitter diversity in mobile wireless channels. IEEE J. Sel. Areas Commun. 17, 461–471 (1999)CrossRefGoogle Scholar
- 3.Ylioinas, J., Juntti, M.: Iterative joint detection, decoding, and channel estimation in turbo-coded MIMO-OFDM. IEEE Trans. Veh. Technol. 58, 1784–1796 (2009)CrossRefGoogle Scholar
- 4.Xie, Y., Georghiades, C.: Two EM-type channel estimation algorithms for OFDM with transmitter diversity. IEEE Trans. Commun. 51, 106–115 (2003)CrossRefGoogle Scholar
- 5.Ylioinas, J., Raghavendra, M., Juntti, M.: Avoiding matrix inversion in DD SAGE channel estimation in MIMO-OFDM with M-QAM. In: 2009 IEEE 70th Vehicular Technology Conference Fall (VTC 2009-Fall), pp. 1–5, September 2009Google Scholar
- 6.Ketonen, J., Juntti, M., Ylioinas, J.: Decision directed channel estimation for improving performance in LTE-A. In: 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1503–1507, November 2010Google Scholar
- 7.Minwegen, A., Auras, D., Ascheid, G.: A multimode decision-directed channel estimation ASIC for MIMO-OFDM. In: 2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC), pp. 65–70, IEEE (2012)Google Scholar
- 8.Li, Y.: Simplified channel estimation for OFDM systems with multiple transmit antennas. IEEE Trans. Wireless Commun. 1, 67–75 (2002)CrossRefGoogle Scholar
- 9.Studer, C., Bölcskei, H.: Soft-input soft-output sphere decoding. In: IEEE International Symposium on Information Theory, 2008, ISIT 2008, pp. 2007–2011, July 2008Google Scholar
- 10.Meinerzhagen, P., Roth, C., Burg, A.: Towards generic low-power area-efficient standard cell based memory architectures. In: 2010 53rd IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 129–132, August 2010Google Scholar
- 11.Simko, M., Wu, D., Mehlfuehrer, C., Eilert, J., Liu, D.: Implementation aspects of channel estimation for 3GPP LTE terminals. In: 11th European Wireless Conference 2011 - Sustainable Wireless Technologies (European Wireless), pp. 1–5, April 2011Google Scholar
- 12.Qiao, X., Zhao, H., Han, Z., Sun, Y.: Decision-directed channel estimation for MIMO-OFDM systems. In: 5th International Conference on Wireless Communications, Networking and Mobile Computing, 2009, WiCom 2009, Beijing, pp. 1–4 (2009)Google Scholar