A Flexible ASIC for TimeDomain DecisionDirected Channel Estimation in MIMOOFDM Systems
Abstract
Channel estimation is a crucial task for the overall communication performance of a wireless receiver. Compared to traditional approaches the estimation of the wireless channel can be improved by using iterative estimation with feedback from other receiver components, however the VLSI implementation of such iterative channel estimation in multipleinput multipleoutput (MIMO) orthogonal frequency division multiplexing (OFDM) systems is challenging due to the high computational complexity. In this chapter we introduce the first ASIC for DecisionDirected MIMOOFDM channel estimation which tracks channel variations using feedback from a decoder and supports MQAM. Furthermore, timing and power dissipation tradeoffs are analyzed.
Keywords
MIMOOFDM VLSI ASIC Channel estimation Expectation maximization SAGE1 Introduction
Orthogonal frequency division multiplexing (OFDM) and spatial multiplexing over multipleinput multipleoutput (MIMO) transmissions schemes are adopted by several recent wireless communication standards such as 3GPP Long Term Evolution (LTE) or IEEE 802.11n and beyond. Due to the concept of coherent detection in MIMOOFDM receivers the channel estimation (CE) is a crucial and computational intensive part of the overall system and has a significant impact an the communication performance in terms of frame error rate and thus influence directly the maximal achievable throughtput. Traditionally, pilotaided channel estimation (PACE) is applied where the channel is estimated at predefined pilot positions. The complete channel over all subcarriers is obtained via interpolation. Iterative channel estimation can be used to improve the estimates which delivers promising SNR gains [1]. In the case of iterative channel estimation algorithms, a priori knowledge from the detector or the decoder is used to improve the channel estimates iteratively.
However, the gain in terms of algorithmic performance is payed for in terms of high latencies since each block that participates in the iterations has to complete its processing before the next one can start using the improved input. An alternative that does not increase the latency but still provides significant benefits is the channel tracking approach. This approach [2, 3] provides channel estimation updates for time instance \(n+U_x\) based on the detector or decoder decisions for time instance n, as shown in Fig. 1.
Especially, fast fading channels are interesting scenarios for such a solution. An algorithmic investigation for a communication system similar to LTE is performed in [4] using the simplified frequency domain (FD) SAGE (spacealternating generalized expectationmaximization) algorithm, which calculates updates of the channel impulse response (CIR) estimates for MQAM constellations. The authors of [3] present a modification that provides a gain of 2 dB for 64QAM. This modification requires a non trivial matrix inversion. Fortunately, this matrix inversion can be avoided by iteratively processing each tap of the CIR in the time domain, as proposed by the timedomain (TD) SAGE algorithm presented in [5].
Furthermore, the authors of [6] present an analysis of the TDSAGE algorithm in the context of LTEAdvanced. The results show that it has the potential to double the system throughput at high user mobility due to a better channel estimate and therefore, a lower frame error rate. Apart from that the authors present simulation results about the impact on the system performance of usage of variable number of feedback symbols from the decoder to the channel estimation block. These results show interesting tradeoffs between the computational complexity and the algorithmic performance for the TDSAGE algorithm. Therefore, the number of feedback symbols seems to be interesting parameter for a tradeoff analysis between energy dissipation and algorithmic performance of a dedicated VLSI architecture.
Contributions: This chapter introduces an extension of the first ASIC implementation of the TDSAGE algorithm which is presented in [7]. The TDSAGE algorithm is transformed to a novel variant, further reducing the computational complexity, to what is termed the tap alternating (TA) SAGE. Apart from that this work discusses two options to further reduce the computational complexity with the penalty of a loss in algorithmic performance. Either reducing the number of feedback symbols as discussed in [6] or reducing the frequency of updating the channel estimate. A suitable VLSI architecture is presented and area, timing and power numbers are provided. The support for a variable number of feedback symbols introduces a modification to the state machine that has an impact on the critical path. Therefore the implementation results for a architectures with and without variable feedback support are presented. This work evaluates the hardware costs of channel tracking in a MIMOOFDM system.
2 System Model
The channel model used in this chapter is a frequencyselective Rayleigh fading channel with a power delay profile according to the typical urban COST259 model. It is time variant with the correlation according to Jake’s model with a normalized Doppler frequency of \(f_d = 1.4468\cdot 10^{5}\), a subcarrier spacing of 15 kHz, a user velocity \(v=50\,\mathrm{km/h}\) and a carrier frequency \(f_\mathrm {c} = 2.4\,\mathrm{GHz}\).
The frame structure of the system setup used throughout this chapter is shown in Fig. 1. The first OFDM symbol of a frame is a preamble following an orthogonal preamble scheme over the transmit antennas as proposed in [8]. The subsequent OFDM symbols are only consisting of data symbol vectors across all subcarriers.
The receiver model considered throughout this chapter is depicted in Fig. 2. Each received data symbol vector is iteratively processed by a softin softout (SISO) detector and a SISO decoder. A data symbol vector is defined as the vector over all receive antennas at the kth subcarrier. Detection is performed by a maxlog MAP SISO sphere detector (SD) with QR decomposition [9], while the channel decoder is a BCJR decoder providing soft information of the coded bits \(\{c\}\). For the simulation results each processing block is executed twice, corresponding to one complete iteration between the detector and the decoder is performed.
The channel estimation provides the required estimate of the channel frequency response to the detector. As depicted in Fig. 2 the CE is split into two parts. First, the PACE processing block calculates an initial CIR estimate based on the preamble. This initial estimate is used to decode the second OFDM symbol of the frame. Second, the DDCE block uses the decoder decisions of time n in order to provide an updated estimate at time \(n + U_x\) for the detection of the next OFDM symbol. Thus, the third OFDM symbol is detected and decoded based on the updated channel estimate. The update frequency (\(U_x\)) can vary depending on the considered Doppler frequency.
There is an option to adjust the computational complexity of calculation of an update of the CIR. It was presented in [6]. The idea is to reduce the feedback from the decoder and therefore reduce the number of computations. Either the full feedback is used, meaning that all modulation symbols of the previous OFDM symbol are used to calculate the CIR estimate or only every second, third and so on symbol is used in the calculation.
3 The TASAGE Algorithm
The \(\varvec{H}_{i,j}\) from (1) can be expressed as the DFT of the CIR: \(\varvec{H}_{i,j} = \text {DFT}(\varvec{h}_{i,j})\), where \(\varvec{h}_{i,j}\) is the CIR between the ith transmit antenna and the jth receive antenna. Only the calculation of the CIR estimate \(\hat{\varvec{h}}_{i,j}=( \hat{h}_{i,j}[0], ..., \hat{h}_{i,j}[N_\mathrm {L}1] )^T\) will be investigated in the remainder of this chapter.
The description of the TDSAGE algorithm in [5] is modified in this chapter to remove redundant calculations for an efficient VLSI implementation.
The iterations are done for each receive antenna independently. The whole processing of the algorithm is done on the TD samples and can be split into four steps that have to be executed for each receive antenna. For these steps a new variable is introduced. The residual \(\epsilon _\mathrm {j}^{(m)}\) is the vector of the values that are remaining after subtracting the reconstruction of the observation given the current CIR estimate from the real observation \(\varvec{y}_j\), which is basically the current estimation error. Then, the steps of the modified SAGE algorithm are the following:
4 Algorithm Evaluation
Figure 3 shows the block error rate (BLER) for the investigated modulation schemes 4,16 and 64QAM using a floatingpoint implementation. A block is defined as one code word which is spread over one OFDM symbol. The simulations for 4, 16 and 64QAM were performed with \(N_i=3\) iterations. The number of estimated taps is \(N_L=32\) which equals the length of the cyclic prefix. This is a worst case assumption for the presented OFDM system which is used throughout this work. Besides the floatingpoint simulations the results for a fixedpoint implementation are shown in Fig. 3. The degradation due to the fixedpoint arithmetic is negligible.
For the following evaluation the modulation 4QAM was chosen exemplary and the number of internal iteration is \(N_i=1\). In Fig. 1 the update frequency is given by \(U_x\). In the following two different options to reduce the computational complexity by a factor of four are evaluated for two different exemplary operation points. The plot depicted in Fig. 4 shows the algorithmic performance for 4QAM and a mobile device speed of \(v=50\, \mathrm{km/h}\) and an \(U_x=1\) and \(U_x=4\). The loss in algorithmic performance is about 1 dB at a BLER of 1 %. Additionally the BLER for a reduced feedback from the decoder is plotted. There \(F_b\) equals 4 which means that every 4th symbol from decoder is used to calculate the update of the CIR estimate. In this case leads to the same computational complexity than using the full feedback but updating the CIR estimate only every 4th OFDM symbol. From an algorithmic perspective it can be concluded from Fig. 4 that for the given mobile speed and the same computational complexity it is a better choice to update the CIR estimate every 4th OFDM symbol using the full decoder feedback.
5 TASAGE VLSI Architecture
5.1 Processing Schedule
The processing is split into four different phases: load, precomputation, iteration and writeback. The load and the writeback phases do not include any computation but are necessary to load the input data in the memories and write back the results. These phases are considered for completeness of the hardware complexity analysis. In the load phase the received data \(\varvec{y}_j\), the current CIR estimate \(\varvec{\hat{h}}\) and the decoder feedback \(\varvec{\hat{s}}_i\) are loaded into the memories depicted in Fig. 6.
The first processing phase (precomputation) corresponds to step 1 of the algorithm description. First, the scalar \({\hat{\varvec{s}}_i}^2\) is calculated for all transmit antennas in the SP unit. Second, the residual vector \(\varvec{\epsilon }_j^{(0)}\) for all receive antennas is calculated in the SP and RU unit. Both units are running concurrently, processing different receive antennas. In parallel to the \(\varvec{\epsilon }_j^{(0)}\) calculation the reciprocal of \({\hat{\varvec{s}}_i}^2\) is precomputed for all transmit antennas, since it does not change over the internal iterations.
The second processing phase is the iteration phase, which corresponds to steps 2, 3 and 4 of the algorithm. Step 2 is reflected in the dedicated address generation of each memory. Step 3 is executed by the SP unit calculating the inner product of (9) and the multiplication with the scaling factor \(\frac{1}{{\hat{\varvec{s}}_i}^2}\). The last steps of the algorithm are (11), executed on the RU unit and (10) calculated by the tap unit. To achieve full utilization of the processing units and account for the data dependencies between (9) and (11) the SP and RU unit are separated by a pipeline register and execute the calculations concurrently for different receive antennas. This is possible since there is no data dependency between different receive antennas, which is a property of the SAGE algorithm.
In the writeback phase the new calculated estimate of the CIR is written from the tap memory to the output ports.
5.2 Processing Units
SP Unit. Section 5.1 discussed that the SP unit is used in two phases and calculates (5), (6) and (9). It can be seen from (5) that all complex multiplications can be executed in parallel. Therefore, it is possible to have a data path parallelism up to \(N_\mathrm {K}\). In (6) and (9) it is necessary to accumulate the result of the concurrent calculations. This is implemented via an adder tree. Due to the high data path parallelism (up to \(w=32\)) the maximum achievable frequency is determined by these adder trees. This leads to the design decision to have a dedicated pipeline stage as shown in Fig. 7 (third pipeline stage). The separation into the first and second pipeline stage is done to avoid two real multipliers in chain. This unit includes \(6 \cdot w\) multipliers and \(3\cdot \log (w) + 7\cdot w\) adders. The multipliers in the first pipeline stage are active in the precomputation phase and the iteration phase. The dotted part in Fig. 7 is only active during the precomputation phase, where first \({\hat{\varvec{s}}_i}^2\) is computed and written into a register file and then the initial residual vector \(\varvec{\epsilon }_j^{(0)}\) is computed while the sequential divider concurrently outputs all \(\frac{1}{{\hat{\varvec{s}}_i}^2}\). The dashed part is active in the iteration phase calculating \(\delta ^{(m)}\).
Tap Update Unit. This unit updates the current tap (10). Due to the low requirements in terms of throughput and the low complexity (2 adders) of this unit compared to the RU and SP units it is no longer discussed separately.
5.3 Memory Architecture
As shown in Fig. 6 the design has three different memories. Each of these memories has a dedicated controller that includes an address generation unit and multiplexers to realize the different data access schemes. The first memory is the tap memory, which stores the \(N_\mathrm {R}N_\mathrm {T}N_\mathrm {L}\) taps of the CIR. This memory has the most relaxed constraints in the architecture. During the initialization phase it is read in every cycle from the RU and the SP unit with a linear addressing scheme. In the iteration phase the tap memory is read and written once per tap update, i.e. every \(\frac{N_\mathrm {K}}{w}\) cycles. Therefore, one read/write port is sufficient (Fig. 8).
The TX memory stores the \(N_\mathrm {T}N_\mathrm {K}\) TD samples of the complex symbols of the remapped decisions from the decoder. The circular shift in (3) is realized as part of the address calculations. This memory is read by the RU and SP unit during the precomputation phase and the iteration phase every cycle in parallel and written in the load phase. Each access reads/writes w elements in parallel. Therefore, two read/write ports with a word width of w elements are implemented.
The third memory is the residual memory. It stores the \(N_\mathrm {R}N_\mathrm {K}\) \(\epsilon \)values and needs to be read by the SP unit (9) and read and written by the RU unit (11) independently and concurrently with a word width of w elements. Furthermore, during the precomputation phase the SP and RU units read and write independently the residual memory (5). Thus, the residual memory has two read and two write ports with a data width matching the data parallelism w.
With w and the algorithmic parameters \(N_\mathrm {T}\), \(N_\mathrm {R}\), \(N_\mathrm {L}\), \(N_\mathrm {K}\) and \(N_i\), the cycle count of each phase can be calculated using the following equations.
6 Implementation Results
6.1 Full Feedback Architecture
Three different configurations of the architecture were implemented. A configuration is defined by its data path parallelism \(w=\{8,16,32\}\). Each configuration was synthesized and layouted for its maximum achievable frequency and additionally for 400 MHz and 200 MHz. There are only two different design points for \(w=32\) since the maximum achievable frequency is 400 MHz.
The areatime tradeoff diagram for the architecture variants is shown in Fig. 9. In this diagram \(T_\mathrm {exec}\) is defined as the time that the specific architecture requires to calculate a complete update of the CIR.
The best \(AT_\mathrm {exec} = 53.55\,\mathrm{mm}^2\upmu \mathrm{s}\) product is the configuration with \(w=32\) and a synthesis and layout constraint of 400 MHz. However, the following discussion will focus on the configurations with an execution time around \(70\,{\upmu }\mathrm{s}\). The configuration with data path parallelism \(w=8\) @ 400 MHz has the \(AT_\mathrm {exec} = 122.81\, \mathrm{mm}^2\upmu \, \mathrm{s}\) product. Doubling the parallelism \(w=16\) and halving the frequency (\(AT_\mathrm {exec} = 138.9\, \mathrm{mm}^2\upmu \mathrm{s}\)), leads to the same execution time and only a slight increase in terms of area. This stems from the fact that this architecture is memory dominated while an increase in the data path parallelism does not influence the memory as much as the data path (Table 1).
The memories in the presented architecture are implemented using standard cell based memories (SCM) [10]. In this work flipflop SCMs are used. The TX memory needs to be split into w banks each providing one word to allow for nonaligned vector accesses. This would lead to 32 macro cells for the maximum configuration, rendering floorplanning difficult. Therefore, the SCMs were utilized in this architecture.
Besides area and timing analysis, postlayout simulations were performed to obtain power estimates for the different configurations. The postlayout simulations with timing annotations were executed for independent test vectors for each configuration in order to obtain statistic toggling information. Synopsys Power Compiler uses the postlayout netlist and the annotated toggling information to calculate the average power estimates.
Area breakdown for the TASAGE \(w=16\) @ 200 MHz and \(w=8\) @ 400 MHz

6.2 Flexible Feedback Architecture
The extension of the architecture to support the flexible feedback involved adjustments in the state machine and therefore in the schedule. Figure 12 compares the implementation results for the full and the flexible feedback architectures.
Power comparision of the full and flexible feedback architecture with different frequencies and a \(w=8\)

7 Conclusion
In this chapter we present to the best of our knowledge the first ASIC implementation of a decision directed channel estimation for MIMOOFDM for MQAM. The architecture is described and formulas for the calculation of the runtime of the algorithm depending on its parameters on the architecture are presented. The implementation is characterized in terms of areatime tradeoffs and power dissipation.
It was shown, that the additional hardware costs of a channel tracking algorithm like the TASAGE are high compared to traditional PACE as presented in [11] but it is possible and therefore worth further investigations.
Future work will include the influence of using latched based SCMs as presented in [10] and the evaluation of mixing macro cell memories (e.g. for the residual and the tap memory) with the SCMs approach for the TX memory. Furthermore, this architecture will be compared with simplified algorithms as for example in [12].
Notes
Acknowledgments
This work has been supported by the UMIC Research Centre, RWTH Aachen University. The authors would like to thank Ernst Martin Witte, David Kammler, Martin Senst, Filippo Borlenghi and Uwe Deidersen for the valuable discussions and their feedback.
References
 1.Gao, J., Liu, H.: Lowcomplexity MAP channel estimation for mobile MIMOOFDM systems. IEEE Trans. Wireless Commun. 7(3), 774–780 (2008)CrossRefGoogle Scholar
 2.Li, Y., Seshadri, N., Ariyavisitakul, S.: Channel estimation for OFDM systems with transmitter diversity in mobile wireless channels. IEEE J. Sel. Areas Commun. 17, 461–471 (1999)CrossRefGoogle Scholar
 3.Ylioinas, J., Juntti, M.: Iterative joint detection, decoding, and channel estimation in turbocoded MIMOOFDM. IEEE Trans. Veh. Technol. 58, 1784–1796 (2009)CrossRefGoogle Scholar
 4.Xie, Y., Georghiades, C.: Two EMtype channel estimation algorithms for OFDM with transmitter diversity. IEEE Trans. Commun. 51, 106–115 (2003)CrossRefGoogle Scholar
 5.Ylioinas, J., Raghavendra, M., Juntti, M.: Avoiding matrix inversion in DD SAGE channel estimation in MIMOOFDM with MQAM. In: 2009 IEEE 70th Vehicular Technology Conference Fall (VTC 2009Fall), pp. 1–5, September 2009Google Scholar
 6.Ketonen, J., Juntti, M., Ylioinas, J.: Decision directed channel estimation for improving performance in LTEA. In: 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1503–1507, November 2010Google Scholar
 7.Minwegen, A., Auras, D., Ascheid, G.: A multimode decisiondirected channel estimation ASIC for MIMOOFDM. In: 2012 IEEE/IFIP 20th International Conference on VLSI and SystemonChip (VLSISoC), pp. 65–70, IEEE (2012)Google Scholar
 8.Li, Y.: Simplified channel estimation for OFDM systems with multiple transmit antennas. IEEE Trans. Wireless Commun. 1, 67–75 (2002)CrossRefGoogle Scholar
 9.Studer, C., Bölcskei, H.: Softinput softoutput sphere decoding. In: IEEE International Symposium on Information Theory, 2008, ISIT 2008, pp. 2007–2011, July 2008Google Scholar
 10.Meinerzhagen, P., Roth, C., Burg, A.: Towards generic lowpower areaefficient standard cell based memory architectures. In: 2010 53rd IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 129–132, August 2010Google Scholar
 11.Simko, M., Wu, D., Mehlfuehrer, C., Eilert, J., Liu, D.: Implementation aspects of channel estimation for 3GPP LTE terminals. In: 11th European Wireless Conference 2011  Sustainable Wireless Technologies (European Wireless), pp. 1–5, April 2011Google Scholar
 12.Qiao, X., Zhao, H., Han, Z., Sun, Y.: Decisiondirected channel estimation for MIMOOFDM systems. In: 5th International Conference on Wireless Communications, Networking and Mobile Computing, 2009, WiCom 2009, Beijing, pp. 1–4 (2009)Google Scholar