1 Introduction

With the wide-spread adoption of MIMO (multi-antenna) technology in current and future wireless communication systems, such as those based on the IEEE 802.11n standard [1], academia and industry are searching for MIMO detectors with reasonable implementation complexity and algorithmic performance. Especially for systems using bit-interleaved coded modulation with iterative decoding (BICM-ID) [2], a major challenge for VLSI implementation is the required soft-input soft-output (SISO) detector, since optimal detection has an exponential complexity.

Iterative MIMO decoding can yield impressive algorithmic performance gains in terms of significantly reduced signal-to-noise ratio (SNR) requirements to achieve a certain fixed error rate [3]. This SNR gain has several possible uses amongst others: we can extend the transmission range, we can serve more users (i.e. tolerating more interference), we can lower the transmission power to save energy (and at the same time reduce interference to other users), or transmit at a higher throughput in the same bandwidth.

Possible detectors can be roughly put into two categories: linear detectors, e.g. MMSE-filter based [46], and non-linear detectors e.g. [3, 79]. Basically, linear detectors try to suppress noise using linear filtering, then decode the estimate. In contrast to this, non-linear detectors perform a search, e.g. a randomly guided one, in the space of possibly transmitted data vectors. Stochastic detection based on Markov chain Monte Carlo (MCMC) methods [10] belongs to this class. It enables small configurable detectors that can cover a large design space. Furthermore, when iterating between detector and channel decoder, MCMC detection shows a communications performance close to max-log detection for certain SNR regimes [10].

To date, only some research effort has been directed towards this field. There exist only a handful of publications on MCMC detector architectures at the moment [1114]. None of them correlates communications performance with VLSI implementations results.

Related Work. An MCMC-based SISO MIMO detector ASIC design supporting independent parallel Gibbs Samplers is presented in [12]. Amongst other things, [12] introduces an initialization scheme for the completely recursive, and thus simplified, computation of the detector states, and shows how to reuse the circuitry to draw independent first samples. However, a multiplier in the timing critical path yields a limited throughput and a relatively large area consumption.

In [11], the authors propose an MCMC-based SISO MIMO detector architecture mapped on an FPGA. It features one multiplier-free Gibbs Sampler pipelined at the symbol vector level. The architecture uses a simple recursive metric computation, but requires one dot-product per cycle. The first sample of every chain needs to be generated externally.

The hybrid soft-output only MCMC detector architecture [14] combined with a hard-output fixed-complexity sphere detector (FSD) features parallel multiplier-free Gibbs Samplers that start with the best candidates found by the FSD. However, the design requires the QR-decomposition of the channel matrix, and the results are only given in terms of operation counts.

Contribution. We present a complete redesign of the MCMC-based MIMO detector architecture presented in [12], with multiplier-free Gibbs Samplers and further architectural improvements that result in a significant area reduction and timing improvement. Post-layout area and clock period reduce by about 50 % and 40 % respectively. In extension to our previous publication [13], we additionally provide our detector’s communications performance results and present an analysis showing how to trade off throughput for improved communications performance at run-time.

Outline. First, we introduce the general concept of MCMC-based MIMO detection (Sect. 3), describe the implemented algorithm (Sect. 4), then we propose the redesigned architecture (Sect. 5). Subsequently, we explicitly highlight the differences to the reference design [12] in Sect. 6. Our implementation results are presented in Sect. 7. The analysis of the communications performance results is explained in Sect. 8.

Fig. 1.
figure 1

Assumed MIMO BICM-ID System Model. Detector and decoder iteratively exchange information to improve the final decoding result.

2 System Model

We consider a spatial-multiplexing \({N_t} \times {N_r} \) MIMO system with BICM-ID, as depicted in Fig. 1. A message \({\varvec{b}} \in \{ 0, 1 \}^{{N_b}}\) is encoded with rate \(r = {N_b}/ {N_c} \) and interleaved, yielding the code word \({\varvec{c}} \in \{ 0, 1 \}^{N_c} \). Let \({\mathcal {X}} \subset {\mathbb {C}}\) be a modulation alphabet with \({K} = \log _2 | {\mathcal {X}} |\) bits per symbol. The code word is partitioned into multiple subvectors \({\varvec{c}}_{n} \in \{ 0, 1 \}^{{K} {N_t}}\). They are subsequently mapped to symbol vectors \({\varvec{x}}_n \in {\mathcal {X}} ^{N_t} \) that are transmitted independently. Assuming a frequency-flat fading channel characterized by \({\varvec{H}}_n \in {\mathbb {C}}^{{N_r} \times {N_t}}\), the received symbol vector at time n is \({\varvec{y}}_n = {\varvec{H}}_n {\varvec{x}}_n + {\varvec{w}}_n\) where \({\varvec{w}}_n \in {\mathbb {C}}^ {N_r} \) is a white Gaussian noise process with \({{\mathbb {E}}[{ {\varvec{w}}_n {\varvec{w}}_n^H }]} = {N_0} {\varvec{I}}_{{N_r}}\). In the remainder, the time index n is dropped for convenience. Using iterative MIMO decoding following the Turbo Principle [15], detector and channel decoder exchange extrinsic information \({\varvec{\lambda }}^e = {\varvec{\lambda }}^p - {\varvec{\lambda }}^a\) in terms of log-likelihood ratios (LLRs), where \({\varvec{\lambda }}^p\) are the detector’s posterior LLRs and \({\varvec{\lambda }}^a\) are the prior LLRs fed back from the decoder.

3 MCMC-Based MIMO Detection

The Markov chain Monte Carlo based MIMO detector class that we consider performs a randomly guided search in the space \({\varvec{c}} \in \{ 0, 1 \}^{{K} {N_t}}\). It starts with a random candidate, then walks around randomly. On its way, it evaluates and saves metric values of the current candidates, which are later used to approximate the posterior LLRs. The random process (Monte Carlo) from which it draws new candidates evolves recursively (Markov chain). By design the search converges towards candidates of high probability [10].

We select independent first samples \({\varvec{c}}^{(q,0)} \in \{ 0, 1 \}^{{K} {N_t}}\), one per chain \(q = 1 \dots N_q\), either randomly from the prior distribution \({\varvec{c}}^{(q,0)} \sim p({\varvec{c}}) = f( {\varvec{\lambda ^a}} )\) or given by an external hard-output detector \({\varvec{c}}^{(q,0)} = {\varvec{c}}^{\text {ext}}\) (usually for at most one chain). Every sample \(s = 1 \dots N_s\) is drawn in \({K} {N_t} \) steps. The algorithm sequentially replaces every bit with 0 and 1, computes the metric for those two candidates, then selects one of them as the next partial sample.

Let \(\varphi : \{ 0, 1 \}^{{N_t} {K}} \mapsto {\mathcal {X}} ^{{N_t}}\) be a rule that maps bit labels onto symbol vectors \({\varvec{x}} \in {\mathcal {X}} ^{N_t} \). We define the metric

$$\begin{aligned} \mu ( {\varvec{c}} ) = - \frac{1}{N_0} \left\| {\varvec{y}} - {\varvec{H}} \varphi ( {\varvec{c}} ) \right\| ^2 - {\varvec{c}}^T {\varvec{\lambda }}^a \end{aligned}$$
(1)

for the candidate \({\varvec{c}} \in \{ 0, 1 \}^{{K} {N_t}}\), which is related to the posterior probability \(P({\varvec{c}} | {\varvec{y}}, {\varvec{H}}, {\varvec{\lambda }}^a)\). Furthermore, let

$$\begin{aligned} {\varvec{c}}_{b\beta } = ( c_1 , \cdots , c_{b-1}, \; \beta \; , c_{b+1}, \cdots , c_{{K} {N_t}}) \end{aligned}$$
(2)

be the vector \({\varvec{c}}\) with the b-th bit replaced by \(\beta \). The detector approximates the posterior LLRs as

$$\begin{aligned} \lambda ^p_{b} \approx \mathop {\text {max}}\limits _{{q,s}} \mu ( {\varvec{c}}_{b0}^{(q,s)} ) - \mathop {\text {max}}\limits _{{q,s}} \mu ({\varvec{c}}_{b1}^{(q,s)}) \end{aligned}$$
(3)

where we search for the two maxima for every bit over all chains and samples.

4 Low-Level Algorithm

The presented algorithm implements the max-log variant of the Rao-Blackwellized MCMC detection algorithm with uniform sampling described in [10]. Its basic idea is to recursively compute the metric in Eq. (1) by tracking the changes while drawing bits [12]. First, we introduce the basic concepts required for understanding the algorithm, then describe the algorithm in detail. For the theoretic background, the reader is kindly referred to [10, 12].

4.1 Basic Concepts

Matched Filter. The algorithm in [12] replaces \({\varvec{H}}\) with the Gram matrix \({\varvec{R}} = {\varvec{H}}^H {\varvec{H}}\) and the received symbol vector \({\varvec{y}}\) with the matched filter output \({\varvec{y}}^{\text {mf}} = {\varvec{H}}^H {\varvec{y}}\) in the metric. This does not influence the posterior LLR calculation, however it allows to use the symmetry \({\varvec{R}} = {\varvec{R}}^H\).

Gibbs Sampler (GS). We realize the Markov chains with Gibbs Sampling. To this end, the GS draw bits sequentially according to an approximation of the marginal distribution \(P(c_b | c_1, \cdots , c_{b-1}, c_{b+1}, \cdots , c_{{K} {N_t}})\). The state of the q-th GS at the s-th sample after drawing the b-th bit is denoted as

$$\begin{aligned} {\varvec{c}}_b^{(q,s)} = ( c^{(q,s)}_{1}, \cdots , c^{(q,s)}_{b}, c^{(q,s-1)}_{b+1}, \cdots , c^{(q,s-1)}_{{K} {N_t}} ) \end{aligned}$$
(4)

and thus contains bits from the previous sample \({\varvec{c}}^{(q,s-1)}\) and the current sample \({\varvec{c}}^{(q,s)}\).

Common Starting Point. All chains start with \({\varvec{c}}^{(-1)}\), which maps onto \({\varvec{x}}^{(-1)}\) with \(x_t = 1+j\), i.e. we have \(\varphi ( {\varvec{c}}^{(-1)} ) = {\varvec{x}}^{(-1)}\). This concept enables the initialization of parallel independent Gibbs Samplers [12].

Symbol Deltas. When the GS state changes, at most one bit is different. We introduce the notation

$$\begin{aligned} \begin{aligned} \left| \varDelta \right| ^2 _b ( {\varvec{c}} )&= | \varphi _n ( {\varvec{c}}_{b1} ) |^2 - | \varphi _n ( {\varvec{c}}_{b0} ) |^2 \\ \varDelta _b ( {\varvec{c}} )&= \varphi _n ( {\varvec{c}}_{b1} ) - \varphi _n ( {\varvec{c}}_{b0} ) \end{aligned} \end{aligned}$$
(5)

where \(\varphi _n\) is the mapping rule for the n-th antenna, and the b-th bit belongs to the n-th antenna.

Recursive Dot-Product. The algorithm tracks the current value of

$$\begin{aligned} {\varvec{S}} = {\varvec{y}}^{\text {mf}} - \tilde{{\varvec{R}}} \varphi ( {\varvec{c}}_b^{(q,s)} ) \end{aligned}$$
(6)

where \(\tilde{{\varvec{R}}}\) is the matrix \({\varvec{R}}\) with the diagonal set to zero. Starting from \({\varvec{S}}^{(-1)} = {\varvec{y}}^{\text {mf}} - \tilde{{\varvec{R}}} {\varvec{x}}^{(-1)}\), it updates \({\varvec{S}}\) recursively when \({\varvec{c}}_b^{(q,s)}\) changes.

Recursive Metric Computation. We introduce an arbitrary offset such that \(\mu ({\varvec{c}}^{(-1)}) = 0\), which cancels out in Eq. (3). Let the distance update be

$$\begin{aligned} \delta ^{(q,s)}_{b} = \mathrm {Re}\lbrace r_{nn} \rbrace \left| \varDelta \right| ^2 _b ( {\varvec{c}}^{(q,s-1)} ) - 2 \mathrm {Re}\lbrace S_n^{*} \varDelta _b ( {\varvec{c}}^{(q,s-1)} ) \rbrace \end{aligned}$$
(7)

where the b-th bit belongs to the n-th antenna, then the metric update is

$$\begin{aligned} \varDelta \mu = \frac{1}{N_0} \delta ^{(q,s)}_{b} + \lambda ^a_b \end{aligned}$$
(8)

which we either subtract from or add to the current metric \(\mu ({\varvec{c}}^{(q,s)})\), depending on the bit flip direction, if the b-th bit changes.

Log-Domain Bit Probability. The term

$$\begin{aligned} \gamma = \frac{1}{\eta N_0} \delta ^{(q,s)}_{b} + \lambda ^a_b \end{aligned}$$
(9)

expresses the probability of the next bit being 1 in the log-domain, where the temperature parameter \(\eta \) mitigates lock-in effects in the high-SNR regime [10]. For the conversion to the linear domain, we apply a piece-wise linear approximation to \({\text {logistic}}(\gamma ) = 1/(1+e^{-\gamma })\) as in [11, 12]. To this end, the GS simply limits \(\gamma \) to the range \([-4, 4)\) and compares \(-\gamma \) to a uniformly distributed pseudo-random number \(u \sim U(-4,4)\) in the same range.

Fig. 2.
figure 2

Partitioning of the low-level algorithm: Front-end Processing, Gibbs Sampler, Metric Update, LLR Computation

4.2 Overall Algorithm Design

Figure 2 depicts the algorithm partitioned into four different parts: the Front-end Processing (FEP), that transforms the channel observations, the parallel Gibbs Samplers (GS) realizing the Markov chains, the Metric Update (M) tracking the current metric state, and the LLR Computation, which searches for the two maximum metric values per bit.

4.3 Front-end Processing

First, choose \(\varGamma = 2^\alpha / (\eta N_0)\) with \(\alpha \) such that \(\varGamma \in [0.5,1)\). We assume \(\eta = 2\). The FEP computes

$$\begin{aligned} \begin{aligned} {\varvec{R}}&= \varGamma {\varvec{H}}^H {\varvec{H}} \\ {\varvec{S}}^{(-1)}&= \varGamma {\varvec{H}}^H {\varvec{y}} - \tilde{{\varvec{R}}} {\varvec{x}}^{(-1)} \\ \end{aligned} \end{aligned}$$
(10)

as described in Sect. 4.1 (Recursive Dot-Product) but scaled by \(\varGamma \).

4.4 Gibbs Sampler

Algorithm 1 describes how the GS sequentially draws bits of the candidate sequence \({\varvec{c}}^{(q,s)}\). GS and Metric Update share the term \(\delta _b^{(q,s)}\) computed in line 6. Note the back-shifting with \(\alpha \) to compensate the normalization of \(\varGamma \). For the first sample (\(s = 0\)), only the prior LLRs are used, in order to draw \({\varvec{c}}^{(q,0)} \sim {\varvec{\lambda }}^a\) (line 7). The saturation in line 8 produces a threshold in the range \([-4, 4)\) (cf. Sect. 4.1 (Log-Domain Bit Probability)). The comparison to a uniformly distributed pseudo-random number in the same range (line 13) yields the new bit value. Afterwards, we need to update the \({\varvec{S}}\) state (lines 14–16).

figure a

4.5 Metric Update

Algorithm 2 recursively computes the current candidate’s metric \(\mu ({\varvec{c}}_b^{(q)})\), using the state \(\mu ^{(q)}\), and produces the two metrics for the current bit \(\mu ({\varvec{c}}_{b0/1})\). As stated earlier, we arbitrarily set the metric for the common starting point to zero (line 1). Lines 4 to 9 show the underlying metric update. Of the two possible states, one is identical to the current state, and thus has the same metric value (line 4). The other one is updated according to the direction of the bit flip (lines 6 and 8). In line 9, we select one of the two as the new current metric. It remains unaltered if the bit does not change.

figure b

4.6 LLR Computation

Algorithm 3 searches for the maximum metrics among all chains, then compares these local maxima with the current global maxima. It excludes the \(s = 0\) step, which is the transition from \({\varvec{c}}^{(-1)}\) to \({\varvec{c}}^{(q,0)}\), from the search (line 3). The computation of the extrinsic LLRs in line 7 is included, as it can be easily implemented in hardware.

figure c

5 VLSI Architecture

5.1 Overview

The macro pipeline of FEP-Circuit and MCMC core, shown in Fig. 3, constitutes the proposed MCMC detector. Both components require multiple clock cycles per input vector, but double buffering between FEP and Core ensures that the computations can overlap. The MCMC core in turn contains four stages connected via registers. The stages exchange information in every clock cycle. They effectively run in a pipeline manner.

The FSM and the multiplexers (e.g. \(\lambda _b^a\), and for the column of \({\varvec{R}}\)) are part of the Mux stage. There are \(N_p\) GS-Circuits implementing Algorithm 1. For every GS-Circuit, there is one corresponding M-Circuit executing Algorithm 2. The L-Circuit performs the LLR Computation in Algorithm 3. Every GS/M-Circuit can run several chains sequentially. For example \(N_q = 8\) chains can be run on \(N_p = 4\) GS/M-Circuits by executing two chains sequentially per GS/M-Circuit. We can also turn off some GS/M-Circuits, e.g. run \(N_q = 4\) chains on \(N_p = 8\) GS/M-Circuits with four inactive circuits.

Fig. 3.
figure 3

Architecture design of the MCMC detector. The n-th column \({\varvec{r}}_n\) of \({\varvec{R}}\) and \(\lambda _b^a\) are selected in the Mux stage.

5.2 FEP-Circuit

The architecture, depicted in Fig. 4, contains in total five multipliers. Using four of these, the dot-product for the terms \({\varvec{H}}^H {\varvec{y}}\) and \({\varvec{R}} = {\varvec{H}}^H {\varvec{H}}\) requires \({N_r} \) cycles per complex entry. We need only the lower triangular of \({\varvec{R}}\) due to \({\varvec{R}}^H = {\varvec{R}}\). The architecture computes either one complex off-diagonal entry, or two real diagonal entries in parallel. The fifth multiplier alternatingly multiplies real and imaginary parts with \(\varGamma = \frac{2^\alpha }{N_0 \eta }\). In parallel, we multiply the entries of \({\varvec{R}}\) with \(x_t^{(-1)} = 1+j\) (cf. Sect. 4.1 (Common Starting Point)) using only adders and multiplexers, and accumulate the results to obtain \({\varvec{S}}\).

Fig. 4.
figure 4

FEP-Circuit

5.3 GS/M-Circuit

Figure 5 depicts the GS-Circuit. The \(\left| \varDelta \right| ^2 \)-multiplier, depicted in detail in Fig. 6(a), exploits the limited range of \(\left| \varDelta \right| ^2 \in \{ -3,-2, \cdots , 3 \} \times \{8, 16\}\) which assumes only 14 different values for 4-/16-/64-QAM. The factor \(\varDelta \) is either purely real or imaginary. We define \(|\varDelta | = |\mathrm {Re}\lbrace \varDelta \rbrace | + j |\mathrm {Im}\lbrace \varDelta \rbrace |\). Then we have \(\mathrm {Re}\lbrace S_n^* |\varDelta | \rbrace = \mathrm {Re}\lbrace S_n\rbrace \mathrm {Re}\lbrace |\varDelta |\rbrace + \mathrm {Im}\lbrace S_n\rbrace \mathrm {Im}\lbrace |\varDelta |\rbrace \). For 4-/16-/64-QAM this assumes only the four values \(\{1,3,5,7\} \times 2\), which greatly simplifies the \(\varDelta \)-multiplier (Fig. 6(b), only shifts, adders and multiplexers). The control of the subsequent adder-subtractor

figure d

considers if \(\varDelta < 0\) and if \(\varDelta \) is imaginary to decide whether to add or subtract. To generate the independent first samples, the multiplexer 

figure e

ensures \(\gamma = \lambda _b^a\). For the external initialization, we have the multiplexer 

figure f

that selects \(c_b^{(q,s)} = c_b^{\text {ext}}\). The circuit uses a 32-bit maximum length Galois-LFSR that generates one 32-bit word per clock cycle. The timing critical path of the whole MCMC detector starts in the \(\left| \varDelta \right| ^2 \)-control, goes through the multiplexers in the \(\left| \varDelta \right| ^2 \)-multiplier towards \(c_b^{(q,s)}\), then finishes in the write-enable control for the \({\varvec{S}}\) registers.

Fig. 5.
figure 5

GS-Circuit. The arithmetic shifter (ASH) reverts the normalization of \(\varGamma \).

Fig. 6.
figure 6

Detailed view of the simplified multipliers

Fig. 7.
figure 7

M-Circuit

The M-Circuit, shown in Fig. 7, implements Algorithm 2 using a write-enabled register for the current metric, which is updated when we flip the current bit. The multiplication with \(\eta \) is implemented as a constant shift.

Update-S-Circuit. The Update-S-Circuit shown in Fig. 8 has \(({N_t}-1)\) complex-valued \(\varDelta \)-multipliers, i.e. \(2({N_t}-1)\) times Fig. 6(b). Using the multiplexers 

figure g

and 

figure h

, we can update all \({N_t} \) elements of \({\varvec{S}}\), however only \({N_t}-1\) change per clock cycle. The entries of \({\varvec{R}}\) e.g. \(r_{1n}, r_{2n}\) are selected in the Mux stage. Similar to the GS-Circuit, the adder-subtractor control 

figure i

considers \(\varDelta < 0\), if \(|\varDelta |\) is imaginary, and additionally the old bit \(c_b^{(q,s-1)}\) and if the input needs to be conjugated, i.e. \(\mathrm {Im}\lbrace r_{tn}\rbrace = -\mathrm {Im}\lbrace r_{nt}\rbrace \). The write-enabled \({\varvec{S}}\) registers are updated if the current bit flips. This control 

figure j

is part of the aforementioned critical path.

Fig. 8.
figure 8

Update-S-Circuit. Example for \({N_t} = 4\) antennas. All units exist for the real and for the imaginary parts respectively (not drawn).

5.4 L-Circuit

The L-Circuit shown in Fig. 9 contains two register files (RFs) for the current maximum metrics with \({K} {N_t} \) entries each. We use tokens propagating alongside the data to indicate whether a value is valid. The Compare Select (CS) elements select the maximum of the valid inputs. The registers also store tokens per entry, which are reset to zero when the processing of a symbol vector starts. After the scalar subtractor, we saturate the extrinsic LLRs to limit their dynamic range. The saturation has a positive influence on the communications performance.

Fig. 9.
figure 9

L-Circuit

6 Differences to Reference Architecture

The proposed architecture is a complete redesign of [12]. This section explicitly highlights the architectural modifications. The original and new timing critical path are located in the GS-Circuit.

Multiplier-Free Gibbs Sampler: Similar to [11], we move the multiplication with \(1/(\eta N_0)\) out of the GS into the FEP, by scaling \({\varvec{R}}\) and \({\varvec{S}}\) with \(\varGamma \). This removes the multiplier from the detector’s critical path, but increases the required word lengths.

Dynamic Scaling: The normalization of \(\varGamma \in [0.5,1)\) allows to use smaller word lengths, mitigating the previously mentioned increase. Consequently, we need an arithmetic shifter in the GS-Circuit at the previous location of the multiplier, which reverts the normalization.

Pipelined Input Multiplexers: Our MCMC detector selects the column of \({\varvec{R}}\) and the entry of \({\varvec{\lambda }}^a\) in the new Mux stage in front of the GS stage. While this removes those multiplexers from the detector’s critical path, it adds an additional latency cycle.

Reduced Update-S-Circuit: We remove two \(\varDelta \)-multipliers (one per real and imaginary part) from the Update-S-Circuit, since in every cycle one of the entries of \({\varvec{S}}\) does not change. This requires multiplexers for the resource sharing, which are however not in the critical path and are smaller than the removed \(\varDelta \)-multipliers.

Shared Maximum Metric Register File: The RFs are moved from the M-Circuit  [12] to the L-Circuit. This reduces the required RFs from \(N_p\) to one. We also add a pipeline register after the L-Circuit to improve timing, which requires another extra latency cycle. Also, our M-Circuit in Fig. 5 has one adder-subtractor instead of two adders, similar to [11].

Adder-Subtractor Units: These new units right after the \(\varDelta \)-multipliers in the GS- and the Update-S-Circuit, replace the original adders and the conditional negation units. The control selects addition or subtraction depending on the sign of \(\varDelta \), if \(\varDelta \) is imaginary, the old bit \(c_b^{(q,s-1)}\) and if \(\mathrm {Im}\lbrace r_{tn}\rbrace = - \mathrm {Im}\lbrace r_{nt}\rbrace \).

Simplified Delta Multiplier: Our \(\varDelta \)-multipliers, used for \(\gamma \) and \({\varvec{S}}\), compute the absolute value \(|\varDelta |\). This removes one multiplexer stage from the critical path.

Postponed Conjugation: We are storing only the lower half of \({\varvec{R}}\). Due to the hermitian property of \({\varvec{R}}\), we have \(\mathrm {Im}\lbrace r_{tn}\rbrace = - \mathrm {Im}\lbrace r_{nt}\rbrace \). The control of the subsequent adder-subtractor units considers the required negation, instead of an explicit conjugation [12].

7 Results

With the word lengths given in Sect. 7.1 and the throughput equations in Sect. 7.2, we first compare our model to the reference architecture [12] based on gate-level synthesis results, then we present post-layout results for different design-time variants of our architecture. Section 8 presents the algorithmic evaluations.

7.1 Simulation Setup

A 802.11n-like \(4\times 4\) MIMO system is considered assuming a spatially uncorrelated Rayleigh channel, perfect channel knowledge and a max-log BCJR decoder. For all results, we assumed a rate-5/6 tail-biting binary convolutional code with generator polynomials 0133 and 0171 and puncturing, a random interleaver and 64-QAM modulation (\(K = 6\)). The frame length of 2160 information bits equals the interleaver’s length, which is one OFDM symbol for this setup. For every data point, we simulated at least \(10^5\) frames. The average signal-to-noise ratio (SNR) per receive antenna is defined as \(\text{ SNR } = {{\mathbb {E}}[{\Vert {\varvec{H}}{\varvec{x}} \Vert ^2}]} / ({N_r} N_0)\). The required word lengths for an SNR loss of \(\le 0.1\,\mathrm{{dB}}\) compared to the floating-point model at a frame error rate (FER) of 10 % are: [integer.fractional] \({\varvec{y}}\) [7.8], \({\varvec{H}}\) [3.8], \({\varvec{\lambda }}^a\) [5.4], \(1 / N_0\) [6.11], \({\varvec{R}}\) [6.10], \({\varvec{S}}\) [9.9], \(\delta \) [17.6], \(\mu \) [19.5], \(\gamma \) [3.29], \(2^\alpha \delta \) [14.6], \(\alpha \) [4.0], \({\varvec{\lambda }}^e\) [8.4]. All are signed, per entry, and for real and imaginary part identical. The first chain (\(q = 0\)) is always initialized with the result of an hard-output zero-forcing MIMO detector. We assume \(N_q = 8\) chains with \(N_s = 8\) samples per chain (i.e. \(N_{gs} = 64\) in [12]) for the next three sections, but vary those parameters in Sect. 8.

7.2 Architecture

Our parameterized architecture implementation currently supports up to \(4 \times 4\) MIMO and 64-QAM. MIMO mode and QAM scheme can be configured at run-time within the supported range, which in turn can be configured at design-time. Each GS/M-pair can process up to 16 chains sequentially, with up to 16 samples per chain. The FEP-Circuit requires

$$\begin{aligned} n_{\text {fep}} = {N_r} \left( \left( {N_t} + 1 \right) {N_t}/ 2 + \lceil {N_t}/ 2 \rceil \right) + 3 \end{aligned}$$
(11)

cycles for its computation. This is slightly faster than the FEP-Circuit in [12]. The MCMC core runs for

$$\begin{aligned} n_{\text {gs}} = \frac{ N_q }{ N_p } \left( N_s + 1 \right) {K} {N_t} + 5 \end{aligned}$$
(12)

cycles. Compared to [12], we need two extra latency cycles (cf. Sect. 6 (Pipelined Input Multiplexers)). The code bit throughput of the architecture is \(\theta _c = \frac{ {K} {N_t}}{ n_{\text {gs}} } f_{\text {clk}}\) assuming \(n_{\text {gs}} \ge n_{\text {fep}} \) and sufficient input data.

7.3 Synthesis Results

We synthesized the design with Synopsys Design Compiler I-2013.12-SP2 in topographical mode using a 1.0 V standard-performance standard cell library for the UMC 90 nm SP-RVT LowK CMOS process. One gate-equivalent (GE) is the area of one 2-input drive-1 NAND gate. Figure 10 compares the four instances \(N_p = \{ 1,2,4,8 \}\) to [12]. While the most efficient design in [12] has an \(AT_{exec}\)-product of 181.7 kGE\(\upmu \)s, our proposed design achieves 50.0 kGE\(\upmu \)s, which is 3.6 times more efficient.

Fig. 10.
figure 10

Area vs. execution time based on the MCMC detector’s synthesis results, comparing this work to [12], assuming \({N_t} = 4\), \({K} = 6\)

Table 1. MCMC detector synthesis results

Table 1 lists the synthesis results for our fastest design instance and the reference design [12]. The FEP is larger (5 kGE), while the GS is smaller (\(-\)6.2 kGE per GS), since we moved the multiplier from the GS to the FEP. The additional area of the new arithmetic shifter is partially compensated for by the other improvements. The Update-S-Circuit becomes smaller (\(-\)2.5 kGE) since we save one complex \(\varDelta \)-multiplier and use \(|\varDelta |\) now. The saving effect is larger than the additional area from the multiplexers required for the resource sharing. The M-Circuit exhibits only about 7.4 % of the original area, since we moved the RFs to the L-Circuit, which consequently became larger (10 kGE). The remainder of the area (\(-\)12.9 kGE) is occupied amongst others by the \({\varvec{R}}\) column multiplexers. The area is reduced because the multiplexers are no longer in the timing critical path.

In total, the redesigned architecture takes on only about 48 % of the original area for \(N_p = 8\). The saving depends on the number of GS/M-Circuits. The critical path was shortened by about 40 %, i.e. the maximum clock frequency increased from 312 MHz to 526 MHz.

7.4 Layout Results

A layout was obtained with Cadence SoC Encounter 9.1 for each configuration’s fastest design instance in order to further study the proposed architecture’s implementation complexity and to enable more precise comparison with future related work. All following area figures are taken from the layout results, depicted in Fig. 11. The consumed area slightly increased, while the achievable clock frequency decreased. It is interesting that the throughput mainly depends on the number of parallel GS/M-Circuits and the chain parameters, i.e.

$$\begin{aligned} \theta _c = \frac{ {K} {N_t}}{ n_{\text {gs}} } f_{\text {clk}} \approx \frac{ N_p }{ N_q ( N_s + 1) } f_{\text {clk}} \end{aligned}$$
(13)

as can be seen in Fig. 11.

Fig. 11.
figure 11

Area vs. throughput based on the MCMC detector’s layout results. For each design-time configuration, the ASIC with the fastest clock is shown. As an example, the 16-QAM \(2 \times 2\) design supports one or two antennas and 4- or 16-QAM at run-time.

The largest instance, for 64-QAM, \({N_t} = 4\) and \(N_p = 8\), requires 149.5 kGE or 0.47 mm\(^{2}\) and achieves a maximum clock frequency of 479 MHz, yielding a code bit throughput of 52 Mbit/s. The fastest instance in terms of throughput supports 4-QAM, \({N_t} = 2\) and has \(N_p = 8\) GS/M-Circuits. It occupies in total an area of 70.7 kGE or 0.22 mm\(^{2}\) and runs at 664 MHz, which results in a throughput of 66 Mbit/s.

To determine the smallest instance, which should be the lower corner of the covered design space,

figure k

in Fig. 11, we synthesized the detector with \({N_t} = 2\), 4-QAM and one GS/M-Circuit for a target of 100 MHz. This ASIC consumes 19.2 kGE or 0.06 mm\(^{2}\), runs at 165 MHz and yields a 2.27 Mbit/s throughput. The FEP-Circuit and MCMC core require 10.9 kGE and 8.3 kGE respectively. Further word length optimizations could yield additional area reductions.

Table 2. Comparison to other reported MIMO detectors

Table 2 compares our work to a selection of reported MIMO detector implementations. We make three observations. First, in terms of hardware efficiency expressed in Mbit/s/kGE, the MCMC detector resides in about the same order of magnitude as the single-tree-search sphere decoder (STS-SD) [7], though our architecture is more than two times more efficient than our reference architecture [12]. The MCMC detector exhibits a deterministic run-time, which eases the receiver system design, while the SD can in principle always achieve near-capacity performance at the cost of a strongly varying run-time. Secondly, the MCMC detectors (and the STS-SD) are about one order of magnitude less efficient than the linear [4, 5], iterative-linear [6] detectors, and most notably the fixed-complexity sphere decoder (FCSD) [8], which achieves close-to-optimal communications performance at a deterministic run-time. In this perspective, the FCSD [8] is the best choice. In case that a particularly small implementation is needed, the MCMC might have an advantage, depending on how well the FCSD scales. Lastly, there are three cases for the preprocessing circuitry. Some implementations include it [46], it is optional for the MCMC detectors [12], and definitely required for the other reported work [79, 16]. This of course makes the area-throughput efficiency comparison difficult.

8 Algorithmic Considerations

In this section, we put the code bit throughput \(\theta _c\), as an implementation property of our architecture, in relation to our design’s communications performance in terms of SNR required to achieve a 10 % frame error rate. With this data, we can determine for example appropriate run-time parameters, or an appropriate run-time strategy to adapt them. Depending on the optimization criterion, the parameter choices might be different. Possible criteria are for example spectral efficiency or energy efficiency (as future work, we plan to perform energy estimations). The first part of this section gives a general overview, while the second part explains in more detail the iterative receiver figures.

In the remainder, we use the post-layout implementation results of the 64-QAM, \(N_t = 4\), \(N_p = 8\) instance that runs at 479 MHz. The simulation setup that we select resembles the highest-throughput mode of the 802.11n standard, which requires a high SNR. However, our experiments show that the MCMC-based detection performs best in a mid-range SNR regime, in combination with lower-order modulation schemes. Thus this can be considered as kind of a worst-case scenario for the MCMC detector.

We assume the same simulation setup as in Sect. 7.1. Additionally, we perform up to two detector-decoder iterations, i.e. per frame, we execute the MCMC detector and BCJR decoder twice. This gives us four run-time parameters: the number of chains \(N_{q1}\) and samples \(N_{s1}\) in the first iteration and respectively \(N_{q2}, N_{s2}\) for the second iteration. The short-hand notation GS18x6 denotes \(N_{q1} = 8\) and \(N_{s1} = 6\), similarly we use GS2 \(N_{q2}\)x\(N_{s2}\). We simulated the parameter set \(N_{q1/2} \in \{ 8, 16 \}\) and \(N_{s1/2} = \{1,2, \dots , 16\}\). Thus all \(N_p = 8\) GS/M-Circuits are always active. The total number of samples per iteration defined as \(N_{gs1/2} = N_{q1/2} \cdot N_{s1/2}\) is our measure for the invested effort.

Figure 12 shows four curves: two for the first iteration, and two for the second. The last part of this section explains how we determine the two second-iteration curves. They are pareto-optimal in terms of SNR versus throughput.

Fig. 12.
figure 12

Code bit throughput over SNR required to achieve a 10 % frame error rate

Clearly in Fig. 12, we can identify the existence of a run-time tradeoff between SNR and throughput. As could be expected, more effort (i.e. more samples, more chains) results in a better algorithmic performance (lower SNR). An SNR gain has several possible uses amongst others: we can extend the transmission range, we can serve more users (more interference), or we can also lower the transmission power to save energy (and reduce interference to other users).

In the non-iterative case (first iteration), we observe a vanishing gain beyond five samples, both for eight and 16 chains. At around 33.5 dB, it is better to use 16 instead of eight chains. Interestingly, this switches from GS18x8 at 33.52 dB to GS116x4 at 32.53 dB. The total number of samples for both configurations is 64, but we gain about 1 dB SNR while approximately maintaining the throughput. It is not completely identical due to the pipeline delays of the architecture.

Instead of using GS18x6 after GS18x5, a good decision would be to switch to the second iteration, therefore never using 16 chains in the non-iterative case. This yields a large SNR gain of about 2.7 dB at a similar throughput. At this transition point, we switch from GS18x5 to GS18x2-GS28x2. The throughputs drops slightly from 77.15 Mbit/s to 74.65 Mbit/s. With \(N_{gs1} = 40\) compared to \(N'_{gs1} + N'_{gs2} = 32\), the MCMC detector’s effort remains very similar.

MCMC-based detection benefits greatly from iterative MIMO decoding. Switching from one to two iterations yields SNR gains as large as 6 dB. While in the first iteration we achieve only about 31.7 dB, all SNR operating points of the second iteration are lower than 31 dB. A possible explanation is that the guidance from the channel decoder, in terms of prior LLRs, is the contributing factor for this. It helps the MCMC-based detection in two ways: we select the initial samples \({\varvec{c}}^{(q,0)} \sim p({\varvec{c}}) = f( {\varvec{\lambda ^a}} )\), and the transition probability \(\gamma \) depends on \({\varvec{\lambda ^a}}\). This seems to let the chains converge faster (in less samples) to interesting regions.

Fig. 13.
figure 13

Iterative receiver: two detector-decoder activations per symbol vector. For the throughput curve, dots and crosses denote if eight or 16 chains are used in the first of the two iterations. For the two total samples curves, the dots and crosses likewise denote if eight or 16 chains are used in the respective iteration. The figure shows only the pareto-optimal points in terms of SNR versus throughput determined from the data set with two iterations.

It follows a closer look on the second iteration. There are four parameters, \(N_{q1/2}\) and \(N_{s1/2}\). For a given SNR, we determine the parameter combination that yields the highest throughput. These pareto-optimals points are shown in Fig. 13. For the two second-iteration curves in Fig. 13, we fix the number of chains in the first iteration \(N_{q1} = 8\) and \(N_{q1} = 16\) respectively, then optimize over the remaining three parameters.

For our calculations, we assume that the channel decoder and the buffering between decoder and detector cause no additional delay. This is a somewhat ideal scenario, since it might give us a large area consumption e.g. of the buffers, but definitely provides us with an upper bound for the achievable throughput. Thus, the throughput is given as

$$\begin{aligned} \theta _{c,2} = \frac{ {K} {N_t}}{ n_{\text {gs},1} + n_{\text {gs},2} } f_{\text {clk}} \end{aligned}$$
(14)

with \(n_{\text {gs},1/2} = \frac{ N_{q1/2} }{ N_p } \left( N_{s1/2} + 1 \right) {K} {N_t} + 5\) and fixing \(N_p = 8\) here.

We observe that more effort is required in the first iteration. For example, around 27 dB, the two configurations GS18x7 and GS28x4 are in use, i.e. \(N_{gs1} = 56\) total samples for the first iteration, and \(N_{gs2} = 32\) for the second.

At about 26.8 dB, we switch from eight to 16 chains in the first iteration. It appears that multiple short chains are favorable for the first iteration. Only at around 24 dB, the detector should switch from eight to 16 chains in the second iteration. It is also the point where the effort significantly rises (near 23.6 dB), especially for the second iteration. This could be an indication for switching to three iterations.

From a pure SNR-throughput perspective, we can say that two iterations are better than a single. As previously stated, we observe large SNR gains from iterating, and the best non-iterative operating point is off by about 0.7 dB compared to the worst second-iteration point. However, this of course ignores the hardware cost caused by the required buffering and the increased throughput requirement on the detector and decoder architectures. A realistic comparison depends on the overall objective, i.e. lowest energy, small area, best spectral efficiency, and on additional constraints, like minimum supported bandwidth. While this is out of scope here, we think that our data outlines the run-time adaptability of the MCMC-based MIMO detection architecture. It also shows that it performs particularly well in iterative receivers, therefore it could be a reasonable candidate to consider in the design of such a system.

9 Conclusions and Outlook

We have presented synthesis and layout results of the proposed MCMC detector architecture. The area reduction of up to 52 % and the shorter clock period by up to 40 % indicate that the proposed architectural modifications to the reference design are effective. Our extensive data set for the communications performance further highlights the available tradeoff between signal-to-noise ratio and architecture throughput. With its run-time adaptability covering a large design space, our detector is effectively able to cope with a lot of channel conditions at the appropriate effort. Though being a stochastic detector, its completely deterministic run-time eases scheduling at the system level, i.e. inside a complex iterative receiver.

Still, the architecture suffers from a relatively low but deterministic throughput, which stems from the MCMC detection method itself. The main advantage appears to be its simple scalability through \(N_p\) and configurability through \({N_t} \) and \({K} \). This allows the architecture to cover a large design space. Practically, only the availability of sufficient data might limit the architectural parallelism.

As future work, we plan to correlate algorithmic performance with energy consumption, which might reveal another tradeoff capability of the proposed design.