**17 Aug 2016**

# Xilinx FPGA Enables Scalable MIMO Precoding Core

### Researchers at Bell Labs Ireland built a frequency-dependent precoding core with high-performance FPGAs for generalized MIMO-based wireless communications systems.

Massive-MIMO (multiple-input, multiple-output) wireless systems have risen to the forefront as the preferred foundation architecture for 5G wireless networks. A low-latency precoding implementation scheme is critical for enjoying the benefits of the multi-transmission architecture inherent in the MIMO approach. Our team built a high-speed, low-latency precoding core with Xilinx System Generator and the Vivado Design Suite that is simple and scalable.

Due to their intrinsic multiuser spatial-multiplexing transmission capability, massive-MIMO systems significantly increase the signal-to-interference-and noise ratio at both the legacy single-antenna user equipment and the evolved multi-antenna user terminals. The result is more network capacity, higher data throughput and more efficient spectral utilization.

But every stick has two ends, and so does massive-MIMO technology. To use it, telecom engineers need to build multiple RF transceivers and multiple antennas based on a radiating phased array. They also have to utilize digital horsepower to perform the so-called precoding function.

Our solution was to build a low-latency and scalable frequency-dependent precoding piece of intellectual property (IP), which can be used in Lego fashion for both centralized and distributed massive-MIMO architectures. Key to this DSP R&D project were high-performance Xilinx 7 series FPGAs, along with Xilinx’s Vivado Design Suite 2015.1 with System Generator and MATLAB/Simulink.

## Precoding in Generalized MIMO Systems

In a cellular network, user data streams that radiate from generalized MIMO transmitters will be “shaped” in the air by the so-called channel response between each transmitter and receiver at a particular frequency. In other words, different data streams will go through different paths, reaching the receiver at the other end of the airspace. Even the same data stream will behave differently because of a different “experience” in the frequency domain.

This inherent wireless-transmission phenomenon is equivalent to applying a finite impulse response (FIR) filter with particular frequency response on each data stream, resulting in poor system performance due to the introduced frequency “distortion” by the wireless channels. If we treat the wireless channel as a big black box, only the inputs (transmitter outputs) and outputs (receiver inputs) are apparent at the system level.

We can actually add a pre-equalization black box at the MIMO transmitter side with inversed channel response to precompensate the channel black-box effects, and then the cascade system will provide reasonable “corrected” data streams at the receiver equipment.

We call this pre-equalization approach precoding, which basically means applying a group of “reshaping” coefficients at the transmitter chain. For example, if we are going to transmit NRX independent data streams with NTX (number of transmitters) antennas, we will need to perform a pre-equalization precoding at a cost of NRX × NTX temporary complex linear convolution operations and corresponding combining operations before radiating NTX RF signals to the air.

A straightforward low-latency implementation of complex linear convolution is a FIR-type complex discrete digital filter in the time domain.

## System Functional Requirements

Under the mission to create a low-latency precoding IP core, my team faced a couple of essential requirements.

We had to precode one data stream into multiple-branch parallel data streams with different sets of coefficients.

We needed to place a 100-plus tap-length complex asymmetric FIR function at each branch to provide reasonable precoding performance.

The precoding coefficients need to be updated frequently.

The designed core must be easily updated and expanded to support different scalable system architectures.

Precoding latency should be as low as possible with given resource constraints.

Moreover, besides attending to the functional requirements for a particular design, we had to be mindful of hardware resource constraints as well. In other words, creating a resource-friendly algorithm implementation would be beneficial in terms of key-limited hardware resources such as DSP48s, a delicate hardware multiplier on Xilinx FPGAs.

## High-speed Low-latency Precoding (HLP) Core Design

Essentially, scalability is a key feature that must be addressed before you begin a design of this nature. A scalable design will enable a sustainable infrastructure evolution in the long term and lead to an optimal, cost-effective deployment strategy in the short term. Scalability comes from modularity. Following this philosophy, we created a modularized generic complex FIR filter evaluation platform in Simulink with Xilinx System Generator.

##### Figure 1 Top-level system architecture

Figure 1 illustrates the top-level system architecture. Simulink*HLP*core describes multibranch complex FIR filters with discrete digital filter blocks in Simulink, while FPGA*HLP*core realizes multibranch complex FIR filters with Xilinx resource blocks in System Generator, as shown in Figure 2.

##### Xilinx resource blocks in System Generator

Different FIR implementation architectures lead to different FPGA resource utilizations. Table 1 compares the complex multipliers (CM) used in a 128-tap complex asymmetric FIR filter in different implementation architectures. We assume the I/Q data rate is 30.72 Msamples/second (20-MHz-bandwidth LTE-Advanced signal).

The full parallel implementation architecture is quite straightforward according to its simple mapping to the direct-I FIR architecture, but it uses a lot of CM resources. A full serial implementation architecture uses the fewest CM resources by sharing the same CM unit with 128 operations in a time-division multiplexing (TDM) manner, but runs at an impossible clock rate for the state-of-the-art FPGA.

A practical solution is to choose a partially parallel implementation architecture, which splits the sequential long filter chain into several segmental parallel stages. Two examples are shown in Table 1.

Table 1 Complex Multipliers (CM) Utilisation Comparison for a 128-tap Compex Asymmetric FIR | ||||
---|---|---|---|---|

Full Parallel | Full Serial | Partial Parallel-A* | Partial Paralel-B | |

Required Fclk (MHz) | 30.72 x 1 = 30.72 | 30.72 x 128 = 3932.16 | 30.72 x 16 = 491.52 | 30.72 x 8 = 25.76 |

1 branch LC core | 128 CM | 1CM | 8 CM | 16 CM |

4 branch LC core | 512 CM | 4 CM | 32 CM | 64 CM |

* Suggested high speed low latency precoding (HLP) core architecture. |

We went for plan A due to its minimal CM utilization and reasonable clock rate. We can actually determine the final architecture by manipulating the data rate, clock rate and number of sequential stages thus:

where NTAP and NSS represent the length of the filters and number of sequential stages.

Then we created three main modules:

Coefficients storage module: We utilized high-performance dual-port Block RAMs to store IQ coefficients that need to be loaded to the FIR coefficient RAMs. Users can choose when to upload the coefficients to this storage and when to update the coefficients of the FIR filters by wr and rd control signals.

Data TDM pipelining module: We multiplexed the incoming IQ data at a 30.72-MHz sampling rate to create eight pipelined (NSS = 8) data streams at a higher sampling rate of 30.72×128÷8 = 491.52 MHz. We then fed those data streams to a four-branch linear convolution (4B-LC) module.

4B-LC module: This module contains four independent complex FIR filter chains, each implemented with the same partially parallel architecture. For example, branch 1 is illustrated in Figure 3.

##### Figure 3 FIR Filter chain branch

Branch 1 includes four subprocessing stages isolated by registers for better timing: a FIR coefficients RAM (cRAM) sequential write and parallel read stage; a complex multiplication stage; a complex addition stage; and a segmental accumulation and downsample stage.

In order to minimize the I/O numbers for the core, our first stage involved creating a sequential write operation to load the coefficients from storage to the FIR cRAM in a TDM manner (each cRAM contains 16 = 128/8 IQ coefficients). We designed a parallel read operation to feed the FIR coefficients to the CM core simultaneously.

In the complex multiplication stage, in order to minimize the DSP48 utilization, we chose the efficient, fully pipelined three-multiplier architecture to perform complex multiplication at a cost of six time cycles of latency.

Next, the complex addition stage aggregates the outputs of the CMs into a single stream. Finally, the segmental accumulation and downsample stage accumulates the temporary substreams for 16 time cycles to derive the corresponding linear convolution results of a 128-tap FIR filter, and to downsample the high-speed streams back to match the data-sampling rate of the system—here, 30.72 MHz.

## Design Verification

We performed the IP verification in two steps. First, we compared the outputs of the FPGA*HLP*core with the referenced double-precision multibranch FIR core in Simulink, and found that we had achieved a relative amplitude error of less than 0.04 percent for a 16-bit-resolution version. A wider data width will provide better performance at the cost of more resources.

After verifying the function, it was time to validate the silicon performance. So our second step was to synthesize and implement the created IP in the Vivado Design Suite 2015.1 targeting the FPGA fabric of the Zynq-7000 All Programmable SoC (Kintex® xc7k325tffg900-2 device). With full hierarchy in the tools’ synthesize setting and default implementation setting, it was easy to achieve the required timing at a 491.52-MHz internal processing clock rate, since we created a fully pipelined design with clear registered hierarchies.

## Scalability Illustration

The HLP IP we designed can be easily used to create a larger massive-MIMO precoding core. Table 2 presents selected application scenarios, with key resource utilisations.

Table 2 Example Resource Utilisation in Different Application Scenarios based on Suggested High Speed Low Latency Precoding Core | ||||
---|---|---|---|---|

Massive MIMO Configuration (Nrx x Ntx) | Number of HLP Cored | Complex Multiplier | Distributed Memory (kb) | Block Memory (RAM18E) |

4x4 | 4 | 128 | 16 | 4 |

4x8 | 8 | 256 | 32 | 8 |

4x16 | 16 | 512 | 64 | 16 |

4x32 | 32 | 1024 | 128 | 32 |

8x8 | 16 | 512 | 64 | 16 |

8x16 | 32 | 1024 | 128 | 32 |

You will need an extra aggregation stage to deliver the final precoding results. For example, as shown in Figure 4, it’s easy to build a 4 x 4 precoding core by plugging in four HLP cores and one extra pipelined data aggregation stage.

##### Figure 4 4x4 MIMO precoding core

## Conclusion

We have illustrated how to quickly build an efficient and scalable DSP linear convolution application in the form of a massive-MIMO precoding core with Xilinx System Generator and Vivado design tools. You could expand this core to support longer-tap FIR applications by either using more sequential stages in the partially parallel architecture, or by reasonably increasing the processing clock rate to do a faster job.

For the latter case, it would be helpful to identify the bottleneck and critical path of the target devices regarding the actual implementation architecture. Then, co-optimization of hardware and algorithms would be a good approach to tune the system performance, such as development of a more compact precoding algorithm regarding hardware utilisation.

Initially, we focused on a precoding solution with the lowest latency. For our next step, we are going to explore an alternative solution for better resource utilization and power consumption.

**Page 1 of 1 **

## About the author

Lei Guan received the B.E. degree and the M.E. degree, both in Electronic Engineering from Harbin Institute of Technology (HIT), Harbin, China in 2006 and 2008, respectively. He was awarded the Ph.D. degree in Electronic Engineering in early 2012 by University College Dublin (UCD), Dublin, Ireland. He served as senior research engineer in UCD School of Electrical, Electronic and Communications Engineering for 1 year and then he hold research fellowship in CTVR the telecommunications research center, Trinity College Dublin (TCD), Ireland. Currently he is a member of technical staff in Bell Labs Ireland.

Bell Labs, the industrial research division of Nokia, continues to conduct innovative and game-changing research around the big issues affecting the ICT industry. Using its wide-reaching expertise and collaborating with the global innovation community (both inside and outside Bell Labs), the organization is focused on finding solutions that offer a 10x (or more) improvement in multiple dimensions. These solutions will then be used to create cross-discipline ‘Future X’ initiatives that will shape the future communications landscape.

## Most popular articles in Processing & embedded

**Share this page**

Want more like this? Register for our newsletter