Design of a Sigma-Delta-Based Radio-over-Fiber Massive MIMO Antenna System

(1)

Massive MIMO Antenna System

Design of a Sigma-Delta-Based Radio-over-Fiber

Academic year 2019-2020

Master of Science in Electrical Engineering - main subject Electronic Circuits and Systems Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Dr. ir. Haolin Li, Chia-Yi Wu, Prof. dr. ir. Johan Bauwelinck Supervisors: Prof. dr. ir. Guy Torfs, Prof. dr. ir. Sam Lemey

Student number: 01505378

(2)

(3)

Massive MIMO Antenna System

Design of a Sigma-Delta-Based Radio-over-Fiber

Academic year 2019-2020

Master of Science in Electrical Engineering - main subject Electronic Circuits and Systems Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Dr. ir. Haolin Li, Chia-Yi Wu, Prof. dr. ir. Johan Bauwelinck Supervisors: Prof. dr. ir. Guy Torfs, Prof. dr. ir. Sam Lemey

Student number: 01505378

(4)

Permission of Use on Loan

The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation.

(5)

Preface

This master thesis finalises my engineering studies at Ghent University. For the past 9 months, I had the opportunity to further explore topics close to my heart in the most pleasant environment that is IDLab-Design. This includes hands-on experience in both programming FPGAs (Field-Programmable Gate Arrays) and RF (Radio Frequency) PCB (Printed Circuit Board) design. This would not have been possible, however, without the help and support of the following people.

First and foremost, I would like to thank prof. dr. ir. Guy Torfs, and prof. dr. ir. Sam Lemey for promoting this work, and for guiding me through the biweekly meetings. Next, I would like to thank dr. ir. Haolin Li, Chia-Yi Wu, prof. dr. ir. Johan Bauwelinck, and dr. ir. Olivier Caytan for their continuous assistance. Additionally, I would also like to thank ir. Joris Van Kerrebrouck for milling several of my PCBs, and helping me finalise the Gerbers for ordering. Furthermore, I would like to thank my fellow thesis students Jakob, Reinier and Borre for creating a pleasant atmosphere in the thesis room. In this regard, I would like to express my deepest ingratitude towards COVID-19, for ending this prematurely. Last but not least, I would like to thank my family, girlfriend, and friends for their unconditional support.

Caro Meysmans 31st May 2020

(6)

Design of a Sigma-Delta-Based Radio-over-Fiber

Massive MIMO Antenna System

Caro Meysmans Student number: 01505378

Supervisors: Prof. dr. ir. Guy Torfs, Prof. dr. ir. Sam Lemey

Counsellors: Dr. ir. Haolin Li, Chia-Yi Wu, Prof. dr. Johan Bauwelinck, Dr. ir. Olivier Caytan

Master’s dissertation submitted in order to obtain the academic degree of Master of Science in Electrical Engineering - main subject Electronic Circuits and

Systems

Academic Year 2019 – 2020

MaMIMO (Massive MIMO) is one of the promising technologies satisfying the exten-sive demands of 5G. Currently, most testbeds investigated the co-located deployment scenario, where all antennas are located in a compact area. However, a distributed architecture, where all antennas are spread out over a large area, offers much higher probability of coverage [1]. In this master thesis, a distributed massive MIMO testbed is developed. To overcome the difficulties imposed by the distributed architecture, the SDoF (Sigma-Delta Modulated Radio-over-Fiber)-based fronthaul network of the 5G C-RAN (Centralized/Cloud Radio Access Network) architecture is targeted. First, the system is thoroughly studied from an architectural point of view. Next, several crucial building blocks of the system are implemented on high-end FPGAs. Finally, an FMC (FPGA Mezzanine Card) card is designed to connect an RRU (Remote Radio Unit) with

an antenna array of up to 8 antennas.

5G, Massive MIMO, Sigma-Delta Modulated Radio-over-Fiber, FPGA, Ra-dio Frequency

(7)

Design of a Sigma-Delta-Based Radio-over-Fiber

Massive MIMO Antenna System

Caro Meysmans

Supervisors: Prof. dr. ir. Guy Torfs, Prof. dr. ir. Sam Lemey

Counsellors: Dr. ir. Haolin Li, Chia-Yi Wu, Prof. dr. Johan Bauwelinck, Dr. ir. Olivier Caytan

Abstract—MaMIMO (Massive MIMO) is one of the promising technologies satisfying the extensive demands of 5G. Currently, most testbeds investigated the co-located deployment scenario, where all antennas are located in a compact area. However, a distributed architecture, where all antennas are spread out over a large area, offers much higher probability of coverage [1]. In this master thesis, a distributed massive MIMO testbed is developed. To overcome the difficulties imposed by the distributed architec-ture, the SDoF (Sigma-Delta Modulated Radio-over-Fiber)-based fronthaul network of the 5G C-RAN (Centralized/Cloud Radio Access Network) architecture is targeted. First, the system is thoroughly studied from an architectural point of view. Next, several crucial building blocks of the system are implemented on high-end FPGAs (Field-Programmable Gate Arrays). Finally, an FMC (FPGA Mezzanine Card) card is designed to connect an RRU (Remote Radio Unit) with an antenna array of up to 8 antennas.

Index Terms—5G, Massive MIMO, Sigma-Delta Modulated Radio-over-Fiber, FPGA, Radio Frequency

I. INTRODUCTION

To support the demands of 5G, an efficient wireless technol-ogy is required. One of the promising wireless technologies is MaMIMO antenna systems, consisting of a few hundred antennas. MaMIMO antenna systems dramatically improve the wireless spectral efficiency by serving multiple users at the same time and frequency [2]. At the time of writing, most of the existing testbeds investigated the co-located deployment scenario, where all antennas are located in a compact area. However, it has been proven that a distributed architecture, where antennas are spread out over a large area, offers much higher probability of coverage [1]. In practice, however, a distributed architecture faces more implementation difficulties than a co-located architecture. These challenges include the synchronization of the distributed antenna arrays in time and frequency, as well as the deployment cost. The ability to guarantee carrier frequency synchronism and the simple RRU architecture makes the SDoF-based fronthaul network an optimal approach to implement distributed MaMIMO.

II. SYSTEMARCHITECTURE

In this master thesis, a system consisting of one CO (Central Office) and 4 RRUs is targeted, as shown in Fig. 1. Each RRU

Hence, the RRU includes the necessary switches to choose between the uplink and downlink paths. Finally, the system targets the 5G high-frequency band around 3.5 GHz. Intensive processing power is required by the MaMIMO at the CO. To ease the development, the channel estimation and the antenna calibration algorithms are implemented on a computer using MATLAB, while an FPGA is used to tackle with the physical layer including sigma-delta modulation and E/O interfacing. To exchange the information data, PCIe (Peripheral Com-ponent Interconnect Express) is used as a high-bandwidth connection between computer and FPGA. The RRU also includes an FPGA, which offers the large number of high-speed transceivers required, as well as the necessary signal processing bandwidth.

Fig. 1: System Architecture

To limit the number of optical fibers between CO and RRU, several sigma-delta modulated signals and a control signal will be interleaved, as depicted in Fig. 2. The control signal will be used to align the data streams and to pass control and timing information.

Fig. 2: Interleaved Sigma-Delta Modulated Signals A. Central Office

(8)

for each antenna in the system a continuous stream of base-band symbols to the computer’s memory, along with control information for each RRU. The baseband symbols and control information are subsequently fetched by the FPGA through the PCIe bus. Per antenna, the corresponding symbols are then pulse shaped and sigma-delta modulated. Next, multiple sigma-delta modulated signals are interleaved with (part of) the control information, and sent over fiber to an RRU. Multiple fibers may be needed per RRU depending on the number of antennas and the sigma-delta modulator sampling rate.

Fig. 3: CO Architecture B. Remote Radio Unit

The RRU comprises an FPGA, an RF (Radio Frequency) receiver, several RF transmitters including some switches, and an antenna array, as shown in Fig. 4. The FPGA extracts the sigma-delta modulated signals and the control information from the incoming CO data. Next, the sigma-delta modulated signals are digitally upconverted and passed on to the RF transmitters. Subsequently, the RF transmitters buffer, filter, and amplify the upconverted signals. The control information is used to configure the RF receiver, the RF transmitters and the switches. Because the RF receiver features only one receiver, an extra switch is needed to switch between the different uplink paths.

C. Signal-Processing Chain

The downlink signal-processing chain is shown in Fig. 5. A carrier frequency fc of 3.6864 GHz is selected. Note that

the interleaving and de-interleaving between CO and RRU are omitted for clarity. However, the number of interleaved sigma-delta modulated signals 2n, with n the number of antennas per fiber, their bitrate 2fc

L3, and the bitrate of the complementary

control signal determine the optical bitrate required between CO and RRU. For simplicity, the same bitrate is chosen for

Fig. 4: RRU Architecture

Fig. 5: Signal Processing Chain per Antenna are available between CO and RRU, because there is only 1 QSFP port available at the RRU. Without upsampling at the RRU, i.e. L3 = 1, driving n = 2 antennas per fiber would

exceed this bitrate for all possible carrier frequencies. As a result, we choose to upsample with a factor L3= 2, giving a

bitrate of 5fc. Lastly, L1 and L2 are chosen to be 5 and 16

respectively, giving a symbol rate of 46.08 Mbaud for both the I (In-Phase) and the Q (Quadrature) signal.

III. FPGA IMPLEMENTATION

Both the CO and every RRU feature a high-end FPGA. These programmable devices enable the massive I/O and signal processing bandwidth required by our application. The CO houses a Hitech Global HTG-930, comprising one Xilinx UltraScale+ VU13P. This PCI Express development platform enables high-speed communication with a desktop computer. Additionally, the platform is expanded with one 4-port QSFP28 FMC+ module to connect the CO with every RRU optically. More QSFP28 FMC+ modules can be con-nected to support more RRUs for future upscaling. Every RRU houses a Xilinx Virtex UltraScale VCU108 evaluation kit, featuring one QSFP28 port. The platform is expanded with one commercial high-speed analog FMC module for RF

(9)

(a) CO FPGA: HTG-930 _{(b) RRU FPGA: VCU108}

Fig. 6: FPGA platforms A. Central Office

The CO FPGA implementation is shown in Fig. 7. Two possible symbol sources are available. The FPGA can fetch symbols from the computer using PCI Express, or the FPGA can generate symbols using a PRBS generator and symbol mapper. The next stage in our signal-processing chain is the FIR filter, which pulse-shapes the symbols. A square-root raised cosine filter is designed in MATLAB and implemented on the FPGA using Xilinx’ FIR (Finite Impulse Response) Compiler [3]. The FIR Compiler maps the calculated filter coefficients to a set of polyphase subfilters to efficiently combine both the upsampling and the filtering. The last stage in our signal-processing chain, is the SDM (Sigma-Delta Modulator) which oversamples and quantizes our signals to two levels, which can be transported by the GTH transceivers. As the SDM needs to sample at gigahertz frequencies a conventional implementation of a SDM would not be possible in FPGA fabric. Fortunately, IDLab Design has a parallelized implementation of a second order SDM available [4]. The SDM is configured to accept and return 16 samples each clock cyle, so the SDM runs 16 times slower than the effective sample rate. All 16 inputs are physically connected to the same FIR filter output, to efficiently implement the ZOH (Zero-Order Hold).

Fig. 7: FPGA Implementation of the Signal Processing Chain per Antenna m at the CO

B. Remote Radio Unit

The RRU FPGA implementation is shown in Fig. 8. Each clock cycle, 16 samples of both I and Q stream are processed in parallel by the upsampler. First, the DC level is removed from these 1-bit samples by converting them to a signed 2-bit representation, {−1, 1}. Next, in parallel these signed samples are upsampled by 2 and filtered by Xilinx’ FIR compiler, resulting in 32 16-bit samples. These samples are

stream, and gives us 64 samples of the upconverted signal, each clock cycle.

Fig. 8: FPGA Implementation of the Signal Processing Chain per Antenna m at the RRUs

IV. INTERLEAVING ANDDEINTERLEAVINGSIGMA-DELTA STREAMS

To transport radio signals from CO to RRU several sigma-delta streams will be interleaved to support multiple antennas with one optical fiber. As a result, we need a way to properly deinterleave the bitstream at the RRU. For this, a control signal Cn is introduced which is interleaved with the sigma-delta

streams Im and Qm. The control signal has two purposes.

First of all, as a constant pattern, it can be used to align the parallel data. When alignment is done, it is possible to send different patterns to the RRUs, corresponding to control and timing information. For example to enable the amplifiers, or to switch one of the antennas to the receiver. The alignment procedure is implemented with a state machine, as shown in Fig. 10. The reset of the state machine is deasserted when the transceivers are initialized.

Fig. 9: Interleaver and Deinterleaver per Fiber n In the start state, the control signal is compared with the alignment pattern. If there is no match, the state machine enters the slide state for two clock cycles. This results in the parallel 80-bit data to be shifted with 1 bit. Next, the state machine waits 32 clock cycles, before comparing the control signal again. When a match is found a leaky bucket algorithm is used to determine whether the link is up or down. If the link is down, the alignment procedure is restarted.

(10)

The leaky bucket algorithm is demonstrated in Fig. 11. The link counter is used as a measure for the current linky quality. When a match occurs, the link counter is incremented. If there is a no match, the link counter is divided by two. This means that single bit errors are tolerated, but bursts of mismatches (for example a bit slip) quickly result in the state machine resynchronizing the transceiver.

Fig. 11: Deinterleaver Leaky Bucket Algorithm A. Implementation Cost

Finally, the implementation cost of the complete CO FPGA design is investigated. First, the cost of the different blocks are analysed for an 8-antenna CO. Given these results, an estimate is given for the maximum number of antennas the system may support. Finally, the current bottlenecks are discussed.

In Table I the device utilisation is shown for an 8-antenna CO. This data, however, includes the cost of common blocks whose contribution will not increase with an increasing num-ber of antennas. Thus, the cost per antenna is required. This

TABLE I: Device Utilisation

Resource Used Available Device Utilization [%]

LUT 128795 1728000 7.45 LUTRAM 19558 791040 2.47 FF 154159 3456000 4.46 BRAM 233 2688 8.67 DSP 115 12288 0.94 IO 20 702 2.85 GT 24 76 31.58 BUFG 20 1344 1.49 MMCM 1 16 6.25 PCIe 1 4 25.00

cost is given in Table II, along with the relative contribution of the different blocks. The SDM holds the biggest contribution in the LUT, LUTRAM and FF resources, while the FIR filter and FIFO hold the biggest contribution in the DSP and BRAM resources respectively. Finally, using the data of Tables I to II TABLE II: Device Utilisation per Antenna and the Relative Contribution of the Individual Blocks

Resource Used per Antenna SDM [%] Filter [%] FIFO [%]

LUT 13319 97.07 2.04 0.90

LUTRAM 2269 90.55 9.45 0.00

FF 16789 92.71 5.63 1.20

BRAM 1 0.00 0.00 100.00

DSP 14 0.00 100.00 0.00

This number is determined for each resource, as shown in Table III. The current bottleneck is the available LUTs, of which the SDM is the main contributor. Furthermore, even if more LUTs are available, the next two bottlenecks are also mainly determined by the SDM. Hence, the SDM is the main bottleneck of the system, with respect to the available resources.

TABLE III: Maximal Number of Antennas per Resource Resource Number of Antennas

LUT 134

LUTRAM 376

FF 227

BRAM 2463

DSP 877

However, the system is not only limited by the available resources, but also by the bandwidth between CO and RRUs, and the bandwidth between computer and CO. The band-width between CO and RRUs is determined by the number of GTY transceivers and the FPGA board layout. 56 GTY transceivers of the Xilinx VU13P are accessible using the HTG-930. Since each GTY transceiver drives 2 antennas, 112 antennas are supported in total. Furthermore, during the measurements the bandwidth between computer and CO never exceeds 9 GBps. Since each antenna requires a bandwidth of 46.08 MHz· 2 · 16 bits = 184.32 MBps, only 48 antennas are supported in total. However, the original 32-antenna CO is feasible with the current configuration.

To conclude, the bottlenecks are listed below in chronolog-ical order, along with possible improvements:

• The first bottleneck is the bandwidth between computer

and CO. This bandwidth can be increased by using a better computer. Alternatively, the bandwidth per antenna can be reduced by using less bits per symbol or by reducing the symbol rate.

• The next bottleneck is the bandwidth between CO and

RRUs. The number of antennas can be increased by driving more antennas per fiber, or by lowering the SDM rate.

• Finally, the SDM can be optimized to use fewer

re-sources.

V. RF FRONT-ENDDESIGN

To interface the FPGA at the RRU with an antenna array, a custom FMC card is designed. The PCB (Printed Circuit Board) buffers the high-speed signals 1 , removes the quan-tization noise 2 , and further amplifies them 3 , to make them suitable for transmission. Additionally, the PCB is able to switch one of the antennas to a common output 4 5 , which is connected to a receiver. A corresponding block design is shown in Fig. 12. The partially soldered PCB is shown in Figs. 13 to 14.

CONCLUSIONS

(11)

Fig. 12: PCB Block Design

Fig. 13: Top View Partially Soldered PCB

Next, several crucial building blocks of the abovementioned system were implemented on the FPGA. This required in-depth knowledge of the FPGA’s transceivers, a major building block of this thesis. All signal processing blocks were im-plemented, including a high-bandwidth connection at the CO with a desktop. Furthermore, a state machine was designed to properly deinterleave the interleaved sigma-delta modulated signals. This required a control signal, which can also be used in the future to send commands downstream and timing information upstream. Finally, the implementation cost and the current bottlenecks were analysed.

Lastly, a custom FMC card was designed to connect an RRU to antenna arrays of up to 8 antennas.

REFERENCES

[1] H. Q. Ngo, A. Ashikhmin, H. Yang, E. G. Larsson, and T. L. Marzetta, “Cell-Free Massive MIMO Versus Small Cells,” Trans. Wireless. Comm., vol. 16, no. 3, pp. 1834–

1850, Mar. 2017,ISSN: 1536-1276.DOI: 10.1109/TWC. 2017.2655515. [Online]. Available: https://doi.org/10. 1109/TWC.2017.2655515.

[2] C. Chen, S. Blandino, A. Gaber, C. Desset, A. Bourdoux, L. Van der Perre, and S. Pollin, “Distributed massive mimo: A diversity combining method for tdd reciprocity calibration,” in GLOBECOM 2017 - 2017 IEEE Global Communications Conference, 2017, pp. 1–7.

[3] Xilinx. (). “FIR Compiler Product Guide,” [Online]. Available: https : / / www . xilinx . com / support / documentation / ip documentation / fir compiler / v7 2 / pg149-fir-compiler.pdf.

[4] H. Li, “Wireless and radio-over-fiber technologies for 5G communication systems,” eng, Ph.D. dissertation, Ghent University, 2019,ISBN: 9789463552400.

(12)

List of Figures

1.1. ARoF Architecture . . . 2 1.2. DRoF Architecture . . . 2 1.3. SDoF Architecture . . . 2 2.1. 4-QAM Constellation . . . 6 2.2. Rectangular Pulse . . . 7

2.3. Raised Cosine Pulse . . . 8

2.4. First-Order Sigma-Delta Modulator . . . 9

2.5. Noise transfer function of an L-th order SDM . . . 10

2.6. SQNR of an L-th order SDM for different OSRs . . . 11

3.1. System Architecture . . . 12

3.2. Interleaved Sigma-Delta Modulated Signals . . . 13

3.3. CO Architecture . . . 13

3.4. RRU Architecture . . . 14

3.5. Signal Processing Chain per Antenna . . . 15

3.6. Normalized 16-QAM Constellation . . . 18

3.7. Impulse Response of Pulse-Shaping Filter with a roll-off factor (β) of 0.28 18 3.8. Spectrum at 1 . . . 19

3.9. Spectrum at 2 . . . 19

3.10. Spectrum at 3 . . . 20

3.11. Expansion Followed by a Sigma-Delta Modulator . . . 20

3.12. Spectrum at 4 Using Approach of Figure 3.11 with A = 1 and n = 2 . 21 3.13. Spectrum at 4 Using Approach of Figure 3.11 with n = 3 . . . 21 3.14. Spectrum at 4 Using Approach of Figure 3.11 with A = 214 _{and n = 16 22}

(16)

3.15. Expansion Followed by a FIR Filter and Sigma-Delta Modulator . . . . 22

3.16. Spectrum at 4 Using the Approach of Figure 3.15 . . . 23

3.17. Zero-order hold . . . 24

3.18. Spectrum at 4 Using Approach of Figure 3.17 . . . 24

3.19. Sine and Cosine Sampled at Multiples of π 2 . . . 26

3.20. Spectrum at 5 Using the Approach of Figure 3.15 . . . 26

4.1. FPGA platforms . . . 28

4.2. Transceiver Clocking Architecture . . . 29

4.3. Simplified Transceiver TX Block Diagram. PISO: Parallel-In Serial-Out 30 4.4. Simplified Transceiver RX Block Diagram. CDR: Clock and Data Re-covery; SIPO: Serial-In Parallel-Out . . . 31

4.5. AXIS Block Diagram . . . 33

4.6. AXI4S (AXI4-Stream) Timing Diagram . . . 33

4.7. FPGA Implementation of the Signal Processing Chain per Antenna m at the CO . . . 34

4.8. Symbol Representation . . . 34

4.9. PCI Express . . . 35

4.10. PCIe Performance . . . 36

4.11. AXI4S Data Width Converter Timing Diagram . . . 36

4.12. Combination of AXI4S Data Width Converter and Broadcaster Timing Diagram . . . 38

4.13. PRBS (Pseudo Random Bit Sequence) Generator and Symbol Mapper . 39 4.14. Clocking at the CO . . . 40

4.15. FPGA Implementation of the Signal Processing Chain per Antenna m at the RRUs . . . 41

4.16. Clocking at the RRUs . . . 42

4.17. Reference Clock Distribution . . . 43

4.18. Interleaver and Deinterleaver per Fiber n . . . 44

4.19. Deinterleaver State Machine . . . 45

4.20. Deinterleaver Leaky Bucket Algorithm . . . 45

4.21. Logic Analyzer Capturing the State Machine Tolerating Bit Errors . . . 46 4.22. Logic Analyzer Capturing the State Machine Resynchronizing the Link 47

(17)

4.23. Eye Scans at QSFP Port of RRU . . . 48

4.24. Device Utilisation after Implementation for 8 Antennas . . . 52

5.1. PCB Block Design . . . 53

5.2. PCB Buildup . . . 54

5.3. Single-Ended Grounded Coplanar Waveguide . . . 55

5.4. Simulated S-parameters of a 30 mm Long Single-Ended Transmission Line 55 5.5. Differential Grounded Coplanar Waveguide . . . 56

5.6. Simulated S-parameters of a 30 mm Long Differential Transmission Line 57 5.7. Samtec SeaRay High-Speed Connector [16] . . . 58

5.8. Test PCB Buffer . . . 59

5.9. Test PCB and Characteristics of Bandpass Filter . . . 60

5.10. Matching Circuit Power Amplifier . . . 61

5.11. Test PCB and Gain of Power Amplifier . . . 62

5.12. Matching of Power Amplifier . . . 62

5.13. Top View Partially Soldered PCB . . . 66

5.14. Bottom View Partially Soldered PCB . . . 66

5.15. Layout to Simulate Crosstalk . . . 67

5.16. Results Crosstalk Simulation . . . 67

5.17. Layout Power Amplifier . . . 68

5.18. Matching Power Amplifier . . . 68

5.19. S21: Gain . . . 69

A.1. Conductor 1 . . . 75

(18)

List of Tables

3.1. Possible Carrier Frequencies and Corresponding Bitrates . . . 17 4.1. Device Utilisation . . . 49 4.2. Device Utilisation per Antenna and the Relative Contribution of the

Individual Blocks . . . 50 4.3. Maximal Number of Antennas per Resource . . . 50 5.1. Optimized Single-Ended Grounded Coplanar Waveguide Dimensions . . . 55 5.2. Optimized Differential Grounded Coplanar Waveguide Dimensions . . . . 56 5.3. Key Features PHA-1H+ . . . 58 5.4. Key Features BFCV-3641+ . . . 59 5.5. Key Features HMC327MS8G . . . 61 5.6. Values Matching Circuit Power Amplifier for Rogers RO4350B Material . 61 5.7. Key Features HMC8038 . . . 63 5.8. Key Features HMC321ALP4E . . . 63 5.9. Maximum Total Power Consumption . . . 64

(19)

List of Abbreviations

ARoF Analog Radio-over-Fiber.

AXI4S AXI4-Stream.

C-RAN Centralized/Cloud Radio Access Network.

CDR Clock and Data Recovery.

CO Central Office.

DAC Digital-to-Analog Converter.

DBB Digital Baseband.

DRoF Digitized Radio-over-Fiber.

DUC Digital Upconverter.

EVM Error Vector Magnitude.

(20)

FMC FPGA Mezzanine Card.

FPGA Field-Programmable Gate Array.

HPC High Pin Count.

I In-Phase.

IIC Inter IC.

IP Intellectual Property.

ISI Inter-Symbol Interference.

LPC Low Pin Count.

MaMIMO Massive MIMO.

NF Noise Figure.

PathWave ADS Advanced Design System.

PCB Printed Circuit Board.

PCIe Peripheral Component Interconnect Express.

PCS Physical Coding Sublayer.

(21)

PMA Physical Medium Attachment Sublayer.

PRBS Pseudo Random Bit Sequence.

Q Quadrature.

RAN Radio Access Network.

RF Radio Frequency.

RoF Radio-over-Fiber.

RRU Remote Radio Unit.

SDM Sigma-Delta Modulator.

SDoF Sigma-Delta Modulated Radio-over-Fiber.

SQNR Signal-to-Quantization-Noise Ratio.

SRRC Square-Root-Raised Cosine Filter.

TDD Time-Division Duplexing.

UART Universal Asynchronous Receiver-Transmitter.

(22)

1. Introduction

1.1. 5G and C-RAN

5G is the 5th _{generation of mobile communication technologies. Compared to previous} generations, 5G will not only provide higher data rates, but also more capacity, lower end-to-end latency, massive device connectivity, reduced costs, and a consistent quality of experience. To support these extensive demands, an effective RAN (Radio Access Network) and complementary transport network architecture is required. One of the key proposals is the C-RAN architecture [2]. In 5G C-RAN the CO (Central Office) should be able to control dozens of RRUs via the fronthaul network. This architecture brings great advantages, but also new challenges. In particular, the fronthaul network requires a reliable interconnection technology for a large number of cells in a cost- and energy-efficient manner, while satisfying the capacity and delay requirements. Considering these factors, the RoF (Radio-over-Fiber) technologies are among the most convincing candidates [3].

1.1.1. Radio-Over-Fiber Technologies

Three main RoF technologies exist: ARoF (Analog Radio-over-Fiber), DRoF (Digitized Radio-over-Fiber), and SDoF [3], [4], as depicted in Figures 1.1 to 1.3.

(23)

E O

DAC O E A

DBB

Central Office Remote Radio Unit

Figure 1.1.: ARoF Architecture

convert the DBB (Digital Baseband) signal to an analog signal. Then, the analog signal is upconverted and transmitted through the optical link. This architecture has the best optical spectrum efficiency and the simplest RRU architecture. However, linear optical modulators are required to obtain good performance.

E O

SER O E A

DBB

DES DAC

Figure 1.2.: DRoF Architecture

The next approach, DRoF, serialises the digital data (SER) and sends this binary signal through the optical link. The receiver deserialises the data (DES) and uses a DAC to obtain the analog signal. Then, the analog signal is upconverted and transmitted. This architecture allows the use of non-linear optical modulators, because the optical signal consists of only two levels, making it immune to non-linearities. However, this architecture has the lowest optical spectrum efficiency. Furthermore, the DACs, oscillators and mixers required at the RRU consume a lot of power, and it is difficult to obtain a fixed phase relationship between the different RRUs.

E O

SDM O E A

DBB

DUC

Figure 1.3.: SDoF Architecture

(24)

A SDM (Sigma-Delta Modulator) oversamples the baseband signal and quantizes it to a 1-bit signal. Then, a DUC (Digital Upconverter) moves the signal to the required center frequency, while preserving the binary nature of the signal [5]. Thus, like DRoF, non-linear optical modulators may be used. A band-pass filter at the RRU removes the out-of-band quantization noise introduced by the SDM. Like ARoF, no power-hungry DACs, oscillators and mixers are required at the RRU. A very simple and power-efficient RRU is obtained. This RRU is also compatible with ARoF. Finally, this architecture has a high tolerance for the bit errors over fiber [6].

1.2. Co-located and Distributed Massive MIMO

To support the demands of 5G, there is also a need for an efficient wireless technology. One of the promising wireless technologies are MaMIMO antenna systems, consisting of a few hundred antennas. MaMIMO antenna systems dramatically improve the wireless spectral efficiency by serving multiple users at the same time and frequency [7]. At the time of writing, most of the existing testbeds investigated the co-located deployment scenario, where all antennas are located in a compact area. However, it has been proven that a distributed architecture, where antennas are spread out over a large area, offers much higher probability of coverage [1]. In practice, however, a distributed architecture proves more difficult to implement than a co-located architecture. These challenges include the synchronization of the distributed antenna arrays in time and frequency, as well as the deployment cost. The ability to guarantee carrier frequency synchronism and the simple RRU architecture make the SDoF-based fronthaul network an optimal platform to implement distributed MaMIMO.

1.3. Thesis Objective and Outline

In this thesis, a distributed MaMIMO testbed, consisting of one CO and several RRUs, is developed. To efficiently transport radio frequency signals between the CO and RRUs,

(25)

SDoF is used. Concretely, the work consists of four parts:

• The system architecture design. This result will be presented by a block diagram in Chapter 3, describing the different processing steps. From this, the different system requirements are deduced.

• The FPGA implementation of the abovementioned system, as described in Chapter 4. The implementation cost and its bottlenecks will be evaluated, in order to obtain an estimate for the number of antennas and RRUs the system can support. • The PCB design to interface the FPGA functioning as an RRU with an antenna

array. The PCB should filter out the quantization noise contained in the sigma-delta modulated signals before amplification and subsequent transmission through the antenna array. Additionally, the PCB can switch to the RF transmitters for the downlink transmission and the RF receiver for the uplink reception. The design will be elaborated in Chapter 5.

• The evaluation of the distributed MaMIMO system. The system will be evaluated in two different phases. First, radio frequency signals will be generated and transmitted from one RRU with only analog beamforming in an anechoic chamber, and the performance will be evaluated. In the second phase, distributed MaMIMO will be evaluated with digital beamforming in typical indoor conditions. For this, the channel estimation and compensation will be implemented using MATLAB. Some theoretical background will be provided in Chapter 2. Finally, the conclusions and possible future work will be discussed in Chapter 6.

1.4. Consequences of the COVID-19 Pandemic

Due to the COVID-19 pandemic and the subsequent closure of the lab, the PCB could not be assembled and tested, nor could the performance of the system be evaluated. As

(26)

an alternative, more time was devoted to the parts that could be evaluated at home, such as the system architecture and the FPGA implementation. This preamble was drawn up in mutual consultation between the student and the promotors, and was approved by all parties.

(27)

2. Theory

2.1. Digital Baseband

The first step in every digital transmitter is to convert the message to a digital baseband signal. This message must first be translated into a string of bits. These bits are subsequently gathered in groups. For example

. . .001011011100 . . . → . . . 00 10 11 01 11 00 . . . (2.1)

Next, these groups are mapped on to symbols, also referered to as a constellation. Most constellations are complex-valued and can be decomposed in an I (In-Phase) and Q (Quadrature) part. For example, a widely used constellation is 4-QAM

11 10 01 00 I Q

(28)

Finally, the string of symbols s(k) must be turned into a waveform sa(t)

sa(t) =

X

k

s(k)p(t − kT ) (2.2)

where p(t) is the pulse shaping function. Ideally, the pulses of different symbols should be zero at subsequent T -spaced sample instants, to avoid ISI (Inter-Symbol Interference)

p(kT ) =      1 k = 0 0 k 6= 0 (2.3)

A pulse satisfying this property is called a Nyquist pulse. A rectangular pulse with a width equal to the symbol period T is evidently a Nyquist pulse, as shown in Figure 2.2a. However, using this pulse, the final waveform will occupy an infinite bandwidth, as depicted in Figure 2.2b.

(a) Time Domain (b) Frequency Domain

Figure 2.2.: Rectangular Pulse

In practice, the most commonly used pulse is the raised cosine pulse, as shown in Figure 2.3a. The final waveform will have a bandwidth of (1 + β)T , as depicted in Figure 2.3b. The most bandwidth-efficient pulse is obtained with β = 0 and is given by

p(t) = 1 T sinc

t

(29)

(a) Time Domain (b) Frequency Domain

Figure 2.3.: Raised Cosine Pulse

However, while this pulse is limited in frequency, it decays the most slowly in time. Increasing β trades bandwidth for faster decay of the pulse.

At the receiver, noise will be added to the signal by the channel imperfections. However, one can proof that the signal-to-noise ratio can be maximized by using a receive filter that is matched to the pulse shape. This receive filter is given by p∗_{(−t). Furthermore, it is not} the transmit pulse that should be a Nyquist pulse, but the combination of the transmit pulse and the receive filter. The raised cosine pulse fails in this aspect. Nonetheless, a new pulse shape can be defined that is the square root of the raised cosine pulse (in the frequency domain). The resulting pulse is called the SRRC (Square-Root-Raised Cosine Filter), and will be used throughout this thesis. The cascade of two SRRC pulses is equivalent to the raised cosine pulse.

2.2. Sigma-Delta Modulation Working Principle

A key element of the SDoF architecture is the SDM, which working principle can be explained by looking at a first-order SDM [8]. The system consists of an integrator and a quantizer, as shown in Figure 2.4.

(30)

z−1 N 1 u[n] s[n] w[n] v[n]

Figure 2.4.: First-Order Sigma-Delta Modulator

Because the quantizer can be modelled as a white-noise source e[n], the output of the SDM can be expressed in the Z-domain as

V(z) = S(z) + E(z) (2.5)

where V (z), S(z), and E(z) are the Z-transform of v[n], s[n], and e[n] respectively. Also,

W(z) = S(z) − V (z) (2.6)

and

S(z) = U(z) + z−1· W(z) (2.7)

From (2.5) to (2.7), V (z) can be written as

V(z) = U(z) +1 − z−1· E(z) (2.8)

Note that the spectrum of the output signal is the sum of the spectrum of the input signal and a reshaped version of the white-noise spectrum. The noise transfer function

N T F(z) that describes this reshaping is

N T F(z) = V(z) E(z) _{U (z)=0} = 1 − z−1 _(2.9)

More generally, the noise transfer function NT F (z) of an L-th order SDM can be expressed as

N T F(z) =1 − z−1L (2.10)

The corresponding frequency response can be written as |N T F(z)| = 1 − z −1 L z=exp{j2π_fsf } = 2 sin π f fs L (2.11) This expression is evaluated for several values of L in Figure 2.5. Note that the

(31)

quantiza-Figure 2.5.: Noise transfer function of an L-th order SDM

tion noise is decreased significantly in the frequency band of interest, resulting in a very high SQNR (Signal-to-Quantization-Noise Ratio). The following expression can be found for the maximal SQNR.

SQNRMAX= 6.02M + 1.76 + 10 log10 " (2L + 1)OSR2L+1 π2L # (2.12) where M is the number of quantizer bits and OSR is the oversampling rate fs

2fb with

fs and fb as the sampling rate and the input bandwidth respectively. This expression

is evaluated for several values of L and OSR, as shown in Figure 2.6. A higher order SDM results in a higher SQNR for the same OSR. As expected, increasing the OSR also increases the SQNR.

(32)

(33)

3. System Architecture

As discussed in Chapter 1, a distributed MaMIMO testbed is developed. To validate the concept, a system consisting of one CO and 4 RRUs is targeted, as shown in Figure 3.1. Each RRU features an antenna array consisting of 8 antennas, resulting 32 distributed antennas in total. TDD (Time-Division Duplexing) is used in this thesis to separate uplink and downlink signals. Hence, the RRU includes the necessary switches to choose between the uplink and downlink paths. Finally, the system targets the 5G high-frequency band around 3.5 GHz. Intensive processing power is required by the MaMIMO at the CO. To ease the development, the channel estimation and the antenna calibration algorithms are implemented on a computer using MATLAB, while an FPGA is used to tackle with the physical layer including sigma-delta modulation and E/O interfacing. To exchange the information data, PCIe (Peripheral Component Interconnect Express) is used as a high-bandwidth connection between computer and FPGA. The RRU also includes an FPGA, which offers the large number of high-speed transceivers required, as well as the necessary signal processing bandwidth.

CO Computer RRU FPGA PCIe FPGA SDoF Antenna TX RX Array Switch

(34)

To limit the number of optical fibers between CO and RRU, several sigma-delta modulated signals and a control signal will be interleaved, as depicted in Figure 3.2. The control signal will be used to align the data streams and to pass control and timing information.

1 2 ...3 1 2 ...3 1 2 ...3 1 2 ...3 Stream 1 Stream 2 Stream ... Stream m

Interleaved Streams

1 2 ...3 Control

1 1 1 1 1...2 2 2 2 2...3 3 3 3 3... Figure 3.2.: Interleaved Sigma-Delta Modulated Signals

This chapter will give an insight in the system architecture and what we need to implement. FPGAPCIe Per RRU RRUs Per Fiber Per Antenna

Symbols Pulse SDM Interleaver

Control Computer

Memory Script

(35)

FPGA Per Fiber

Deinterleaver Digital Upconversion Per Antenna

Control

TX

Switch

RX Switch Antenna Array

Per Antenna CO

...

Figure 3.4.: RRU Architecture

3.1. Central Office

The CO comprises a computer and an FPGA, as shown in Figure 3.3. A MATLAB script, running on the computer, writes for each antenna in the system a continuous stream of baseband symbols to the computer’s memory, along with control information for each RRU. The baseband symbols and control information are subsequently fetched by the FPGA through the PCIe bus. Per antenna, the corresponding symbols are then pulse shaped and sigma-delta modulated. Next, multiple sigma-delta modulated signals are interleaved with (part of) the control information, and sent over fiber to an RRU. Multiple fibers may be needed per RRU depending on the number of antennas and the sigma-delta modulator sampling rate.

(36)

3.2. Remote Radio Unit

The RRU comprises an FPGA, an RF receiver, several RF transmitters including some switches, and an antenna array, as shown in Figure 3.4. The FPGA extracts the sigma-delta modulated signals and the control information from the incoming CO data. Next, the sigma-delta modulated signals are digitally upconverted and passed on to the RF transmitters. Subsequently, the RF transmitters buffer, filter, and amplify the upconverted signals. The control information is used to configure the RF receiver, the RF transmitters and the switches. Because the RF receiver features only one receiver, an extra switch is needed to switch between the different uplink paths.

3.3. Signal-Processing Chain

The downlink signal-processing chain is shown in Figure 3.5. When designing this chain, the platform on which it is implemented must be taken into account. Indeed, the FPGAs will impose boundary conditions (of which the origins will be clarified in Chapter 4) on the choice of the carrier frequency fc and the three upsampling factors L1, L2 and L3. Note that the interleaving and de-interleaving between CO and RRU are omitted for clarity. Nonetheless, this step introduces an important boundary condition.

SDM I Q 16 16 L1 L1 Pulse-Shaping L2 L2 SDM L3 L3 Digital Up con verter 4fc 2fc 2fc L3 2fc L3 2fc L2L3 2fc L2L3 2fc L1L2L3

1

Interleaving and De-interleaving

1 2 3 4

1

X 5

(37)

3.3.1. Remote Radio Unit

To synchronize the transmitters with the receiver at the RRU, the bitrate after digital upconversion 4fc should be a simple fraction of 122.88 MHz. For derivation simplicity,

we choose an integer multiple x. Since the carrier frequency fc should lie in the 3.5 GHz

range (3.3 GHz – 3.8 GHz), x should lie between 108 and 123, corresponding to a bitrate of 13.271 Gbps and 15.11424 Gbps respectively.

3.3.2. Interleaving and De-interleaving

To synchronize the different RRUs, the optical bitrate per fiber between the CO and each RRU should be a simple fraction of 122.88 MHz. For derivation simplicity, we choose an integer multiple y. This bitrate depends on the number of sigma-delta modulated signals per fiber 2n, with n the numbers of antennas per fiber, their bitrate 2fc

L3, and the bitrate

of the control signal. For simplicity, the same bitrate is chosen for all, giving an optical bitrate per fiber of (2n + 1)2fc

L3. In practice, each optical fiber is limited to 25 Gbps, and

only 4 optical fibers are available between CO and RRU, because there is only 1 QSFP port available at the RRU. Without upsampling at the RRU, i.e. L3 = 1, driving n = 2 antennas per fiber would exceed this bitrate for all possible carrier frequencies. As a result, we choose to upsample with a factor L3 = 2, giving a bitrate of 5fc.

Taking the above into account, this results in the possible carrier frequencies of Table 3.1. The highlighted column is used, as the resulting carrier frequency is the closest to the center frequency of the bandpass filter used in the analog processing.

3.3.3. Central Office

Lastly, L1 and L2 are chosen to be 5 and 16 respectively, giving a symbol rate of 46.08 Mbaud for both the I and the Q signal. The first upsampling with the factor L1

(38)

Table 3.1.: Possible Carrier Frequencies and Corresponding Bitrates

x 108 112 116 120

y 135 140 145 150

Carrier Frequency [GHz] 3.3178 3.4406 3.5635 3.6864

Bitrate after Digital Upconversion [Gbps] 13.2710 13.7630 14.2540 14.7460

Optical Bitrate [Gbps] 16.5890 17.2030 17.8180 18.4320

inserts L1−1 zeros between every two samples, while the upsampling with L2 repeats each sample L2 times.

3.3.4. Upsampling Sigma-Delta Modulated Signals

What still requires investigation is how to upsample the sigma-delta modulated signals coming from the CO by L3 at the RRU, preserving the binary nature of the signal. In order to quantitatively compare possible solutions, we will look at the spectra of the different signals in the system and compare the amount of quantization noise present. The latter is quantified by determining the bandwidth BW0 where the quantization noise power is 30 dB lower than the signal power. The higher this bandwidth, the better, as this makes the filtering in the analog processing easier. The symbol source in the simulation generates random 16-QAM symbols. The 16-QAM constellation is shown in Figure 3.6. The pulse-shaping filter is a SRRC with a roll-off factor β of 0.28, as shown in Figure 3.7. To reduce the computational complexity, this filter is truncated to 10 symbol intervals. The spectra of the signals at 1 – 3 in Figure 3.5 are shown in Figures 3.8 to 3.10 respectively. A bandwidth BW0 of 346 MHz is observed at 3 . Ideally the upsampling by L3 does not decrease this bandwidth at 4 .

Expansion Followed by a Sigma-Delta Modulator

The first approach is shown in Figure 3.11. First, the 1-bit sigma-delta modulated signal is mapped to an n-bit signed signal. To maintain the DC balance, 0 and 1 are mapped

(39)

Figure 3.6.: Normalized 16-QAM Constellation

(40)

Figure 3.8.: Spectrum at 1

(41)

Figure 3.10.: Spectrum at 3

to −A and A, respectively. Next, we expand this signal by adding L3−1 zeros between every sample. Finally, an SDM is used to quantize and low-pass filter the previous signal. The latter is necessary to suppress the aliasing introduced by the upsampling.

1 I,Q n {0, 1} {−A, A} SDM 1 {0, 1} L3 {−A,0, A} n

Figure 3.11.: Expansion Followed by a Sigma-Delta Modulator

The most simple configuration, A = 1 and n = 2, delivers inadequate results, as seen in Figure 3.12. Strong peaks are observed at integer multiples of 921.6 MHz, which is 1

8 of the sampling rate. The signal power has also decreased, resulting in a lower bandwidth

BW0.

Increasing n = 3, does not get rid of these peaks, nor does it increase the signal power. The spectra for different values of A are shown in Figure 3.13. However, the peaks can be explained by looking at n = 16 and A = 214_{. The resulting spectrum is shown in}

(42)

Figure 3.12.: Spectrum at 4 Using Approach of Figure 3.11 with A = 1 and n = 2 Figure 3.14a. The peaks are still present, however, if we dither the signal (by adding a relatively small random signal, hence the large n and A) before the SDM we get the spectrum of Figure 3.14b. Indeed, because an SDM is a finite state machine, the quantization error may form short and repeating patterns if the input is not sufficiently random [8]. This causes strong unwanted tones in the output spectrum.

(a) A = 1 (b) A = 3

(43)

(a) Without Dithering (b) With Dithering

Figure 3.14.: Spectrum at 4 Using Approach of Figure 3.11 with A = 214 _{and n = 16}

Expansion Followed by a FIR Filter and Sigma-Delta Modulator

The first approach deteriorates the bandwidth BW0, because of the lower signal power, and because the original quantization noise is still present. The second approach, as depicted in Figure 3.15, uses a FIR (Finite Impulse Response) filter with length 60, to remove all the quantization noise introduced by the SDM, before quantizing the signal again with a second SDM. The filter has a passband frequency of 29.4912 MHz, which is equal to the bandwidth of the useful signal. Its stopband frequency and attenuation are, respectively, 170 MHz and 30 dB. Meanwhile, the second SDM runs twice as fast. As a result of these two reasons, the bandwidth BW0 at 4 has more than doubled compared to 3 , as shown in Figure 3.16. A drawback is the introduced complexity.

1 I,Q 2 {0, 1} {−1, 1} FIR L3 {−1, 0, 1} 2 16 SDM 1 {0, 1}

(44)

(45)

Zero-Order Hold

If the hardware is not capable of such filter lengths, satisfactory results are also obtained using a ZOH (Zero-Order Hold), as shown in Figure 3.17. The bandwidth at 4 remains the same as 3 , as shown in Figure 3.18.

1 I,Q

{0, 1}

ZOH 1

{0, 1} Figure 3.17.: Zero-order hold

Figure 3.18.: Spectrum at 4 Using Approach of Figure 3.17

3.3.5. Digital Upconversion

The upconverted signal at 5 can be written as

(46)

Because this signal is sampled at four times the carrier frequency (t = 1 4fcn), we can write X(n) = I (n) cos _π 2n − Q(n) sin _π 2n (3.2) Or, by looking at Figure 3.19, we can write this as

X(n) =                      I(n), n ≡0 (mod 4) −Q(n), n ≡ 1 (mod 4) −I(n), n ≡ 2 (mod 4) Q(n), n ≡3 (mod 4) (3.3)

Only the odd samples of I and the even samples of Q are unused, so we halve both input sample rates: I0(n) = I(2n) (3.4) Q0(n) = Q(2n + 1) (3.5) resulting in X(n) =                      I0(n₂), n ≡ 0 (mod 4) −Q0₍n−1 2 ), n ≡ 1 (mod 4) −I0₍n 2), n ≡ 2 (mod 4) Q0(n−1 2 ), n ≡ 3 (mod 4) (3.6)

Applying Equation (3.6), the I and Q streams can be easily upconverted through inverting and interleaving. The resulting spectrum, when upsampling using the approach of Figure 3.15, is shown in Figure 3.20. However, in the hardware implementation only the even samples of Q are available:

Q0(n) = Q(2n) (3.7)

resulting in a slight IQ mismatch of 1

4fc. The image created by this slight phase shift

can be compensated by applying a fractional delay interpolation to the Q stream after pulse-shaping. However, with respect to the chosen symbol rate, this can be neglected.

(47)

Figure 3.19.: Sine and Cosine Sampled at Multiples of π

2

(48)

3.4. Antenna Calibration and Channel Estimation

A return path is needed from RRU to CO for the uplink paths. In this setup, the uplink paths are used for antenna calibration and channel estimation. We will use an Analog Devices FMCOMMS1 to downconvert and digitize the incoming RF signals. The I and Q signals are then sigma-delta modulated. Subsequently, the sigma-delta modulated signals and timing information are then interleaved and sent over fiber to the CO. The I and Q signals, and the timing information are then written to the computer’s memory through the PCIe bus. Finally, the computer can calibrate the antennas and estimate the channels.

(49)

4. FPGA Implementation

Both the CO and every RRU feature a high-end FPGA. These programmable devices enable the massive I/O and signal processing bandwidth required by our application. The CO houses a Hitech Global HTG-930, comprising one Xilinx UltraScale+ VU13P. This PCI Express development platform enables high-speed communication with a desktop computer. Additionally, the platform is expanded with one 4-port QSFP28 FMC+ module to connect the CO with every RRU optically. More QSFP28 FMC+ modules can be connected to support more RRUs for future upscaling. Every RRU houses a Xilinx Virtex UltraScale VCU108 evaluation kit, featuring one QSFP28 port. The platform is expanded with one commercial high-speed analog FMC module for RF receiver, and one custom FMC module for RF transmitters and switching. This chapter discusses the FPGA implementation of both CO and RRU.

(a) CO FPGA: HTG-930

(b) RRU FPGA: VCU108

(50)

4.1. High-Speed Serial I/O

The key design block for this thesis is the high-speed serial I/O, for both the commu-nication between CO and RRU and the digital-to-analog conversion of the sigma-delta modulated signals at the RRU [9]–[11]. The communication between CO and RRU uses the GTY transceivers, while the digital-to-analog conversion of the sigma-delta modulated signals at the RRU uses the GTH transceivers. This choice is mainly deter-mined by the board layout: the QSFP port uses the GTY transceivers, while the FMC interface uses the GTH transceivers. The GTY transceivers in Virtex UltraScale and UltraScale+ devices support speeds up to 30.5 Gbps and 32.75 Gbps respectively, while the GTH transceivers support speeds up to 16.375 Gbps. Both types of transceivers are based on the same architecture. The following will elaborate on certain aspects of this architecture.

First of all, transceivers are organized in pairs of four, called quads. Each transceiver contains one ring-based channel PLL (Phase-Locked Loop), CPLL. Additionally, each quad contains two LC-based PLLs, QPLL0/1. The output of one of these PLLs feed the TX and RX clock divider blocks of each transceiver, which control the generation of serial and parallel clocks used by the PMA (Physical Medium Attachment Sublayer) and PCS (Physical Coding Sublayer). These clocks determine the parallel and serial data rate of the transceivers. The reference clock of the quad is used to drive the quad or channels PLLs. The PMA is mostly responsible for the serialisation and deserialisation, while the PCS is responsible for encoding for transmission, decoding, lane alignment, . . .

RX PMA RX PCS TX PMA TX PCS Clock Dividers Clock Dividers CPLL QPLL0/1 Reference Clo cks

(51)

4.1.1. Transmitter

Each transceiver includes an independent transmitter, which consist of a PCS and a PMA. Parallel data flows from the device logic into the TX interface, through the PCS, and out of the PMA.

PISO FIFO Interface

Buffer PCS XCLK TXUSRCLK TXUSRCLK2 TX Serial Clock TXDATA PMA

Figure 4.3.: Simplified Transceiver TX Block Diagram. PISO: Parallel-In Serial-Out Device logic interacts with the TX datapath of the transceiver through the TX interface. Two clocks must be provided: TXUSRCLK, and TXUSRCLK2. Data is transmitted by writing to the TXDATA port on the positive edge of the TXUSRCLK2, while the TXUSRCLK is used for the internal PCS logic in the transmitter. Depending on the internal data width, TXUSRCLK runs at either the same rate or half the rate of TXUSRCLK2. Both clocks must be positive-edge aligned.

Another clock domain is present in the PCS: the XCLK domain. To transmit data between the two clock domains, the XCLK rate must match the TXUSRCLK rate, and all phase differences between the two domains must be resolved. The latter can be ignored by using the TX FIFO buffer, while the former must be satisfied to not over- or underrun the TX FIFO buffer. To match both rates, the device logic can access one of the transceiver parallel clocks through TXOUTCLK.

If multiple transceivers are using the same reference clock, the transmitters will be running at the same data rate, so the TXOUTCLK of one transceiver can be used to derive the TXUSRCLK, and the TXUSRCLK2 for all transceivers. There will, however, still be a non-deterministic skew between the different serial data streams. This is of no

(52)

concern for the downlink path (as this can be included in the wireless channel), but it could prevent us from having all RRUs execute a certain command at the same point in time.

4.1.2. Receiver

Each GTY transceiver includes an independent receiver, made up of a PCS and a PMA. High-speed serial data flows from traces on the board into the PMA, into the PCS, and finally into the device logic.

SIPO Elastic Interface

Buffer PCS XCLK RXUSRCLK RXUSRCLK2 RX Serial Clock RXDATA CDR PMA

Figure 4.4.: Simplified Transceiver RX Block Diagram. CDR: Clock and Data Recovery; SIPO: Serial-In Parallel-Out

Device logic interacts with the RX datapath of the transceiver through the RX interface. Two clocks must be provided: RXUSRCLK, and RXUSRCLK2. Data is received by reading from the RXDATA port on the positive edge of the RXUSRCLK2, while the RXUSRCLK is used for the internal PCS logic in the transmitter. Depending on the internal data width, RXUSRCLK runs at either the same rate or half the rate of RXUSRCLK2. Both clocks must be positive-edge aligned.

Another clock domain is present in the PCS: the XCLK domain. To transmit data between the two clock domains, the XCLK rate must be sufficiently close to the RXUSRCLK rate, and all phase differences between the two domains must be resolved. The latter can be ignored by using the RX elastic buffer. XCLK is derived from the recovered clock provided by the CDR (Clock and Data Recovery). To eliminate the need of clock

(53)

correction1_{, RXUSRCLK will also be derived from the recovered clock. The device logic} can access the recovered clock through RXOUTCLK.

If multiple transceivers receive data from transmitters using the same reference clock, the transmitters, and consequently the receivers, will be running at the same data rate, so the RXOUTCLK of one transceiver can be used to derive the RXUSRCLK, and the RXUSRCLK2 for all transceivers. There will, however, still be a non-deterministic skew between the different serial data streams. Additionally, we can not assume the transmitter and receiver within the same transceiver to be running at the same rate, because the transmitter will be running synchronous to the reference clock, while the receiver will be running synchronous to the recovered clock. There is no guarantee that both clocks are running at the exact same rate. This poses a problem when a single clock source is used to derive TXUSRCLK, TXUSRCLK2, RXUSRCLK and RXUSRCLK2. If TXOUTCLK is used, clock correction will be needed at the receiver. If RXOUTCLK is used, the TX FIFO buffer will over or underrun. To overcome this, the VCU108 features a jitter attenuator. This jitter attenuator has two inputs. At startup, the jitter attenuator generates a reference clock for the transceivers derived from the first input, a crystal. After a short while the CDR will recover the serial clock (since the CDR requires a stable reference clock to function) and RXOUTCLK will be stable. At this point the jitter attenuator switches to the second input, RXOUTCLK. Consequently, the reference clock will be synchronous to RXOUTCLK, and both transmitter and receiver will run at the exact same rate.

The transceivers also provide functionality to align the incoming serial data to word boundaries. In the manual alignment mode, the device logic asserts the RXSLIDE for 2 RXUSRCLK2 cycles to shift the parallel data by one bit. RXSLIDE must be low for 32 RXUSRCLK2 cycles before it can be used again.

1_{The transmitter can send certain patterns that may be discarded or repeated by the elastic buffer to}

(54)

4.2. High-Speed Streaming Data

Inside each FPGA a large amount of data will be processed by different processing blocks. To pass data from one processing block to another, the AXI4S protocol will be used. One processing block will be the master, providing the data, another will be the slave, accepting the data.

TDATA TVALID TREADY Master TDATA TVALID TREADY Slave

Figure 4.5.: AXI4S Block Diagram

Its two-way flow control mechanism enables both the master and the slave to control the rate at which the data is transmitted across the interface. For a transfer to occur both the TVALID and TREADY signal must be asserted. An example of multiple AXI4S transfers are shown in Figure 4.6.

ACLK TDATA TVALID TREADY

A B C

Figure 4.6.: AXI4S Timing Diagram

Note that the AXI4S specification defines more (optional) signals, but these are outside the scope of this thesis.

(55)

4.3. Central Office

In Chapter 3, we explain the signal processing chain at the CO, as shown in Figure 3.5. Next, its FPGA implementation, as shown in Figure 4.7, is discussed.

Symbol _FIFO ZOH SDM

SDM ZOH 2x16 2x16 16 16 16x16 16x16 16x1 16x1 ACLK (250 MHz) TXOUTCLK (230.4 MHz) Im Qm AXI4S AXI4S Generator Filter FIR

Figure 4.7.: FPGA Implementation of the Signal Processing Chain per Antenna m at the CO

4.3.1. Symbol Generator

Two possible symbol sources are available. The FPGA can fetch symbols from the computer using PCI Express, or the FPGA can generate symbols using a PRBS generator and symbol mapper.

Symbols are represented as the concatenation of a signed 16-bit I, and a signed 16-bit Q, as shown in Figure 4.8. 0 Q (signed) I (signed) Symbol ... ... ... ... ... ... 31 0 1 ... 15 15 ... 1 0

(56)

PCI Express

As PCI Express is a fairly complex protocol to implement from scratch, the DMA subsystem for PCI Express provided by Xilinx [12] will be used. This IP (Intellectual Property) moves data between the computer’s memory and the DMA subsystem. The DMA subsystem is configured to have one AXI4S interface for all 4 possible channels. To transfer data, the host writes a linked list of descriptors for each channel to its memory, which the DMA fetches and processes. One such descriptor specifies the source or destination address, and the length of one transfer.

Computer Memory Descriptors Data FPGA DMA Subsystem PCIe AXI4S

for PCI Express

512 32

...

AXI4S

Splitter

Figure 4.9.: PCI Express

The PCI Express communication uses 16 lanes, running each at 8 Gbps. Taking into account the 128b/130b encoding, this gives a total throughput of 15.75 GBps. To support these rates in the FPGA fabric each AXI4S interface is 512 bits wide running at 250 MHz. Though this theoretical limit is not achieved in practice, as shown in Figures 4.10b to 4.10a. Both in the uplink and downlink direction the speed never exceeds 9 GBps. We do not want to throttle the PCI Express communication (TREADY low) in the process of splitting the 512 bits wide AXI4S interfaces into 32 bits wide AXI4S interfaces, each representing a symbol. This would be the case when using the AXI4S Data Width Converter2_{to downsize the interface, as demonstrated in Figure 4.11.}

Instead, the AXI4S Data Width Converter can be used to upsize the interface,

depend-2_{The AXI4S Data Width Converter, and AXI4S Broadcaster are part of Xilinx’ AXI4S Infrastructure}

(57)

(a) Downlink (Computer to FPGA) (b) Uplink (FPGA to Computer)

Figure 4.10.: PCIe Performance

ACLK TDATA TVALID TREADY TDATA TVALID TREADY 256 bits = 8 symbols 0 1 2 3 4 5 6 7 PCIe throttled

(58)

ing on how many symbols need to be available in parallel. Subsequently, the AXI4S Broadcaster splits the upsized interface into multiple smaller interfaces. This is shown in Figure 4.12.

PRBS Generator and Symbol Mapper

When the computer is not used as a symbol source, we can generate a PRBS, which is subsequently mapped onto symbols. The number of bits transferred per clock cycle between the PRBS generator and the symbol mapper can be configured, to support different constellations.

Both the PRBS generator and the symbol mapper are AXI4S compatible, to be compatible with the IP provided by Xilinx, as is the PCIe symbol source.

Additionally, the symbol mapper uses a skid buffer [13] to decouple the input and output handshaking to allow back-to-back transfers without a combinatorial path between input and output, as dictated by the AXI specification.

4.3.2. Pulse-Shaping Filter

The next stage in our signal-processing chain is the FIR filter, which pulse-shapes the symbols. A square-root raised cosine filter is designed in MATLAB and implemented on the FPGA using Xilinx’ FIR Compiler [14]. The FIR Compiler maps the calculated filter coefficients to a set of polyphase subfilters to efficiently combine both the upsampling and the filtering.

Depending on the input sampling frequency and the frequency of the clock driving the FIR Compiler, the FIR Compiler will accept one or more input samples in parallel, or wait multiple clock cycles per input sample. Additionally, the FIR Compiler may return one or more output samples in parallel, or wait multiple clock cycles per output

(59)

ACLK TDATA TVALID TREADY TDATA TVALID TREADY 32 symbols 32 symbols 64 symbols TDATA TVALID TREADY Symbol 0 TDATA TVALID TREADY TDATA TVALID TREADY Symbol ... Symbol 63 PCIe not throttled

Figure 4.12.: Combination of AXI4S Data Width Converter and Broadcaster Timing Diagram

(60)

Symbol Mapper 2x16 configurable PRBS Symbols AXI4S AXI4S

Figure 4.13.: PRBS Generator and Symbol Mapper

sample. In our case, the FIR Compiler accepts one input sample every 5 clock cycles (corresponding to an input sample rate of 46.08 MHz i.e. the symbol rate), while each clock cycle an output sample is returned. The filter itself is a SRRC, trimmed down to 51 coefficients, each 16-bits wide.

Out of the box, the FIR Compiler is not configured to support back-pressure, as this saves resources and likely results in a high performance. The need for back-pressure is avoided, by matching the output rate of the FIR Compiler (after ZOH) and the input rate of the sigma-delta modulator. The output rate of the sigma-delta modulator is assumed to match the output rate of the GTH transmitter.

4.3.3. Zero-Order Hold and Parallel Sigma-Delta Modulator

The last stage in our signal-processing chain, is the SDM which oversamples and quantizes our signals to two levels, which can be transported by the GTH transceivers. As the SDM needs to sample at gigahertz frequencies a conventional implementation of a SDM would not be possible in FPGA fabric. Fortunately, IDLab Design has a parallelized implementation of a second order SDM available [8]. The SDM is configured to accept and return 16 samples each clock cyle, so the SDM runs 16 times slower than the effective sample rate. All 16 inputs are physically connected to the same FIR filter output, to efficiently implement the ZOH.

(61)

4.3.4. Clocking

As shown in Figure 4.7 and Figure 4.14, the signal-processing chain consists of two clock domains: the ACLK domain, and the TXOUTCLK domain. The ACLK is generated by the DMA subsystem for PCIe, while the TXOUTCLK is generated by one of the GTH transmitters. To transmit data between the two domains, a Xilinx’ AXI4S Data FIFO is configured to use independent read and write clocks. For this situation to work, however, the reference clocks of the GTY transceivers must be sourced from the same clock generator. PCIe TXOUTCLK ACLK GTY Transceiver GTY Transceiver TXUSRCLK TXUSRCLK2 TXUSRCLK TXUSRCLK2 User Logic

Figure 4.14.: Clocking at the CO

4.4. Remote Radio Unit

In Chapter 3, we explain the signal processing chain at the RRUs, as shown in Figure 3.5. Next, its FPGA implementation, as shown in Figure 4.15, is discussed.