On designing coarse grain reconfigurable arrays to operate in weak inversion

(1)

Dian Marie Ross

B.Eng., University of Victoria, 2010

A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTERS OF APPLIED SCIENCE

in the Department of Electrical and Computer Engineering

c

Dian Marie Ross, 2012 University of Victoria

(2)

On Designing Coarse Grain Reconfigurable Arrays to Operate in Weak Inversion

by

Dian Marie Ross

B.Eng., University of Victoria, 2010

Supervisory Committee

Dr. Mihai Sima, Co-Supervisor

(Department of Electrical and Computer Engineering)

Dr. Curran Crawford, Co-Supervisor (Department of Mechanical Engineering)

Dr. Michael McGuire, Departmental Member

(3)

Supervisory Committee

Dr. Mihai Sima, Co-Supervisor

Dr. Curran Crawford, Co-Supervisor (Department of Mechanical Engineering)

Dr. Michael McGuire, Departmental Member

ABSTRACT

Field Programmable Gate Arrays (FPGAs) support the reconfigurable comput-ing paradigm by providcomput-ing an integrated circuit hardware platform that facilitates software like reconfigurability. The addition of an embedded microprocessor and pe-ripherals to traditional FPGA Combinational Logic Blocks (CLBs) interleaved with interconnections has effectively resulted in a programmable system on-chip. FPGAs are used to support flexible implementations of Application Specific Integrated Cir-cuit (ASIC) functions. Because FPGAs are reconfigurable, they often are used in place of ASICs during the cicuit design process. FPGAs are also used when only a small number of ICs are required: ASICs necessitate large manufacturing runs to be economically viable; for smaller runs the use of FPGAs is an economic alternative.

Application domains of interest, such as intelligent guidance systems, medical devices, and sensors, often require low power, inexpensive calculation of trance-dental functions. COordinate Rotation DIgital Computer (CORDIC) is an iter-ative algorithm used to emmulate hardware expensive multipliers, such as Multi-ply/ACculmulate (MAC) units, with only shift and add operations. However, be-cause CORDIC is a sequential algorithm, characterized as having the latency of a serial multiplier, techniques that speed up computational performance have many applications.

(4)

To this end, three implementations of standard CORDIC, (i) unrolled hardwired, (ii) unrolled programmable, and (iii) rolled programmable, were implemented on four Xilinx FPGA families: Virtex-4, -5, and -6, and Spartan-6. Although hardwired unrolled was found to have the greatest speed at the expense of no runtime flexibility, and rolled programmable was found to have the greatest flexibility and lowest silicon area consumption at the expense of the longest propagation delay, improvements to CORDIC implementations were still sought.

Three parallelized CORDIC techniques, P-CORDIC, Flat-CORDIC, and Para-CORDIC, were implemented on the same four FPGA families. P-CORDIC and Flat-CORDIC, were shown to have the lowest latency under various conditions; Para-CORDIC was found to perform well in deeply pipelined, high throughput cir-cuits. Design rules for when to use standard versus precomputation CORDIC tech-niques are presented.

To address the low power requirements of many applications of interest, the Un-folded Multiplexor-LRB (UMUX-LRB), patent held by Sima, et al, was analyzed in weak inversion across four transistor technology nodes (180nm, 130nm, 90nm, and 65nm). Previous was also expanded from strong inversion across 180nm, 130nm, and 90nm technology nodes to also include 65nm.

The UMUX-LRB interconnection network is based upon the Xilinx commercial interconnection network. Therefore, this network (MUX-LRB), and another static circuit technique, CMOS-Transmission Gates (CMOS-TG), were profiled across all four technology nodes to provide a baseline of comparision. This analysis found the UMUX-LRB to have the smallest and most balanced rising and falling edge propagation delay, in addition to having the greatest reliability for temperature and process variation.

(5)

List of Tables

Table 3.1 CORDIC implementation on FPGA: Area (number of resources

utilized) . . . 21

Table 3.2 CORDIC implementation on FPGA: Delay (ns) L = Logic, R = Rout-ing, T = Total . . . 22

Table 3.3 CORDIC implementation on FPGAs: the Area-Delay Product (slices _{× ns) . . . .} 23

Table 4.1 Sign precomputation delays on Xilinx FPGAs. . . 36

Table 5.1 Sub-threshold operation. . . 51

Table 5.2 Near-threshold operation. . . 52

Table 5.3 Super-threshold operation of sub-threshold optimized ULRB and CMOS circuits versus super-threshold optimized MUX-LRB. . . 53

Table 5.4 Temperature variation. . . 54

Table 5.5 Process variation. . . 55

(8)

List of Figures

Figure 2.1 CORDIC operation modes. . . 7

Figure 2.2 FPGA Interconnection Network. . . 12

Figure 2.3 FPGA CLB. . . 12

Figure 3.1 Unrolled hardwired CORDIC. . . 16

Figure 3.2 Unrolled programmable CORDIC. . . 17

Figure 3.3 Rolled CORDIC. . . 17

Figure 3.4 Area-delay product curve for 32-bit word-length (UH – Unrolled Hardwired, UP – Unrolled Programmable, R – Rolled). . . 19

Figure 4.1 P-CORDIC Implementation . . . 28

Figure 4.2 CORDIC CORE Detail . . . 29

Figure 4.3 Flat-CORDIC Implementation . . . 30

Figure 4.4 Para-CORDIC Implementation . . . 32

Figure 5.1 Standard FPGA pseudo-nMOS interconnect network structure. (MC = Memory Cell) . . . 41

(9)

ACKNOWLEDGEMENTS I would like to thank:

Dr. Mihai Sima, for his strong mentorship and support through a challenging Masters.

Dr. Curran Crawford, for his continuous support throughout the program. My Mom for teaching me Vision (and because one always has to thank her Mother.). My Dad, for teaching me Stubborness (and math, and stubborness while attempting

math...)

My Sister, Darien, for being my best friend, and her kitty, Whiskers, for being her best friend.

My Kitty, Misty, for all the late night musings and support.

>_{b . b <}

& Everyone Else, for teaching me courage and fortitude.

Quality is a term that . . .

can split a world into hip and square, classic and romantic, technologi-cal and humanistic, is an entity that can unite a world already split along these lines into one. A real understanding of Quality doesn’t just serve the System, or even beat it or even escape it. A real understanding of Quality captures the System, tames it, and puts it to work for one’s own personal use, while leaving one completely free to fulfill [her] inner destiny.

(10)

Introduction

Reconfigurable devices, such as Field-Programmable Gate Arrays (FPGAs), are be-coming increasingly accepted for implementing digital designs due to their flexible, post-fabrication software programmability, with hardware-like performance. Recent developments in FPGA architecture have effectively created a programmable system on-chip by combining traditional FPGA logic blocks and interconnection networks with modern embedded microprocessors and peripherals.

FPGA flexibility reduces the non-recurring engineering expense of designing a full-custom circuit, in addition to eliminating the time and expense associated with custom silicon circuit fabrication; these factors combine to reduce the overall time to market when compared with an Application Specific Integrated Circuit (ASIC). Nevertheless, this flexibility comes at the cost of increased power consumption, prop-agation delay, and silicon area overhead [27]. The surge in popularity of portable and embedded devices has resulted in power consumption becoming an increasingly crit-ical design constraint within the semiconductor industry. Therefore, reconfigurable solutions that can lower power requirements without too great a sacrifice of delay and silicon area overhead will support a significant market need.

Recent reports indicate that upwards of 70% of power dissipation in reconfigurable devices, including FPGAs, occurs in the interconnection fabric, which includes signal and clock interconnection networks [28, 31]. This dissipation is intrinsic to the struc-ture of the fabric itself: series of long wire segments with large parasitic capacitances are connected by programmable nMOS switches. Depending upon the circuit design, some FPGA resources go unused, adding to the interconnection silicon area and delay overhead when compared with ASIC circuits which utilize only as much (faster and smaller area) logic that is required to implement one particular circuit. Furthermore,

(11)

dedicated computational units on-board the FPGA, such as Multiply/ACcumulate (MAC) units, are area and power heavy, and scarce on-board.

The COordinate Rotation DIgital Computer (CORDIC) algorithm is presented as a case study for reducing area and power expensive operations on an FPGA. As a means of providing cost effective solutions for the evaluation of trancedental functions, CORDIC performs vector rotations in a number of different coordinate systems [8, 9]. Such functions have many application domains of great interest today, including intelligent guidance systems, wireless communications, medical devices and Digital Signal Processing (DSP). CORDIC has the potential to offer an reduction in power consumption by utilizing shift and add operations to emmulate more expensive on-chip multiplications. However, as a sequential algorithm, hardware solutions are required to improve the computation speed of CORDIC implementations.

FPGAs are configured using hardware description language (HDL) such as VHDL or Verilog; both languages were used in these CORDIC implementations. To optimize high-level implementation of functions, such as CORDIC, simplified, coarse grain code must be traded-off with device-specific fine grain code. Behavioural code allows the compiler to specify hardware implementation; Structural code allows the circuit designer to specify the function mapping to particular devices on the FPGA circuit board.

Three standard CORDIC implementation schemes, (i) unrolled hardwired, (ii) un-rolled programmable, and (iii) un-rolled programmable, were implemented on FPGAs us-ing Structural code and analyzed in terms of area, delay and scalability [21]. The four FPGA families considered include a legacy device, Virtex-4, a state-of-the-art device, Virtex-6, and two in-between devices, Virtex-5 and the newer, economy Spartan-6. Extensive simulations were carried out and a complete set of numerical figures are provided. A number of design rules for implementing stan-dard CORDIC on commercial reconfigurable FPGA devices are proposed based upon use-case, speed and area requirements.

From this profiling of standard CORDIC, it was determined that further improve-ments to the algorithm latency when mapped onto an FPGA could be made by uti-lizing a priori knowledge of the direction or ‘sign’ of many CORDIC vector rotations. This knowledge effectively leads to parallelization of the CORDIC algorithm. Three methods of sign precomputation, (i) P-CORDIC [19], (ii) Flat-CORDIC [16,17], and (iii) Para-CORDIC [18], have been previously proposed in literature as methods for reducing algorithm logic delay when implemented on an ASIC; delay is reported in

(12)

terms of full adders [17–19]. Nevertheless, little analysis exists on reconfigurable implementations where the major algorithm optimization design goal is to reduce interconnection delay.

All three sign precomputation techniques are shown to improve delay and logic uti-lization when compared with standard CORDIC. On state-of-the-art FPGAs, such as the Virtex-6, P-CORDIC is found to perform best; on older devices such as Virtex-4, Flat-CORDIC has the best performance. On in-between FPGAs, such as the Vir-tex-5 or Spartan-6, there is no clear winner between P-CORDIC and Flat-CORDIC. Para-CORDIC never outperforms either P-CORDIC or Flat-CORDIC, but still rep-resents an improvement over standard CORDIC implementations. Furthermore, Para-CORDIC can be pipelined for applications where high throughput is the main design goal. Design rules directing when to use each type of precomputation CORDIC are presented.

Although Flat-CORDIC is the winner in terms of latency, P-CORDIC still has much potential when Flat-CORDIC cannot be used:

• On architectures with small LUTs (4-input) or without dedicated multiplexors (MUXes), such as the Virtex-4;

• On small FPGAs or when LUTs need to be used for other functions; and • On coarse grain architectures, such as the Shift-Enabled Embedded

Reconfig-urable Array (ShEERA), patent held by Sima, et al [49].

Unlike commercial FPGA Combinational Logic Blocks (CLBs), the ShEERA archi-tecture is deemed coarse grain as it does not allow the bitwise manipulations required to implement such algorithms as Flat-CORDIC or Para-CORDIC. Instead, it accepts data buses as inputs, as are provided by the P-CORDIC algorithm, and simularily output a data bus. Consequently, the design only has access to a coarse grain unit through its I/O ports.

Although the CORDIC algorithm does offer improvements on FPGAs in terms of area and power consumption compared to dedicated multipliers , circuit techniques that offer even greater reduction in power consumption are required for the applica-tions of interest. Therefore, in a bid to reduce power consumption in reconfigurable arrays, devices have been proposed that operate at sub-threshold voltages; this re-duction comes at the expense of delay.

(13)

Circuit energy in one clock period is calculated as:

Eone clock period = VSUPPLY· I (1.1)

where VSUPPLY is the supply voltage and I is the instantaneous current through the circuit integrated over one clock period. Therefore, operation at a lower supply voltage automatically reduces the energy consumption. Furthermore, the dominant current in weak inversion is leakage current, where Ileakage << Idrain.

For the devices of interest, such as wireless biomedical devices (eg. hearing aids), cellular phones, or sensors, energy consumption is a critical design constraint. In these applications, the device is required to operate on battery power for extended periods of time; low power circuits improve device reliability.

Standard programmable interconnection circuit techniques cannot be directly ap-plied to weak inversion operation, and previously proposed alternative techniques greatly increase latency and silicon area. Techniques from prior-art are analyzed across four technology nodes (180nm, 130nm, 90nm, and 65nm) by operating them in both weak and strong inversion. A new logic family, the Unfolded Multiplexor - Level Restoring Buffer (UMUX-LRB) reconfigurable interconnection is proposed for operation at both sub- and super-threshold voltages. With smaller silicon area compared with other sub-threshold circuit techniques, UMUX-LRB logic is shown to both reduce power consumption in strong inversion and improve propagation delay performance in weak inversion; the end user is able to prioritize for power or delay with the same device. Design rules and transistor sizing guidelines are presented for designing interconnection networks with reconfigurable supply voltages that can be operated across the range of sub- to super-threshold supply voltages.

This dissertation is organized as follows. In Chapter 2, a background to the CORDIC algorithm and its implementation on commerial FPGAs is highlighted. The FPGA architecture is explained in terms of its interconnection network, interleaved with CLBs which contain: general purpose Look Up Tables (LUTs) that trigger dedicated 2:1 MUX carry chains, and on more modern FPGAs, 5 to 8:1 wide MUXes. Chapter 3 presents three standard CORDIC implementation schemes, (i) unrolled hardwired, (ii) unrolled programmable, and (iii) rolled programmable, on four Xilinx FPGA families: Virtex-4, -5, -6 and Spartan-6. From this analysis, design rules are presented for when to use each implementation.

(14)

Chapter 4 expands upon the standard CORDIC in Chapter 3 by presenting and analyzing three precomputation methods of implementing CORDIC, namely (i) P-CORDIC [19], (ii) Flat-CORDIC [16, 17], and (iii) Para-CORDIC [18] on the same four FPGA families.

Chapter 5 presents a low power solution for reconfigurable devices by profiling the Unfolded Multiplexor - LRB (UMUX-LRB) interconnection network at sub-threshold voltages across four technology nodes (180nm, 130nm, 90nm, and 65nm). The UMUX-LRB is also compared against commerical FPGA interconnection networks (MUX-LRB), and another prior art interconnection network, the CMOS - Trans-mission Gate (CMOS-TG), to demonstrate its reduced propagation delay and power consumption in weak inversion, and its reconfigurable voltage supply its acceptable performance in strong inversion.

Chapter 6 concludes the paper by summarizing the disseration contributions and highlighting potential future work.

(15)

Chapter 2 Background

2.1 The CORDIC Algorithm

COordinate Rotation Digital Computer (CORDIC) is an iterative method of per-forming rotations of a 2-D vector by arbitrary angles, θ, using shift and add opera-tions [8, 9], where each angle may be decomposed into a sum of elementary angles. In this way, CORDIC avoids the use of expensive, limited hardware resources, such as Multiply and ACcumulate (MAC) units for computation. Using CORDIC, many transcendental functions may be computed with the latency of a serial multiplier [3]. The CORDIC equations for a given input vector (x, y) are shown in Equation 2.1.

( x(j + 1) = x(j) − σ(j)2−j_y(j)

y(j + 1) = y(j) + σ(j)2−jx(j) (2.1) where σ(j) is the sign of y(j), the direction of the CORDIC rotation. A z-path is used to describe the accumulation of the rotation angles according to Equation 2.2:

z(j + 1) = z(j)_{− σ(j) arctan 2}−j

(2.2) such that

σ(0)_{· z(0) + ... + σ(j) · z(j) + ... + σ(n) · z(n) = θ} (2.3) where θ is the total angle of the vector rotation, represented as a sum of elementary rotations, z(j).

(16)

The CORDIC algorithm is operated in one of two modes: rotation or vectoring. In rotation mode, the angle accumulator is initialized with the desired rotation angle, such that z0 = θ, σ0 = sign(z0). At each iteration, the rotation decision is made to decrease the magnitude of the residual angle (σ(j) = +1 if z(j)_{≥ 0, otherwise is −1).} In vectoring mode, the CORDIC unit rotates the input vector to align the result vector with the x axis, such that y approaches 0. The accumulator is initialized to zero (z0 = 0) and σ(j) =−1 if y(j) ≥ 0, otherwise is +1. The two modes of operation are pictorially represented in Figure 2.1.

Figure 2.1: CORDIC operation modes.

The CORDIC implementations considered in this analysis use fixed-point arith-metic. As a result, each successive iteration produces one bit of accuracy. In the case of a programmable CORDIC implementation, as examined in Chapter 3, the precision of the algorithm can be dynamically set at runtime with minimal increases in hardware complexity. To prevent round-off errors from accumulating throughout the CORDIC rotation iteration stages, and thereby contaminating the final result, at least log₂n additional low-order bits are necessary in CORDIC for intermediate values [9].

The algorithmic multiplication by 2−i _{to calculate the new (x, y) coordinates can} be implemented with a series of shift operations, eliminating the need for dedicated and expensive multiplication hardware, such as Multiply/ ACcumulate (MAC) units. Therefore, the critical algorithm operations are shift and addition/subtraction (which can be implemented as one unit in hardware). The arctan(2−i) values can be read in from values stored in advance in an on-chip ROM, which removes this operation

(17)

from the critical path of the algorithm; no additional hardware is needed to calculate arctan at runtime.

Since only addition and shift operations are needed to implement rotations accord-ing to Equation (2.1), CORDIC is considered hardware inexpensive and is appropri-ate for implementations on FPGAs and embedded microprocessors. As is apparent in Equations (2.1) and (2.2), CORDIC calculations are recursive; in each iteration, the direction of the next microrotation is dependent upon the revious iterations’ di-rections. As a result, in traditional CORDIC, the implementation is sequential and therefore slow. Theoretically, to reduce this serial latency, the directions of the mi-crorotations, σ(j), need to be predicted using a minimum number of steps. Then, the recursion described in Equation (2.1) can be fully unrolled, and the resulting combinational CORDIC circuit can be optimized as a single entity. This offers the possibility of tight optimizations based on time and delay criteria, as explored in Chapter 4.

Field-Programmable Gate Arrays (FPGAs) solutions are therefore proposed as a means of providing software-like flexibility with performance near that of dedi-cated hardware. This flexibility comes at the expense of lower logic density and pre-fabricated logic that can decrease the effective speed and throughput performance of the algorithm.

2.2 Field Programmable Gate Arrays

Field Programmable Gate Arrays (FPGAs) support the reconfigurable computing paradigm by providing an integrated circuit hardware platform that facilitates soft-ware reconfigurability [4]. Recent developments in FPGA architecture have resulted in the addition of embedded microprocessors and peripherals to traditional FPGA configurable logic blocks (CLBs) interleaved with interconnection, as shown in Fig-ure 2.3. This combination has effectively resulted in a programmable system-on-chip that has many applications today.

As general-purpose computing devices that can be reconfigured via software, FP-GAs are used to support flexible implementations of Application Specific Integrated Circuit (ASIC) functions. Because FPGAs are reconfigurable, they often used in place of ASICs during the circuit design process. FPGAs are also used when only a small number of integrated circuits are required; ASICs require a large manufacturing run to be economically viable.

(18)

Examples of these types of FPGAs include those analyzed in Chapters 3 and 4: the Virtex-4 [11], Virtex-5 [10], Virtex-6 [13], and Spartan-6 [12] platforms from Xilinx,Inc.

An FPGA consists mainly of Look-Up Table (LUT) based programmable logic blocks, and reconfigurable interconnections build with nMOS-tree multiplexors. Com-pared to ASIC designs, the speed of FPGA designs is slower due to the significant extra delay introduced by interconnections [2]. Thus, the latency of an FPGA cir-cuit is determined by two factors: the logic delay in LUTs; and the routing delay in the interconnection paths. Therefore, an understanding of the FPGA’s architecture, the synthesis tool, and the routing and mapping software is essential in obtaining satisfactory system speed.

To improve computation speed, most FPGA architectures provide dedicated re-sources for specific operations. For example, addition is supported by fast, dedicated carry chain routing for 2:1 multiplexors (MUXes) that are triggered by LUT out-puts. Consequently, addition is typically faster than LUT-based Boolean function evaluation, which requires passes through the slow, global interconnection network.

State-of-the-art FPGAs, such as Virtex-6 and Spartan-6, also include a number of small-capacity Random Access Memories (RAM) to support data-intensive appli-cations, embedded multipliers to provide hardware support for filter implementation, and wide multiplexors to support barrel shift operations.

As discussed in Section 2.1, the CORDIC algorithm is implemented using only additions and shift operations. While addition is well supported in FPGA by means of dedicated carry chains, variable shift operations are difficult to build due to the high cost of multiplexing logic [5]. For this reason, after preliminary analysis of the standard CORDIC algorthim, the CORDIC recursion was completely unrolled and analyzed in Chapter 3. Since the shift operations are carried out over a known number of positions, they can be hardwired, and thus have the delay only because of general routing.

When mapped to an FPGA, the CORDIC calculations utilize the flexible Look-Up Table (LUT) hardware resource. Virtex-4 architecture includes 4-input LUTs that support the implementation of any 4-input function; Virtex-5, -6 and Spartan-6 architectures all support up to 6-input LUTs. The output of each LUT is associ-ated with a 2:1 multiplexor (MUX) that is locally connected to form a fast carry chain for implementing highly parallelized functions. In addition, on Virtex-6 and Spartan-6 FPGAs, there are dedicated local wide MUXs (5:1 - 8:1 MUXs) that were

(19)

utilized in the CORDIC implementations. Finally, in the case of CORDIC rotation angle precomputation, read-only memory (ROM) blocks were also used; one block per CORDIC implementation.

As a reconfigurable computing unit, FPGAs have the functionality to implement ASIC functions by using multiple levels of LUTs. However, there is a large time penalty associated with going through global interconnect between LUT levels. Given the fact that many modern computing applications require real-time processing, it is of fundamental importance to use dedicated local resources which require a propaga-tion delay that is on the order of 5% of that of global interconnect.

FPGA CLBs are comprised of dedicated primitive resources including look-up tables (LUTs) which can implement any 4- to 6-input function, depending upon architecture type and dedicated fast carry chains of 2:1 MUXes which are triggered by the LUT outputs. Of the four Xilinx FPGA families analyzed, Virtex 4 is limited to 4-input LUTs; Virtex 5, 6 and Spartan 6 all have 6-input LUTs [20]. In addition to these primative resources, modern FPGAs also provide limited on-board coarse grained DSP units, including multiply and accumulate (MAC) units. Embedded memory blocks, such as SRAM, and local wide MUXes (5- to 8-input) on state-of-the-art FPGAs. Spartan 6 FPGAs also contain dedicated CORDIC cores, but these implementations focus on maximizing system throughput by increasing clock speed, as opposed to minimizing latency.

The major trade-off for FPGAs is fast logic compution using dedicated LUT and carry chain primitives, at the expense of slow propagation through global intercon-nect. Nevertheless, Virtex 6 FPGAs have been found to improve upon routing delay compared to logic delay [20]. Spartan 6 FPGAs have been found to have the worst routing delay; the trade-off associated with being a low-cost modern FPGA.

If implemented as an ASIC, the parallelized CORDIC methods presented in this paper can be optimized in hardware to reduce logic delay; interconnection speed is known to be fast because of fast routing for shifters using tri-state logic. On an FPGA, however, the interconnection network is much slower than its ASIC counterpart: tri-state logic is not available, so CORDIC shifters must instead be implemented through multi-level look-up tables (LUTs) configured as multiplexors (MUXes). As a result, unlike ASIC implemented designs, FPGA implemented designs must be optimized to reduce interconnection delay. Given the different optimization goals associated with FPGAs versus ASICs, the potential benefits of these CORDIC algorithms when mapped to an FPGA have not yet been quantified or evaluated.

(20)

Field Programmable Gate Arrays (FPGAs) support the reconfigurable comput-ing paradigm by providcomput-ing an integrated circuit hardware platform that facilitates software-like reconfigurability. The addition of embedded microprocessors and pe-ripherals to traditional FPGA Configurable Logic Blocks (CLBs) interleaved with interconnections has effectively resulted in a programmable system on-chip. FPGAs are used to support flexible implementations of Application-Specific Integrated Cir-cuit (ASIC) functions. Because FPGAs are reconfigurable, they often used in place of ASICs during the circuit design process. FPGAs are also used when only a small num-ber of integrated circuits are required: ASICs require a large manufacturing run to be economically viable; for smaller runs the use of FPGAs are an economic alternative. FPGA CLBs are comprised of dedicated primitive resources including look-up tables (LUTs) which can implement any 4- to 6-input function, depending upon architecture type, and dedicated fast carry chains of 2:1 MUXes which are triggered by the LUT outputs. Of the four Xilinx FPGA families analyzed, Virtex-4 is limited to 4-input LUTs; Virtex-5, -6 and Spartan-6 all have 6-input LUTs [20]. In addition to these primitive computing resources, modern FPGAs also provide limited on-board coarse grained DSP units, such as multiply and accumulate (MAC) units, embedded mem-ory blocks, such as SRAM units, and local wide MUXes (5- to 8-input). Spartan-6 FPGAs also contain dedicated CORDIC cores, but core implementations focus on maximizing system throughput by increasing clock speed, as opposed to minimizing latency through algorithm design.

The major trade-off for FPGAs is fast logic computation, using LUTs and car-ry-chain primitives, balanced against the expense of slow signal propagation through a global interconnect. Nevertheless, state-of-the-art Virtex-6 FPGAs have been found to improve upon this trade-off by balancing routing and logic delay [20]. Spartan-6 FPGAs were found to have the worst routing delay amongst current generation FP-GAs tested; the trade-off associated with being a low-cost modern FPGA implemen-tation [21].

Figure 2.2 shows the configuration of the standard Xilinx-style FPGA intercon-nection network. Figure 2.3 shows the detail of a CombinationaL Logic Block (CLB), the squares in Figure 2.2.

If these methods are implemented as an ASIC, the hardware design of the paral-lelized CORDIC schemes can be optimized to reduce logic delay; fast routing using tri-state logic can be used for the interconnection of the shifters. On an FPGA, how-ever, the interconnection network is much slower than its ASIC counterpart: tri-state

(21)

Figure 2.2: FPGA Interconnection Network. 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT

Bitïlevel Reconfigurable Array 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT 0 1 LUT x(6) y(6) x(7) y(7) y(5) x(5) x(4) y(4) x(3) y(3) x(2) y(2) y(1) x(1) y(0) x(0) p(7) g(7) p(6) g(6) p(5) g(5) p(4) g(4) p(3) g(3) p(2) g(2) g(1) p(1) g(0) p(0) 0 1 0 1 LUT XOR g00=g(0) g01=g(1)+g(0)p(1) g22=g(2) g23=g(3)+g(2)p(3) g44=g(4) g45=g(5)+g(4)p(5) g66=g(6) g67=g(7)+g(6)p(7) 0 1 LUT g01 0 1 0 Figure 2.3: FPGA CLB.

(22)

logic is not available, so CORDIC shifters must instead be implemented through multi-level look-up tables (LUTs) configured as multiplexors (MUXes). As a result, unlike ASIC implemented designs, FPGA implemented designs must be optimized for a different design constraint: the reduction of interconnection delay. Because of the different optimization goals associated with FPGA versus ASIC, the potential benefits of mapping these CORDIC algorithms onto an FPGA have not yet been quantified or evaluated in literature.

(23)

Chapter 3 The Standard CORDIC Approach

on FPGA

This Chapter summarizes analysis previously presented at The IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim) 2011 and included in the proceedings: Design Rules for Implementing CORDIC on FPGAs, Ross et al, pg. 797-802.[21]. In this analysis, the performance of three FPGA-mapped CORDIC implementations is analyzed in terms of area, delay and scalability. These hardware implementations are comprised of (i) fully unrolled units with hardwired shift (unrolled hardwired), (ii) fully unrolled units with programmable shift (unrolled programmable), and (iii) rolled units with programmable shift (rolled programmable) which are mapped to multiple families of modern Xilinx FPGAs [10–13].

The main results show that the unrolled hardwired scheme is the best option when latency is more important than throughput, whereas the rolled programmable scheme is more appropriate when a large number of CORDIC operations are to be executed in parallel. Building upon the parallel nature of many CORDIC applications, this analysis then expands to consider precomputation-based CORDIC implemenations. These parallelized implementations have been previously mapped to ASICs, but given the different architecture and design constraints between ASICs and FPGAs, must be re-assessed for FPGA operations. The contributions are as follows:

• Comparative analysis of three standard CORDIC implementation schemes onto modern FPGA families.

• Design recommendations for mapping standard CORDIC onto four types of modern FPGAs.

(24)

Section 3.1 presents three CORDIC implementation schemes on FPGA; Sec-tion 3.2 discusses the numerical results, proposes design rules, and highlights the need for further CORDIC design solutions. These alternative solutions are then explored in Chapter 4.

3.1 CORDIC Implementation Schemes

Three implementation schemes of the CORDIC algorithm (unrolled hardwired, un-rolled programmable, un-rolled programmable), were implemented on FPGA and tested for five bit widths. These bit widths were chosen to address both the parallelized nature of FPGA routing for word lengths of powers of 2 (16, 32, and 64), and the intermediate word lengths (24, 48) often used in cryptographic applications. The FPGA families considered in this analysis range from legacy devices, Virtex-4 and Virtex-5, to state-of-the-art devices, Virtex-6 and Spartan-6. Spartan-6 FPGAs are included as a cost effective, modern FPGA that provides a good implementation trade-off between available on-chip resources with increased propagation delay.

As mentioned in Chapter 2, runtime operations include shift and add; the arc-tangent rotation values are precalculated and stored at compile time in ROM. To ensure a fair comparison, a tight area is needed for each scheme. As a result, for the purposes of standarizing the comparison, many CORDIC units are deployed on-chip, with each implementation scheme pipelined between each CORDIC unit.

Each of the mentioned schemes was mapped onto four FPGA devices from dif-ferent families (Virtex-4 LX60, Virtex-5 LX50, Virtex-6 XC6VLX75T, and Spar-tan-6 XC6SLX75). The FPGA implementation was completed by compiling Ver-ilog source code, then confirming accurate algorithm operation post-place-and-route on-chip. The shift operations in the rolled scheme intensively uses dedicated FPGA resources: carry chains in the Virtex-4, and wide multiplexors in the Virtex-5, Vir-tex-6, and Spartan-6. The implementation schemes are discussed below.

1. Unrolled hardwired scheme (Fig. 3.1), in which each CORDIC iteration is given individual hardware support. The bit width and, therefore, the number of itera-tions, are statically configured at compile time. As a result, the shift operations are hardwired. This scheme offers the most efficient CORDIC implementation in terms of area and reduced delay, but comes at the cost of runtime flexibility.

(25)

Figure 3.1: Unrolled hardwired CORDIC.

2. Unrolled programmable scheme (Fig. 3.2), in which each CORDIC iteration is given individual hardware support, but the bit width and the number of iterations can be dynamically configured (programmed) at runtime to provide system scalability and flexibility. This flexibility comes at the cost of increased delay and area. However, as can be seen in Fig. 3.2, the Z-path of the CORDIC algorithm is calculated the same as for the unrolled hardwired scheme; the additional area is required for calculating the X- and Y-paths.

3. Rolled programmable scheme (Fig. 3.3), in which only a single iteration is given hardware support. The hardware is then reused for subsequent CORDIC iter-ations. A controller is used to keep track of the current CORDIC iteration and manage the X-, Y-, and Z-paths with runtime configurable shift amounts. Rolled programmable represents the most flexible and area efficient scheme, at the cost of increased hardware complexity, and greatest delay.

Any other standard CORDIC implementation scheme can be described in terms of these three basic schemes. The next Section 3.2 quantifies the performance of these implementation schemes by providing numerical results for area and delay. A comparative analysis of the three CORDIC implementation schemes is then presented.

(26)

Figure 3.2: Unrolled programmable CORDIC.

Figure 3.3: Rolled CORDIC.

3.2 Results and Discussions

The three CORDIC implementation schemes (unrolled hardwired, unrolled programmable, and rolled programmable) were each mapped to four modern FPGA devices: Virtex-4 LX60, Virtex-5 LX50, Virtex-6 XC6VLX75T, and Spartan-6 XC6SLX75. From this analysis, three types of measurements are reported: (i) silicon area (utilization figures), (ii) propagation delay, and (iii) area-delay product.

(27)

Table 3.1 presents the area figures. From the results, it is apparent that, for all bit widths, the rolled programmable scheme requires the smallest area of the three schemes. Since the two shift units and two adders of the rolled programmable scheme occupy less area than 2N adders used of the unrolled hardwired scheme, the shift unit is never implemented as a multiplier by a power of 2. Furthermore, the ra-tio of the unrolled programmable area to the rolled programmable area increases with bit width and logic family, which means that the two shift units used in the rolled programmable scheme exhibit a decreasing contribution to the total CORDIC unit area with increasing bit width. In addition, the unrolled hardwired and un-rolled programmable schemes have comparable slice utilization figures when they are mapped onto modern FPGA families (Virtex-5, Virtex-6, and Spartan-6). For the old family Virtex-4, where shift is implemented using dedicated carry chains, unrolled programmable scheme uses a 4 to 5 times larger area than the unrolled hardwired scheme. These observations lead to the conclusion that the more recent FPGA archi-tectures with wide multiplexor primitives (Virtex-5, Virtex-6, and Spartan-6) suport shift operations better than the old ones with only carry chains (Virtex-4).

An application domain that can benefit from CORDIC units mapped onto FP-GAs is wireless communications, for which QR decomposition (QRD) of the wireless channel matrix is required [6, 7]. Assuming a 4× 4 complex-valued matrix which is common in Multiple Input - Multiple Output (MIMO) communications, a fully par-allel implementation of QRD requires 32 CORDIC units. For a precision of 32 bits, only the rolled programmable scheme leads to a single chip solution on medium-sized FPGAs. Since the CORDIC units will operate in parallel, the longer latency of the rolled programmable scheme is actually hidden.

Table 3.2 shows propagation delay figures per pipeline stage. The hardwired scheme exhibits the smallest delay, mainly because the shift operation is hardwired, while the largest delay is encountered in the rolled programmable scheme. The prop-agation delay through logic is larger than the propprop-agation delay through interconnect for the unrolled hardwired scheme mapped onto Virtex devices (since the interconnect is short when pipelines are built), but it is smaller for the rolled programmable scheme. These two delay components are more balanced when the CORDIC is mapped onto the cost-effective (and as a result, slower) Spartan-6 family. Therefore, the main design effort for unrolled CORDIC should be directed toward the implementation of logic functions (for example, by selecting an FPGA with faster logic, or with archi-tectural features to support shift operations), whereas an FPGA family with good

(28)

routing capabilities is needed when rolled CORDIC is desired (since the optimization should address interconnect rather than logic). If a trade-off is required, then the unrolled programmable scheme represents the best choice.

Area-Delay Product (ADP) figures are presented in Table 3.3. The rolled pro-grammable scheme exhibits the lowest ADP, and is, in fact, much lower than the unrolled hardwired scheme (which is the next competitor). Therefore, the rolled pro-grammable scheme is the best option when a large number of vector rotations need to be calculated in parallel (for example, in QR decomposition), whereas the unrolled hardwired scheme should be used when the propagation delay is critical.

A better view of these results is shown in Figure 3.4. It is apparent that the best area-delay product is achieved for the rolled programmable scheme. The unrolled hardwired scheme is a good trade-off between reduced delay and area consumption.

Figure 3.4: Area-delay product curve for 32-bit word-length

(UH – Unrolled Hardwired, UP – Unrolled Programmable, R – Rolled).

Based on these observations, a set of design rules and recommendations are pre-sented:

1. Use the unrolled hardwired scheme when a small number of CORDIC operations is to be executed and latency is more important than throughput.

2. Use the rolled programmable scheme when a large number of CORDIC oper-ations are to be executed in parallel (for example, in QR decomposition, or channel derotation in Orthogonal Frequency Division Multiplexing (OFDM) communications).

(29)

3. Use an FPGA architecture with as many levels of embedded multiplexors as possible to implement shift operations without going through global intercon-nect.

4. Use an FPGA with good routing architecture when the rolled programmable scheme is required.

5. Use an FPGA having as many inputs as possible for the look-up tables when the unrolled hardwired scheme is required.

6. Use the unrolled programmable scheme when a trade-off in terms of area and propagation delay is desired.

7. Cost-effective FPGA families (e.g., Spartan-6) provide a good trade-off in terms of on-chip resources and large area at the cost of some propagation delay.

(30)

T able 3.1: CORDIC implemen tation on FPGA: Area (n um b er of resources utilized) F amily / Device Sc heme Bit-width 16 32 64 Slice FF LUT4 Slice FF LUT4 Slice FF LUT4 Virtex-4 / LX60 Unrolled hardwired 353 647 647 1430 2741 2777 6376 11386 11527 Unrolled programmable 1242 636 2353 6839 2741 13057 32383 11444 60180 Rolled 140 106 266 341 201 659 739 409 1408 Virtex-5 / LX50 Unrolled hardwired 636 663 663 2715 2772 2772 11443 11505 11505 Unrolled programmable 636 1339 1339 2715 7980 7980 11443 34156 34156 Rolled 103 159 159 200 489 489 393 993 993 Virtex-6 / X C6VLX75T Unrolled hardwired 637 1066 1066 2716 4536 4536 11452 18998 18998 Unrolled programmable 636 1712 1712 2715 9843 9843 11444 44776 44776 Rolled 104 207 207 203 553 553 393 1150 1150 Spartan-6 / X C6SLX75 Unrolled hardwired 674 1066 1066 2848 4536 4536 11779 18998 18998 Unrolled programmable 648 1712 1712 2742 9712 9712 11443 44855 44855 Rolled 110 211 211 218 560 560 397 1156 1156

(31)

T able 3.2: CORDIC implemen tation on FPGA: Dela y (ns) L = Logic, R = Routing, T = T otal F amily / Device Sc heme Bit-width 16 32 64 L R T L/R L R T L/R L R T L/R Virtex-4 / LX60 Unrolled hardwired 1.6 1.3 2.9 1.25 2.2 1.5 3.7 1.47 3.6 2.0 5.6 1.80 Unrolled programmable 2.5 1.8 4.3 1.39 3.0 2.5 5.5 1.20 4.3 2.9 7.2 1.48 Rolled 2.5 3.5 6.0 0.71 3.5 4.0 7.5 0.86 5.0 3.5 8.5 1.43 Virtex-5 / LX50 Unrolled hardwired 1.5 0.9 2.4 1.67 1.9 0.9 2.8 2.11 2.6 1.0 3.6 2.60 Unrolled programmable 2.0 2.5 4.5 0.80 2.2 2.1 4.3 1.05 2.8 2.8 5.6 1.00 Rolled 1.8 2.5 4.3 0.72 2.0 2.6 4.6 0.77 2.9 3.4 6.3 0.85 Virtex-6 / X C6VLX75T Unrolled hardwired 1.0 0.8 1.8 1.25 1.2 0.8 2.0 1.50 1.9 0.9 2.8 2.11 Unrolled programmable 1.1 1.5 2.6 0.73 1.7 1.6 3.3 1.06 2.1 2.4 4.5 0.86 Rolled 1.3 1.8 3.1 0.72 1.4 2.1 3.5 0.67 2.2 3.2 5.4 0.69 Spartan-6 / X C6SLX75 Unrolled hardwired 1.4 1.9 3.3 0.74 1.8 1.9 3.7 0.95 2.3 2.0 4.3 1.15 Unrolled programmable 1.9 3.0 4.9 0.63 2.6 3.6 6.2 0.72 3.1 4.7 7.8 0.66 Rolled 2.1 4.4 6.5 0.48 2.5 4.5 7.0 0.56 2.9 7.0 9.9 0.41

(32)

T able 3.3: CORDIC implemen tation on FPGAs: the Area-Dela y Pro duct (slices × ns) F amily / Device Sc heme Bit-width 16 24 32 48 64 Virtex-4 / LX60 Unrolled hardwired 1031 2555 5154 16317 36037 Unrolled programmable 5241 184097 37984 136990 232154 Rolled 842 1750 2566 6851 6214 Virtex-5 / LX50 Unrolled hardwired 1530 3774 7518 20300 41183 Unrolled programmable 2851 7540 11905 42164 63726 Rolled 435 836 931 2101 2496 Virtex-6 / X C6VLX75T Unrolled hardwired 1143 2794 5524 14269 33131 Unrolled programmable 1621 5214 9160 29166 52047 Rolled 319 641 708 1438 2118 Spartan-6 / X C6SLX75 Unrolled hardwired 2265 5358 10421 25942 51156 Unrolled programmable 3215 11304 17214 56175 89278 Rolled 714 1368 1503 3084 4015

(33)

Chapter 4 Precomputation CORDIC

To improve upon the serial performance of standard CORDIC, as presented in Chap-ter 3, parallelized algorithms have been proposed [16–19]. Each algorithm involves the precomputation of the direction of the rotation at each iteration; once this step is completed, the algorithm delay is reduced to combinational logic of a fully-unrolled CORDIC unit. It should be noted that these approaches are limited to rotation mode only, while standard CORDIC allows for both rotation and vectoring modes.

If these methods are implemented on an ASIC, the hardware design can be op-timized to reduce logic delay; fast routing using tri-state logic can be used for the interconnection of the shifters. On an FPGA, however, the interconnection network is much slower than its ASIC counterpart: tri-state logic is not available so CORDIC shifters must instead be implemented through multi-level look-up tables (LUTs) con-figured as multiplexors (MUXes). As a result, unlike ASIC implemented designs, FPGA implemented designs must be optimized to reduce interconnection delay. The different optimization goals associated with FPGAs versus ASICs for CORDIC im-plementations and the potential benefits of these algorithms when mapped onto an FPGA have not yet been quantified or evaluated in the literature.

An analysis of parallelized CORDIC implementations and system-level metrics for implementations on state-of-the-art commercial FPGAs is provided. This analysis was first presented at The 45th Annual Asilomar Conference on Signals, Systems, and Computers (Asilomar 2011), and is included in the proceedings as Exploration of Sign Precomputation-based CORDIC in Reconfigurable Systems, Ross, et al, pp. 2186–2191 [22]. For this analysis, three implementations of precomputation-based CORDIC was mapped to four families of FPGAs available from Xilinx Inc. [20]. Each implementation was then analyzed in terms of area, delay, and routing to provide rules

(34)

for implementing CORDIC on FPGAs [21]. The standard CORDIC algorithm was used as a baseline measure of performance; the analysis was then extended to include alternate parallelized CORDIC implementations and system-level considerations.

Previous analysis of each technique was based on compiled ASIC circuits; delays were reported in terms of full adders [17–19]. P-CORDIC [19], Flat-CORDIC [16, 17], and Para-CORDIC [18] algorithms were mapped to four Xilinx FPGA familes, Virtex-4, -5, -6, and Spartan-6, for 16-, 24- and 32-bit word lengths. Additionally, as a further point of comparison, a set of metrics for an Ideal CORDIC algorithm is also presented to represent the best case boundary condition. In the Ideal case, all microrotation directions are known in advance and the final result, (x’,y’), can be directly calculated with no recursion.

All designs were compiled from VHDL code with the Xilinx ISE toolset. Lay-outs were verified post place-and-route to ensure the efficient utilization of on-board resources, such as fast carry chains and wide MUXes, and to provide device-level op-timization. Further analysis addresses the predicted bottlenecks in an FPGA-based CORDIC system. The delay associated with sign precomputation is evaluated; design recommendations for mapping parallelized CORDIC implementations onto modern FPGAs are presented. To summarize, the contributions are as follows:

• Comparative analysis of three previously proposed parallelized CORDIC imple-mentation schemes with standard rotation and Ideal CORDIC impleimple-mentations mapped onto four Xilinx FPGA families.

• Design recommendations for mapping parallelized CORDIC onto modern FP-GAs.

This chapter is organized as follows: Section 4.1 provides a background to the three parallelized precomputation CORDIC algorithms and implementation techniques. In Section 4.2, the Xilinx simulation framework and the parallelized CORDIC results are presented. In Section 4.3, these results are compared with standard and Ideal CORDIC implementations; based upon this analysis, design rules are presented with consideration to variable word-lengths (16-, 24-, and 32-bits), and four different FPGA families (Virtex-4, -5 and -6, and Spartan-6).

(35)

4.1 Precomputation CORDIC

This analysis implements and compares three prior art algorithms for precomputing the directions of the CORDIC microrotations: P-CORDIC [19], Flat-CORDIC [16, 17], and Para-CORDIC [18]. Previous to this exploration, these algorithms have only been proposed for custom silicon hardware. Section 4.2 addresses the computational delays and hardware costs associated with implementing these algorithms on FPGAs. The three parallel algorithms analyzed in this paper are presented below. Note that while traditional CORDIC can be calculated in two modes, vectoring and rota-tion, the proposed parallel algorithms may only be calculated in rotation mode.

4.1.1 P-CORDIC

P-CORDIC [19] presents a parallelized approach over the standard CORDIC algo-rithm for precomputing the microrotation directions, based upon two key observa-tions:

• For large elementary angles, the difference i = arctan(2−i)− 2−i is small and decreases approximately by a factor of 8 with increasing i.

• For small angles, arctan(2−i₎_{≈ 2}−i _{(within the precision of N bits) for 2}−i ₁ (for large i).

Assume the rotation angle θ is described by:

θ = N X

i=0

σ(i)_{· arctan(2}−i) (4.1)

where σ(i) ∈ {−1, +1} represents the direction of the microrotations at iteration i. The goal of the P-CORDIC algorithm is to calculate an N -bit number, d, whose value in 2’s complement representation is given by:

d = N X

i=0

d(i)2−i (4.2)

where the binary digits d(i) represent a Bipolar-to-Binary Recoding (BBR) of the microrotation signs σ(i):

(36)

The BBR of the microrotation signs, d, is written in terms of the rotation angle θ as: d = N X i=0 di2−i = θ 2+ 1 + 1 2 N X i=0 i | {z } c1 −sgn(θ)0− N X i=1 dii | {z } δ (4.4)

where the partial offset scaling factors, (i), are given by: ( 0 = 1− arctan(1)

i = 2−i− arctan(2−i)

. (4.5)

To implement the P-CORDIC algorithm, Equation 4.4 is solved to obtain d. This calculation is refactored such that the terms that can be calculated offline (c1, (0), . . . , (i), . . . , (N )) are separated from the critical path calculations that must be completed online (δ), as given by:

δ = N/3 X i=1

dii. (4.6)

The partial offsets, (i), a direct result of the P-CORDIC equation refactoring, are known to decrease by a factor of 8 each iteration, rapidly approaching zero for higher order iterations [19]. As a result, (i) only needs to be calculated for i = 0, 1, ...N/3. These terms are precalculated offline and all possible sums from Equation (4.6) are stored in ROM. During the evaluation of the sum shown in Equation (4.6), the N/3 Most-Significant bits (MSb) of the rotation angle θ are used to address the ROM block and obtain δ.

To address the truncation error associated with the calculation of δ, an offset version of the lowest order MSb is also calculated and stored in ROM, effectively doubling the required nominal ROM size. The highest order Least-Significant bit (LSb) selects the appropriate value of δ. The remaining 2N/3 LSb microrotation directions are directly inferred according to the small angle approximation. Once all microtation directions are known from precomputation and direct inferrence, the input x(i) and y(i) paths are combinationally calculated online; the accumulating z-path of the standard CORDIC algorithm is eliminated.

Using this methodology, the number of online additions increases relatively slowly at the rate of log₂(N ) for increasing wordlength, N . As a result, once δ has been

(37)

retrieved from memory, all CORDIC rotations are effectively completed in parallel. The trade-off for this parallelization, however, is that P-CORDIC is limited in its scal-ability: the size of the ROM table increases exponentially for increasing wordlength according to 2N/3_{. Many FPGAs only contain ROM blocks of up to 1024 = 2}10 entries, so P-CORDIC on-chip applications are limited to word-lengths of up to 32 bits when two layers of ROM are used (N/3 = 11) bits. Nevertheless, 32-bits of precision are sufficient for most DSP applications of practical interest; for example, most current generation smart phones and portable multimedia devices use only 16 bit wordlengths for their DSP calculations.

The structure of the P-CORDIC algorithm is shown in Figure 4.1. The CORDIC core included in Figure 4.1 is shown in greater detail in Figure 4.2.

α[2] α[1] α[0]

Rotation Angle in 2’s Complement α[N− 1] α[N − 2] N/3 RAM 2N/3 σ[N− 1] σ[N − 2] σ[2] σ[1] σ[0] Binary-to-Bipolar Recoding to CORDIC CORE Residual angle

Figure 4.1: P-CORDIC Implementation

The latency of the critical path in P-CORDIC is comprised of the additions associ-ated with the microrotation recoding, the memory look-up operation associassoci-ated with retrieving the N/3 microrotation directions from ROM, a multiplexing operation to select the correct precomputed value, the combining of MSb and LSb microrotation directions, and the piping through of an ideal CORDIC calculation where all micro-rotation directions are known simultaneously. As a result, the main penalty to the P-CORDIC sign-precomputation approach is the latency associated with the memory look-up.

(38)

x[1] = x[0] + σ[0] 20_y[0]

y[1] = y[0]_{− σ[0] 2}0_x[0]

x[i + 1] = x[i] + σ[i] 2−i_y[i]

y[i + 1] = y[i]_{− σ[i] 2}−i_x[i]

σ[0]

σ[i]

CORDIC CORE

Figure 4.2: CORDIC CORE Detail

4.1.2 Flat-CORDIC

Flat-CORDIC [16, 17] is a completely unrolled, purely combinational circuit im-plementation of CORDIC. Similar in approach to P-CORDIC, Flat-CORDIC uti-lizes the same two observations: that, for large elementary angles, the difference i = arctan(2−i)− 2−i is small (decreasing approximately by a factor of 8 with in-creasing i), and that arctan(2−i)≈ 2−i _{for large i. However, instead of precomputing} the direction of the N/3 MSb microrotations and storing them in ROM, Flat-CORDIC uses a so-called Split Decomposition Algorithm (SDA) that consists of a small num-ber of multiplexors and adders to infer the MSb microrotation directions. SDA is based upon observations of the regular pattern among the microrotation directions and residual angles [16,17], which allows the emulation of a large ROM with a smaller combinational circuit.

The drawback of the Flat-CORDIC scheme is poor scalability; the complex-ity of the SDA circuit increases exponentially for increasing word-length. There-fore, Flat-CORDIC can be considered a brute-force approach to implementing the CORDIC algorithm for smaller word-lengths.

To improve upon the hardware complexity of Flat-CORDIC, an alternate ap-proach has been proposed that eliminates the need for microrotation sign precom-putation by using direct inference of the bipolar form of the input angle, θ [23]. However, this approach increases the number of microrotations exponentially, and is not considered further in this analysis.

Figure 4.3 gives a pictoral representation of the Flat-CORDIC implementation. CORDIC Core, as before, is shown in Figure 4.2

(39)

α[2] α[1] α[0]

Rotation Angle in 2’s Complement

α[N− 1] α[N − 2] N/3 2N/3 σ[N− 1] σ[N − 2] σ[2] σ[1] σ[0] Binary-to-Bipolar Recoding to CORDIC CORE Residual angle Combinational circuit

Figure 4.3: Flat-CORDIC Implementation

The critical path latency associated with standard Flat-CORDIC is comprised of a binary-to-bipolar recoding of the input angle, θ the operations associated with the SDA, the combination of the MSb and LSb microrotation directions, and the piping through of an ideal CORDIC calculation where all microrotation directions are known simultaneously.

4.1.3 Para-CORDIC

The Para-CORDIC algorithm presents a parallelized CORDIC implementation by using a-priori knowledge of the Taylor series expansion associated with predetermined high order bits of the input rotation angle. The Taylor series is shown in Equation 4.7:

arctan x = x−x 3 3 + x5 5 − x7 7 ⇒ x = arctan x + x3 3 − x5 5 + x7 7 (4.7)

When 2−i_{, the shift term, is substituted for x, Equation 4.7 becomes:}

2−i = arctan(2−i) + 2−3i

3 −

2−5i 5 +

2−7i

7 ... (4.8)

As with the approaches implemented in P-CORDIC and Flat-CORDIC, Para-CORDIC first splits the input angle, θ, into High order LSbs, and Lower order

(40)

MSbs according to Equation 4.9: θ = θL+ θH = (_−b0) + m−1 X j=1 bj2−j | {z } θL + N X j=m bj2−j | {z } θH (4.9)

where m is the smallest index value i such that the arctangent function arctan (2−i) is approximated by its argument within the considered N -bit precision:

m = N − log23

3 (4.10)

2−m_{− arctan(2}−m) < 2−N (4.11) Para-CORDIC is comprised of two main operations: Binary to Bipolar Represen-tation(BBR) and Microrotation Angle Recoding (MAR). The lower order portion of the input rotation angle is recoded as:

θL = m X i=1 σ(i)   n(i) X j=1

arctan(2−s(i,j)) + e(i) 

− 2−m. (4.12) For the angles corresponding to the lower m−1 indices, MAR is used to decompose each positional binary weighting, 2−i, into a predetermined combination of Taylor series arctangent terms, plus an error term e(i):

2−i = arctan(2−i) + n(i) X j=2

arctan(2−s(i,j)) + e(i) (4.13) where n(i) represent the number of additional arctangent terms needed to approxi-mate 2−i_{. Consequently, the microrotation directions for large angles are known in} advance, but require additional microrotations beyond those required in traditional CORDIC implementations. The number and order of the additional microrotations is provided in [18].

As with the P-CORDIC and Flat-CORDIC algorithms, the residual angle after the MSb rotations is added to the remaining n−m+1 LSbs; the microrotation directions of the LSb are then directly inferred according to the small angle approximation. The compensated LSb portion of the input rotation angle, θ, is given by:

(41)

ˆ θH _{= θ}H ₊ m X i=1 σ(i)e(i)_{− 2}−m. (4.14)

The Para-CORDIC algorithm is shown in Figure 4.4. CORDIC Core is shown in Figure 4.2.

Binary-to-Bipolar Recoding

θ[N− 1] θ[N− 2]

Binary-to-Bipolar Recoding

α[2] α[1] α[0]

Rotation Angle in 2’s Complement α[N− 1] α[N − 2] N/3 σ[N− 1] σ[l] σ[k] σ[N− 1] σ[N − 2] σ[k] σ[l] σ[2] σ[1] σ[0] to CORDIC CORE 2N/3 angles Residual

Figure 4.4: Para-CORDIC Implementation

The critical path latency associated with Para-CORDIC is comprised of the ad-dition operations associated with the BBR of the MSbs, the standard CORDIC x/y path iterations for the m-1 MSb, θL _{microrotations, the additional microrotations} for θL _{as predetermined by MAR, the addition associated with compensating the} n_{− m + 1 LSb angle, θ}H_{, with the residual angle from the θ}L _{microrotations, and the} standard CORDIC rotations for the directly inferred θH _{microrotations.}

In Section 4.2, we present the latency and area results of the three parallel algo-rithm mappings for 16-, 24-, and 32-bit word lengths. We then compare these results to standard CORDIC implementations and an Ideal CORDIC implementation where it is assumed that there is no precomputation overhead; sign bits are available for immediate transmission.

(42)

4.2 Simulation Framework and Results

As comparative analysis of four CORDIC schemes is presented here: (1) Ideal CORDIC, (2) P-CORDIC, (3) Flat-CORDIC, and (4) Para-CORDIC. Except for Ideal CORDIC, where all the sign bits are known in advance, all the other three schemes exhibit a similar computation pattern, consisting of a first phase during which the microrotation signs are all precomputed, and a second phase during which the microrotations are performed in a combinational fashion. Within the second phase, as described in Section 4.1, each scheme has additional operations above that of the standard CORDIC algorithms. These operations include the calculation of cor-rection terms for the precomputed angle and performing a binary-to-bipolar recoding (BBR) to maintain a constant scaling factor over the microrotations. As a result, the design goal of this paper is to determine which precomputation scheme is faster and also which scheme requires the lowest computing resources.

All designs were compiled from VHDL code using the Xilinx ISE 13.2 toolset [20]. The mapped circuits were verified post place-and-route to ensure the utilization of on-board resources, such as fast carry-chains and wide MUXes, and to provide de-vice-level optimization. As the dominant component in the total propagation delay is delay over the FPGA interconnection network, appropriate mapping of shift op-erations was confirmed. In particular, to expose the shift opop-erations to the VHDL compiler, the shift operations have been implemented without the use of a high-level operator. To ensure a tight optimization of the circuitry corresponding to each of the two phases, sign precomputation and CORDIC rotation, described above, the hierarchy of the design was preserved during synthesis.

All the schemes were mapped onto four state-of-the-art FPGA families provided by Xilinx: Virtex-4, -5, -6, and Spartan-6. The main design target is the reduction of the delay associated with sign precomputation. To achieve this goal, a number of design recommendations for mapping parallelized CORDIC onto modern FPGAs are presented in Section 4.3.

Three level of numerical precision, 16-, 24- and 32-bit, are considered in this analysis as being of greatest interest in the modern digital-signal processing (DSP) domain. To maintain the desired external precision for fixed-point arithmetic, word lengths of N + log₂N are used internally; the residual log₂N bits are truncated before the final vector coordinates are externally reported. The reported figures of merit are the place and route propagation delay and FPGA LUT resource utilization.

(43)

The results of mapping the sign precomputation schemes to the four Xilinx FPGA families are shown in Table 4.1. Section 4.3 presents an extensive discussion of these results.

4.3 Discussion

The Ideal scheme assumes that the microrotation sign bit directions are known in advance and that additional sign computations are not required. It is, therefore, included as the lower boundary condition in terms of latency and logic utilization. The analysis of the numerical results addresses the performance and penalties of the three sign-precomputation schemes versus the standard CORDIC scheme, as previously presented [21]. The communications and computational bottlenecks of the three parallelized CORDIC schemes as compared against Ideal and standard CORDIC are analyzed.

As demonstrated in Table 4.1, on Virtex-4, the oldest FPGA family considered, the P-CORDIC scheme exhibits the lowest latency among the three considered sign-precomputation schemes: 30.7 ns, 49.2 ns, and 78.3 ns for a word-length of 16-bits, 24-bits, and 32-bits, respectively. Virtex-4 architecture includes LUTs with only four inputs, a feature that makes the mapping of the Flat-CORDIC combinational logic for sign precomputation slow and resource expensive. On such an architecture, retrieving information from an SRAM block is faster than computing bit-level functions.

Also apparent in Table 4.1 is that, on the state-of-the-art Virtex-6 family, the Flat-CORDIC scheme exhibits the lowest latency among the three considered sign-precomputation schemes with latencies of 24.9 ns, 39.3 ns, and 52.5 ns for word-lengths of 16-bits, 24-bits, and 32-bits, respectively. Virtex-6 devices include 6-input LUTs and an improved, fast global interconnection network well suited to the Flat-CORDIC scheme implementation.

On the older Virtex-5 family, there is no precomputation CORDIC scheme which is clearly superior: P-CORDIC has the lowest latency of 38.5 ns for 24-bit word-length, but Flat-CORDIC scheme exhibits the lowest latencies of 23.7 ns and 64.0 ns for 16- and 32-bit word-lengths. A similar result occurs on the modern, cost-effective Spartan-6 family: P-CORDIC has the lowest latency of 80.4 ns for 32-bit word-length, but Flat-CORDIC ranks first with latencies of 31.4 ns and 56.0 ns for 16- and 24-bit word-lengths. Despite the availability of 6-input LUTs on both the Virtex-5 and Spartan-6 architectures, the interconnection network of these two families is

(44)

slower than their Virtex-6 counterpart. The effeciency of bit-level functions, such as Flat-CORDIC sign-precomputation logic, is reduced.

Due to the large number of additional microrotations needed to pre-calculate the rotation signs, the Para-CORDIC never ranks first in terms of latency. However, Para-CORDIC can be deeply pipelined; therefore, it is the best option when a high throughput is required.

Nevertheless, the standard CORDIC scheme is the slowest amongst all of the presented scheme, due to its highly sequential nature. As mentioned, P-CORDIC is the fastest scheme on Virtex-4 FPGAs; it is apparent in the results table that, on average, P-CORDIC is only 4.0% slower than the ideal CORDIC. Flat-CORDIC is the fastest scheme on Virtex-6 FPGAs; on average, it is only 4.2% slower than the ideal CORDIC. Therefore, it can be concluded that sign-precomputation is a viable design option on FPGAs and should be used when rotation mode CORDIC is required. If vectoring mode CORDIC is required, standard CORDIC should be used based upon the design recommendations provided in Chapter 3.

(45)

T able 4.1: Sign precomputation dela ys on Xilinx FPGAs. W ord-length (bits) FPGA Family Ideal CORDIC Standard CORDIC [2 1] P-CORDIC Flat CORDIC P ara-CORDIC Dela y Area Dela y Area Dela y Area RAM Dela y Area Dela y Area [ns] [LUTs] [ns] [LUTs] [ns] [LUTs] [1Kx16] [ns] [LUTs] [ns] [LUTs] 16 Virtex-4 29.6 755 32.7 935 30.7 761 2 31.3 787 33.7 Virtex-5 22.3 693 29.5 933 27.9 700 23.7 723 26.6 Virtex-6 22.9 652 29.2 892 27.6 658 24.9 682 24.4 Spartan-6 28.9 652 34.1 892 37.0 658 31.4 682 33.7 24 Virtex-4 47.0 1569 54.5 2029 49.2 1577 2 50.7 1617 54.5 1751 Virtex-5 38.4 1475 51.3 2027 38.5 1484 48.3 1519 44.1 1649 Virtex-6 37.4 1408 46.1 1960 43.4 1417 39.3 1453 43.5 1978 Spartan-6 53.1 1408 68.0 1960 57.6 1417 56.0 1453 62.2 1978 32 Virtex-4 75.5 2609 89.7 3509 78.3 2710 4 (2-la y ers) 78.6 2688 79.5 3210 Virtex-5 61.5 2483 76.0 3507 65.4 2584 64.0 2563 71.1 3075 Virtex-6 51.7 2392 67.3 3448 57.1 2493 52.5 2472 65.0 3911 Spartan-6 68.0 2392 92.4 3448 80.4 2493 82.8 2472 87.2 3911

(46)

Chapter 5 Super- and Sub-threshold

Programmable Interconnection

Networks with Reconfigurable

Supply Voltage

As described in Chapter 2, recent developments in reconfigurable devices, such as Field-Programmable Gate Arrays, have effectively resulted in a programmable sys-tem-on-chip. Because of this hardware-like performance with software-like programma-bility, FPGAs are increasingly being accepted for implementing digital designs with a lower non-recurring engineering cost when compared with an Application Specific Integrated Circuits (ASICs). However, the FPGA programmability increases power consumption, propagation delay, and silicon area overhead [27]; recent reports indi-cate that upwards of 70% of power dissipation in reconfigurable devices occurs in the interconnection fabric.

The increasing popularity of devices of interest such as wireless, mobile, biomedical or sensor devices, has necessitated that power consumption become an increasingly significant design constraint. Therefore, in a bid to improve the power consump-tion, and as a result, the battery life of portable electronic devices, circuits have been designed to operate solely at a sub-threshold supply voltage (VDD < VP), where VP represents the transistor pinch-off, threshold voltage. For devices operated at sub-threshold voltages, the transistor channel is said to be weakly inverted; hence-force, such operation is referred to as weak inversion. In circuits that operate at

On designing coarse grain reconfigurable arrays to operate in weak inversion

Contents

List of Tables

List of Figures

Introduction

Chapter 2

Background

2.1

The CORDIC Algorithm

2.2

Field Programmable Gate Arrays

Chapter 3

The Standard CORDIC Approach

on FPGA

3.1

CORDIC Implementation Schemes

3.2

Results and Discussions

Chapter 4

Precomputation CORDIC

4.1

Precomputation CORDIC

4.1.1

P-CORDIC

4.1.2

Flat-CORDIC

4.1.3

Para-CORDIC

4.2

Simulation Framework and Results

4.3

Discussion

Chapter 5

Super- and Sub-threshold

Programmable Interconnection

Networks with Reconfigurable

Supply Voltage