VHDL Implementation and Performance Analysis of two Division Algorithms

(1)

by

Salman Khan

B.S., Sir Syed University of Engineering and Technology, 2010

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Applied Science

in the Department of Electrical and Computer Engineering

c

Salman Khan, 2015 University of Victoria

(2)

VHDL Implementation and Performance Analysis of two Division Algorithms

by

Salman Khan

B.S., Sir Syed University of Engineering and Technology, 2010

Supervisory Committee

Dr. Fayez Gebali, Supervisor

(Department of Electrical and Computer Engineering)

Dr. Atef Ibrahim, Member

(3)

Supervisory Committee

Dr. Fayez Gebali, Supervisor

Dr. Atef Ibrahim, Member

ABSTRACT

Division is one of the most fundamental arithmetic operations and is used exten-sively in engineering, scientific, mathematical and cryptographic applications. The implementation of arithmetic operation such as division, is complex and expensive in hardware. Unlike addition and subtraction, division requires several iterative compu-tational steps on given operands to produce the result. Division, in the past has often been perceived as an infrequently used operation and received not as much attention but it is one of the most difficult operations in computer arithmetic. The techniques of implementation in hardware of such an iterative computation impacts the speed, the area and power of the digital circuit. For this reason, we consider two division algorithms based on their step size in shift. Algorithm 1 operates on fixed shift step size and has a fixed number of iteration while the Algorithms 2 operates on variable shift step size and requires considerably fewer number of iterations. In this thesis, technique is provided to save power and speed up the overall computation. It also looks at different design goal strategies and presents a comparative study to asses how each of the two design perform in terms of area, delay and power consumption.

(4)

5.2 Hardware Simulation . . . 45 5.2.1 VHDL Simulation of Algorithm 1 . . . 46 5.2.2 VHDL Simulation of Algorithm 2 . . . 48 5.3 Performance Evaluation . . . 50 5.3.1 Device Utilization . . . 50 5.3.2 Timing Analysis . . . 51 5.3.3 Power Consumption . . . 53 5.3.4 Power-Delay Product . . . 54 5.3.5 Area-Delay Product . . . 55

5.4 Comparison of Work in Related Area . . . 55

5.5 Chapter Summary . . . 57

6 Conclusion, Contributions and Future Work 59 6.1 Conclusion . . . 59 6.2 Contributions . . . 60 6.3 Future Work . . . 60 Bibliography 62 7 Additional Information 64 7.1 Interpretation of signals . . . 65

(7)

List of Tables

Table 4.1 Truth Table when Y is positive . . . 33

Table 4.2 Truth Table when Y is negative . . . 33

Table 5.1 Iterations for Algorithm 1 . . . 44

Table 5.2 Iterations for Algorithm 2 . . . 45

Table 5.3 On-chip device utilization of Algorithm 1 . . . 51

Table 5.4 On-chip device utilization of Algorithm 2 . . . 51

Table 5.5 Timing Summary of Algorithm 1 . . . 52

Table 5.6 Timing Summary of Algorithm 2 . . . 52

Table 5.7 On-chip power consumptions. . . 54

Table 5.8 Power-delay product for Algorithm 1 and 2. . . 55

Table 5.9 Area-delay product for Algorithm 1 and 2. . . 55

(8)

List of Figures

Figure 2.1 Nonzero bits of X and Y at the start of division . . . 7

Figure 2.2 Nonzero bits of X and Y at the end of division . . . 7

Figure 4.1 Algorithm 1 system level . . . 19

Figure 4.2 Registers X, Y and Z in the bank . . . 20

Figure 4.3 Data multiplexer for register bank . . . 21

Figure 4.4 Comparator for Y . . . 21

Figure 4.5 LUT block . . . 22

Figure 4.6 ALU block . . . 22

Figure 4.7 Logical operation of ALU during ith iteration . . . 23

Figure 4.8 Counter block diagram . . . 23

Figure 4.9 Finite State Machine block . . . 24

Figure 4.10 State transition diagram for Algorithm 1 . . . 25

Figure 4.11 Algorithm 2 system level . . . 27

Figure 4.12 Delta (δ) Address Generator . . . 28

Figure 4.13 DAG system level . . . 28

Figure 4.14 Position finder unit block . . . 29

Figure 4.15 Multiplexer for flag input . . . 30

Figure 4.16 Multiplexer for data input . . . 30

Figure 4.17 The Px Register . . . 31

Figure 4.18 The number subtractor block in DAG . . . 31

Figure 4.19 2-bits scan unit 0 in level 1 . . . 32

Figure 4.20 Hierarchical approach between level 1 and 2 . . . 33

Figure 4.21 Hierarchical arrangement of position finder unit . . . 35

Figure 4.22 Finite State Machine block . . . 36

Figure 4.23 State transition diagram for Algorithm 2 . . . 37

Figure 4.24 Top level block of fixed shift division algorithm . . . 38

(9)

Figure 4.26 Delta address generator RTL schematic . . . 40

Figure 4.27 Top level block of adaptive shift division algorithm . . . 40

Figure 4.28 Adaptive shift division algorithm RTL schematic . . . 41

Figure 5.1 All iterations for Algorithm 1 . . . 46

Figure 5.2 Iterations 0 to 2 for Algorithm 1 . . . 47

(10)

ACKNOWLEDGMENTS

In the name of Allah, the Most Gracious and the Most Merciful.

All praises belong to Allah the merciful for his guidance and blessings to enable me complete this thesis. I would like to thank:

My parents, for their prayers, love, patience, emotional support, motivation and assurance in difficult and frustrating moments and for their constant motiva-tion. Despite the financial constraints, they were always ready to support me financially.

My Supervisor, Dr. Fayez Gebali, for all the mentoring and support which en-abled me to achieve my academic and research objectives, also for helping me cope up with off-school problems and settling in as an international student. For sharing his ideas, concepts and experiences and It would not have been possible to complete my research without his invaluable guidance.

My Committee, Dr. Atef Ibrahim, for devoting precious time and providing valu-able suggestions to improve the quality of the thesis.

My Manager at BC Hydro, Djordje Atanackovic, for his encouragement and support to help me focus on my thesis completion.

UVIC ECE Dept admin and lab staff, Kevin Jones, Janice Closson, Paul Fedrigo and Brent Sirna for assisting me during the course of my degree.

(11)

DEDICATION

To my father, Muhammad Khalid Zahid and my mother, Imtiaz Khalid for having a lifelong long dream to see me achieve my graduate qualification at a world

class foreign institution. In difficult times, it proved as key motivating factor and enabled me to maintain focus.

To my grandmother, Rasool Fatima for her countless prayers and believing in me. To my Supervisor, Dr. Fayez Gebali, he is one of the most knowledgeable, kindest

(12)

Introduction

1.1 Overview

Implementation of mathematical algorithms, as those required by a Random number generator (RNG), require complex and expensive arithmetic operations like division and multiplication, while also requiring iterative computations on given inputs to obtain the required output. The techniques of implementing these operations and it-erations in hardware significantly impacts the speed, area and power of the hardware. The division of two integers, the divisor and the dividend, results in an integer remain-der and an integer quotient. The integer division in the one of the most fundamental arithmetic operation and is heavily required in engineering, scientific, mathematical and statistical computations. Implementing and performing division operation in hardware is complex, expensive and requires more computational power in consump-tion when compared to the addiconsump-tion and subtracconsump-tion operaconsump-tions. According to [1], division is the most difficult operation in computer arithmetic and it is a common per-ception of think of division as an infrequently used operation whose implementation does not receive much attention. The division in modern micro-processors takes many clock cycles, furthermore the number of requires clock cycles for integer division also depends on the operand’s values [2], a larger integer operands will require more clock cycles to perform division. The more clock cycles or numbers of iterations are needed by the divider, the more is the power consumed and the speed of operation decreases too. Ignoring the implementation has been proven in to result in significant system performance degradation [3]. In applications that employ division operation, having an efficient implementation of division hardware can significantly improve the overall

(13)

performance of the system, thus it is imperative to find out the best implementation method of the division algorithm in hardware. Having a divider that has a lower heat dissipation is also a desirable attribute in terms of performance and security.

1.2 Motivation for this work

This work is part of an on-going research on the design, development and implementa-tion of low power Pseudo Random Number Generator (PRNG) and the work focuses on the implementation and performance analysis of division algorithms to be incor-porated in the PRNG. These division algorithms are implemented as co-processor designs which will be later required by the PRNG to implement the mathematical algorithm that generates the random numbers. Although the implementation of the overall PRNG exceeds the scope of this work, the targeted PRNG is based on the Park-Miller algorithm, which is a fairly popular choice of algorithm for the genera-tion of random numbers, the algorithm requires an initial seed value, a special prime number, a quotient and a remainder to generate a random number [4]. Two hardware divider designs are considered and implemented to generate the quotient and remain-der through a division algorithm for the Park-Miller Algorithm so that the random number is generated by the PRNG.

The hardware for 32 bit integer division is based on the digit-recurrence, non restoring division algorithm. The divider designs are analyzed later for the perfor-mance and impact on important parameters for the choice of application. There has been quite a bit of work in the hardware dividers with reference to the application of algorithms particularly dealing with higher radix implementation and floating point implementation. Most researchers compare the performance results of the overall divider in terms of speed and area while the methodology of implementation and how the changes in implementation can affect the performance, specially in power consumption, in fixed point integer division has not been explained much clearly. This motivated us to determine the best implementation in terms of performance parameters of hardware divider and study the two dividers to see which one is best suited to low power or high speed or low cost implementation. Another motivation of this work was to come up with a simplified design approach that would allow the new designers and researchers to understand and re-implement the integer division in hardware. From the academic and learning point of view, this work enabled the un-derstanding of iterative algorithms, their design and implementation, state machine

(14)

syncronization which are skills useful for any one learning practical hardware design implementation.

1.3 Contributions

Two division algorithms which are based on digit recurrence, non-restoring division are considered and implemented. The first algorithm is called the “fixed shift division algorithm” while the second is the “adaptive shift algorithm”. The second algorithm is an improvement of the first algorithm in terms of performance. Our work contributes to the following:

1. Designed and implemented two signed integer division algorithms for performing the division operation in hardware.

2. Verify the hardware design by developing a Matlab code to confirm the correct-ness and accuracy of the hardware implemented in VHDL.

3. Compared performance of two division algorithms from the viewpoint of device utilization (area), power consumption and timing analysis (delay).

4. The high-radix technique proposed in [5] for floating point arithmetic is adapted to integer arithmetic.

Our work will help the designer in decision making towards choosing the division im-plementation for application specific purpose. If the application demands high speed or low power computation such as RNGs, cryptographic and encryption processors then the adaptive shift algorithm is the preferred choice where as in applications such as those in smart cards which have area and cost constraints, the fixed shift algorithm is better suited.

1.4 Thesis Organization

This section outlines the organization of the thesis and is intended to present the reader with the brief summary of main focus of each chapter.

Chapter 1 introduces the reader to the subject and the scope of the research. The motivation for the research and the contributions of the research is discussed which were the fundamental objectives in thesis.

(15)

Chapter 2 describes the background and fundamentals of division in hardware. A brief classification of division algorithms is provide in order to aid the reader with the understanding of the related previous work done in the area.

Chapter 3 describes our approach towards the division operation. The two con-sidered division algorithms, known as the fixed shift division algorithm and the adaptive shift division algorithm, are presented and their methodology ex-plained which is used to achieve the correct result of division operation.

Chapter 4 describes the hardware design and implementation. The system hard-ware entities are explained which are common to both the algorithms and also the ones that are specific to each of the two algorithms. The circuit implemen-tations of both the algorithms are presented.

Chapter 5 contains the results and evaluations of the two algorithms. The numerical simulation results are obtained to verify that the algorithms work and then the results of hardware simulations (in VHDL) are presented to confirm that the implementation of the two algorithm has been done correctly. Performance analysis of the two algorithms is also conducted in this chapter.

Chapter 6 has the concluding statements and the short description of the work and what was achieved through this work.

(16)

Chapter 2 Division Background

2.1 Division Fundamentals

There are various references, such as [6][7][8], by authors who have worked on number division. The fundamental principle of division is that the division of dividend by a divisor can be realized in cycles of shifting and adding (in actual subtraction) with hardware or software control of the loop which requires iteratively converging at the correct result of the division through the hardware divider.

In this literature, we refer to Y as the dividend and X as the divisor. We wish to divide the integer Y by a positive integer X, the result of this division operation should be two integers: the quotient and the remainder, denoted by q and r respectively so that the following equation is satisfied:

Y = qX + r (2.1)

q and r can be expressed as:

q = Y X

(2.2)

0 ≤ r < X (2.3)

the floor value of eqn (2.2) would give us a whole number rounded to the lower integer and a fractional part which is the difference from the actual value to the

(17)

rounded value. This whole number is the quotient while the fractional part of this floor function will give us the remainder. Using this concept we can rewrite:

r = Y − qX (2.4)

The above equation states that the remainder r, can be obtained if X is subtracted from Y for a q number of times, untill the condition in (2.3) is statisfied and at this point the value of Y is the desired remainder, r. Most hardware dividers operate in the same manner, this is very similar to the the long division by hand in which the hardware divider updates the value of Y as per the equation:

Y ← Y − δX (2.5)

The δ is the partial quotient and the updated value of Y is the partial remainder. The hardware divider, in the similar manner as long division method by hand, keeps track of the quotients by adding their values in a register Z, which is given by:

Z ← Z + δ (2.6)

From (2.5) and (2.6) we see that δ is subtracted in Y and added to Z.

The choice of value of δ can be arbitrary towards achieving the correct result of division, provided that the following two conditions are met:

1. The updated value for Y in (2.5) should converge to the range 0 ≤ Y < X, so that this will produce the desired remainder. If Y is positive, the factor δX is subtracted, if Y is negative, the factor δX is added to Y ;

2. The updated value for Z in (2.6) should add or subtract to produce the desired quotient.

We represent the dividend Y of n bits in 2’s complement so that the range of Y can be given as:

−2n−1_{≤ Y < 2}n−1 _(2.7)

Our divisor X is assumed to require only m bits for it’s representation such that m ≤ n. Figure 2.1 shows the nonzero bits in Y and X at the start of division operation. Our goal is to iteratively reduce the nonzero bits of Y to m bits so that

(18)

Y comes in the range:

0 ≤ Y < X (2.8)

Figure (2.2) shows the nonzero bits of Y at the end of the division operation where Y stores the value of the remainder which can fall in the range 0 ≤ r < X.

The choice in value of δ at each iteration to implement (2.5) and (2.6) will differentiate the division algorithms that we will implement in our work, this will be demonstrated in chapters to follow.

Figure 2.1: Nonzero bits of X and Y at the start of division

(19)

2.2 Division Algorithms Classes

Oberman and Flynn presented the taxonomy of division algorithm in [3], which clas-sified the algorithms based on their hardware implementations and they classify the algorithms in five classes: digit recurrence, functional iteration, very high radix, ta-ble look-up and variata-ble latency. Many practical division algorithms are hybrids of several of these classes and can reach combinations of classes to the overall algorithm.

2.2.1 Digit Recurrence Algorithms

Digit recurrence is the most simplest and widest implemented of all division algo-rithms. The methodology behind it is that it uses subtractive methods to deduce digits of quotient on every iteration and it retires a fixed number of bits of the quo-tient in every iteration to achieve this, meaning that the step-size of bits retired in each iterations are the same. The implementation of digit recurrence algorithms require less complexity and area.

2.2.2 Functional Iteration Algorithms

The functional iteration uses the multiplication operation as the basis of division operation. Functional iteration take advantage of high speed multiplier to converge to result quadratically, unlike the subtractive division through which the result is converged upon linearly, this reduces the latency and length of each iteration cycles. Therefore instead of retiring fixed bit at iterations, this class of algorithms retire increasing bits at each iteration.

2.2.3 Very High Radix Algorithms

Digit recurrence algorithms are suited to low radix division operation and as we increase the radix, the hardware and divisor multiple process gets complicated and consumes more area and computation time too. A variant of this is the very high radix algorithm which avoids the constraints posed by the higher radix, and the term “very high radix” applies to dividers that retire more than 10 bits in each iteration.

(20)

2.2.4 Look-up Tables

When a low-precision quotient is required, it may be feasible to apply division using a look-up table implementation without the use of an algorithm. This implementation uses direct and liner approximation methods to compute quotient bits. The table can be implemented as a ROM and the advantage of using this fast processing since no arithmetic calculation is needed but on the down side, the size of the look-up table grows exponentially to account for each added bit for accuracy.

2.2.5 Variable Latency Algorithms

The digit recurrence and very high radix algorithms retire fixed number of bits in every iteration while the function iteration based algorithms retire increasing number of bits in every iteration, but all three of these algorithms complete the operation in fixed number of cycles. Variable latency algorithms based dividers perform division in variable amount of time.

2.3 Related work in the area

The main algorithms for division in hardware implementation were highlighted in pre-vious section and each methodology has it’s own application and benefits, however the digit recurrence algorithms is the most commonly used approach for hardware division implementation and they have procedures like restoring, non-restoring, SRT division (Sweeney, Robertson and Tocher), approximation algorithms, CORDIC al-gorithm, multiplicative algorithm and continued product algorithm [9]. According to Sutter and Deschamps in [10], binary non-restoring digit recurrence algorithms are the mostly preferred procedure for FPGA based dividers. Authors of [9] implemented high speed non-restoring based division using the high speed adder/subtractor ap-proach to speed up the division operation. Sutter and Deschamps implemented high speed fixed point dividers in [10] based on utilization of FPGA characteristics such as: adder/subtractor or conditional adders having same delay as simple adders; existence of dedicated and fast carry generation and propagation logic; and additional mul-tiplexers to the general purpose LUTs in a sequential, combinational and pipelined circuits. Achieving higher speed is desirable in hardware implementation but some applications may also require power efficiency, Nannarelli and Lang proposed low power divider [11], which discussed power saving techniques such as : re-timing the

(21)

recurrence, changing redundant representations to reduce the number of flip flops, using gates with lower drive capability, equalizing the paths of the input signals of the blocks to reduce glitches, switching-off not active blocks.

We focused the implementation of division algorithms on non-restoring division methodology and designed a fixed iteration division algorithm and then utilized Dr. Gebali’s HCORDIC technique [5], an adaptive algorithm methodology, to reduce number of iterations based on hierarchical design for the adaptive shift iteration algorithm. Dr. Gebali implemented this technique for floating point arithmetic and we adapted this technique to make it applicable to integer arithmetic.

2.4 Chapter Summary

This chapter highlights the basics of division in hardware which will enable the reader to understand the algorithms we present in Chapter 3. Overview of some of the known division algorithm classes are presented to enable the reader to understand the high level differences between different implementations. The related work in the area of division is also discussed to present the reader with additional information to help better understanding of intended work.

(22)

Chapter 3 Considered Division Algorithms

The non-restoring division algorithm is based on retiring fixed number of quotient bits in each iterations, the basis of our algorithms depends on the shifts or δ, which was introduced in the previous chapter. The difference in the size of δ defines our algorithms with the fixed δ and the adaptive δ, which we refer to as fixed shift algorithm and the adaptive shift algorithm respectively.

3.1 Division Approach

3.1.1 Reasons For Considerations

We choose these two division algorithms because of the following reasons: 1. They are popular for implementation of division in integer arithmetic. 2. No multiplier is needed (reduced power and area).

3. No adder, no multiplier, look up table is utilized thus can be implemented in non-Xilinx programmable logic devices, hence these algorithms are not device specific.

4. Simplicity of the algorithms.

3.1.2 Overview of Operation

(23)

1. Range reduction mode of Y - in this mode, the algorithm takes multiple steps/iterations to reduce the dividend to converge on to the result.

2. Post processing mode of Y and Z - this is a single step to process the remainder and quotient when the result in mode 1, does not fall in the desired range. To begin the operation in mode 1, the sign of the current value of dividend Y is checked, if the value is negative, the product of δ and divisor, X is added to Y and the next value of Y is obtained. If the value of Y is positive, the product δ*X is subtracted from the current value of Y to obtain the next value of Y, these steps yields the value of the remainder.

The quotient is produced in simultaneous steps, the δ is added or subtracted to the current value of quotient Z depending on the operation performed on Y since the two will have opposite operations performed on them. At each of these steps, the range of Y is also kept in checked; if at the end of the iteration, the value of Y is in the desired range, that value of Y would be the remainder and the corresponding value of Z will be the quotient.

If the value is not in the range at the end of the range reduction mode, the algorithm will jump to mode 2, which will be a single step to adjust the range so that we have the correct quotient and remainder at the next step. This methodology is mathematically explained in the next section.

3.2 Division Algorithm 1 : Fixed Shift Algorithm

This algorithm performs a fixed minimal number of iterative steps to give the quotient and the remainder when we perform the division of Y by X. In our work, Y is a 32 bit signed integer such that the value n, which is the number of bits in the dividend, is 32. The X is m bits long, which is 17 bits long since this is the minimum value needed by Dr. Gebali for the initial quotient to implement the random number generator. The sign of X is arbitrary, therefore assumed to be positive.

The fixed shift division algorithm has the following properties: 1. The required number of iteration is equal to n − m + 1.

(24)

2. The sign of the current value of Y determines if the operation needed on the next iteration is addition or subtraction.

3. The value of Z will converge on to the quotient with the opposite operation to the operation of Y in property number 2.

4. The δ at every iteration is determined by the equation (3.6) below.

3.2.1 Mode 1 : Range reduction of Y

The step size of δ is given by the iteration index and not by the intermediate values of Y. The property number 1 is applied on Y and Z per the following equations:

Y(i+1) = Y(i)− µiδiX, 0 ≤ i ≤ n − m (3.1)

Z(i+1) = Z(i)+ µiδi (3.2)

where the initial value of Y and Z are:

Y(0) = Y (3.3)

Z(0) = 0 (3.4)

the µi in equation (3.1) and (3.2) denotes the addition or subtraction operation in a

given iterative index value i, the δi is the step size given by the following equations:

µi =    1 when Y(i) ≥ 0 −1 when Y(i) _{< 0} (3.5) δi = 2(n−m−i), 0 ≤ i ≤ n − m (3.6)

Once again, it is important to remember that the iteration step size depends on δ and not on the intermediate data of the partial quotient and remainder, this step size will be governed by the binary shift and will be used by the ALU of the divider to compute the result.

(25)

3.2.2 Mode 2 : Post processing of Y and Z

On the completion of Mode 1, the value of remainder, Yn−m+1 _{needs to fall in the}

range:

−2m−1 _{≤ Y}n−m+1 _{≤ 2}m−1 _{− 1} _(3.7)

This range may not be satisfied due to the following: 1. The value of Yn−m+1 _{is negative.}

2. The value of Yn−m+1 _{is positive but greater than X.}

In either outcome, the post processing mode becomes applicable such that the in-equality below is satisfied in order to achieve the correct remainder:

0 ≤ Yn−m+1 < X (3.8)

the value of quotient, Zn−m+1_{, also needs to be updated whenever Y is changed.}

In order to bring the result Yn−m+1 _{in the desired range, the following process needs}

to be applied:

Y(n−m+1) = Y(n−m+1)− µX (3.9) Z(n−m+1) = Z(n−m+1)+ µ (3.10) where µ works in the same way as in range reduction mode to determine the addi-tion and subtracaddi-tion operaaddi-tion on equaaddi-tion (3.9) and (3.10) based on the following condition: µ =    1 when Y(n−m+1)≥ X −1 when Y(n−m+1) _{< 0} (3.11)

To satisfy (3.8), this process is only needed once. The total number of iterations needed in algorithm 1 is n − m + 1 if the result of division is achieved in mode 1. If the result is not achieved in mode 1, a total n − m + 2 iterations will be required.

(26)

3.3 Division Algorithm 2 : Adaptive Shift

Algo-rithm

This algorithm does not perform a fixed number of iterative steps to compute the quotient and the remainder but instead it functions by determining at each iteration, the step size δ from the magnitude of the input data. Since the step size of the shift is not fixed, we call this as adaptive shift. This algorithm requires lesser iterations in comparison to the fixed shift algorithm. Similar to our assumptions for fixed shift algorithm, we consider the divisor X to have m bits and the dividend Y to have n bits, inclusive of sign bit.

The adaptive shift division algorithm has the following properties: 1. The required number of iteration is determined by the input data.

2. The sign of the current value of Y determines if the operation needed on the next iteration is addition or subtraction.

3. The value of Z will converge on to the quotient with the opposite operation to the operation of Y in property number 2.

4. The location of the most significant bit value of Y and X determines the value of δ at every iteration by the equation (3.17) below.

3.3.1 Mode 1 : Range reduction of Y

The step size of δ in the adaptive shift algorithm is obtained by the magnitude of the input data and not by the iteration index, as it was obtained in the fixed shift algorithm. The iterations on Y and Z occur as per the following equations:

Y(i+1) = Y(i)− µiδiX, 0 ≤ i ≤ n − m (3.12)

Z(i+1) = Z(i)+ µiδi (3.13)

where the initial value of Y and Z are:

Y(0) = Y (3.14)

(27)

the µi in equations (3.12) and (3.13) denotes the addition or subtraction operation

in a given iterative index value i, the δi is the step size given respectively by the

following equations : µi =    1 when Y(i) _{≥ 0} −1 when Y(i) _{< 0} (3.16) δi = 2(P y−P x), |y| ≥ x (3.17)

where Px is the position of the most significant set bit of X, since X is arbitrary and

our notation assumes it as a positive value. while Py is defined as:

P y =         

position of most significant 1 when Y > 0

0 when Y = 0

position of most significant 0 when Y < 0

(3.18)

when Py ≤ Px, the iterations for the range reduction mode are stopped.

3.3.2 Mode 2 : Post processing of Y and Z

On the completion of Mode 1, the range of Yn−m+1 _{needs to fall in the range:}

−2m−1 ≤ Yn−m+1 ≤ 2m−1 − 1 (3.19) Just like in fixed shift algorithm post processing; the range may not be satisfied because either the value of Yn−m+1 is negative or positive but greater than X and thus, this value needs to be processed so that it satisfies the range:

0 ≤ Yn−m+1 < X (3.20)

the value of quotient, Zn−m+1, also needs to be updated whenever Y is changed. In order to bring the result Yn−m+1 _{in the desired range, the following process needs}

(28)

to be applied:

Y(n−m+1) = Y(n−m+1)− µX (3.21) Z(n−m+1) = Z(n−m+1)+ µ (3.22) where µ works in the same way as in range reduction mode to determine the addition and subtraction operation on equations (3.12) and (3.13) based on the following condition: µ =    1 when Y(n−m+1)_{≥ X} −1 when Y(n−m+1) _{< 0} (3.23)

This processes is needed so that the range of equation (3.20) is satisfied. The total number of iterations needed in algorithm 2 is n − m if the result of division is achieved in mode 1. If the result is not achieved in mode 1, one more iteration is needed in mode 2.

3.4 Chapter Summary

In this chapter, we considered the two division algorithms; the fixed shift algorithm and the adaptive shift algorithm. The equations and conditions required by the algorithms were explained and represented mathematically. The difference between the two algorithms is primarily based on the step size δ, in the fixed shift algorithm, the δ is determined by the iterative index while in the adaptive shift algorithm, the δ is governed by the input data, that is difference between the position of most significant “1” or “0” based on sign of Y and the position of most significant “1” in X, since X is assumed to be positive. In both algorithms, the idea is to reduce Y as determined by δ such that it is positive and lesser than X in magnitude. When Y fails to falls in the correct range, a post processing step is required to obtain the correct values of Y and Z.

(29)

Chapter 4 Design and Implementation

The hardware realization of the division algorithms requires identification and de-sign implementation of individual system blocks and their interconnectivity divider designs. This chapter provides sufficient design methodology.

4.1 Hardware entities for Algorithm 1

The division methodology, equations, conditions and operations explained in chapter 3, will be used to determine the hardware entities required for each of the division algorithms. In this section we look at the hardware entities that are required for implementation of Algorithm 1. In every iteration the hardware needs to implement:

• One shift.

• One addition and one subtraction (two operations performed by the ALU) to implement this, Algorithm 1 needs the following entities:

• X, Y and Z registers • Data multiplexer • Comparator for Y • Look-up table • ALU • Counter

(30)

• Finite state machine

The system block-level diagram of Algorithm 1 is shown in fig. 4.1.

Figure 4.1: Algorithm 1 system level

4.1.1 X, Y and Z Registers

Division requires four operands in total; the divisor, the dividend, the quotient and the remainder but in our implementation, only three operands are needed since we

(31)

reduce the dividend such that it yields the quotient. Therefore we need to store only three values in the registers; the remainder Y, the quotient Z and the divisor X. The word width of Y is 32 bits, therefore we set the registers of X and Z to 32 bits word width too. Having a uniform word width of the three registers will simplify the applicability of arithmetic operations on these operands.

Moreover, the registers are required to hold values from the following: • The initial values from the external data lines

• The intermediate values of Y and Z from the data feedback from the ALU during each iteration.

• The final values of Y and Z once the iterations are complete and division result is obtained.

To perform the above requirements, we need to have control signals for the register bank to enable the read/write capability on the register contents and we also need the ability to switch selectivity between the external data or the internal feedback data. The block level view of our register bank is shown in fig. 4.2 below.

Figure 4.2: Registers X, Y and Z in the bank

4.1.2 Data Multiplexer

The multiplexer has the control signal input from the controller to select from the external data line or from the feedback data lines from the ALU, the output data lines from the multiplexer feeds the data into the registers. The block level of the multiplexer is shown in fig. 4.3.

(32)

Figure 4.3: Data multiplexer for register bank

4.1.3 Comparator for Y

The comparator that scans Y is an important part of the hardware since it determines if the addition or subtraction operation is needed on the next values of Y and Z. The operands X and Y are fed up in to the comparator to raise the flag when the following conditions occur:

• Raise the flag when the value of Y goes negative (f ypos = 0)

• Raise the flag when the value of Y is positive but less than X (f ygtex = 1) The block level view of the comparator is shown in fig. 4.4 below.

Figure 4.4: Comparator for Y

4.1.4 The Look-up table

The look-up table (LUT) is implemented as a ROM in the system with contents stored in weights of binary shifts. The value of δ calculated from in two algorithms corresponds to the address in the LUT, which is picked up by the ALU during the computation in the iteration. The LUT block is shown in fig. 4.5.

(33)

Figure 4.5: LUT block

4.1.5 The ALU unit

The ALU unit computes the equations (3.1)(3.2)(3.9)(3.10)(3.12)(3.13)(3.21) and (3.22) and is comprised of three ALUs to perform the following:

• Perform multiplication between δi and X.

• Perform addition/subtraction (based on sign bit of current Y ) of the product δiX from Yi to obtain Yi+1

• Perform addition/subtraction (based on the sign bit of current Y ) of δi from Zi

to obtain Zi+1

The ALU requires the control signal based on the status of comparator flags to per-form addition or subtraction operation. The ALU block is shown in fig. 4.6 and the logical operation during an iteration is shown in fig. 4.7.

(34)

Figure 4.7: Logical operation of ALU during ith iteration

4.1.6 Counter

To perform the shift we need a counter. Recall from section 3.2.1 that the step size of δ is given by the iteration index and not by the intermediate values of Y. The counter is employed in algorithm 1 to produce the iterations indexes at each iteration which pulls out the corresponding values from the LUT table for the ALU. When the iterations are complete, a flag is raised and it’s status is provided to the controlling unit. The counter block is shown below in fig. 4.8.

Figure 4.8: Counter block diagram

4.1.7 Finite State Machine

The finite state machine (FSM) is the controlling unit of the system, it sends and receives the control signals to and from other hardware entities in the system. The FSM block is shown in fig. 4.9. The FSM of the algorithm 1 is fairly simple and only has four states: initial, iterate, adjust and final.

(35)

Figure 4.9: Finite State Machine block

4.1.8 FSM : State transition diagram

In the initial state the FSM is in the idle mode and scans for an external “start” input control signal. The initial state is used as a system initialization mode which occurs upon reset and the counter is cleared, the “sel” (select) is set to high so that the external data inputs are selected and those values are ready to be loaded into the registers X,Y and Z. The enable x and enable yz are set to high which enables the writing in the registers while the “done” signal is set to “zero” and add sub y is essentially in the don’t care state.

Once the “start” is received, the FSM goes into the iterate mode which implements the “range reduction of Y ” mode, for this the counter is enabled and the “sel” control is set to “0” so that the internal feedback data lines from the ALU are selected for the next iteration. The flags f ypos = 0 and f ygtex = 1, which means that the Y is negative or is positive but greater than equal to X respectively and the add sub y is controlled accordingly. If the value of Y is negative the addition is performed, if it’s positive and greater than Y, subtraction is performed. When the counter has reached the pre-determined “counts”, the f i flag is raised to a “1”, which sends the signal to the FSM that iterations are complete in mode 1.

The FSM checks the status of the flag, if flags f ypos = 0 and f ygtex = 1 then the FSM goes into the adjust mode to “post process Y and Z ”. Otherwise if the flags have different status (f ypos = 1 and f ygtex = 0), this means that the Y is in the correct range and the FSM goes directly into the final state. In the final state, the write capability in registers Y and Z is disabled through enable yz and the “done” signal is set to “1” which indicates that the division operation is complete. The state

(36)

transition diagram is shown in fig. 4.10.

Figure 4.10: State transition diagram for Algorithm 1

4.2 Hardware entities for Algorithm 2

We know from section 3.3.1 that the step size of δ in the adaptive shift algorithm is obtained by the magnitude of the input data and not by the iteration index, therefore we do not use the counter in the implementation of this algorithm. We will instead need a special hardware unit that will check the most significant 1’s or 0’s in the operand, if the number is positive or negative respectively. In our design, we call this unit the delta address generator (DAG). In every iteration the hardware needs to implement the following operations:

• Determine the location of most significant 1 or 0 for Yi_.

(37)

• One addition and one subtraction (two operations performed by the ALU) to implement this, Algorithm 2 needs the following entities:

• X, Y and Z registers • Data multiplexer • Comparator for Y • Look-up table • ALU

• Delta (δ) address generator • Finite state machine

The system block-level diagram of Algorithm 2 is shown in fig. 4.11. We only discuss the DAG and finite state machine for Algorithm2 because it specific to the adaptive shift algorithm while the rest of the entities are implemented in the exact same way as in Algorithm 1. One key difference between both designs is that the counter used in Algorithm 1 is not used in Algorithm 2, instead, the DAG generates the shifts in δ.

(38)

Figure 4.11: Algorithm 2 system level

4.2.1 Delta Address Generator

This unit will determine the location of most significant 1 or 0 by scanning the position Py and Px and generating an address from the difference of the two position to obtain

the corresponding value of shift in δ from the LUT ROM. which will be used by the ALU for the computation in the iteration step. The DAG block level diagram is shown in fig. 4.12. and the overall system block-level diagram is given in fig. 4.13.

(39)

Figure 4.12: Delta (δ) Address Generator

(40)

The DAG is composed of several hardware entities such as : • position finder unit.

• multiplexer for flag. • multiplexer for data lines. • Px register.

• number subtractor.

Position finder unit

The purpose of this unit is to find Py and Pxfrom Y and X respectively based on the

input of flag, f ypos. If f ypos = 1, the position finder unit detects the most significant 1 bit in Y and if f ypos = 0 then the unit detects for most significant 0 bit. Since X is assumed to be positive, the unit will always look for most significant 1’s in X. See fig. 4.14 below, the number at the input is either X or Y depending upon the data multiplexer input. Similarly, the f mux which is the flag forwarded by the flag multiplexer, indicates the sign on the number operand at the input of the position finder unit. For the case of Y, the “f mux” will have input from the f ypos of the comparator, for the case of X, the “f mux” will send a “1” to the position finder unit, which indicates the unit to look for most significant “1” in X. The output “position” will have the value of Py and Px from Y and X respectively. The “flag out” signal

is the resultant of the hierarchical implementation of the position finder and is not used in computation of Py and Px or in the division operation.

(41)

Multiplexer for flag

This is just a simple multiplexer that enables re-using the same position finder unit for Px and Py. It reads the status of the flag, f ypos to decide if the unit needs to look

for 1’s or 0’s in Y. For the case of Px, we feed a “1” from the multiplexer input so

that the unit always looks for most significant “1’s” in X, since X is always positive. Figure 4.15 below highlights this, the “sel x” input comes from the FSM and when it’s high, the multiplexer sends “1” at the output, otherwise when its a low or a “0”, it sends “f ypos” at the output as “f mux”.

Figure 4.15: Multiplexer for flag input

Multiplexer for data lines

This works in the exact same way as the multiplexer for flag and share the same control input “sel x”, since we re-use the position finder unit for both Px and Py, this

multiplexer helps to control the data lines selected as input for the position finder unit as shown in fig. 4.16 below.

(42)

Px register

To employ the re-usability of the position finder unit, we need a register that stores subtracter. Since this register is only used for Px, it will function only when “sel x

= 1”, and therefore this is controlled by the signal “enable reg Px”. Figure 4.17

illustrates this block.

Figure 4.17: The Px Register

Number subtractor

This hardware entity essentially performs the subtraction of Py − P x that is used

as an address for LUT and this entity also raises the flag “f i” when the result of subtraction is less than or equal to “0”, which indicates to the FSM that the “range reduction of Y ” mode is complete. The delta address will have the value of the delta from the result of Py − P x while the “position x and position” signals represent P x

and P y respectively. Figure 4.18 illustrates this block.

(43)

4.2.2 DAG Implementation

The DAG is the most important hardware unit for the Algorithm 2 since this unit computes the adaptive shift, δ for this algorithm. Remember in algorithm 1, we employed the counter to compute the fixed shifts which was based on the iterative index i, but in the adaptive shift based division technique, we scan the words Y and X for the bit position of most significant 1’s or 0’s and then use the difference between the bit locations to obtain value of δ.

The DAG is implemented in a hierarchical arrangement of five levels which is given by the relation since we have the 32 bit operand:

2x = 32 (4.1)

therefore, x = 5

The “level 1” is comprised of 16 2-bits scan units that each scans the two bits at a time for the entire word width of Y starting from bit location Y0Y1 up till Y30Y31, the

unit checks the presence of 1 or 0 in the MSB, depending on the sign of Y otherwise checks the LSB for a 1 or 0 and sends the flag and position to the next hierarchical level. This unit also accepts a starting base value n at each block on which the value is obtained to pass on to the next level. Figure 4.19 below shows 2 of these units that will help illustrate the concept.

Tables 4.1 and 4.2 show how the 2-bits scan unit works in Y is positive or negative.

Figure 4.19: 2-bits scan unit 0 in level 1

The “level 2” is comprised of 8 scan block that each scans, essentially 4 bits, the two numbers and the two flags from the 2-bits scan unit in level 1 starting from scan block 0 up to scan block 8, if the flag “f1” of the 2-bits scan unit 1 is a “1” then the number on the output of scan block 0 is “n1” and if the flag “f0” is a “1” and “f1”

(44)

Y1 Y0 n0 f0

0 0 0 0

0 1 n 1

1 0 n+1 1

1 1 n+1 1

Table 4.1: Truth Table when Y is positive Y1 Y0 n0 f0

0 0 n+1 1

0 1 n+1 1

1 0 n 1

1 1 0 0

Table 4.2: Truth Table when Y is negative

is a “0” then the number on the output of scan block 0 is “n0”. We demonstrate this relation between scan block 0 in level 2 and the two 2-bits scan unit 0 and scan unit 1 from level 1 in fig. 4.20.

The approach for level 2 transcends in the same manner all the way down to level 5 through level 3 and level 4. The scan block shown in fig. 4.20 is exactly the same for the rest of level and works on the same principle by accepting two numbers and two flags from the previous level and updating the number output depending on the status of the flag(s). As we increase a level, the number of scan blocks needed will be reduced by a factor of 2 and hence we only have four scan blocks in level 3 and then two in level 4 and one in level 5.

Figure 4.20: Hierarchical approach between level 1 and 2

(45)

“n0 L5”obtained in the output of level 5 is the position of the most significant 1 or 0 depending on the sign of operand.

The n at the top of each “2-bits scan unit”, referred in figure as “u”, is the base value for each unit. Notice that the whole word width of 32 bits is covered with 16 2-bits scan unit, each scan unit forwards their respective bit position outputs (n0...n15) and flag outputs (f0...f15) to scan blocks in level 2.

Although the methodology of operation is same for scan blocks as 2-bits scan unit, different notation for number output and flag outputs is used to highlight the difference. The numbers are the base value plus the most significant 1 or 0 in that unit, then the flag determine which scan block has the most significant 1 and 0, in other words, if the flag from a high order scan block is high, the number output of that scan block is sent at the output.

(46)

(47)

4.2.3 Finite State Machine

The finite state machine (FSM) is the controlling unit of the system sends and receives the control signals to and from other hardware entities in the system. The FSM block is shown below in the fig. 4.22. The FSM of the algorithm 2 has an additional state than algorithm 1 and only has a total of five states: initial, load X (initialize X ), iterate, adjust and final.

Figure 4.22: Finite State Machine block

4.2.4 FSM : State transition diagram

In the initial state the FSM is in the idle mode and scans for an external “start” input control signal. Once the “start” is received, the FSM goes into the “load X” mode. The load X state is an additional initialization state, along with the initial state, that loads the value of Px into the Px register so that the iterations are synchronized

with Py when the iterate mode is reached. The initial state is used as a system

initialization mode which occurs upon reset and the “sel” (select) is set to high so that the external data inputs are selected and those values are ready to be loaded into the registers X,Y and Z. The enable x and enable yz are set to high which enables the writing in the registers while the “done” signal is set to “zero” and add sub y is essentially in the don’t care state. We have two additional control signals; the “sel x” (select x) and “enable reg x” (enable register x) which are associated with obtaining the value of Px, the position of “X ”. In the load X’ state, the “sel x” and

“enable reg x” are disabled so that DAG will fetch values of Y in order to obtain the value of Py.

(48)

on next clock cycle the FSM goes into the iterate mode which implements the “range reduction of Y ” mode. The flags f ypos = 0 and f ygtex = 1, which means that the Y is negative or is positive but greater than equal to X respectively and the add sub y is controlled accordingly. If the value of Y is negative the addition is performed, if it’s positive and greater than Y, subtraction is performed. When the result of Py − Px ≤ 0, the the f i flag is raised to a “1”, which sends the signal to the FSM

that iterations are complete in mode 1. The state transition diagram is shown in fig. 4.23.

(49)

The FSM checks the status of the flag, if flags f ypos = 0 and f ygtex = 1 then the FSM goes into the adjust mode to “post process Y and Z ”. Otherwise if the flags have different status (f ypos = 1 and f ygtex = 0), this means that the Y is in the correct range and the FSM goes directly into the final state. In the final state, the write capability in registers Y and Z is disabled through enable yz and the “done” signal is set to “1” which indicates that the division operation is complete.

4.3 Circuit Implementations

The general description of the system and it’s blocks has been covered in previous sections, In this section we look at the Register Transfer Level (RTL) view of the top level block and the overall RTL schematic of the division and allied hardware implementation. The signals paths are shown in red and the data paths are shown in black colored lines in the schematics.

4.3.1 Algorithm 1 : Fixed Shift division algorithm

The top level block and the RTL schematic for fixed shift division algorithm are shown in the fig. 4.24 and 4.25.

(50)

(51)

4.3.2 DAG overall layout

The overall RTL schematic for the DAG used in algorithm 2 is shown in fig. 4.26.

Figure 4.26: Delta address generator RTL schematic

4.3.3 Algorithm 2: Adaptive Shift division algorithm

The schematics for adaptive shift division algorithm are shown in the fig. 4.27 and 4.28.

(52)

(53)

4.4 Chapter summary

In this chapter, the design overview and methodology was explained with regards to each of the two division algorithms: the algorithm 1, fixed shift division algorithm and the algorithm 2, the adaptive shift division algorithm. The difference in operation and implementation between the two algorithms was explained with the reference to step size, δ. In algorithm 1, the iterations are pre-determined and this was achieved through a counter while in algorithm 2, the shifts in δ was achieved through a special hardware called the DAG. The DAG, is a hierarchal implementation of scan units and scan blocks with the purpose of calculating the difference between Py− Px. This

difference corresponds to an address in the LUT that holds the shifted binary value of δ.

(54)

Chapter 5 Results and Evaluation

The aim of this chapter is to demonstrate that the two division algorithms designs in previous chapter will work based on the algorithms discussed in chapter 3. The im-plementation phase proved to be very challenging and required a respectable amount of testing, debugging and design revision to ensure that proper functionality of the in-tended hardware. This chapter documents the tests and simulation results to analyze the functionality and the performance of the two algorithms. Initially the division algorithm 1, based on fixed shift, was constructed to achieve a working division al-gorithm and then alal-gorithm 2, based on adaptive shift technique, was constructed to produce the same division result. A comparative analysis was conducted between the two algorithms for their power consumption, device utilization, timing analysis, area-delay product and power-area-delay product based on design goals for balanced, timing performance and power optimization. Some of the related work in the area is also compared in this chapter.

5.1 Numerical Simulation using MATLAB

The two algorithms were first implemented in software using MATLAB in order to verify that the division algorithms yielded correct value of quotient and remainder when the dividend was divided by the divisor. The purpose of this numerical simula-tion was also to have a reference benchmark of numerical values in each iterasimula-tion so that the comparison can be drawn accordingly during the hardware implementation phase. These simulation numbers were not only important from the verification point of view, but were also very beneficial during hardware description debugging.

(55)

5.1.1 Numerical Simulation of Algorithm 1

Table 5.1 shows the numerical simulation in each iteration when Y = 1,176,349 is divided by X = 127,773.

Range Reduction Mode of Y, Algorithm 1 i Yi+1 Zi+1 δi = 2n−m−i µi× δi

Initialize 1,176,349 0 - -0 -2092256483 16384 214 ₁₆₃₈₄ 1 -1045540067 8192 213 _-8192 2 -522181859 4096 212 -4096 3 -260502755 2048 211 -2048 4 -129663203 1024 210 -1024 5 -64243427 512 29 _-512 6 -31533539 256 28 _-256 7 -15178595 128 27 _-128 8 -7001123 64 26 _-64 9 -2912387 32 25 _-32 10 -868019 16 24 _-16 11 154165 8 23 -8 12 -356927 12 22 4 13 -101381 10 21 -2 14 26392 9 20 _-1

Post Processing Mode of Y and Z, Algorithm 1 Not required, results are obtained

Table 5.1: Iterations for Algorithm 1

In the above table, we have all the values of Yi+1_{and Z}i+1_{for each of the iterations}

i, notice that shifts in δi are decremental or decreasing by 1 bit, for this reason we call

this algorithm as the fixed shift division algorithm. In chapter 3, we discussed that in our work Y is 32 bits and X needs to be at least in 17 bits, denoted by n and m respectively. The difference n − m = 15 bits, which gives us the number of iterations required for the division operation, therefore we perform a total of 15 iterations. Since the result of division on the chose value of operands Y and X satisfies the equation (3.8), the “post processing mode of Y and Z is not needed” in the fixed shift division algorithm.

(56)

5.1.2 Numerical Simulation of Algorithm 2

Table 5.2 shows the numerical simulation in each iteration when Y = 1,176,349 is divided by X = 127,773.

Range Reduction Mode of Y, Algorithm 2 i Yi+1 Zi+1 δi = 2Py−Px µi× δi

Initialize 1,176,349 0 -

-0 -868019 16 24 ₁₆

1 154165 8 23 _-8

2 -101381 10 21 ₂

Post Processing Mode of Y and Z, Algorithm 2

26392 9 20 _-1

Table 5.2: Iterations for Algorithm 2

The table lists out the values of Yi+1 and Zi+1 for each of the iterations i, notice that shifts in δi are not decremental as in case of algorithm 1, for this reason we call

this algorithm as the adaptive shift division algorithm. As discussed in chapter 3, the iterations for adaptive shift division algorithm are given by Py − Px. The most

significant 1 in Y is at 20th bit position starting from bit position number 0, the least significant bit in Y while the most significant 1 in X is at 16th bit position starting from bit position number 0, the least significant bit in X. The difference between the two respective bit positions in Y and X is 20 − 16 = 4, therefore the algorithm takes a total of 4 iterations to produce the result. The iterations in the “range reduction mode of Y ” ends when Py ≤ Px, at this point the value of Y did not satisfy the

equation (3.20) therefore the algorithm goes into the “post processing mode of Y and Z to obtain the correct result. The Table 5.2, shows that the iterations required to achieve the division result is much lesser than the iterations given in Table 5.1.

5.2 Hardware Simulation

The two algorithms were design, synthesized and implemented in VHDL using Xilinx ISE Project Navigator 13.4. The implemented top level and overall RTL schematics of both the division algorithms and allied hardware modules were presented in chapter 4. The VHDL test benches were created and simulated to verify that the hardware performs division correctly.

(57)

5.2.1 VHDL Simulation of Algorithm 1

Figure 5.1 shows the screen shots of test bench output.

(58)

By observing Y and Z in fig. 5.1, once the “start” becomes a “1”, the iterations begin on every rising edge as it can be seen until the result of division quotient and remainder is achieved. For clarity we break down the figure and examine the zoomed view of iterations in fig. 5.2 to fig. 5.6 such that, to verify iterations data with the numerical simulation data.

Figure 5.2: Iterations 0 to 2 for Algorithm 1

(59)

Our observation of the figures of the test bench screen shots above, it can be seen that the iteration data from the VHDL simulation is consistent with the numerical simulation data obtain in section 5.1.

5.2.2 VHDL Simulation of Algorithm 2

We now assess the functionality of our division algorithm 2, the adaptive shift division algorithm. Just like in section 5.1, it was observed that the adaptive shift technique reduces considerable number of iterations as compared to the fixed shift technique, this is verified by observing Y and Z in fig. 5.7.

(60)

(61)

5.3 Performance Evaluation

The hardware device chosen for the implementation is Xilinx Spartan-3E xc3s1200e-4fg320. This Spartan-3E FPGA device contains 1200,000 system gates, 19,512 equiv-alent logic cells and 8,672 total number of slices [12] out of which the available logic utilization consists of 17344 flip flops, 17344 4 input LUTs and 250 bonded IOBs. Apart from the usage of slices in logic, they are also used for routing signals within the device. The test study analyzes and compares the two division algorithms for their power consumption, device utilization and timing analysis using the Xilinx ISE tool with respect to three design goals:

• Balanced.

• Timing performance. • Power optimization.

These design goal profiles are pre-defined in ISE Navigator tool and can be set to a desired goal in the synthesis properties. In the balanced profile, the optimization goal is the “speed” and the optimization effort is “normal” while in the timing performance, the optimization goal is the “speed” and uses a “high” optimization effort. In the power optimization profile, the optimization goal is “area” while the optimization effort is “high”.

5.3.1 Device Utilization

Once the two division algorithms were successfully complied, they were synthesized to assess the device utilization and performance. The device utilization results for fixed shift division algorithm and adaptive shift division algorithm is shown in Table 5.3 and 5.4 respectively.

The device utilization summary can be obtained through the following:

Go to ISE Navigator Design pane > select Implementation (view) > select the design as “top module”.

In the Process pane > select synthesize - XST > View the Design summary (synthe-sized) window .

(62)

Device Utilization Summary : Algorithm 1

Design goal Balanced Timing Performance Power Optimization

Number of Blocks 457 462 426

Flip Flops 70 86 70

4-Input LUTs 253 242 222

Occupied Slices 129 159 134

Table 5.3: On-chip device utilization of Algorithm 1

The on-chip logic utilization summary shows that Algorithm 1 uses a total of 457 blocks for the balanced, 462 blocks for timing performance and 426 blocks for power optimization which also includes 132 IOBs, 1 BUFGMUXs and 1 MULTI18xSIOs for each of the three profile. All three profiles have a device utilization of 1% for our target device.

Device Utilization Summary : Algorithm 2

Number of Blocks 607 582 585

Flip Flops 105 104 104

4-Input LUTs 368 344 347

Occupied Slices 194 182 192

Table 5.4: On-chip device utilization of Algorithm 2

Algorithm 2 uses a more area as compared to Algorithm 1, and in addition to the number of flip flops and 4-input LUTs mentioned in Table 5.4, the total number of blocks also includes 132 IOBs, 1 BUFGMUXs and 1 MULTI18xSIOs for each of the three profile. All three profiles have a device utilization of 2% for our target device.

5.3.2 Timing Analysis

In the timing analysis we look at the two division algorithms for their clock frequency, the critical path delay and the overall completion time required by the division op-eration. The Table 5.8 and Table 5.9 shows the timing summary of the division algorithm 1 and 2 respectively.

(63)

Timing Analysis : Algorithm 1

Clock Frequency [MHz] 81.374 83.008 77.48

Critical Path Delay [ns] 12.289 12.047 12.907

Division operation completion time = 155 ns Table 5.5: Timing Summary of Algorithm 1

Timing Analysis : Algorithm 2

Clock Frequency [MHz] 39.022 41.722 37.198

Critical Path Delay [ns] 25.626 23.968 26.883

Division operation completion time = 70 ns Table 5.6: Timing Summary of Algorithm 2

The critical path delay determines the clock frequency and can be obtained by running the synthesize - XST option from the process panel in the ISE tool. Once the synthesis is complete, the timing report can be viewed from right clicking the synthesize - XST. This report also reveal the the source and destination of the critical path in each of the profiles.

The data path that cause the critical path delay for Algorithm 1 were the following: • Balanced.

source : counter instance/temp count 3 (FF) destination : regbank instance/y reg alu 31 (FF). • Timing performance.

source : regbank instance/y reg alu 1 1 (FF) destination : regbank instance/z reg alu 31 (FF). • Power optimization.

source : regbank instance/y reg alu 0 (FF) destination : regbank instance/z reg alu 31 (FF).

(64)

• Balanced.

source : fsm instance/pstate internal FSM FFd3 (FF) destination : regbank instance/y reg alu 31 (FF). • Timing performance.

source : fsm instance/pstate internal FSM FFd2 (FF) destination : regbank instance/y reg alu 31 (FF). • Power optimization.

source : fsm instance/pstate internal FSM FFd2 (FF) destination : regbank instance/y reg alu 31 (FF).

The overall division operation completion time was obtained through running the ISIM simulation through the “simulation” view of the ISE tool and double clicking the “simulate behavioral model”, which will show the test bench output. Using vertical markers to calculate the time difference between vertical marker place on the rising edge when “start” signal becomes high till the rising edge time instant when “done” signal is set to high.

In respective to the timing analysis empirical data, it was observed that the al-gorithm 2 had over 50% lesser clock frequency which corresponds to double the time period or the delay. The addition of DAG hardware increased delay per clock cycle thereby, reducing the clock frequency which results in lesser circuit power consump-tion and increased reliability [13], this is because the dynamic power consumpconsump-tion is related to clock frequency, the higher switching activity there is in the circuit, or higher clock frequency, it results in higher dynamic power consumption. The DAG hardware also resulted in lesser job (division operation) completion time which verifies that the Algorithm 2 is more than 50% faster than Algorithm 1.

5.3.3 Power Consumption

The total on-chip power is given by the static power and the dynamic power. The static power results mainly from the leakage current within the device from the tran-sistors and exists even when the transistor is logically “OFF”. The dynamic power depends on the switching activity defined in [14]. Based on this theory, the total power consumption will change for the same design if different target devices are used, therefore in our results, we refer to dynamic power in the presented data in Table 5.7.

VHDL Implementation and Performance Analysis of two Division Algorithms

Contents

List of Tables

List of Figures

Introduction

1.1

Overview

1.2

Motivation for this work

1.3

Contributions

1.4

Thesis Organization

Chapter 2

Division Background

2.1

Division Fundamentals

2.2

Division Algorithms Classes

2.2.1

Digit Recurrence Algorithms

2.2.2

Functional Iteration Algorithms

2.2.3

Very High Radix Algorithms

2.2.4

Look-up Tables

2.2.5

Variable Latency Algorithms

2.3

Related work in the area

2.4

Chapter Summary

Chapter 3

Considered Division Algorithms

3.1

Division Approach

3.1.1

Reasons For Considerations

3.1.2

Overview of Operation

3.2

Division Algorithm 1 : Fixed Shift Algorithm

3.2.1

Mode 1 : Range reduction of Y

3.2.2

Mode 2 : Post processing of Y and Z

3.3

Division Algorithm 2 : Adaptive Shift

Algo-rithm

3.3.1

Mode 1 : Range reduction of Y

3.3.2

Mode 2 : Post processing of Y and Z

3.4

Chapter Summary

Chapter 4

Design and Implementation

4.1

Hardware entities for Algorithm 1

4.1.1

X, Y and Z Registers

4.1.2

Data Multiplexer

4.1.3

Comparator for Y

4.1.4

The Look-up table

4.1.5

The ALU unit

4.1.6

Counter

4.1.7

Finite State Machine

4.1.8

FSM : State transition diagram

4.2

Hardware entities for Algorithm 2

4.2.1

Delta Address Generator