by
Kareem Moeen
B.Sc., Kuwait University, 2009
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
MASTER OF APPLIED SCIENCE
in the Department of Electrical and Computer Engineering
c
Kareem Moeen, 2016 University of Victoria
All rights reserved. This thesis may not be reproduced in whole or in part, by photocopying or other means, without the permission of the author.
Progressive Product Reduction for Polynomial Basis Multiplication over GF(3m)
by
Kareem Moeen
B.Sc., Kuwait University, 2009
Supervisory Committee
Dr. Fayez Gebali, Supervisor
(Department of Electrical and Computer Engineering)
Dr. Kin Li, Department Member
(Department of Electrical and Computer Engineering)
Dr. Alex Thomo, Outside Member (Department of Computer Science)
Supervisory Committee
Dr. Fayez Gebali, Supervisor
(Department of Electrical and Computer Engineering)
Dr. Kin Li, Department Member
(Department of Electrical and Computer Engineering)
Dr. Alex Thomo, Outside Member (Department of Computer Science)
ABSTRACT
Galois fields are essential blocks of building many of cryptographic schemes. The main advantage of applying Galois fields over cryptographic applications are to reduce cost and increase the sufficiency of the performance. In past, they were interested in implement Ga-lois field of characteristic 2 in most of the crypto-system application, but in the meantime, the researcher started to work on Galois field of odd characteristics which it has appli-cations in many areas like Elliptic Curve Cryptography, Identity-based Encryption, Short Signature Schemes and etc.
In this thesis, an odd characteristic Galois field was implemented. In particular, this thesis focuses on implementation of multiplication and reduction on GF (3m). Overview
about the thesis idea was presented in the beginning. Finite field arithmetic was discussed where it shows some of the Galois fields important definitions and properties. In addition, irreducible polynomials over GF (p) where p is prime and the basic additional and multipli-cation over GF (pm) was discussed as well. Introduction to the proposed implementation started with the arithmetic of the Galois field characteristics 3. The problem formulation in-troduced by its mathematical representation and the Progressive Product Reduction (PPR) technique which is the technique used in this thesis. Implement three different semi-systolic arrays architecture with different projection functions. This stage followed by modeling
assumption for complexity analysis for both area and delay where it used to compare pro-posed designs with other published designs. Propro-posed design gets verified by Matlab code implementation at the end of this thesis.
Contents
Supervisory Committee ii
Abstract iii
Table of Contents v
List of Tables vii
List of Figures viii
Acknowledgements ix Dedication x List of Abbreviations xi 1 Introduction 1 1.1 Overview . . . 1 1.2 Background . . . 2
1.3 Software Tools for Parallelizing . . . 2
1.4 Thesis Organization . . . 3
2 Finite Field Arithmetic 4 2.1 Mathematical Background . . . 4
2.1.1 Galois Field . . . 4
2.1.2 Definitions of Galois Field . . . 5
2.1.3 Properties of Galois Field . . . 5
2.1.4 Irreducible Polynomials . . . 6
2.1.5 Addition over Finite Fields . . . 8
2.2 Arithmetic Over GF (3) . . . 11
3 Progressive Product Reduction Technique 14 3.1 Problem Formulation . . . 14
3.2 Mathematical Representation . . . 16
3.3 Parallelizing the PPR Technique . . . 17
3.3.1 Scheduling Function Design for PPR Technique . . . 19
3.3.2 Projection Function Design for PPR Technique . . . 22
3.4 Design Space Exploration for PPR Technique . . . 22
3.4.1 Design #1: using s and d1 . . . 22
3.4.2 Design #2: using s and d2 . . . 24
3.4.3 Design #3: using s, d3 . . . 27
4 Complexity Analysis 30 4.1 Performance Modeling . . . 30
4.2 Clock Duration . . . 31
4.3 Area and Delay Analysis . . . 32
4.4 Results . . . 32
4.5 Verification . . . 33
4.5.1 Golden Code . . . 33
4.5.2 Verifying Dependency Graph for PPR . . . 33
4.5.3 Verifying Design# 1 . . . 33
4.5.4 Verifying Design# 2 . . . 33
4.5.5 Verifying Design# 3 . . . 34
5 Conclusion and Future Work 35 5.1 Thesis Contributions . . . 35
5.2 Conclusion . . . 35
5.3 Future Work . . . 36
List of Tables
Table 2.1 Multiplication table in GF (3) for ck= aibj. . . 12
Table 2.2 Addition table in GF (3) for ck = ai+ bj. . . 12
Table 4.1 Components comparison with respect to normalized area and delay. . 31 Table 4.2 Comparison between different GF (3m) multipliers. . . 32
List of Figures
Figure 3.1 Dependence graph of the PPR algorithm for m = 9 and k = 4. . . . 18 Figure 3.2 Dependence graph of (a) Details of red cells where we have cm−1 is
the feedback and the box label with D is 1-trit flip-flop with clear load control inputs. (b) Details of blue cells where we multiply ain, binand add it to cin. . . 19
Figure 3.3 Node timing for the PPR algorithm using scheduling function s for m = 9 and k = 4. . . 21 Figure 3.4 Design #1 for m = 9 and k = 4. (a) Semi-systolic array. (b) PEj
details when j 6= 0 or k. (c) PEj details when j = 0, k. . . 23
Figure 3.5 Processor activity for Design #2 for m = 9 and k = 4 based on (3.41). . . 25 Figure 3.6 Processor activity for Design #2 for m = 9 and k = 4 based on
(3.42). . . 26 Figure 3.7 Design #2 for m = 9 and k = 4. (a) Semi-systolic array. (b) PE
details. . . 27 Figure 3.8 Processor activity for Design #3 for m = 9 and k = 4. . . 28 Figure 3.9 Design #3 for m = 9 and k = 4. (a) Semi-systolic array. (b) PE
details. . . 29 Figure 6.1 Multiplier Logic Gate Level Representation. . . 41 Figure 6.2 Addition Logic Gate Level Representation. . . 42
ACKNOWLEDGEMENTS
In the name of Allah, the Most Gracious and the Most Merciful
Alhamdulillah, all praises belongs to Allah the merciful for his blessing and guidance. He gave me the strength to reach what I desire. I would like to thank:
My family for supporting me at all stages of my education and their unconditional love. My Supervisor, Dr. Fayez Gebali, for all the support, and encouragement he provided to
me during my work under his supervision. It would not have been possible to finish my research without his invaluable help of constructive comments and suggestions. The winners in life are the ones who are constantly thinking of a way: I can, I’ll do, I’ll be
DEDICATION
To the memory of my father, Moeen Moeen (1950 - 2013), to my mother and my siblings for their love, prayers, and encouragement.
List of Abbreviations
AOP All One Polynomials
AOTP All One or Two Polynomials
CUDA COMPUTE UNIFIED DEVICE ARCHITECTURE
DAG Directed Acyclic Graph DLC Down Literal Circuit
ECC Elliptic Curve Cryptosystems FFT Fast Fourier Transform
FPGA Field Programmable Gate Array gcd Greatest Common Divisor
GF Galois Field
GPU Graphics Processing Unit HPC High Performance Computing
HPRC High Performance Reconfigurable Computing MMM Montgomery Modular Multiplication
MVL Multiple Valued Logic NTT Number Theoretic Transform
PE Processing Element
PPR Progressive Product Reduction
Introduction
1.1
Overview
The use of computational tools for accelerating engineering and scientific computing ap-plications has been a fundamental theme in computer engineer research. The acceleration of demanding computational application was applied sufficiently using High Performance Computing (HPC), where this computational is being complicated since the researcher are formulated more sophisticated problems. Those reasons encourage to exploring new tech-nologies for HPC to achieve high performance computing where it was called High Perfor-mance Reconfigurable Computing (HPRC), where speedup is achieved by exploiting the synergism between hardware and software execution [9].
Galois fields received considerable attention because of their applications in coding theory and the implementation of error correcting codes in the middle of 60’s. Later on, in 70’s public-key cryptography has been invented by Diffie and Hellman [8]. Most of the arithmetics architecture works appeared after two public-key cryptosystems based on finite fields: elliptic curve cryptosystems, introduced by Miller and Koblitz [19, 29] and hyperelliptic cryptosystems introduced by Koblitz [20] in the 80’s.
Until late 90’s, the researcher was mainly focused on fields of characteristic 2 that because it’s straight forward manner in which the field elements can be represented by the logical values “0” or “1”. Also, applications of fields GF (pm) for odd p were scarce in the
literature.
Early 2000’s, [18,39] the researcher start to consider working over characteristic 3 while arithmetic over GF (2) has been extensively studied. Also, Brian Hares in his study shows that the engineering advantages of GF (2) which is the most studied field in literature have
nothing to do with the basic properties of the decimal and binary numbering systems. On the other hand, GF (3) does have a genuine mathematical distinction in its favor. By one plausible measure, it is the most efficient of all integer bases; it offers the most economical way of representing numbers, more details available in [16].
1.2
Background
Finite Galois field arithmetic operations are important in many applications in the digital system. Finite fields can be applied in number theory, computer algebra, information the-ory, switching thethe-ory, error control coding, digital signal processing, image processing, and public-key cryptosystems e.g., Elliptic Curve Cryptosystems (ECC) [19].
An ECC defined over GF (3m) is implemented on a Field Programmable Gate Array (FPGA). Montgomery Modular Multiplication (MMM) algorithm [30] has been used for modular multiplication operation which has considerable effected on the ECC performance [40].
Multiple Value Logic (MVL) gates were used in a ternary systolic product-sum circuit over GF (3m) using neuron-MOS where it used to replace complex operation on threshold
in MVL [33]. Then they compared their design of GF (32) with the binary circuit for
GF (23) in terms of the priority of the a number of transistors and interconnections [31]. Other researcher expanded the All One Polynomials (AOP) multiplication algorithm [24] to GF (3) and applied reducible All One or Two Polynomials (AOTP) to the AOP algorithm. where it has non-zero coefficients over GF (3). And they used neuron-MOS Down-Literal Circuit (DLC) [41].
Semi-systolic multipliers over finite field using selected primitive polynomials under several criteria will drive to achieve low area and low power multipliers [35].
1.3
Software Tools for Parallelizing
There is a variety of software tools for parallelizing the processors available to the user for software implementation. Where the user is able to control the number of threads and the workload assigned to each thread. Also, can also control synchronization of the different threads to ensure proper program execution. Using such techniques, the programmer is able to generate a parallel code — that is, a code that contains several threads [13]. some of those software tools are represented as follow:
1. CILK + + is a language extension programming tool where it is suited for divide - and-conquer problems where the problem can be divided into parallel independent tasks and the results can be combined afterward.
2. OpenMP (Open Multi - Processing) is a concurrency platform for multithreaded, shared - memory parallel processing architecture for C, C + + , and Fortran. By using OpenMP, the programmer is able to incrementally parallelize the program with little programming effort.
3. CUDA is a software architecture that enables the graphics processing unit (GPU) to be programmed using high - level programming languages such as C and C + + . The programmer writes a C program with CUDA extensions, very much like Cilk + + and OpenMP as previously discussed.
1.4
Thesis Organization
This section giving an overall map of the thesis and a short description of each chapter. Chapter 2, reviews the basic background of the arithmetic over GF (p) and introduce arith-metic over GF (3). Chapter 3, mathematical representation for Progressive Product Re-duction technique (PPR) and parallelization of the technique was implemented. Also, it presents a description of proposed scheduling function design, projection function design and shows the proposed three different designs. Chapter 4, assumption for proposed de-sign modeling was presented. In addition, comparison between proposed three different designs with other published designs from area and delay respective. Chapter 5, contains conclusion of this thesis and future work related to the proposed work.
Chapter 2
Finite Field Arithmetic
2.1
Mathematical Background
An overview of finite fields fundamentals is given in this section. Inclusive review of finite fields with important definitions and properties can be found in [25]. The field is a set of elements in which it is possible to apply all the mathematical operations (+, −, ., /) , such that the commutative, associative and distributive properties are satisfied. A finite field is a field that contains a finite number of elements.
2.1.1
Galois Field
Galois field was named in honor of the French mathematician Evariste Galois. Also, it’s called Finite Fields, the elements of Galois field GF (pn) is defined as [3]:
GF (pn) =(0, 1, 2, . . . , p − 1)∪
(p, p + 1, p + 2, . . . p + p − 1)∪
(p2, p2+ 1, p2+ 2, . . . p2+ p − 1) ∪ · · · ∪ (pn−1, pn−1+ 1, pn−1+ 2, . . . pn−1+ p − 1)
(2.1)
where p ∈ P and n ∈ Z+. The order of the field is given by pnwhile p is the characteristic
2.1.2
Definitions of Galois Field
Definition 1. A finite field {F, +, .} consists of a finite set F, and two operations + and . that satisfy the following properties:
1. ∀a, b ∈ F, a + b ∈ F, a.b ∈ F. 2. ∀a, b ∈ F, a + b = b + a, a.b = b.a.
3. ∀a, b, c ∈ F, a + (b + c) = (a + b) + c, (a.b).c = a.(b.c). 4. ∀a, b, c ∈ F, a.(b + c) = (a.b) + (a.c).
5. ∃0, 1 ∈ F, a + 0 = 0 + a = a, a.1 = 1.a = a.
6. ∀a ∈ F, ∃ − a ∈ F such that a + −a = −a + a = 0. ∀a 6= 0 ∈ F, ∃a−1 ∈ F such that a.a−1= a−1.a = 1
A finite field with q elements is denoted by GF (q). The number of elements in a field can be either prime or a power of a prime. GF (pm) is the field of pm elements, it is also called an extension field of GF (p) where p is also called the characteristic.
Below will introduce other definition and properties of finite field will help to under-stand the presented material in this thesis.
Definition 2. The order of a finite field is the number of elements in the field.
Definition 3. A polynomial, whose coefficients are elements of GF (pm), is said to be a polynomial overGF (pm).
Definition 4. A polynomial over GF (pm) is irreducible if it cannot be factored into non-trivial polynomials over the same field.
Definition 5. A primitive polynomial is a polynomial F(X) with coefficients in GF (p) which has a rootα in GF (pm) such that {0, 1, α, α2, α3, . . . , αpm−2} is the entire extension
fieldGF (pm), and moreover, F(X) is the smallest degree polynomial having α as root.
2.1.3
Properties of Galois Field
The following are some basic properties of Galois field:
1. If F is a Galois field where it contains pm elements for such p where p is a prime number m is positive integer where m ≥ 1. We can say that two fields are isomorphic if they have the same structure even if they do not have same elements representation.
2. Galois field of order q where q = pm and p is a prime number then the field
charac-teristic is p. Also, the GF (q) is containing a copy of GF (p) as subfield where in this case we can call q is an extension of field p of degree m.
3. Every subfield of Galois field q has order pn where is n is a positive divisor of m. Conversely, if we have n is a positive divisor of m there will be one subfield of q of order pn.
(a) A ∈ GF (q) is in the subfield GF (pn) if and only if Apn ≡ A
(b) Multiplicative group of GF (q) are the non-zero elements of GF (q).
(c) Cyclic group of order q − 1 is denoted as GF (q)∗ where Aq = A for all A ∈ GF (q). Also,GF (q)∗ is called a primitive element of GF (q)
4. The multiplicative inverse of A is A−1 ≡ Aq−2 where A ∈ GF (q). Alternatively,
we can use the extended Euclidean algorithm for polynomials to compute S(α) and T (α) such that S(α)A(α)+T (α)P (α) = 1, where P (x) is an irreducible polynomial of degree m over p. Then, A−1 = S(α).
5. If A, B ∈ GF (q), with GF (q) a Galois field of characteristic p, then
(A + B)pt = Apt + Bpt (2.2)
for all t ≥ 0.
2.1.4
Irreducible Polynomials
In this section, irreducible polynomials will be discussed where they provide sufficient im-plementations on Galois field arithmetic. We need to note that the irreducible polynomials are dispersed throughout the literature especially when the field characteristics is odd. In addition, there is a load of published work on field characteristic two. Also, presented some of the useful theorems to generate alternative methods to construct irreducible polynomi-als [26].
Let Fqdenote the finite field with q = pmelements, for some prime p, and Fq[x] denote
the set of polynomials over Fq. We say that a polynomial P (x) ∈ Fq is irreducible if it can
not be factored into a non-trivial product of lower degree polynomials in Fq[x].
Definition 6. The order (or exponent or period) of a non-zero polynomial P (x) ∈ Fq[x],
it is denoted byord(P ) = ord(P (x)). If P (0) = 0, then P (x) = xhG(x) for some h ∈ N
andG(x) ∈ Fq[x], with G(0) 6= 0, and ord(P ) is defined to be ord(G).
Notice that if P (x) is irreducible over Fq[x] then P (0) 6= 0 by definition. A polynomial
P (x) is called primitive if it has degree n and ord(P ) = qn− 1. It is sometimes convenient to define the notion of the index of P (x) as qn− 1/e, where e = ord(P ). We denote by Iq,nthe number of irreducible polynomials of degree n over Fq. Then, it can be shown (see
[ [26], Theorem 3.25]) that Iq,n = 1 n X d|n µ(n d)q d= 1 n X d|n µ(d)q(n/d) (2.3) whereP
d|nmeans the summation over all positive integers d divisors of n and µ(.) is the
Moebius function defined as follows
Definition 7. The Moebius function µ is the function on N defined by
µ(n) = 1 if n = 1
(−1)k if n is the product of k distinct primes
0 if n is divisible by the square of a prime
(2.4)
Equation 2.3 also used to show that there is exist irreducible polynomial in F[x] of degree n for all Galois field Fq and all integer n ∈ N [26].
The field of form F36m was used in cryptographic with 36m ≥ 21024 in [2, 4, 5, 12, 18, 32,
39].
Definition 8. Let F ∈ K[x] be of positive degree and E an extension field of K. Then, F splits in E if F can be written as a product of linear factors in E[x], that is if there exists elementsα1, α2, . . . , αn ∈ E such that
F (x) = a(x − α1)(x − α2) . . . (x − αn), (2.5)
where a is the leading coefficient of F. The field E is the splitting field of F over K if F splits in E and if, moreover, E = K(α1, α2, . . . , αn).
Definition 9. Let F ∈ K be a polynomial of degree n ≥ 2 and suppose that F(x) = a(x − α1)(x − α2) . . . (x − αn) with α1, α2, . . . , αnin the splitting field ofF over K. Then
D(F) = a2n−2 Y
1≤i<j≤n
(αi− αj)2 (2.6)
It’s clear from (2.6) that D(F) = 0 only if there is multiple roots for the polynomial. Theorem 1. [38] Let F(x) = xn+ axk + b ∈ Fq[x], for odd q, n > k > 0, and d =
gcd(n, k) with n = n1d, k = k1d, then
D(F) = (−1)n(n−1)/2.bk−1.[nn1.bn1−k1 − (−1)n1.(n − k)n1−k1.kk1.an1]d (2.7)
2.1.5
Addition over Finite Fields
Let a(x), and b(x) be two elements over Galois field of characteristics pmas following:
a(x) = m X i=1 am−ixm−i, b(x) = m X i=1 bm−ixm−i (2.8)
then their additional over GF (pm) will be [22]:
a(x) + b(x) = c(x) = m X i=1 cm−ixm−i (2.9) where ci ≡ (ai+ bi) mod p, i = 0, 1, . . . , m − 1. (2.10)
Addition of field elements is pretty straight forward and performed bitwise, thus requir-ing only s word operations.
Algorithm 1 Additional in F2
Input : Binary polynomial a(x) and b(x) of degrees at most m − 1. Output : c(x) = a(x) + b(x).
for i from 0 to m − 1 do C[i] ← A[i] ⊕ B[i] RETURN (c) end for
2.1.6
Multiplication over Finite Fields
Recall that for all prime p, for any positive integer m ≥ 1 and for any irreducible polyno-mial. f (x) = xm+ m X i=1 fm−ixm−i (2.11)
The multiplication over GF (pm) is more complicated [22]. To calculate the product of two elements will need to introduce g(x) firstly, as following:
g(x) = a(x).b(x) = g2m−2x2m−2+ g2m−3x2m−3+ · · · + g2x2+ g1x + g0 (2.12)
where a(x) and b(x) as shown in Equation 2.8 and gi as following:
g0 ≡ (a0.b0) mod p, (2.13) g1 ≡ (a1.b0+ a0.b1) mod p, g2 ≡ (a2.b0+ a0.b2+ a1.b1) mod p, . . . . g2m−3 ≡ (am−1.bm−2+ am−2.bm−1) mod p, g2m−2 ≡ (am−1.bm−1) mod p.
The operation of multiplication in GF (pm) will be:
h(x) ≡ g(x) mod f (x). (2.14)
Algorithm 2 Right to Left Shift algorithm
Input : Binary polynomial a(x) and b(x) of degrees at most m − 1. Output : c(x) = a(x).b(x) mod f (x).
if a0 = 1 then then c ← b; else c ← 0 end if for i from 0 to m − 1 do b ← b.x mod f (x) if ai = 1 then then c ← c + b end if RETURN (c) end for
where the previous algorithm 2 is based on the observation that iteration i computes xib(x) mod f (x) and adds it to accumulator c if ai = 1.
a(x).b(x) = am−1xm−1b(x) + · · · + b2x2b(x) + a1xb(x) + a0b(x). (2.15)
The multiplication operation over finite fields plays the triumphant role in accelerating reverse engineering of genetic networks and other finite field applications. In consequence, designing efficient multipliers is essential and necessary.
Montgomery multiplication is, a multiplication method used in cryptosystems, was first proposed for efficient integer modular multiplication [30]. It gets extended after to multi-plication over finite field characteristic 2 [21].
Let f (x) be an irreducible polynomial that defines the field GF (2m) and r(x) be a fixed el-ement in GF (2m) such that gcd(f (x), r(x)) = 1. Then, the extended Euclidean algorithm can be used to determine f (x)−1and r(x)−1that satisfy
r(x).r(x)−1+ f (x).f (x)−1 = 1 (2.16) where r(x)−1 is the inverse of r(x). Given two fields elements a(x), b(x) ∈ GF (2m),
the Montgomery multiplication is given by
c(x) = a(x)b(x)r(x)−1 mod f (x) (2.17)
Montgomery multiplier efficiency depends on r(x) the fixed field element. Implemen-tation of this multiplier on FPGAs is discussed in details in [28].
Finite field multiplication is consists of polynomial multiplication followed by a mod-ular reduction. Most of the researcher realized that reducing computation can improve the multiplier performance. where some of them proposed combined computations in one step and compute both steps simultaneously, or precomputing the first step.
Systolic array architectures are useful for speeding up computations by exploiting bit-level parallelism and pipelining. Finite field multiplication using systolic architectures have been represented in [7, 23].
For a serial-parallel implementation of Finite field multiplier, the semi-systolic linear array is used where it presented in [14], the polynomial multiplication is computed into a serial design while the reduction is performed by a bidirectional modulo reduction tech-nique.
Another approach was developed by Mastrovito in [27] for a standard basis multiplier where he performed multiplication C = A.B by means of matrix-vector product−→c = Z−→b where −→c and −→b are the vectors components of C and B, where Z is the matrix with elements obtained by XOR operations over some components of A, and by computing Z the reduction step is precomputed.
Due to the capability of reducing the time and space complexity, the previous multi-plier has been broadly used. Given that the number of operations in the multimulti-plier is de-termined by the irreducible polynomial which defines the field, some researcher proposed their architectures based on specific irreducible polynomials [1, 37], where other variants of multipliers based on Mastrovito matrix [15, 34, 36].
2.2
Arithmetic Over GF (3)
The coefficients ai, bj ∈ GF (3) and ckhave the values 0, 1, 2 which are represented by two
bits each. The truth table for the trit modular multiplication is shown in Table 2.1. Since each trit is represented by two bits, the value 3 could occur due to glitches or system errors and must be considered in presented operations. This explains why we use arithmetic over 0, 1, 2 and 3 in GF (3) in presented table.
Table 2.1: Multiplication table in GF (3) for ck= aibj. b1b0 00 01 10 11 a1a0 00 00 00 00 00 01 00 01 10 00 10 00 10 01 00 11 00 00 00 00
Karnaugh map reduction of Table 2.1 produces the following expressions for c0and c1:
c0 = a1a0b1b0+ a1a0b1b0 (2.18)
c1 = a1a0b1b0+ a1a0b1b0 (2.19)
Table 2.2: Addition table in GF (3) for ck= ai+ bj.
b1b0 00 01 10 11 a1a0 00 00 01 10 00 01 01 10 00 01 10 10 00 01 10 11 00 01 10 00
The truth table for the trit modular addition operation ck = ai+ bj is shown in Table
(2.2).
Karnaugh map reduction of Table (2.2) produces the following expressions for c0 and
c1:
c0 = (a1a0+ a1a0)b1b0+ (b1b0+ b1b0)a1a0+ a1a0b1b0 (2.20)
The sequel, Figure. 3.2, shows that multiplication is followed by addition in GF (3) in presented hardware. Reference [10] merged the multiply and add operations as a unified multiply/accumulate:
d = a b + c (2.22)
where a, b, c, d ∈ Z. The resulting multiply/accumulate hardware was four times the speed of a multiplication followed by a double-precision addition. Therefore we investigated in the following section merging the multiply and add operations in GF (3) to explore any potential speedup. The truth tables for d0 and d1 in (2.22) would have six inputs
Chapter 3
Progressive Product Reduction
Technique
3.1
Problem Formulation
A finite field over GF(3m) could be defined using their irreducible polynomial:
Q(x) = xm+ qm−1xm−1 + · · · + q1x + q0 (3.1)
where qi ∈ GF (3) for 0 ≤ i ≤ m and q0 6= 0. In this thesis we consider trinomial
irreducible polynomials of the form:
Q(x) = xm+ qkxk+ q0 (3.2)
where qk, q0 ∈ GF (3) and q0 6= 0. This form represents two of GF(3m) field polynomial.
The four values for (m, k, qk, q0) are (23, 3, 2, 1) or (31, 5, 2, 1) [6]. This motivated several
systolic implementations using these polynomials.
The two field elements A and B to be multiplied are represented by the polynomials:
A = m−1 X i=0 aiαi (3.3) B = m−1 X j=0 bjαj (3.4)
Since α is a root of Q(x) we can write Q(α) using (3.2) in the form:
Q(α) = αm+ qkαk+ q0 = 0 (3.5)
or
αm = −qkαk− q0 (3.6)
After the modulo operation, the product C will be m-trit long and is given by:
C = A × B mod Q(α) = "m−1 X i=0 m−1 X j=0 aibjαi+j # mod Q(α) (3.7) = m−1 X k=0 ckαk (3.8)
where ck represents trit k of the product C. It is not practical to perform the modulo
operation on the polynomial in (3.7) whose degree is 2m − 2. Since the modulo operation is distributive, we can write (3.7) in the form:
C = m−1 X i=0 biαiA mod Q(α) (3.9) = m−1 X i=0 [Ci mod Q(α)] (3.10)
Noted from (3.9) or (3.10) that each partial product polynomial Ciis of degree m+i−1.
More specifically, partial product polynomial Ci is given by:
Ci = biαiA mod Q(α) (3.11)
Each partial product polynomial Ciis of degree m + i − 1 that must be reduced before
the addition operation is performed, which is not too practical. Since the addition operation is associative, we iteratively perform the reduction operation on the different powers of the partial product Ci (3.11).
3.2
Mathematical Representation
Equation (3.9) or (3.10) can be converted using (3.11) into an iteration using decreasing powers of the summation index i. Our strategy is to use Theorem 2 in the sequel to reduce the degree term Ci by one so we can add it to the adjacent lower degree term Ci−1 in the
following form:
C = hC0+ αhC1+ αhC2+ · · · + αhCm−2+ αCm−1i · · · ii (3.12)
Where
hCi+ αCi+1i ≡ Ci+ [αCi+1 mod Q(α)] (3.13)
The expression in (3.12) can be expressed iteratively as
cm = 0 (3.14)
ci = hci+ αci+1i (3.15)
0 ≤ i < m
c = c0 (3.16)
where ci is the intermediate value of the iteration at step i. Notice from (3.15) that we reduce a polynomial ci+1of degree m + 1 to another one of degree m + i − 1.
The theorem below shows how a polynomial of degree i + 1 ≥ m can be reduced to a polynomial of degree i using the irreducible polynomial Q(α) in (3.5).
Theorem 2. Assume an m-term polynomial R with degree l + m − 1 of a form similar to one of the partial products in (3.7) or (3.8):
R(α) = αl
m−1
X
i=0
riαi (3.17)
The order of this polynomial was reduced by using (3.6). Proof. We can write R(α) in (3.17) in the form:
R(α) = αl−1
m−1
X
i=0
The expression for αmin (3.6) was used in (3.18) to get : R(α) = αl−1 " rm−1(−qkαk− q0) + m−2 X i=0 riαi+1 # (3.19)
Thus the input polynomial R(α) has now been reduced by one as required.
From (3.14)-(3.16) and (3.19) the bit-level iterations for our PPR algorithm was defined as: c(m)j = 0 (3.20) c(i)j = ci+1j−1+ biaj (3.21) j 6= 0, k c(i)j = c(i+1)j−1 + biaj − q0c (i+1) m−1 (3.22) for j = 0 c(i)j = c(i+1)j−1 + biaj − qkc (i+1) m−1 (3.23) for j = k cj = c (0) j (3.24)
3.3
Parallelizing the PPR Technique
Previous literature developed hardware implementations for a given iterative algorithm in an ad hoc fashion based on observing the inter-data dependencies. We developed a new approach based on index dependencies of the iteration variables. This approach performs two basic operations: scheduling of tasks and projecting the tasks to different Processing Elements (PE’s) or software threads [13].
The iterations in (3.20)-(3.24) define an alternative algorithm for calculating the finite field multiplication. The algorithm has two input variables A and B; intermediate variables ci
b
4
b
3
b
2
b
1
b
0
i
j
b
6
b
5
a
0
a
1
a
2
a
3
a
4
a
5
a
6
c
0
c
1
c
2
c
3
c
4
c
5
c
6
0
0
0
0
0
0
0
c
7
b
7
0
b
8
0
a
7
a
8
c
8
0
0
0
0
0
0
0
0
0
4
5
6
7
8
2
3
1
0
Time
Figure 3.1: Dependence graph of the PPR algorithm for m = 9 and k = 4.
Input trit biare shown as the horizontal lines in Figure. 3.1. The intermediate variables
cij are shown by the diagonal lines. The output variable C is obtained at the top row as shown.
b
in
c
out
c
in
b
out
(a)
(b)
a
in
a
out
b
in
c
out
c
in
b
out
D
a
in
a
out
c
m-1
Figure 3.2: Dependence graph of (a) Details of red cells where we have cm−1 is the
feed-back and the box label with D is 1-trit flip-flop with clear load control inputs. (b) Details of blue cells where we multiply ain, bin and add it to cin.
3.3.1
Scheduling Function Design for PPR Technique
Scheduling uses an affine scheduling function such that point p = [i j]t ∈ D is assigned a time value n(p) given by:
n(p) = sp − γ (3.25)
= iα + jβ − γ (3.26)
where s = [α β] is the scheduling vector and γ is a scalar constant. The scheduling function assigns a time index value for each node in the graph. Furthermore, several points in the graph will have the same time index value, depending on the choice of s. This indicates that the computational tasks associated with these points are to be executed at the same time. Of course we must be certain that our choice for s satisfies the algorithm timing constraints and data depedencies. A detailed discussion of this point can be found in [13].
Thus the data moving between the nodes are now governed by a time relationship. Therefore we can argue that the scheduling function converts the dependence graph D into a Directed Acyclic Graph (DAG).
The iterations in (3.20)-(3.24) pose two restrictions on our choice of s: the iterative calculation of cijin (3.21) at point (i, j) must be executed after the task at point (i + 1, j − 1) is completed. This can be written in vector form as:
[α β][i j]t > [α β][i + 1 j − 1]t (3.27) This results in the inequality:
α < β (3.28)
Another restriction on timing stems from (3.22) is that point (i, 0) can only proceed after point (i + 1, m − 1) has been evaluated:
[α β][i 0]t> [α β][i + 1 m − 1]t (3.29) This results in the inequality:
α < −(m − 1)β (3.30)
The above restrictions result in a scheduling function of the form:
n(p) = sp − γ = m − i − 1 (3.31)
s = [−1 0] (3.32)
4
5
6
7
8
2
3
1
0
b
4
b
3
b
2
b
1
b
0
i
j
b
6
b
5
a
0
a
1
a
2
a
3
a
4
a
5
a
6
c
0
c
1
c
2
c
3
c
4
c
5
c
6
0
0
0
0
0
0
0
c
7
b
7
0
b
8
0
a
7
a
8
c
8
0
0
0
0
0
0
0
0
0
Time
Figure 3.3: Node timing for the PPR algorithm using scheduling function s for m = 9 and k = 4.
Figure. 3.3 shows node timing for the PPR algorithm using the scheduling functions s for m = 9 and k = 4. The grey boxes indicate equitemporal regions where the nodes in each region execute at the same time. The numbers on the right of the figure indicate the times. The PPR technique will require m time steps to complete.
3.3.2
Projection Function Design for PPR Technique
In this section, we discuss how we can map our two-dimensional dependence graph D ⊂ Z2
to a 1-D space D ⊂ Z. In other words, we are mapping our nodes in D to nodes in D which correspond to our 1-D systolic or processor array.
The mapping is accomplished using the affine mapping function [13]:
p = Pp − δ (3.34)
where P is the projection row vector and δ is some scalar value. P can be obtained through our choice of projection direction d, where d is a column vector orthogonal to P. Detailed explanation of these concepts can be found in [13].
A restriction on choosing the d is proposed in [13]:
s d 6= 0 (3.35)
Given the scheduling function s, we can derive three associated projection directions that satisfy (3.35):
d1 = [1 0]t (3.36)
d2 = [1 1]t (3.37)
d3 = [1 −1]t (3.38)
3.4
Design Space Exploration for PPR Technique
The following subsections discuss the different designs associated with each choice of d.
3.4.1
Design #1: using s and d
1In this case, the projection matrix P1 associated with d1is given by:
P1 = [0 1]
A point p = [i j]t∈ D will be mapped by the projection onto the point:
Thus all nodes in a column will map to a single PE such that the nodes in a column will execute at different time steps due to our choice of the scheduling function. Figure.3.4 shows the hardware details for Design #1.
PE
0PE
1PE
2PE
3PE
4c
2c
0c
1c
3c
4PE
5PE
6c
5c
6(a)
(b)
f
0
D
1b
i+
(c)
PE
7PE
8c
7c
8b
iD
2D
1b
i+
+
D
2D
3c
m-1(i)c
j(i)c
j(i)c
j-1(i+1)c
j-1(i+1)Figure 3.4: Design #1 for m = 9 and k = 4. (a) Semi-systolic array. (b) PEj details when
j 6= 0 or k. (c) PEj details when j = 0, k.
Figure. 3.4(a) shows the resulting semi-systolic array when m = 9 and k = 4. Com-munication between adjacent PE’s requires only a one-trit line for transmitting c(i)j while the multiplier trit aj is stored locally in the PEj. Multiplicand trit bi is broadcast to the
PE’s.The feedback signal f is obtained from the output of PEm−1 at each clock. This
sig-nal is fed back to the two processors PE0 and PEk. Figure 3.4(b) shows the details of PEj
when j = 0, k,Register D1 stores the multiplier trit aj, Register D2 stores the output signal
c(i)j . Figure 3.4(c) shows the details of P Ej when j = 0 or j = k, register D1 stores the
multiplier trit aj, register D2stores the output signal c (i)
j , register D3stores −q0when j = 0
or −qkwhen j = k.
It is best to summarize the operation of each PEj (0 ≤ j < m) for Design #1.
1. At time n = 0, the registers D2 in Figure. 3.4(b) or 3.4(c) accumulating c (i) j are
2. At time n ≥ 0, input multiplier trit aj is fed to PEj and input multiplicand trit bi is
broadcast to all PE’s.
3. The feedback signal f is obtained from the output of PEm−1. This signal is fed back
to the two processors PE0 and PEk.
4. At time n = m − 1, output product trit cj is obtained from PEj.
3.4.2
Design #2: using s and d
2In this case, the projection matrix P2 associated with d2is given by:
P2 = [1 − 1]
A point p = [i j]t∈ D will be mapped by the projection onto the point:
p = P2p = i − j (3.40)
The resulting processor array corresponding to the projection matrix P2 consists of 2m − 1
PE’s. Examination of Figure. 3.3 reveals that only m PE’s are active at a given time step. Thus this projection direction results in PE’s that are not well utilized. To improve PE utilization, we need to reduce the number of processors using nonlinear mapping operator as:
p = P2p mod m = i − j mod m (3.41)
Based on our choice for s and P2, the activity of the processors is illustrated in Figure. 3.5
b4 b3 b2 b1 b0 i j b6 b5 0 0 0 0 0 0 0 0 0 0 0 0 0 a0 a1 a2 a3 a4 a5 a6 2 1 0 8 7 6 5 3 2 1 0 8 7 6 4 3 2 1 0 8 7 5 4 3 2 1 0 8 6 5 4 3 2 1 0 7 6 5 4 3 2 1 8 7 6 5 4 3 2 0 0 c0 c1 c2 c3 c4 c5 c6 0 a7 a8 4 3 5 4 6 5 7 6 8 7 0 8 1 0 c7 c8 0 8 7 6 5 4 3 1 0 8 7 6 5 4 2 1 3 2 0 b8 b7 0 0 4 5 6 7 8 2 3 1 0 Time
Figure 3.5: Processor activity for Design #2 for m = 9 and k = 4 based on (3.41).
Examination of Figure. 3.5 reveals that each PE communicates with its next-to-nearest-neighbor, i.e PEk communicates with PEk±2. To circumvent this problem, we modify our
nonlinear projection operation in (3.41) to the following form: p =jm
2 k
(i − j) mod m (3.42)
b4 b3 b2 b1 b0 i j b6 b5 0 0 0 0 0 0 0 0 0 0 0 0 0 a0 a1 a2 a3 a4 a5 a6 8 4 0 5 1 6 2 3 8 4 0 5 1 6 7 3 8 4 0 5 1 2 7 3 8 4 0 5 6 2 7 3 8 4 0 1 6 2 7 3 8 4 5 1 6 2 7 3 8 0 0 c0 c1 c2 c3 c4 c5 c6 0 a7 a8 7 3 2 7 6 2 1 6 5 1 0 5 4 0 c7 c8 0 5 1 6 2 7 3 4 0 5 1 6 2 7 8 4 3 8 0 b8 b7 0 0 4 5 6 7 8 2 3 1 0 Time
Figure 3.6: Processor activity for Design #2 for m = 9 and k = 4 based on (3.42).
noticed that the communication between PEs is now neighbour-to-neighbour. Applying the above nonlinear projection operation produces the processor array shown in Figure. 3.7(a).
PE
0PE
1PE
2PE
3PE
4PE
5PE
6(a)
f
0
PE
7PE
8b
iD
1b
i+
+
D
2D
3c
j(i)(b)
f
c
j-1(i+1)D
4Figure 3.7: Design #2 for m = 9 and k = 4. (a) Semi-systolic array. (b) PE details.
Figure. 3.7(a) shows the resulting semi-systolic array when m = 9 and k = 4 , where multiplicand trit bi is broadcast to the PE’s. The feedback signal f is even obtained or
broadcast at each clock, Figure 3.7(b) shows the PE details, shift register D1 stores the
multiplier trit aj, register D2 stores the output signal c (i)
j , register D3 stores −q0 and D4
stores −qk.
3.4.3
Design #3: using s, d
3In this case, the projection matrix P3 associated with d3is given by:
P3 = [1 1]
A point p = [i j]t∈ D will be mapped by the projection onto the point:
p = P3p = i + j (3.43)
The resulting processor array corresponding to the projection matrix P3 consists of 2m − 1
operator as:
p = P3p mod m = i + j mod m (3.44)
The activity of the processors is illustrated in Figure. 3.8 where the numbers inside the circles indicate the PE index.
b4 b3 b2 b1 b0 i j b6 b5 0 0 0 0 0 0 0 0 0 0 0 0 0 a0 a1 a2 a3 a4 a5 a6 2 3 4 5 6 7 8 3 4 5 6 7 8 0 4 5 6 7 8 0 1 5 6 7 8 0 1 2 6 7 8 0 1 2 3 7 8 0 1 2 3 4 8 0 1 2 3 4 5 0 0 c0 c1 c2 c3 c4 c5 c6 0 a7 a8 0 1 1 2 2 3 3 4 4 5 5 6 6 7 c7 c8 0 1 2 3 4 5 6 1 2 3 4 5 6 7 7 8 8 0 0 b8 b7 0 0 4 5 6 7 8 2 3 1 0 Time
Figure 3.8: Processor activity for Design #3 for m = 9 and k = 4.
Applying the above nonlinear projection operation produces the processor array shown in Figure. 3.9(a).
PE
0PE
1PE
2PE
3PE
4PE
5PE
6(a)
f
0
PE
7PE
8b
ib
i+
+
D
2c
j (i)(b)
f
D
1a
ina
outD
3D
4Figure 3.9: Design #3 for m = 9 and k = 4. (a) Semi-systolic array. (b) PE details.
Figure 3.9(b) shows the PE details. Register D1 stores the multiplier trit aj, register D2
stores the output signal c(i)j , register D3 stores −q0 and register D4 stores −qk. At certain
times, the partial product output from D2 is used to drive the feedback signal f using the
Chapter 4
Complexity Analysis
4.1
Performance Modeling
The systolic architectures for modular multiplication over GF (3m) discussed in Section 3.3 contain the following elements: multipliers and adders over GF (3), flip-flops, inverters, MUXs, and tri-state buffers. Therefore, complexity analyses was developed for the area and delay of these components here. Our analyses are based on the following assumptions:
1. Static CMOS technology used.
2. All logic modules will be implemented in terms of minimum-area, two-input NAND gates.
3. The pMOS transistors are adjusted so as to give equal rise and fall times.
4. Component areas are normalized relative to the area of our prototype NAND gate. 5. Component delays are normalized relative to the delay of our prototype NAND gate. 6. The setup and hold delays of the edge-triggered flip-flops, as well as clock skew, are
ignored relative the gate delays used here.
The areas and delays of the system components are obtained based on the designs of [11, 17] along with our assumptions in this section and Section 2.2. Table 4.1 summarizes the areas and delays of the components to be used in the sequel.
Table 4.1: Components comparison with respect to normalized area and delay.
Component Area Delay
Multiplication 14.0 5.0 Addition 25.0 7.0 Multiply/Accumulate 52.5 9.0 Inverter 0.5 1.0 MUX 3.5 3.0 Flip-flop 9.0 10.0
Tri state buffer 0.5 1.0
We notice from the table that the merged multiply/accumulate structure has more area that the combined sum of the adder and multiplier. On the other hand, the delay of the multiply/accumulate module has lesser delay relative the the delay of a multiplier followed by an adder.
4.2
Clock Duration
The clock cycle for the proposed designs will be calculated by the following equation:
τClock = max[(τc), (τp)] (4.1)
Where τcis the communication delay and τp is the processing delay. Also, from the
previ-ous equation 4.1, we can discuss two type of delay as follow:
1. τcwhere this delay represents the bus delay the one was used to broadcast the trit bi
for all the PE’s where is going to be calculated by the following equation [11]: τc=
1 2τd
m(m − 1)
2 (4.2)
where τd= rc, r is the resistance of the bus and c is the capacitance of the bus.
2. τp where it will be represented by the sum of the individual delay for every element
4.3
Area and Delay Analysis
The area and delay complexities of the three proposed designs can be determined from Figures 3.4, 3.7 and 3.9. In this section, comparison will be implemented between proposed designs and other published designs.
First, design by Byoung Hee Yoon, Sung Han, Young-Hee Choi, Jong-Hak Hwang, Hyeon-Kyeong Seong, and Heung Soo Kim [41] where they proposed a parallel input/Out-put modulo multiplier, which applied to AOTP multiplicative algorithm over GF (3m) us-ing ternary multiplication followed by ternary addition used DLC and pass transistor logic. The gate level of the design was reviewed carefully and took in consideration all the author assumption. Logic gates needed to implement this design was measured.
Second, design by Yavuz, Ilker and Yalcin, Siddika Berna Ors and Koc¸, Cetin Kaya [40], where they design elliptic curve processor over GF (3) using systolic Montgomery multiplication algorithm. By applying the bit level computational as Koc¸, Cetin Kaya and Acar shown in section 4 in [21], the gate level for the proposed design was achieved where it can be compared to our proposed design using the same performance modeling assumption.
Table 4.2 compares our designs complexities and other two published designs. Table 4.2: Comparison between different GF (3m) multipliers.
Design Area Delay
Design [41] 56(m + 1) 28(m + 1) Design [40] 249 m 61 m Design# 1 57 m + 96 29 m Design# 2 121.5 m 43 m Design# 3 121.5 m 43 m
4.4
Results
In this section, comparison take place using the previous complexity analysis where it shows that the proposed design #1 had better performance than design [40] in both area and delay perspective, also proposed design #1 has better performance than design [41] from the delay perspective while [41] design has better performance from the area perspective. The comparison took a place in Table 4.3.
Table 4.3: Comparison between different GF (3m) multipliers at m=9.
Design Area Delay
Design [41] 560 280 Design [40] 2241 549 Design# 1 609 261 Design# 2 1093.5 387 Design# 3 1093.5 387
4.5
Verification
In this section, verification was applied to ensure that the dependency graph, the three proposed designs, and the golden code have the same functionality. The golden code was implemented by built-in functions in Matlab.
4.5.1
Golden Code
The golden code is going to be based on two main Matlab built-in functions.
1. gfconv - where it do basic convolution operation between vector a and b over Ga-lois Field p.
2. gfdeconv - where it do basic deconvolution operation between vector c ”the con-volution result” and Q ”the proposed trinomial” over Galois Field p.
4.5.2
Verifying Dependency Graph for PPR
Reference to Figure.3.1, page-18 Matlab code was written.4.5.3
Verifying Design# 1
Reference to Figure.3.4, page-23 Matlab code was written.
4.5.4
Verifying Design# 2
4.5.5
Verifying Design# 3
Chapter 5
Conclusion and Future Work
5.1
Thesis Contributions
The main goal of this thesis is to enhance or maintain the performance of multiplication operation over Galois Field by implementing different Designs.
The contributions of this work can be summarized as follow: 1. PPR technique over GF (3) was implemented.
2. Derived three different designs for giving scheduling function s and derived three associated projection directions.
3. Designs verification for results using Matlab was implemented.
4. Comparison between proposed designs and other published works was presented.
5.2
Conclusion
Finite field multiplier over odd-prime extension fields have been introduced. Three differ-ent semi-systolic array multipliers over GF (3m) was designed and verified using Matlab.
This structure enhance the performance of the multiplication algorithm over GF (3), while avoiding modular reductions in the multiplication process.
5.3
Future Work
Future work would be develop other efficient multipliers over finite fields via Number The-oretic Transform (NTT) or Fast Fourier Transform (FFT) for even GF (3m) or GF (pm).
Bibliography
[1] Mohsen Bahramali and Hadi Shahriar Shahhoseini. The best irreducible pentanomi-als for a Mastrovito GF multiplier. In IEEE International Conference on Computer Systems and Applications, 2006., pages 493–499. IEEE, 2006.
[2] Paulo SLM Barreto, Hae Y Kim, Ben Lynn, and Michael Scott. Efficient algorithms for pairing-based cryptosystems. In Annual International Cryptology Conference, pages 354–369. Springer, 2002.
[3] Christoforus Juan Benvenuto. Galois field in cryptography. University of Washington, 2012.
[4] Dan Boneh and Matt Franklin. Identity-based encryption from the weil pairing. In Annual International Cryptology Conference, pages 213–229. Springer, 2001.
[5] Dan Boneh, Ben Lynn, and Hovav Shacham. Short signatures from the weil pairing. Journal of Cryptology, 17(4):297–319, 2004.
[6] Grzegorz Borowik and Andrzej Paszkiewicz. Method of generating irreducible poly-nomials over GF(3) on the basis of tripoly-nomials. In International Conference on Com-puter Aided Systems Theory, pages 335–342. Springer, 2011.
[7] Amir Daneshbeh and Anwarul Hasan. A class of unidirectional bit serial systolic ar-chitectures for multiplicative inversion and division over GF(2m). IEEE Transactions
on Computers, 54(3):370–380, 2005.
[8] Whitfield Diffie and Martin Hellman. New directions in cryptography. IEEE trans-actions on Information Theory, 22(6):644–654, 1976.
[9] Tarek El-Ghazawi. Is high-performance, reconfigurable computing the next super-computing paradigm? In SC 2006 Conference, Proceedings of the ACM/IEEE, pages xv–xv. IEEE, 2006.
[10] Fayez Elguibaly. Merged inner-product processor using the modified booth algorithm. Canadian journal of electrical and computer engineering, (4):133–139, 2000. [11] Kamran Eshraghian and Neil Weste. Principles of CMOS VLSI design.
Addiso-Wesley Pub. Company, 1993.
[12] Steven Galbraith, Keith Harrison, and David Soldera. Implementing the tate pairing. In International Algorithmic Number Theory Symposium, pages 324–337. Springer, 2002.
[13] Fayez Gebali. Algorithms and parallel computing. John Wiley & Sons, 2011.
[14] Fayez Gebali and Atef Ibrahim. Low space-complexity and low power semi-systolic multiplier architectures over GF(2m) based on irreducible trinomial. Microprocessors and Microsystems, 40:45–52, 2016.
[15] Alper Halbutogullari and C¸ etin Kaya Koc¸. Mastrovito multiplier for general irre-ducible polynomials. IEEE Transactions on Computers, 49(5):503–518, 2000. [16] Brian Hayes. Computing science: Third Base. American scientist, 89(6):490–494,
2001.
[17] Atef Ibrahim and Fayez Gebali. Optimized structures of hybrid ripple carry and hier-archical carry lookahead adders. Microelectronics Journal, (9):783–794, 2015. [18] Antoine Joux. A one round protocol for tripartite diffie–hellman. In International
Algorithmic Number Theory Symposium, pages 385–393. Springer, 2000.
[19] Neal Koblitz. Elliptic curve cryptosystems. Mathematics of computation, 48:203– 209, 1987.
[20] Neal Koblitz. Hyperelliptic cryptosystems. Journal of Cryptology, 1(3):139–150, 1989.
[21] C¸ etin Kaya Koc¸ and Tolga Acar. Montgomery multiplication in GF(2k). Designs, Codes and Cryptography, 14(1):57–69, 1998.
[22] Czesaaw Ko±cielny. Computing in GF(pm) and in gff(nm) using maple. Quasigroups and Related Systems, 13:245–264, 2005.
[23] Chiou-Yng Lee. Low-complexity bit-parallel systolic multipliers over GF(2m).
Inte-gration, the VLSI Journal, 41(1):106–112, 2008.
[24] Chiou-Yng Lee, Erl-Huei Lu, and Jau-Yien Lee. Bit-parallel systolic multipliers for GF(2m) fields defined by all-one and equally spaced polynomials. IEEE Transactions on Computers, 50(5):385–393, May 2001.
[25] Rudolf Lidl and Harald Niederreiter. Introduction to finite fields and their applica-tions. Cambridge university press, 1994.
[26] Rudolf Lidl and Harald Niederreiter. Finite fields, volume 20. Cambridge university press, 1997.
[27] Edoardo Mastrovito. VLSI architectures for computations in Galois fields. Linkoping univ. Dep. of electrical engineering Link@: oping, 1991.
[28] Ciaran McIvor, M´aire McLoone, and John McCanny. FPGA montgomery multiplier architectures-a comparison. In Field-Programmable Custom Computing Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, pages 279–282. IEEE, 2004. [29] Victor Miller. Use of elliptic curves in cryptography. In Conference on the Theory
and Application of Cryptographic Techniques, pages 417–426. Springer, 1985. [30] Peter Montgomery. Modular multiplication without trial division. Mathematics of
computation, 44(170):519–521, 1985.
[31] Noriaki Muranaka, Shigenobu Arai, Shigeru Imanishi, and Michael Miller. A ternary systolic product-sum circuit for GF (3m) using neuron MOSFETs. In Multiple-Valued
Logic, 1996. Proceedings., 26th International Symposium on, pages 92–97. IEEE, 1996.
[32] Dan Page and Nigel Smart. Hardware implementation of finite fields of character-istic three. In International Workshop on Cryptographic Hardware and Embedded Systems, pages 529–539. Springer, 2002.
[33] Wang Pengjun, Lu Jingang, and Xu Jian. Application of neuron MOS in multiple-valued logic. Neural Computing and Applications, 17(2):139–143, 2008.
[34] Nicola Petra, Davide De Caro, and Antonio Strollo. A novel architecture for galois fields GF(2m) multipliers based on mastrovito scheme. IEEE Transactions on Com-puters, 56(11):1470–1483, 2007.
[35] Leilei Song and Keshab Parhi. Optimum primitive polynomials for area low-power finite field semi-systolic multipliers. In Signal Processing Systems, 1997. SIPS 97 - Design and Implementation., 1997 IEEE Workshop on, pages 375–384, Nov 1997.
[36] Leilei Song and Keshab Parhi. Low-complexity modified mastrovito multipliers over finite fields GF(2m). In Circuits and Systems, 1999. ISCAS’99. Proceedings of the 1999 IEEE International Symposium on, volume 1, pages 508–512. IEEE, 1999. [37] Berk Sunar and C¸ etin Kaya Koc¸. Mastrovito multiplier for all trinomials. IEEE
Transactions on Computers, 48(5):522–527, 1999.
[38] Richard Swan et al. Factorization of polynomials over finite fields. Pacific J. Math, 12(3):1099–1106, 1962.
[39] Eric Verheul. Self-blindable credential certificates from the Weil pairing. In Inter-national Conference on the Theory and Application of Cryptology and Information Security, pages 533–551. Springer, 2001.
[40] Ilker Yavuz, Siddika Berna Ors Yalcin, and C¸ etin Kaya Koc¸. Fpga implementation of an elliptic curve cryptosystem over GF(3m). In ReConFig, pages 397–402, 2008.
[41] Byoung Hee Yoon, Sung Han, Young-Hee Choi, Jong-Hak Hwang, Hyeon-Kyeong Seong, and Heung Soo Kim. A systolic parallel multiplier over GF(3m) using neuron-MOS DLC [down-literal circuit]. In Multiple-Valued Logic, 2004. Proceedings. 34th International Symposium on, pages 135–140, 2004.
Chapter 6
Appendices
IC representation for multiplication , addition , Multiply/Accumulate
A1 A0 B1 B0 A1 A0 B1 B0 C0 C1
C0 A1 A0 B1 B0 C1