Synthesis of a parallel data stream processor from data flow process networks

(1)

Synthesis of a parallel data stream processor from data flow process networks

Zissulescu-Ianculescu, C.

Citation

Zissulescu-Ianculescu, C. (2008, November 13). Synthesis of a parallel data stream processor from data flow process networks. Retrieved from

https://hdl.handle.net/1887/13262

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/13262

Note: To cite this publication please use the final published version (if applicable).

(2)

Synthesis of a Parallel Data

Stream Processor from Data Flow Process Networks

Claudiu Zissulescu-Ianculescu

(3)

(4)

Synthesis of a Parallel Data Stream Processor from Data Flow Process Networks

Proefschrift

ter verkrijging van de graad van Doctor aan de Univer- siteit Leiden, op gezag van de Rector Magnificus prof. mr.

P.F. van der Heijden, volgens besluit van het College voor Promoties te verdedigen op donderdag 13 November 2008 klokke 16:15 uur

door

Claudiu Zissulescu-Ianculescu geboren te Bucures¸ti, Romˆania

in 1976

(5)

Samenstelling promotiecommissie:

promotor Prof. dr. Ed Deprettere co-promotor Dr. A.C.J. Kienhuis

referent Dr. Steven Derrien INRIA, France overige leden Prof. dr. Harry Wijshoff

Prof. dr. Joost Kok

Prof. dr. Kees Goossens Technische Universiteit Delft, NXP Eindhoven Dr. Laurens Bierens Eonic BV, Delft

Synthesis of a Parallel Data Stream Processor from Data Flow Process Networks Claudiu Zissulescu-Ianculescu. -

Thesis Universiteit Leiden. - With index, ref. - With summary in Dutch ISBN/EAN 978-90-9023643-8

Copyright c2008 by Claudiu Zissulescu-Ianculescu, Leiden, The Netherlands.

All rights reserved. No part of the material protected by this copyright notice may be repro- duced or utilized in any form or by any means, electronic or mechanical, including photo- copying, recording or by any information storage and retrieval system, without permission from the author.

Printed in the Netherlands

(6)

sot¸iei mele Dani

(7)

(8)

Chapter 1 Introduction

An embedded system is an information processing system that is application domain specific (e.g., signal processing, multimedia, automotive, communications) and tightly coupled to its environment. Tightly coupled to the environment means that the system must react to incoming data at a speed that is imposed by the type and properties of that data. For example, a DVD player has to read, decode and display the movie (the incoming data) at a rate such that the user can observe a smooth transition between two consecutive frames. Thus, embedded systems are reactive systems that are very often real-time systems.

The computational requirements of today’s embedded systems are such that a single processor can not provide the compute power. Instead, new platforms are emerging that are able to satisfy the performance needs of tomorrow’s embedded applications. These new platforms are usually multi-processor or multi-core execution platforms consisting of a number of processing elements and a communication, synchronization and storage infrastructure, all integrated on a single chip. These systems are called multiprocessor system-on-chip (MP- SoC). A MPSoC may be homogeneous or heterogeneous. All processing elements in a homogeneous MPSoC are of the same type, e.g., instruction set architecture (ISA) elements.

On the other hand, the heterogeneous MPSoC systems are composed of processing elements that are not the same. These elements may be software programmable (ISA), hardware programmable, or even dedicated. The processing elements may operate autonomously, or may be co-processing elements. A co-processing element is a processing element that executes complex ISA instructions in a shorter period than the outsourcing ISA element could.

Moreover, multi-processor execution platforms may be given or may be dedicated. A given platform is a platform that has properties of its own (e.g., Intel processors, IBM Cell processor or graphical processor units). A dedicated platform is a platform that can be par- tially or totally reconfigured. Such a platform is the Field Programmable Gate Array (FPGA).

The FPGA execution platforms are special in that they are not pre-defined (except for fairly general admissible organizations), but can be application customized without the programmer having to deal with the platform specification. FPGAs may consist of embedded CPUs or DSP blocks, distributed RAMs, specialized input/output blocs, and configurable logic blocs (CLBs).

To use the parallelism available in FPGA execution platforms, we need to program them

(13)

2 Introduction

in such a way that we can exploit distributed control and distributed memory. Distributed con- trol means that the individual components on a platform can proceed autonomously in time without considering other components. Distributed memory means that data is not pooled in a large global memory, but distributed over the platform. Although distributed memory and control are key requirements to take advantage of the new emerging platforms, we observe that applications are typically cast in the form of a sequential imperative programming language, i.e., a Matlab, a C/C++ or a Java program. A strong point of the imperative model of computation is that it is easy to reason about a program as only a single thread of control needs to be considered. Also, the memory space is global, i.e., all data comes from the same memory source. However, the single memory and the single thread of control are contra- dictory to the need for distributed control and memory. Therefore, programming these new platforms is a very tedious, error prone, and time consuming process.

There are two ways in which we can overcome the programming problem. One way is to require application developers to specify their applications in a parallel programming language (textual or graphical). Graphical or visual programming styles have been proposed and successfully used to specify streaming data signal processing and multimedia applications.

Typical examples of such parallel programming styles rely on dataflow graph and dataflow process network models of computation (MoC) [1, 2]. In these models, a program consists of active entities (functions, threads, processes) that communicate point-to-point over FIFO channels. Application developers are reluctant to provide specifications in terms of these models for several reasons. Firstly, the models are either not expressive enough or are un- decidable. Secondly, practical applications can not be specified only in terms of dataflow models that do not really take dynamic control flow into account.

Sequential imperative programming

language:

C/C++, Java, Matlab

P2

P1

P3

P4

R A M

DSP CPU

R A M

Configurable Logic Blocs

Input/Output Blocs

Code parallelization Mapping to

architecture

Parallel description

Application Multi-processor embedded system

Figure 1.1: Mapping of an application to an FPGA execution platform

The other way to overcome the mismatch between a sequential imperative application specification and a targeted parallel execution platform is to convert (to parallelize) the sequential specification to an input-output equivalent parallel specification that is a better match to a targeted multi-processor execution platform. Then, the parallelized code is mapped onto the multi-processor execution platform. The action of converting the parallelized code to an executable code suited for the multi-processor execution platform is sometime called map- ping [3] and sometime called synthesis [4].

(14)

1.1 COMPAANData Flow Process Network 3

In this last approach, the programmer (almost) does not need to know about parallel programming or parallel architectures in order to exploit inherited parallelism in the application.

This approach is illustrated in Figure 1.1 in which the multi-processor execution platform is embedded in an FPGA execution platform. Application developers will most likely continue using sequential imperative languages to specify applications, because these languages are expressive general purpose languages. Thus, the second approach deserves further investiga- tion.

Not every sequential imperative language program can be easily - and preferably automatically - converted to an input-output equivalent parallel specification. However, in signal processing, multimedia, molecular biology, and other related application domains, there are nested loop programs (NLP) of which many can be converted to input-output equivalent parallel specifications. In particular, those that are so called affine nested loop programs can automatically be converted. The conversion to input-output equivalent parallel specification of a subset of these nested loop programs has been amply studied and reported in the lit- erature [5–8]. This subset of the nested loop programs is called static affine nested loop programs. In this thesis, I focus on the problem of mapping the input-output parallel specifi- cation of a static affine nested loop programs to FPGA multi-processor execution platforms.

More specifically, we address the problem of synthesizing Process Network specifica- tions to FPGA multi-processor execution platforms. The process networks we consider are special cases of Kahn Process Networks [2]. We call them COMPAAN Data Flow Process Networks (CDFPN) because they are provided by a translator called the COMPAANcompiler that automatically translates affine nested loop programs to input-output equivalent (COM-

PAAN) process network specifications [5]. COMPAANDataflow Process Network is a model of computation that expresses an application naturally in terms of distributed control and distributed memory. The CDFPN programs are parallel programs that specify networks of active entities (threads, processes or actors) that communicate point-to-point over unbounded communication channels. The inter-process synchronization is done by means of a blocking read protocol. This protocol states that a process can always write to a channel, but it blocks when it attempts to read from a channel that is empty.

Our objective is to provide an effective and efficient implementation of CDFPNs in an FPGA execution platform, where our implementation is close to a one-to-one mapping of the originating CDFPN. However, in our implementation we do not make use of the embedded CPU blocks, specialized DSP blocks or soft-cores processors. The execution platform emerges as part of the mapping process resulting in a dedicated multi-processor execution platform for a given CDFPN specification.

1.1 C OMPAAN Data Flow Process Network

COMPAANData Flow Process Network is a model of computation well suited to represent applications from the realm of digital signal processing. In this section, we sketch the behavior of a CDFPN that is equivalent to an imperative nested loop program.

The CDFPN MoC communication semantics is similar to the Kahn Process Network (KPN) communication semantics [2]. The KPN MoC is more general than CDFPN MoC, which is closer to the Dataflow Process Network (DPN) [1], preserving the monotonicity property [2] (i.e., a CDFPN needs only partial information of the input stream in order to

(15)

4 Introduction

produce partial information of the output stream). As in the case of the DPN, the CDFPN processes map input tokens into output tokens in concordance with a set of firing rules. These rules dictate precisely what tokens must be available at the input for the process to fire. A firing consumes input tokens and produces output tokens. In the CDFPN case, the firing rules are derived by the COMPAAN compiler using the Polyhedral Model [7, 9, 10]. The construction of the CDFPN firing rules is covered in [3]. Here, the notion of variants is introduced, representing a set of firing rules. A CDFPN process produces a single scalar token when it fires.

Listing 1.1: A simple Matlab program

f o r i = 1 : 1 : 1 , [ z ( i ) ] = I n i t ( ) ; end

f o r i = 1 : 1 : N, [ z ( i + 1 ) ] = bar ( z ( i ) ) ; end

f o r i =N+1 : 1 : N+ 1 , [ ] = S i n k ( z ( i ) ) ; end

P1 ^{ED 1} P2 P3

ED 2

ED 3

Figure 1.2: The CDFPN Network of the program listed in Listing 1.1 Consider the CDFPN shown in Figure 1.2. Like with most of the graphically programming environments, the nodes of the graph can be viewed as processes that run concurrently and exchange data over the arcs of the graph (communication channels). The given CDFPN consists of three processes, P1, P2, and P3, and three communication channels, ED1, ED2, and ED3. This graph is referred to as a network and it is the input-output parallel specification equivalent to the affine nested loop program listed in Listing 1.1. The network communica- tion channels, ED1, ED2, and ED3, implements the global memory represented by vector z(j) in a distributed meaner. The network processes contain a subprogram that call a specific function of the affine nested loop program (i.e. P1 calls Init() , P2 calls bar() and P3 calls Sink()). The network is constructed in such way that no control is shared between the processes. Hence, the functions from the affine nested loop program are also executed independently (i.e., no explicit synchronization signal is shared among them).

In a CDFPN, concurrent processes communicate only through one-way communication channels with unbounded capacity. The interface between a communication channel and a process is called a port. A port is either an Input Port or an Output Port. An Input Port is used for a read operation from a communication channel. An Output Port is used for a write operation to a communication channel. Each channel carries a sequence (a stream) that is made of atomic data objects called tokens. Each token is written (produced) exactly once, and read (consumed) possible more than once. Writes to the channels are nonblocking, but reads are blocking. This means that a process that attempts to read from an empty input channel stalls until the buffer has sufficient tokens to satisfy the read. Hence, when a process stalls the function will not evaluate. Each call of the function results in output tokens that are sent to the appropriate outgoing ports. The order in which a channel is read or written is given (i.e., schedule).

A process in the CDFPN model is a subprogram. This subprogram is a module wrapper

(16)

1.1 COMPAANData Flow Process Network 5

;;;;;

void P2::main() {

for (int i = 1 ; i <= N ; i += 1 ) if (i -1 == 0) {

// READ

// reads a token from a channel in_0 = IP1.get();

} if (i -2 >= 0) {

// reads a token from a channel in_0 = IP2.get();

out_0 = bar(in_0) ; //EXECUTE

// WRITE

}

} // Process P2 if (-i+N-1 >= 0) {

//writes a token to a channel OP1.put(out_0);

} if (i-N == 0) {

//writes a token to a channel OP2.put(out_0);

} } Input port IP1

Input port IP2

Output port OP1

Output port OP2

Process P1 Process P1

Process P3 Process P3

ED 1

ED 2

ED 3

Figure 1.3: A CDFPN process unveiled

that isolates the computation (the affine nested loop function call) from the communication (the schedule according to which a channel is read or written). Such a subprogram is given in Figure 1.3 and shows the implementation of the process P2. The process has four ports: two input ports (i.e., IP1 and IP2) and two output ports (i.e., OP1 and OP2). We observe that the tokens written to ED2 via OP1 are read back by the same process via IP2. We refer to this kind of communication channel as a self-loop.

The schedule according to which a channel is read or written is given by the surrounding nested for-loop and if-statements. Each input port is guarded by if-statements that control the read from a communication channel. For example, when the for-loop iterator i is equal to one, a token from the IP1 input port is read. In all other cases a token from the IP2 input port is read. The read token is placed in an internal temporary variable in 0 that is used as an input argument for the function bar. The call of this function produces an output token stored in variable out 0. Next, this variable is written to one of the output ports. As in the case of the input ports, the output ports are also guarded by if statements. The if-statement gives the right order of writing tokens into network channels. For example, when the for-loop iterator i is equal to N , then the Output Port OP2 is active. In all other cases, the Output Port OP1 is active.

Process P2 reads either from channel ED1 or from channel ED2, processes the token, and finally writes the processed token to either channel ED2 or channel ED3, depending on loop iterator conditions. Hence, we can always distinguish three behavioral parts in any of our processes: the READ, the EXECUTE, and the WRITE parts. The part that reads tokens from communication channels via input ports is called the READ part, as shown in Figure 1.3.

Operations on the read tokens take place in the EXECUTE part. In Figure 1.3, the EXECUTE part evaluates the bar function. The part that writes tokens to communication channels via output ports is called the WRITE part.

(17)

6 Introduction

1.2 Problem Definition

Given an affine nested loop program and its input-output equivalent COMPAANData Flow Process Network specification, how can we implement that specification in an FPGA based multi-processor execution platform? In fact, this can be done in several ways. Current FPGA chips are powerful enough to allow heterogeneous architecture implementations. Thus, an architecture consisting of hardcore and/or softcore ISA components, configurable components, dedicated components, point-to-point, bus-based, or cross bar-based communication structures, and shared or distributed memory components, can be implemented in FPGA chips.

This diversity of options has been investigated in [11]. However, the most efficient way to implement a CDFPN specification in an FPGA fabric is a dedicated one-to-one mapping of the former into the latter. Although this seems to be a straightforward approach, it is not so because the CDFPN processes are threads that read data, evaluate functions, and write data in a sequential order. When implemented as such, resource utilization and performance will not, and can not, be optimal.

Execute Unit Write Unit Read Unit

In_0 IP Core ^Out_0

Controller Controller Controller What is the

communication model of a channel

How is the Write Unit controller implemented What is the capacity

of a communication channel

Processor P1 Processor P1

Processor P3 Processor P3

How is the Read Unit controller implemented

Channel 1

Channel 2

Channel 3

How do we embedd and controll an IP core

IP2 IP1

OP2 OP2

Figure 1.4: Process network architecture example: The processor template

Considering the CDFPN example shown in Figure 1.2, then a possible implementation model and the issues to be addressed are depicted in Figure 1.4. In this model, a process is mapped to a processor. Each processor is decomposed into a Read Unit, an Execution Unit, and a Write Unit that operate in a pipelined fashion. The execute unit evaluates one or more functions that are enclosed in an intellectual property (IP) core. Function input arguments are delivered by the Read unit that selects the arguments from process input channels. Function results are delivered to the Write unit that distributes them across process output channels.

The main issue in mapping a CDFPN to an FPGA is how to design an architecture such that it achieves the maximum data throughput for the given CDFPN. Addressing this issue requires answering to the following questions:

• What is the actual capacity of a communication channel? The CDFPN MoC specifies that the intra processes communication is done over unbounded FIFOs, thus, we need to bound the channels capacities for an implementation. In [7], the communication channels are bounded to an over-dimensioned value. Hence, we have to determine the

(18)

1.3 Solution Approach 7

communication channel capacity so as to avoid memory spilling and network dead- locks.

• What are the actual communication primitives and protocols? Because the initial nested loop programs are streaming data based applications that enforce a throughput, the way in which the communication channel handles reading and writing operations should not obstruct the flow of data.

• Can the processor be always pipelined? If so, the CDFPN specification has to be transformed to exploit this option.

• How to embed an IP core in the processor template? The computation in a C^OMPAAN Process Network is not fully specified. Functions embedded in a process of a CDFPN are specified as mathematical functions [out₀, out₁] = F (in₀, in₁, in₂), i.e., the implementation code is not included.

• How to implement the Read and Write units without hindering the performances of the IP core? The Read and Write units implement the blocking read and blocking write synchronization primitives. At each occurrence of a blocking read or write situation, the processors stall. Moreover, the Read and Write units have to determine the next read and write sequence at each clock cycle. Hence, the design of the Read and Write units is critical in obtaining an implementation that has a maximal data throughput.

1.3 Solution Approach

The problem of mapping a CDFPN to an FPGA is the subject of this thesis, resulting in a solution approach and an implementation called the LAURAtool. This solution is part of the COMPAAN/LAURAtool chain shown in Figure 1.5: the synthesis of applications specified as nested loop programs to an FPGA platform.

The mapping of a CDFPN in an FPGA consists of two parts. The first part is a platform independent step. In this step, a given CDFPN specification is converted into an Abstract Ar- chitecture. The Abstract Architecture is a set of hieratical interconnect modules representing an architecture. Each module isolates computation from communication. We call this plat- form independent step the PN to Abstract Architecture step, as no information of the actual targeted platform is taken into account. The second part is a platform dependent part that synthesizes an actual FPGA-based multi-processor execution organization from the Abstract Architecture. In this step, the Abstract Architecture is converted onto a Network of Synthesiz- able Processors. We call this the Architecture Synthesis step as platform specific information is taken in account. Also, we embed in this step platform specific IP cores that implement the Execute unit functionality of the synthesized CDFPN processes. The Network of Synthe- sizable Processors is specified using a hardware description language. In this thesis we limit ourselves to VHDL output.

The CDFPN specification, the Abstract Architecture, and the Network of Synthesizable Processors (the FPGA implementation) are topologically identical. However, their semantics are different. The semantics are related through a number of operations in the PN to Abstract Architecture and the Architecture Synthesis steps. The PN to Abstract Architecture step consists of the following operations:

(19)

8 Introduction

Compaan

Laura

PN to Abstract Architecture

Architecture Synthesis

IP

Laura in more detail

Matlab

Process Network

Architecture

Process Network

Abstract Architecture

Architecture

Platform independent

Platform dependent

Figure 1.5: The LAURAflow in the COMPAAN/LAURAtool chain

• Topological Mapping converts the CDFPN to a network of virtual processors. The resulting network has the same topology as the CDFPN, as we employ one-to-one mapping in which each process becomes a Virtual Processor and each communication channel a Dedicated Channel;

• Semantic Mapping decomposes a PN process specification into the components of a Virtual Processor. These components are the Read Unit, the Write Unit, and the Exe- cute Unit. The controller is distributed in the Read, Execute and Write units.

The Architecture Synthesis step consists of the following operations:

• Control Synthesis derives the control structure for the Read or Write units;

• Communication Synthesis determines the type and the capacity of a dedicated channel;

• Expression Synthesis translates a linear or pseudo linear expression to a form that is free of multiplication and integer division operations. The Expression Synthesis operation is used by the Control Synthesis operation;

• IP Core Integration embeds a functional IP core.

The Network of Synthesizable Processors is captured in the LAURA tool in terms of a set of both generic and platform specific VHDL templates. In this dissertation, we target specifically the VIRTEX II/Pro platform from Xilinx, as this platform was available to us.

We use the Xilinx resources (e.g., embedded memories, serial interfaces) to implement the platform specific VHDL templates. When we embed an IP core in our network, a tailored processor is made to accommodate the IP core. The resulting processor inherits the execution model of the IP, the computational resources available, and the clock cycles needed for an execution.

(20)

1.4 Thesis Contribution 9

1.4 Thesis Contribution

In this thesis, we present the LAURAapproach that implements our methodology to map PNs generated by the COMPAANcompiler onto a reconfigurable platform such as an FPGA. The main contributions are:

• The development of an approach that allows mapping of a Process Network specifica- tion onto reconfigurable platforms in a systematic and automatic way;

• The introduction of the notions of Abstract Architecture, Virtual Processor and Dedi- cated Channel to capture and model the Process Network behavior on the FPGA;

• The development of a technique that improves the efficiency of the IP cores embedded in our architecture;

• The development of a technique that can estimate at compile time the type and capacity of the dedicated channels;

• The development of a number of techniques that allow us to map a CDFPN efficiently (in terms of speed and resources) onto an FPGA platform;

• The validation of the present approach with real-life industrial experiments;

• The prototyping of the present approach in software.

A number of experiments have been conducted for applications in the field of image processing and signal processing. The experiments show that we are able to fully automatically derive a FPGA implementation from a given sequential imperative application specification.

1.5 Related Work

Many researchers have addressed the problem of mapping sequential imperative programs to FPGA execution platforms [12–21]. In the literature, all contributions differ in the way programs and platforms are constrained. Mapping any sequential imperative program to an FPGA execution platform is almost equivalent to mapping such programs to homogeneous multi-processor architectures. Our approach is different as we generate FPGA implementations using a constrained Process Network MoC that uses the polyhedral model to derive the firing sequence of each process. In this section, we discuss some approaches that uses the Process Network MoC or the polyhedral model to generate hardware architectures. These approaches are discussed in two sections. The first section is dealing with the tools that generates an FPGA implementation using the polyhedral model. The second section deals with tools that use PN MoC to describe hardware architectures.

1.5.1 Hardware Architecture Implementations that uses the Polyhedral Model

The polyhedral model is used in numerous projects to synthesize architectures for multi- processor FPGA based architectures. The CLooGVHDL [22] project is one of them. Each

(21)

10 Introduction

C statement processed by ClooG [23] is synthesized by the VHDL back-end using a sequential execution model. Thus, only the parallelism within one statement is exploited by the ClooGVHDL. In [24] Teich and Thile describe a systematic way to design a processor array. Although the processor arrays are a good solution to map applications that are data flow dominated, these processor arrays are not suited for more control dominated applications.

Thus, the control dominated applications require a more complex global controller to be synthesized to. In the Paro compiler [25], the authors extend the work presented in [24] by a methodology which reduces the hardware cost of the global controller and memory address generators by avoiding costly multiplication and division operations.

In the Alpha environment [26], a program is described as a system of affine recurrence equations (SARE). Starting from such a specification, both the synthesis of regular architectures and the compilation to sequential or parallel machines are considered. The rationale behind writing programs in Alpha rather than in some imperative language is that a functional/mathematical specification matches the way people think of an algorithm and that all the parallelism in the algorithm is naturally preserved. AlpHard [27] is a subset of the Alpha language that enables the hardware generation of regular architectures, like systolic arrays.

Another example is the Atomium [28] project which consists of a set of tools that operate at the behavioral level of an application, expressed in C. The output is a transformed C description, functionally equivalent to the original program, but typically leading to strongly reduced execution times, memory size, and power consumption. Related to our work is the part of the Atomium dealing with memory issues when mapping applications onto platforms with distributed memory architectures. The Memory Architect is a component tool allowing the designer to explore the effects of timing constraints on the required memory architecture.

This architecture translates the timing constraints into optimized memory architecture constraints: for a given set of timing constraints, it generates an optimized set of architectural constraints and a cost estimate for the resulting architecture.

The high-level synthesis methodology Phideo [29] starts with a specification in the single assignment form, and converts this description into an instance of a target architecture template. An important part of Phideo is the address generation method for memories that are introduced by the synthesis tool. The address generation in Phideo is a special case of the address generation present in the COMPAAN Data Flow Process Networks. This is due to a somewhat more restricted geometry of the iteration domain used in Phideo. The hardware designed by Phideo is synchronous and a schedule is derived. This represents an important constraint and therefore the class of applications Phideo accepts as input is restricted to single assignment perfectly nested loop programs.

The PICO project [30] at HP Labs (later on spun out to a startup called Synfora [31]) is an effort that aims to automate the mapping of applications onto platforms consisting of a VLIW processor and custom nonprogrammable accelerators (NPA) connected to a two-level cache subsystem connected to the system bus. Each accelerator is customized to execute a compute intensive loop nest that would otherwise have been executed on the VLIW. Different than in our network representation, an NPA is represented by a fixed size (non-parametric) array of processing units activated by a global schedule. The NPA is derived by the PICO- NPA compiler which accepts a perfect loop nest in C and produces, based on a template, a structural Verilog/VHDL that defines the NPA at the register transfer level together with the C code that repeatedly invokes the NPA hardware. This code is compiled onto the host processor along with the remainder of the application.

(22)

1.5 Related Work 11

1.5.2 Hardware Architecture Implementations that uses the Process Net- work MoC

The usage of the Process Network MoC for FPGA implementation is not new. A special subset of the Process Network MoC called Communicating Sequential Processes (CSP) [32]

has been used by a number of projects for modeling applications. The Stream-C project [33]

lets the user specify coarse grain, process level parallelism. The compiler infers fine grain, loop level parallelism. Stream-C is also based on the CSP model and allows users to specify independent parallel processes and their mapping to a multiple FPGA platform. Another approach is to add constructs and annotations to a subset of a programming language to specify parallelism and event sensitivity. Examples of this approach include the Handle-C project at Oxford [34], where the compiler can produce hardware from an input description.

Handle-C is based on Hoare’s CSP model and it is a modified form of C, where the user can specify concurrent operations and bit-widths of data.

C-HEAP is a top-down design methodology presented in [35] using the Kahn Process Network (KPN) MoC. It generates instances of an architecture template containing dedicated hardware components, multiple software programmable processors (e.g., CPUs, DSPs), local cache memories, a global shared memory, and a communication network. Although the communication between various processors is made using KPN modeling, their hardware implementation is a bus oriented architecture. In this architecture, however, problems with the cache coherence are reported. In our approach we do not use global shared memory and thus memory contention is avoided.

In [36], the authors present a Kahn Process Network methodology based on the DISY- DENT platform (DIgital SYstem Design ENvironmenT). The system is described by a set of communicating Kahn processes. These processes are C POSIX threads representing both software and hardware tasks. Each thread communicates with the others using channel- read/channel-write primitives. Systems realization consists of synthesizing hardware tasks to RTL-VHDL language. However, this methodology is suited for control dominated applications, requiring low level information to be given by the user.

The use of KPN as a model of computation is also reported in the COSY [37] and Prophid [38] tools. Prophid is a heterogeneous multi-processor architecture template. This template distinguishes between control-oriented tasks and high performance media processing tasks. A CPU connected to a central bus is used for control-oriented tasks and possibly low to medium performance signal-processing tasks. A number of application-specific processors implement high performance media processing tasks. These processors are connected to a reconfigurable high-throughput communication network to meet the high communication bandwidth requirements. Hardware FIFO buffers are placed between the communication network and the inputs and outputs of the application-specific processors to efficiently implement stream-based communication.

The COSY methodology [37] provides a gradual path for communication refinement in a top-down fashion for a given platform and a communication protocol. The main goal of the COSY methodology is to perform design space exploration of a system at a high abstraction level. In order to achieve this, the methodology provides a mechanism for modeling communication interfaces at a high level of abstraction (including the behavior of the selected protocol) with various parameters, e.g. delay of execution of the protocol itself. The COSY methodology uses a message passing communication protocol with read and write primitives.

(23)

12 Introduction

The architecture generated by us has close ties with the Globally Asynchronous Locally Synchronous systems [39]. GALS systems contain several independent synchronous blocks which operate with their own local clocks and communicate asynchronously with each other.

The main feature of these systems is the absence of a global timing reference and the use of several distinct local clocks (or clock domains), possibly running at different frequencies.

However, we know of no tools that generate GALS systems out of a imperative program like our proposed COMPAAN/LAURAapproach does.

1.6 Thesis Outline

The general outline for the LAURAapproach is shown in Figure 1.6.

Process Network

Topologic Mapping

Semantic Mapping

Abstract Architecture

IP Core Integration

Control Synthesis

Communication Synthesis

Expression Synthesis

Memory Estimation

Network of Synthesizable

Processors VHDL

IP Core Library

(Chapter 2) (Chapter 2)

(Chapter 3)

(Chapter 4)

(Chapter 5) (Chapter 6)

(Chapter 7)

(Chapter 8)

Figure 1.6: The Laura Flow

In Chapter 2, we present the step of mapping a process network onto an Abstract Archi- tecture. In Chapter 3, we present our methodology to map the associated control for each Read and Write unit of the Virtual Processor. We pay attention to constraints such as clock speed and area in the mapping of those units. In Chapter 4, we analyze the communication behavior between processors and propose four channel realizations. In Chapter 5, we present a methodology to determine at compile time the memory requirements for a particular communication channel and thus, for the entire network. In Chapter 6, we present our methodology to synthesize complex expressions which are found in the control units. The topic of IP core integration is discussed in Chapter 7. Here we present how an IP core is embedded into our Virtual Processor and how we can determine its utilization in the case of pipelined IP cores in the presence of self-loops. In the case of a low utilization of the IP pipeline, we propose a number of transformations that may increase this utilization. In Chapter 8 we present three software kernels which are used in smart antenna applications. We conclude the thesis in Chapter 9.

(24)

Chapter 2 From C OMPAAN Data Flow Process Network to Abstract Architecture

In this chapter, we deal with the first two operations that are present in the first step of the LAURA approach, that is called PN to Abstract Architecture, see Figure 2.1. These two operations are the Topological Mapping and the Semantic Mapping. The Topological Map- ping translates a CDFPN network topology to an architectural communication network. The Semantic Mapping structures each process of the network to a form that is suitable for synthesis. The core of this structure is the Read-Execute-Write(REW) synthesis template. The result specifies the CDFPN model in architectural terms as an Abstract Architecture. Before we further explain the PN to Abstract Architecture step, we first provide some background information in Section 2.1 concerning the model of computation used. In Section 2.2, the Topological Mapping relates the topology of the COMPAAN Data Flow Process Network (MoC) with the topology of the Abstract Architecture. In Section 2.3, we discuss about the Semantic Mapping which translates the behavior of the processors in the CDFPN MoC to the behavior of the processors in the Abstract Architecture. The chapter is concluded in Section 2.4.

2.1 Background

The model of computation (MoC) that we consider in this thesis is a process network model which we call it COMPAANData Flow Process Network (CDFPN). An CDFPN is a special case of the Kahn Process Network [1, 2] MoC, sharing many of the characteristics of KPN MoC (e.g., the communication semantics of CDFPN is the same as for Kahn networks).

A Kahn Process Network (KPN) consists of a set of processes that communicate point-to- point over unbounded FIFO channels. A process that wants to read from a channel will block when that channel is empty waiting for data to arrive. Compared to Kahn networks, a COMPAANnetwork does not allow a process or a network to be nondeterministic, and the network is always static (i.e., no additional nodes or arches can be added at run-time). Also, the hierarchical characteristic of a generic KPN is not kept by CDFPN MoC. However, KPN

(25)

14 From COMPAANData Flow Process Network to Abstract Architecture

Process Network

Network generation

Processor generation

Abstract Architecture

Semantic mapping Topologic

mapping

Figure 2.1: PN to Abstract Architecture step in more detail

has the following favorable characteristics shared also by CDFPN networks:

• The KPN model is deterministic, which means that irrespective of the schedule chosen to evaluate the network, always the same input/output relation exists;

• The inter-process synchronization is done by a blocking read. This is a very simple synchronization protocol that can be realized easily and efficiently in FPGAs;

• Processes run autonomously and synchronize via the blocking read. When mapping processes to an FPGA, you get autonomous islands on the FPGA that are only syn- chronized via blocking reads;

• As control is completely distributed to the individual processes, there is no global scheduler present. As a consequence, partitioning a KPN over a number of reconfigurable components or microprocessors is a simple task;

• As the exchange of data has been distributed over the FIFOs, there is no notion of a global memory that has to be accessed by multiple processes. Therefore, no resource contention occurs.

Listing 2.1: Static Affine Nested Loop Program example

f o r i = 1 : 1 : M, f o r j = 1 : 1 : N ,

i f i + j <= T , [ r ( i , j ) ] = F ( . . . ) ; end

end end

Our model is special in that the processes are static affine nested loop programs (SANLP).

In an SANLP, the loop bounds, condition statements, and variable indexing functions are all

(26)

2.1 Background 15

affine or pseudo affine functions of loop iterations and static parameters. A static parameter is a parameter that is constant during a SANLP execution. An example is shown in Listing 2.1 in which M , N , and T are static parameters, i and j are the index variables of the surrounding loop. Hence, Z(i, j) and Z(i, i + j) are affine accesses. A function of one or more variables, i₁, i₂, . . . , i_nis affine if it can be expressed as a sum of a constant, plus constant multiples of the variables, i.e., c₀+ c₁x₁+ c₂x₂+ . . . + c_nx_n, where are c₀, c₁, . . . , c_nconstants.

In CDFPN networks, the firing rules are derived in a particular manner that respects the SANLP nature of the COMPAANsequential imperative input programs:

• A firing rule dictates how the tokens are consumed in one process fire.

• A process interacts with the rest of the network only through their FIFO links. A CDFPN is input-output equivalent to a SANLP, and can be automatically derived from that equivalent program by the COMPAAN compiler. Internally, the COMPAAN compiler uses a representation of loops so-called the polyhedral program model [7, 9, 10].

It is important to understand what COMPAANcompiler generates a CDFPN from a SANLP input program. Consider the program shown in Listing 2.2. This program uses the Matlab programming language as this is the input programming language accepted by the COMPAAN

compiler.

Listing 2.2: SANLP compilation example

f o r j = 1 : 1 : N, f o r i = 1 : 1 : N,

[ x ( j , i ) ] = F₁( . . . ) ; i f i + j <= N,

[ ] = F₂( x ( j , i ) ) ; end

end end

The above program can be represented using the polyhedral program model, where each statement is guarded by a set of linear equations. A statement is a line of a program without control (e.g., [] = F₂(x(j, i)) from Listing 2.2). A statement is executed for a set of values of the iteration vector, the vector containing the iterators of surrounding loops (i.e., for statement containing the function F₂(), the iteration vector is (i, j)). The iteration domain is the set of values of the iteration vector for which the statement is executed. The iteration domain of the statement containing the function F₂() is shown in Figure 2.2.

The program listed in 2.2 is equivalent to the process network presented in Figure 2.3. The network is made out of two processes called producer and consumer. The producer process wraps the function call of F₁() of given SANLP example and the consumer process wraps the function call of F₂() of given SANLP example. The producer process is characterized by the C program:

f o r ( i n t j₁= 1 ; j1<= N ; j1+ = 1 ) { f o r ( i n t i₁= 1 ; i1<= N ; i1+ = 1 ) {

out₀= F1() ;

i f (−j1− i1+ N >= 0 ) { OP₁. p u t ( out₀ ) ; } }

}

(27)

N

1

1 N i

j

Figure 2.2: The graphical representation of the iteration domain of statement containing the function F₂() from Listing 2.2

And the consumer process is characterized by the C program:

f o r ( i n t j₂= 1 ; j2<= N − 1 ; j2+ = 1 ) { f o r ( i n t i₂= 1 ; i2<= −j2+ N ; i2+ = 1) {

in₀= IP1. g e t ( ) ; F2(in0) ; } }

The communication between these two processes is done using an unbounded FIFO buffer (i.e., ED₁). The interface between the FIFO buffer and the producer process is de- noted by a black point (i.e., OP₁). This interface is called the Output Port. An Output Port is used for a write operation to a FIFO. The interface between the FIFO buffer and the consumer process is denoted also by a black point (i.e., IP₁). This interface is called the Input Port. An input port is used for a read operation from a FIFO.

F ₁ F ₂

Producer Consumer

FIFO buffer ED₁

Output port Input port

OP₁ IP₁

Figure 2.3: CDFPN network of the SANLP presented in Listing 2.2

The two processes from Figure 2.3 and their relation can be represented by a producer- consumer pair P/C pair) [40] and an affine mapping.

Definition 2.1 A P/C pair is a tuple < C(p), f, P (p), ≺>, where C(p) ⊂ Qⁿ is a parame- terized polytope, f : Zⁿ → Z^mis an affine function,P(p) = f(C(p) ∩ Zⁿ), and ≺ is the lexicographical order.

(28)

2.1 Background 17

Thus,

C(p) = {x ∈ Qⁿ| Ax + Bp ≥ C}, (2.1)

with A, B, and C integral matrices of appropriate dimension, and p static integral parametric vector.C(p)∩Zⁿis called the consumer domain. A consumer domain is an iteration domain.

P(p) = f(C(p) ∩ Zⁿ) = {i ∈ Z^m| i = Mk + O ∧ k ∈ (C(p) ∩ Zⁿ)}. (2.2) with M and O integral matrices of appropriate dimension. Usually matrix O is zero and matrix M is called mapping matrix.P(q) is called the producer domain. A producer domain is an iteration domain.

In our example, the mapping function f :

f

j₁ i₁

=

1 0 0 1

j₂ i₂

maps the points (j₂, i₂) from the consumer domain

C ∩ Z²= {

j₂ i₂

∈ Z²|

⎛

⎜⎜

⎝

1 0

−1 0

0 1

0 −1

−1 −1

⎞

⎟⎟

⎠

j₂ i₂

+

⎛

⎜⎜

⎝ 0 1 0 1 1

⎞

⎟⎟

⎠[N] ≥

⎛

⎜⎜

⎝ 1 0 1 0 0

⎞

⎟⎟

⎠}

to the points (j₁, i₁) from the producer domain

P = {

j₁ i₁

∈ Z²|

⎛

⎜⎜

⎝

1 0

−1 0

0 1

0 −1

−1 −1

⎞

⎟⎟

⎠

j₁ i₁

+

⎛

⎜⎜

⎝ 0 1 0 1 1

⎞

⎟⎟

⎠[N] ≥

⎛

⎜⎜

⎝ 1 0 1 0 0

⎞

⎟⎟

⎠}

The lexicographic order ”≺” is the order as defined through the loop nests. The lexico- graphic order is also referred as the local schedule. Hence, the local schedule of the producer process is given by the following nested for-loops:

f o r ( i n t j₁= 1 ; j1<= N ; j1+ = 1 ) f o r ( i n t i₁= 1 ; i1<= N ; i1+ = 1 )

and the local schedule of the consumer process is given by the following nested for-loops:

f o r ( i n t j₂= 1 ; j2<= N − 1 ; j2+ = 1 ) f o r ( i n t i₂= 1 ; i2<= −j2+ N ; i2+ = 1 )

The understanding of P/C pair is essential as a CDFPN is a collection of these P/C pairs arranged in a network [7]. In [41], the authors present four types of communication that can exist in a P/C pair. The four types differ in order and multiplicity. Multiplicity means that a token that is sent by the producer is read more than once at the consumer side. Order means that a token sent by the producer is not read by the consumer in the same order as it was produced. Depending on the order and presence of multiplicity, an arbitrary communication channel belongs to one of the following four classes:

(29)

In-Order without multiplicity (IOM-) a producer writes data in the channel in the same order and quantity as the consumer reads from the channel;

In-Order with multiplicity (IOM+) the order in which data is produced is the same as the order in which data is consumed. However, some data is consumed more than once, breaking the communication model of a FIFO, where a get operation is destructive. In this model, the life-time of a token needs to be taken into account;

Out-of-Order without multiplicity(OOM-) a consumer reads data in a different order than it has been written by the producer;

Out-of-Order with multiplicity(IOM+) a channel has the same characteristics as in the Out-of-Order case. Additional release logic is added at the consumer side to keep track of the life-time of tokens, to determine the release moment.

To synthesize a CDFPN to an FPGA multi-processor execution platform, we need a model that facilitates this synthesis. This model is the Abstract Architecture. It is defined in terms of concurrent autonomous Virtual Processors (VP) that communicate in a point to point fashion over bounded channels. In the next sections we show how a CDFPN is mapped into an Ab- stract Architecture. The mapping to an Abstract Architecture is made using two operations.

The first operation, generates an interconnection network between a number of processors.

The processors are generated by the second operation. In this second operation, we make use of a predefined processor architecture template. Due to the fact that the first operation deals with topologies of an architecture, this first operation is called Topological Mapping.

The second operation is called Semantic Mapping. The semantic model of a processor is defined in this second operation, i.e., how the various components interact with each other;

autonomous processors that communicate over channels using blocking read and blocking write semantics.

2.2 Topological Mapping

The topological mapping operation creates an architectural interconnection network that has the same topology as the topology of the given CDFPN MoC. This is due to one-to-one topological mapping used by our methodology. The architectural interconnection network is captured by our Abstract Architecture model. The one-to-one mapping ensures that the task level parallelism (TLP) captured by the CDFPN MoC is propagated to the FPGA level. The Abstract Architecture naturally fits the CDFPN MoC when:

• The Abstract Architecture communication and synchronization primitives match the CDFPN MoC communication and synchronization primitives;

• The operational semantics of the Abstract Architecture match the operational semantics of the CDFPN MoC;

• The data types used in the CDFPN MoC match the data types used in the architecture.

Hence, the Abstract Architecture interconnection network has at least the following im- plementation characteristics:

(30)

2.3 Semantic Mapping 19

• Communication and Synchronization:

– FIFO buffer used to realize the interprocess communication;

– Blocking read synchronization primitive is part of a FIFO behavior. Any virtual processor implements the read token from FIFO function as an execution blocking function.

• Operational semantics. An FPGA multi-processor execution platform provides a large number of opportunities to create various number of complex operations. These operation can be realized using the embed CPUs or DSP blocs, distributed block RAMs, input/output blocs and Configurable Logic Blocs (CLBs).

• An FPGA multi-processor execution platform is optimized to work with scalar data types rather with packages of data. Hence, in this dissertation, we consider that a CDFPN token (or datum) is always a scalar.

Additionally, due to the fact that a FIFO buffer is bounded in real life, a new synchro- nization primitive is introduced. This is the blocking write synchronization primitive. This protocol states that a virtual processor halts when it attempts to write to a channel that is full.

Thus, a virtual processor can continue its execution whenever there are neither blocking read nor blocking write situations.

Not all the communication channels of the Abstract Architecture are FIFOs as in the case of the CDFPN. This is due to a different approach in handling the four types of communication that exists in CDFPN. These communication types has been introduced in [41] and enumerated in Section 2.1. This derogation does not change the topology of the Abstract Architecture in respect to the topology of the COMPAANgenerated PN. However, the communication channels may have differen execution behavior than a FIFO (First In First Out) (e.g., LIFO - Last In First Out - execution behavior, circular buffers).

Our philosophy is to keep a virtual processor of an Abstract Architecture memoryless, and all the communication related memory is handled by the Abstract Architecture intercon- nection network. Thus, we clearly separate the computation from communication as a virtual processor only deals with the computational part of the CDFPN MoC and the Abstract Ar- chitecture interconnection network with the communication. The different communication primitives that the Abstract Architecture interconnection network employs are discussed in Chapter 4. The output ports and the input ports of a CDFPN are instantiated accordingly to the type of channel used. An example of an abstract interconnection network is the CDFPN given in Figure 2.3. However, the FIFO buffer is bounded in Abstract Architecture interconnection network. We discuss in Chapter 5 how to determine the upper limit of all the channels types used in an implementation. More complex network examples are given in the next sections and chapters.

2.3 Semantic Mapping

We showed before how the topology of the Abstract Architecture is generated using the topological mapping operation. This operation defines the communication channels in terms of types, bounds and synchronization primitives. Each of these communication channels is con- nected to a processor. This processor is called Virtual Processor (VP). In this section, we

(31)

show what is the synthesis template of a VP, and how to properly synthesize a VP into an Abstract Architecture.

Execute Unit Write Unit Read Unit

In Arg IP Core ^{Out Arg}

IP_1

IP_2

OP_1

OP_2

Unit Controller

Unit Controller IP Core

Wrapper

Units Synchronization Signals IP Core Control

Signals

Data Path Switch Signals

IP_x OP_y

Data Path Select Signals

Figure 2.4: The Virtual Processor synthesis template

We start presenting the synthesis template of a VP in Figure 2.4. A virtual processor is composed of three units: a Read Unit, a Write Unit, and an Execute Unit. The Execute unit is the computational part of a virtual processor. It has a number of input arguments that provide to the unit the necessary data for execution and a number of output arguments that are the result of the computation process. In our model of implementation, the Execute unit fires when all the input arguments have data and always produces data to all the output arguments at once. The functionality of the Execute unit is realized using an IP core. The IP core is taken from an IP library. The control of the execution unit firing and of the embedded IP core is done by IP core wrapper. The IP core wrapper is discussed in Chapter 7. The Read unit is responsible for assigning all the input arguments of the Execute unit with valid data.

Since there are more input ports than arguments, the Read unit has to select from which port to read data. This selection is realized by a selector. The selector is controlled by the local unit controller. An Input Port is the input interface that connects the virtual processor with a communication channel. An Input Port is characterized by an iteration domain called Input Port Domain (IPD). The Output Port is the output interface that connects the virtual processor with a communication channel. An Output Port is characterized by an iteration domain called Output Port Domain (OPD). The Write unit is responsible for distributing the results of the Execute unit to the relevant processors in the network. A write operation can be executed only when all the output arguments of the execute unit are available for the write unit. Since there may be more output ports than output arguments, a switch is used for the proper selection of the Output Port. The switch is controlled by a local unit controller. The local unit controller selects the proper Output Port accordingly to the current state (i.e., iteration) of the virtual processor.

A key characteristic of the VP is that it models an overlay of these three units realized using inter-units synchronization signals. The overlay execution model of a VP is shown in Figure 2.5, where R is the execution of the Read unit, E is the execution of the Execute unit, and W is the execution of the Write unit.

The processes in a CDFPN are static affine nested loop programs. These programs are mapped by semantic mapping operation to a VP. Thus, the Von-Neumann model of execution

Synthesis of a parallel data stream processor from data flow process networks

Synthesis of a parallel data stream processor from data flow process networks

Zissulescu-Ianculescu, C.

Citation

Zissulescu-Ianculescu, C. (2008, November 13). Synthesis of a parallel data stream processor from data flow process networks. Retrieved from

https://hdl.handle.net/1887/13262

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/13262

Note: To cite this publication please use the final published version (if applicable).

Synthesis of a Parallel Data

Stream Processor from Data Flow Process Networks

Claudiu Zissulescu-Ianculescu

Synthesis of a Parallel Data Stream Processor from Data Flow Process Networks

sot¸iei mele Dani

Contents

Chapter 1

Introduction

1.1 C OMPAAN Data Flow Process Network

1.2 Problem Definition

1.3 Solution Approach

1.4 Thesis Contribution

1.5 Related Work

1.5.1 Hardware Architecture Implementations that uses the Polyhedral Model

1.5.2 Hardware Architecture Implementations that uses the Process Net- work MoC

1.6 Thesis Outline

Chapter 2

From C OMPAAN Data Flow Process Network to Abstract Architecture

2.1 Background

F 1 F 2

2.2 Topological Mapping

2.3 Semantic Mapping

F ₁ F ₂