Building and testing a processor architecture with out-of-order execution in CλaSH

(1)

Bachelor Informatica

Building and testing a

proces-sor architecture with

out-of-order execution in CλaSH

Harold Aptroot

June 8, 2017

Supervisor(s): Taco Walstra

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

This paper investigates the effects of the size of the reorder buffer on the ability of an architecture with out-of-order execution to tolerate a high memory latency. The effects are measured by running benchmarks on a simulated architecture written in CλaSH. The exact figures depend strongly on the code being run. If there are enough independent load operations, the size of the reorder buffer strongly affects the ability of the architecture to tolerate a high memory latency.

1 Introduction

The goal of this project is to implement a processor architecture with out-of-order execution in CλaSH and quantifying its performance for various memory latencies and reorder buffer sizes. The differences in clock speed caused by the changes in reorder buffer size are not considered, in order to limit the scope of the project. The expectation is that a larger reorder buffer improves the architecture’s ability to cope with high memory latencies.

CλaSH is a hardware description language based on Haskell [2]. Code written in CλaSH is Haskell code so it can be compiled and run directly. The ability to run CλaSH code directly makes it possible to do behavioural debugging and verification at the Haskell level, avoiding the need for most test benches in VHDL or Verilog.

A synthesizable subset of the language can be compiled into VHDL, Verilog or SystemVerilog by the CλaSH compiler. Not all constructs that are valid in Haskell are synthesizable, the most important ones that cannot be compiled (but can be simulated) are normal (non-polymorphic) recursion and data types that combine multiple constructors with data fields. This project uses only the synthesizable subset of Haskell, but the resulting VHDL/(System)Verilog code has not been used for testing or benchmarking.

Normal recursion poses no problem in software, but with the “function application is instan-tiation” view that CλaSH takes it would imply an infinitely large circuit that accounts for every possible recursion depth. Polymorphic recursion can be used to limit the depth of the recursion to a constant expressed in the type, as is done for example in the implementation of map applied to a vector (a fixed-length replacement for lists) from CλaSH’s Prelude seen in code listing 1.

map :: (a -> b) -> Vec n a -> Vec n b

map _ Nil = Nil

map f (x `Cons` xs) = f x `Cons` map f xs

Code Listing 1: The implementation of Vector.map in CλaSH’s Prelude.

The trick that makes this recursion synthesizable is that the type of the second argument changes from Vec n to Vec (n-1) and so on until finally map is called on a vector of size zero and the recursion ends. This is in contrast with a typical recursion on a list where the type does not change and the depth of the recursion is not bounded by the type.

Data types with multiple constructors that also have data fields are not supported, though this seems to be a missing feature rather than an inherent limitation. This limitation can be worked around by manually implementing the type as an explicit tagged union, for example a type such as in code listing 2 can be reworked to the types shown in code listing 3.

(4)

These limitations are further described in “From Haskell to Hardware” [1] and “CλaSH: Struc-tural Descriptions of Synchronous Hardware using Haskell” [2].

data Broadcast

= NoBroadcast

| Completed ROBIndex Tag Word32

Code Listing 2: A type that combines multiple constructors and data fields.

data Broadcast = Broadcast

{ bc_type :: BroadcastType , bc_index :: ROBIndex , bc_tag :: Tag , bc_value :: Word32 } data BroadcastType = NoBroadcast | Completed

Code Listing 3: Like code listing 2, but refactored into an explicit tagged union.

1.1 Related Work

CλaSH has mostly been used to implement data processing algorithms on field-programmable gate arrays (FPGAs), but it has also been used to implement different kinds of processors. Two examples of processors implemented in CλaSH are a data flow processor (Niedermeier et al. [8]) and a very long instruction word (VLIW) processor (Bos [3]).

The effect of the number of physical registers on the performance of an out-of-order processor with a realistic memory latency has been investigated by Espasa et al. [4]. The effect of the size of the instruction window has been investigated by Mutlu et al. [7], while also presenting runahead-execution as an alternative to a very large reorder buffer.

2 Out-of-order Execution

The history of out-of-order execution starts with the CDC 6600 (1964), designed at the Control Data Corporation. The CDC 6600 implemented a limited form of dynamic instruction scheduling (Thornton’s “scoreboarding” algorithm, [10]) to extract instruction level parallelism (ILP) from scalar code. The central processor of the CDC 6600 was capable of issuing independent instructions in parallel as long as the required functional units were free, but stalled on true dependencies.

Full out-of-order execution, allowing complete reordering of independent instructions through the use of register renaming, was first seen in the floating point section of the IBM System/360 Model 91 (1966), which used Tomasulo’s algorithm [11].

The main goal in 1966 was to extract ILP, as can be seen from the title of Tomasulo’s paper, “An efficient algorithm for exploiting multiple arithmetic units”. With the widening of the memory

(5)

gap it has since been argued that the toleration for high latency loads is the more important benefit of out-of-order execution [5].

2.1 Register Renaming

In order to enable full out-of-order execution, write-after-read (WAR) and write-after-write (WAW) dependencies have to be removed. WAR and WAW dependencies occur due to writing to a register before the previous use has been completed, but it is purely a naming problem. If there were a sufficient number of registers (ignoring the need to merge control flow paths) to never write to the same register twice, WAR and WAW dependencies would not exist. That can be emulated by renaming dynamically in hardware, giving a new “name” to an architectural register every time it is written to, but reclaiming names when they are no longer in use.

The two most frequently mentioned schemes for register renaming can be classified by the namespace they rename to and where values are eventually stored in the machine, the tag-indexed register file style and the reservation station style.

2.1.1 Tag-indexed register file style

In the tag-indexed register file style there is a central physical register file (PRF). Architectural registers are renamed to physical registers and the corresponding index in the PRF is used as a “tag” to track dependencies. When an instruction is decoded, its operands are looked up in the remap file which keeps track of the newest name for each architectural register. Its destination(s) are renamed to free physical registers and the remap file is updated accordingly.

In this style the reorder buffer and architectural register file (the non-speculative state, updated when an instruction retires, see section 2.3) contain tags instead of values. A future file, from which the newest value of a register (if it is available) is loaded in the reservation station style, is not used. All values are loaded from the physical register file.

The free tag reclaiming mechanism is more complicated in this style than in the reservation station style. Since the architectural register file contains tags as well (it should be in the same style as the remap file, since the remap file will be restored from the architectural register file in case of a pipeline flush), tags cannot be freed until they leave the architectural register file or the instruction writing to it is flushed from the pipeline. In the simplest case, with no shared mapping (renaming multiple architectural registers to the same physical register), a tag can be freed immediately when it leaves the architectural register file. When shared mapping is desired, some form of reference counting is required. Tags can then be freed only when their reference count is decreased to zero.

The tag-indexed register file style is used by for example the MIPS R10000 [12]. 2.1.2 Reservation Station style

In the reservation station style values are stored in a more distributed manner and the tag is not inherently meaningful. The tag can be supplied, for example, by a sufficiently wide counter.

In this style the architectural register file directly contains the non-speculative values of archi-tectural registers, instead of tags. The result values of instructions must be stored in the reorder buffer so they can be written out to the architectural register file when the instruction retires. The remap file also exists in this style (although the terminology varies), but it is only used to look up the tags for values that have not been computed yet. Values that have been computed are supplied

(6)

by the future file, containing the newest computed value of an architectural register. The reserva-tion stareserva-tions store both the μops and their operands, either the tag or the value itself depending on whether the value is available or not. μops, or micro operations, are the internal operations that instructions are decoded to. They are intended to need little further decoding and can store information that ISA instructions have no fields (or too narrow fields) for.

The latency from instruction selection to execution can be lower than in the tag-indexed register file style, because there is no indirection through the physical register file. Instead, the value is “captured” in the reservation station. A downside of this style is that the reservation stations need content-addressable memory (CAM) to capture broadcasted values in the corresponding operands of the μops they contain. A CAM is more expensive to implement in terms of hardware (and more complicated to implement in CλaSH) than a RAM. The instruction queues in the tag-indexed register file style can be implemented with only RAM.

The S/360 Model 91 (the first micro architecture with true out-of-order execution) used this style, perhaps for that reason it tends to be the style covered by textbooks such as “Computer architecture: a quantitative approach” [9].

2.2 Instruction Selection

The mechanism that actually reorders the execution of instructions is the instruction selection. After decoding and renaming, μops are placed in a buffer where they wait until they can be executed. Though the principle is the same, the name of that buffer depends on the style of register renaming. In the reservation station style, the buffer is the reservation station. In the tag-indexed register file style, the buffer is called the issue queue or instruction queue. The name “queue” may be unfortunate, since it suggests a first-in-first-out buffer but the purpose is to change the order in which μops are executed.

A μop can be issued (sent to an execution unit) when its inputs are available and an execution unit that can execute it is ready. It is not inherently necessarily to issue the oldest eligible μop. The reorder buffer (see section 2.3) has a limited size and instructions are removed from it in program-order, therefore a μop that remains unexecuted for a very long time will block other μops from retiring and the reorder buffer will fill up. When the reorder buffer is full no new instructions can be decoded causing the instruction queue to slowly drain as instructions are issued from it. Therefore, as long as the instruction selection process will definitely choose some eligible instruction (if one exists) every cycle, all decoded instructions will eventually be executed.

Relying on the ROB filling up to drain the instruction queue of old μops is bad for performance. This technique causes more front-end (instruction fetcher and decoder) stalls than necessary and also delays the execution of instructions that are only allowed to be executed when they have become the oldest active instruction. Therefore, most instruction selector designs prioritise the execution of old μops. The two main ways to give priority to old μops are with a collapsing queue in which μops are sorted by age and an unordered buffer with explicit priority encoding.

When a μop is issued from a collapsing queue, the “hole” left by removing the μop is filled by shifting in the newer μops, “collapsing” the hole. Priority is given to the first eligible instruction in the queue, and new instructions from the decoder are always inserted at the end of the queue. Shifting the μops to fill the hole is a fast operation in hardware (all entries move at the same time, it is not a one-by-one process as it would be in software) but requires a rather large circuit that can read and write the entire buffer in one cycle. An unordered buffer requires an explicit priority encoding to keep track of the oldest μop in the buffer.

(7)

2.3 Precise Exceptions

Exceptions (including branch mispredictions) pose a problem in out-of-order execution, because when they are detected the machine is usually not in the same state an in-order architecture would have been. The reordering of instructions should not be directly observable as that would significantly complicate programming, or even make it essentially impossible. Consider for example a mispredicted branch, depending on how soon it is detected some number of instructions on the incorrect path may already have been executed. A number of solutions to that problem exist, for example the reorder buffer and the history buffer.

A history buffer is used to store (along with other information about the instruction) the old values of the architectural register that an instruction wrote to. The architectural state is updated immediately as results become available. An old state can be reconstructed by rewinding the history buffer, effectively stepping back in time, undoing the effect of the instructions one by one. This process is inherently serial, so it scales poorly with the size of the buffer. Removing an instruction from the buffer is usually trivial however, nothing needs to be done at all unless the instruction has generated an exception.

A reorder buffer stores the new value that an architectural register will have, the architectural state is then updated in program-order. This process is also serial, but it is done incrementally instead of all at once. At any time, the architectural state is already in a consistent state, it can be reverted to by discarding all speculative state. A frequently mentioned downside of using a reorder buffer is that a large CAM is necessary to look up the newest value (or tag) of registers in it, however with a future file or remap file (depending on the style of renaming) the newest value is available outside the reorder buffer in a RAM-indexed structure.

3 Implementation

3.1 ISA and Assembly Language

The instruction set architecture (ISA) for this design largely follows a RISC philosophy, to simplify the implementation. As such, it is a load/store architecture and it features simple instruction formats, a mostly orthogonal instruction set, and a Harvard memory model. In order to better show off register renaming, this is a two-operand ISA with few registers. Having a lot of architectural registers would also complicate debugging, since it increases the amount of state to look through.

In addition to the eight general purpose registers, there are also the flags register and the link register.

The flags register contains the usual zero (indicating the result was zero), carry (indicating a carry out of the top bit, or borrow out of the top bit for subtractions), sign (a copy of the sign of the result) and overflow (indicating signed overflow, defined as the XOR between the carry into the sign and the carry out of the sign) flags. The flags are affected by most arithmetic instructions and read by most branches. Having a flags register is perhaps not very RISC-like, but it’s often useful and it does not really hurt the design. There are no instructions that write to only some but not all flags, to prevent having to either rename parts of the flags separately or having to read the old value of the flags.

The link register is used for subroutines, the jal (jump and link) instruction sets it to the return address and the ret (return) instruction is a jump to the address contained in the link register. A jal/ret pair that works through a link register is easy to implement without decoding instructions

(8)

to more than one μop and avoids storing and loading the return address in leaf functions. With a more CISC-like approach of a call/ret pair that implicitly pushes and pops the return address on the stack it may be necessary or advantageous to split the instruction into more than one μop. Such a ret instruction would both modify the stack pointer and load the return address from memory. If the stack pointer modification is not split off as a separate μop, a later instruction that uses the stack pointer would have to wait for the entire ret to complete, including the (possibly slow) implicit load. Splitting the ret during decoding solves that problem, but splitting a return operation at the ISA level also allows leaf function calls to be implemented with two fewer memory accesses.

Instructions that should be decoded into multiple μops are avoided in this ISA to simplify the implementation. Decoding an instruction into multiple μops would mean dealing with more than one μop in parts of the pipeline or stalling the front-end while μops are generated.

Originally the ISA had fixed-size instructions, but the limit this placed on immediate values was too restrictive, often leading to many arithmetic operations just to load a constant value. Loading the values from data memory did not solve that problem since it would still take several instructions to construct the address of the value. To address that problem, an instruction format with a 32-bit immediate operand was added.

The memory ordering is designed without multiprocessing mind. Stores must not be reordered and cannot be executed speculatively. Loads can be reordered with respect to independent stores and other loads, but through-memory dependencies must be maintained. Memory barriers and atomic instructions are not part of the ISA.

All instructions are listed with their encoding and description in appendix A.

3.2 Register Renaming Implementation

This design follows the tag-indexed register file style, so the renaming namespace is the set of names of physical registers. There are 64 physical registers, making the tags 6 bits wide. Since physical registers and tags correspond to each other in this way, “tag” and “physical register” are used almost interchangeably in this section.

Register 0 is reserved as zero-register with a constant value of zero and is never written to or freed. Having a dedicated zero-register that is never freed allows multiple architectural registers to be remapped to it (shared mapping) without having to use reference counting (more details about this in section 3.3 about the ROB). It also allows zeroing idioms to be handled by the renamer, which can simply rename the destination to the zero-register.

The operands of an instruction passed into the renamer are changed from architectural register names to physical register names (tags). The tag that an operand is replaced with is the newest tag that architectural register has been renamed to, which is found (by RAM-lookup) in the remap vector. If the instruction writes to an architectural register, a free tag register is allocated from the free-queue and the remap entry corresponding to that architectural register is updated with that tag.

This is simpler (both in CλaSH and in hardware) than the mostly out-dated technique where the mapping is stored in an associative CAM with priority encoding. That CAM would match all the physical registers that the given architectural register has been renamed to lately, and the priority encoding is used to select the most recent of those names.

The free-queue is not a true queue but actually a bitmap with 64 bits, each bit corresponding to a tag. If a bit is set, the corresponding tag is free. A free tag can be taken from the queue

(9)

by computing the number of trailing zeroes (ntz), which can be done with an O(log#tags)-depth circuit. The CλaSH prelude has a definition of countTrailingZeros that does not guarantee that logarithmic depth, so an algorithm based on fold is implemented, see appendix B. The bit corresponding to the selected tag can be reset in parallel with the ntz-computation by computing free .&. free - 1, which resets the lowest set bit without needing to know its index [6]. Subtrac-tion can be implemented with a circuit of logarithmic depth as well, this is left up to the synthesis tools however. The main motivation for storing the free queue as a bitmask is that the ROB may have to free many tags at once when flushing out many entries, which is easier to implement as the bitwise OR between two bitmasks than as enqueuing an arbitrary number of items in an actual queue.

3.2.1 Zeroing Idioms

Two common zeroing idioms are supported, xor same,same and sub same,same, where same is some register that is used as both operands. The renamer checks whether an operation is of that form, and handles the instruction specially if it is.

When the renamer detects a zeroing idiom, the destination and flags are renamed to the zero-tag and a special ROB-only μop is emitted. ROB-only μops still go to the ROB (registers were renamed while decoding the instruction, so something is needed to update the architectural state with the new names at retirement), but do not go to the instruction queue. Since the μop does not need to be executed, it is marked done immediately when inserted into the ROB.

3.3 Reorder Buffer

This implementation uses a reorder buffer to implement precise exception. In this implementation, the list of ROB-entries is implemented as a circular queue, so instead of keeping the oldest entry at index 0 and shifting all the other entries when the oldest is removed, the entries never move but a pair of indices is used to indicate where the oldest and newest entries are. This is mainly motivated by the need to mark entries as done (for normal completion of the corresponding μop) or failed (when the corresponding μop generated an exception or misprediction). Since the index where the ROB-entry of a μop is stored does not change with a circular queue implementation, μops can contain their own ROB-index, which can be broadcasted and used by the ROB to mark the corresponding entry done or failed with a RAM instead of a CAM.

Another motivation is the need to communicate to the load-store unit (LSU) when a store instruction has become the oldest instruction in the machine, which with a stable ROB index can be communicated through passing the index of the oldest entry to the LSU. This way store-instructions do not need a physical register allocated to them just to be able to compare the tags. More details about that are in section 3.5 which covers the load-store unit. A third motivation is that conditionally moving all entries would need a large (though simple) circuit to multiplex every entry with the next-newest entry, whereas moving them virtually only requires a (narrow) conditional incrementer.

The ROB retires an instruction when it is the oldest instruction in the machine and it is either done or failed. If the instruction that retires is done, the architectural registers in the architectural register file (the non-speculative state) it writes to are updated with the tags of the results. The tags that leave the architectural state by being overwritten that way are marked free, except the tag corresponding to the zero-register which is never freed. The retiring instruction is (by definition) the oldest instruction in the machine and when it was decoded the architectural registers it writes

(10)

to were renamed, so the previous names of those registers cannot be referred to by any newer instructions. That means that when a tag leaves the architectural state there can be no active instruction that refers to it, so the tag is safe to re-use.

That invariant would in be broken by allowing shared mapping, necessitating reference count-ing. To prevent the need for reference counting, general shared mapping is not supported by this implementation. The zero-tag is an exception, multiple architectural registers can be mapped to the zero-tag at the same time, but since it should never be freed there is no need to implement reference counting for it.

Branch mispredictions and other exceptions are only handled when the mispredicted branch retires. When a mispredicted branch retires, the back end will be flushed completely, the remap state will be restored from the architectural state (which, since all instructions older than the branch have retired at that point, reflects the proper pre-branch state), the front end will be redirected to the proper address and the destination-tags of the flushed instructions are marked free. Other schemes that aim to continue down the right execution path immediately upon discovering the misprediction, such as Eager Misprediction Recovery presented by Zhou et al. [13], are significantly more complicated.

3.4 Instruction Queue

The instruction queue is implemented as a collapsing queue, meaning that the age of a μop is encoded by its position in the queue and no auxiliary data structure is needed to encode it. The instruction queue is typically smaller than the ROB so the size of the queue-collapsing circuit is less of a concern and the instruction queue index of a μop is not needed outside of the instruction queue so there is no benefit to keeping it stable over the life of a μop.

When an instruction is inserted into the queue, its operand tags are translated into a bitmask that indicates with a set bit the physical registers that it does not depend on, that mask is inserted in the vector deps.

The instruction queue manages its own bitmask that indicates which physical registers are “ready”. A physical register is considered “ready” when the newest instruction that writes to it has been executed and it has not since been freed. When an instruction that takes a single cycle to execute is selected for execution, the bit in ready corresponding to its destination tag is set immediately, making use of forwarding to enable dependent single cycle instructions to execute back-to-back. Because of this optimisation, ready has to be tracked locally in the instruction queue.

The readiness-mask of all μops in the queue can then be computed as ready = v2bv.reverse $ map (reduceAnd.(r .|.)) deps, where r is the mask of “ready” physical registers. The reverse puts the mask in oldest-first order (CλaSH normally considers the bit at index 0 of a vector to be the top-most bit of the corresponding BitVector), so that it matches the order of the mask that tracks which entries of the queue are filled. The index of the oldest ready μop can be found by computing the number of trailing zeroes of the computed mask.

This way of computing the readiness of the μops avoids the need for a CAM, at the cost of having to evaluate a bitwise OR and AND-reduction in the critical path of instruction selection. Another common method to compute the readiness uses a CAM to capture the readiness of the registers read by the μop to store it in the corresponding entry of the instruction queue. That way the readiness of a μop can be computed with a single OR-gate.

(11)

all loads and stores is also implemented. This step calculates an extra mask priority with a 1 for every address generation μop, then if (priority .&. ready) /= 0 the μop to execute will be selected with countTrailingZeros (priority .&. ready). Giving priority to address generation μops is only intended to be used for an experiment, not as integral part of the architecture.

3.5 Load-Store Unit

The load-store unit (LSU) in this design is responsible for the execution of all loads and stores, managing the cache, and interfacing with main memory. Main memory is modelled as a blockRam with additional delay, to model slow memory.

The LSU is not responsible for address calculation. After decoding, the load or store μop is sent to both the LSU and the ALU instruction queue. The copy that goes to the ALU instruction queue is interpreted as an add-immediate, the result goes from the ALU directly to the LSU, bypassing the normal broadcasting mechanism. Using a normal broadcast is not appropriate since completion of the address calculation does not signal the completion of a μop (there is only one ROB entry allocated to a load or store) and it does not signal the completion of the value in a physical register. This LSU is capable of store-load forwarding, but does not apply memory dependence prediction. Without store-load forwarding, if a load and an earlier store access the same effective address, the load would have to wait until the store is retired. Store-load forwarding enables that load to execute by taking the value directly from the store. Without memory dependence prediction, store-load forwarding is allowed only when the addresses of both the load and the store are available, there are no blocking stores between the load and the store, and the value-operand of the store is available. Blocking stores are stores that may write to the same effective address, including both stores that actually write to the same effective address and stores that write to an address that has not been computed yet.

The ISA specifies only aligned loads of all the same size, so memory operations never partially overlap. The selection of forwarding candidates and blocking stores can therefore be based on exact equality of the effective address. It also means no special mechanism is needed to support loads and stores that cross a cache line boundary.

The store-load forwarding circuit is the worst-scaling circuit in the entire design. Every entry in the queue is matched with every previous entry, giving a circuit size of O(#entries2). The depth scales logarithmically thanks to the logarithmic circuit depth of fold, which is used extensively in the LSU. A common pattern is to map a function over all the entries that produces results wrapped in Maybe and then fold the results with the helper function keepOlder, shown below.

keepOlder older newer = maybe newer Just older

maybe is a function from the standard Haskell library with the type signature

b -> (a -> b) -> Maybe a -> b, if the last argument is Nothing it returns the first argument, otherwise it applies the function from the second argument to the value of the last argument. In this usage, that means that if older is Just x, Just x is returned. Otherwise if older is Nothing, newer is returned. fold-ing the keepOlder function over a vector of Maybes therefore has the result of selecting the first (oldest) item that is not Nothing, and this is accomplished with a circuit of logarithmic depth.

The cache is very simple because it is not the focus of the experiments. It’s two-way set-associative with LRU replacement, the block size is 16 and there are 8 sets.

(12)

3.6 Other Implementation Details

The implementation is intentionally limited to issuing only one instruction from the ALU instruction queue per cycle, because issuing more than one instruction would complicate printing and reading an execution trace. The multiplier unit already needs a separate broadcast channel from the single-cycle functional units due to the different latency (the alternative would be changing the scheduling to actively avoid write-back conflicts), so it would be easy to change the architecture to issue up to two μops per cycle where one of them goes to the multiplier.

Branches are predicted with a static predictor which predicts backward branches taken and forward branches non-taken. There is no branch target prediction for normal branches because the actual target can be computed in time to do the instruction fetch from the new address. The jump target for a return instruction is predicted using a 16-entry ring-buffer.

Instructions with a 32-bit immediate operand are decoded in three cycles, one for every instruc-tion word (16 bits) worth of data. The decoder is a small state machine that keeps track of what it has read so far. The state is reset when the pipeline is flushed. The decoder does not transition between states while the front end is stalled, unless the pipeline is flushed during that time.

4 Benchmarking

The main benchmark tests the performance of a small program that calculates an integer dot prod-uct. The dot product between two arrays is the sum of the pairwise products, that isP

ia[i] · b[i]. Calculating a dot product is a relatively common operation, making it relevant as a benchmark. The dependency structure is in a sense “between” the dependency structure of a simple mapping op-eration (for example vector addition) where every item is completely independent and a completely dependent dependency structure (for example iterating through a linked list).

The exact code is shown in code listing 4, it calculates an integer dot product because floating point operations are not implemented in this architecture. The dependency structure of the code is shown in figure 1. Loop-carried dependency chains are visualised as loops in the diagram. The relatively loose structure should allow the address computation and the branch to “run ahead” of the main computation for a number of instructions that mainly depends on the reorder buffer size. The time taken by the whole program is defined as the time from the first cycle to the earliest cycle in which the jmp stop instruction is committed. By that time all earlier instructions are (by definition) also committed, so the fact that the end of the program does not depend on the result of the computation does not enable postponing work to after the stop-condition is detected. This is a pessimistic time, it takes a couple of cycles before the first instruction can be executed and then it waits the longest reasonable time to conclude that the program is finished. When placed in an actual context as a function call, the code would appear to take less time.

The times for different combinations of memory latency and reorder buffer size are shown in figure 3. The effect of decreasing the size of the vectors to 16 elements each is shown in figure 6.

The discrete derivatives of the times (with respect to memory latency) from figures 3 and 6 are shown in figures 4 and 7 respectively. The discrete derivative of the time with respect to memory latency is the number of cycles the total time increases for a cycle of added memory latency, so a dimensionless number that indicates how performance scales with memory latency “locally”, around a given memory latency. I will refer to this measure as “incremental memory latency impact” (IMLI).

(13)

Figures 5 and 8 show how many cycles the front-end (fetch and decode/rename stages) was stalled because the reorder buffer was full.

Figures 9 and 10 show the time and IMLI respectively of the benchmark when the instruction selector is changed to schedule address generation μops with priority. The priority works by first attempting to select the oldest eligible address generation μop, and only considering other types of μops if none were found.

An other benchmark tests one of the worst cases for out-of-order execution, iterating through a linked list with a bad memory order. The linked list has been generated by taking ((i + 13) · 13) (mod 64) for i = 0 to i = 63, generating a random-looking pattern while guaranteeing that the linked list is a single cycle so no part of it is inadvertently skipped. The program iterates through the entire cycle and stops when it reaches the beginning again. The code is shown in code listing 5, the time for a range of memory latencies is shown in figure 11.

.section data .dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 .dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 .dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 .dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 .section text

mov r2, 128 ; change to 64 for vector size 16 mov r5, r2 dotloop: mov r3, [r1 + 0] add r1, 4 mov r4, [r2 + 0] add r2, 4 mul r3, r4 add r0, r3 cmp r1, r5 jnz dotloop ; result in r0 stop: jmp stop

(14)

mov r2, 128 mov r5, r2 mov r3, [r1 + 1] add r1, 4 mov r4, [r2 + 1] add r2, 4 mul r3, r4 add r0, r3 cmp r1, r5 jnz dotloop

(15)

.section data .dd 164, 216, 12, 64, 116, 168, 220, 16 .dd 68, 120, 172, 224, 20, 72, 124, 176 .dd 228, 24, 76, 128, 180, 232, 28, 80 .dd 132, 184, 236, 32, 84, 136, 188, 240 .dd 36, 88, 140, 192, 244, 40, 92, 144 .dd 196, 248, 44, 96, 148, 200, 252, 48 .dd 100, 152, 204, 0, 52, 104, 156, 208 .dd 4, 56, 108, 160, 212, 8, 60, 112 .section text loop: mov r1, [r1 + 0] test r1, r1 jnz loop stop: jmp stop

Code Listing 5: Code listing of the linked list program.

mov r1, [r1 + 1]

test r1, r1

jnz loop

(16)

(17)

5 Results

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

0

200

400

600

800 1000

1200

Time in clock cycles

ROB size

8

12

16

24

32

Figure 3: Time in cycles for the integer dot product benchmark with different reorder buffer sizes. For vectors of size 32.

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

5

0

5

10

15

20

25 Incremental memory latency impact

ROB size

8

12

16

24

32

(18)

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

0

200

400

600

800 1000

Cycles of front-end stall due to full ROB

ROB size

8

12

16

24

32

Figure 5: The number of cycles the front end was stalled because the ROB was full, for vectors of size 32.

(19)

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

0

50

100

150

200

250

300 Time in clock cycles

ROB size

8

12

16

24

32

Figure 6: Time in cycles for the integer dot product benchmark with different reorder buffer sizes. For vectors of size 16.

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

0

1

2

3

4

5 Incremental memory latency impact

ROB size

8

12

16

24

32

(20)

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

0

20

40

60

80

100

120

140

160

180 Cycles of front-end stall due to full ROB

ROB size

8

12

16

24

32

Figure 8: The number of cycles the front-end was stalled because the ROB was full, for vectors of size 16.

(21)

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

0

200

400

600

800 1000

1200

Time in clock cycles

ROB size

8

12

16

24

32

Figure 9: Time in cycles for the integer dot product benchmark with different reorder buffer sizes. For vectors of size 32. Prioritise calculation of addresses.

Figure 10: IMLI for the integer dot product with vector size 32. Prioritise calculation of addresses.

0

5

10

15

20

25

30

35

40 Memory latency in clock cycles

0

5

10

15

20

25

30

35 Incremental memory latency impact

ROB size

8

12

16

24

32

(22)

0

4

8

12

16

20

24

28

32

36

40 Memory latency in clock cycles

0

500 1000

1500

2000

Time in clock cycles

ROB size

8

12

16

24

32

Figure 11: Time in cycles for the linked list benchmark with different reorder buffer sizes. The five graphs are exactly identical.

(23)

6 Discussion and Future Work

6.1 Discussion

In the first dot product benchmark, with vectors of size 32, the effect of adding a cycle of memory latency strongly depends on the size of the reorder buffer. For a reorder buffer size of 8, the incremental memory latency impact (IMLI, figure 4) is between 7 and 24 cycles. The effect is much smaller for a reorder buffer size of 32, with an IMLI between -4 and 9 cycles. For all ROB sizes there is a pattern of starting with a low IMLI and, via a messy transition, ending with a high IMLI that stays constant. Extending the benchmark to even higher memory latencies did not change that pattern.

Negative IMLI indicates that increasing the memory latency by 1 decreases the total time, this is likely caused by instruction scheduling interference where a less important instruction is scheduled earlier than a more important instruction. It seems to be important to compute memory addresses early, certain memory latencies enable the multiplication or accumulation to “steal” the cycle in which it would have been best to execute the address generation of a load operation. This explanation is supported by figures 9 and 10, which show the time and IMLI respectively when μops that calculate an address are given priority during instruction selection. With this change the IMLI is never negative, possibly indicating less scheduling interference. More times are improved, except the times for a ROB size of 16.

The number of stalled cycles due to a full ROB (figure 5) shows a remarkable similarity to the total time taken (figure 3), indicating that the slowdown is indeed due to the ROB filling up.

The linked-list benchmark (figure 11) shows that ROB size does not always have a large impact. The loads in the linked-list benchmark are dependent, so the cache misses generated by them are completely non-overlapping. There are not enough other instructions to execute to make use of the time spent waiting for cache misses.

6.2 Future Work

A more varied set of benchmarks has been left to future work. For example, benchmarks that include a mixture of dependent and independent loads may exhibit behaviour that is in a sense “between” the two extremes that are considered in this paper. Benchmarks with store operations may also be interesting, but could not be run because stores are not fully implemented.

It may also be interesting to see how the ability to tolerate memory latency is affected by other aspects of the architecture and by mechanisms that were not implemented. For example, how strong is the effect of the size of the load and store queues and what is the effect of memory dependence speculation?

The effect of the size of the reorder buffer on the cycle time of a physical realisation of the architecture, which would require timing analysis of the VHDL/(System)Verilog generated by the CλaSH compiler, has also been left to future work.

References

[1] Christiaan Baaij. Cλash : from haskell to hardware, December 2009. URL http://essay. utwente.nl/59482/.

(24)

[2] Christiaan Baaij, Matthijs Kooijman, Jan Kuper, Arjan Boeijink, and Marco Gerards. Cλash: Structural descriptions of synchronous hardware using haskell. In Proceedings of the 13th EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools, pages 714–721. IEEE Computer Society, September 2010. URL http://doc.utwente.nl/73124/. [3] J.C.H. Bos. Synthesizable specification of a VLIW processor in the functional hardware

de-scription language cλash, September 2014. URL http://essay.utwente.nl/66086/.

[4] Roger Espasa, Mateo Valero, and James E. Smith. Out-of-order vector architectures. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 30, pages 160–170, Washington, DC, USA, 1997. IEEE Computer Society. ISBN 0-8186-7977-8. URL http://dl.acm.org/citation.cfm?id=266800.266816.

[5] Andrew Glew. MLP yes! ILP no. ASPLOS Wild and Crazy Idea Session’98, 1998.

[6] Donald E. Knuth. The Art of Computer Programming, Volume 4, Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams. Addison-Wesley Professional, 12th edition, 2009. ISBN 0321580508, 9780321580504.

[7] Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N Patt. Runahead execution: An al-ternative to very large instruction windows for out-of-order processors. In High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium on, pages 129–140. IEEE, 2003.

[8] Anja Niedermeier, Rinse Wester, Kenneth Rovers, Christiaan Baaij, Jan Kuper, and Gerard Smit. Designing a dataflow processor using cλash. In Proceedings of the 28th Norchip Con-ference, NORCHIP 2010, page 69. IEEE Circuits & Systems Society, November 2010. URL http://doc.utwente.nl/74963/.

[9] David A Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.

[10] James E Thornton. The cdc 6600 project. Annals of the History of Computing, 2(4):338–348, 1980.

[11] Robert M Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev., 11(1):25–33, January 1967. ISSN 0018-8646. doi: 10.1147/rd.111.0025. URL http://dx.doi.org/10.1147/rd.111.0025.

[12] Kenneth C Yeager. The mips r10000 superscalar microprocessor. IEEE micro, 16(2):28–41, 1996.

[13] Peng Zhou, Soner ¨Onder, and Steve Carr. Fast branch misprediction recovery in out-of-order superscalar processors. In Proceedings of the 19th annual international conference on Supercomputing, pages 41–50. ACM, 2005.

(25)

Appendices

A

Instruction Set Reference

The instructions formats are:

RR(p, r, a) 00000 rrr aaa ppppp. rrr encodes the first source register and, if applicable, the destination. aaa encodes the second source register. ppppp encodes the operation.

RI(p, r, imm8) 01 ppp rrr imm8. ppp encodes the operation. rrr encodes the first source register and, if applicable, the destination. imm8 is an 8-bit immediate value. The immediate value cannot be zero.

RIL(p, r, imm32) 01 ppp rrr 00000000 imm32. ppp encodes the operation. rrr encodes the first source register and, if applicable, the destination. imm32 is a 32-bit immediate value, stored in little-endian order.

JMP(p, rel11) 00000 p rel11. p determines whether to store the return value in the link register (p = 0) or not (p = 1). rel11 is a signed 11-bit offset from the address just after the jump, measured in instructions (not bytes).

JCC(c, rel8) 11000 ccc rel8. ccc encodes the branch condition. rel8 is a signed 8-bit offset from the address just after the branch.

SP(p, r, a) 10011 rrr aaa ppppp. rrr encodes the first source register and the destination. aaa encodes the second source register. ppppp encodes the operation.

MEM(p, r, a, ofs8) 1000 p rrr aaa ofs8. p determines whether the operation is a load (p = 0) or a store (p = 1). For loads, rrr encodes the destination register, for stores it encodes the register whose value will be stored. aaa encodes the base-address register. ofs8 is a sign-extended offset, measured in words (not bytes).

MEML(p, a, ofs5) 1110100 p aaa ofs5. p determines whether the operation is a load (p = 0) or a store (p = 1). The destination register (for loads) or source register (for stores) is always the link register. aaa encodes the base-address register. ofs5 is a sign-extended offset, measured in words (not bytes).

(26)

Instructions

Instruction Encoding Description

add r, a RR(5, r, a) Add a to r. The flags are affected as defined.

add r, imm8 RI(5, r, imm8) Add imm8 to r. imm8 is sign-extended. The flags are affected as defined.

add r, imm32 RIL(5, r, imm32) Add imm32 to r. The flags are affected as defined. and r, a RR(3, r, a) r AND a. carry and overflow are reset, zero and

sign are affected as defined.

and r, imm8 RI(3, r, imm8) r AND imm8. imm8 is sign-extended. carry and overflow are reset, zero and sign are affected as de-fined.

and r, imm32 RIL(3, r, imm32) r AND imm32. carry and overflow are reset, zero and sign are affected as defined.

cmp r, a RR(7, r, a) Compare r with a. The flags are defined as if a was subtracted from r, but r is not written to.

cmp r, imm8 RI(7, r, imm8) Compare r with imm8. imm8 is sign-extended. The flags are defined as if imm8 was subtracted from r, but r is not written to.

cmp r, imm32 RIL(7, r, imm32) Compare r with imm32. The flags are defined as if imm32 was subtracted from r, but r is not written to.

jal rel11 JMP(0, rel11) Transfer control to the address specified by rel11, stores the address of the next instruction in the link register.

jmp rel11 JMP(1, rel11) Transfer control to the address specified by rel11. jz rel8 JCC(0, rel8) Jump if zero = 1.

jnz rel8 JCC(1, rel8) Jump if zero = 0. jc rel8 JCC(2, rel8) Jump if carry = 1. jnc rel8 JCC(3, rel8) Jump if carry = 0. js rel8 JCC(4, rel8) Jump if sign = 1. jns rel8 JCC(5, rel8) Jump if sign = 0.

jl rel8 JCC(6, rel8) Jump if sign 6= overflow. jge rel8 JCC(7, rel8) Jump if sign = overflow.

mov r, a RR(0, r, a) Copy a into r. The flags are not affected.

mov r, imm8 RI(0, r, imm8) Copy imm8 (sign-extended) into r. The flags are not affected.

mov r, imm32 RIL(0, r, imm32) Copy imm32 into r. The flags are not affected. mov r, [a + ofs8] MEM(0, r, a, ofs8) Load the word from the effective address into r. The

effective address is a+4*ofs8, ofs8 is sign-extended. An unaligned effective addresses is not allowed. The flags are not affected.

mov [a + ofs8], r MEM(1, r, a, ofs8) Store the value of r to memory at the effective ad-dress. The effective address is a+4*ofs8, ofs8 is sign-extended. An unaligned effective addresses is not al-lowed.

mov link, [a + ofs5] MEML(0, a, ofs5) Load the word from the effective address into link. The effective address is a+4*ofs5, ofs5 is sign-extended. An unaligned effective addresses is not al-lowed. The flags are not affected.

(27)

Instruction Encoding Description

mov [a + ofs5], link MEML(1, a, ofs5) Store the value of link to memory at the effective ad-dress. The effective address is a+4*ofs5, ofs5 is sign-extended. An unaligned effective addresses is not al-lowed.

mov r, flags SP(12, r, 0) Copy the value of the flags register to r. The flags are sozc (sign, overflow, zero, carry), padded on the left with zeroes. The zero flag is inverted.

mov flags, r SP(13, r, 0) Copy the lowest 4 bits of r to the flags. The other bits are reserved and should be zero.

mul r, a SP(0, r, a) Multiply r by a. The flags are not affected.

mulh r, a SP(1, r, a) Multiply (unsigned) r by a and extract the upper 32 bits of the full product. The flags are not affected. mulhs r, a SP(2, r, a) Multiply (signed) r by a and extract the upper 32 bits

of the full product. The flags are not affected. or r, a RR(1, r, a) r OR a. carry and overflow are reset, zero and sign

are affected as defined.

or r, imm8 RI(1, r, imm8) r OR imm8. imm8 is sign-extended. carry and over-flow are reset, zero and sign are affected as defined. or r, imm32 RIL(1, r, imm32) r OR imm32. carry and overflow are reset, zero and

sign are affected as defined.

ret E0 00 Transfer control to the address contained in the link register.

sar r, a SP(6, r, a) Shift r right by a while preserving the sign. The carry flag is set to the left significant bit of r, it is intended to be used for shifts by 1. The other flags are affected as defined.

shl r, a SP(4, r, a) Shift r left by a while shifting in zeroes. The carry flag is set to the most significant bit of r, it is intended to be used for shifts by 1. The other flags are affected as defined.

shr r, a SP(5, r, a) Shift r right by a while shifting in zeroes. The carry flag is set to the left significant bit of r, it is intended to be used for shifts by 1. The other flags are affected as defined.

sub r, a RR(6, r, a) Subtract a from r. The flags are affected as defined. sub r, imm8 RI(6, r, imm8) Subtract imm8 from r. imm8 is sign-extended. The

flags are affected as defined.

sub r, imm32 RIL(6, r, imm32) Subtract imm32 from r. The flags are affected as de-fined.

test r, a RR(4, r, a) The flags are defined as if r was AND-ed with a, but r is not written to.

test r, imm8 RI(4, r, imm8) The flags are defined as if r was AND-ed with imm8, but r is not written to. imm8 is sign-extended. test r, imm32 RIL(4, r, imm32) The flags are defined as if r was AND-ed with imm32,

but r is not written to.

xor r, a RR(2, r, a) r XOR a. carry and overflow are reset, zero and sign are affected as defined.

xor r, imm8 RI(2, r, imm8) r XOR imm8. imm8 is sign-extended. carry and overflow are reset, zero and sign are affected as de-fined.

xor r, imm32 RIL(2, r, imm32) r XOR imm32. carry and overflow are reset, zero and sign are affected as defined.

(28)

B

Count Trailing Zeros

This algorithm is a divide-and-conquer approach to counting the number of trailing zero. The fold at the heart of the algorithm corresponds to a tree-shape reduction of blog nc layers. It implements the following recursive definition of ntz (number of trailing zeros):

ntz(high : low) = (

ntz(low) if low 6= 0 ntz(high) + size(low) if low = 0 ntz(bit) = 1 − bit

(1)

The test whether the low part is zero can be optimised by calculating it bottom-up. The addition can be turned into pre-pending a bit in the binary representation. This relies on having the result type at every merge-step be BitVector k where k is the depth so it ranges from 0 to n (the input vector has size 2n_{). That size is one bit too short to represent all possible results,} because the input (or effective input of every invocation of combine) can be zero, and then the result would only just overflow. For example, if the input is 0 :: BitVector 32 then naively the result would be 32 :: BitVector 5, which is impossible. The trick lies in handling the case where the input (on any layer) is 0 in the next level, including in one final extra step that tests whether the complete input is zero. The next level automatically handles that case, because if its low part is zero then size(low) is used (implicitly, by pre-pending a 1 in binary to a bit string of length log(size(low))) instead of ntz(low) itself. The case where the high part is zero is only relevant if the low part is also zero, in that case it is again the next level that handles the problem.

-- count the number of trailing zeros, O(log n) depth version countTrailingZeros :: KnownNat n

=> BitVector (2^n) -> BitVector (n+1)

countTrailingZeros x =

if x == 0

then fromIntegral $ size# x

else v2bv $ low :> ctz

where

(ctz, _) = dtfold (Proxy @ICtzTuple) (\x -> (Nil, x))

combine (bv2v x)

combine _ (ctzH, have1H) (ctzL, have1L) =

if have1L == high

then (low :> ctzL, high)

else (high :> ctzH, have1H)

data ICtzTuple (f :: TyFun Nat *) :: *

Building and testing a processor architecture with out-of-order execution in CλaSH

Bachelor Informatica