Performance and energy efficiency of clustered processors

(1)

Performance and Energy Efficiency of

Clustered Processors

Sepehr Zarrabi

M.A.Sc., University of Victoria, 2004 A Thesis Submitted in Partial Fulfillment of the

Requirements for the Degree of

MASTER OF APPLIED SCIENCE

in the Department of Electrical and Computer Engineering We accept this thesis as conforming

to the required standard

0 Sepehr Zarrabi, 2004 University of Victoria

(2)

Supervisor: Dr. Arnirali Baniasadi

Modern processors aim to achieve

ABSTRACT

ILP by utilizing numerous hctional units, large on- chip structures and wider issue windows. This leads to extremely complex designs, which in turn adversely affect clock rate and energy efficiency. Hence, clustered processors have been introduced as an alternative, which allow high levels of ILP while maintaining a desirable clock rate and manageable power consumption. Nonetheless, clustering has its drawbacks. In this work we discuss the two types of clustering-induced delays caused by limited intra-cluster issue bandwidth and inter-cluster communication latencies. We use simulation results to show that the stalls caused by inter-cluster communication delays are the dominant factor impeding the performance of clustered processors. We also illustrate that microarchitectures become more energy efficient as the number of clusters grows. We study branch misprediction as a source of energy loss and examine how pipeline gating can alleviate this problem in centralized and distributed processors.

(3)

Table of Contents Title Page

...

i

. .

ABSTRACT

...

11 List of Figures

...

iv List of Tables

...

v

...

Acknowledgements vi 1

.

Introduction

...

1 2

.

Brief Overview of Superscalar Processors

...

6 3

.

Performance Analysis of Clustered Processors

...

9

...

3.1 Introduction 9 3.2 Related Work

...

10

...

3 -3 Performance Analysis 11 3.4 Methodology

...

14 3.5 Simulation Results

...

16

...

3.6 Other Architectures 2 1

...

3.7 Conclusion 23

...

4

.

Energy Efficiency Analysis of Clustered Processors 25

...

Introduction 2 5

...

Energy-Delay Metric 26 Related Work

...

27

...

Energy Efficiency in Multi-Cluster Systems 2 8

...

4.4.1 Energy-Efficiency versus Number of Clusters 28 4.4.2 Pipeline Gating in Distributed Architectures

...

30 Methodology

...

37 Conclusion

...

3 9

...

5

.

Conclusion and Future Work 41

6

.

References

...

4 4

.

...

(4)

List of Figures

Figure 1 . Top view of a 4-cluster processor

...

3

.

Figure 2 . Pipelined execution

...

7

. .

Figure 3 . Clustenng-mduced stalls

...

12

Figure 4 . Inter-cluster vs

.

intra-cluster stalls

...

17

Figure 5 . Percentage of cycles issue width is full

...

18

Figure 6 . RUU occupancy

...

19

Figure 7 . P C gain for I-NC & NI-C

...

20

...

.

Figure 8 . Stall distribution. 8-way machine dual cluster vs quad cluster 22

...

Figure 9 . P C gain for I-NC with inter-cluster communication delay of 2 cycles 23

...

.

Figure 10 . Energy-delay vs number of clusters for different benchmarks 29 Figure 11 . Energy-delay improvement with ideal pipeline gating

...

33

Figure 12 . Energy savings with ideal pipeline gating

...

33

...

Figure 13 . Energy-delay obtained using pipeline gating 34 Figure 14 . Energy-delay without gating relative to energy-delay with pipeline gating

...

3 5 Figure 15 . Performance improvement without pipeline gating

...

36

Figure 16 . Energy improvements obtained by utilizing pipeline gating

...

36

...

(5)

List

of Tables

...

Table 1 . Base processor configuration 15

Table 2 . SPEC'2K benchmarks used in this work

...

15

...

Table 3 - Distribution of inter-cluster and intra-cluster stalls 16

.

...

(6)

Acknowlednements

Firstly I want to thank my family, especially my parents, for their dedication and support throughout the years, as well as for instilling the value of education in me.

I would also like to thank my supervisor, Dr. Baniasadi, whose encouragement and invaluable technical insight made this work possible.

I thank all my fi-iends who showed interest in my work and supported me during the years of my studies.

(7)

1. Introduction

For many years, hardware designers have strived to produce processors that that can process more data in shorter time by utilizing faster clock rates or by performing more operations per clock cycle. This trend has given way to high-speed processors that have the ability to execute several concurrent instructions per clock cycle. This is also referred to as instruction-level parallelism

(ILP).

Hence, utilizing these state-of-the-art processors to their fullest extent necessitates finding more independent instructions that can be executed in parallel. This task is not possible without employing wider issue logic and sophisticated structures such as bypass logic, which are discussed in more detail below and in chapter 2.

After being decoded and renamed, instructions are placed in a pool of instructions in the issue window. Issue logic is responsible for selecting and moving these instructions to the execution stage based on data (operand) and resource availability. Multiple-issue processors often employ a dynamically scheduled out-of-order execution scheme to identify and execute independent instructions in parallel. As the name implies, instructions may not be necessarily executed in the sequential order in which they were coded. To clarifl, this work solely concentrates on instruction-level parallelism which refers to executing multiple instructions simultaneously on a single processor. Such multiple-issue processors require a wider issue window. The idea is that a wider and consequently more complex issue window will expose more available instructions, increasing the likelihood of finding independent instructions that may be executed simultaneously, thus improving ILP. Hence, issue logic is one of the portions of the processor whose complexity grows with increasing instruction-level parallelism.

The role of data bypass logic to forward result values fiom completed instructions to dependent instructions, bypassing the register file [I]. This is due to the fact that committing the results of the instructions to the register file is one of the last steps in the superscalar pipeline and waiting for instructions to be committed in order to access their results implies stalling the pipeline and underutilizing its valuable resources, which may

(8)

2 diminish performance. Data bypass logic is used to avoid such stalls by delivering results from completed instructions to waiting instructions as soon as possible.

Of course executing more instructions in parallel requires exploiting larger and wider issue logic and more aggressive bypass logic. However, the extra structures come at the price of added complexity that poses serious practical limitations: slower clock cycle and high power consumption, both of which limit the overall performance.

Implementing more sophisticated issue window and bypass logic entails utilizing more complex circuitry, which can increase the delay of the critical path. This is due to the fact that the issue window cannot be pipelined as it is required to execute dependent instructions in consecutive cycles (back-to-back), and the delay of the window logic must be less than one clock cycle [12]. This negatively impacts the clock speed as clock period is determined by wire delays. The complexity analysis shows that logic associated with the issue window is likely to be one of the key limiters of clock speed as we move towards more complex structures and more advanced technology [1,22,23

1.

This directly affects and degrades the performance.

Furthermore, the added complexity translates to higher power consumption. In recent years, power-aware design has gained importance in high-performance domains such as desktop computers and servers as heat dissipation can limit performance. It is expected that the processor maximum cycle time, hence frequency, will be limited by thermal constraints because the heat generated by high-speed circuits will be too high to be cooled in an affordable and effective manner [ 10,141.

Thus, there are two issues that may restrict our ability to manufacture superior processors as we attempt to exploit more ILP: prolonged clock cycle time and excessive power consumption. The key challenge is inventing methods that allow us to achieve most of the benefits of a complex architecture, while maintaining a very fast clock in the implementation and retaining the power consumption at manageable levels.

(9)

One such approach is decentralized architectures. proposed and studied 121, however our focus is

Many decentralized designs have been only on the multi-cluster architecture. Hence, we will use the terms decentralized and multi-cluster interchangeably. Clustered architectures are an attractive alternative to wide and deep organizations. As illustrated in figure 1, a multi-cluster architecture is composed of several smaller issue windows with their respective functional units and datapaths. Throughout this work, we assumes uniform clusters (i.e. all clusters are identical). Issue width, also known as width of a processor, refers to the maximum number of instructions that may be initiated for execution per clock cycle. As demonstrated in figure 1, in our simulations, issue width is divided equally between the clusters.

Figure 1 - Top view of a 4-cluster processor

IW denotes issue width (number of instructions that may be executed concurrently in each clock cycle).

Mathematical models developed by [I] show that the delays of most critical structures are proportional to the square of the issue width. As well, simulation results by [12] demonstrate quadratic growth in energy consumption as issue width increases. Therefore,

(10)

splitting a large structure such as the issue logic between two or more clusters can reduce the delay and power consumption quadratically. This is why multi-clustered architectures have been proposed and pursued as a viable solution to the problems associated with centralized architectures, discussed above.

Nonetheless, clustering has its drawbacks. The distributed nature of clustered architectures, leads to clustering-induced stalls. Generally, these stalls are of two types: stalls due to intra-cluster issue bandwidth limitation and stalls due to inter-cluster communication latencies. Although, such stalls may be minimized by specific cluster assignment techniques, they cannot be eliminated. Therefore, performance issues associated with clustering require thorough examination.

Another aspect of distributed systems that is of interest is their energy efficiency. As discussed above, clustered architectures simplifjr the design and reduce power consumption by breaking up large complex structures into smaller ones, resulting in reduced power consumption. However, the disadvantage of clustering is that the program may take longer to execute due to the clustering induced stalls. This may increase the energy required to run a piece of code, as energy is directly proportional to time (Energy =

Power

x

Time). Additionally, transferring and synchronizing data between clusters is done through cluster communication ports which add to the power overhead associated with clustering. This raises the question of whether multi-clustered architectures are energy- efficient.

The contributions of this work are outlined below:

-

_{We utilize Simplescalar [20] and Wattch 1211 simulation toolsets to perform detailed}

analysis of performance and energy efficiency of clustered processors using SPEC'2K benchmark suite [25]. We use the above tools to simulate processors with 1,2,4, and 8 clusters and issue widths of 4,6, 8, and 16.

(11)

-

_{In order to better understand the pafonnance aspect of clustered processors, we study}

the types of delays introduced by clustering. We conclude that inter-cluster communication stalls are the predominant issue impairing the performance of clustered processors. This will help designers focus their efforts on implementing cluster assignment techniques that will reduce that type of stall.

-

_{We discuss how clustering affects resource utilization. We note that processor}

resources may be more effectively utilized if stalls due to inter-cluster communication are diminished.

-

We then analyze and compare the energy efficiency of several centralized

architectures with their decentralized counterparts (i.e. same issue width, but distributed among several clusters). We demonstrate that microprocessors become more energy-efficient as the number of clusters increases. We use the energy-delay metric for measuring energy efficiency.

-

_{Branch prediction and cost of mispredictions are investigated in the context of}

clustered processors. We notice that as the number of clusters increases, the energy waste due to branch misprediction intensifies. We implement pipeline gating and show that although pipeline gating may be somewhat detrimental to performance, it can improve energy-delay in highly distributed systems (8-cluster or 4-cluster processors).

(12)

2. Brief Overview of Superscalar Processors

In this chapter we briefly discuss the microarchitecture of a typical superscalar processor. It is outside the scope of this work to describe the detailed operation of a superscalar processor. Therefore, we refer readers who are interested and require a more comprehensive explanation to the references enumerated at the end of this document. Multiple-issue processors come in two basic flavors: superscalar processors and VLIW (very long instruction word) processors. The focus of this work is on superscalar processor. Superscalar processors are processors that are capable of executing more than one instruction per clock cycle. In order to initiate multiple instructions for execution in parallel, instructions need to be scheduled based on availability of operand data rather than original program sequence. This important feature of superscalar processors is known as out-of-order execution. In other words, superscalar processors strive to execute instructions whose operands are available regardless of program sequence order. Of course, this is unbeknownst to the user or programmer. Hence, in order to imitate in-order (or sequential) execution, instructions are stored in a buffer upon completing the execution stage so they can be committed (written to architectural components) in the original program order. Common examples of superscalar processors are Intel Pentium series and AMD IA32 processors.

The main advantage and motivation behind invention of superscalar processors are their ability to execute multiple instructions per clock cycle, which allows for better utilization of chip resources and yields shorter execution time for applications and benchmarks. The disadvantage of superscalar processors is their growing complexity, which makes them power-hungry and difficult to design.

In a typical superscalar processor, the instructions could move through these major stages: instruction fetch, instruction decode, execution, memory access, and commit stage. These stages are pipelined, allowing multiple instructions to be in flight concurrently as depicted

(13)

in figure 2. For instance, when the first instruction is being executed at clock cycle 3, the second instruction is being decoded and the third instruction is being fetched. Each stage is discussed in more detail below.

Instruction fetch is responsible for supplying the rest of the processor pipeline with instructions. It is logical that the instruction fetch rate should at least match the instruction decode and execution rate, otherwise the processor resources will be underutilized. This is also the stage where branch prediction occurs. Upon detecting a conditional branch, the branch prediction unit speculates the outcome of the branch (taken or not taken), allowing the fetch unit to calc~date the address of the next instruction that needs to be fetched from memory. This enables the processor to continue execution speculatively while the branch outcome is computed and resolved. The alternative would be to stop fetching new instructions once a conditional branch is detected until the branch instruction completes the execution stage and its outcome is known for certain. However, this would entail stalling the pipeline and would cause the processor resources to idle, which would be inefficient. We will revisit branch prediction and its consequences in section 4.4.2 where we discuss pipeline gating.

Figure 2 - Pipelined execution

Each block represents one stage of the pipeline: Fetch, Decode, Execution, Memory access, Commit.

During the instruction decode stage, instructions are removed fiom the fetch buffer, examined for data dependencies, and distributed or dispatched to buffers associated with

(14)

hardware functional units. This phase also includes register renaming. The purpose of this task is to resolve register hazards as much as possible. Hazards are situations when the next instruction in the stream cannot be executed in its designated clock cycle because either the hardware resource(s) it requires is currently in use by a previous instruction or because it depends on the result(s) of an in-flight instruction that has not completed yet. Once the instructions are decoded and renamed, the next step is to determine which instructions can be issued to the execution unit, depending on data and resource availability. In multiple-issue processors, the issue logic becomes more complex because not only it needs to find eligible instructions whose input operands and hardware resources are available, but also it needs to find multiple instructions that are independent of each other that may be operated on in parallel. The probability of finding independent instructions varies among benchmarks. Hence, a wider instruction window (a larger pool of ready to execute instructions) improves the likelihood of finding independent instructions.

The memory access stage requires special considerations due to the large latency associated with storage structures. To cope with long access times, memory operations are often overlapped and sometimes performed out of order. Store buffer is utilized to ensure that hazards are properly resolved and sequential execution semantics are observed [7].

The final stage is the commit stage where the instructions are allowed to modifl the processor state. The purpose of this stage is to implement the appearance of a sequential execution model, although the actual execution is quite likely non-sequential [7]. Of course, speculative instructions that are incorrectly executed are not committed. They are flushed without altering the processor state.

(15)

3. Performance Analysis of Clustered Processors

In this section, we discuss how distributing the resources of a processor among several clusters affects the performance.

3.1 Introduction

Many modern processors such as Alpha 21 164 and Intel Pentium series use pipelining to overlap the execution of instructions (also known as Instruction Level Parallelism or ILP) in an attempt to improve the performance. In order to achieve substantial levels of ILP from common applications, current processors employ numerous functional units and large on-chip structures such as caches, register files and branch predictors. Utilizing these resources effectively requires larger and wider instruction windows. The idea is that a larger instruction window will expose more instructions that may be executed in parallel, which in turn will lead to higher performance. The trade-off however, is that, as the instruction windows grow larger, so do their complexity. This added complexity necessitates a reduced clock rate because of the trade-off between clock speed and complexity [I].

Therefore, clustering has been proposed as an alternative. In clustering, the large and deep instruction window is replaced by a collection of smaller windows with their own respective functional units and datapaths. Smaller issue windows result in simplified designs that yield improved clock rates. Compared to a centralized organization, clustered designs trade-off scheduling flexibility for higher clock rates. The down side of clustering nonetheless is that, inevitably, it will lead to clustering induced stalls. Generally, these stalls are of two types: stalls due to intra-cluster issue bandwidth limitation and stalls due to inter-cluster communication latencies [3].

In a non-clustered architecture, dependent instructions can execute in successive cycles. In a clustered architecture however, inter-cluster bypass delay prevents dependent

(16)

instructions located in different clusters from issuing in successive cycles, as a delay is incurred while the data travels across clusters [2]. Another challenge is balancing the load between the different clusters because each cluster only has access to a fraction of the total available issue bandwidth. Hence, if the load is not distributed evenly among different clusters, some instructions may end up stalling due to issue bandwidth limitation in one cluster while other clusters remain idling or underutilized. Both these issues are further discussed in chapter 3.3 and clarified using figure 3. Therefore, if an intelligent "steering" logic is not employed for dispersing instructions among the different clusters, the clustering overhead can potentially negate the benefits of utilizing a clustered architecture.

In order to develop more advanced steering logics, we need to acquire better insight on the sources and patterns of clustering induced stalls. We use previously suggested models [3] to measure and analyze the effects of intra-cluster and inter-cluster delays. Our goal is to determine where the bottlenecks lie. We provide better understanding of such stalls by investigating numerous SPEC'2K benchmark programs [25] on dual and quad-cluster machines with issue widths of 4,6, and 8. The simulation toolset and the benchmark suite are discussed in more detail in chapter 3.4 where we explain our methodology.

3.2 Related Work

Numerous studies have been conducted and published on clustered processors over the past decade [2, 3, 9, 101. The theme of all these investigations is to present a suitable alternative to large monolithic structures that will allow for further

ILP

without increasing the design complexity or impairing the clock rate. Palacharla, Jouppi, and Smith studied key structures in a generic superscalar processor and their respective delays [I]. They showed that as designers exploit wider issue widths, bigger windows and smaller feature sizes, the clock rates would have to slow down to cope with the extra complexity. They proposed clustering as a solution and introduced an innovative "dependence-based" dual-

(17)

cluster architecture where dependent instructions are steered into chains of FIFOs and are scheduled for in-order execution.

Many bodies of work have focused on developing more intelligent steering heuristics to minimize the impact of clustering induced stalls. Baniasadi and Moshovos proposed various steering logics including Mod-N and First-Fit [3]. Canal, Parcerisa and Gonzhlez studied a variety of non-adaptive instruction distribution methods also for a non-uniform dual-clustered architecture [4]. They also explained how slice information can be extracted dynamically and proposed the slice-based method.

Most studies so far have concentrated on dual or quad cluster machines with centralized caches. Balasubramonian, Dwarkadas, and Albonesi have recently published a paper that analyses topologies with as many as 16 clusters and discusses use of decentralized caches

In this chapter we will solely focus on sources and causes of clustering induced stalls. We will also analyze how each type of stall (inter-cluster vs. intra-cluster) contributes to clustering induced latencies. Additionally, we will investigate the effects of changing different clustering parameters such as issue width, number of clusters, and communication delay between clusters.

3.3 Performance Analysis

Clustering is an effective solution for tackling the growing impact of wire delays and increasing complexity of parts of the processor such as the issue and rename logic. Nonetheless, clustering introduces its own set of challenges, namely the two types of clustering induced stalls.

Inter-cluster latency refers to instances when an instruction gets stalled, waiting for the result of another instruction to propagate across clusters. Figure 3 illustrates an example of such a stall. Part (b) shows the register dependence graph (RDG) for a simple piece of code illustrated in part (a). Parts (c) and (d) demonstrate two different cluster assignment

(18)

policies. Cluster assignment technique used in figure 3 part (c) is designed to minimize inter-cluster communication as much as possible whereas cluster assignment technique utilized in part (d), does not aim at reducing the inter-cluster communications. As displayed in figure 3 part (d), the latter steering logic results in a stall during the second clock cycle while R2 waits for the result of R1 to traverse across clusters. Consequently, it takes 4 clock cycles to execute the whole code using the algorithm shown in part (d), while the algorithm shown in part (c) can execute the code in 3 clock cycles. Figure 3 part (d) does not use a particular cluster assignment technique. This example is specifically depicted to illustrate the effects of inter-cluster communication. In part (c), we have used the dependence method [ 3 ] which is discussed later in this chapter.

Figure 3 - Clustering-induced stalls

An example depicting the effects of inter-cluster stall. (a) Exemplifies a piece of code. (b) Shows the RDG for the code in (a). (c) Illustrates an optimized cluster assignment that does not cause any extra stalls, (d) Demonstrates a non-optimized cluster assignment that yields a one-cycle delay while R2 waits for R I to propagate across clusters.

Intra-Cluster stall refers to limited per cluster issue bandwidth. Compared to a monolithic architecture, each cluster is restricted to a fiaction of the total issue slot. For example, in a quad-cluster 8-way machine, each cluster has access to only 2 issue slots (8 i 4 = 2). This

can lead to unbalanced load distribution, which causes inefficient resource utilization as well as unnecessary stalls. As an example, a simple partitioning scheme utilized by older

(19)

processors is to separate floating point (FP) and integer (INT) instructions into their respective clusters [4]. Although this results in minimum inter-cluster communication, it is less than optimal in terms of bandwidth utilization because rarely programs have an equal number of FP and INT instructions. Hence, while one cluster is underutilized, instructions are stalled due to limited issue bandwidth in the other cluster.

One of the instruction distribution heuristics is the dependence method [3]. This method uses the data-dependence of the program in an attempt to minimize the costly communication between clusters. This method aspires to steer dependent instructions to the same cluster. This is performed in the following manner: when decoding a new instruction, the algorithm aims to assign it to the same cluster as its parents. If the parents are in different clusters, the instruction is assigned to the cluster containing the least number of instructions. If the new instruction does not have any parents (parents have already completed), the cluster with the fewest number of instructions is chosen. This scheme seems to strike a good balance between inter-cluster communication costs and efficient use of limited issue bandwidth available to each cluster. This is the clustering assignment technique used throughout our simulations.

In order to better understand the effects and limitations that each type of clustering induced stall imposes, we use the machine models proposed in [3]: I-C, I-NC and NI-C where "I" stands for per cluster issue bandwidth limitation, "C" stands for inter-cluster communication delay, and "N" signifies absence of one of these features. For instance, I- NC denotes a theoretical machine that has limited per cluster issue bandwidth but no inter-cluster communication delay (i.e. takes zero cycles for data to travel across clusters). Although unrealistic, this model reveals how much performance is lost solely due to issue bandwidth limitations and how much performance would have been gained if the processor were not limited by it. In contrast, the NI-C model demonstrates how inter- cluster communication latencies would affect performance if issue bandwidth were perfectly utilized. Knowing this information will help us focus our efforts on inventing

(20)

cluster assignment algorithms that eliminate the type of stall that is more significant. I-C refers to a typical clustered processor that suffers from both types of stalls.

3.4 Methodology

Before we analyze the simulation results, we need to explain our methodology. All our simulations have been performed using the SimpleScalar simulation toolset [20] on the first one billion instructions of a subset of SPEC'2K [25] benchmarks, compiled for SimpleScalar. SPEC'2K is an industry-standardized CPU-intensive benchmark suite, designed to provide a comparative measure of compute intensive performance across the widest practical range of hardware. These benchmarks are developed from real user applications and measure the performance of the processor.

We assume homogeneous cluster configurations throughout this work. In other words, the issue width is divided equally between clusters and all clusters are identical. In section 3.5, we assume dual-cluster machines with issue widths of 4, 6 and 8 and inter-cluster communication delay of one cycle. Later in section 3.6, we analyze the effects of clustering induced stalls on quad-cluster machines and architectures with inter-cluster communication delays of greater than one cycle.

We assume that each cluster includes its own scheduler, which schedules instructions for execution based on operand and resource availability. Additionally, we assume that once an instruction is assigned to a cluster, the decision is final (an alternative would be to decouple execution resources and schedulers as done in the dual cluster Alpha 21264 [241).

The base processor configuration is shown below in table 1. SPEC'2K benchmarks that we have used in this work are outlined in table 2. Details of the simulation methodology are discussed in Appendix A.

(21)

Branch Confidence Estimator Scheduler

Fetch Unit

LoadIStore Queue

Fetch, Issue, Decode, Commit Bandwidth

Integer ALU's available

Floating: Point ALU's available Functional Unit Latencies L1 - InstructionfData Caches Unified L2 Cache

Memorv

Benchmark

Table 2 -

BOS

-

32K share

+

32K bimodal 128 entries, RUU-like

64 entries 64 entries

Any 16/8/6/4 instructions per cycle in 16-way18- way/6-way14-way processors respectively

16

units

16 units

Same as MIPS RlOOOO

64K, 4-way SA, 32-byte blocks, 3-cycle hit latency 256K. 4-wav SA. 64-bvte blocks. 16-cvcle hit latencv Infinite, 100-cycle latency

figuration Abbreviation

I

Floating Point INT 1 FP g ~ i ~ gCC mes

I

Floating Point Integer Integer Floating Point

1

Integer mcf Integer

1

Integer

SPEC'2K benchmarks used in thi,

Category Compression

C Programming Language Compiler 3-D Graphics Library

Combinatorial Optimization

Seismic Wave Propagation Simulation

- -- -

Computational Chemistry Word Processing

Compression

(22)

3.5 Simulation Results

We start this section by presenting the distribution of stalls, i.e. the number of stalls due to inter-cluster communication versus the number of stalls resulting fiom the limited issue bandwidth. The average number of stalls are shown in table 3 and the results for each benchmark is plotted in figure 4. As shown in figure 4, the majority of stalls are due to inter-cluster communication delays and this becomes more of a bottleneck as issue width increases. For instance, in a 4-way dual-cluster machine, 59% of the stalls are due to inter-cluster communication, whereas in an 8-way dual- cluster machine, this number escalates to 75%. This can be attributed to the fact that an 8-way machine has more issue slots; therefore it is less likely that instructions will get delayed due to issue width restrictions. Additionally, an 8-way machine introduces more ILP. Hence, it has more instructions in flight at any given time, which translates into more communication among instructions, and therefore clusters.

It is also interesting to know how efficiently the issue bandwidth is utilized. For this we report the percentage of cycles that the issue bandwidth is fully utilized in figure 5.

Table 3 - Distribution of inter-cluster and intra-cluster stalls 4-Way Machine

6-Way Machine 8-Way Machine

This table shows the average number of each type of stall for dual-cluster 4, 6, Sway machines. These number are plotted in the last column of figure 4, below.

Average Number of

Inter-Cluster Communication Stalls

298,768,529 (59%) 3 12,742,362 (70%) 324,791,365 (75%) Average Number of Intra-Cluster Stalls 211,152,171 (41%) 131,321,614 (30%) 107,53 1,569 (25%)

(23)

Stall Distribution (4-Way)

II I Stall Ed C Stall

I

gzip gcc mes mcf equ amm par bzip2 AVG

Stall Distribution (6-Way)

1

".

C .O - 500 .-

5

400 300 El C Stall 200 100 0

1

Stall Distribution (&Way)

z 600

,

.G

- 500

5

400 300 200 E4 C Stall 100 0

Figure 4 - Inter-cluster vs. intra-cluster stalls

These graphs illustrate how many of the stalls (in Adillions of instructions) in dual-cluster

4, 6, and &way machines are due to inter-cluster communication delays and how many of the stalls are due to intra-cluster issue bandwidth limitation for a set of SECP'2K benchmarks.

It is noticeable that as issue width increases, the issue bandwidth is utilized less efficiently. The 4-way machine on average fully utilizes its issue bandwidth 52% of the time. For the 6-way machine, this number declines to 36%, and it diminishes to 16% for

(24)

18 the 8-way machine. We justifi these results as follows: finding ILP within instructions is not a simple undertaking. The probability of discovering 4 instructions that can be executed in parallel is higher than 8 instructions. Hence, the 4-way machine can better utilize its resources than the 6 and 8-way machines.

I

Issue Width Full

I

gzip gcc mes mcf equ amm pars bzip2 AVG

Figure 5 -Percentage of cycles when issue width is fully utilized

From left to right, bars report how often the issue width is fully utilized (all issue slots are used) for 4, 6, and &way dual-cluster machines respectively.

We use average RUU occupancy to analyze how fast instructions flow through the pipeline. We use a unified Register Update Unit (RUU) to model the reorder buffer and reservation stations. High R W occupancy suggests that instructions spend a long time in the pipeline before being committed, which may be attributed to high number of stalls. Whereas low RUU occupancy indicates fast and uninterrupted instruction flow. Figure 6 illustrates the average R W occupancy for different models of an 8-way machine all of which have 128 register update units.

With the I-NC model, on average, 64% of the R W ' s are utilized at all times, where as the I-C and NI-C models have average R W occupancy of 67%, indicating that the I-NC model allows for the most uninterrupted flow of instructions in the pipeline as instructions do not occupy reservation stations for long periods of time. We conclude that removing inter-cluster communication delays results in a faster instruction flow in the pipeline.

(25)

19

Results for 4-way and 6-way machines are very similar, hence we opted to show the graph for an 8-way machine only.

RUU Occupancy (8-way)

100% 80% 60% 40% I-NC 20% 0%

gzip gcc mes mcf equ amm pars bzip2 AVG

Figure 6 -

R

UU occupancy

This graph shows the average Register Update Unit occupancy for an &way machine with the different models: I-C, I-NC, N-IC respectively (left to right).

Figure 7 shows the instructions per cycle (PC) for each benchmark for the NI-C and I- NC models relative to the I-C model. In other words, figure 7 illustrates how much performance could be gained if there were no inter-cluster communication delays (I-NC model) or if there was no issue bandwidth limitation (NI-C model). It is apparent that with the exception of very few benchmarks, the I-NC model always has a higher P C compared to the NI-C model, suggesting that communication stalls play a more critical role in the performance of clustered architectures.

The difference becomes more evident as the issue width increases. In case of the 4-way machine, both I-NC and NI-C models improve the average P C by 6% relative to the I-C model. Comparing this with the 8-way machine, the relative performance gain of the I- NC model is 14% versus a 2% enhancement by the NI-C model.

(26)

~. IPC lmprovement (4-Way)

gzip gcc mes mcf equ amm pars bzip2 AVG

IPC lmprovement (&Way)

gzip gcc mes mcf equ amm pars bzip2 AVG

-

1

IPC Improvement (&Way)

1

16% 14% 12% 10% 8% NI-C 6% 4% 2% 0%

1

gzip gcc mes mcf equ amm pars bzip2 AVG

I

Figure 7 - IPC gain for I-NC & NI-C

This graph illustrates how much /PC could improve relative to the I-C model if stalls due to inter-cluster communication delay were removed (I-NC model) or stalls due to limited issue width were removed (NI-C model).

As we observed in figure 5, machines with wider issue logics do not utilize their resources very efficiently anyways; therefore it is intuitive that removing the issue

(27)

2 1 bandwidth limitation (NI-C model) will not improve performance dramatically for the 8- way machine. Additionally, the abundance of issue bandwidth in wider machines (such as the 8-way) yields higher ILP (more instructions executing in parallel), which inherently prompts more inter-cluster communication, hence more stalls of this nature. These results are in concert with findings in figure 4, which shows that the majority of stalls are due to inter-cluster communication, especially as issue width increases.

We explain the unusual behavior of the bzip2 benchmark in figure 7 on the 4-way machine as follows: this benchmark seems to be very susceptible to low issue width, hence due to scarcity of issue width in the dual-clustered 4-way machine, removing the inter-cluster communications does not enhance the performance. In other words, the inter- cluster communication is not the bottleneck; it is rather the intra-cluster bandwidth limitation that inhibits the performance of bzip2.

3.6 Other Architectures

We start this section by analyzing the effects of clustering induced stalls on an 8-way quad-cluster machine. Figure 8 compares the stall distribution in an 8-way dual-cluster machine (machine with 2 clusters each of which have 4 issue slots) with an 8-way quad- cluster machine (machine with 4 clusters each of which have 2 issue slots). As can be seen fkom the graphs, in case of the quad-cluster machine, the inter-cluster communication stalls alone outnumber both types of stalls combined for the dual-cluster machine. The growth for inter-cluster bandwidth limitation however, is not as substantial. This is an indication that as the number of clusters increases, the inter-cluster communication stalls become more of an obstacle.

Our simulation results also show the IPC gain of an 8-way quad-cluster machine for I-NC and NI-C models relative to the conventional I-C model. The average gain for I-NC and NI-C models are 17% and 3% respectively. These numbers were 11% and 1 % for the 8- way dual cluster machine (figure 7). This is yet another confirmation for our hypothesis

(28)

that machines with more clusters suffer more severely from inter-cluster communication stalls (rather than stalls due to limited issue width) and they would benefit greatly if such stalls were reduced.

I

Average Number of

I

AverageNumber of

I

Inter-Cluster Communication Stalls

I

Intra-Cluster Stalls

I I

8-Way Quad Cluster

1

452,632,612 (78%)

1

125,993,475 (22%)

I

8-Way Dual Cluster

Table 4 - Stall distribution, &way dual cluster vs. quad cluster

324,791,365 (75%)

1

107,531,569 (25%)

This table shows the average number of each type of stall for Sway dual and quad cluster machines, which are shown below in figure 8.

I

Stall Distribution (Dual vs. Quad Cluster)

s 0 .-

--

600 2 400 a C Stall 200 0

I

gzip gcc rnes rncf equ arnrn par bzip2 AVG

Figure 8 - Stall distribution, 8-way machine dual cluster vs. quad cluster

This graph shows stall distribution (in Millions of instructions) for an &way dual-cluster machine (left bars) versus an &way quad-cluster machine (right bars) for a set of SP2k benchmarks.

Next we study the effects of increasing the number of inter-cluster communication delay cycles on clustered processors. Figure 9 demonstrates how much

P C

could be improved for dual-cluster 4, 6, and 8-way machines with inter-cluster communication delay of 2 cycles if inter-cluster communication stalls were eliminated. In order to put these results in perspective, we compare the numbers with the results obtained from similar machines

(29)

23 with inter-cluster communication delay of one cycle (figure 7). The average IPC gain for the I-NC model for 4, 6, and 8-way machines with. communication delay of one cycle is 6%, 9% and 11% respectively. These numbers more than double for similar machines with communication delay of 2 cycles to 14%, 19% and 24%. However, varying the communication delay cycles has negligible effects on the results of the NI-C model, which is yet another indication that the major factor impeding the performance of clustered processors is inter-cluster communication delays rather than the limited issue width. Intuitively, the effects of inter-cluster communication become more critical for architectures that have longer inter-cluster communication delays (i.e. it takes more clock cycles to transfer data between clusters).

Relative IPC Gain for I-NC (communication delay=2)

I

gzip gcc mes mcf equ amm pars bzip2 AVG

Figure 9 - IPC gain for I-NC with inter-cluster communication delay of 2 cycles

This graph represents how much /PC could be improved for dual-cluster machines with inter-cluster com. delay of 2 cycles if inter-cluster com. was removed.

3.7 Conclusion

We studied three different clustered machines with different issue widths and reported the simulation results of several SPEC72K benchmarks on each machine. We investigated how each type of clustering induced stall affects the performance and resource utilization. Our findings demonstrate that of the two types of clustering induced stalls, inter-cluster communication latencies are the dominant type impeding performance, and efficient resource utilization may not be possible unless such stalls are diminished.

(30)

We noticed that in a machine with few issue slots (a 4-way machine in our research), intra-cluster issue width limitation and inter-cluster communication stalls both contribute almost equally to the total number of stalls. As the number of issue slots increases however, it is the communications across the clusters that become the major bottleneck, which in turn, preempt efficient resource utilization. This is proven by our simulation results that show as the issue width increases, the number of times when issue width is completely full decreases.

Our findings provide evidence that inter-cluster communication delays become more of a hurdle in clustered processors as the issue width or the number of clusters increases. This comes at a time when the industry is moving towards wider architectures and more clusters on the die due to rapid advancements in the transistor technology [8]. Therefore, although clock rates will continue to rise, IPC may not improve considerably unless better cluster assignment techniques are developed to reduce inter-cluster communication stalls.

(31)

4. Energy Efficiency Analysis of Clustered Processors

In this section we study and examine multi-cluster architectures from an energy efficiency standpoint. Our goal is to show that processors' energy efficiency increases with more clusters.

4.1 Introduction

In recent years, low-power design has become a major consideration for hardware architects. This is due to the fact that problems associated with power-hungry architectures are no longer confined to mobile and battery-powered devices where energy budgets are restricted by size and battery life. Power-aware design has also gained importance in high-performance domains such as desktop computers and servers, due to heat dissipation costs and reliability concerns. It is expected that the processor maximum cycle time, hence frequency, will be limited by thermal constraints because the heat generated by high-speed circuits will be too high to be cooled in an affordable and effective manner 111, 1 41.

The key challenge is inventing methods that reduce power dissipation without adversely affecting the performance. One such approach is decentralized architectures. Many decentralized designs have been proposed and studied [12], however our focus is only on the multi-cluster architecture. Hence, we will use the terms decentralized and multi- cluster interchangeably. We analyze and compare the performance and energy consumption of several centralized architectures with their decentralized counterparts (i.e. same issue width, but distributed among several clusters). We show that microprocessors become more energy-efficient as the number of clusters increases.

(32)

4.2 Energy-Delay

Metric

As mentioned earlier, performance is no longer the single most important feature of a processor. Energy and power are also key concern. This has led to an increasing diversity in the processors available. Comparing the energy-efficiency of processors across this wide spectrum is a complex task, which requires an appropriate metric that can capture both energy and performance. Power is not the most suitable metric as it is proportional to the clock fi-equency. Therefore, dramatic power savings can be achieved by reducing the clock speed. In this scenario, while the design becomes more power-efficient the performance suffers. Additionally, the longer execution time caused by the slower clock frequency can result in higher energy consumption (Energy = Power x Time).

An

alternative metric is energy per instruction. This metric also has its shortcomings. Energy is proportional to

d,

meaning that energy per instruction may be reduced by lowering the supply voltage or reducing the capacitance (using smaller transistors). Both of these increase the delay of the circuit and adversely impact the performance. Therefore, a suitable metric is one that encompasses both energy and performance simultaneously. Hence, we use the product of energy and delay, simply known as the energy-delay.

energy cycles

The energy-delay product, E x D = x

,

is a reasonable metric for

operation operation

evaluating the energy efficiency of a microprocessor [13]. Using the energy-delay metric, an energy-efficient architecture can be defined as an architecture that delivers the highest performance among all architectures, while dissipating the same amount of energy.

An

alternative definition is the architecture that dissipates the least energy while delivering the same performance [12].

When analyzing the graphs and illustrations pertaining to energy-delay, readers should note that this metric improves as it decreases. In other words, lower energy-delay indicates better energy efficiency andlor enhanced performance.

(33)

4.3 Related Work

Numerous studies have been conducted and published on clustered processors [2, 5, 9, 101 over the past decade. The majority of these studies are concerned with processor complexity and how large centralized architectures impair clock rate. Balasubramonian et. al. [2] show that shrinking CMOS process technology and the trend towards faster clock rates make it quite difficult to design large monolithic architectures without compromising clock speed and hence, suggest clustered architectures as an alternative. Others [I51 study the negative effects of clustering and propose solutions to alleviate the drawbacks associated with distributed architectures.

Zyuban and Kogge [12] have presented ideas for low-power high-performance superscalar architectures. As part of their research, they investigate energy-delay of distributed architectures. Although the findings are similar, their work is different from the work presented in this thesis. Zyuban and Kogge use an architectural simulator to measure the number of accesses to various parts of the processor for SPEC95 benchmarks. This data is then applied to an energy model of functional units within the processor to estimate energy and power. The results are presented in plots of energy versus IPC for 1, 2, and 4-cluster architectures. In this thesis, we use Simplescalar [20] and Wattch [21] simulation tools to more accurately calculate the power and energy consumption of 1, 2, 4, and 8-cluster processor for a subset of SPEC'2K benchmarks [25]. Furthermore, we investigate the effects of branch misprediction in distributed architectures and analyze pipeline gating in the context of clustered processors.

Our data is presented as energy-delay versus number of clusters for 1, 2, 4, 8-cluster processors (more on this in section 4.4). We also provide simulation results and analysis for theoretical and practical pipeline gating methods to show how much energy efficiency can be improved by preventing the pipeline fiom executing down the wrong path.

(34)

4.4 Energy Efficiency in Multi-Cluster Systems

The multi-cluster architecture is a decentralized, dynamically scheduled architecture in which the register files, dispatch queue, and hctional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural registers [12]. The advantage of the multi-cluster architecture is that it allows for faster clock fkequency, compared to centralized architectures with the same number of hardware resources, by reducing the size and complexity of components on critical paths, such as the issue window. The distributed nature of clustered configurations nonetheless, requires data access across multiple clusters, which is inherently slow [15]. This results in diminished performance in terms of instructions per clock cycle (PC) and reduces the number of instructions that can be simultaneously in execution. The faster clock achieved by the less complex hardware however, could potentially compensate for the negative effects on P C .

4.4.1 Energy-Efficiency versus Number of Clusters

In this section, we study the effects on energy-delay as the number of clusters varies. Figure 10 shows the total energy-delay for architectures of different widths (4, 6, 8 and 16-way) with varying number of clusters (1,2,4, 8). We have simulated wider machines with larger number of clusters. For instance the 16-way machine is simulated with as many as 8 clusters. Narrower machines such as 4-way on the other hand, are simulated with fewer clusters. For example, a 4-cluster, 4-way machine is not a practical configuration and hence it is not simulated. Our simulation results establish that, the energy-delay decreases (improves) as more clusters become available. Closer analysis of the results reveals that as the number of clusters grows, IPC is adversely affected whereas energy consumption improves. The results of these simulations confirm our hypothesis that the energy enhancements associated with multi-cluster architectures outweigh the decline in their performance, hence energy efficiency as a whole improves.

(35)

E-D vs. No. of Clusters (4-way)

gzip gcc mesa mcf equake ammp parser bzip2 AVG

1

I

2

20

2

15 $ Q -

4

10 > P 5 a,

4

0

gzip gcc mesa mcf equ ammp parser bzip2 AVG

Figure I 0 - Energy-delay vs. number of clusters for dzferent benchmarks

(36)

Nonetheless, we observe some anomalies in figure 10, specifically with respect to rncf

and bzip2 benchmarks. The mcf benchmark benefits fiom high IPC (it has the least number of stalls as seen in figure 4). Therefore, its energy-delay is so low relative to other benchmarks that it does not appear on the graphs of figure 10. The bzip2 benchmark on the other hand, is very susceptible to issue width and its performance is degraded dramatically in machines that have narrow issue width (see the 4-way machine in figure 7). Therefore, although the energy consumption of the bzip2 benchmark improves somewhat with clustering, its performance is deteriorated so much that its energy-delay suffers fiom clustering on machines that are not very wide (i.e. 4-way and 6-way machines).

4.4.2 Pipeline Gating in Distributed Architectures

As discussed in chapter 2, modem superscalar processors exploit ILP to enhance performance, which entails having several instructions in-flight concurrently. This implies that when executing conditional branches, the outcome of the branch may not be known (committed) yet.

In order to avoid stalling the pipeline, modern processors attempt to predict the direction of the conditional branch and continue execution speculatively. Discussing the details and different kinds of branch prediction is outside the scope of this document. Nonetheless, regardless of the type of branch predictor utilized, mispredictions are inevitable. Speculatively executed instructions are not committed (do not change the processor state) until the branch outcome is computed. Once the outcome of the branch is known, the speculated instructions are committed if the branch outcome was correctly predicted, or they are flushed if the branch outcome was mispredicted. Therefore, a major source of energy waste in microprocessors is execution of incorrectly speculated instructions. Such instructions consume energy but do not contribute to IPC as they are flushed before being committed.

(37)

Pipeline gating has been proposed as a mean to reduce the amount of energy waste caused by misspeculated instructions. This approach attempts to reduce flushed (misspeculated) instructions by impeding instruction fetch once the number of low-confidence branches in the pipeline exceeds a certain threshold (five in our simulations). The performance and accuracy of pipeline gating depends on the branch confidence estimator [17] and branch predictor [I 81 used. Our pipeline gating simulator uses the Both Strong (BOS) approach [16] for branch confidence estimation. The Both Strong method is a form of McFarling branch confidence estimator that combines gshare and bimodal predictors and marks a branch as high confidence only if both predictors are in the strong state and have the same prediction direction. The BOS confidence estimator and its corresponding branch predictors are further explained in chapter 4.5 where we discuss our methodology.

A high-confidence estimate means the prediction of this branch is likely to be correct. A low-confidence estimate indicates that the prediction of this branch is likely to be a misprediction and subsequent instructions will not be committed due to misspeculation [18]. Hence, by reducing the number of speculatively executed instructions that are likely to be mispredicted, energy waste due to misspeculation and pipeline flushing may be minimized.

Misspeculation unavoidably occurs in any microprocessor regardless of the number of clusters. Nonetheless, as the number of clusters increases so does the cost of misspeculations. As mentioned earlier, the distributed nature of multi-cluster architectures entails a number of inter-cluster ports to allow transfer of data among clusters. Therefore, accessing data in other clusters is an expensive operation both in terms of energy required to activate the inter-cluster ports as well as the number of clock cycles necessary to transfer the data. In a distributed environment, the effects of misprediction in one cluster can propagate throughout the system. In other words, a misprediction in one cluster may cause the data in all other clusters to be flushed as well. Therefore, the cost associated with mispredictions rises with more clusters. To verify this theory, we demonstrate the

(38)

simulation results of ideal pipeline gating (theoretical model) and regular pipeline gating (practical model) and show that highly distributed structures benefit more fiom pipeline gating.

Before presenting the simulation results, we should explain ideal and regular pipeline gating schemes and the difference between them. The ideal pipeline gating scheme always knows whether the outcome of a branch has been predicted correctly or not. If the branch predictor makes an incorrect prediction, the pipeline fiont-end is gated to prevent fhrther execution down the wrong path. Although unrealistic, this model demonstrates how poor branch prediction affects centralized and distributed architectures as well as how much energy efficiency can be improved by inhibiting instruction fetch once an incorrectly predicted branch enters the pipeline.

Of course, a perfect branch predictor or ideal pipeline gating is unattainable. Therefore, we present simulation results for both ideal and regular pipeline gating and investigate the effects of each type of pipeline gating below.

Figure 1 1 compares the energy-delay of regular 1 Bway processors distributed among different number of clusters, with similar processors that utilize an ideal pipeline gating. As can be seen in figure 11, on average the energy-delay of a non-clustered (centralized) architecture improves by 8% when using ideal pipeline gating whereas an 8-cluster processor of the same width experiences a 23% energy-delay enhancement.

It could be argued that the improved energy-delay of the multi-cluster processor with ideal pipeline gating is due to superior performance rather than enhanced energy efficiency. For this, we present figure 12, which illustrates the energy savings obtained by utilizing an ideal pipeline gating. It is noticeable that the two graphs, energy-delay improvement and energy improvement, are fairly similar. In other words, it is the energy improvements that lead to enhanced energy-delay. It is evident that better branch prediction is more beneficial to machines with more clusters. This supports our earlier claim that multi-cluster machines are more severely impacted by misspeculation.

(39)

E-D lmprovement with ldeal Pipeline Gating (1 6-way)

gzip gcc mesa mcf equake ammp parser bzip2 AVG

Figure 11 - Energy-delay improvement with ideal pipeline gating

This graph illustrates how much energy-delay can be improved with -1 pipeline gating, CLI, CL2, CL4, CL8 designations represent 1,2, 4 and 8-cluster architectures.

Energy lmprovement with ldeal Pipeline Gating (1 6-way)

50% cn 40% rn . 30% 2

*

20% h p 10% a,

,r

0%

I

- gzip gcc mesa mcf equ ammp parser bzip2 AVG

I - . - I

Figure 12 - Energy savings with ideal pipeline gating

This graph demonstrates how much energy can be improved with -1 pipeline gating. CL 1, CL2, CL4, CL8 designations represent 1,2, 4 and Scluster architectures.

It is important to clarify the difference between a perfect branch predictor and ideal pipeline gating. A perfect branch predictor always forecasts the outcome of all branches correctly. Obviously, this leads to better performance as no misspeculated instructions are executed, which in turn results in better energy efficiency because 1) it takes less time to execute a piece of code and 2) energy is not wasted due to wrong-path execution. Ideal pipeline gating on the other hand, does not alter the accuracy of the branch predictor; rather, it always knows whether the branch predictor has correctly speculated the outcome

(40)

of the branch. If the branch is incorrectly estimated, the pipeline is gated to prevent further execution down the wrong path. However, it still takes several clock cycles for the processor to execute and resolve the branch instruction, recognize that the branch is incorrectly speculated, restore the processor state, and resume execution down the correct path. Hence, ideal pipeline gating does not vary the execution time of a benchmark, allowing us to isolate and study the effects of wrong-path execution in distributed architectures. That is why our simulations are performed using ideal pipeline gating rather than a perfect branch predictor.

Figure 13 shows the energy-delay measurements obtained by using the practical (non- ideal) pipeline gating method. Conspicuously, it :follows the same pattern as previous simulations: as the number of clusters grows, the energy-delay metric improves.

Clearly, this method is not as advantageous as the theoretical ideal method. However, it still offers attractive energy-delay improvements. The centralized processor experiences a

1% energy-delay improvement, whereas the 8-cluster machine is enhanced by 13%

(figure 14). These numbers were 8% and 23% respectively when utilizing ideal pipeline gating.

1

E-D vs. Number of Clusters with Pipeline Gating (16-way)

I

gzip gcc mesa mcf equake ammp parser bzip2 AVG

Figure 13 - Energy-delay obtained usingpipeline gating

This graph illustrates the energy-delay for 16-way processors with 1, 2, 4, and 8 clusters, Processors with more clusters seem to have better energy-delay. CLI, CL2, CL4, CL8 designations represent 1,2, 4 and 8-cluster architectures.

(41)

E-D without PG Compared to E-D with PG (16-way)

!

30%

,

i

I I

Figure 14 - Energy-delay without gating relative to energy-delay with pipeline gating

This graph shows how much energy-delay can be improved (or degraded) by utilizing pipeline gating relative to processors that do not employ pipeline gating. It demonstrates that using pipeline gating can improve energy-delay for some benchmarks, especially as the number of clusters increases. CLI, CL2, CL4, CL8 designations represent 1, 2, 4 and &cluster architectures.

It is noticeable in figure 14 that some benchmarks are actually negatively affected by pipeline gating. We explain this phenomenon as follows. Pipeline gating bases its decisions on the confidence of the branch prediction. Low-confidence branches are deemed precarious and cause the fetch unit to 'be gated, which may be sometimes unnecessary. Pipeline gating does not consider that even low-confidence branches may be predicted correctly on occasion. Hence, pipeline gating is imperfect and can cause excessive slow downs. In highly distributed architectures (4 and 8 clusters) where the cost of misspeculation is quite substantial, the energy savings obtained by pipeline gating are sufficiently significant to offset the performance hit caused by superfluous slow downs. Hence, the 8-cluster architecture always benefits fiom pipeline gating. In the single cluster machine where misspeculation is not as detrimental however, sometimes it is more constructive to let the pipeline continue down the wrong path and squash the incorrectly executed instructions, rather than venture excessive stalls in the pipeline. Nonetheless, on

(42)

average all machines benefit fiom pipeline gating, especially as we move towards highly distibuted architectures.

In order to gain a more thorough understanding of effects of pipeline gating on performance and energy efficiency of distibuted architectures, we provide figures 15 and 16, which illustrate energy and delay (performance) on separate graphs. The following conclusions may be drawn fiom these figures.

Performance without PG vs. Performance with PG (1 6-way)

I

gzip gcc mesa mcf equ ammp parser bzip2 AVG

I

Figure 15 - Pe$ormance improvement without pipeline gating

This graph shows that pipeline gating slightly diminishes the performance and better performance may be obtained without pipeline gating. CL 1, CL2, CL4, CL8 designations represent 1,2, 4 and &cluster architectures.

Energy with PG vs. Energy without PG (1 6-way)

gzip gcc mesa mcf equ ammp parser bzip2 AVG

Figure 16 - Energy improvements obtained by utilizingpipeline gating

This graph illustrates that substantial energy savings are achievable by pipeline gating. CL 1, CL2, CL4, CL8 designations represent 1,2, 4 and 8-cluster architectures.