Design-Space Exploration for Embedded System Caches through Simulation

(1)

Bachelor Informatica

Design-Space Exploration for

Embedded System Caches through

Simulation

Lars Wenker

June 9, 2017

Supervisor(s): Sebastian Altmeyer

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

This thesis documents the design and implementation of a flexible simulator for em-bedded systems caches. The cache simulator provides an exact and fast simulation of a configuration that is defined using parameters such as block size, associativity, and replace-ment policy. It also contains a framework that allows a user to easily define and implereplace-ment additional replacement policies to add functionality.

Evaluating the performance of a certain cache configuration is necessary when perform-ing design-space exploration, which is the process of searchperform-ing through the space of possible configurations to find an optimal design, in this case for an embedded systems cache.

We perform experiments on the cache simulator to determine the simulation time and eval-uate its accuracy. To further establish the use of the simulator in finding an optimal cache configuration, a ’cache optimizer’ is implemented that uses the simulator to determine con-figuration fitness in a genetic algorithm. The use of metaheuristics to explore the design space is intended as a proof-of-concept, as the research needed to apply this method exceeds the scope of this thesis.

(4)

(5)

1.1.4 Cost Functions . . . 7 1.2 Outline . . . 8 1.3 Research question . . . 8 2 Background 9 2.1 Caches . . . 9 2.1.1 Locality . . . 9 2.1.2 Cache Performance . . . 9 2.1.3 Cache Parameters . . . 10 2.1.4 Addressing . . . 12 2.2 Cache simulators . . . 12 2.3 Metaheuristics . . . 13 3 Cache Simulation 15 3.1 Specification . . . 15 3.1.1 Features . . . 15 3.1.2 Parameters . . . 16

3.1.3 Assumptions and Limitations . . . 17

3.2 Design Choices . . . 17

3.2.1 Performance . . . 17

3.2.2 Ease of Use & Flexibility . . . 18

3.2.3 Extensibility . . . 18

3.2.4 Replacement Policy Extensibility . . . 19

3.3 Implementation . . . 21

3.3.1 Parsing Tracefiles . . . 21

3.3.2 Parsing Arguments . . . 23

3.3.3 Cachesim: Simulating the Cache . . . 23

3.3.4 Replacement Policies . . . 26 3.3.5 Write Policies . . . 30 4 Cache Optimization 31 4.1 Cost Function . . . 31 4.2 Genetic Algorithm . . . 32 4.2.1 Overview . . . 32 4.2.2 Implementation . . . 33

(6)

5 Experiments 35

5.1 Performance Impacts of Cache Parameters . . . 35

5.1.1 Block Size . . . 36

5.1.2 Number of Blocks . . . 37

5.1.3 Associativity . . . 37

5.2 Cache Simulator Performance . . . 38

5.2.1 Associativity . . . 39 5.2.2 Number of Accesses . . . 41 5.3 Cache Optimizer . . . 41 5.4 Discussion . . . 42 5.4.1 Simulator Accuracy . . . 43 5.4.2 Simulator Performance . . . 43 5.4.3 Optimizer Performance . . . 43 6 Conclusion 45 6.1 Further Work . . . 45

(7)

CHAPTER 1

Introduction

1.1 Context

1.1.1 Caches

As processor performance has continued to improve, the performance of memory and caching has become increasingly more important [1]. Because the time taken to retrieve information from system memory tends to be much higher than the CPU cycle time, memory performance can be a large bottleneck for the overall system performance. The introduction of a cache can mitigate the effects of this bottleneck by providing storage with faster retrieval for frequently referenced information.

1.1.2 Embedded Systems

An embedded system is a computer system that is part of a larger system with a specialized function. Examples of embedded systems include wristwatches, refrigerators, cell phones, printers and thermostats. When designing an embedded system, there are often extra constraints when compared to a general-purpose computer system such as power, hardware cost, and hardware size [2]. It is therefore desirable for an embedded system to have components that achieve the performance necessary for its specific application, while minimizing these other factors.

1.1.3 Design Space Exploration

Caches for embedded systems are often specifically designed and built for each specific appli-cation. For each embedded system, an optimal cache needs to be found that has the right parameters to satisfy its performance needs and hardware constraints. It has been shown that the cache can account for up to 40-50% of CPU power consumption [3] [4]. This makes it even more important to find an optimal cache design in a context where limiting energy consumption is important, such as in an embedded systems environment. Design Space Exploration is the means to find a suitable design for hardware components by exploring the Design Space so that the right design can be found and implemented. In the case of a cache, we attempt to find the optimal values for a number of parameters to find the optimal cache configuration based on application-specific performance needs and cost factors.

1.1.4 Cost Functions

The cost for a certain cache configuration can be determined using a given Cost Function. The cost function assigns factors that weigh the different parameters in the configuration. This gives a number for the cost of a configuration, which is related to the specific needs of an embedded system, such as hardware costs or energy consumption. The cost function is different for every application and can be tailored to suit its needs.

(8)

1.2 Outline

Finding a cache configuration for an embedded system is a process that is repeated every time a new embedded system is developed because caches are specifically designed for each system, and the optimal cache configuration differs for each application. A simple process for evaluating cache configurations is needed so that a suitable cache design can easily be found.

This thesis proposes a method for finding the optimal cache configuration for a given application within the context of an embedded system. A parametric embedded system cache simulator is used to determine the performance of a cache configuration. The relevant costs of a cache configuration is determined by evaluating the cost function, and the optimal cache configuration is found using a genetic algorithm.

The main focus of the thesis will be the design choices, implementation and performance of the cache simulator. The use and implementation of a genetic algorithm to find an optimal configuration, referred to as cache optimizer in this thesis, is intended as a proof of concept to show that the simulator can feasibly be used for cache design.

1.3 Research question

The research question for this thesis is:

”How can a fast and flexible cache simulator be designed for application-specific design-space exploration?”

(9)

CHAPTER 2

Background

2.1 Caches

A cache works by storing chunks of memory containing multiple bytes (Blocks) that have been referenced recently and are likely to be referenced in the future. Accessing information that is stored in a cache is much faster than accessing information that is stored in system memory because the cache is located much closer to the CPU physically and uses faster memory technology (such as SRAM) [5].

2.1.1 Locality

According to [5], programs only access a relatively small portion of their address space at any instant of time. This principle is called locality. There are two different types of locality:

• Temporal locality. This means that items that have recently been referenced are likely to be referenced soon. Caches use this by keeping recently referenced memory in the cache so that it can be accessed again faster.

• Spatial locality. This means that items that are near recently referenced items are also likely to be referenced. Caches make use of this by storing information in blocks that contain multiple bytes. Therefore, when an address is referenced, the bytes adjacent to it (the amount being determined by block size) are also placed in the cache.

2.1.2 Cache Performance

When the system requests an address that is in the cache, it is referred to as a hit. The time it takes to access the memory if it is present is referred to as the hit time. If the information is not in the cache, this is called a miss. There are three types of cache misses: Cold misses, Conflict misses and Capacity misses. [6]

• Cold misses occur when the cache is initialized, and information has to be written into the cache for the first time. These misses are unavoidable.

• Conflict misses occur when there is space in the cache, but the information cannot be stored because the set mapped to it is full. Fully associative (detailed under ’associativity’ in the next section) caches have no conflict misses since items can exist everywhere in the cache. The number of conflict misses can be decreased by increasing associativity.

• Capacity misses occur because the entire cache is full. The amount of capacity misses can be decreased by using a different replacement policy or by increasing cache size.

When a cache miss occurs, the cache must retrieve the information from system memory. The time it takes to do this and to store the information in the cache is referred to as the miss penalty.

(10)

The performance of the cache then depends on hit time, miss penalty, and the rate of misses (miss rate) [5]. Changing the configuration of the cache can have big impacts on all of these parameters.

2.1.3 Cache Parameters

To define a specific cache design, we use a number of parameters. A set of values for all parameters is considered to be one cache configuration. This section will detail the different cache parameters included in the cache configurations in this thesis, and their effects on cache performance. Block Size and Number of Blocks

A block is a chunk of memory that contains a number of bytes that are loaded into the cache when one of the bytes within the block is addressed by the application. The total capacity of a cache can be defined as:

C = (BS ∗ N B) (2.1)

With BS as the block size in bytes and NB the number of blocks in the cache.

When the number of blocks is increased, the amount of cache misses also decreases as more blocks can be stored in the cache without needing to be replaced. This is useful when the same addresses are referenced multiple times. An increase in the size of each individual block decreases the number of cache misses where the referenced addresses are similar to addresses stored in the cache.

Associativity

Associativity is the property that determines where a block of memory can reside in the cache. The simplest way to construct a cache is to have one possible location in the cache for each block of memory. This is called a direct mapped cache. Another possibility is to make a cache where information can reside in any position in the cache. This is called a fully associative cache. In practice, direct mapped caches provide poor performance due to an increased number of conflict misses, and fully associative caches have high hardware costs. Consequently, most caches exist somewhere between these two extremes [5]. These caches are referred to as set-associative caches [6] [5]. In this case, a block can exist in n places in the cache.

We can also consider all caches as set-associative caches. We define a direct-mapped cache as 1-way set associative, and a fully associative cache as n way associative, with n being the number of blocks in the cache [5]. Increasing associativity tends to decrease cache miss rate, because it reduces the number of conflict misses created when there is space in the cache, but not in the relevant set for the cache item.

Replacement Policy

When a new entry is placed in a full cache, one entry must be replaced. The policy that deter-mines which cache address to replace is called the replacement policy. The included replacement policies are:

• First In First Out (FIFO). The to-be-replaced cache item is the first item that has been added to the cache.

• Last In First Out (LIFO). The to-be-replaced cache item is the last item that has been added to the cache.

• Least Recently Used (LRU). The to-be-replaced cache item is the item that was least recently used. While this is an effective policy, it is often too costly to implement for hierarchies with more than a small degree of associativity because it is costly to track the most recent access time for each cache item [5] [7].

(11)

• Random Replacement (RR). The to be replaced cache item is randomly selected.

• Pseudo Least Recently Used (PLRU). This is an approximation of LRU that is cheaper to implement. Instead of storing the exact age of memory addresses in the cache, an approximate measure is used. The information for the PLRU policy is stored in a binary search tree. For every node of the tree, there is a bit that stores the most recently traversed direction of the node. When a new block of memory is added to the cache, the pseudo least recently used tag is found by following the direction bits [8].

Figure 2.1: An example of the PLRU algorithm [9] .

Write Policy

The write policy determines how data is written from the cache to memory. When data is written into the cache on a write instruction, an inconsistency exists between the cache and the memory. [5]. This inconsistency must be alleviated by writing the changed data back to memory at some point. While the data is being written to memory, unrelated instructions may be performed to improve performance. The two write policies considered in this thesis are write-through and write-back :

• Write-through writes the new data straight into memory from the cache. This means that for every write instruction, a write to memory must be performed. This scheme is simple, but it is often inconvenient because memory writes are expensive and many writes are performed on the same data. This can cause a significant decrease in performance [5]. • Write-back only writes the information to system memory when the data is replaced in the

cache. When a write is performed while the data is present in the cache, there is no need to perform a write to memory. While this can improve performance, write-back schemes are also more complex to implement than write-through [5].

Separate Instruction and Data Cache

It is possible for instructions and data to be stored in separate caches. Doing this reduces the number of cache misses, as instruction and data no longer need to compete for space in the cache. It does however increase hardware cost, complexity and power consumption because an entirely separate cache must be integrated into the design.

(12)

2.1.4 Addressing

Tag Index Block Offset

Figure 2.2: A cache address [5]

The address in a cache contains 3 parts: A tag, index, and block offset field. These three fields each play a different field in cache addressing.

A cache can know if the correct block of memory is present by comparing the tag of the ad-dress. This tag field consists of a number of bits in the cache adad-dress. The number of bits used for the tag field depend on the amount of bits present in the address, and the number of bits used for the index and block offset field.

The index field determines which set the message can be contained. The size of the index field is dependent on the number of sets: an increase of factor 2 in associativity decreases the number of index bits by one (half the amount of sets). The number of bits needed for the index field is given by log2(#sets) with #sets= #blocks associativity

When the right set is selected, every entry must be checked to see if it contains the same tag as the cache address. To speed this up, the search is executed in parallel. If there is a fully associative cache, there is only one ’set’, so there is no index, and all cache blocks are checked in parallel [5].

The block offset is used to address the specific byte inside the cache block that contains the requested memory. The size of this field is dependant on the block size in bytes, and determined by

log2(Blocksize)

For example, if a 4-way associative cache has 32 blocks that contain 8 bytes each and we assume the address has a length of 2 bytes (16 bits), the number of bits is log2(8) = 3 for the block offset

field, log2(32₄) = log2(8) = 3 for the index field, and 16 − 3 − 3 = 10 for the tag field.

2.2 Cache simulators

A cache simulator works by evaluating an execution trace, which is a file that contains all the memory fetch instructions and addresses requested by a CPU during the execution of an applica-tion. By running through this trace file in a simulator, we can determine the exact hit/miss ratio of a given cache configuration. A simple cache simulator can be expanded to achieve greater accuracy at the possible expense of performance.

There are two ways of performing cache simulation to evaluate a cache: to evaluate one cache configuration at a time or to evaluate multiple cache configurations simultaneously [10]. Sin-gle cache configuration simulators include Dinero IV [11], Drcachesim [12] and Cachegrind [13]. These simulators are unfortunately often lacking in either performance or flexibility, particularly when it comes to performing parallel simulations with different replacement policies.

In the case of simultaneous cache simulations, properties are used that allow for single-pass cache simulation. This means that the execution trace is only read once for simulating every cache configuration. These simulators include binomial tree and stack-based implementations [10]. An example of such properties is [6], [10]:

(13)

When a cache hit occurs for an address MA in a cache configuration (B, S, A), all other configurations (B, S’, A), where S’>S with LRU replacement policy can also be guaranteed to have hits for MA.

The issue with this approach to simulation is that the decrease in evaluation time comes at the expense of flexibility. Often these properties are only valid only for LRU or FIFO (to a lesser extent) [10]. While LRU is often claimed to be a common replacement policy in academia [5], [14], it is often too costly to implement in practice, especially in an embedded systems environ-ment [10]. An analysis of different replaceenviron-ment policies conducted by Daniel Grund [7] shows that LRU is the most expensive replacement policy to implement, and is rarely implemented. The LRU implementations that are used in practice only tend to use low associativities (2 or 4 way) [7]. Consequently, the current simulation techniques that often rely on or assume an LRU/-FIFO replacement policy are less useful as they lack flexibility for practical cache optimization. A cache simulator that can provide more choice in replacement policies while still maintaining good performance would be a relevant and useful tool.

In this thesis we create a cache simulator that aims to provide flexibility and extensibility while also maintaining good performance. The goal is to be able to easily simulate different replace-ment policies and other parameters at a high level, and to provide an extensible framework for adding new replacement policies. This allows for cache simulation to occur in a way that is more useful for designing practical cache configurations, especially for embedded systems.

2.3 Metaheuristics

According to [15], Metaheuristics are high level strategies for exploring search spaces by using different methods. Examples of metaheuristics include Ant Colony Optimization, Evolution-ary computation including Genetic Algorithms, Iterated Local Search, Simulated Annealing and Tabu Search. These can be used to solve Combinatorial Optimization problems, which are prob-lems consist of finding an ’optimal’ object from a finite set [16] [17]. Finding an optimal cache configuration can be considered as one of these Combinatorial Optimization problems, as we have a set of parameters with discrete values and a finite (assuming there is a finite upper limit to the number of blocks) search space. Metaheuristics are useful for these problems when it is im-practical to perform an exhaustive search on the search space. In the case of cache optimization, the simulation time to evaluate every possible cache configuration would simply be too long, so a non-exhaustive way of finding an optimal configuration is needed.

Genetic Algorithms (GA) are a type of Evolutionary Computation algorithm that is commonly used to solve combinatorial optimization problems [15] and have been proposed as a solution for SoC design space exploration [18]. It is inspired from evolution by natural selection, proposed by Charles Darwin in 1859 [19]. The algorithm was introduced by John Holland in 1988 [20]. A genetic algorithm begins with a population of solutions (chromosomes), that are randomly generated. A function is used to calculate the fitness of each member of the population. The fitness of an individual is determined by how ’optimal’ it is as a solution to the given prob-lem. Fit individuals then have a higher chance to ’reproduce’ than unfit individuals, providing a selection pressure on better solutions to evolve [21]. When individuals reproduce, their child’s parameter values are chosen by randomly selecting from the parent’s values, and a slight chance of random mutation is introduced. The end result is a slowly evolving population that, given enough generations, will produce a (near) optimal solution.

Another type of metaheuristic that is commonly used for Combinatorial Optimization prob-lems is Simulated Annealing (SA). [17]. This algorithm is inspired by a process called annealing in metallurgy, where a substance is melted and the temperature slowly lowered, spending a long time near the freezing point [17]. The algorithm introduces a temperature component that de-termines the chance the solution will be altered, and iteratively improves a single solution until the solution has ’frozen’.

(14)

Because using a metaheuristic is an approximate search [15], a good solution can usually be found in a reasonable amount of time, but there is no guarantee that an optimal solution will be found. Furthermore, while there is some research that suggest Genetic Algorithms to be a better heuristic [22], more research for the specific problem of cache optimization is needed to conclusively determine which metaheuristic is better.

For this thesis the implementation of a metaheuristic for design space exploration is intended more as a proof of concept, so a simple implementation of a classic Genetic Algorithm with tournament selection is used.

(15)

CHAPTER 3

Cache Simulation

This chapter will detail the implementation and design choices involved in creating the main product of this thesis, the Cache Simulator. First we define the specification and the design goals for the simulator. Next, we detail the choices made when designing the simulator, and provide the reasoning for making these choices. Finally, the software implementation of the simulator itself will be covered.

3.1 Specification

3.1.1 Features

When designing the Cache Simulator, the objective was to make a simulator that provided a number of features:

• Flexibility. In contrast to similar cache simulators, this simulator must make it easy to simulate a wide range of different cache configurations. This is especially important when it comes to simulating different replacement policies. The goal is to provide the ability to easily switch between replacement policies when testing different cache configurations, and to implement a wide range of policies to choose from.

• Ease of Use. The cache simulator must make it easy for a hardware designer to change and supply parameters (arguments) to the simulator. The parameters must be specified in some way that offers a high level of abstraction, so that it is straightforward to test and adjust different cache configurations.

The results of the simulator must be presented in a readable manner that is uncompli-cated and facilitates comparison. This allows the user to easily perform tests iteratively on a cache configuration and compare the results to determine an optimal configuration. It must be possible to use the simulator together with other pieces of code or programs, such as the cache optimization method detailed in Chapter 4, or standalone as a tool for evaluating a single configuration.

• Extensibility. While the simulator supports a variety of parameters, it has to be possible to extend the functionality of the simulator in a straightforward fashion. When creating the simulator it is better to start with a simple simulator, and provide additional functionality after each component has been tested. An effort must be made to create the different parts and functions of the simulator in a modular manner so that functionality can be expanded upon or changed easily.

Particular attention is paid to providing a simple way to add additional replacement poli-cies to the simulator. Because one of the key features of the simulator is its flexibility in

(16)

choosing replacement policies, providing a simple way to include additional replacement policies only increases this flexibility.

• Performance. One of the main problems with trace-driven cache simulation is the simula-tion time involved. According to [10], a trace for only a few seconds of an applicasimula-tion can consist of millions of memory accesses. A cache simulator is ineffective if it cannot deliver a simulation result in a reasonable amount of time.

By making sure that the cache simulator has a short simulation time, the feasibility of using the simulator to provide practical designs for an embedded system cache increases. Having an optimized simulator also makes it possible to run search algorithms like the one detailed in Chapter 4 to evaluate large numbers of different cache configurations and thereby increasing the likelihood of finding an optimal configuration.

When designing and implementing the cache simulator, choices and trade-offs need to be made to balance and realize the features listed above. The detailed account of these choices is provided in Section 3.2: Design Choices.

3.1.2 Parameters

This section will provide a quick overview of the different parameters included in the cache simulator. This is purely intended as a reference list. For a more detailed explanation of what each parameter means, refer to Section 2.1.3: Cache Parameters. Also included in brackets are the names of the arguments that are used to specify these parameters in the simulator

• Block Size. (-i block size | -d block size).

• Number of Blocks. (-i num blocks | -d num blocks). • Associativity. (-i associativity | -d associativity). • Replacement Policy. (-i policy | -d policy).

There are five replacement policies implemented in the cache simulator, with the ability for additional replacement policies to be added in a simple manner. The included replacement policies are:

– First In First Out (FIFO). – Last In First Out (LIFO). – Random Replacement (RR). – Least Recently Used (LRU).

– Pseudo Least Recently Used (PLRU).

• Shared or Unified Data and Instruction Cache. (-shared).

The simulator has the option for a separate instruction and data cache. If there is a separate data cache, a separate set of the above four parameters exists for the data cache. This means that data and instruction cache can have different values for Block Size, Num-ber of Blocks, Associativity and/or Replacement Policy. The default value for this is a separate cache, which can be changed to shared using the provided argument name. • Write Policy. (-writethrough).

The implemented write policies are write-through and write-back. The default value for this is write-back, which can be changed to write-through using the provided argument name.

(17)

• Verbose. (-verbose).

An additional argument can be given for step-by-step logging of the cache simulator. This is useful for testing that the simulator works as intended, and verifying the simulation. There are two constraints on these values that are necessary to simulate a cache correctly and accurately. The values provided for Block Size, Number of Blocks and Associativity must be powers of 2, and the associativity must be equal to or less than the number of blocks.

3.1.3 Assumptions and Limitations

To limit the scope of the thesis and provide a simulation that more closely approaches an em-bedded systems environment (which can be more simple in terms of hardware complexity than general-purpose computing), a few assumptions and limitations are made regarding the cache simulator. It is possible to provide a more general-purpose or complex simulator by extending the simulator to remove some or all of these limitations. The assumptions/limitations are as follows:

• The cache simulator will only simulate L1 caches. Multilevel caches are not included in the simulator, although it is theoretically possible for them to be added in the future.

• We assume a uniprocessor is being simulated for this cache. This means that all application traces are executed sequentially, and there are no multiprocessors sharing memory. • We assume that upon write instructions, other instructions can continue without stalling

the processor while the write is occurring. This means that there is no additional latency when a block has to be written from the cache back to main memory.

• The cache simulator only returns figures for hit/miss rates, the types of misses, and memory writes. We assume that any variations in hit time and miss penalty are included in the Cost Function, and are constant throughout trace execution.

3.2 Design Choices

To realize the goals of the simulator detailed in Section 3.1, a number of trade-offs had to be considered. This section will detail the design choices made when creating the simulator, and the reasoning behind them.

3.2.1 Performance

Performance of the simulator is important for the simulator to be practical, and it was necessary to choose a programming language that would allow for quick execution and easy extensibil-ity. Ultimately C was chosen as the language to implement the simulator. C is one of the fastest programming languages in existence [23], and one of the most widely adopted [24]. Using C also allows for direct control of the memory when allocating and deallocating the resources needed for the simulator. The simulator is written according to the C99 standard of the language. While simulators such as drcachesim [12] allow simulating on multiple threads and processes, for this simulator we decided to focus on maximizing performance while running on a single thread. When finding an optimal cache configuration (such as using the method in Chapter 4), it is likely that multiple independent cache simulations need to be performed successively. It was therefore attempted to maximize the performance of a single simulation running on a single thread, while allowing for parallelization to occur at the optimization level. This reduces simulation overhead and allows for the possibility to simultaneously simulate multiple cache con-figurations that can have completely different sizes, associativities, replacement policies, or any of the other included parameters.

(18)

3.2.2 Ease of Use & Flexibility

To provide ease of use and flexibility for the simulator, arguments are supplied in a high level of abstraction. The user provides the simulator with the parameters (mentioned in Section 3.1.2). These can be provided through the command line or through an arguments file. The simulator also performs some sanity checks (such as checking that the associativity is lower than the number of blocks in the simulator, and that some arguments are given in powers of 2.). There is also a framework for integrating the simulator within another program. Arguments can easily be provided and results obtained programmatically through the use of structs that are provided and obtained when calling the cachesim function.

3.2.3 Extensibility

Extensibility for the program is provided by producing a structure that is as modular as possible. There are four different modules of the cache simulator:

• The cachesim module: performs the simulation of the cache itself

• The replacement policy module: Calls the relevant replacement policy. All replacement policy implementations are linked here.

• The parser module: parses the tracefile and provides addresses to the cache

• The Argparser module: parses arguments, either in command line format or in a file. Default arguments are provided in a default Argfile

By ensuring this modularity it becomes easy to change or extend the functionality of the sim-ulator. For example, if a different format of tracefile needs to be compatible with the simulator, one only needs to change the parser module. New replacement policies can be added by simply adding them to the replacement policy module. The connection between these four modules is displayed in the figure below.

(19)

Tracefile parser.c Parses tracefile cachesim.c Performs simulation replacementpolicy.c Calls relevant policy

FIFO LIFO RR LRU PLRU simulator.c Creates executable argparser.c Parses arguments Argsfile Default Argsfile Parses Calls Calls Calls Calls Parses Parses

Figure 3.1: Modular structure of the cache simulator. Simulator.c creates the executable that calls the simulator itself, and the parser that parses arguments from the command line or file. The simulator calls the tracefile parser and the framework for the replacement polcies.

3.2.4 Replacement Policy Extensibility

Providing extensibility for replacement policies is a key feature of the simulator. To ensure this, a general framework for replacement policies is provided in the simulator. A key observation is that many replacement policies can easily implemented through the use of a single-linked list:

• FIFO has the same behavior as a queue. New items are added to the back of the queue, and items are removed from the front of the queue. A queue can be implemented using a linked list by simply traversing to the end of the list to add items, and removing the first item to remove items.

• LIFO has the same behavior as a stack. New items are added and ’popped’ off the top of the stack, so that most recently added item is the first one to be removed. Stacks can be implemented using a linked list by adding new items to the front of the list, and linking them to the previously first item. Items are removed from the front of the list.

• LRU can be implemented by traversing the linked list and moving items to the back every time they are referenced by the simulator. The list item that was least recently used will then always be at the front of the list. New items are added to the back of the list, as they are the ’most recently used’ when they are added.

• RR can be implemented by generating a random number between 0 and the size of the list, then traversing the list and removing the item with the index that matches the random number. New items are added to the front of the list.

Implementing these replacement policies using the same data structure means that, to add a new policy, one only needs to add custom functions for adding, updating, and removing items from

(20)

the list. This limits the amount of code that must be written to add new replacement policies, and provides easy extensibility.

The downside of implementing replacement policies using a linked list is the impact it has on performance. The decision to use linked lists for most replacement policies was a trade-off be-tween extensibility and performance. Most of the policies listed above could be implemented more efficiently if they used some other data structure.

The data structures for implementing the replacement policies are stored separately in the simu-lator from the cache arrays itself. An item in the replacement policy list only provides a pointer to the relevant place in the cache. When a cache is searched for a hit or miss, only the cache arrays are searched. The replacement policy lists are only traversed when an item is added, replaced or removed from the cache (for a miss), or when the list is reordered (when a hit occurs, for some replacement policies).

The algorithmic complexity of the linked list implementations is as follows: • FIFO has a complexity of O(1) for adding items and removing items. • LIFO has a complexity of O(1) for adding items and removing items.

• RR has a complexity of O(1) for adding items, and a worst-case complexity of O(n) for removing items.

• LRU has a complexity of O(1) for adding items and removing items, and a worst-case complexity of O(n) for updating the list.

The performance difference between O(n) for a linked list and a possible O(1) for a more effi-cient implementation tends to be low since the maximum value of n is the associativity of the cache. While associativities can theoretically be high, in practice they are often limited to low values (2, 4, 8), and values as high as 64-way are typically only used with FIFO or RR policies [7]. Therefore, we can assume that linked list traversals are cheap in practice, since the lists are short. For FIFO, LIFO and RR, these list traversals only happen when new items are added and removed to the cache. Because this only happens rarely compared to the amount of memory ref-erences, these traversals have a negligible impact on performance. LRU does have a list traversal every time a memory reference happens in the simulator, as it needs to update the list usage. This has the potential to cause a large decrease in simulator performance. However, LRU im-plementations in practice only tend to have very low associativities (usually 2-way, sometimes 4-way) [7], because higher associativities are to costly to implement [5]. Thus, even though there are many more list traversals, the performance impact of these list updates are limited because the linked lists tend to be short. We can therefore assume that the performance impact of using linked lists is minor.

Some replacement policies, such as PLRU, cannot be implemented with a linked list. For ex-ample, PLRU uses a binary search tree structure. Therefore, while the simulator provides an easy framework for adding new replacement policies with a linked list, it is also possible to define custom data structures to implement replacement policies. The replacement policy framework al-lows the user to create functions for initializing, updating, and deallocating these data structures. Having a non-linked list implementation for a replacement policy allows for testing of the above performance assumptions. In Chapter 5: Experiments we compare the performance of the binary search tree of PLRU with the performance of linked-list implementations such as FIFO, LIFO, RR and LRU.

(21)

3.3 Implementation

This section provides a more technical overview of the implementation of the cache simulator. It will cover the modules detailed in Section 3.2.3: Extensibility, as well as all the cache parameters included in the simulator.

3.3.1 Parsing Tracefiles

The simulator reads a file called a tracefile containing an execution trace, that contains all the memory instructions executed by a CPU during execution for an application. For each step of the simulation, a new memory address is requested by the simulator, and parsed by the parser. The information stored in the tracefile was chosen to ensure compatibility with [25], and is in the following format:

Instruction Address 1 Instruction Data Address (Optional)

Figure 3.2: The format for execution traces parsed by the simulator

• The Instruction Address. This is always present in every line of the trace. It contains the address in memory of the instruction that needs to be executed. The address contains 4 bytes (32 bit).

• The 1 here refers to the time taken to execute an instruction. This was originally intended to be used in the simulator, but did not end up in the final simulator. This decision was made without the involvement of this thesis and occurred prior to it. This value is always equal to 1. It is possible to change the parser in future to remove this field.

• The instruction type. It can either be read (’r’), write (’w’), or execute (’e’). If there is a read or write instruction, a data address should also be present. For an execute instruction, only an instruction cache address will be given.

• The Data Address. This contains the address in memory of the data to be read or written in case of one of those instructions. Like the Instruction Address, this address also has a size of 4 bytes.

(22)

Figure 3.3: An example of part of an execution trace. The trace is a sequence of instructions, and each instruction is one memory reference made during the execution of an application Initialization

First, the parser must be initialized. The parser contains its own struct, called TraceBuffer, that it uses to store the information needed for the parser to parse a new address for use by the simulator. This contains values used by the parser such as pointers to buffers and a file pointer to the tracefile, as well as the tag and index fields for a parsed address. Rather than completely read the tracefile into memory, only one line of the execution trace is read at a time, and the TraceBuffer is reused. This is necessary to reduce memory usage as tracefiles can contain millions of lines, and multiple simulations could be running simultaneously.

Parsing an Address

For each step of the simulation, the cache simulator calls the getAddress() function to get the next memory addresses. This function reads a new line of the tracefile word by word. Once the instruction address is read and stored in the buffer, it is converted to an unsigned long using the strtoul() function.

Next, the values for the tag and index fields (see Section 2.1.4 for more information on ad-dressing) must be extracted. Since the simulator only stores the tag field of the address, the specific byte address inside the block is irrelevant. To remove the byte offset from the address, the address is right-shifted by the byte offset size in bits. In this section, the term ’fixed address’ will be used to refer to an address without its byte offset.

The set index field of the fixed address is extracted by using a bit mask, that is equal to the bits to be extracted set to 1, and the rest set to 0. A bitwise AND is applied between the fixed address and the mask to extract the field. For example, an address has a value of 010110, and the number of bits in the index field is 3. We extract the last three bits of the address by performing a bitwise AND with the mask 000111. The first three bits of the result will evaluate to zero, since

(23)

regardless of whether x is equal to 1 or 0. The last three bits of the result will be equal to the last three bits of the address, since

1 ∧ x = x The result of applying the mask is 000110.

The only remaining field to be extracted is the tag field. This can be obtained by right-shifting the fixed address by the amount of bits in the index field. To summarize:

T ag = A (B + I) (3.1)

SetIndex = (A B) ∧ M (3.2)

With A as the address, B as the size of the byte offset field in bits, I as the size in bits of the index field, and M as the bit mask. We use as the right-shift operator, and ∧ as the bitwise AND operator. The size in bits of the byte offset and index field can be calculated with the equations given in Section 2.1.4.

The parser stores the extracted instruction tag and set index in the TraceBuffer. If it detects a read or write instruction, it will also extract and store fields from the data address using the same process.

3.3.2 Parsing Arguments

The argument parser reads an arguments file or command line for the cache parameters. The commands for changing parameters are detailed in Section 3.1.2. Once the arguments are read, sanity checks are performed. The arguments are stored in an argument struct. A collection of values for each simulation parameter is one cache configuration.

1 t y p e d e f s t r u c t A r g s { 2 // G l o b a l a r g u m e n t s 3 c h a r* T R A C E _ F I L E ; 4 b o o l V E R B O S E ; 5 b o o l W R I T E B A C K ; 6 // I n s t r u c t i o n c a c h e a r g u m e n t s 7 int I _ N U M _ B L O C K S ; 8 int I _ B L O C K _ S I Z E ; 9 int I _ A S S O C I A T I V I T Y ; 10 int I _ P O L I C Y ; 11 // D a t a c a c h e a r g u m e n t s 12 int D _ N U M _ B L O C K S ; 13 int D _ B L O C K _ S I Z E ; 14 int D _ A S S O C I A T I V I T Y ; 15 int D _ P O L I C Y ; 16 b o o l S E P A R A T E ; 17 } A r g s ;

Figure 3.4: The struct that contains all cache simulator arguments. These include global argu-ments such as tracefile path and write policy, as well as two sets of arguargu-ments for instruction and data cache.

3.3.3 Cachesim: Simulating the Cache

The main cache simulator implements a cache by storing tags in an array of unsigned longs. Addresses are stored in unsigned longs since they cover the exact same value range as the 4-byte addresses given in the tracefile. This means that there is no possibility for an overflow.

(24)

Algorithm

First, the cache simulator initializes the cache array, and the datastructure that implements the replacement policy. The cache array is a 2D array with the dimensions being associativity and the number of sets, which can be calculated by dividing the number of blocks by the associativity of the cache. If there is a separate data cache, this must be done twice so that both caches are initialized separately. Finally, the simulator initializes the tracefile parser. The simulator is now ready to begin the cache simulation.

1 w h i l e (! l a s t ) 2 l a s t = g e t A d d r e s s () ; // Get the n e x t a d d r e s s f r o m p a r s e r 3 4 if( V E R B O S E ) 5 p r i n t S t e p () ; 6 7 s t e p ( i_cache , i_tag , i _ s e t ) ; // E v a l u a t e I n s t r u c t i o n c a c h e 8 // If d a t a tag is o b t a i n e d f r o m parser , e v a l u a t e D a t a c a c h e as w e l l 9 if( d _ t a g ) 10 if( s e p a r a t e _ c a c h e ) 11 s t e p ( d_cache , d_tag , d _ s e t ) ; // S e p a r a t e D a t a c a c h e 12 e l s e 13 s t e p ( i_cache , d_tag , d _ s e t ) ; // S h a r e d D a t a C a c h e

Figure 3.5: Pseudocode for the cache simulation algorithm

For each step of the simulation, a new set of instruction and (optionally) data addresses is obtained from the tracefile parser. With the new address, the step() function is called. The step() function performs one step of the simulation for one address. When both a data and instruction address are obtained from the parser, step() is called twice. After the last line from the tracefile has been successfully simulated, the simulator frees all memory used by the simulator and returns the results.

The step() function

When the step() function is called, all items in the relevant set of the cache are searched. If the tag is found in the cache, a hit is added to the results. If the tag is not found in the cache, the simulator evaluates what type of miss is encountered. If the address is referenced for the first time, it is counted as a cold miss. If any of the other sets have empty places, the miss is added as a conflict miss. Finally, if the address was referenced before, and the cache is full, the miss is counted as a capacity miss.

In case of a conflict or capacity miss, an item must be replaced from the cache according to its replacement policy. This is handled using the replacement policy framework. The popAddress() function removes an address from the replacement policy list according to its policy and returns it to the simulator. The simulator then overwrites that block in the cache with the given tag. When a new block is added to the cache, it is also added to the replacement policy list using the pushAddress() function.

(25)

1 for(int i = 0 ... a s s o c i a t i v i t y ) 2 if( tag in c a c h e )

3 U p d a t e r e p l a c e m e n t p o l i c y l i s t 4 h i t s ++;

5 r e t u r n;

6 // If tag is not p r e s e n t in the c a c h e

7 m i s s e s ++; 8 for(int i = 0 ... a s s o c i a t i v i t y ) 9 if( c a c h e e l e m e n t e m p t y ) 10 I n s e r t tag in e m p t y c a c h e e l e m e n t 11 P u s h to r e p l a c e m e n t p o l i c y l i s t 12 c o l d _ m i s s e s ++; 13 r e t u r n; 14 15 Get p o i n t e r to r e p l a c e m e n t a d d r e s s f r o m r e p l a c e m e n t p o l i c y d a t a s t r u c t u r e 16 R e p l a c e p o i n t e r c o n t e n t w i t h new tag 17 P u s h to r e p l a c e m e n t p o l i c y l i s t 18 19 // C h e c k if m i s s is a c o n f l i c t m i s s 20 for(int i = 0 ... # s e t s ) 21 if( any set not f u l l ) 22 c o n f l i c t _ m i s s e s ++; 23 r e t u r n;

24 c a p a c i t y _ m i s s e s ++; 25 r e t u r n;

Figure 3.6: Pseudocode for the step() function that searches the cache for a tag and replaces an item in the cache if necessary.

Results

Storing the results of the simulation is done using a specially defined struct called simResult. This struct can be printed to the console when the simulator is used standalone, or returned to be evaluated programmatically (such as in the cache optimizer). The simulator uses the clock library in C to keep track of the amount of time taken to perform the simulation.

1 s t r u c t s i m R e s u l t { 2 int h i t s ; 3 int m i s s e s ; 4 int c o l d _ m i s s e s ; 5 int c a p a c i t y _ m i s s e s ; 6 int c o n f l i c t _ m i s s e s ; 7 int m e m o r y _ w r i t e s ; 8 int e n d _ d i r t y _ b i t s ; 9 d o u b l e p r o c e s s o r _ t i m e ; 10 d o u b l e m i s s _ p e r c e n t ; 11 };

Figure 3.7: The simResults struct returned by the simulator. These include hits, misses, types of misses, number of memory writes, dirty bits at the end of the simulation, time needed to perform the simulation, and the percentage of misses.

(26)

3.3.4 Replacement Policies

This section will first detail the implementation of the replacement policy framework. Next, we will discuss the implementation for each included policy. Finally, the process for extending the simulator with a new replacement policy will be described.

Replacement Policy Lists

The data structure that administrates the replacement policies is implemented in the framework as a struct called a replacement policy list, or RPList. This struct will point to a linked list by default, but also supports different data structures for replacement policies. To provide this flexibility, a void pointer is used. RPLists also contain an identifier value for replacement policy, and information that is used when implementing the replacement policy such as list size and cache associativity.

The framework also defines a struct for a node in a single-linked list, called RPListItem. This struct simply contains a pointer to an element in the cache, and a pointer to the next item in the linked list.

1 s t r u c t R P L i s t I t e m { 2 u n s i g n e d l o n g* a d d r e s s ; 3 R P L i s t I t e m * n e x t ; 4 }; 5 6 s t r u c t R P L i s t { 7 v o i d * f i r s t ; 8 v o i d * l a s t ; 9 int p o l i c y ; 10 int a s s o c i a t i v i t y ; 11 int s i z e ; 12 u n s i g n e d l o n g* c a c h e ; 13 };

Figure 3.8: The RPList and RPListItem structs. The void pointers in RPList either points to the first and last RPListItem in the linked list, or to the items in a custom data structure Framework

The purpose of the replacement policy framework is to call the correct function for every in-cluded replacement policy. The simulator initializes and frees the replacement policy lists us-ing initList and freeList, and performs replacement policy operations usus-ing pushAddress, popAddress, and updateList. The replacement policy framework then calls the function that implements the correct replacement policy.

In order to define the policies that are included, each policy is assigned a number in the re-placement policies header file:

1 # d e f i n e F I F O 0 2 # d e f i n e L I F O 1 3 # d e f i n e LRU 2 4 # d e f i n e P L R U 3 5 # d e f i n e RR 4

A replacement policy’s function can then be called through the use of a switch statement. As an example, the popAddress function is included below:

(27)

1 f u n c t i o n p o p A d d r e s s ( l i s t ) : 2 s w i t c h( p o l i c y ) 3 c a s e F I F O : 4 r e t u r n f i f o _ p o p ( l i s t ) ; 5 c a s e L I F O : 6 r e t u r n l i f o _ p o p ( l i s t ) ; 7 c a s e LRU : 8 r e t u r n l r u _ p o p ( l i s t ) ; 9 c a s e P L R U : 10 r e t u r n p l r u _ p o p ( l i s t ) ; 11 c a s e RR : 12 r e t u r n r r _ p o p ( l i s t ) ;

Figure 3.9: Pseudocode for popAddress() one of the functions included in the framework. The framework simply calls the function for the relevant replacement policy and returns the result

Some functions such as updateList are only relevant for some replacement policies (LRU and PLRU in the included set). In this case, the switch only contains the policies that implement the function. Other functions such as initList and freeList need to be executed in a different manner for replacement policies that use a custom data structure. In these functions either a custom free or initialize function can be called, or a linked list is initialized/freed by default. FIFO

The FIFO replacement policy is implemented with a linked list with the functions fifo pop and fifo push. FIFO behaves like a queue, so items are pushed to the back of the list and popped from the front of the list. The fifo pop function removes the first item from the RP list, retrieves its address and returns it. The first item’s successor then becomes the new top of the list. The fifo push function pushes new items to the back of the list. It creates a new linked list element that contains the given address. This element is inserted at the back of the list by making it the last element’s successor using the last pointer in the RPList struct.

LIFO

Like FIFO, LIFO is implemented using the functions lifo pop and lifo push. LIFO behaves as a stack, so items are both pushed to and popped from the front of the list. Since items are popped from the front of the list with both LIFO and FIFO, the pop function is identical. The lifo push function must push new items to the front of the list. First, it creates a new linked list element with the given address. The next field of the new element points to the first item of the list. Pointing the first pointer in the RPList struct to it causes it to be placed at the front of the list.

LRU

LRU is implemented by three functions: lru pop, lru push and lru update. Like with FIFO, elements are popped from the front of the list and pushed to the back of the list. Thus, the pop and push functions are identical to the FIFO implementation.

The lru update function is called when a cache hit occurs and must move referenced blocks to the end of the linked list. First, the list is traversed to find the item that contains the referenced address. This item is then removed from the list by pointing the next field of its predecessor to its successor. Finally, the item is added to the end of the list using the same process as fifo push.

(28)

RR

The RR implementation has three functions: rr init, rr push and rr pop. To get pseudo-random behavior, the rand function from the stdlib library is used. The rr init function serves only to set the seed of the random number generator. To ensure reproducible results the same seed is always used. The seed has a value of 55555.

Because items are popped from the list randomly, it is irrelevant whether items are added to the front or back of the list, so the RR implementation uses the same push function as LIFO’s implementation.

To pop items from the list, an index between 0 and the size of the list is randomly gener-ated. Next, the list is traversed until the item that has the same index as the generated number is found. That item is then removed from the list by pointing its predecessor’s next field to its successor. Finally, the removed item’s address is returned.

PLRU

PLRU is implemented in a completely different way than the above replacement policies, because it uses a different datastructure. For an explanation of PLRU, see Section 2.1.3. To implement PLRU, we must first define and implement the binary search tree datastructure, followed by the functions that perform operations on the tree.

The first void pointer of the RPList struct points to the top node of the tree. The binary tree is implemented through multiple instances of a struct called PLRUTreeNode. The contents of this struct are displayed in Figure 3.10.

1 # d e f i n e L E F T 0 2 # d e f i n e R I G H T 1 3 4 s t r u c t P L R U T r e e N o d e { 5 int d i r e c t i o n ; 6 b o o l l e a f ; 7 v o i d * l e f t ; 8 v o i d * r i g h t ; 9 };

Figure 3.10: The PLRUTreeNode struct. It contains an integer value that can either point left or right, a boolean that indicates whether it is a leaf node, and void pointers for the left and right child nodes. These can point to other PLRUTreeNodes or elements in the cache arrays, depending on whether the node is a leaf or not.

When the RPList is initialized, the plru init function is called. This function initializes the binary search tree. Nodes are initializated from the bottom up, with leaf nodes being initialized first. The pseudocode for this function is shown in Figure 3.11.

(29)

1 f u n c t i o n p l r u _ i n i t ( l i s t ) : 2 n u m N o d e s = a s s o c i a t i v i t y / 2; 3 if( n u m N o d e s == 0) // Set n u m N o d e s to 1 if a s s o c i a t i v i t y was 1. 4 n u m N o d e s = 1; 5 6 P L R U T r e e N o d e l o w N o d e s []; 7 P L R U T r e e N o d e h i g h N o d e s []; 8 9 // I n i t i a l i z e l e a f n o d e s 10 for( i = 0 ... n u m N o d e s )

11 i n i t i a l i z e a new l e a f n o d e t h a t p o i n t s to two e l e m e n t s in the c a c h e array , and s t o r e it in l o w N o d e s 12 13 // I n i t i a l i z e o t h e r t r e e n o d e s . 14 w h i l e ( n u m N o d e s > 1) : 15 n u m N o d e s = n u m N o d e s / 2; 16 i n i t i a l i z e a new e m p t y h i g h N o d e s a r r a y 17 for( i = 0; i ... n u m N o d e s ; i ++) 18 i n i t i a l i z e a t r e e n o d e t h a t p o i n t s to two n o d e s in the l o w N o d e s array , and s t o r e it in h i g h N o d e s 19 l o w N o d e s = h i g h N o d e s ; 20

21 // P o i n t the f i r s t p o i n t e r in the l i s t to the top n o d e of the

t r e e .

22 list - > f i r s t = l o w N o d e s [ 0 ] ;

Figure 3.11: The algorithm for initializing the binary search tree. Leaf nodes are first initialized, followed by one level of the tree at a time. The number of nodes decreases by half for every upwards layer of the tree constructed, so numNodes is halved.

There are three functions that are called during the simulation of the cache: plru update, plru pop and plru push. plru update searches the tree to find the correct index, while chang-ing the directions of the nodes it traverses on the way. The tree is searched to find the referenced address using an iterative binary search. The direction of every node that is traversed during this search is pointed away from the direction taken in the search, thus updating the tree. When an address is requested by the simulator, the plru pop function traverses the search tree, following the direction value for each node, and returning the address that the leaf node points to. When new items are ’pushed’, the simulator already changes the contents of the tag stored in the cache array element, and the binary tree already contains all elements of the cache array. Thus, when a push occurs, only plru update needs to be called to reflect the new cache address being referenced.

Adding new policies

The included set of replacement policies can be extended using a simple process. This can be done by the following steps:

1. If the replacement policy uses a custom data structure, implement the datastructure. The void pointer of the RPList will point to this datastructure, and the replacement policy implementation is responsible for allocating and deallocating the memory used by the data structure. The init and free functions can be used for this purpose.

2. Write implementations for the relevant functions. The required functions are push and pop for adding and removing items. The update function can be defined for updating the replacement policy every time an item is referenced.

(30)

3. Create a header file for the replacement policy that contains all the implemented functions. The header file must be included in replacementpolicy.c.

4. Assign the replacement policy an integer value in replacementpolicy.h.

5. Finally, add all of the implemented functions to the framework by appending them to the switch statements in replacementpolicy.c. The new implementation will now be called when a cache configuration that uses it is simulated.

3.3.5 Write Policies

There are two different write-policies implemented in the simulator: write-through and write-back. The impact of the chosen write policy is modelled by keeping track of the amount of memory writes performed during simulation. If the policy is write-back, the simulator also returns the amount of ’dirty’ bits left in the cache at the end of the simulation. For more details on these two write policies, see Section 2.1.3.

Write-through

If the write policy is write-through, a memory write is performed for every write instruction. When the tracefile parser encounters a write instruction, it indicates this to the simulator using the TraceBuffer. The simulator then increments the amount of memory writes if the write policy is write-through.

Write-back

For the write-back policy, block that have been overwritten are only updated to system memory when it is replaced from the cache. To indicate that a block needs to be written to memory when it is replaced, a dirty bit is used in the tag. Once a block is added to the cache, its tag is left-shifted by one bit, and the least significant bit denotes whether the tag is ’dirty’ or not. Because the tag has been extracted from the memory address, the only situation in which information can be lost during this shift is if there is a fully associative cache with a block size of 1 byte. Such a cache would not be practical, so it is unlikely that adding a dirty bit affects the performance of the simulator.

When a referenced tag is compared with the tags in the cache, the tags in the cache are right-shifted back by one bit so that a valid comparison can be made. This right-shift is not stored, so the value of the dirty bit is not lost. Once the simulator detects a write instruction with the write-back policy, it checks if the referenced tag already contains a dirty bit, and sets it to 1 if this is not the case. The amount of memory writes is incremented when a block that has a tag with a dirty bit is replaced from the cache. At the end of the simulation, all the tags that are still present in the cache are examined and the amount of tags with dirty bits is stored with the results.

(31)

CHAPTER 4

Cache Optimization

This chapter will discuss the implementation of the Cache Optimizer. The optimizer is imple-mented as a proof of concept, to show that it is feasible to use the simulator together with metaheuristics to find an optimal cache configuration. As such, the optimizer uses a classic ge-netic algorithm, with a relatively simple cost function. First, we will discuss the cost function that is used, followed by the details of the genetic algorithm used. Finally, the implementation of the cache optimizer is detailed.

4.1 Cost Function

A cost function determines the cost of a cache configuration given its values for the cache pa-rameters. This can be simple, such as

C = #blocks∗ blocksize (4.1)

with C being the cost of the configuration. The cost function can then progressively be extended. Because the optimizer is intended as a proof of concept, a simple cost function is used. This cost function does not account for write policy, assumes a shared data and instruction cache, and only includes three replacement policies (FIFO, LRU and PLRU). The implemented cost function is:

C = (#blocks∗ blocksize∗ X) + (#blocks∗ ReplacementP olicy(a) ∗ Y ) + (A ∗ Z) (4.2)

where X, Y and Z are constants, a is the associativity, and ReplacementPolicy(a) refers to the cost to implement a replacement policy for a given associativity.

The costs for the replacement policies are:

ReplacementP olicyF IF O(a) = 1

ReplacementP olicyP LRU(a) = log2(a)

ReplacementP olicyLRU(a) = a

(4.3)

The objective of the cache optimizer is to find a cache configuration that achieves a given performance (defined as percentage of misses) and has a minimal cost.

(32)

4.2 Genetic Algorithm

4.2.1 Overview

The metaheuristic used to explore the cache design space is a genetic algorithm. For a brief overview of metaheuristics and genetic algorithms, see Section 2.3. As stated in [21],

”It is helpful to view the execution of the genetic algorithm as a two stage process. It starts with the current population. Selection is applied to the current population to create an intermediate population. Then recombination and mutation are applied to the intermediate population to create the next population. The process of going from the current population to the next population constitutes one generation in the execution of a genetic algorithm.”

We can define the execution of a genetic algorithm along the following stages:

Initialization Fitness Evaluation Selection Crossover (Recombination) Mutation Repeat until a termination condition is reached Termination One generation of the algorithm

Figure 4.1: The different stages of execution in a genetic algorithm. Selecting the fittest individ-uals of a current population creates the intermediate population. The next population is created by applying crossover and mutation.

The population in the cache optimizer is initialized by generating random cache configura-tions. Evaluating the fitness of a population is done using the cost function. Individuals must meet the minimum performance requirement to be selected, meaning that the miss percentage must be below a given value.

To create an intermediate population, the fittest individuals are selected using a process called tournament selection. Tournament selection seeds a tournament with a n random individuals from the population. The individual with the highest fitness (lowest cost) is selected for the intermediate population. Two individuals are from the intermediate population are combined

(33)

by in a process called crossover to create one individual for the next population. The param-eters for this ’offspring’ individual are randomly chosen from either ’parent’. Finally, the new ’offspring’ individual is mutated, where a small chance exists for each property to be randomly altered. A fraction of the new population consists of newly random-generated solutions, and the best solution in the current population will always be passed on to the next population. The whole process is repeated until a certain condition is fulfilled. This can be when a configura-tion with a certain cost value is found, or after a number of generaconfigura-tions have been executed. For the proof of concept cache optimizer, a set number of generations are executed until termination.

4.2.2 Implementation

This section will focus on the implementation of the genetic algorithm in the cache optimizer, with a detailed explanation for every stage of the genetic algorithm (see Section 4.2.1.). The following arguments can be adjusted when running the cache optimizer:

• Population size.

• Maximum miss rate. This parameter defines the minimum performance that the selected configurations must conform to.

• Chance of a mutation occuring. • Tracefile.

• Number of generations until termination occurs. • The size of the tournament used for selection. Initializing the Population

When initializing the population, configurations are generated with random values for all pa-rameters until a population has been created with the correct size. The random generation takes the following into account:

• The generated associativity must be smaller than the generated number of blocks. • The generated configuration must conform to all the simulator, meaning that the generated

associativity must be smaller than the generated number of blocks and that the values for associativity, number of blocks and block size must be powers of 2.

• Only the replacement policies that are listed in Section 4.1 are generated. • All configurations will have a shared instruction and data cache.

Fitness Evaluation

When evaluating the fitness of a cache configuration in the population, two things must be considered:

1. The cost of the configuration, which is determined by evaluating the cost function. 2. Whether the performance of the configuration meets the minimum requirements. Cache

simulation using the simulator (Chapter 3) is used to determine cache configuration perfor-mance. To minimize the time taken to simulate the entire population, the OpenMP library [26] was used to perform cache simulation in parallel.

Selection

The fittest individuals of the population are selected through tournament selection. A brief description of tournament selection is given in Section 4.2.1. The cache optimizer seeds the ’tournament’ with a given number of configurations from the population with adequate perfor-mance, and selects the individual with the lowest cost.

(34)

Crossover & Mutation

To generate an individual for the next population of the genetic algorithm, two ’parent’ individu-als are crossed over. The optimizer constructs an ’offspring’ configuration by randomly choosing the value of each parameter from either parent, making sure that associativity does not exceed number of blocks. Finally, the offspring individual is then ’mutated’, adding an extra degree of randomness to the simulation.

(35)

CHAPTER 5

Experiments

This chapter will detail the results of experiments carried out on the cache simulator and opti-mizer. To determine the accuracy of the simulator, we examine the impacts of different cache parameters on the miss rate. The performance of the simulator itself is assessed by looking at the impact that different parameters have on simulation time. Finally, we examine the performance of the cache optimizer in finding desirable cache configurations.

5.1 Performance Impacts of Cache Parameters

In this section, we examine the impact that different parameters have on cache performance (miss rate). These parameters are block size, number of blocks and associativity. The purpose of these experiments is not to establish the relation between these parameters and cache performance, as this is already known. Instead, the results of these experiments allow us to determine whether the simulator is accurate or not. The results of these experiments can differ when using different configurations and traces, and are only intended to draw general conclusions on the effects of different cache parameters. These experiments are performed on the aifftr01-trace file, which contains approximately 3 million memory references. This trace was generated by running the EEMBC AutoBench suite [27]. This trace was chosen because the benchmark suite includes a variety of different algorithms and workload tests, so a more general measure of performance may be obtained.

(36)

5.1.1 Block Size

Figure 5.1: Relation between block size and miss rate. The cache configuration in this experiment has 64 blocks. Results for the FIFO and LRU replacement policies are shown.

Design-Space Exploration for Embedded System Caches through Simulation

Bachelor Informatica