Thermal models for the exploration of embedded system architectures

(1)

Bachelor Informatica

Thermal models for the

exploration of embedded

system architectures

Matthijs Jansen

June 7, 2018

Supervisor(s): Dr. Andy D. Pimentel (UvA)

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

The introduction of small, embedded systems together with the increase in computing power as was predicted by Moore’s law, have led to a higher risk of embedded systems overheating. This calls for a need to monitor the temperature of embedded systems and this thesis tries to deliver such a monitoring system in the form of a simulator for early-stage embedded system architecture design. This new simulator consists of Sesame[12], a modeling and simulation framework for embedded system architectures which provides an abstract application and architecture model, and HotSpot[25], a thermal modeling tool. A power model is built from scratch to complete the interface between Sesame and HotSpot and it is calibrated against the lower-level Sniper multicore simulator[8]. This simulator provides a very detailed application, architecture and power model while HotSpot is also used for the thermal model. This results in a simulator which can correctly predict relative temperature differences between components of an architecture and a method is presented with which the accuracy of these predictions can be improved.

(4)

(5)

1.2 Problem definition . . . 9 1.3 Setting . . . 10 1.4 Prior work . . . 11 2 Sesame 13 2.1 Application model . . . 14 2.2 Mapping layer . . . 15 2.3 Architecture model . . . 15 2.4 Power model . . . 16 3 HotSpot 19 3.1 Floor plan . . . 19 3.2 Power consumption . . . 20 3.3 Thermal model . . . 21 4 Sniper 23 4.1 Application model . . . 23 4.2 Architecture model . . . 25

4.3 Power and temperature . . . 26

5 Power model calibration 27 5.1 Application calibration . . . 27 5.2 Architecture configuration . . . 28 5.3 Calibration . . . 28 6 Results 29 6.1 Used architectures . . . 29 6.2 Instruction cost . . . 30 6.3 Workload balancing . . . 32 6.4 Power model . . . 33 6.5 Thermal model . . . 35 6.6 Simulation time . . . 38 7 Conclusion 39 A Used applications 43 A.1 Sniper’s applications . . . 43

A.2 Sesame’s applications . . . 44

B HotSpot configurations 49

(6)

(7)

List of figures

1.1 Active cooling system . . . 10

1.2 Processing trends . . . 11

2.1 Sesame’s Y-chart design . . . 13

2.2 Simple application . . . 14

2.3 Simple architecture . . . 15

2.4 Clock cycle schedule . . . 16

3.1 Floor plan of a simple 4 core architecture with a shared memory . . . 20

3.2 Visualization of a thermal model from HotSpot . . . 21

4.1 Different versions of the application model . . . 24

4.2 Detailed floor plan of a simple 4 core architecture with a shared memory . . . . 25

4.3 Interaction between Sniper, McPat and HotSpot . . . 26

6.1 Two core floor plan for Sesame . . . 29

6.2 Four core floor plan for Sniper . . . 30

6.3 Instruction cost for Sniper . . . 31

6.4 Instruction cost for Sesame . . . 31

6.5 Power usage per component per time interval of 10000 nanoseconds for Sniper . . 33

6.6 Power usage per component per state . . . 34

6.7 Power usage per component per time interval of 5000 nanoseconds . . . 35

6.8 Thermal model of the 2-core architecture with 1 core stressed . . . 35

6.9 Thermal model of the 2-core architecture with all cores stressed . . . 36

6.10 Thermal model of the 4-core architecture with 1 core stressed . . . 36

6.11 Thermal model of the 4-core architecture with all cores stressed . . . 37

6.12 Temperature ranges in Kelvin for the thermal models of HotSpot . . . 37

6.13 Simulation time for the Sesame and Sniper simulator in seconds . . . 38

(8)

(9)

CHAPTER 1

Introduction

In the beginning of the information era computers were so massive that their size could easily span a whole room, as was the case with the first computer in the Netherlands, the ARRA 1[1]. Over the years computers became much smaller, which led to the introduction of the personal computer (PC), even to the point where computers became portable and could be carried with you. These all were enormous advances in technology and offer many hopeful opportunities for the future but many challenges are still waiting. This thesis will focus on a type of computers called embedded systems. Where traditional computers, called the general purpose computers, which include personal computers and laptops, consist of just a few different systems, embedded systems exist in many different sorts and sizes. Some examples are mobile phones, remote controls, televisions and smart refrigerators.

1.1 Embedded system requirements

General purpose computers do not have many requirements: If the computer is fast enough and produces little sound, most consumers are satisfied and will buy it. However embedded systems exist for many different purposes and for each purpose there can be many very specific and conflicting requirements[23]. Mobile phones should be energy efficient such that the battery will last as long as possible but they should also be cheap and small in size, work real-time (a consumer wants to see a result when using the touchscreen immediately) and it must not become too hot for example. There is also a category of embedded systems with much more strict requirements: If a sensor in a car or factory does not work properly it could lead to big financial problems or even death[20]. This shows that it is very important that every aspect of a system must be checked thoroughly to ensure that it functions properly [14].

1.2 Problem definition

This thesis will focus on one specific aspect of embedded systems: Temperature. Systems like personal computers often have much free space internally where the generated heat of the system can escape through. Some of those systems also have active cooling: This is a construction of metal with a large contact surface with the air called the heat sink, which absorbs the heat from the part with which it is connected like the central processing unit (CPU) or the graphics processing unit (GPU). A fan is then attached to the heat sink so it can cool the metal, see figure 1.1 below. A less effective variant is the passive cooling system which only has the heat sink but the advantage is that it does not make any noise because the fan is not present.

(10)

Figure 1.1: Active cooling system

This results in the fact that most general purpose computers do not have any serious heat problems. This is a completely different story for most embedded systems however. These systems are very small so the heat sources are close to each other and active and passive cooling is often not possible because the system is too small to fit the large heat sink or a fan. A requirement for an embedded systems can also be that it must not make too much noise which also rules out the use of a fan. This results in a need to monitor and take into account the heat that such a system can produce since overheating a computer system can affect the performance and lifetime of a system or even result in failure[4][2].

1.3 Setting

The main problem addressed by this thesis is described in the previous section, namely that most embedded systems can not effectively remove the heat they produce so the thermal behaviour of such systems should be analyzed when designing a new system. It is however not feasible to physically build and rebuild a system to search for the best configuration because it would be very costly and take too much time. This kind of problem is well known in the field of computer science so simulators are used to simulate and model an embedded system architecture. The advantages of a simulator is that a system can be tested and adapted very fast since changing software is much faster than changing hardware. The only disadvantage of a simulator is that it can never have the same accuracy as a test with real hardware[11]. For this reason tests with real hardware are always done after the best configurations are explored using simulators [23].

A solution for this problem is provided in this thesis using two different software frameworks, Sesame[12] and HotSpot[25]. Sesame is a modeling and simulation framework for embedded sys-tem architectures and will be explained in chapter 2. Sesame can simulate an architecture and an application using a performance and a very basic power model whereas a thermal model is missing. When Sesame’s power model is extended, since the default version is very basic, it can be used for HotSpot: This is a thermal model which can predict the temperature of a system using the power model of Sesame. The working of HotSpot will be explained in chapter 3. Since Sesame is a very abstract simulator, the temperature predictions of HotSpot must only be used as the first step of the design pipeline of a new architecture since the results will not be very accurate.

This power model is part of a new interface which connects Sesame and HotSpot and this will be discussed in chapters 2, 3 and 5. This interface handles all missing connections between Sesame and HotSpot and has to be calibrated since the end result, HotSpot’s temperature predictions, must approximate the temperature of real hardware as accurate as possible. For this purpose the Sniper multicore simulator is used. This simulator has a very detailed architecture, application and power model and also uses HotSpot as its thermal model. In reality the HotSniper frame-work[22] is used because it already has HotSpot built-in but since it does not provide anything more than that, we will use the word Sniper to describe this framework. It is to be used as

(11)

a second step in the design pipeline so it is a perfect candidate to calibrate and validate the Sesame-HotSpot interface against. The main research question of this thesis will then be if the temperature predictions of the Sesame-HotSpot interface can approximate the temperature pre-dictions of Sniper in combination with HotSpot.

The working of the Sniper multicore simulator will be explained in chapter 4 whereafter the description of the calibration methods will follow in chapter 5. The results of the calibration and finally the validation will follow in chapter 6. This thesis ends with a conclusion where the usability of this new interface will be discussed and follow-up questions will be presented.

1.4 Prior work

The most famous law in computer science is Moore’s law: In 1965 Gordon Moore predicted that the number of transistors on a chip would double every 12 months, which would later be changed to 18 months[21]. This prediction would come true and CPU’s would become faster as a result. This would eventually lead to small CPU’s with one core which was densely packed with transistors and had a very high clock frequency. In the search for better performance, computer architects ran into a problem between 2005 and 2010: The CPU needed so much power for all the components to work on a high frequency that the temperature of the processor would become dangerously high. The result was that the clock speed had to be reduced to reduce the power consumption. This was the end of the single core era which can be seen in the figure below[5].

Figure 1.2: Processing trends

This example shows that temperature control became an important part of system design out of necessity. Since this example is from more than 10 years ago, the interface between Sesame and HotSpot is not the first simulator which takes temperature into account. There are simulators which use HotSpot to create a thermal model like the Sniper multicore simulator[8] and many simulators that use their own thermal model[19][24]. What sets the combination between Sesame and HotSpot apart is the abstraction level and simplicity of Sesame. Sesame was built with the purpose to be the first step in the design pipeline of a new system architecture while the other previously mentioned simulators are much more detailed on the architecture and application level. This will be further explained in chapter 2 which discusses Sesame and chapter 4 which discusses Sniper.

(12)

(13)

CHAPTER 2

Sesame

Sesame is a very abstract modeling and simulation framework for embedded system architec-tures and has been developed by the System and Network Engineering Lab at the University of Amsterdam[28]. The purpose of Sesame is to explore different system architecture designs in a very abstract manner. One of the advantages of the abstraction is that the simulations can be executed very fast such that many different architecture designs and application-to-resource mappings can be explored. The disadvantage is obviously that the results from Sesame are not as accurate as a simulator which has an application model at instruction level and a more detailed architecture model but because of this, it is to be used at the beginning of a design pipeline for a new embedded system architecture.

Figure 2.1: Sesame’s Y-chart design

The power of Sesame is that it separates architecture and application into different modules. The different modules are visualized in figure 2.1[27]. The Y-model symbolizes this separation between architecture and application. The modules on which this thesis will focus are the application model in which it is possible to specify a program which can generate workload, the architecture model in which all the different parts of an architecture and their connections are described and finally the mapping layer in which the connections between the generated workload and the different parts of the architecture are described[12]. The separation of the application and architecture model speeds up the prototyping of a new architecture since applications and architectures can be reused in different combinations once made since only the mapping layer has to be changed.

(14)

2.1 Application model

Creating an application model for a less abstract simulator than Sesame is a very difficult task. This is because system architectures in general are very complex and multiple levels of commu-nication exists between different parts of a system. This results in many different instruction sets for all the different architectures like the ARM or x86 instruction set. Since the aim of Sesame is to make designing and prototyping of applications on architectures very easy and fast, abstractions have been made so knowledge about the exact instruction set of the architecture is not needed. For this reason Sesame uses a Kahn Process Network (KPN)[13] as the application model.

A KPN is a network of concurrently running processes[27]. For this reason an application is divided in the smallest possible units of work called processes. A process can be viewed as one single algorithm used in a whole application. This leads to a dependency graph of processes in an application where the processes need to read and write to each other to make the whole application work as intended, see figure 2.2 below. The individual processes and their depen-dencies must be defined when creating a new setup. Since a KPN is an event driven network, a process is nothing more than some code which generates workload in the form of different kinds of events. The following events are currently implemented in Sesame: Read, write and execute. Since an event simulates one or multiple instructions on a real system, it is important to specify the workload of each event in terms of performance (processor cycles needed or power consump-tion for instance). However components of a computer system (like a processor or memory) can have different performances, for example in a heterogeneous system one processor is fast while the other one is slower. This can be implemented in Sesame because the workload for each component can be set individually. In this way a memory hierarchy can also be implemented. The read and write events are generated when two processes are communicating with each other just like they normally would.

Figure 2.2: Simple application

A Kahn Process Network uses Kahn channels as communication between processes. A Kahn channel is a first in, first out (FIFO) buffer in one direction with unlimited capacity[27]. These channels support the read and write events of Sesame. Since these channels are the only way of communicating, global or shared memory must be recreated in Sesame by abiding the rules of a shared memory architecture while using Kahn channels[26]. This is an important fact which should be taken into account since almost all embedded systems do use a form of shared mem-ory. The consequence of the fact that a Kahn channel is a buffer of unlimited size is that writes are non-blocking since there is no restraint in the number of writes that can be queued. An important side note here is that while the Kahn channels are of unlimited size in the application model, when the application model gets mapped to the architecture model this is no longer the case since the conceptual buffer of the Kahn channel gets mapped to a real buffer which has a real maximum size. This means that there can be a moment when the buffer is full, in this case a write will be blocking because it has to wait until it successfully writes data into the buffer but this is a very rare situation. A read event is always blocking and thus it is the only (normal) way of synchronizing between parts of an application and thus parts of the architecture. The result of the blocking read events is that a Kahn Process Network is deterministic. This means that the result of an application, executed on an architecture, is always the same regardless of the scheduling.

Execute events are handled on the processors on which the process is executed without any communication. As mentioned earlier, it is possible to specify the workload of each event on every processor. In this way normal computer instructions can be simulated with the application model without the need to check which instruction set the architecture supports.

(15)

2.2 Mapping layer

Since the Y-chart design of Sesame separates the application and architecture from each other, a mapping has to be made somewhere between these layers for Sesame to function. In first instance it seems like processes from the application model are directly mapped onto the different parts of the architecture model. In reality a more complex mapping layer, also called the virtual or synchronization layer, resides in between the two models since there are some complexities which make a direct mapping too difficult.

One of the problems that must be faced in the mapping layer is that the result of an application is a collection of events. These events do not have any notion of time while this is important for the scheduling in the architecture model (this will be explained in detail in section 2.3). Another problem is that two processes in an application can communicate with each other but can actually be mapped to the same processor. In this case the Kahn channel between the processes must not be mapped to a communication pipe between two different processors but from a processor to itself. Since more detailed explanation of the mapping layer is not required to understand this thesis and the mapping layer is mostly automatically generated, only a mapping from each process to each architecture part has to be defined, more information about the mapping layer can be found in other papers[12].

2.3 Architecture model

The last core part of the Sesame Y-chart design and the most important part for this thesis is the architecture model. In this model it is possible to build a new architecture from components which are included in a Sesame library. The components can be a processor, bus, memory, FIFO channel and many more. All these components can be fully configured, like what the workload of a read, write or a certain execute event is (see section 2.1) when ran on that component. In this way a heterogeneous system or a memory hierarchy can be simulated. A visualization of an simple architecture in Sesame is shown in figure 2.3.

Figure 2.3: Simple architecture

The consequence of the Kahn Process Network implemented in the application model is that the events from the event traces generated from the application must be ran concurrently on the architecture. The events also have to be translated into discrete-event time since the architecture model is implemented in the Pearl discrete-event simulation language[12][27]. The result is that

(16)

first the events have to get divided over the component, for this each process (and thus all the events that are generated by it) is mapped onto a previously defined component (see section 2.2). When this is done, each component has a list of events that it has to execute so the second step will be to schedule these events per component. This will lead to a discrete-event time schedule with events that can be concurrently executed for component.

Each component has an environment (architecture model component) of its own which has the advantage that the components are highly configurable. Since the components are separated, communication between components is done by Pearl behind the scenes. The read and write events are translated into simple calls to the correct function of the correct architecture part in the system. In this way, blocking reads can also be simulated by the architecture and a shared memory can be mimicked. In these architecture model components non-functional aspects like power consumption can also be taken into account. This is the key focus of the interface between Sesame and HotSpot because a reasonably realistic (but still abstract) power model in Sesame is needed to create an accurate thermal model with HotSpot. This will be the topic of the next section.

2.4 Power model

A suitable power model in Sesame did not exist yet at the beginning of this thesis so a new model had to be built from scratch. A basic power model for a simulation of an application on an architecture should take account of the number of busy and idle clock cycles of each com-ponent and it should know what the power consumption of each comcom-ponent in a certain state (busy or idle) is. When these two parts are linked to each other, the power consumption of each component as a result of the executed application is known. The clock cycle behaviour will be explained in this section since this can be done in Sesame without any specific knowledge about the architecture while the power consumption per component will be explained in chapter 5. This is because the power consumption of each component of the architecture that will be used for HotSpot is not known so the Sniper multicore simulator will be used to calibrate Sesame’s power model, since Sniper uses a very detailed tool to compute power consumption. With this information a validated and functional power model for Sesame can be built.

(17)

In Sesame each architecture model component used (see the previous section of this chapter) will be initialized at the start of the application. Thereafter the application will start to run and components will sometimes execute an event (busy cycles) and sometimes will do nothing (idle cycles) since they have to wait because of a blocking read or have no more events to execute. An execution of an application can thus be seen as a time schedule with busy and idle cycles per component generated from events since the architecture model is in discrete-event time. A visu-alization of this concept is given in figure 2.4. Three components are listed at the top with each a different clock cycle schedule. The first step in Sesame’s power model is to create a file, called a trace file, which contains such a schedule, derived from the different architecture components that are used.

For this each architecture model component uses timers which are all started at the beginning of the application. When an event is to be executed on a component, a function in his architecture model component will be called which will write data to the trace file. The start time of the event can be derived from the timers in that specific architecture model component while the length of the events (so the number of busy cycles) are always known since this information comes from the application model. The only information left is when the component is idling, so when it is not executing an event. If an architecture model component knows when its last event ended and when the current event starts, it can check if there is a time gap in between. For example, in figure 2.4, the first event for the memory component ended after 30 clock cycles while the second event starts after 40 clock cycles. This gap of 10 cycles means that the component was idling in between.

This schedule can be used by the interface between Sesame and HotSpot to build a power model for the simulation. If the interface knows the power consumption of each component in a certain state (busy or idle) and combines this information with the busy and idle cycles information from the schedule in the previous figure for example, a power schedule can be made which contains information about power consumption per time interval. This is one of the inputs needed for HotSpot to create a thermal model. This usage of the power schedule will be explained in the next chapter.

(18)

(19)

CHAPTER 3

HotSpot

HotSpot is a modeling tool for creating thermal models of a system architecture developed by the department of Electrical and Computer Engineering and the department of Computer Science from the University of Virginia[25]. In this thesis, HotSpot is the second part of the tool chain: Sesame is the starting point and simulates an application on an architecture using a simple power model. To then create a thermal model, HotSpot needs two different forms of input data from Sesame: The first is a floor plan of the used architecture with very accurate information about the dimensions of each component and their placement on the chip. This will be the topic of section 3.1. The second form of input data that is needed is the power usage of all the components of the architecture. This will be the topic of section 3.2 and uses the information given in section 2.4 of the previous chapter. HotSpot furthermore has the option to present the resulting thermal model in a few different ways where the major trade-off is time versus accuracy. This will be the final topic of this chapter in section 3.3.

3.1 Floor plan

The architecture we have modeled in Sesame is very simple as is visible in the example in figure 2.3. There is a lot of attention towards the workings of a component but not in the characteristics of the exterior such as size and exact position on the chip. This is because of the abstractions in Sesame in which those characteristics do not have any effect on the simulation. This is a problem since HotSpot does need a very detailed floor plan with the exact sizes and positions on the chip of each component. This shows that an interface between Sesame and HotSpot has to create a floor plan based on the different components that Sesame uses and the characteristics of these components have to come from another source since Sesame’s data is not suited for this. A visualization of an example floor plan is shown in figure 3.1 below which shows the level of detail needed for HotSpot when combining it with Sesame. This image does not show however what the real level of detail of HotSpot’s thermal model is. It needs much more information about the details of the architecture which is discussed in appendix B.

What must be noted is that there is no place for a bus or any component which is used for communication between components since those components are so small that they do can not be placed on a floor plan. However the bus is used in every Sesame architecture in this thesis since it is a very easy way to model a shared memory architecture. This results in a workaround were the bus is not used in the floor plan and power model for HotSpot and we pretend that the different processors directly communicate with the shared memory instead of via the bus for the sake of simplicity and abstraction.

A floor plan however has no limit for the number of components it can contain. For this reason a more detailed floor plan will be used with Sniper because Sniper uses a far more detailed architecture model. This will be shown in the next chapter. The advantage of this floor plan

(20)

Figure 3.1: Floor plan of a simple 4 core architecture with a shared memory

system is that it is easy to compare a multicore architecture with a floor plan where a core consists of just one component with the same architecture but then using a floor plan where a core consists of different components like execution unit and level 2 cache. In this way the influence of this difference in detail can be examined to answer the research question of this thesis if the abstract simulator Sesame can create a thermal model with HotSpot that can approximate the thermal model of the more detailed Sniper multicore simulator which also uses HotSpot.

3.2 Power consumption

For any formula related to thermodynamics, information about power consumption is needed since it is directly related to temperature. For HotSpot the information about power has the form of a power trace. This power trace is a table where every column indicates the power usage of a single component and each row indicates power usage at a certain time. For this a constant time interval must be set for which an entry must occur. This is important to keep in mind since a change in interval can change the result and accuracy of the formulas used[16]. For Sniper this format is no problem at all since it uses McPAT[18] to create a power trace. This software tool already creates a power trace which is usable for HotSpot so no extra changes have to be made. Since McPAT needs a time interval value to create the power trace, the interval for HotSpot is already known. More information about McPat and Sniper’s power model will be given in section 4.3.

(21)

The constant interval however, is a problem for Sesame since the application gets scheduled and executed in discrete-event time on the architecture. This eventually results in a cycle schedule which was explained in section 2.4. To get this cycle schedule, some major changes already had to be made. The only thing left is to change the schedule of busy and idle cycles per component to power which can be done when the power usage is already known. If the clock speed of the processor is also known, a conversion from cycles to time can be made to create a power trace which accurately follows a certain time interval. Since this is a key process to make the Sesame-HotSpot interface work, a separate chapter ”Power model calibration” (5) discusses the exact methods to get the right configuration for the power model.

3.3 Thermal model

When the correct input data is given, HotSpot will apply different thermal formulas to finally create a thermal model. This thermal model can come in different forms to suit the needs of the user[15]. The most important choice to be made is if the block or grid model must be used to create a thermal model. The block model is the fastest but least accurate option. It works intu-itively when considering the input: It models the temperature of each component in the system while taking the influence of temperature differences between those components into account. The result is a trace file with temperatures instead of power values. For each time interval, a temperature is given for each component where conclusions can be drawn from.

The grid model however is far more detailed but not always usable since it can become computa-tionally unfeasible when the floor plan is too detailed and too much power values are given[16]. The grid model lets go of the idea that the whole system consists of a few components and divides the system in a grid with much more subcomponents. The resolution of this grid can be manually set but is default 64 by 64[15]. The result of the grid model is a far more detailed temperature trace file with much more temperature values per delta since there are a lot of subcomponents.

memory procP0 procP1 procP2 procP3 331.45 330.09 328.73 327.37 326.01 324.65 323.29 322.39

(22)

This thesis will use the block model by default since the purpose of Sesame is to be abstract and fast so this model fits best with this idea[16]. The results of the block model are also accurate enough to check if a part of the system becomes too hot too fast. The grid model will be used when a visualization of the results is required since this can only be done using the grid model. For this purpose it does not matter that the grid model is slow because a visualization is not always required by the user and has less of a time requirement than making a normal block model with HotSpot. An example visualization can be seen in figure 3.2 and looks like a floor plan with different colours to indicate the temperature difference. The used floor plan is the same as with figure 3.1 and the used application was a simple broadcast where core B (the red core in the image) would generate some random data and then send it to the three other cores. The three other cores all have the same workload and only read the data. It can clearly be seen that core B is the most used component while the shared memory is also very much used but does not have that high of a temperature because its surface is bigger. The other three cores all have a lower temperature the further away they are from core A as is expected according to the rules of thermodynamics.

The last important aspect of the thermal model is the initial temperature. For any thermal model to predict a temperature based on power usage, a starting temperature has to be used from where changes can be made. HotSpot has not a direct option to solve this problem so the best alternative is to run each thermal model more than once. The most simple form is to run HotSpot once with the power model from a Sesame or Sniper simulation and then run it once again while using the temperature output of the first run with HotSpot as an initial temperature for the second run. In this way the second run will produce more accurate results while the extra time for an extra run is not that much since the block model is very fast. This method will be used for HotSpot when using the Sesame as well as the Sniper simulator for the block and grid model.

(23)

CHAPTER 4

Sniper

Sniper is a simulator like Sesame which can simulate the execution of an application on an ar-chitecture but with much higher detail and is focused on multi- and many-core systems. While the application model of Sesame consists of some abstract processes inside the main application which themselves consist of user defined instructions with custom workload, Sniper is able to execute any binary and derives the workload from its machine code. This is a much more re-alistic and detailed approach since a real architecture will also execute machine code and not some generic workload like it is the case with Sesame. This difference in detail also exists in the architecture model where Sniper divides each core into multiple smaller parts like the execution unit and the load-store unit instead of viewing each core as just one component.

While Sniper also uses HotSpot as the thermal model, it uses McPAT as a power model which is the main point of interest when comparing Sniper to Sesame. Sesame’s power model has to be built from scratch (see section 2.4) and has to be abstract since that is its purpose while Sniper uses a very detailed power model. If this detailed power model is used with the same architectures and applications as Sesame, hopefully some conclusions can be drawn from the comparison and Sesame’s power model can be correctly calibrated. The working of Sniper and all the used models will be explained in the upcoming sections while the usage of Sniper in calibrating the power model of Sesame will be the topic of chapter 5.

4.1 Application model

The application model of Sniper is built around the execution of binaries as was mentioned ear-lier, so at machine code level. Sniper uses Graphite and Pin as the simulation infrastructure so Sniper is able to simulate the behaviour of binaries[9]. The result of this is that Sniper is able to run a multi-threaded workload instead of only single-threaded workloads which is a big advantage of this simulator. So except running one single-threaded application, Sniper can run one multi-threaded application and even multiple single- or multi-threaded applications. This leaves the user with a lot of freedom when choosing the application(s) to run. All these options are visualized in figure 4.1 below. These are only the standard options and can be mixed together so a workload containing a multi-threaded application and a single-threaded application is also possible for instance.

For this thesis only single multi-threaded applications are used since this is a workload closest to what Sesame can handle. In Sesame we use an application with multiple processes and each process is mapped to a separate core so this is essentially a multi-threaded application. For the applications for Sniper we use pthreads (POSIX Threads[3]) to get a multi-threaded program. Specifically thread-pinning is used to guarantee that each user defined function that generates workload is executed on a different thread. To further change our Sniper application such that it behaves more like Sesame’s application model, a so called region of interest (ROI) is used[7]. For this the different simulation modes of Sniper have to be discussed first.

(24)

Figure 4.1: Different versions of the application model

The Pin software tool supports three different simulation modes in which an application can be run[6]. These modes have effect on what is simulated, how detailed it is simulated and how fast the simulation is. The slowest mode is called detailed: The whole architecture is simulated with the application. This includes all the different parts of the processors and the cache behaviour. The second mode, which is ten times faster than the detailed one, is called cache-only. In this mode only the cache hits and misses are simulated, not the actual execution of the application. This mode can be used to warm-up the cache and branch predictors since not only information about the cache hit rates is gathered but also information for the branch predictor. The final and fastest mode, which is one thousand times faster than the detailed one, is called fast-forward. In this mode nothing is simulated and all the workload is skipped.

The earlier mentioned region of interest is a tool to switch between the different modes in an application. Sniper offers a library for the C language, which is the language used to make the applications for Sniper in this thesis. The library offers some tools to set markers in a program. The SimRoiBegin() and SimRoiEnd() markers can be used to indicate a region of code which is the most important. When the right flags are used when running Sniper, the simulation will start in cache-only mode to warm up the brach predictor and switches to detailed mode when the begin marker occurs. The simulation will run in detailed mode until the end marker is seen and will then switch to fast-forward mode until the end of the simulation. Using this method, the code at the beginning and end of an application which is used for initializing and clean-up can be ignored. This is a handy feature since it is too difficult to take this part of the simulation into account when building an application for Sesame since Sesame’s application model is too abstract to take those parts of the simulation of a real binary into account. Now only the interesting part of the code can be simulated with Sniper so a workload as identical as Sesame’s version of the binary can be created.

(25)

4.2 Architecture model

The architecture model of Sniper is capable to handle very detailed system architectures. For the Intel Gainestown architecture that is used in this thesis and which will be talked about in section 6.1, this means that the different cores will be split up in smaller components when comparing it to Sesame’s architecture model which can be viewed in figure 2.3. The figure below shows the floor plan that is used for HotSpot with the same architecture as in figure 3.1 but now with only two cores and using Sniper. Here each core is divided into an execution unit (EU) which also includes the level 1 data cache, instruction fetch unit (IFU), renaming unit (RU), memory managing unit (MMU), load store unit (LSU) and a level 2 cache (l2). These are just some components chosen for this architecture and Sniper supports many more components[22].

Figure 4.2: Detailed floor plan of a simple 4 core architecture with a shared memory

The floor plan is only used for HotSpot, Sniper itself uses configuration files to specify the archi-tecture and simulation details. Some examples are given here to explain what the configuration files are capable of. One part describes the memory hierarchy and the cache prediction behaviour since wrong predictions can lead to a penalty of many clock cycles depending on the memory level in which the miss occurred. It is also possible to indicate the cost of different instructions like integer addition or jump. This topic will return in section 6.2. For convenience the config-urations can be divided over more than one file: A basic configuration file with all the default values that Sniper will use and optionally a user made configuration file where values from the default file can be overwritten with values specific to an architecture. These files can be found in appendix C. A third option also exists where the command line is used when starting Sniper. This option will also be used in this thesis since this gives an option for easy last minute changes in configurations and this will be further explained in section 5. It is even possible to switch between different thread scheduling methods but since the applications in this thesis will use exactly as many threads as there are cores available and Sniper only simulates one thread per core, every thread will run on a separate core from the beginning with no need to ever switch. This means that every scheduling option will satisfy since only the start core for each thread has to be determined.

(26)

4.3 Power and temperature

During the simulation of the application on the architecture, McPat is run to generate esti-mations of power consumption. McPat is a power, area and timing modeling framework[10]. As the name suggests, it can model power consumption while taking the architecture and the way the application is simulated on the architecture into account. This ensures a high level of detail when combining it with the Sniper simulator. The exact working of McPat[18] will not be discussed in this thesis since a very simple and abstract power model must be created for Sesame and the working of McPAT is too detailed to help building Sesame’s power model directly.

In figure 4.3 the interaction between Sniper, McPat and HotSpot is shown. The most important thing to note is that McPat is not called once at the end of the whole Sniper simulation but at a constant time interval. This time interval must be given as input when running Sniper. This means that Sniper stops simulating after a certain amount of time has passed and calls McPat with all the information about the simulated application and architecture from one time interval. Using this method McPat adds some new entries to the power trace for HotSpot every time instead of building the whole power trace at the end of the simulation.

Figure 4.3: Interaction between Sniper, McPat and HotSpot

The same thing could have been done for HotSpot (call it once every time interval with the power consumption information McPat just created) but since it was decided to run HotSpot two times for better accuracy (see chapter 3), the default way Sniper handles this has been changed since it would generate too much overhead and slow the whole simulation down. The applications that are used in this thesis do not generate a big workload so a time constraint is not really present since the whole simulation is always fast enough. HotSpot is only called after the whole power trace is generated, just like with Sesame. This change in implementation could have been done with McPAT too but this would be too much work for a change that would not impact the results of Sniper at all and would only make the whole simulation possibly a little bit faster.

Since McPat needs a time interval to function, it can transform the power consumption output already to a power schedule for HotSpot as was explained in section 3.2. When the user also provides a floor plan of the architecture used, HotSpot can be run just like it was the case with Sesame. For HotSpot is does not matter what the level of detail of the provided floor plan and power trace is since it will not change the model, the only disadvantage is that HotSpot will take more time to complete the thermal predictions. Since these thermal results are in exactly the same format as with Sesame’s usage of HotSpot, comparisons between the thermal predictions can be made without any problem. This will help the configuration of the power model which is the topic of the next chapter.

(27)

CHAPTER 5

Power model calibration

The main purpose of this thesis is to make an interface between the modeling and simulation framework Sesame and the thermal model HotSpot. In chapter 2.4 a clock cycle schedule was made with busy and idle cycles per component as can be seen in figure 2.4. Now a power model is needed to translate the clock cycle schedule into a power schedule for HotSpot. Since Sesame does not have a useful power model yet, a power model has to be made from scratch. This power model should include the power usage of every component of the used architectures at the idle and busy state. These values have to be calibrated so the resulting power consumption and thus the resulting temperature prediction from HotSpot reflects the real world power behaviour as good as possible. The real world behaviour could be obtained by running an application on physical hardware but in this thesis we have not done so since this would be rather difficult and would require physical access to the real system under study. Instead, we use the Sniper multicore simulator to calibrate the power model of Sesame. The working of Sniper has been explained in the previous chapter. In this chapter the details about the calibration methods will be explained.

5.1 Application calibration

The power model has to be calibrated using an architecture and an application. It must be guaranteed that those are almost the same for Sesame and Sniper since calibrating Sesame’s power model using Sniper’s power model only works if both models are roughly the same. If an application that generates a lot of workload is compared with an application that generates almost no workload, the power consumption will logically be different so calibrating one power model to fit to other will not make any sense in that case. To make applications with roughly the same workload per component of the system, the workload of the binary application for Sniper has to be checked first whereafter the workload of the Sesame application has to be changed to achieve a balance. It has to be done in this order since a binary application for Sniper is much more detailed than the Sesame application: It is much harder to control the workload of a binary since it generates clock cycles for components of the system on instruction level while the Sesame application has a much easier way to generate workload as was explained in section 2.1.

The first step in creating applications with the same workload is that both the simulators have to use about the same instructions with the same costs. These are the building blocks of an application which generate workload. For Sesame the names of the instructions that can be used have to be defined by the user and for Sniper these are predefined by the simulation framework. The similarity between these two different application models is that the workload of each in-struction can be set by the user. This was already explained in the previous Sesame (2) and Sniper (4) chapters. In section 6.2 the resulting instructions with their costs are given. After this is done the actual applications can be created using the instruction costs as a guideline. The resulting applications can be found in appendix A.

(28)

The final step in creating applications for Sesame and Sniper with the same workload is to make simple applications for Sniper and get the workload that they generate for each component of the architecture. The Sniper framework has an option to get all the different statistics and data about the last simulation run, in this way the clock cycles generated per component can be extracted. For Sesame it is very easy to monitor how many clock cycles per component are generated since the whole framework is much more abstract. Now all the applications in Sesame have to get calibrated with the data from Sniper until the workload is about the same and then the applications are ready to use.

5.2 Architecture configuration

For Sniper and Sesame an architecture is needed where all the details are available for Sniper’s configuration files and an abstract version to use with Sesame must exist. For this the Intel Xeon X5550 processor is used as a baseline in this thesis[17]. This is a processor in the Gainestown line which is based on the Nehalem microarchitecture[29]. The normal Intel Xeon X5550 processor has 4 cores and is based around this but for this thesis we want to use architectures with different numbers of cores to test if this influences the thermal prediction in a not foreseen way. Using more different architectures and thus applications will make the resulting power model more accurate which would give a better result. The exact floor plans for HotSpot which reflect the architectures can be found in section 6.1.

As mentioned in chapter 4, the configuration of the simulator and hardware are very detailed. The general configurations for Sniper and the Gainestown architecture are set in their respective configuration file (given in appendix C) but some details still change when switching between the number of cores that are used. The most important features are of course the number of cores that are used which also changes the shared memory, also known as the level 3 cache. When 4 cores are used instead of 2, the cache also has to double from 4 megabyte to 8 megabyte. For Sesame there are almost no configurations to change except the number of cores used. This makes Sesame less accurate, but much faster for prototyping.

5.3 Calibration

To calibrate the power model of Sesame, the power trace (section 3.2) of each Sesame simulation has to be compared to its Sniper counterpart. This means that the power traces of Sesame and Sniper need to have about the same number of rows because this makes comparing the two eas-ier. Since each row represents the power consumption of each component for one time interval, a time interval has to be chosen for Sesame so that its power trace has the right number of rows. For Sniper the intended time interval is 10000 nanoseconds since this is the default value. The most simple way to determine a time interval for Sesame is, especially since the number of rows do not have to be exactly the same, to start with a random time interval value and change it until the number of rows in the power trace is correct. During the calibrations, the time interval value of Sniper and thus Sesame will be changed since this value can change the accuracy of the results and the duration of the simulation which is an interesting effect to investigate.

Now the power usage for each type of component (in this case there are only two types of components, memory and processor) in idle and busy state has to be determined. This is done by fitting the idle and busy state power usages for each component in Sesame to the power consumption as was predicted by Sniper. For the memory this is very simple since the memory component in Sesame is exactly the same as the level 3 cache component in Sniper but for the processor this a bit more difficult because a processor in Sniper is divided over 6 subcomponents. For this reason we use the mean power usage and temperature from these 6 components when comparing it to a single processor component from Sesame. The results can be found in section 6.4.

(29)

CHAPTER 6

Results

In this chapter the results of the methods described in the chapter ”Power model calibrations” are shown. For this, the architectures and applications used are explained first. Thereafter Sesame’s resulting power model will be shown and the power usage of every component will be explained using the results of Sniper as a comparison. The thermal predictions of HotSpot are also shown since this is the actual result where Sesame must approximate Sniper’s result as close as possible and thus also needed for the calibration.

6.1 Used architectures

As was explained in chapter 5, the Gainestown architecture is used with the Xeon X5550 processor as a basis. This has led to the use of two different architectures, namely a two core and the default four core variant, all connected to one shared memory (also called level 3 cache). Sesame also uses a slightly simpler version when compared to Sniper since Sesame sees a core as just one component, in contrast to Sniper’s 6 different subcomponents. The floor plan for the four core Sesame architecture can be found in figure 3.1 and the two core Sniper architecture in figure 4.2. The two missing floor plans are given below:

(30)

Figure 6.2: Four core floor plan for Sniper

6.2 Instruction cost

As mentioned in section 5.1, the types of instructions and their costs in clock cycles for the Sesame and Sniper simulators have to be about the same since this guarantees that the building blocks of an application are about the same. With this assumption, an application can be built for Sesame and Sniper which generates about the same workload for each component of the system.

In figure 6.3 can be seen that Sniper supports many different instructions. Some instructions have a cost of zero clock cycles and will thus be ignored by the power model. Further more it can not be seen on which types of components in the architecture these instructions can be executed but this does not matter much since it can be derived from the kind of instruction they represent. An ”iadd” is an integer addition instruction and will be executed on the processor while a ”mem access” instruction stands for memory access and is executed on any form of memory, also called cache. Note that the cost of the ”mem access” instruction is zero, this is because the cost of accessing memory is different for each part of the memory. Accessing a level 1 cache is much cheaper than accessing a level 3 cache for example. These so called read and write delays are set in the configuration files which can be found in appendix C. The number of clock cycles these instructions take to complete are the same for every component in the architecture but the time it takes to execute them can be different since different components can have different clock cycle frequencies. These frequencies determine how many clock cycles can be executed on a component per unit of time. In this way a heterogeneous system can be simulated but this will not be implemented in this thesis since it would make the calibration between Sniper and Sesame much more difficult.

(31)

Instruction Cost (clock cycles) iadd 1 isub 1 imul 3 idiv 18 fadd 3 fsub 3 fmul 5 fdiv 6 generic 1 jmp 1 string 1 branch 1 dynamic misc 1 recv 1 sync 0 spawn 0 tlb miss 0 mem access 0 delay 0 unknown 0

Figure 6.3: Instruction cost for Sniper

Component Instruction Cost (clock cycles) processor generic 1

processor iadd 1 processor isub 1 processor imul 3 processor idiv 18 processor local mem access 3 memory access 30

Figure 6.4: Instruction cost for Sesame

As was mentioned in section 5.1, the instructions of Sniper were already predefined but the costs can be changed, although the default values are used in figure 6.3. For Sesame the whole instruction cost table has to be built from scratch using Sniper’s data. For Sesame instructions can be directly assigned to components while for Sniper this can only be derived from the name. This can be seen in figure 6.4. As explained in section 3.1, the bus component is not taken into account since HotSpot can not handle small communication components. For this reason we ignore any bus related delay during the simulation in Sesame. There is however a memory delay (for the shared memory, the level 3 cache) which can compensate for this so reads and writes are not instantaneous. The application does not need to contain explicit instructions for memory delays since this is done automatically during a read or write. All the instructions for the processor have to be called manually however.

For Sesame the instruction table has to be as simple as possible since imitating a Sniper’s binary is too complex. This led to the use of some of the instructions from 6.3 and their costs. As for the mathematical operations, only integer operations are used since using floating point operations would make the application unnecessary complicated. There also is one generic instruction to compensate for instructions Sniper supports but Sesame does not. The last processor instruction is called ”local mem access” which simulates a memory access to the level 1 cache which is part of the local memory of each processor. This is to compensate for the fact that a core consists of

(32)

only one component in Sesame in contrast to Sniper so a read or write event to local memory is not possible. The costs of these instructions are all directly derived from the costs in Sniper. Instructions are defined for all architecture components just like with Sniper (so there is an ”iadd” instruction for all processors) but Sesame does not have a clock cycle frequency for each component so it uses a different approach to change to cost for an instruction. While with Sniper it is not possible to change the cost in clock cycles of an instruction, in Sesame it is. For each component it is possible to change all the clock cycle costs for the instructions it can use. This is Sesame’s way to simulate a heterogeneous system or a memory hierarchy but for the used applications and architectures in this thesis the cost of each instruction is the same for all components. This means that each instruction for all cores has the same cost and since there only is one physical memory in Sesame’s architecture, only a shared memory and no distinct local memory (although there is the local mem access command to compensate for this), a memory hierarchy does not have to be simulated so reads and writes always cost the same when writing or reading a fixed data size.

6.3 Workload balancing

As was explained in section 5.1, the final step for application calibration is workload balancing. Now that the instructions are calibrated, which are the building block of the application, ap-plications which generate an equal amount of workload for each component of the architecture can be created. Since a 2- and 4-core architecture is used, Sesame and Sniper need a different version of an application since their application model is totally different and 3 different ideas for applications will be implemented, a total of 12 applications will be made. The important parts of the code can be found in appendix A.

A problem has arisen during the use of the Sniper framework to get statistics about the gener-ated workload of an application, namely most of the statistics are not correct or do not make sense. For instance, the number of instructions and the number of clock cycles per component generated by the application is much too low compared to what the predicted number should be. Further more, the difference between the statistics from the region of interest only code and the whole application differ too little or not at all for some statistics. Maybe some statistics are correct since the list of gathered statistics by the Sniper framework is enormous, but since a lot of statistics are not correct, the credibility of all Sniper’s own statistics is gone. This however does not apply to the power and thermal results of Sniper since this is a different part of the framework and these values do align to the expectations.

This causes a significant problem for the scientific value of this thesis since the power model can not be correctly calibrated if the generated workload is not almost the same to begin with. This problem can be resolved by writing a new program which collects statistics from a Sniper simulation but this would take more time than is available for this thesis. This will be a topic of further research and will be discussed in chapter 7. Because there is no time to correctly solve this problem, a different approach has to be taken. This approach has to make use of estimations to balance the workload instead of more precise balancing. This is done by trying to translate the code of Sniper’s application to Sesame’s application but this is of course far from optimal. The resulting applications can be found in appendix A.

Another problem which occurred during the creation of the applications is related to the complex application model of Sniper. One of the intended applications for Sesame and Sniper was a broadcast application where one core writes data to shared memory and the other cores will read it. For Sesame this is very simple since each core is connected to the bus, the bus is connected to the memory and the logic behind the application and architecture model is very abstract so simple writes and reads can be made without any form of data races. However for Sniper this is much more complex because thread pinning functionalities from the pthreads library are used. To communicate between threads, mutex locks and conditional variables are used[3]. The prediction was that the communication between threads, thus between cores since Sniper uses

(33)

one thread per core in the simulations in this thesis, would use the level 3 cache since this is the only form of shared memory. This is however not the case because Sniper does not show any sign that this data is read and written on the level 3 cache. Since this component is apparently not used for communication between threads, another way to stress the level 3 cache has to be found. A problem with this is that the inner workings of the memory hierarchy in Sniper is of such a complexity that an equally complex application has to be built to stress this component. The resulting application would then be too complex to implement in Sesame since the application model of Sesame can only handle simple applications. This has led to the conclusion that the idea of an application with communication between cores has to be dropped for this thesis since it can not (yet) be implemented in a simple way in Sniper.

6.4 Power model

The power traces of the Sniper simulation have been changed for this calibration as was ex-plained in section 5.3. The sum of the power consumption and the mean temperature of the 6 subcomponents of the processor have been used so the power trace of Sesame and Sniper have the same format. For the shared memory component no changes were made since it has the same architecture format in both simulators. There is however one problem with the power usage of the shared memory at the busy state since the application which would stress the shared memory could not be built for Sniper (appendix A). This means that no application is used which stresses the shared memory so the power usage of this component at the busy state could not be measured.

The first step is to find the right time interval for Sesame so that the power traces have about the same number of entries for both simulators. For Sniper, the default time interval of 10000 nanoseconds was used[22] but a problem occurred when the resulting power traces of Sniper were observed. The figure below shows the power trace of an application where only one core is stressed (core 1) on a 2-core architecture. The first three entries of the power trace show that both cores have a high power consumption but the last two entries show almost no power consumption at all. This should not happen for the second core since the stress function was used which should stress the core during the whole region of interest and thus during the whole simulation.

Level 3 cache Core 0 Core 1 1.789876 12.957153 22.935132 1.787496 14.022822 22.927315 1.787630 14.234287 23.275669 1.793704 2.711972 2.711972 1.787677 5.370840 7.688971

Figure 6.5: Power usage per component per time interval of 10000 nanoseconds for Sniper

When the time interval for Sniper was lowered to 5000 and 1000 nanoseconds, more problems occurred. The time interval should indicate the interval with which a Sniper simulation should be interrupted to call McPat to generate power consumption data as was explained in the Sniper chapter. This means that we would expect when lowering the time interval from 10000 to 5000 nanoseconds, that the last four entries (instead of the last two) of the power trace should contain extremely low values and the rest of the entries should have higher values. When lowering the interval to 1000 the same principal should apply but then for the last twenty entries. However for both the 5000 and 1000 nanoseconds time interval simulations, only the last two entries in the power trace indicate a very low power consumption. This is completely illogical since the last two entries of the original 10000 nanoseconds power trace contain data of the last 40 percent of the total simulation. The power traces with lower time interval do not confirm this at all because with a lower time interval, the last two entries of a power trace reflect much less of the total data. This means that it is most likely that the last two entries of the power trace of Sniper are always incorrect. For this reason the last two lines of the power traces are ignored for the calibration of

(34)

Sesame’s power model and thus also the last two lines of the temperature predictions of HotSpot. The time interval is also lowered to 5000 nanoseconds to get enough power consumption entries in the power trace to calibrate the power model of Sesame more precise.

This time interval for Sesame is measured in clock cycles since a simulation in Sesame generates a clock cycle schedule (section 2.4). After some testing, it turns out that this value for Sesame is 150000 clock cycles.

Another important change that had to be made was changing the order of the stress and no stress functions in the applications where only one thread was stressed and the rest not at all (appendix A). It turned out that the thread scheduling algorithm of Sniper scheduled the threads not in the order in which they where created by the pthreads application. If the stressed function is executed on a different core for both simulators, the thermal predictions of HotSpot will differ because the temperature of one core has effect on the surrounding components. The Sesame application was chosen to be changed since this is easy to do in its framework and so the stressed function was switched to another thread. This means that the stressed function is now executed on the same core for both simulators.

The parameters found for power usage in busy and idle state for each component are given in the table below. The memory component does not have a busy state power usage as was mentioned earlier. The idle state power usage of the shared memory was easy to find since it is always idle in the applications for Sniper. The power usage in busy state for the processor was found using all the stressed cores since those cores are always busy. The only variable left is the processor power usage in idle state which should be found using the power consumption of the processors which used the no stress function. These processors have busy and idle cycles but since the busy state power usage was already found, the idle state power usage could be found by changing the value until the processors with the no stress function have the same power consumption as was calculated with Sniper.

Component State Power usage (W) Memory (2 processors) Busy

-Memory (2 processors) Idle 0.119 Memory (4 processors) Busy -Memory (4 processors) Idle 0.225 Processor Busy 1.53 Processor Idle

-Figure 6.6: Power usage per component per state

The power usage of the shared memory has been divided in two categories since the memory size of the 2-core architecture is 4 megabyte while it is 8 megabyte for the 4-core architecture. This means that the power usage in busy and idle state differs between those two architectures for the shared memory since the size of the component is different and this is reflected in the table. In the previous paragraph, it was said that the power usage of a processor in idle state could be found using the processors which run the no stress function. However the power consumption that Sniper calculated was not what was expected and can be seen in figure 6.7b below. Here we can see that the first power consumption entry of core 0, which runs the no stress function, is lower than the entries that will follow for this core. This behaviour is very odd because it was expected that the power consumption of core 0 would be relatively high at the start since it had to execute its workload but that it then would drop to a constant, lower level for the rest of the simulation because it had no workload anymore to execute. This first power value, as was predicted by Sesame, can be seen in figure 6.7a. Here the idle state power usage of a processor has been set to zero. This first power consumption value is already larger than the first entry of core 0 for the Sniper simulator which can be seen in figure 6.7b. This means that the difference in predicted power consumption would become even bigger for the first entry of core 0 when the idle state power consumption of a processor is set to a number larger than zero, which it has

(35)

Memory Core 0 Core 1 1.785 15.300153 22.95 1.785 - 22.95 1.785 - 22.95 1.785 - 22.95 1.785 - 22.95 (a) Sesame

Level 3 cache Core 0 Core 1 1.792020 11.383983 22.943056 1.787571 14.022987 22.927389 1.787571 14.022857 22.927275 1.787571 14.018013 22.927552 1.787614 14.040790 23.381719 (b) Sniper

Figure 6.7: Power usage per component per time interval of 5000 nanoseconds

to be since every component uses power in the idle state. This would make no sense so these arguments have led to the conclusion that the idle state power usage of a processor can not be calculated due to incorrect data. All the power usage values from figure 6.6 that could not be calculated, have been set to zero for the remaining tests and results.

Figure 6.7 shows the comparison of the power trace for a Sesame and Sniper simulation which uses the 2-core architecture and an application where core 0 uses to no stress function and core 1 the stress function from appendix A. The results of the shared memory and the busy core 1 are very much alike for both simulators which shows that the chosen values in figure 6.6 are correctly chosen. They are however not perfect since the applications do not have the same workload as was discussed earlier. The results are the same for the other simulations with different architectures and applications.

6.5 Thermal model

Now that a power model has been established (the unknown power usage values from figure 6.6 have been set to 0), HotSpot can be used to get thermal models for each simulation. The results for each application and architecture combination in a Sesame and Sniper simulation can be found side by side in the following figures:

(a) Sesame (b) Sniper

(36)

Figure 6.9: Thermal model of the 2-core architecture with all cores stressed

Thermal models for the exploration of embedded system architectures

Bachelor Informatica