Modeling CPU energy consumption using machine learning

(1)

Modeling CPU energy

consumption using machine

learning

(2)

Layout: typeset by the author using LATEX.

(3)

Modeling CPU energy

consumption using machine

learning

Kevin Pawiroredjo 11043199

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

dr. ir. A.L. Varbanescu Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jun 26th, 2020

(4)

Chapter 1 Introduction

Sustainability has come more and more into focus as we enter the new decade. Global warming was always a looming threat and has not been reduced by pro-jections that we have set in the past. While ICT has seen tremendous growths in the past, it needs to contribute to reducing energy consumption as a whole as recent studies [5] have indicated. As we slowly transition into a future that uses more renewable energy sources every day, we will need to try to lower our energy consumption to accommodate this change in shifting priorities.

One way to reduce the ICT energy footprint is to improve the energy-efficiency of heavy computational tasks. Hardware has continually evolved, becoming more powerful with every iteration, but sustainability took the backseat. Technologies like multi-core processors and hyper-threading have been developed to complete computationally-heavy tasks faster, but their impact on energy consumption when running parallel-applications is less studied. This thesis aims to raise awareness about the energy efficiency of parallel applications running on multi-core architec-tures. To this end, we aim to demonstrate how the energy consumption of a given application changes depending on its parallelism.

Additionally, we aim to determine whether the energy consumption of parallel applications can be predicted. If this is possible, we can reduce the energy foot-print of applications by selecting a good trade-off between their execution time and energy. However, to predict energy consumption for parallel applications run-ning on multi-core systems, we need to build predictive models. In this work, we propose to investigate whether such models are feasible.

(7)

1.1 Research question and approach

The research presented in this thesis focuses on the following question:

Is it possible to use ML models to predict energy consumption for par-allel applications?

The use of a machine learning model requires representative training data. Thus, a large part of this thesis project will be dedicated to collecting data. Thus, the first research sub-question we propose is:

RQ1. How can we collect representative data from benchmarking tools that is indicative for the CPU?

To answer this question, we performed a literature study on benchmarking soft-ware. The outcome of this study is presented in section 2.3. To evaluate the results from the benchmarking software we calculate the standard deviation to see if exceptions muddy the results.

To further build the model, we have to find the variables that influence energy consumption, like the amount of threads that the application uses, and how they are distributed across the CPU. Also, we need to investigate how energy consump-tion changes with different input data for the parallel applicaconsump-tion. Therefore, the second research sub-question we propose is:

RQ2. What features influence the energy consumption of a given par-allel application on a multi-core processor?

We expect that the number of used CPU hardware threads, cores, and sockets all impact the energy consumption of a CPU. To determine how an application’s energy consumption changes with the use of these CPU resources, we will conduct an empirical analysis, where we will measure the impact of each of this features using the tools and methods found in RQ1.

Finally, we also build a first predictive model, using the data and features from RQ1 and RQ2. We further evaluate this model’s ability to correctly predict energy consumption for an application. Thus, we formulate a third sub-question as: RQ3. How can we build a linear regression model to predict the energy consumption of an application for an unseen input data and a given CPU configuration?

To approach this question, we will attempt to build a machine learning model and assess its accuracy.

1.2 Thesis outline

In Chapter 2 we introduce the necessary concepts required to understand this work. Additionally, we analyse earlier studies that have been done in the field of ML and computer systems. In chapter 3 we present the benchmarks we have

(8)

used for data collection, and we analyse their results to determine the features of interest for our model. Chapter 4 presents the model we have constructed, and its performance analysis. Finally, we conclude the thesis with Chapter 5, which highlights our main findings and proposes future work.

(9)

Chapter 2 Background and Related Work

In this chapter, we discuss the necessary background information for understanding the platforms and methods used in this thesis, and we provide a brief overview of relevant related work.

2.1 Platforms

The central processing unit (i.e., the CPU) is the workhorse of the computer. Although a lot of components have become more specialised over the years, the design of the CPU core has not changed a whole lot. All CPU cores consist of a control unit, registers, processing units (needed to perform the actual computation, like arithmetic or bitwise-logic operations), and, in most cases, some cache.

In the past 20 years, a lot of effort has been made to design and use multi-core processors [10]. In this case, instead of a processor having one single core, it has multiple cores and, when necessary, even multiple sockets with multiple cores each. An example of a multi-core processor can be seen in Figure 2.1.

To make a CPU faster, a logical solution is to add more parallelism by adding more sockets or cores per socket and make each core do simpler things. This has eventually resulted in development of the GPU. A GPU has significantly more cores and sockets, but does much simpler work per core. Why do we then still use CPUs? The CPU’s purpose is tasked to be as flexible as possible. The major drawback of GPUs is that they are more difficult to program, and that splitting an application in many small, fine-grain tasks can be very difficult. The CPU has more advanced cores, and building parallel applications for multi-core CPUs is easier.

Another way to improve the CPU performance is to clock the CPU faster, and make the logic units better. However, these developments have become ineffective due to technological limitations [13]. Instead, multi-core systems have emerged.

(10)

Figure 2.1: A multicore architecture (figure taken from "Overview of Perfor-mance Measurement and Analytical Modeling Techniques for Multi-core Pro-cessors", summarized at https://www.cse.wustl.edu/~jain/cse567-11/ftp/ mulitcore/).

Moreover, in 2002, Intel came out with the first instance of hyper-threading (and implementation of the concept of simultaneous multi-threading, or SMT [16]), an innovative solution to more efficiently handle work in the same amount of cores. Hyper-threading is the process in which cores do work on multiple tasks (2 or 4, in general) in a bid to take parallelism further. When a core has to wait for work in one task to finish, it will schedule a second task simultaneously. Using this method, downtime should be reduced, and the CPU should see an increase in utilization, while the application should increase its performance [3]). In this thesis, we will evaluate results with either hyper-threading enabled or disabled.

2.2 Parallel Programming

When using multi-core processors, applications see performance improvement only when efficiently using the available parallelism. To achieve this, applications must use techniques such as multi-threading or multi-processing [7], to enable multiple tasks in the application to execute in parallel. The mapping of these tasks to cores is done by the operating system. This placement may have - as we will see in Chapter 3 - consequences on both the performance and energy consumption of a task.

Parallel programming has multiple paradigms for threading an application. We could use either Pthreads or OpenMP to force multiple threads in rodinia.

(11)

Pthreads is a low level API designed for giving relatively high levels of control to the programmer. In contrast, Openmp is the high level variant and can be scaled more easily. Because we aim to only increase or decrease the amount of threads and pin the threads using Likwid, OpenMP is the clear pick for use in this thesis.

2.3 Measuring tools

The energy consumption of an application is proportional with its execution time. Thus, we will provide two types of measurements in this work: execution time and energy consumption.

We can measure execution time by simply placing timers in the code. Specif-ically, we start a timer when the application starts to run the threads on the multiple nodes. We end the timer when all threads are finished and the matrix is solved. The difference between these points is the time we measure as execu-tion time. We have chosen these regions to more accurately discern the difference between normal threading and hyper-threading.

We can measure energy consumption of a given application in multiple ways [9]. We could, for example, install external hardware to measure the energy passing through it before it reaches the machine. This gives us an independent measure-ment of energy consumption since it will just be reading the amount joules passing through. A disadvantage to this is that we need to have access to the machine in question. Alternatively, we could install software that measures energy consump-tion based on the internal readings of the machine. This has as advantage that we don’t need changes to the hardware. There are multiple counters on the CPU that read the performance and log it for software to read. This has as added benefit that there has been done research about the accuracy of these readings [6].

In this work, we chose to rely on the internal RAPL counters to measure energy consumption [12]. This is because access to das5 is restricted and it would add further complexity to the research that is not relevant.Another reason is that we can access the RAPL counters on the intel CPU’s far more easily using LIKWID. This is further discussed in 2.3

2.4 Related work

In this section we present four different approaches for estimating the energy con-sumption and/or efficiency for a given application.

(12)

2.4.1 A Comparative Study of Methods for Measurement

of Energy of Computing

The paper[9] presents an evaluation of the effectiveness of tools for measuring dynamic energy consumption.

Using an external power measuring method, cutting edge chips were com-pared to predictive energy models to see which one had better results. Obtaining the readings was different for every on-chip counters. They have used RAPL for Intel multicore CPUs, NVML for Nvidia GPUs and Intel System Manage-ment Controller chip (SMC) for Intel Xeon Phis. To compare both approaches, a methodology has been presented,making use of HCLWattsUp API, to determine the component-level dynamic energy consumption of an application using system-level physical measurements.

For the study various CPUs and GPUs were benchmarked against dense matrix-matrix multiplication (DGEMM) and 2D fast Fourier transform (2D FFT) . The experiments concluded that the average error was between 8% and 73% between the energy profiles and the external controller. RAPL gives significantly different readings of the energy consumption compared to the external power meter. For the 2D-FFT application executing on accelerators (GPUs or Intel Xeon Phis), RAPL reports higher dynamic energy than the on-chip power sensors in the accelerators. Rapl counters only measure energy consumption on the CPU and DRAM. Further-more Intel MPSS and NVML provide Further-more accurate dynamic energy consumption and exhibit similar trend as that of HCLWattsUp. For DGEMM, however, RAPL measurements are significantly less than on-chip sensor values of the accelerators. This is indicates that computations contribute a large part of total dynamic con-sumption.

The paper came to the conclusion that calibration cannot improve the accuracy of the on-chip sensors to an extent that can allow them to be used in optimisation of applications for dynamic energy. This is relevant to the thesis, because we will be using the RAPL counters to measure our energy consumption. We will not go into depth into disputing the accuracy of RAPL as we only care about precision of the measurements.

The comparisons of the benchmarks were done on a Intel Haswell multi-core CPU and an Intel Skylake multi-core CPU. The average error ranged between 14% and 32% for energy prediction models using performance monitoring counters and the external power measurer, while the maximum error reaches 100%. The paper concludes that methods solely based on correlation with energy to select PMCs are not effective in improving the average prediction accuracy. This is, however, not a problem for the modeling approach, as they aim to use past energy measurements as predictors.

(13)

2.4.2 Measuring Energy and Power with PAPI

In this paper [18], the authors sought to make measuring and reporting energy and power values easier. They have further developed on the Performance API (PAPI) analysis library to standardise these tasks. PAPI is used to instrument application code, and reports different performance metrics [1]. PAPI provides, by default, a lot of low-level performance indicators (i.e., directly related to the performance counters of the processor [14]), but lacks metrics for energy and power consumption. This study shows how to add such metrics to PAPI. Thus, old application code using PAPI immediately also provides these metrics, together with CPU performance counters, GPU counters, network, and I/O.

There are, however, some limitations when using PAPI for energy consump-tion. PAPI’s readings on energy consumption are system wide, and PAPI itself also consumes energy (i.e, it adds measurement overhead). Bottlenecks may be created because of PAPI, and energy consumption is not clearly centred around one application. PAPI can be used with external measuring tools and internal measuring tools. In the paper, the authors indicate that RAPL and NVIDIA counters are comparable with PAPI; they are investigating the support for AMD Application Power Management.

The paper presents results, for Cholesky factorization, on energy consumption, power draw, and performance. PAPI could be interesting to use as an API to assess the information given by RAPL.

2.4.3 MIPP: A Microbenchmark Suite for Performance, Power,

and Energy Consumption Characterization of GPU

architectures

Predicting performance based on benchmarking results has been done in paper [6]. This paper demonstrated the use of microbenchmarking in measuring perfor-mance, power, and energy consumption. MicroBenchmarks are small benchmarks specifically designed to introduce stress-points in the machine. This is to gather data about the individual components that are used instead of the entire system. The data from these microbenchmarks is then used to build a model.

Collecting data about instruction units, shared memory, cache and DRAM was the primary goal. Each microbenchmark run returns information like execution time, actual throughput (to compare with the theoretical throughput from the device specifications), average and max power consumption, energy consumption and energy efficiency. Microbenchmarking could be relevant to identify bottlenecks of machines early and can be useful in defining what the limit of each component in each machine is. This paper also illustrates the fact that benchmarking is a valid way of receiving feedback about internal changes.

(14)

2.4.4 Measuring Energy Consumption for Short Code Paths

Using RAPL

The paper[12] is a report on the use of RAPL energy sensors on the latest In-tel CPU’s. The authors illustrate the problems they encountered and their own experience. They have used RAPL to measure internal power consumption of individual components. They noticed deviations in the update counter when ex-perimenting, which prompted them to investigate further. The authors mention that RAPL’s updates are not accurately timed, and therefore cannot be used to report energy consumption at every millisecond. In turn, this means that RAPL is unusable for measuring short term energy consumption.

The paper also presents HAECER, a new framework for instrumenting applica-tions and performing fine-grained energy measurements at sub-update frequencies, and demonstrated the applicability of HAECER for an accurate evaluation of the energy consumption of video slice decoding. This paper is important to the project because it details the experience of using RAPL.

(15)

Chapter 3 Data collection: measuring

data and energy consumption

In this chapter we present our measurements for the performance and the energy consumed by the application we target as our case-study.

3.1 Experimental setup

3.1.1 Case-study application: LUD

To make sure we select a representative application, we decided to pick a parallel application from a well-known benchmark: Rodinia [8]. Rodinia has many differ-ent applications to benchmark performance on [2]. We have chosen to use their LUD (lower upper decomposition) application for this thesis. LUD is a method to solve a matrix using gaussian elimination. Given matrix A we can permute it as a lower triangular matrix and an upper triangular matrix.

A =A11 A21 A12 A22 =U11 0 U12 U22 L11 L21 0 L22

The thesis will not go into more detail in explaining how LUD works - we simply use LUD as an application of reasonable difficulty, for which we want to provide an energy consumption model.

3.1.2 Benchmarking method

For benchmarking, we will use different matrix sizes as input, because random matrices of any size can be very easily generated by Rodinia’s internal tools. Using

(16)

multiple matrices of different sizes gives us consistent test and validation data. To this end, we have generated random square matrices of sizes 1024 to 2560 with 256 intervals.

We also made a couple of changes to Rodinia’s benchmarking method to give us more consistent data. As expected, LUD solves the matrix in one iteration of the application. However, this leads to very short execution time (especially for small matrices), which in turn leads to less accurate energy measurements with RAPL. To increase the execution time, we repeated the LUD solving 10 times before reporting the results. We measure the execution time and energy consumption for all 10 iterations, and report the average per iteration (i.e., we divide the measured data by 10).

Finally, the LUD implementation provided by Rodinia uses OpenMP to achieve parallelism across multiple cores. We will test different parallelizations by changing the number of threads used by the application. This is done by directly instructing OpenMP to use different numbers of threads. To control the mapping of threads to cores, we will control thread affinity [17] with the help of LIKWID.

3.1.3 Platform

It is paramount to this work that the performance and energy consumption mea-surement are stable throughout the benchmarking process. To enable this in all aspects, we have used nodes from DAS5 supercomputer as target machines [4]. All the nodes we used have the same internal specs: a dual 8-core 2.4 GHz (Intel Haswell E5-2630-v3) CPU and 64 GB memory. To make sure no interference for other tasks influences our results, we pre-reserved the nodes prior to our experi-ments.

Because of the time it took to generate the data, we did the benchmarking in several runs: first on all matrices smaller than 2176 × 2176, and then on matrices of sizes 2304 × 2304, 2432 × 2432, and 2560 × 2560.

3.1.4 Energy Measurement

All the energy measurements in this thesis use LIKWID-4.2.0 and the adapted LUD version from Rodinia. As a reminder, LIKWID reads both standard per-formance counters and Intel’s RAPL counters, and compiles a report on all the measurements. We processed this report by making use of regular expressions to extract the relevant information. In the end, we collected the data for run-time, energy consumed, and max power usage. All the data is available here: https://github.com/isthatajojomeme/Thesis_energy_prediction.

In figure 3.1 we present a report likwid gives us at the end of solving a matrix. Here we can see how much energy is consumed by the process and the execution

(17)

likwid-powermeter likwid-pin -c S0:0-0 ./omp/lud_omp -n 1 -i \ ../../data/lud/1280.dat

---CPU name: Intel(R) Xeon(R) ---CPU E5-2630 v3 @ 2.40GHz

CPU type: Intel Xeon Haswell EN/EP/EX processor CPU clock: 2.39 GHz

---Reading matrix from file ../../data/lud/1280.dat

num of threads = 1

Time consumed(ms): 33331.934000 [likwid-pin] Main PID -> core 0 - OK

---Runtime: 34.0119 s

Measure for socket 0 on CPU 0 Domain PKG:

Energy consumed: 822.006 Joules Power consumed: 24.1682 Watt Domain PP0:

Energy consumed: 0 Joules Power consumed: 0 Watt Domain DRAM:

Energy consumed: 76.4481 Joules Power consumed: 2.24769 Watt

Measure for socket 1 on CPU 8 Domain PKG:

Energy consumed: 530.226 Joules Power consumed: 15.5895 Watt Domain PP0:

Energy consumed: 0 Joules Power consumed: 0 Watt Domain DRAM:

Energy consumed: 101.123 Joules Power consumed: 2.97318 Watt

(18)

time it took. From this report, we are interested in the following fields: • runtime: The time it takes to execute the code

• Energy consumed PKG: provides total power consumption of the processor package

• energy consumed DRAM: Provides power consumed by the DRAM DIMMs (not relevant since this will be a constant and not affected by threading method)

• socket Number: We always use socket 0 for the experiment with hyperthread-ing. For normal threading both sockets will be used and energy consumed will be added together

3.2 Results

Now onto the experiments. As a reminder, we have matrices of sizes 1024 to 2560, in steps of 256, and we run the LUD calculation 10 times. The graphs below detail the runtime and energy consumption of each LUD reading divided by 10 to normalise the results. We performed every reading for every thread number 5 times, and reported the average across these runs.

In figures 3.2 and 3.3, we observe the runtime of LUD when increasing the number of cores. The one socket data shows the effect of hyperthreading (HT): using 9 to 16 threads on a single 8-core socket reflects the impact of hyperthreading.

Figure 3.2: Execution time of the LUD application using 1 and 2 sockets - small matrices.

(19)

Figure 3.3: Execution time of the LUD application using 1 and 2 sockets - large matrices.

When using both sockets, we make sure (through pinning) that we always have one thread on one physical core (thus, no HT is used). Looking at both figures, we clearly see the performance on 1 and 2 sockets is similar until the 8 thread mark has been passed - which is to be expected, because the CPU uses 1 core per thread until the amount of threads is higher than the amount of cores on the CPU. After the 8 thread mark, however, we can see a sharp increase in the execution time when using hyper-threading, whilst the 2-sockets graph sees a further reduction in execution time. We also note that this hyperthreading overhead is larger as the matrix sizes increase. We assume that this increase is a result of the added CPU time to finish the second assigned thread on the first core. This results in a more inefficient system, which runs 7 cores with normal threading and 1 core HT. As the number of threads increases, the overhead diminishes, because the extra workload is now spread out over more cores that run using HT.

Figures 3.4 and 3.5 show the energy consumption for the same set of bench-marks. We observe a similar trend with the runtime graphs: after the 8 thread mark however, we can see a sharp increase in energy consumption when hyper-threading, whilst the normal threading graph is continuously consuming less en-ergy. Thus, there is a penalty also in terms of energy consmption for using hy-perthreading; the penalty decreases as the number of threads increases. However, using 2 sockets remains more efficient in terms of energy consumption, too. The similar trends for energy consumption and runtime are not surprising, given that the two metrics are proportional by definition.

The different behaviour observed when using is a strong argument for using multiple models to predict energy consumption in the two different situaitons: with and without hyper-threading.

(20)

Figure 3.4: Energy consumption for the LUD application using 1 and 2 sockets -small matrices.

Figure 3.5: Energy consumption for the LUD application using 1 and 2 sockets -large matrices.

(21)

Chapter 4 Energy consumption modeling

Our goal is to provide a model that can predict the energy consumption of LUD when given an unseen data size, a number of threads, and knowledge of how these threads are distributed - i.e, 1 or 2 sockets, and with or without hyperthreading. In this chapter we present our modeling strategy and preprocessing method. We give an explanation of the choices made in regards to modeling and give an introduction in the techniques used for modeling the data. We also delve deeper into the graphs made and evaluate each model accuracy and precision.

Based on our benchmarking results, we aim to infer the correlation between multi-threading and efficiency/power consumption. Additionally, we aim to deter-mine the impact of hyperthreading on energy consumption (as a reminder, hyper-threading is a technique in which more threads are run then there are available cores - see Chapter 2).

4.1 The Dataset

We separate the data into 2 sets: a training dataset and a validation dataset. From here on out we will refer to the data of sizes 1028, 1280,1532,1792,2176 as the training dataset. The data of sizes 1152, 1408, 1658, 2048 will be used as validation data set in comparing the efficiency of the model.

4.2 Preprocessing

To analyse the data further, we need to preprocess the data. Why is this im-portant? Since the size of the processed matrix and amount of threads as input data and the result be it power, energy consumption or runtime are likely to be dependent on each other. But not all data is considered equal. Using ML on this

(22)

raw data is prone to result in faulty or crummy models. Going from a 1x1 matrix to a 2x2 matrix is not simply a linear increase but an quadratic one. Preprocessing changes the data in such a way that a ML application could make better models by transforming the data to show more depth. In this Paper we will be using several preprocessing methods to change the data we already have in something more palpable. We then run the data through several ML algorithms to see the change in prediction accuracy. Should the preprocessing method use input data, the best one will be taken in comparison

4.2.1 Polynomial features

This preprocessing method generates polynomial and interaction features from the data. As input it will take an input value of degree to create the polynomial features. For example given the degree is 2 and the numpy array [x, y], the resulting transform would be [1, x, y, x2_{, xy, y}2_]. This has resulted in the most clear results

we have observed.

4.3 Evaluation

To evaluate the models and give them a clear and objective score, we opted to use normalised mean square error as evaluation method. Using the validation data set we can measure the distance of each point on the actual result with the prediction of the model given the data size. We chose this evaluation method because we want to measure the distance between the predicted energy consumption and actual energy consumption to give more importance to large differences in comparison to smaller ones. This will also be used to build the linear regression model using the training data.

4.4 Models

For the thesis, we have developed 3 models each focusing on a different range of threads and sockets. These models are:

1. A model for 1-socket and a max of 8 threads using normal threading. 2. A model for 2-sockets and a max of 16 threads also using normal threading. 3. A model for 1-socket with a max of 16 threads using hyper-threading.

(23)

4.4.1 1-socket Model using normal threading

Figures 4.1a and 4.1b present the predicted energy consumption for LUD (small and large matrices, respectively) for the application running on a single socket, using up to 8 threads.

(a) Small data sizes (b) Large data sizes

Figure 4.1: Predicted vs. actual energy consumption for LUD running on a single socket, with no hyperthreading.

We expected this to be our most accurate model, because the energy consump-tion is clearly decreasing as the amount of threads increases. However, we can observe in 4.1b that the prediction is following a curve. The smallest data size at 1152 seems to incorrectly be under predicting the energy consumption after 2 threads. The prediction of 1152 slowly comes closer to the actual energy con-sumption after 5 threads. This figure also shows that the model follows the actual result fairly close on the 1408 data size until a certain point. The prediction curve seems to incorrectly underpredict the energy consumption after 3 threads and this becomes worse until at 5 threads the model begins to predict higher results. This eventually becomes an overprediction of energy consumption at 8 threads.

In 4.1b the model seems to consistently predict the energy consumption to be higher than the actual model for both sizes. The drop at the start of the prediction model seems to be less extreme than the actual energy consumption. What is interesting is that the prediction of data size 1658 is better than the prediction of data size 2048 as it is much closer to the real result.

Finally, we note that the NMSQE of this model is 27175.186.

4.4.2 1-socket Model using hyperthreading

Figures 4.2a and 4.2b illustrate the predicted energy consumption of LUD when using one socket with hyperthreading.

(24)

Figure 4.2: Predicted vs. actual energy consumption for LUD running on a single socket, with hyperthreading.

We make the following observations. In 4.2a we see the prediction model start-ing with a drop in energy consumption as the number of threads increase between 1-5 thread. This drop eventually becomes neutral and the energy consumption rises at 6-13 threads before eventually curving steeply down for the remainder. For larger data sizes in 4.2b the model predicts between 0-8 threads before even-tually rising in threads 8-13. The prediction after 13 is a slow drop in energy consumption The NMSQE of this model is 52901.606

4.4.3 2-socket Model using normal threading

Figures 4.3a and 4.3b illustrate the predicted energy consumption of LUD when using two sockets with no hyperthreading.

First, we analyse the data in 4.3a, which aims to predict the energy consump-tion for smaller matrices. We observe from the graph that the predicconsump-tion for small number of threads is fairly accurate. However, the prediction becomes worse for 4-9 threads, where it underpredicts the actual consumed energy. In the region of 10 to 15 threads, the results are accurate. However, for 16 threads, the prediction is again of, again undepredicting the actual consumption. The prediction has this shape because we set the polynomial degrees to 3. What does look promising is the accuracy of the model for the large matrices, as seen in 4.3b. Here, the prediction follows the actual energy consumption very closely. It seems the higher the input data size, the more accurate the model becomes. The NMSQE of this model is 48059.821.

(25)

Figure 4.3: Predicted vs. actual energy consumption for LUD running on two sockets, with no hyperthreading.

4.5 Discussion

The models we trained to predict energy consumption for LUD have varying accu-racy. We have observed that most models seem to be predicting the start closely, but fail predict energy consumption as the number of threads go higher. They are also not consistently over- or under-predicting the energy consumption. What is a running theme throughout these models is that the datasets of sizes that lie in the middle are consistently better in predicting the energy consumption. Because of this, we can theorise over what two possible causes for the mismatch are.

First, the data could have been inconclusive or otherwise lacking. This seems the most likely option, because the results showed a predictable curve in the sim-plest models. There were 5 datasets used for training the data and we had 4 datasets used for validating the model. All were evenly spaced and what we no-ticed was that the datasets located at the edge were most out of sync with the models. Another issue was the last datapoint at 16 threads for most models. The model prediction was pretty poor given that, after 16 threads, the data was not generated. But, most importantly, the relatively small amount of data (compared to the typical requirements for ML) is probably what limited our accuracy. This happened because we had to generate our own dataset, and collecting the data took a long time, especially when being limited by the availability of DAS5.

Second, it could be that the generated data used for the models could also not be conclusive enough to produce good models. While we have searched for papers that did comparable experiments to build energy predicting models, most did so

(26)

with more variables using more complex models. Our experiments only focused on the possibility of predicting energy efficiency with no option to increase the num-ber of possible inputs that might smooth the model. It is possible that we could build better models by increasing the amount of variables or, alternatively, by de-creasing the amount of input variables - for example, we could fix the amount of threads to an arbitrary number, and only focus on predicting the accurate energy consumption for an unseen data size: we speculate that, in this case, the models could become more accurate.

(27)

Chapter 5 Conclusion

Sustainability becomes a more global concern because of the irreversible future of climate change. As one of the industries that has grown the largest in the past decades, ICT has a responsibility to stand on the forefront of this goal. ICT is projected to be one of the primary consumers of energy in the future [15], and an increase in efficiency now will pay dividends in the future.

To improve our understanding of how energy in consumed by parallel appli-cations running on modern multi-core processors, this thesis has focused on in-vestigating a simple method to build predictive models for energy consumption. When such models are accurate, we can expect users and infrastructure providers will make better choices to improve the overall efficiency of their applications and systems.

5.1 Summary and findings

A way to reduce energy consumption is to select the most efficient way to use the CPU. For modern multi-core processors, which use several cores and/or several sockets, this means we need to find the best way to divide the work across all these CPU resources. This is done using parallel, multi-threading applications, but selecting the number of threads to use for the most energy-efficient version of the application remains an unsolved problem. This thesis targeted predictive models that can help selecting this most energy-efficient version. Moreover, using such prediction models, we can make by making comparisons between 2 different parallel systems, which allow us to select the most efficient one.

Using a case-study, we investigated how to build such predictive models. Our test application is Lower-Upper Decomposition (LUD) from the Rodinia bench-marking suite [8]. We have benchmarked the application on 9 different data sizes,

(28)

to generate the data used for modeling1. The empirical analysis of the data

indi-cated a consistently lower energy consumption when more threads were used. We also observed that the use of hyper-threading (HT) may lead to additional energy consumption when compared to avoiding it. Finally, we proposed a prediction model based on linear regression, suing the benchmark data as training and test-ing set. Linear regression is used to describe the relationship between independent variables, specifically the number of threads used and HT vs non-HT, and the dependent variable energy consumption.

While the models build did show some promising trends, predictions were not of high accuracy. We believe this is a result of multiple factors, such as: insufficient data, incompatible ML models, or irregular application results.

To conclude, we revisit the research question we sought to answer in this work: "Is it possible to use ML models to predict energy consumption for parallel appli-cations?". Our results show it is partially possible to predict energy consumption using linear regression, but our current models are not accurate enough to predict energy consumption for unseen data sizes, different numbers of threads, and HT or non-HT as variables.

5.2 Future work

While this thesis resulted in some promising models, the results showed limited model accuracy. To improve the models, we propose focusing on collecting mode and/or better data. As a start, using more data should give better indications on whether our independent variables were conclusive enough. Using only 5 data sizes for training and 4 for validation was optimistic. Moreover, having smaller differences between the input data sizes is also likely to improve results. Also setting the end of the threads on a baseline should reduce the inaccuracies at the end.

Looking beyond the data, we also see opportunities in building more complex models. This work only used the number of threads, HT vs non-HT and the data sizes as independent variables. Adding more independent variables, or searching for stronger independent variables, could also lead to more accurate models.

1_{The code and data used in this thesis are available here:} _{https://github.com/}

(29)

Bibliography

[1] ***. Introduction to papi. Published at https://icl.cs.utk.edu/papi/docs/index.html (accessed on 21-1-2021).

[2] ***. Rodinia: Accelerating compute-intensive applications. Published at http://rodinia.cs.virginia.edu/doku.php?id=start (accessed: 21.01.2021). [3] Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis, and Nectarios

Koziris. Exploring the performance limits of simultaneous multithreading for memory intensive applications. The Journal of Supercomputing, 44(1):64–97, 2008.

[4] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff. A medium-scale distributed system for computer science research: Infrastructure for the long term. Computer, 49(05):54–63, may 2016.

[5] Lotfi Belkhir and Ahmed Elmeligi. Assessing ict global emissions footprint: Trends to 2040 & recommendations. Journal of Cleaner Production, 177:448 – 463, 2018.

[6] N. Bombieri, F. Busato, F. Fummi, and M. Scala. Mipp: A microbenchmark suite for performance, power, and energy consumption characterization of gpu architectures. In 2016 11th IEEE Symposium on Industrial Embedded Systems (SIES), pages 1–6, 2016.

[7] Randal E. Bryant and David R. O’Hallaron. Computer Systems: A Program-mer’s Perspective. Addison-Wesley Publishing Company, USA, 3rd edition, 2016.

[8] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heteroge-neous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC ’09, page 44–54, USA, 2009. IEEE Computer Society.

(30)

[9] Muhammad Fahad, Arsalan Shahid, Ravi Reddy Manumachu, and Alexey Lastovetsky. A comparative study of methods for measurement of energy of computing. Energies, 12(11):2204, 2019.

[10] P. Gepner and M. F. Kowalik. Multi-core processors: New way to achieve high system performance. In International Symposium on Parallel Computing in Electrical Engineering (PARELEC’06), pages 9–13, 2006.

[11] Google. Circular economy of google data centers.

[12] Marcus Hähnel, Björn Döbel, Marcus Völp, and Hermann Härtig. Measuring energy consumption for short code paths using rapl. SIGMETRICS Perform. Eval. Rev., 40(3):13–17, January 2012.

[13] John L. Hennessy and David A. Patterson. Computer architecture: a quanti-tative approach. Elsevier, 2019.

[14] Intel. Intel’s processors performance counters. Published at https://download.01.org/perfmon/index/ (accessed on 21-1-2021).

[15] Energy Innovation LLC. How much energy do data centers really use? Published at https://energyinnovation.org/2020/03/17/how-much-energy-do-data-centers-really-use (accessed: 10.01.2021).

[16] Deborah T Marr, Frank Binns, David L Hill, Glenn Hinton, David A Koufaty, J Alan Miller, and Michael Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1), 2002.

[17] Jan Treibig, Georg Hager, Gerhard Wellein, and Michael Meier. Poster: Lik-wid: lightweight performance tools. In Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion, pages 29–30. 2011.

[18] V. M. Weaver, M. Johnson, K. Kasichayanula, J. Ralph, P. Luszczek, D. Terp-stra, and S. Moore. Measuring energy and power with papi. In 2012 41st International Conference on Parallel Processing Workshops, pages 262–268, 2012.

Modeling CPU energy consumption using machine learning

Modeling CPU energy

consumption using machine

learning

Modeling CPU energy

consumption using machine

learning

Contents

Chapter 1

Introduction

1.1

Research question and approach

1.2

Thesis outline

Chapter 2

Background and Related Work

2.1

Platforms

2.2

Parallel Programming

2.3

Measuring tools

2.4

Related work

2.4.1

A Comparative Study of Methods for Measurement

of Energy of Computing

2.4.2

Measuring Energy and Power with PAPI

2.4.3

MIPP: A Microbenchmark Suite for Performance, Power,

and Energy Consumption Characterization of GPU

architectures

2.4.4

Measuring Energy Consumption for Short Code Paths

Using RAPL

Chapter 3

Data collection: measuring

data and energy consumption

3.1

Experimental setup

3.1.1

Case-study application: LUD

3.1.2

Benchmarking method

3.1.3

Platform

3.1.4

Energy Measurement

3.2

Results

Chapter 4

Energy consumption modeling

4.1

The Dataset

4.2

Preprocessing

4.2.1

Polynomial features

4.3

Evaluation

4.4

Models

4.4.1

1-socket Model using normal threading

4.4.2

1-socket Model using hyperthreading

4.4.3

2-socket Model using normal threading

4.5

Discussion

Chapter 5

Conclusion

5.1

Summary and findings

5.2

Future work

Bibliography