Investigation of Gaussian Mixture Models for power estimation in a computing cluster

(1)

Investigation of Gaussian Mixture

Models for power estimation in a

computing cluster

Jos Wezenberg 6213278

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor

H. Zhu, dr. P. Grosso (& dr. M.W. van Someren) SNE (System and Network Engineering Group)

Institute for Informatics (IvI) University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

Server administrators require an accurate and precise power estimation for the dif-ferent tasks their servers perform, in order to create an efficient schedule. An es-timation can be made by using the performance data from those servers, but some data, such as I/O data can be difficult to acquire. This project investigates the ef-fect on the accuracy and precision of using different subsets of performance data to estimate the power consumption in a Gaussian Mixture Model (GMM). The subsets investigated are CPU, CPU+Memory, CPU+I/O data and the Full Set of CPU+Memory+I/O data. The Root Mean Squared Error (RMSE) is used as an ac-curacy measure and the Success Rate indicates precision. Conclusively, the added value of acquiring each subset is evaluated. Both the Linear Model and GMM showed relatively poor performance when using only the first three subsets. The Full Set showed on average a 40%-60% decrease in RMSE and an average increase of 20%-30% in Success Rate, compared to the first three subsets. A significant im-provement from the Linear Regression Model in both accuracy and precision was shown, using the Full Set of available features in GMM. Thereby justifying the increased effort required to acquire the complete data set.

(3)

Acknowledgements.

I would like to thank dr. Maarten van Someren for his assistance in acquiring this assignment and dr. Paola Grosso and Hao Zhu for agreeing to supervise this project in addition to their regular duties. Also I would like to thank my parents for granting me the opportunity to have an education and for helping me eliminate any obstacle I have faced during my academic track. The support of friends and family is very important to the success of any project, including this one.

(5)

1 Introduction.

Power consumption continues to be a bottleneck in the design of large scale data-centres. Since it accounts for over half of the monthly expenses of an average data-centre (Berl et al., 2010), investigating ways to minimise power consumption could be highly lucrative. Apart from monetary gain, efficient scheduling could decrease the carbon footprint of data-centers worldwide. Which are estimated to account for over 2% of global greenhouse gas emissions (Cook & Van Horn, 2011).

Manufacturers are becoming increasingly successful in designing energy-efficient hardware, yet that is only one aspect of energy-aware systems design. Another principal way of increasing the efficiency of data-centres is proper planning of tasks across the server nodes. Scheduling more tasks to one server node might in-crease its power consumption, whereas it could also result in idle nodes which can be dynamically shut down, conserving power. There are several different strategies for scheduling. However, those lie outside the scope of this project.

A system administrator requires an accurate indication of the power consumed by a server for any given task, in order to create an effective schedule. Since this is a dynamic value, dependant on several variables, only an approximate Power Consumption Estimation (PCE) can be made, based on measurements taken from a server’s internal components.

Machine learning can be utilised to make accurate PCEs for specific tasks, using historic data of similar tasks where Operating Systems (OS) data and Per-formance Monitoring Counters (PMCs) are measured. Alternatively, Dhiman et al. (2010) showed that the Gaussian Mixture Model (GMM) (Reynolds, 2008) is more accurate than multi-variate Linear Regression approaches (hereafter referred to as the ’Linear model’), due to its ability to incorporate the multidimensional effects of different components and their correlations in a PCE prediction model. In this thesis, we focus on two research questions about using the GMM for PCE.

Acquiring the right data set to feed the GMM for PCE is a time-consuming and complex process (Zhu et al., 2013). The CPU and memory data is relatively easy but good data on I/O components is difficult to acquire. Firstly, focus is placed on evaluating the effects of utilising different sets of variables within the GMM model used for PCE, to determine whether the increase in the accuracy of the PCE justi-fies the costly acquisition of I/O data. We want to see how the accuracy changes when different features are removed in different combinations for the most effective model choice (GMM) in the given applications, try to calculate the pre-dictive power of the feature subsets and attempt to explain the findings. Afterwards, the results will be compared to a Linear model to show whether it improves upon it. This research adds to the field by investigating the predictive values of the performance variables in the GMM model for system level power consumption es-timation in server systems. It improves upon previous approaches by considering the system level instead of just VMs and it considers all major components in the system and their correlations in stead of a small subset (for a Cmap depicting the

(6)

positioning of this research in the current state of the art, see Appendix A). The structure of this thesis is as follows. First, an overview of the current state of the art is provided in a literature review (section 2). Subsequently the approach with regards to the experiments conducted are described (section 3). The results are then evaluated and conclusions are drawn (section 4). Any issues encountered during the project or shortcomings of this research and possible suggestions for future work are discussed in the Discussion & Future Work (section 5).

2 Literature review.

Most of the research in PCE models focuses on Virtual Machine level power con-sumption estimation or only use a small feature set. The standard approach using only CPU or Memory features in a Linear model (Ardagna et al., 2012) is insuf-ficient for modelling complex systems with shared processing capabilities and a high number of I/O operations. A system level model is necessary to capture the correlations between components and make accurate predictions for the power con-sumption on a system level (Piga et al., 2014). Dhiman et al. (2010) showed that the Gaussian Mixture Model(GMM) (Reynolds, 2008) is most accurate in incorpo-rating the multidimensional effects of different components and their correlations in a power estimation prediction model.

The varying nature of the tasks performed by individual components in a sys-tem creates non-linear relations. GMM only copes with this issue to a certain degree. When undesirable results are produced by a GMM due to this issue, the data could be pre-clustered using Gaussian Mixture Vector Quantisation(GMVQ) after which the model is trained, using the resulting clusters in order to make better predictions (Dhiman et al., 2010).

Real power consumption can be measured in several ways, for example, SE-FLab (Ferreira et al., 2013) physically measured the power consumption for in-dividual components. By creating a lab setup where each component in a single server node is physically measured through attaching sensors to each power cable they found to be the primary supply for each investigated component, they were able to isolate the contribution of each component to the overall power consump-tion under various stress tests. This research shows clear promise into investigating the power consumption of individual components on a hardware level, however, this project endeavours to incorporate each component into a system level esti-mation model that can use the measurements of just the overall power tion and OS-data and PMCs. Nevertheless, the distribution of power consump-tion amongst the components in this research can serve as a guide to selecting the subsets of data worth exploring. The main result showed that the primary power consumers in the tested server node are the CPU’s. Therefore this component is used in each investigated subsets and separately, for comparison. The memory con-sumed a relatively high amount of power but showed little variation under stress

(7)

compared to the baseline values. The CPU and Memory combination is widely used in PCE-research and is therefore included in this research in order to make a proper comparison to other work, but is expected to perform lower than the re-maining combined subsets. The remainder of the power consumption is attributed to ”the disk, the network, the I/O and peripherals, the power supplies, the regu-lators and the rest of the glue and circuitry in the server.” (Ferreira et al., 2013). Measurements from OS-data of the first three of these elements are included in the data set used for this research, combined under the label ’I/O data’. The individ-ual elements hardly show in the big picture, as a set however, they were shown to contribute nearly 30% of the power consumed.(Ferreira et al., 2013) This leads to the conclusion that in order to make an accurate PCE, the I/O data should be in-cluded in the model. The resulting combinations are CPU and I/O data, to check if it performs better than the CPU and memory set, and finally, the Full Set including CPU, Memory and I/O data, which should perform best.

In this project, focus is placed on evaluating the effects of utilising different sets of variables within the GMM model used for power consumption estimation of a Physical Machine (PM) server cluster. The accuracy of a model is a nor-malised evaluation measure which can be employed to compare the performance of different models in making power consumption estimations. In order to evaluate and incorporate the results of the feature investigation into the broader perspective of the power consumption estimation model, and to describe how they relate to the practical application in energy-aware system design, the framework used in the power measuring and profiling survey by Chen & Shi (2012) is employed.

(8)

3 Method.

The primary goal is to show the effect of using different subsets of the data set for PCE in GMM. The main hypothesis is that the GMM is more precise and has lower RMSE in PCE when I/O data are included in the data set. The secondary hypothesis is that the GMM is more precise and has lower RMSE in PCE than the Linear model. The data set is therefore divided into four combinations, which will be further elaborated upon in section 3.3, to map the difference between those combinations when used to train and validate both the GMM and Linear model. To verify that GMM is indeed more effective than the Linear model, the results of both models on the same combinations of the data set will be compared.

Component Model Capacity

CPU Intel Core i7-3517U 3GHz

Memory DDR3(SODIMM) 8GB

GPU GT610M 1GB

Storage SSD 32GB

HDD 1TB

Table 1: Hardware configuration of testing computer Lenovo IdeaPad U410 (2012 ed. OS: Win7)

The specifications of the primary machine on which the computations were executed are listed in table 1. Python(2.9.7.0) was used to run both the GMM and Linear algo-rithms. The computation time for both models, as shown further on in table 7, will probably be lower

when using a stronger computer, but this was the strongest available machine at the time. The predictions made on the data, were calculated using the pypr package (Petersen, 2010) for Python. For a full description of the package, see their web-site, listed in the References section. All applications and packages used in this experiment are open-source and available on-line.

To investigate the accuracy of each subset, a fixed algorithm was constructed, for both the GMM and the Linear model, which were executed using each subset of the data (see section 3.3) for a given number of runs (see 3.4).

In this chapter, a brief theoretical overview of both models used is portrayed, as well as a description of the data and fixed parameters used in the execution of the models. Concluding this chapter is a description of both constructed algorithms for these experiments with the aid of pseudo-code to accurately show which operations take place at what point in the algorithms.

3.1 Evaluation measures

Set Portion

Training 60%

Test 20%

Validation 20%

Table 2: Data division To make a (semi-)supervised machine learning

appli-cation generate a predicted value, a training-set, test-set and a validation-test-set of the data are required. A training-set is a labelled set of historic data, which the algorithm can utilise in order to make an

(9)

mation of new values, given the historic data in the training-set. The algorithm then uses a test-set to fine-tune the model, based on the accuracy of predictions

made with the trained model, now using the data in the test-set. It measures the error, or cost(J) of the outcome by the difference between the predicted value and the actual measured value. Finally, the fine-tuned model is validated using the validation-set, to check its accuracy on a third (new) set of data, again measuring J. The ratio with which the data is divided into these three subsets can be found in table 2.

The overall accuracy of a model over a data set, is measured with the Root Mean Squared Error (RMSE), which is easiest to explain using a simple Linear model, wherein it is calculated as follows:

First, the hypothetical value hθ(x(i)), given the historic data is generated, where x(i) is the value x for the i’th entry (row) in the data and θ is the gradient of the model. In the simple example, this is done by measuring the offset (θ1x(i)) from the x-axis baseline θ0resulting in a gradient for the Linear prediction.

hθ(x(i)) = (θ0+ θ1x(i))

Then for each entry, the measured value (y(i)) is subtracted from the hypothet-ical value (hθ(x(i))) and squared. The sum of all entries is divided by the number of entries, leaving the mean value. Then the resulting value is rooted to ensure no negative values are registered.

J(i)= hθ(x(i)) − y(i)

RMSE(θ0, θ1) = s 1 m m

∑

i=1 (J(i))2

The cost function for the GMM is different because it uses an Expectation-Maximisation of a Mixture of Gaussians, which will be elaborated upon in section 3.2. But the resulting estimation can be evaluated using the RMSE function (by subtracting the measured value from the estimated value). This way, both models share a unified evaluation measure, enabling a real comparison.

∆-level percentage

1 1%

2 5%

3 10%

Table 3: ∆-levels for success-rate Additionally, the success rate is

cal-culated, which is expressed in a percent-age of all entries for which a PCE was calculated. Three ∆ levels were chosen (see table 3) to depict the success rate of the models for each subset on each node. The Linear Model will not be

(10)

Gaussian Mixture Model however, might need some additional explanation. It is not only a less used model in PCE research, but the additional theoretical frame-work will aid in determining the crucial differences between the two models.

3.2 Gaussian Mixture Model

Please note: This section contains excerpts from the documentation of the pypr package (Petersen, 2010). To maintain the readability of the text, no quotation marks were used. Additional explanatory content is added, but some parts are di-rect quotes from documentation the package.

A multivariate Gaussian distribution is a generalisation of the one-dimensional Gaussian distribution into multiple dimensions (Petersen, 2010). The benefit of using a Gaussian distribution over a (multi-variate) Linear model is that it does not assume that the variables are independent. This assumption would lean more to a naive Bayesian model where (1) is assumed. Where

L

is the likelihood, the product of all probabilities of xnover n.

L

=

_∏

n

p(xn) (1)

The number of clusters(K) is fixed (see section 3.4) so the parameters the model needs to generate are means and covariance of the distributions. In the GMM used in this experiment, the Expectation Maximisation (EM) algorithm is employed. The EM algorithm is an iterative refinement algorithm used for finding maximum likelihood estimates of parameters in probabilistic models. The likelihood is a measure for how well the data fits a given model, and is a function of the parameters of the statistical model (Petersen, 2010).

The multi-variate Gaussian distribution takes the covariance matrix into ac-count, which also allows for a conditional distribution function. Note: The covari-ance matrix must be a non-negative definite matrix, else the package will produce an error. Properly reshape the entry data and replace any Not-a-Number(NaN) val-ues with ones.

By the independence assumption, the advantages of Gaussian mixture are ren-dered useless. The power of this model lies in the dependence and co-dependence of variables across the data set. As a mixture of Gaussians is used as model, we can write

p(xn) =

∑

k∈K

N

(xn|µk, Σk)p(k) (2)

Where

N

is the density function. The mixture comes from the summation over all clusters in K. In contrast to, for instance, the K-means algorithm, the EM al-gorithm for Gaussian Mixture does not assign each sample to only one cluster.

(11)

Instead, it assigns each sample a set of weights representing the sample’s probabil-ity of membership to each cluster (note: the sum of all weights must be 1). This can be expressed with a conditional probability, p(k|n), given a sample n gives the probability of the sample being drawn a certain cluster k (Petersen, 2010).

pnk= p(k|n) = p(xn|k)p(k) p(xn) =

N

(xn|µk, Σk)p(k) p(xn) (3) Given the data and the model parameters µk, Σk, and p(k), we now can calculate the likelihood

L

and the probabilities pnk. This is computed in the expectation (E) step in the EM-algorithm.

In the maximisation (M) step we estimate the mean(4), co-variances(5), and mixing coefficients p(k) (6). As each point has a probability of belonging to a cluster, pnk, we have to weight each sample’s contribution to the parameter with that factor. The following equations are used to estimate the new set of model pa-rameters. ˆµk= ∑npnkxn ∑npkn (4) ˆ Σk= Σn(xn− ˆµk) ⊗ (xn− ˆµk) ∑npnk (5) ˆ p(k) = 1 N

∑

_n pnk (6)

These parameters are then used to generate the conditional distribution. The conditional distribution takes the centroids (ˆµk), combined covariance( ˆΣk) and the probibility of cluster k ( ˆp(k)) and generates firstly, the conditional centroids, sec-ondly, the conditional combined covariance and lastly conditional p(k)’s, which are called the mixing coefficients. Based on the conditional distribution, the value for power consumption with the highest conditional probability can be calculated, which is the output estimation (or prediction) based on the model. To sum up, for GMM, the data is clustered with EM and the resulting clusters are used as a Gaussian mixture to make prediction based on the input features.

(12)

3.3 Data

Group # of nodes Processor Memory Storage NIC GPU GPU-nodes 7 Dual E5-2620 (2GHz), 12 cores 64GB RAM 2*1TB HDD 1B and GbE NIC 7 K20m, 1Phi Hadoop-nodes 8 (Dual E5-2620 (2GHz), 12 cores, hyperthreading enabled 64GB RAM 2*1TB HDD 1B and GbE NIC no

Table 4: Configuration of target nodes in UvA’s DAS-4 server cluster (Zhu et al., 2013)

Type of resource feature Corr(feature, power consumption)

CPU usage 0.89

Memory usage 0.79

disk I/O-time 0.48 (sda), 0.49 (sdb)

Disk read speed 0.46 (sda), 0.46 (sdb) Disk write speed 0.38 (sda), 0.38 (sdb) NIC incoming speed 0.37 (eth0), 0.36 (eth1) NIC outgoing speed 0.38 (eth0), 0.18 (eth1)

ALL page fault 0.28

Major page fault 0.27

Table 5: Correlation of the resource features and power consumption obtained by measurements of one typical node in the DAS-4 cluster. Some feature types were measured on two components(HDDs and NICs), the correlation of which are separately given (Zhu et. al., 2013)

.

The data set used for this investigation is a 90 day record of component mea-surements (OS-data and PMC) taken from the UvA’s DAS-4 cluster, containing eight Hadoop-nodes (I/O intensive tasks) and seven GPU-nodes (wide range of tasks) supplied by H. Zhu and dr. P. Grosso (Zhu et al., 2013). The data is con-structed in the following form: Firstly, there is a time-stamp indicating the moment of measuring. Two CPU values are measured (cpu-sys and cpu-user) which are combined under ’CPU usage’ and used for the CPU subset test and all combined subsets. Additionally, the memory (mem) use is measured, combined with the CPU subset creates the CPU+Mem subset. Subsequently, the following I/O val-ues are measured; HDD (sda,sdb), Network Interface Cards (NIC’s)(eth0:net in, eth0:net out, ib0:net in, ib0:net out). These combined with the CPU subset cre-ates the CPU+I/O subset. And conclusively, the overall power consumption is measured (power). This measured power consumption is used to compute the ac-curacy of the predictions made by both models. Table 5 shows that the correlation coefficient of CPU usage and power consumption is close to 1. However the cor-relation coefficients for the I/O components is much lower. This could indicate a non-linear relation between the power consumption and the I/O data or that the I/O components consume a relatively low proportion of the total power consumed.

Node078 through node085 are GPU nodes and node086 through node093 are Hadoop nodes. Due to the long computation time and the unavailability of

(13)

ditional hardware to run the calculation on, a sample of three nodes from each node-type was taken. The nodes chosen are those which showed the strongest indi-cation of clearly portraying the investigated effects for both the Linear model and the GMM after initial experimental runs. The selected GPU nodes are node079, node080 and node084. The selected Hadoop nodes are node086, node087 and node091.

3.4 Parameters

In order to compare the data between nodes, several variables were fixed after ini-tial experimentation. The fixed parameters are listed in table 6. These parameters are only relevant when reproducing this work. Several different clustering algo-rithms can be employed in the clustering step of the pypr GMM package, here the k-means algorithms was chosen because it showed to be most effective after initial experimentation. The choice of clustering algorithm is communicated through the cluster initialisation parameters in cluster init kw.

Table 6: Fixed Parameters used in the Pypr Package python code. See package documentation (Petersen, 2010) for exact placement in the code.

Parameter Value

Runs per subset 30

# of clusters(K) 20

max iterations 800

max tries 10

init kw cluster init kw

cluster init kw

cluster init kmeans

max init iter 5

cov init var

verbose False

(a) GMM parameters

Parameter Value

Runs per subset 50 # of iterations 3000

α 0.5

λ 10

(14)

3.5 Algorithm pseudo code

In order to ensure the reader has a clear understanding of the proceedings in the experiments conducted, a brief and simplified version of the code is given in the form of pseudo-code. It combines all the theoretical elements into a practical ex-perimental setup.

The parameters listed in the previous section are incorporated into the pseudo-code. First, the Linear Model code will be discussed, subsequently, the GMM code will be elaborated upon.

The Linear Model (see Algorithm 1) is relatively simple, in that it uses only the gradient descent package to teach the model, under the assumption that all features have independent values. Te algorithm starts by listing the nodes it needs to work through.

After initial experimentation on all nodes for 1-5 runs, a selection of 3 GPU nodes and 3 Hadoop nodes was made, listed in the pseudo-code as Nodes. Then a def-inition of the combinations of data for the four subsets is given, these are listed as Combinations. The data set is loaded from a *.csv file, randomly shuffled and divided into the training-, test- and validation-sets in the getShuffledData function (line 11). The Linear model uses Gradient Descent for Linear Regression to min-imise the cost of the model as a whole (line 17). The model is trained on the training-set data. It is tested on the test-set, and also tested on the training-set to ensure the model was properly trained. (Note: If the training-set shows high RMSE after learning than the parameters need to be (re-)adjusted.) Based on the learn-ed model, predictions are made for the training-, test- and validation-set (line 21). The cost of each set is denoted by the Error value in the code (line 24), meaning the ab-solute value of remainder of subtracting the predicted value from the real measured value. Using these cost values, the RMSE for each run is calculated and added to the sum of RMSEs for that combination, for that data set (line 27). The sums of the RMSE of all three data sets are averaged (line 39). First over the number of entries per run and subsequently over the number of runs. Finally, the success rate is calculated (line 29) for the three given ∆-levels as threshold for the difference be-tween the predicted value and the real measured value. The success rates for each ∆-level is averaged in the same manner as the cost (line 40). The average cost and average success rates are then written to the output file after each combination. A new output file is created for each node. Optionally, a time check could be inserted to clock the performance of the model, on the particular machine running the code. The model runs over each node, for every combination, for a set number of runs.

(15)

Algorithm 1 Linear Model

1: Combinations ← [0, 1, 2, 3]

2: for all Nodes do

3: Create output file

4: for all Combinations do

5: Runs ← 50

6: Iterations ← 3000

7: α ← 0.5

8: λ ← 10

9: ∆ ← [0.01, 0.05, 0.1]

10: for all Runs do

11: functionGETSHUFFLEDDATA(Node,Combination)

12: TrainingSet ← Data[0-60%]

13: TestSet ← Data[60-80%]

14: ValidationSet ← Data[80-100%]

15: return TrainingSet, TestSet, ValidationSet

16: end function

17: function GRADDESCENT(TrainingSet(X), TrainingSet(Y),α,

Iterations,λ)

18: return θ

19: end function

20: Sets ←[TrainingSet, TestSet, ValidationSet]

21: for all Sets do

22: function PREDICTION(Set(X), Set(Y),θ)

23: PredY ← Set(X) ⊗ θ

24: Error ← Prediction - Set(Y)

25: return PredY, Error

26: end function

27: SumSetError ← SumSetError + Error

28: end for

29: for all ∆ do

30: functionCOUNTSUCCES(PredY, Set(Y), ∆)

31: if (PredY-Set(Y))/Set(Y) ≤ ∆ then

32: CountSet∆ ← CountSet∆ +1

33: end if

34: SuccesRateSet∆ ← CountSet∆ / length(Set(Y))

35: SumSuccesRateSet∆ ← SumSuccesRateSet∆ + SuccesRateSet∆ 36: end function 37: end for 38: end for 39: AvgError ← ErrorSum/Runs 40: AvgSuccesRateSet∆ ← SumSuccesRateSet∆/Runs 41: end for

(16)

The GMM (see Algorithm 2) algorithm also runs over a fixed set of nodes, for a fixed set of combinations for a fixed number of runs. It keeps track of the RMSE and the success rate and writes that to the output file, same as the Linear model. The ∆-levels are the same, to make sure we can compare the models. In stead of defining α- and λ-values, this algorithm defines only the parameter for the number of cluster centroids for use in the EM’s k-means clustering element.

However, the GMM is slightly more complex than the Linear Model in the sense that it does not simply assign a power consumption value to a combination of features, as is done in the Linear Regression Gradient Descent method. In stead, it assigns mixture weights to each of the conditional centroids, resulting in an es-timation based on probabilities derived from the Gaussian models, constructed by applying Estimation Maximisation on the training-set data’s distribution and co-variance matrices (line 15). As discussed in section 3.2, the GMM can use the centroids of the (k-means) clustering, the covariance matrix and the mixing coef-ficients provided by the EM GM method to generate the conditional distribution for both the test-set and the validation-set (line 21). Using the conditional cen-troids(centroid cond) and the conditional mixing coefficients(mc cond) generated by the conditional distribution method, the model can estimate the power consump-tion values for each entry in both the test- and the validaconsump-tion-sets (line 22). These estimated values are then compared to the actual measured power consumption values and the RMSE is calculated (line 23).

Both the RMSE and success rates are summed and averaged in the same way as the Linear Model algorithm.

After both models were run across all nodes, six nodes were chosen to perform a 30 runs per subset GMM code on. The results of which will be discussed in the subsequent section.

(17)

Algorithm 2 GMM

1: Combinations ← [0, 1, 2, 3]

2: for all Nodes do

3: Create output file

4: for all Combinations do

5: Runs ← 30

6: K← 20

7: ∆ ← [0.01, 0.05, 0.1]

8: for all Runs do

9: functionGETSHUFFLEDDATA(Node,Combination)

10: TrainingSet ← Data[0-60%]

11: TestSet ← Data[60-80%]

12: ValidationSet ← Data[80-100%]

13: return TrainingSet, TestSet, ValidationSet

14: end function

15: function EM GM(TrainingSet(X), K, Iterations=800, clustering =

kmeans)

16: return centroids, ccov, p k, logL

17: end function

18: Sets ←[TestSet, ValidationSet]

19: for all Sets do

20: function PREDICTION(Set(X), Set(Y),θ)

21: centroid cond*mc cond ← cond dist(centroids, ccov, p k)

22: PredY ← centroid cond*mc cond

23: Error ← Prediction - Set(Y)

24: return PredY, Error

25: end function

26: SumSetError ← SumSetError + Error

27: for all ∆ do

28: functionCOUNTSUCCES(PredY, Set(Y), ∆)

29: if (PredY-Set(Y))/Set(Y) ≤ ∆ then

30: CountSet∆ ← CountSet∆ +1

31: end if

32: SuccesRateSet∆ ← CountSet∆ / length(Set(Y))

33: SumSuccesRateSet∆ ← SumSuccesRateSet∆ + 34: SuccesRateSet∆ 35: end function 36: end for 37: end for 38: end for 39: AvgError ← ErrorSum/Runs 40: AvgSuccesRateSet∆ ← SumSuccesRateSet∆/Runs 41: end for 42: end for

(18)

3.6 Results

Model Average run-time # of runs

GMM 2:13:10 30

Linear 0:09:37 50

Table 7: Average runtime of each sub-set of features (in hours)

Table 7 shows the average run time per subset for both models. The high run-time for the GMM explains why the num-ber of runs was capped at 30 for the GMM and that the Linear model allowed for more runs per execution per subset due to its lower run time.

In this section, the GPU nodes results are

analysed, comparing the Linear model and the GMM model for the different sub-sets of features. After which the Hadoop nodes results are discussed. Subsequently, the difference between the node types are explained.

Continuing with the success rate results, averaged for the three nodes of both node types, depicting the differences between the models on both node types. Subse-quently, a brief analysis of the differences will be provided and finally the overall conclusion of the results are drawn.

Examining the bar charts in figure 1, each showing feature subsets(for differ-ent combinations) on the y-axis and the corresponding average cost on the x-axis, the pattern of differentiating accuracy across the different subsets becomes appar-ent for both models. Figure 1a shows three bar charts for the three selected GPU nodes. Figure 1b shows three bar charts for the three selected Hadoop nodes. Each node is depicted separately to show the similarity, but also differences in each node for both models. Note that the first GPU node (node079) has a much higher aver-age measured power consumption as well as the corresponding standard deviation than the other two GPU nodes, but the cost values are very similar, for both models. Keep in mind that the RMSE score needs to be as low as possible. In figure 1 the following rule applies; the smaller the bar, the better. The bars are combined per model for all subsets in a node. The name of the subset is inside the bar and the RMSE value for that subset is showed behind the bar. The only difference between the GPU nodes and the Hadoop nodes is the scale of the x-axis. Both node-types show similar patterns, however both models performed significantly better in the Hadoop nodes.

It is clear that for both node types, the GMM performs at least twice as good on RMSE scores for the Full Set. This confirms second part of the secondary hypothe-sis; the GMM is more precise and has lower RMSE in PCE than the Linear model. at a single glance. The second part will be evaluated with the use of the precision scores, in the following section The primary hypothesis, stating; the GMM is more precise and has lower RMSE in PCE when I/O data are included in the data set, requires a better look. When using the Full Set (CPU+Mem+I/O), the RMSE is significantly lower for GMM in both models and in both node types. However,

(19)

if we look at the CPU+I/O subsets, we see that the RMSE generally varies little from the CPU and CPU+Mem subsets in the GPU nodes. In contrast, the Hadoop node’s RMSE when using GMM is about two Watts lower on average than when only CPU, and also about 1 Watt lower on average than the CPU+Mem subset. So the second part of the primary hypothesis can be confirmed, for the Hadoop nodes, but is rejected for the GPU nodes. The use of I/O data with only CPU does not give a significant improvement there, it even scored a higher RMSE in node084. The Full Set did however show a significant improvement on the CPU subset in node084, indicating a possible linking capacity of the Memory data between the CPU and I/O data.

Worth mentioning is the slight decrease of the RMSE when using I/O-data in the Linear model for Hadoop nodes. This indicates that, even under the assumption of Independence of values, the I/O data still contributes to a better prediction than the memory data. 20.3123 CPU Linear 20.3944 CPU+Mem 20.168 CPU+I/O 13.6463 Full Set 14.7936 CPU GMM 15.7233 CPU+Mem 14.8354 CPU+I/O 7.2778 Full Set 0 5 10 15 20 RMSE of powerprediction node079 in Watts

Powerdata node079 σ = 20.51 ¯ p= 243.06 20.4873 CPU Linear 20.4574 CPU+Mem 20.3584 CPU+I/O 11.8245 Full Set 15.6752 CPU GMM 15.0762 CPU+Mem 15.1938 CPU+I/O 5.4517 Full Set 0 5 10 15 20 RMSE of powerprediction node080 in Watts

Powerdata node080 σ = 20.77 ¯ p= 144.07 19.7908 CPU Linear 19.7570 CPU+Mem 19.711 CPU+I/O 11.4932 Full Set 17.0531 CPU GMM 17.3107 CPU+Mem 21.6354 CPU+I/O 7.5079 Full Set 0 5 10 15 20 RMSE of powerprediction node084 in Watts

Powerdata node084 σ = 20.33 ¯

p= 141.90

(a) GPU nodes

10.9710 CPU Linear 10.8508 CPU+Mem 10.0460 CPU+I/O 6.4829 Full Set 9.9206 CPU GMM 8.7041 CPU+Mem 7.6723 CPU+I/O 2.9536 Full Set 0 2 4 6 8 10 RMSE of powerprediction node086 in Watts

Powerdata node086 σ = 20.51 ¯ p= 243.06 10.5718 CPU Linear 9.4867 CPU+Mem 9.2022 CPU+I/O 6.2956 Full Set 10.0338 CPU GMM 8.2426 CPU+Mem 7.3113 CPU+I/O 2.8823 Full Set 0 2 4 6 8 10 RMSE of powerprediction node087 in Watts

Powerdata node087 σ = 14.34 ¯ p= 124.31 10.8451 CPU Linear 10.4051 CPU+Mem 9.8854 CPU+I/O 7.1588 Full Set 8.7807 CPU GMM 7.4768 CPU+Mem 6.6117 CPU+I/O 2.6870 Full Set 0 2 4 6 8 10 RMSE of powerprediction node091 in Watts

Powerdata node091 σ = 13.37 ¯

p= 124.27

(b) Hadoop Nodes

Figure 1: Average RMSE scores for the six selected nodes. Additionally, the mean value and standard deviation of each node’s powerdata are given.

(20)

The primary difference between the two node types is the nature of the tasks they perform. GPU nodes perform mostly CPU intensive tasks with a relatively low amount of I/O events. Hadoop nodes perform I/O intensive tasks, such as Big Data transfers. The internal management structure of the Hadoop nodes allows for a load balancing amongst Hadoop nodes within a cluster. This means that the CPU of each node is equally stressed but that also increases the I/O event even more because the processed data has to be communicated amongst the nodes. That is why the effect of I/O data is much clearer in Hadoop nodes than in GPU nodes. The use of the Full Set using GMM in GPU nodes is still a significant improve-ment on the CPU and CPU+Mem subsets for GMM and even when using a Linear model, the collection of I/O data results in a better PCE. The effect is greater in Hadoop nodes, but still very effective in GPU nodes as well.

To validate the first part of the two hypotheses, the increase in precision when using GMM, evaluation of the success rates is in order.

Figure 2 shows a bar chart of the average success rate over the each node type for both models. The success rate is measured as discussed in section 3 with the ∆ levels shown in table 3. The bars are teamed to compare the two models (top = Linear, bottom = GMM) for each ∆-level within each subset of features. For the bars in figure 2 the following rule applies; the larger the bars, the better. A larger bar indicates a higher precision with measured for that ∆-level as a threshold. The ∆-level are shown inside the bar, the success rate is shown behind the bar. Both charts are scaled the same.

Recapping the hypotheses, we need to check whether the GMM is more precise than the Linear model and whether the use of I/O data in the GMM improve upon the precision. The GMM is more precise than the Linear model for all subsets in the Hadoop nodes, and for all but one subset in the GPU nodes. The odd one out is the CPU subset for the 5% and 10% levels, which show a lower success rate for GMM than for the Linear model. There is however, an increase of 375% in the success rate for the 1% level. But a 19% success rate is still undesirable. The most important value in is the Full Set 1% value which is 70% for GMM and only 4% for the Linear model. It can be stated that the GMM is generally more precise and has lower RMSE in PCE than the Linear model, confirming the secondary hypothesis. To confirm the primary hypothesis, the same delta level bar for GMM in each sub-set is compared. This shows a clear hierarchy moving from the lowest scores (for CPU) to the highest scores (for the Full Set). In both node types, each following subset shows an increase in success rate over the previous subset. This means the CPU+I/O subset is more precise than both the CPU and CPU+Mem subsets, and yet the Full Set scores even higher over all ∆-levels. It can therefore be stated that the GMM is more precise and has lower RMSE in PCE when I/O data are included in the data set, confirming the primary hypothesis, without exception.

(21)

4% CPU∆1% 19% ∆1% 84% ∆5% 74% ∆5% 94% ∆10% 88% ∆10% 2% CPU+Mem∆1% 28% ∆1% 33% ∆5% 78% ∆5% 82% ∆10% 90% ∆10% 2% CPU+I/O∆1% 45% ∆1% 35% ∆5% 83% ∆5% 83% ∆10% 90% ∆10% 4% Full Set∆1% 70% ∆1% 83% ∆5% 92% ∆5% 93% ∆10% 97% ∆10% 0% 20% 40% 60% 80% 100% GPU

(a) Average success rates of chosen GPU nodes 18% ∆1% CPU 46% ∆1% 68% ∆5% 75% ∆5% 85% ∆10% 88% ∆10% 25% ∆1% CPU+Mem 61% ∆1% 74% ∆5% 86% ∆5% 90% ∆10% 92% ∆10% 26% ∆1% CPU+I/O 69% ∆1% 77% ∆5% 91% ∆5% 91% ∆10% 94% ∆10% 28% ∆1% Full Set 87% ∆1% 88% ∆5% 98% ∆5% 94% ∆10% 99% ∆10% 0% 20% 40% 60% 80% 100% Hadoop

(b) Average success rates of chosen Hadoop nodes

Figure 2: Average Success Rate per node type.

(For each two bars; Top-bar = Linear, Bottom-bar = GMM)

It is clear that the effect of utilising the I/O data on the success rate is also stronger in Hadoop nodes than in the GPU nodes. This is in line with the GMM showing a lower average RMSE for the full set. It was previously established that a lot of I/O operations, i.e. Big Data transfers, occur in Hadoop nodes. This, in combination with the low RMSE also explains the high success rate for the GMM in Hadoop nodes, compared to the GPU nodes. This again however, does not mean that is less beneficial to acquire I/O data for the GPU nodes. The relative increase in average success with the use of I/O data is still significant in GPU nodes. Espe-cially considering the ∆1% level.

The results from the conducted experiments indicate the added benefit of using I/O data in PCE, with both the Linear Model and, especially, the GMM. Following, the portability and usability of the PCE application will be discussed.

The simpler resource features of a model are, the higher their usability (Zhu et al., 2013). There are several evaluation criteria on which a model can be tested, such as accuracy, simplicity, responsiveness and granularity (Chen & Shi, 2012). However, in this research the focus is placed on accuracy and simplicity. The model needs to give an accurate estimation of power consumption while also maintaining a level of simplicity that makes it easy to implement. It being a trade-off, an

(22)

equi-librium must be found between the aforementioned two main evaluation criteria. Given the relative decrease in RMSE and higher success rates for the GMM model with the use of I/O data, the PCE made with the full set will have an overall lower margin of error. In the resource planning done by server administrators, the aim is not solely to minimise the margin or error, but also consider the (monetary) ac-quisition cost of the input data. It then becomes a case of what the administrator deems acceptable, with regards to the margin of error in his resource planning. Since the monetary cost for acquiring I/O data and the level of acceptable error margins varies strongly per administrator and per server-system, attempting to cal-culate an example situation would be futile. In short, this research shows that when considering a system level PCE, using a GMM with CPU+Mem+I/O data is most accurate and most precise. What the system administrators choose to do with that information, is up to them to decide.

(23)

4 Conclusion.

The work presented here confirmed that the best way to make PCEs is with the GMM and the Full Set of the data, including CPU-, Memory- and I/O data. The GMM was shown to be significantly more effective than the Linear model based on both the average RMSE and average success rates. The relative decrease in average RMSE of PCE combined with the high precision for the ∆1%-level is should suffice to persuade server administrators to at least consider calculating the (monetary) cost of I/O data acquisition for their server structure, as it could assist in creating more efficient resource planning schema’s based on a more accurate PCE. The added benefit of considering I/O data in PCE is most apparent in the tested Hadoop nodes, due to the nature of tasks it performs. Hadoop nodes generally have a high number of I/O events, compared to the tested GPU nodes. However, PCE with GMM for the GPU nodes still benefited strongly from the addition of I/O data, regardless of the nature of tasks it performs.

5 Discussion & Future work.

Due to the high amount of time required to run the GMM for each subset per node, not all nodes could be fully investigated. Preferably, all nodes in the cluster would have been tested for all combinations for a high number of runs. A lower number of runs could have been chosen, but then the randomised data showed too much variation per run, giving an unclear average. The average values seemed to steady after 20 runs, 30 made sure the values were representative for this feature investigation.

In future, a more fine-grained investigation of the individual components could be carried out to assess the added benefit of, for instance, each column in the I/O data. This might be beneficial to the research in power consumption minimisation on a hardware level on the manufacturing end, but the point of this investigation was to show that the full I/O set gives the best PCE using GMM, not which of the individual components consumed the most power. The engineers at the SEFLab have already made much more accurate measuring setups for that purpose.

Portability of the GMM and Linear model has been shown over homogeneous nodes, and different node types within one DAS-4 cluster. Additional tests on a variety of server clusters should be run to determine the portability over server structures.

(24)

References

Ardagna, D., Panicucci, B., Trubian, M., & Zhang, L. (2012). Energy-aware autonomic resource allocation in multitier virtualized environments. IEEE Transactions on Services Computing, 5(1), 2–19. doi: 10.1109/TSC.2010 .42

Berl, A., Gelenbe, E., Di Girolamo, M., Giuliani, G., De Meer, H., Dang, M. Q., & Pentikousis, K. (2010). Energy-efficient cloud computing. The computer journal, 53(7), 1045–1051.

Chen, H., & Shi, W. (2012). Power Measuring and Profiling: The State of Art. In Handbook of energy-aware and green computing(pp. 649–674).

Cook, G., & Van Horn, J. (2011). How dirty is your data? a look at the energy choices that power cloud computing. Greenpeace (April 2011).

Dhiman, G., Mihic, K., & Rosing, T. (2010). A system for online power predic-tion in virtualized environments using gaussian mixture models. In Design automation conference (dac), 2010 47th acm/ieee(pp. 807–812).

Ferreira, M., Hoekstra, E., Merkus, B., Visser, B., & Visser, J. (2013, May). Seflab: A lab for measuring software energy footprints. In Green and sustainable software (greens), 2013 2nd international workshop on(p. 30-37). doi: 10 .1109/GREENS.2013.6606419

Petersen, J. P. (2010). Gaussian Mixture Models python package documentation. http://pypr.sourceforge.net/mog.html. (Accessed: 2015-06-01) Piga, L., Bergamaschi, R. A., & Rigo, S. (2014). Empirical and analytical

ap-proaches for web server power modeling.doi: 10.1007/s10586-014-0373-0 Reynolds, D. a. (2008). Gaussian Mixture Models. Encyclopedia of

Biometric Recognition, 31(2), 1047–64. Retrieved from http:// www.ncbi.nlm.nih.gov/pubmed/21856417$\backslash$nhttp:// extwebprod.ll.mit.edu/mission/communications/publications/ publication-files/full\ papers/0802\ Reynolds\ Biometrics -GMM.pdf doi: 10.1088/0967-3334/31/7/013

Zhu, H., Grosso, P., Liao, X., & de Laat, C. (2013). Evaluation of approaches for power estimation in a computing cluster.

(25)

Appendices.

Appendix A

Investigation of Gaussian Mixture Models for power estimation in a computing cluster