A Comparative Study of Forecasting Methods for Workload Prediction Applied to a Practical Use Case

(1)

Applied to a Practical Use Case

submitted in partial fulfillment for the degree of master of science

Tim Schijvenaars

13185713

master information studies

data science

faculty of science

university of amsterdam

23-06-2021

Internal Supervisor External Supervisor 3rdExaminer

Title, Name Jamila Alsayed Kassem Hans Kramer Dr. Paola Grosso

Affiliation UvA, IVI, MNS Vattenfall UvA, IVI, MNS

(2)

A Comparative Study of Forecasting Methods for Workload

Prediction Applied to a Practical Use Case

ABSTRACT

In recent times, cloud services have shown a sharp rise in usage. Workload prediction of data centers has benefits to both the owner of the data center and its users. This research studies to what extent using the LightGBM machine learning algorithm, as predictive algo-rithm is useful in terms of prediction accuracy when predicting data center workload. The objective of this paper is to show how such a goal can be achieved, by performing a comparative study between statistical time series forecasting techniques and the LightGBM ma-chine learning algorithm. The comparative study is performed by describing how to prepare the data, implementation of the various methods, analyzing which features are important when predicting workload of resource pools and how well the various prediction methods perform compared to each other. As statistical forecasting methods, Naïve regression, Seasonal Naïve regression, STL, ARIMA with and without errors and external regressors and TBATS are used. As machine learning algorithm, LightGBM is used. Experi-ments have been performed using these methods, with the objective to predict eight data points in the future. For each used method, the mean absolute error and run times of each experiment have been recorded. The outcome of the experiments showed the LightGBM method outperforms all statistical forecasting methods in terms of prediction accuracy. The downside of this is that LightGBM is approximately fifty times slower than methods that perform only slightly worse. This suggests that when the goal is to have as low as possible prediction errors, LightGBM is the preferred choice to build a predictive model. However, when run times are also of importance the statistical forecasting methods are the better choice.

KEYWORDS

workload prediction, workload estimation, time series forecasting, data centers, resource estimation, modeling and prediction, work-load forecasting, lightgbm, machine learning

1 INTRODUCTION

Accurate resource estimation for data centers is needed to be able to perform better load balancing, holistic power usage management and dynamic scaling. This is a growing necessity, as in recent years the amount of companies that are migrating their services to the cloud has substantially increased [15]. As a result, big cloud service providers like Amazon Web Services, Microsoft Azure and Google Cloud, have seen substantial growth of at least 28 percent in the third quarter of 2020, due to more people working from home because of the COVID-19 pandemic[15]. As the use of cloud services keeps growing, its providers have to solve increasingly difficult issues around power consumption, cost management, security and complexity challenges [15].

Workload prediction is a key factor in performing resource esti-mation. The term workload is often used to encompass certain key

performance indicators (KPIs) of data centers such as computing-, storage- and networking capacity. Accurate predictions of the workload of a virtual machine (VM), subsystem or data center can be used to preemptively up- and downscale system resources to prevent unnecessary idle time or resource shortages. It can be said that workload prediction falls under the category of time series forecasting, and will be treated as such. To limit the scope of this research, only the workload prediction of computing capacity will be studied.

Currently, a research gap can be found within the field of work-load prediction when it comes to applying workwork-load prediction to different use cases. This study fills that research gap by per-forming a comparative study of statistical time series forecasting methods and the LightGBM machine learning algorithm, while applying these to a utility company use case. Moreover, existing literature describe many approaches to tackle workload prediction, but is often focused on either using statistical forecasting methods

or advanced neural networks in experimental environments[13].

With fewer studies focusing on applying workload prediction on practical use cases, this ultimately leads to a gap in research. This comparative study will give more insights in the performance of various forecasting techniques and address the literature gap by applying forecasting methods on a real life use case. The data on which the experiments are performed are provided by Vattenfall

1_{, a Swedish utility company. The following research question is}

associated with the goal of this research: "To what extent can com-puting workload be predicted in terms of accuracy for resource pools using LightGBM based on data center snapshot data, compared to statistical forecasting methods?". The research can be divided into the following sub-questions:

• How can necessary data be extracted and formatted to be

able to be useful to create the prediction model?

• How can resource pool workload be predicted using the

LightGBM algorithm?

• Which features of the data are important for computing

capacity workload prediction?

• What prediction accuracy can be achieved using statistical

time series forecasting techniques?

• How accurately can resource pool workload be predicted

based on a LightGBM based predictive model, compared to statistical time series forecasting methods?

The rest of the paper is structured as follows. Section2discusses existing approaches to workload prediction, presenting an overview of the most relevant related work to provide a background about

this topic. Section3presents the approach taken to perform the

experiments, elaborating on the selected techniques. After that, the experiments and its results are included in Section4. In Section5

the results are evaluated. Finally, Section6contains the conclusion and suggestions for future work.

(3)

2 RELATED WORK

This paper looks into the prediction of resource utilization and workload for data centers. This section is dedicated to describing the related literature. In search for related literature, Google Scholar

2_{has been used in combination with keywords such as workload}

prediction, data center workload forecasting and resource estimation. Selection of appropriate papers has been based on amount of ci-tations, related articles and filtering based on date of publishing. Besides manually searching, the related literature sections of the reviewed papers also have been analyzed for promising papers. A summary of the most relevant literature is shown in Table1.

In recent years, the field of cloud computing has evolved signifi-cantly. Data center optimization plays a big role in cloud computing and involves many elements; from holistic power consumption and physical cooling of the server racks, to load balancing and resource management of physical servers or VMs. When looking at resource management of data centers, optimization for different objectives can be identified such as ensuring service level agreements (SLAs), reducing power consumption and reducing the operational costs of managed services [29]. To predict workload, the CPU, memory, stor-age and network bandwidth are important factors [29]. A prominent challenge when forecasting data center workloads are unforeseen peaks. These variations cause underestimation and overestimation which translate to resource-waste, unnecessary power usage and SLA violations [2].

When predicting workloads, recent studies propose approaches of ensemble techniques of different statistical forecasting meth-ods: Linear Regression, Support-Vector Machine (SVM), ARMA and ARIMA, to predict workloads by dynamically determining the weight of each predictor using regression [17]. Other works also suggest ensemble-methods to predict workloads for specific or com-bined cloud resources [4,5,14,23,28,30]. Some approaches view the workload forecasting problem specifically as a time series forecast-ing problem. These approaches use techniques like autoregression, ARIMA, exponential smoothing and neural network autoregression [3,7,9,19,31].

Over the past decade, contributions to the field of workload pre-diction for data centers have been made by using machine learning techniques [24]. Neural networks are used for workload forecasting as well. One approach uses a neural network model with super-vised learning [18]. Another example is the use of auto-encoders to predict CPU utilization of VMs [32]. Other examples of techniques that are used are multiple-layered restricted Boltzmann machines (RBMs) [22,33], evolutionary neural networks [21], long short-term

memory [27] and recurrent neural networks [8]. A more general

study has the goal of identifying the most appropriate machine learning method for predicting resource utilization given the data and recent observations [13]. However, that technique only allows for identification of appropriate machine learning methods that are supported by that approach.

A persistent problem when using specific machine learning al-gorithms for workload prediction, is the lack of research applied to real life use cases. This study tries to fill in that research gap by not only performing a comparative study between statistical and 2_{https://scholar.google.com/}

machine learning techniques, but also by applying the prediction methods to a practical use case using company data.

Possible uses of workload forecasts can be combined with a dynamic resource allocation strategy to make optimal use of the data center resources and reduce power consumption. An effective prediction can thus lead to performance improvements, optimized resource utilization [26] and put less strain on resource allocation strategies, preventing under- or over-provisioning, fragmentation and deficiency [10]. These follow-up topics are not discussed any further in this research.

Reference Observed Metric Approach

Chen[5] CPU/Network Fuzzy Neural Network

Rahmanian[23] CPU Ensemble method using

automata theory to ad-just weights

Tseng[30] CPU/memory/Energy

consumption Multi-objective geneticalgorithm

Liao[19] CPU Ensemble method using

multiple time series pre-dictors as input for a lin-ear prediction model

Zhang[33] CPU Autoencoders to predict

future CPU utilization of VMs

Kumar[18] CPU/memory Differential evolution

algorithm with adap-tion in the mutaadap-tion and crossover phase

Subirats[28] CPU Mathematical models

Vazquez[31] Job scheduling Review of time series forecasting techniques Table 1: A table showing a summary of most relevant related work.

3 METHODOLOGY

To answer the research questions, a methodology is defined that will be used to perform experiments. The outcomes of these experi-ments will ultimately be used to answer the research questions. The methodology section discusses the methods used for this research. The data and its preprocessing are discussed first. Afterwards, the various statistical time series forecasting techniques are described. Finally, the LightGBM machine learning algorithm is discussed, which is used to compare the statistical time series forecasting techniques to. This section answers the sub-questions: "How can necessary data be extracted and formatted to be able to be useful to create the prediction model?" and "How can resource pool workload be predicted using the LightGBM algorithm?".

3.1 Data Description

The data for this study is provided by Vattenfall3, a Swedish utility company. The data provided by Vattenfall is data recorded from their own data centers and is in the form of snapshots, where data is recorded at a fixed point in time, with snapshots reoccurring in a fixed interval of at least twice a week. These snapshots contain information about 25 different topics, ranging from information about individual VMs and data storage to resource pools and VM 3_{https://group.vattenfall.com/}

(4)

clusters. The snapshot data can serve as ’back-up’ of the data center systems. As this study is aimed at predicting computing capacity of resource pools specifically, only the data about resource pools is used.

The data ranges from 09-07-2018 to 24-03-2021, with snapshots created twice a week, and contains data about a total amount of 722 resource pools spread out over three geographical locations. The resource pool data is grouped in separate datasets corresponding to each geographical location. This results in three datasets, identified in this research as the ’470’ dataset, the ’480’ dataset and the ’490’ dataset. These numbers correspond to the identifiers of geographi-cal locations used by Vattenfall. To be able to perform consistent experiments, only resource pools that have consistent logging from 09-07-2018 to 24-03-2021 are selected. This results in a total of 140 resource pools. The 470 dataset contains 77 resource pools, the 480 dataset contains 24 resource pools and the 490 dataset contains 39 resource pools.

An overview of the data can be seen in Appendix SectionA.2,

where all variables are summarized in a table. The table contains the name, data-type and variable description of each variable. Vari-ables regarding CPU are recorded in megahertz (Mhz), whereas variables regarding memory are recorded in mebibyte (MiB). Some variables allow the setting of ’-1’, which corresponds to unlimited. Exploratory graphs are included in Appendix SectionA.1. Graphs3,

4and5describe the CPU usage, memory usage and maximum CPU

usage over time respectively of one of the resource pools. Graphs

6,7and8show the plots of CPU usage for each resource pool per dataset.

As the main target for prediction is the computing capacity, CPU usage and memory usage are to be predicted. These targets are labelled as the variables ’CPU overallUsage’ and ’Mem overallUsage’ within the dataset. The goal of Machine Learning is to train a model function that maps dependent variable ’y’ (the target) to independent variables ’x’ (input variables).

3.2 Data Preprocessing

Preprocessing of data can be used to remove unreal or unusable variables from a dataset. The data is snapshot data. This means the data is copied at one point in time, and saved as it is. The data has to be usable as back-up, so no missing values are allowed. Hence, no missing values have been found and all data is within realistic ranges (no outliers).

Various variables have been removed from the dataset as well. The ’VI SDK Server’ and ’VI SDK UUID’ have been removed because these variables may contain sensitive company data and are not relevant to the experiments. All the variables starting with ’QS’ are removed as well, because these variables are aggregated values and will have the same values of corresponding non-’QS’ variables according to the data documentation. The ’Day’ variable has been manually added to the dataset, according to the ’Time’ variable. The ’Day’ variable describes the day of the week in String format, so it can later be used as a categorical value for one-hot encoding. Another part of the preprocessing is the creation of a data pipeline to perform time series forecasting and Machine Learn-ing prediction. Both pipelines differ in implementation, but share

the same principal idea: formatting the data and dividing the data into a training and test set.

In terms of time series forecasting, the data has to be formatted into a ’ts’ (time series) format. A time series is a series of values occurring in a fixed interval over time. In this study, the time series data is based on a frequency of 104 data points per year (as there are two snapshots taken each week). For each resource pool, its name is saved in a list, together with its time series. After com-pletion of this list containing all the time series, two new lists are made: a training and a test set. Because of the nature of time series forecasting methods, producing a separate validation and test set is not necessary. The training set and test set are split up in a way that the test set contains the last 8 data points of the full time series. The train set contains all data points leading up to the test set.

When preprocessing data for machine learning, the data does not need to be transformed into a specific time series format. First, the index-column is removed and column names are cleaned. After this, data formatting, splitting is performed. First, all the data-points corresponding to the resource pool are grouped and sorted on ascending date. Second, the day-of-the-week variable is one-hot encoded, as they are categorical in nature. After this, a train-val-test split is made, where validation and test sets both contain the last 8 data points. Finally, two separate datasets are produced for train, validation and test data containing the dependent and independent variables.

3.3 Performance Metrics

As this research is a comparative study, it is important to disclose which metrics are used for comparison. As main performance met-ric, the Mean Absolute Error (MAE) is selected. The MAE metric has been selected because it is commonly used as measurement within prediction and forecasting. In this paper, the term accuracy will be used interchangeable with prediction error and the perfor-mance metric mean absolute error. This metric is calculated with the following formula:

𝑀 𝐴𝐸=1 𝑛 𝑛 Õ 𝑖=1 |𝑦𝑖− 𝑥𝑖|,

for which 𝑛 is the number of errors and |𝑦𝑖− 𝑥𝑖|is the absolute difference between the predicted value 𝑦𝑖 and the real value 𝑥𝑖 (called the error). In short, the MAE calculates the mean of all the differences between the predicted values and real values. As the goal is to make predictions as close to the real values as possible, the objective is to minimize the MAE. As secondary performance metric, run time is selected. This metric is measured in seconds and equals the run time of each performed experiment from start to finish. Run time is selected as secondary metric because the time to calculate predictions can be critical in certain situations as well as to have another benchmark to compare all forecasting methods to.

3.4 Time Series Forecasting

As described in Section2, workload prediction can be approached

as a time series forecasting problem. Several ensemble and non-ensemble techniques have been identified in related literature, and a selection of methods has been made. In this section, the process

(5)

of applying a selection of these time series forecasting techniques will be outlined, limited to:

•Naïve method

•Seasonal Naïve method

•STL decomposition

•ARIMA method

•TBATS method

These methods are implemented using the R programming

lan-guage, in combination with the forecasting library ’fpp2’4. The

library name is based on the corresponding book "Forecasting: Prin-ciples and Practice (2nd Edition)" [11]. The ’fpp2’ library is an R library which combines the libraries ’ggplot2’5_{(data visualization)} and ’forecast’6(forecasting methods).

Experiments are executed for each resource pool individually, using loops to calculate the mean accuracy. This means that each experiment produces one MAE score per resource pool forecast. After these forecasts, the average of all MAE scores of each resource pool are taken. This is repeated for each dataset, resulting in final scores of average mean absolute errors of all resource pools in a dataset.

3.4.1 Naïve Method.Naïve forecasting sets all future forecasts to the value of the last observation, which can be described by:

ˆ𝑦𝑡+ℎ |𝑡 = 𝑦𝑡,

where 𝑦𝑡is the last known data point at time 𝑡 and ˆ𝑦𝑡+ℎ |𝑡 is the next data point to predict at time 𝑡 + ℎ of which ℎ is a point in the future. This method is literally setting all future data points equal to the last known data point.

3.4.2 Seasonal Naïve Method. The seasonal naïve method is a

method similar to naïve forecasting. The difference is that this method sets each forecast to be equal to the last observed value from the same season, which is described by:

ˆ𝑦𝑡+ℎ |𝑡 = 𝑦𝑡+ℎ−𝑘𝑚,

where 𝑦𝑡+ℎ−𝑘𝑚is the last observed value for the same season,

with 𝑚 as the seasonal period, 𝑘 the smallest integer greater than (ℎ −1)/𝑚 and ˆ𝑦𝑡+ℎ |𝑡 is the next data point to predict at time 𝑡 + ℎ. This method is effective only when data is highly seasonal. 3.4.3 STL Decomposition. STL, short for "Seasonal and Trend de-composition using Loess", is a method used for time series decom-position. The method uses Loess: a method for estimating nonlinear relationships [25]. Given its strong focus on seasonality and trends, this forecasting technique only works well with seasonal data.

3.4.4 ARIMA Method.ARIMA models can be used to describe the

autocorrelations in data. ARIMA is an acronym for

AutoRegres-sive Integrated Moving Average [11]. Such ARIMA models can be

written mathematically as follows:

𝑦_𝑡′= 𝑐 + 𝜙1𝑦_𝑡−′ ₁+ ... + 𝜙𝑝𝑦_𝑡−𝑝′ + 𝜃1𝜀_𝑡−1+ ... + 𝜃𝑞𝜀_𝑡−𝑞+ 𝜀𝑡,

where 𝑦′

𝑡 is the differentiated series at time 𝑡, which is equal to the intercept 𝑐, the lagged values denoted by 𝜙𝑝𝑦′

𝑡−𝑝 and the

4_{https://github.com/cran/fpp2} 5_{https://github.com/tidyverse/ggplot2} 6_{https://github.com/robjhyndman/forecast}

lagged errors denoted by 𝜃𝑞𝜀𝑡−𝑞+ 𝜀𝑡. Within the lagged values and lagged errors, the 𝑝 stands for the number of autogregressive terms, whereas the 𝑞 denotes the number of lagged forecast errors. These values are manually set by the user or can be automatically calculated using an algorithm. Changing the 𝑝 and 𝑞 variables highly influences the prediction accuracy of the model.

For this approach, a special function of the fpp27_{package called} ’auto.arima()’ will be used. This function uses combination of

tech-niques to obtain a well-fitted ARIMA model [12]. ARIMA will be

used in two phases. The first phase uses ARIMA to predict the target variables only based on previous occurrences, which will be called the no-error approach. The second phase introduces indepen-dent variables to obtain an ARIMA model, which will be called the with-error approach. The introduced independent variable might help the model to better explain the residuals, resulting in better accuracy of prediction.

3.4.5 TBATS Method.TBATS is an acronym for the main features

of the model: Trigonometric seasonality, Box-Cox transformation, ARMA errors, Trend and Seasonality components. The model

com-bines these features in a completely automated manner [6]. This

results in regression where seasonality is allowed to change over time. The TBATS method can be slow to produce estimates.

3.5 Machine Learning

To predict the workload using machine learning, the LightGBM algorithm is used. LightGBM is short for Light Gradient Boosting Machine. It is an open source gradient boosting framework which uses tree based learning algorithms. This method is evidenced to be several times faster than average gradient boosting trees [16]. The difference between LightGBM and conventional gradient boosting trees, is that LightGBM grows trees leaf-wise instead of level-wise which results in faster run times. This method is implemented using the programming language Python, with the additional libraries: ’lightgbm’8_{, ’sklearn’}9_{, ’pandas’}10_{, ’numpy’}11_{, ’matplotlib’}12_and ’optuna’13. The ’sklearn’ library provides tools for predictive anal-ysis such as automated calculation of performance metrics and implementations of cross-validation. The libraries ’pandas’ and ’numpy’ provide optimized implementations of data structures in Python for data-frames and arrays. For improved data visualisation, ’matplotlib’ is used.

As manually tuning of hyperparameters can be tedious, the ’optuna’ package is used to perform automatic hyperparameter optimization. The Optuna framework offers user defined search spaces, high modularity and allows for pruning of unpromising runs for fast results [1]. As there are a limited amount of hyperparameter tuning frameworks, Optuna is selected because of its ease to use and broad set of features and support for automated tuning. Optuna can be configured to run a set number of trials, for which it will calculate the most optimal hyperparameters.

7_{https://github.com/cran/fpp2} 8_{https://lightgbm.readthedocs.io/en/latest/} 9_{https://scikit-learn.org/stable/} 10_{https://pandas.pydata.org/} 11_{https://numpy.org/} 12_{https://matplotlib.org/} 13_{https://optuna.org/} 4

(6)

The choice of LightGBM is inspired by a forecasting challenge, the M5 Competition[20], aimed to predict future sales using what-ever technique possible. As outcome, LightGBM models were among the best performing, slightly outperforming the standard statistical methods and heavily outperforming neural network approaches [20].

3.5.1 Optuna Hyperparameter Tuning. When using Optuna for

hy-perparameter optimization, constraints to the parameters have to be set before running the optimization trials. The constraints are based upon the parameter documentation14_{with the goal of scoring} the lowest possible mean absolute error within a fixed number of optimization runs. Optimization runs - also called trials - are runs for which a model is trained and predictions are made. Based on predictions and real values the prediction error is calculated. A run is saved, together with the selected training parameters and its pre-diction error. Within Optuna, multiple trials make up a ’study’: the process of running a set amount of trials to tune hyperparameters and get an as low as possible prediction error. The boosting type, objective and metric are already known parameters and have been set to gradient boosting decision tree (gbdt), regression and MAE respectively. The following parameter boundaries have been set:

•num_leaves: integer between 2 and 10

•feature_fraction: float between 0.3 and 0.98

•bagging_fraction: float between 0.3 and 0.98

•bagging_freq: integer between 1 and 8

•min_child_samples: integer between 2 and 100

•max_depth: integer between 5 and 50

•learning_rate: float between 0.005 and 0.6

•max_bin: integer between 150 and 500

The parameter boundaries have been chosen based on exploratory research. The exploration started with running experiments based on a bigger range of boundaries (0 to 1000 for integers and 0 to 1.0 for floats). Based on the outcomes of the exploration, the aforemen-tioned parameter boundaries were set.

3.5.2 Approach. For the implementation of LightGBM, two

ap-proaches have been used. The first approach is the prediction of the aggregated series. This approach trains one model to fit all resource pools within the dataset. The model is trained over a hun-dred runs, of which the run with the lowest MAE is selected as best fit. The optimal hyperparameters, run time, validation and test MAE are recorded. This experiment is repeated for each dataset and target value. The aggregated forecasting approach is used because it has to perform less model training and is able to output feature importance for the whole aggregated series.

The second approach is aimed at the prediction of each resource pool individually, using a loop to preprocess, split, train, predict and validate each resource pool. For each resource pool a model is trained over a hundred runs, of which the run with the best MAE is selected as best fit. This results in a best fitting model for each resource pool, of which the run times and MAE scores are calculated. To calculate the accuracy of this approach, the average of the best MAE per resource pool is taken. A sketch of both approaches has been included in Appendix SectionA.3.

14_{https://lightgbm.readthedocs.io/en/latest/}

4 EXPERIMENTS & RESULTS

In this section the results of all executed experiments are described. First, the system specifications on which the experiments have been performed are disclosed. Second, a description of the experiments is given. Afterwards, the results of the experiments are disclosed starting with the feature importance. Finally, the results of the ex-periments are shared in the last subsection. For each experiment, the MAE and run time are recorded. This section answers the fol-lowing sub-questions: "Which features of the data are important for computing capacity workload prediction?", "What prediction accuracy can be achieved using statistical time series forecasting techniques?" and "How accurately can resource pool workload be predicted based on a LightGBM based predictive model, compared to statistical time series forecasting methods?".

4.1 System Specifications

All experiments have been executed on computing running Win-dows 10 Pro (version 2004). The system has an AMD Ryzen 5 3600 6-Core Processor @ 3.6Ghz and 16 GB of RAM. The experiments have been executed using R version 4.0.2 and Python version 3.8.3.

The development environments used are RStudio15_{version 1.3.1073}

(R) and Jupyter Notebook16version 6.1.4 (Python).

4.2 Experiment Description

The experiments have been executed in two stages and have been

performed on datasets described in Section3.1which have been

preprocessed as described in Section3.2. The goal of the experi-ments was to predict the next eight data points in the time series and measure how well the methods and algorithms predicted these eight data points based on the prediction error. To calculate the prediction error, the predicted value and real values are used to calculate the mean absolute error as described in Section3.3.

The first stage of experiments saw the execution of the statisti-cal forecasting methods. The code has been included in Appendix SectionA.6.1. The experiments have been performed with a cleared cache before each experiment and no running background tasks for accurate run time measurements. First, the dataset is loaded, preprocessed and split into a training and validation set. During preprocessing the dependent variable is split from the rest of the dataset, to create a dataset with just the target variable (CPU over-allUsage or memory overover-allUsage). There is no need for making a separate test set because statistical methods do not rely on training a model, so leaving out unseen data points will have no effect on possible overfitting. Hence, for these experiments the validation set has the same purpose as test set. Next, a for-loop is used to iterate over each resource pool within the dataset, fit a model, predict the following 8 data points and calculate the prediction error. These prediction errors are saved. The recorded run time captures the preprocessing, model fitting and calculation of the prediction errors. This process is repeated for each statistical forecasting method, each target variable and each dataset resulting in 42 experiment runs in total. Finally, the recorded prediction errors have been plotted in box-plots and tables per target variable and dataset.

15_{https://www.rstudio.com/} 16_{https://jupyter.org/}

(7)

The second stage of experiments includes the execution of Light-GBM experiments. The code for these experiments have been

in-cluded in Appendix SectionA.6.2andA.6.3. These experiments

have been performed with a cleared and restarted kernel with no running background tasks for accurate run time measurements. The LightGBM experiments can be divided in two: aggregated series experiments and resource pool-wise experiments. The LightGBM experiments are repeated for each target variable and dataset, re-sulting in 12 experiment runs in total.

The aggregated series experiments first make a new Optuna tuning objective with a pre-set number of trials which has been set to 100. Within this objective, the Optuna hyperparameter bound-aries are set to limit the search scope for this experiment. After this, a for-loop runs through each resource pool within the dataset. For each resource pool, its data is preprocessed through one-hot encoding and a train/validation/test split is made. The training set contains all variables up to the last 16 one. The last 16 through 8 data points are given to the validation set and the last 8 data points are the test dataset. Further preprocessing is performed by splitting the train/validation/test datasets. The target variable is split into a ’y’ dataset, and the other variables are split into an ’x’ dataset corresponding with dependent and independent variables. Afterwards, a LightGBM model is trained given the pre-set param-eters, the training data and validation data. Finally, the prediction errors are calculated by predicting the last eight data points and comparing them with the real values of the test set. This MAE score is saved for each resource pool. After all resource pools have been predicted, Optuna repeats this process for the pre-set number of trials and keeps track of which trial has the smallest MAE and its parameters. The feature importance of the best performing trial are also plotted.

The resource-pool wise experiments are performed in a different way. The Optuna tuning objective does not contain the for-loop for iterating over each resource pool. Instead, for each resource pool Optuna is used to run 100 trials of hyperparameter optimization. This means that, instead of creating one model that fits best all resource pools, now every resource pool has a model with optimized parameters. Because of the structure of these experiments, feature importance is not possible, as feature importance plots would have to be compared for all resource pools in each dataset and for each target variable.

4.3 Feature Importance

Feature importance is a score which is given to input variables, based on their importance to predict the output. The higher the score, the more important the variable is to predict the outcome. For this research, the feature importance of the aggregated series is obtained while running LightGBM. The output of the feature

importance can be found in Appendix SectionA.4. Based on these

graphs, the most important and reoccurring features for each target variable and dataset are:

•Mem maxUsage

•Mem unreservedForPool

•Wednesday (one-hot encoded Day)

•Friday (one-hot encoded Day)

•CPU maxUsage

• # VMs

• # vCPUs

4.4 Experiment Results

The results of the experiments are discussed in this subsection. Experiment results are listed per target variable (CPU usage or memory usage) and per dataset (470, 480 or 490). When looking at the results, the MAE has to be interpreted in terms of its cor-responding unit variable. For CPU prediction results, the mean absolute error is in terms of megahertz whereas memory prediction the mean absolute error is measured in mebibytes. Table2contains the experiment outcomes of each applied algorithm. The results cover both the mean absolute error as performance metric and the run times of each experiment over different datasets. The result of the STL experiment for dataset 480 is infinite because of how bad the prediction errors are. The STL outcome for dataset 490 also shows that behavior because the data is not seasonal.

Table3describes all outcomes of each experiment for the

mem-ory usage target variable in the same way as table2. In this table, all performance metrics are listed alongside the run times per pre-diction algorithm grouped by the corresponding dataset. For both dataset 480 and 490, the STL prediction outcome is larger than any other outcome.

Figure1and2both contain the distribution of MAE scores per

target (CPU or memory) and dataset (470, 480 and 490) based on predictions made with LightGBM. The distribution of MAE scores of predictions made with the statistical methods can be found in

Appendix SectionA.5. This distribution has been visualized into

boxplots, shown in figures1and2, which is a standardized way

of displaying the distribution of the data. For this purpose it is highly effective, because it clearly shows the minimum, first quartile, median, third quartile, maximum and the outliers of the data. The boxplots can help explaining the prediction performance of the experiments.

Figure 1: Boxplot distribution of mean absolute errors on predicted CPU using LightGBM of each dataset. The scores have to be interpreted in megahertz.

(8)

Dataset 470 Dataset 480 Dataset 490

Algorithm MAE Run time MAE Run time MAE Run time

Naïve 2626.099 4.936305 secs 2880.656 0.9781171 secs 6129.609 2.679172 secs

Seasonal Naïve 17742.96 5.06342 secs 3027.849 1.097333 secs 10690.16 2.7902732 secs

STL 7390.87 5.1333548 secs - (Inf ) 1.7778421 secs 2.130134e+128 3.460881 secs

ARIMA without errors 2689.051 8.083822 secs 1813.308 2.532529 secs 6714.915 5.309384 secs

ARIMA with errors

(CPU unreservedForPool) 3821.15 12.530712 secs 1768.488 4.684484 secs 6676.7 5.761973 secs

ARIMA with errors

(CPU maxUsage) 4336.673 12.481209 secs 1609.725 4.687488 secs 6720.373 5.678407 secs

TBATS 3151.098 100.29482 secs 1744.497 23.03735 secs 6794.194 65.32908 secs

LightGBM - Aggregated 2846.695 182.386111 secs 1587.176 48.489191 secs 6259.111 76.54488 secs

LightGBM – RP-wise 1665.917 260.110985 secs 1029.027 78.449996 secs 3644.505 126.602254 sec

Table 2: CPU prediction and run time results for all datasets. The MAE scores have to be interpreted as megahertz.

Dataset 470 Dataset 480 Dataset 490

Algorithm MAE Run time MAE Run time MAE Run time

Naïve 23831.21 4.9515199 secs 2910.802 0.968108 secs 110670.3 2.7012041 secs

Seasonal Naïve 177519.84 4.963802 secs 18225.02 0.9866279 secs 160776.7 2.6971879 secs

STL 25452.53 5.2501079 secs - (Inf ) 1.774695 secs 5.872075e+129 3.416348 secs

ARIMA without errors 24955.24 9.755153 secs 4130.552 1.7236019 secs 99491.94 6.230085 secs

ARIMA with errors

(Mem unreservedForPool) 90279.58 11.971143 secs 10739.97 5.503063 secs 87348.24 12.62403 secs

ARIMA with errors

(Mem maxUsage) 121074.2 8.690546 secs 8706.049 21.14682 secs 87761.29 11.743865 secs

TBATS 27174.42 88.87658 secs 3953.179 19.62673 secs 90976.96 50.94389 secs

LightGBM - Aggregated 24399.051 190.462334 secs 1614.133 53.014996 secs 52458.008 83.649299 secs

LightGBM – RP-wise 4453.624 268.012728 secs 950.364 78.550164 secs 28018.463 131.209974 secs

Table 3: Memory prediction and run time results for all datasets. The MAE scores have to be interpreted as mebibytes.

Figure 2: Boxplot distribution of mean absolute errors on predicted memory using LightGBM of each dataset. The scores have to be interpreted in mebibytes.

5 DISCUSSION

The discussion section describes the interpretations of all findings, presented in Section4, theoretical and real-world implications and limits of the research. The interpretation of the results is performed based on the main research problem: "To what extent can computing

workload be predicted in terms of accuracy for resource pools using LightGBM based on data center snapshot data, compared to statistical forecasting methods?".

5.1 Interpretation of Results

Tables2and3provide an overview of prediction performance

and run times per algorithm and data set. Several patterns can be identified. The first point of interest are the MAE scores of statis-tical methods versus those of the LightGBM approach. It can be concluded that for every target variable (CPU usage and memory usage) and dataset (470, 480 and 490), the resource pool-wise Light-GBM approach outperforms any other method. The approach of training one model for the aggregated series also has one of the smallest prediction errors, however is sometimes overshadowed by statistical forecasting methods. For example, table2shows the ag-gregated LightGBM method has smaller prediction errors when ran on dataset 480: 1587.176 compared to 2880.656, while for dataset 470 and 490 the Naïve method has smaller prediction errors (as an example dataset 470 the aggregated approach has an average MAE of 2846.695 versus the Naïve average MAE of 2626.099).When looking at the statistical forecasting methods the Naïve and ARIMA methods are the methods who have the smallest prediction errors.

(9)

STL and Seasonal Naïve have the highest prediction errors. Based on this it can be concluded that the data is not seasonal as both STL and Seasonal Naïve methods are purely based on seasonal time series. The prediction errors for different ARIMA approaches, without and with errors, often differ per target and dataset. Perfor-mance of the ARIMA with errors method depends on the selected independent variable(s).

The second point of interest is the run time of each algorithm. The run times of statistical forecasting methods are shorter com-pared to that of the LightGBM approaches. The Naïve forecasting method is the fastest method, while TBATS is the slowest statis-tical forecasting method followed by the aggregated LightGBM approach. The longest overall run time is that of the RP-wise Light-GBM approach.

When looking at the distribution of prediction errors described in figure1,2and all figures in Appendix SectionA.5it can be pointed out that the final MAE score is heavily influenced by outliers. The final MAE score is calculated of all the prediction errors for each resource pool in the dataset. When some of the resource pools are difficult to predict, the final MAE score is affected by it. This brings up a new question: "How to deal with resource pools that are harder to predict?".

Finally, the feature importance described in Appendix Section

A.4shows the independent variables that were most important

when making a predictive model for the aggregated approach of LightGBM. The results are summarized in Section4.3. The signifi-cance of the most important independent variables such as ’CPU maxUsage’ and ’Mem maxUsage’ were expected. What is unex-pected however, is for both CPU and memory prediction, mostly memory-related variables were used. This means that to predict CPU usage, memory is also very important. What is interesting as well, is that the different weekdays (Wednesday and Friday) are less important than initially thought and would imply the usage is not much affected by the time of the week. The same holds for the number of VMs and CPUs supplied by the resource pool, which are also less important for prediction as initially expected. Theoret-ical and practTheoret-ical implications of the results are described in next sub-section, followed by a breakdown of possible limitations.

5.2 Theoretical and Practical Implications

The results of the experiments were expected because existing literature shows machine learning outperforms statistical methods based on prediction accuracy for time series as described in Section

2. An increase in run time was expected as well, albeit to a lesser extent. Based on these findings, a comparative assessment has to be made between using statistical forecasting methods or LightGBM as prediction algorithm for workload prediction based on snapshot data. If the goal is to have a sufficient prediction accuracy with the lowest possible training times, statistical forecasting methods are preferred. If run times are important to a lesser extent, LightGBM is proven to be the better prediction algorithm.

Theoretically, LightGBM outperforms statistical methods in ev-ery experiment based on the prediction errors. Purely based on this metric, LightGBM would be the preferred algorithm to use. When also looking at the run times however, statistical methods are faster. When comparing the Naïve approach with the LightGBM

resource-pool wise approach, the run times are approximately fifty times higher for LightGBM, whereas the MAE scores improve to a lesser extent. This leads to a follow-up question: does the increase in prediction accuracy justify the approximately fifty times increase in run time. Real world implications for these results would be that the use of LightGBM to predict workloads is a valid option. Run time can possibly be improved given the used hardware is not top-tier and run time includes data preparation, preprocessing, training and hyperparameter optimization. When real time workload prediction is preferred, it would be advisable however to look at other options and to re-run the experiments using a dataset that is of a smaller interval.

5.3 Limitations

The research in this paper has several limitations. These limitations are not errors but acknowledgements of potential flaws. Existing research is often focused on predicting CPU usage in almost real-time. This paper approaches workload prediction as prediction of CPU usage, but predicts only based on two data points a week because of the supplied dataset. This limits to what extent the outcomes of this research can be compared to studies that also look into CPU usage prediction.

Another limiting factor is the used dataset. The dataset only includes variables that are captured through continuous logging. Spikes in computing capacity can be triggered by external factors as well, for example by performing system updates. These external factors are not included in the dataset. When it comes to the dataset, another limitation can be identified. As machine learning only learns from past data, predicting very low and very high resource utilization is difficult because the data provided possibly does not contain these extremities.

The methodology also includes limitations. When it comes to ARIMA, only a few variables have been experimented with as ex-ternal regressor. It might have been that there are better perform-ing models usperform-ing ARIMA with different combinations of external regressors, however this would have called for exhaustive experi-mentation.

6 CONCLUSION

This research studies to what extent using LightGBM as prediction algorithm is useful when predicting data center workload. The objective was to predict computing capacity, which can be divided into CPU usage and memory usage. This has been achieved by performing a comparative study between statistical forecasting methods and a state of the art machine learning method called LightGBM.

Experiments have been performed for every forecasting method based on three datasets. Per dataset, both CPU usage and memory usage have been predicted in separate experiments. The statistical methods that have been used are Naïve regression, Seasonal Naïve regression, STL, ARIMA without errors, ARIMA with errors and external regressors and TBATS. For the LightGBM experiments, two approaches have been suggested. The first approach is by viewing the model as aggregated series. A model is trained to explain and 8

(10)

predict each resource pool. The second approach is a resource pool-wise approach, where a prediction model is created for each single resource pool.

The experiment results show LightGBM is better at predicting workload of CPU and memory, but comes with the cost of a much longer run time. Comparing Naïve regression with the resource pool-wise LightGBM approach, run times are approximately 50 times longer for LightGBM whereas the prediction accuracy only improves a fraction of that.

In short, LightGBM has proven to be more successful at ac-curately predicting workloads compared to statistical forecasting methods. However, using LightGBM results in long run times. This is not a problem if workload prediction is used in a reporting style manner, but when time is critical another approach should be taken.

7 FUTURE WORK

This section contains possible points of future research. To improve on the findings in this paper, future research could address the following:

– Researching generalizability of the approaches presented in this study. This can be done by retrieving similar datasets as presented in this study and executing all experiments in a similar manner on that dataset. The goal is to confirm the validity of the research presented in this study.

– Create more data using information retrieval. The provided long term dataset only has two data points per week. Data of a smaller interval is also available, but only for smaller time periods (days/weeks), because of data storage limitations. Using information retrieval techniques, a new dataset could be created by combining the low and high interval datasets. – Researching which variables theoretically are important for workload prediction and building a dataset around all those variables. Current dataset contains only logging data. Events like system updates might cause sudden spikes but those events are not represented in the dataset. Creating a new dataset specifically aimed at workload prediction might sig-nificantly improve prediction accuracy.

– Using hierarchical forecasting to improve results. Instead of directly predicting the workload of resource pools, one could also predict the workload of individual VMs that each resource pool hosts and translate that to the workload of their parent resource pool. As each VM has its own behavior this approach might improve the overall prediction accuracy. – Prediction of CPU and memory together as one target, in-stead of predicting them separately. CPU and memory usage might be strongly correlated, and thus might influence the prediction accuracy.

– Developing a smart machine learning pipeline to include continual learning. Continual learning keeps training the model throughout time. Workload prediction would benefit from this, since a time series keeps growing in size. – Looking into other machine learning algorithms for

work-load prediction. This research only focuses on LightGBM,

inspired by the M5 Challenge[20]. Other algorithms might

be as suited might perform better. Hence, future studies

could look into different machine learning algorithms like Prophet17_.

REFERENCES

[1] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame-work. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining.

[2] Maryam Amiri, Leyli Mohammad-Khanli, and Raffaela Mirandola. 2018. An online learning model based on episode mining for workload prediction in cloud. Future Generation Computer Systems87 (2018), 83–101.

[3] Rodrigo N Calheiros, Enayat Masoumi, Rajiv Ranjan, and Rajkumar Buyya. 2014. Workload prediction using ARIMA model and its impact on cloud applications’ QoS. IEEE transactions on cloud computing 3, 4 (2014), 449–458.

[4] Katja Cetinski and Matjaz B Juric. 2015. AME-WPC: Advanced model for efficient workload prediction in the cloud. Journal of Network and Computer Applications 55 (2015), 191–201.

[5] Zhijia Chen, Yuanchang Zhu, Yanqiang Di, and Shaochong Feng. 2015. Self-adaptive prediction of cloud resource demands using ensemble model and subtractive-fuzzy clustering based fuzzy neural network. Computational in-telligence and neuroscience2015 (2015).

[6] Alysha M De Livera, Rob J Hyndman, and Ralph D Snyder. 2011. Forecasting time series with complex seasonal patterns using exponential smoothing. Journal of the American statistical association106, 496 (2011), 1513–1527.

[7] Ksenzovets Dmytro, Telenyk Sergii, and Pysarenko Andiy. 2017. ARIMA forecast models for scheduling usage of resources in IT-infrastructure. In 2017 12th Inter-national Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Vol. 1. IEEE, 356–360.

[8] Martin Duggan, Karl Mason, Jim Duggan, Enda Howley, and Enda Barrett. 2017. Predicting host CPU utilization in cloud computing using recurrent neural net-works. In 2017 12th International Conference for Internet Technology and Secured Transactions (ICITST). IEEE, 67–72.

[9] Wei Fang, ZhiHui Lu, Jie Wu, and ZhenYin Cao. 2012. Rpps: A novel resource prediction and provisioning scheme in cloud data center. In 2012 IEEE Ninth International Conference on Services Computing. IEEE, 609–616.

[10] Noha Hamdy, Amal Elsayed, Nahla ElHaggar, and MS Mostafa. 2017. Resource allocation strategies in cloud computing overview. Int. J. Comput. Appl. Technol. 177 (2017), 18–22.

[11] Rob J Hyndman and George Athanasopoulos. 2018. Forecasting: principles and practice. OTexts.

[12] Rob J Hyndman, Yeasmin Khandakar, et al. 2008. Automatic time series fore-casting: the forecast package for R. Journal of statistical software 27, 3 (2008), 1–22.

[13] Waheed Iqbal, Josep Lluis Berral, Abdelkarim Erradi, David Carrera, et al. 2019. Adaptive prediction models for data center resources utilization estimation. IEEE Transactions on Network and Service Management16, 4 (2019), 1681–1693. [14] Yexi Jiang, Chang-Shing Perng, Tao Li, and Rong N Chang. 2013. Cloud analytics

for capacity planning and instant VM provisioning. IEEE Transactions on Network and Service Management10, 3 (2013), 312–325.

[15] Dashveenjit Kaur. 2021. Cloud computing spend increased by a third in 2020 -TechHQ. https://techhq.com/2021/02/cloud-computing-spend-increased-by-a-third-in-2020/

[16] Guolin Ke, Qi Meng, Thomas Finely, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boost-ing Decision Tree. In Advances in Neural Information ProcessBoost-ing Systems 30 (NIP 2017). https://www.microsoft.com/en-us/research/publication/lightgbm-a-highly-efficient-gradient-boosting-decision-tree/

[17] In Kee Kim, Wei Wang, Yanjun Qi, and Marty Humphrey. 2018. Cloudinsight: Utilizing a council of experts to predict future cloud application workloads. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE, 41–48.

[18] Jitendra Kumar, Deepika Saxena, Ashutosh Kumar Singh, and Anand Mohan. 2020. Biphase adaptive learning-based neural network model for cloud datacenter workload forecasting. Soft Computing (2020), 1–18.

[19] Shasha Liao, Hongjie Zhang, Guansheng Shu, and Jing Li. 2017. Adaptive resource prediction in the cloud using linear stacking model. In 2017 Fifth International Conference on Advanced Cloud and Big Data (CBD). IEEE, 33–38.

[20] Spyros Makridakis, Evangelos Spiliotis, and Vassilis Assimakopoulos. 2020. The M5 Accuracy competition: Results, findings and conclusions.

[21] Karl Mason, Martin Duggan, Enda Barrett, Jim Duggan, and Enda Howley. 2018. Predicting host CPU utilization in the cloud using evolutionary neural networks. Future Generation Computer Systems86 (2018), 162–173.

[22] Feng Qiu, Bin Zhang, and Jun Guo. 2016. A deep learning approach for VM workload prediction in the cloud. In 2016 17th IEEE/ACIS International Conference 17_{https://facebook.github.io/prophet/}

(11)

on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 319–324.

[23] Ali Asghar Rahmanian, Mostafa Ghobaei-Arani, and Sajjad Tofighy. 2018. A learning automata-based ensemble resource usage prediction algorithm for cloud computing environment. Future Generation Computer Systems 79 (2018), 54–71. [24] Md Rasheduzzaman, Md Amirul Islam, Tasvirul Islam, Tahmid Hossain, and Rashedur M Rahman. 2014. Study of different forecasting models on Google cluster trace. In 16th Int’l Conf. Computer and Information Technology. IEEE, 414–419.

[25] Cleveland Robert, C William, and Terpenning Irma. 1990. STL: A seasonal-trend decomposition procedure based on loess. Journal of official statistics 6, 1 (1990), 3–73.

[26] Zolfaghar Salmanian, Habib Izadkhah, and Ayaz Isazadeh. 2020. Auto-scale resource provisioning in IaaS Clouds. Comput. J. (2020).

[27] Binbin Song, Yao Yu, Yu Zhou, Ziqiang Wang, and Sidan Du. 2018. Host load prediction with long short-term memory in cloud computing. The Journal of Supercomputing74, 12 (2018), 6554–6568.

[28] Josep Subirats and Jordi Guitart. 2015. Assessing and forecasting energy efficiency on cloud computing platforms. Future Generation Computer Systems 45 (2015), 70–94.

[29] Sergii Telenyk, Eduard Zharikov, and Oleksandr Rolik. 2018. Modeling of the Data Center Resource Management Using Reinforcement Learning. In 2018 Inter-national Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T). IEEE, 289–296.

[30] Fan-Hsun Tseng, Xiaofei Wang, Li-Der Chou, Han-Chieh Chao, and Victor CM Leung. 2017. Dynamic resource prediction and allocation for cloud data center using the multiobjective genetic algorithm. IEEE Systems Journal 12, 2 (2017), 1688–1699.

[31] Carlos Vazquez, Ram Krishnan, and Eugene John. 2015. Time Series Forecasting of Cloud Data Center Workloads for Dynamic Resource Provisioning. J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl.6, 3 (2015), 87–110. [32] Qingchen Zhang, Laurence T Yang, Zheng Yan, Zhikui Chen, and Peng Li. 2018.

An efficient deep learning model to predict cloud workload for industry infor-matics. IEEE transactions on industrial informatics 14, 7 (2018), 3170–3178. [33] Weishan Zhang, Pengcheng Duan, Laurence T Yang, Feng Xia, Zhongwei Li,

Qinghua Lu, Wenjuan Gong, and Su Yang. 2017. Resource requests prediction in the cloud computing environment with a deep belief network. Software: Practice and Experience47, 3 (2017), 473–488.

(12)

A

APPENDIX

A.1 Exploratory Plots

Figure 3: Plot of CPU usage in Mhz over time for one of the resource pools

Figure 4: Plot of Memory usage in MiB over time for one of the resource pools

Figure 5: Plot of Maximum CPU in Mhz allowed over time for one of the resource pools

Figure 6: Plot of CPU usage in Mhz over time for all of the resource pools in the 470 dataset

(13)

A.2 Table of all Dataset Variables.

Attribute Type Description

Resource pool String Name of the RP

Status Int Status ID of RP

# VMs Int Number of VMs in the RP

# vCPUs Int Number of virtual CPUs in the RP

CPU limit Int Limit of CPU utilization

CPU overheadLimit Int Maximum allowed overhead memory

CPU reservation Int Resources guaranteed availability

CPU level String Level of allocation

CPU shares Int Number of shares allocated

CPU expandableReservation Bool Is allocated CPU expandable

CPU maxUsage Int Current upper-bound usage

CPU overallUsage Int CPU used

CPU reservationUsed Int Amount of resources used

CPU reservationUsedForVm Int Reserved CPU used for VMs

CPU unreservedForPool Int CPU unreserved for Pool

CPU unreservedForVm Int CPU unreserved for VM

Mem Configured Int Total Memory configured

Mem limit Int Limit of Memory

Mem overheadLimit Int Maximum allowed overhead memory

Mem reservation Int Reserved Memory guaranteed

Mem level String Allocation level

Mem shares Int Amount of shares allocated

Mem expandableReservation Bool Is allocated Memory expandable

Mem maxUsage Int Upper-bound on usage

Mem overallUsage Int Memory used

Mem reservationUsed Int How much of the reserved Memory is used

Mem reservationUsedForVm Int Reserved Memory used for VMs

Mem unreservedForPool Int CPU unreserved for Pool

Mem unreservedForVm Int CPU unreserved for VMs

QS overallCpuDemand Int Aggregated overall CPU demand

QS overallCpuUsage Int Aggregated overall CPU usage

QS staticCpuEntitlement Int Aggregated static CPU entitlement

QS distributedCpuEntitlement Int Distributed CPU

QS balloonedMemory Int Balloon driver size

QS compressedMemory Int Compressed Memory

QS consumedOverheadMemory Int Consumed overhead Memory

QS distributedMemoryEntitlement Int Distributed Memory entitlement

QS guestMemoryUsage Int Memory used by guest

QS hostMemoryUsage Int Memory used by host

QS overheadMemory Int Memory overhead

QS privateMemory Int Amount of private Memory

QS sharedMemory Int Amount of shared memory

QS staticMemoryEntitlement Int Static Memory entitlement

QS swappedMemory Int Memory granted for a host’s swap space

VI SDK Server String VI SDK Server Name

VI SDK UUID String Universally unique identifier

Time DateTime Date and Time the datapoint is recorded

Day String String of the Weekday

Table 4: Table including attribute names, data types and de-scriptions of all variables within the uncleaned dataset.

(14)

A.3 Processing Pipeline

Figure 9: Sketch of the data processing pipeline for the opti-mization of the aggregated series.

Figure 10: Sketch of the data processing pipeline where a model is trained for each individual resource pool.

(15)

A.4 Feature Importance

Figure 11: Plot of feature importance for predicting CPU within the 470 dataset.

Figure 14: Plot of feature importance for predicting Memory within the 470 dataset.

(16)

A.5 Boxplot Results

Figure 17: Boxplot distribution of mean absolute errors on predicted CPU on the 470 dataset using statistical forecast-ing methods.

Figure 20: Boxplot distribution of mean absolute errors on predicted Memory on the 470 dataset using statistical fore-casting methods.

(17)

(18)

A.6 Project Code

A.6.1 R Code. 1 l i b r a r y( r e a d r ) 2 l i b r a r y( fpp2 ) 3 l i b r a r y( t i d y v e r s e ) 4 l i b r a r y( k a b l e E x t r a ) 5 l i b r a r y( g r i d E x t r a ) 6 7 CC_RP_470 <− read _ csv(" Path/To/D a t a s e t/470 ", c o l _t y p e s = c o l s ( X1 = c o l _s k i p ( ) , Time = c o l _d a t e t i m e (format = " %Y−%m−%d %H:%M:% S ") ) ) 8 CC_RP_480 <− read _ csv(" Path/To/D a t a s e t/480 ", c o l _t y p e s = c o l s ( X1 = c o l _s k i p ( ) , Time = c o l _d a t e t i m e (format = " %Y−%m−%d %H:%M:% S ") ) ) 9 CC_RP_490 <− read _ csv(" Path/To/D a t a s e t/490 ", c o l _t y p e s = c o l s ( X1 = c o l _s k i p ( ) , Time = c o l _d a t e t i m e (format = " %Y−%m−%d %H:%M:% S ") ) ) 10 11 CC_RP_470 <− CC_RP_47 0 [o r d e r(CC_RP_470$Time ) , ] 12 CC_RP_480 <− CC_RP_48 0 [o r d e r(CC_RP_480$Time ) , ] 13 CC_RP_490 <− CC_RP_49 0 [o r d e r(CC_RP_490$Time ) , ] 14 15 # M u l t i p l e r e s o u r c e pool p r e d i c t i o n s _____ 16

17 # C r ea t e l i s t with a l l the time s e r i e s 18 s t a r t _ time <− Sys .time( )

19 f u l l_l s t = l i s t( ) 20

21 f o r( name in CC_RP_470$` Resource pool `) {

22 ph <− CC_RP_470[CC_RP_470$` Resource pool ` == name , ] 23 v a l u e s <− ph$`CPU o v e r a l l U s a g e ` 24 f u l l_l s t [ [ name ] ] <− t s( values ,f r eq u en cy=104) 25 } 26 27 # s p l i t i n t o t r a i n i n g and v a l i d a t i o n s e t 28 t r a i n = l i s t( ) 29 v a l = l i s t( ) 30 f o r( name in names( f u l l_l s t ) ) {

31 t r a i n [ [ name ] ] <− s u b s e t( f u l l_l s t [ [ name ] ] , end=l e n g t h(

f u l l_l s t [ [ name ] ] ) −8)

32 v a l [ [ name ] ] <− s u b s e t( f u l l_l s t [ [ name ] ] , s t a r t=l e n g t h( f u l l_l s t [ [ name ] ] ) −7)

33 }

34 end _ time <− Sys .time( ) 35 end _ time − s t a r t _ time 36

37 # ## f o r e c a s t using n a i v e 38 h=8

39 n a i v e = l i s t( )

40 acc_n a i v e = d a t a.frame( ) 41 s t a r t _ time <− Sys .time( ) 42 f o r( name in names( t r a i n ) ) {

43 n a i v e [ [ name ] ] = f o r e c a s t ( n a i v e ( t r a i n [ [ name ] ] ) , h=h ) 44 acc_n a i v e [" SNAIVE MAE", name ] = f o r e c a s t : : a c c u r a c y (

n a i v e [ [ name ] ] , v a l [ [ name ] ] ) [" Test s e t ", "MAE"]

45 acc_n a i v e [" SNAIVE MASE", name ] = f o r e c a s t : : a c c u r a c y (

n a i v e [ [ name ] ] , v a l [ [ name ] ] ) [" Test s e t ", "MASE"]

46 p r i n t(p a s t e( name , " : Done ") ) 47 }

48 end _ time <− Sys .time( )

49 acc_n a i v e = round( acc_naive , d i g i t s =4) 50 acc_n a i v e

51

52 k b l ( acc_naive , booktabs =T , l i n e s e p = " ") %>% k a b l e_ s t y l i n g ( l a t e x_ o p t i o n s = c(" s c a l e_down ", f o n t_s i z e = 7 ) )

53

54 mean_acc_n a i v e = mean(as.numeric( acc_n a i v e [ 1 , ] ) ) 55 end _ time − s t a r t _ time

56

57 # ## f o r e c a s t using s n a i v e 58 h=8

59 s n a i v e = l i s t( )

60 acc_s n a i v e = d a t a.frame( ) 61 s t a r t _ time <− Sys .time( )

62 f o r( name in names( t r a i n ) ) {

63 s n a i v e [ [ name ] ] = f o r e c a s t ( s n a i v e ( t r a i n [ [ name ] ] ) , h=h ) 64 acc_s n a i v e [" SNAIVE MAE", name ] = f o r e c a s t : : a c c u r a c y (

s n a i v e [ [ name ] ] , v a l [ [ name ] ] ) [" Test s e t ", "MAE"]

65 acc_s n a i v e [" SNAIVE MASE", name ] = f o r e c a s t : : a c c u r a c y (

s n a i v e [ [ name ] ] , v a l [ [ name ] ] ) [" Test s e t ", "MASE"]

69 acc_s n a i v e = round( acc_snaive , d i g i t s =4) 70 acc_s n a i v e

71

72 k b l ( acc_snaive , booktabs =T , l i n e s e p = " ") %>% k a b l e_

s t y l i n g ( l a t e x_ o p t i o n s = c(" s c a l e_down ", f o n t_s i z e = 7 ) )

73

74 mean_acc_s n a i v e = mean(as.numeric( acc_s n a i v e [ 1 , ] ) ) 75 end _ time − s t a r t _ time

76

77 # ## f o r e c a s t using stlm 78 h=8

79 stlm = l i s t( )

80 acc_stlm = d a t a.frame( )

81 s t a r t _ time <− Sys .time( )

82 f o r( name in names( t r a i n ) ) {

83 stlm [ [ name ] ] = f o r e c a s t ( stlm ( t r a i n [ [ name ] ] , lambda=

BoxCox . lambda ( t r a i n [ [ name ] ] ) ) , h=h )

84 acc_stlm ["STLM MAE", name ] = f o r e c a s t : : a c c u r a c y ( stlm [ [

name ] ] , v a l [ [ name ] ] ) [" Tes t s e t ", "MAE"]

85 acc_stlm ["STLM MASE", name ] = f o r e c a s t : : a c c u r a c y ( stlm [ [

name ] ] , v a l [ [ name ] ] ) [" Tes t s e t ", "MASE"]

89 acc_stlm = round( acc_stlm , d i g i t s =4) 90 acc_stlm

91

92 k b l ( acc_stlm , booktabs =T , l i n e s e p = " ") %>% k a b l e_s t y l i n g

( l a t e x_ o p t i o n s = c(" s c a l e_down ", f o n t_s i z e = 7 ) )

93

94 mean_acc_stlm = mean(as.numeric( acc_stlm [ 1 , ] ) ) 95 end _ time − s t a r t _ time

96

97 # ## f o r e c a s t using arima without e r r o r

98 h=8

99 a r i m a n o e r r o r = l i s t( )

100 acc_a r i m a n o e r r o r = d a t a.frame( ) 101 s t a r t _ time <− Sys .time( ) 102 f o r( name in names( t r a i n ) ) {

103 a r i m a n o e r r o r [ [ name ] ] = f o r e c a s t ( auto . arima ( t r a i n [ [ name

] ] ) , h=h )

104 acc_a r i m a n o e r r o r ["ARIMANOERR MAE", name ] = f o r e c a s t : :

a c c u r a c y ( a r i m a n o e r r o r [ [ name ] ] , v a l [ [ name ] ] ) [" T e s t s e t ", "MAE"]

105 acc_a r i m a n o e r r o r ["ARIMANOERR MASE", name ] = f o r e c a s t : :

a c c u r a c y ( a r i m a n o e r r o r [ [ name ] ] , v a l [ [ name ] ] ) [" T e s t s e t ", "MASE"]

109 acc_a r i m a n o e r r o r = round( acc_arimanoerror , d i g i t s =4) 110 acc_a r i m a n o e r r o r

111

112 k b l ( acc_arimanoerror , booktabs =T , l i n e s e p = " ") %>% k a b l e

_s t y l i n g ( l a t e x_ o p t i o n s = c(" s c a l e_down ", f o n t_s i z e = 7 ) )

113

114 mean_acc_arimanoerr = mean(as.numeric( acc_a r i m a n o e r r o r [ 1 , ] ) )

115 end _ time − s t a r t _ time 116

117 # ## f o r e c a s t using arima with e r r o r and e x t e r n a l r e g r e s s o r 118 h=8

119 a r i m a e r r o r = l i s t( )

120 acc_a r i m a e r r o r = d a t a.frame( ) 121 s t a r t _ time <− Sys .time( ) 122 f o r( name in names( t r a i n ) ) {