• No results found

Predicting runtime behaviour of Linux processes using machine learning

N/A
N/A
Protected

Academic year: 2021

Share "Predicting runtime behaviour of Linux processes using machine learning"

Copied!
54
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Predicting runtime behaviour of

Linux processes using machine

learning

Willemijn Beks

10775110

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dhr. dr. ing. S.J.Altmeyer Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 28th, 2019

(2)

Abstract

If we could approximate the execution time of a process, scheduling could be made more efficient. In this thesis, we try to predict cpu time based on a dataset that we collected with help of the Linux ps interface for proc. The collected data contrains information about processes, of which the vast majority only ran a few seconds. We choose two machine learn-ing techniques, multivariate linear regression (MVLR) and a feedforward neural network (FFNN). This thesis shows that there is no linear relation between the features in the dataset and the cpu time of a process. It also shows that the chosen approach does not yield sufficient results to reliably predict execution time of a process in Linux.

(3)

Contents

1 Introduction 4 2 Background 6 2.1 Processes in Linux . . . 6 2.2 Regression . . . 6 3 Method 9 3.1 Acquiring data . . . 9

3.2 Pre-processing the dataset . . . 11

3.3 Feedforward Neural Network for Regression . . . 12

3.3.1 Training Details . . . 14

3.4 Multivariate Linear Regression . . . 14

3.5 Evaluation . . . 15

4 Results 16 4.1 Dataset . . . 16

4.2 Results of Predictions . . . 17

4.2.1 Predictions Feedforward Neural Nets trained with MSE . 21 4.2.2 Predictions Feedforward Neural Nets trained with MAE . 22 4.2.3 Predictions Feedforward Neural Nets trained with Huber loss . . . 23

4.2.4 Predictions linear models . . . 24

4.2.5 Predictions FFNN with 3 hidden layers . . . 25

5 Discussion and Future Work 26

6 Conclusion 27

Appendices 31

A Data acquisition script 31

B PS features that were used 33

(4)

1

Introduction

Inspired by research on using machine learning to predict execution times in realtime systems (Bonenfant et al. 2017, Huybrechts, Mercelis, & Hellinckx 2018, Huybrechts, Cassimon, et al. 2018), in this thesis, we will try to predict execution times in a more complicated environment.

The Completely Fair Scheduler (CFS), which is an implementation of weighted fair scheduling, performs scheduling in Linux environments, where the scheduler controls how long a task can run before it is pre-empted(Lozi et al. 2016). When a task’s timeslice has expired, the scheduler pre-empts the running task to free the CPU for a different task (Lozi et al. 2016).

If it was approximately known what the execution time of a task would be -and therefore its remaining execution time - then it could be decided whether or not a task should be pre-empted or not and when to pre-empt. Therefore, one of the possible uses of this research could be adjusting the time quantum of a task accordingly, making scheduling more efficient. Improvement in scheduling is useful because the turnaround time of tasks could be reduced, meaning the system would be more efficient (Negi & Kumar 2005). Furthermore, smarter adaptation of the scheduler would mean more efficient use of resources.

In related research there have been attempts to predict the worst-case exe-cution time (WCET) with machine learning(Bonenfant et al. 2017, Huybrechts, Mercelis, & Hellinckx 2018, Huybrechts, Cassimon, et al. 2018). WCET is the maximum execution time a task requires on a specific hardware environ-ment(Wilhelm et al. 2008). WCET analysis focuses on finding a safe WCET bound, this bound must never be too low, otherwise it would not be useful. When predicting execution time, slight deviations from the real execution time in both directions can be good enough.

Predicting WCET with machine learning is interesting because it can give in-sight into hardware needs early in the development of embedded systems (Huy-brechts, Mercelis, & Hellinckx 2018). Huybrechts et al. developed a hybrid approach on WCET analysis for Real-time systems using the static approach combined with machine learning, on a dataset of building blocks made by di-viding the source code into smaller parts as a dataset (Huybrechts, Mercelis, & Hellinckx 2018). They used a couple of regression techniques, such as lin-ear regression and support vector regression, to predict the WCET, yielding promising results. Of which support vector regression yielded the best results.

Huybrechts et al. have also explored the usage of neural networks in the pre-diction of WCET, again finding promising results, although not good enough to use as a reliable upper bound. They used feed-forward neural nets and a tree recursive neural net and applied these on the building blocks and then ap-plied feature selection per block(Huybrechts, Cassimon, et al. 2018). These features were a count of certain operations per block (e.g. number of bit-wise operations)(Huybrechts, Cassimon, et al. 2018). Huybrechts et al found that their dataset was not sufficient for a FFNN to predict meaningful upper bounds(Huybrechts, Cassimon, et al. 2018). Believing that this is a feasible approach, Huybrechts et al. envision further optimizing hyperparameters and expanding the dataset to improve results, before trying other deep learning models like Long Short-Term Memory networks (Hochreiter & Schmidhuber 1997). So trying a similar problem of predicting execution times but not

(5)

up-per bounds; paying more attention to the hyup-perparameters and structure of the FFNN should be a reasonable approach. The dataset is a point of consideration because the use of source code to train models is not always possible in a Linux environment, since it is simply not always available for every program. Exe-cutables are available in Linux, but don’t contain program input, so they might be missing vital information in predicting execution times. Furthermore, often Linux runs on more complicated architectures than specific embedded systems, on which previously mentioned research has focused.

Less recent research has shown the need for prediction of, among other things, the execution time of a job, when scheduling jobs in a distributed en-vironment. Ali et al. show moderate success with the use of relatively simple features and a history-based similarity measuring model, achieving an accuracy of 80% (Ali et al. 2004). A history-based model classified tasks based on simi-larity to programs that had been encountered earlier, using superficial features. These promising results suggest possible improvements in further research.

There has also been research stating that it is possible to predict scheduling behaviour, but whether it is successful depends on the features used and the machine learning techniques (Negi & Kumar 2005). Negi and al. researched the possibility of improving the process turnaround time of an older Linux scheduler (version 2.4.20-8) with the help of machine learning. The turnaround time is the delay between process submission and program completion, so time spent idle plus execution time and processing of input/output. Negi et al. modified the kernel to increase the timeslice of a running program when it was almost finished. For this, a decision tree and k nearest neighbours were used, of which the decision tree algorithm resulted in the best performance in terms of execution time. In particular, they used data from the program traces of a couple of representative programs. Whereas there are no readily available programs that would automatically generalise to all processes in a personal Linux computer. Negi. et al. focus on training a model that learns to associate ideal time slices in combination with implementations of standard programs to lower turnaround time, while this research will focus more on trying a machine learning approach on the more complicated structure to see if that is a viable approach.

So how can runtime behaviour of Linux processes be predicted using machine learning? The results of this research will give insights into the possibilities of predicting execution time on a personal computer that runs Linux, based on data that was collected on the same machine. An exploratory look into what could be viable approaches to solve this question are made. First a dataset is created using the Linux ps command. To learn if execution time can be predicted using machine learning, we use two machine learning techniques, mul-tivariate linear regression (MVLR) and a feedforward neural network (FFNN). If a computationally light method like multivariate linear regression (MVLR) could already yield acceptable results, it could be a basis for a computationally light implementation. If no linear relation can be found, which is quite possible for this particular problem, since it is quite a complex problem with an abun-dance of important features to decide the execution time, a non-linear relation might quite possibly still be there. Both approaches do not yield any results that point in the direction of a correlation between the data and execution time, so it is probable that the dataset does not contain enough information about the execution time and more research on this will be needed.

(6)

2

Background

In this following section we give a short explanation of used terms and tech-niques. The background section is split in two parts, of which the part is on Linux processes and everything related to collecting the dataset. In the second part, an overview of related machine learning techniques is provided.

2.1

Processes in Linux

Processes in Linux are instances of running programs. Every running process has a process identifier (pid), which is a unique nonnegative integer (Kerrisk n.d.(a). On creation of a process, it gets assigned a new pid, which is not currently taken. Pid’s wrap around at a certain value, specified in the kernel, meaning there is a maximum amount of running processes at once.(Kerrisk n.d.(b)).

In Linux, there are either user space programs or kernel space programs. The Linux Information Project defines kernel space as follows: ”Kernel space is where the kernel (i.e., the core of the operating system) executes and provides its services” 1. The main difference between code that is run in kernel space and user space is that user space code has less privileges and is therefore less prone to vulnerabilities.

In Linux there is a process information pseudo-filesystem, or proc, which provides an interface to kernel data structures (Kerrisk n.d.(b)). So proc is a directory in Linux that contains information about running processes. There are two user space interfaces that provide information about proc, namely ps and top. According to the Linux manual pages, top provides a dynamic real-time view of a running system. It can display system summary information as well as a list of processes or threads currently being managed by the Linux kernel (Kerrisk n.d.(d)). Ps reports a snapshot of the current processes (Kerrisk n.d.(c)).

2.2

Regression

In supervised machine learning, there are multiple techniques that can be ap-plied for regression. With regression, there is a function or model that maps in-put to a numerical outin-put value, like execution time. Multivariate linear regres-sion (MVLR) is the most well known machine learning technique, that combines the input variables linearly to estimate the output (Bishop 2006). To find the optimal function, the exact solution can be found with least squares, although this method becomes computationally expensive when performed on datasets of higher dimensionality as it requires a matrix inverse. Another method for regression that is standard use in supervise machine learning is a feedforward neural network (FFNN). In essence, a FFNN is a compostion of linear operations and non-linear activation function. It turns out that, under some assumptions, an FFNN can represent any function (Hornik 1991). More specifically an FFNN consists of a couple of components. First, a FFNN has a number of input nodes that are equivalent to the number of features in the data sample. Second, a FFNN has any number of hidden layers of which each has an arbitrary number of nodes, assuming sufficient computational resources. Third, the output layer which is equal in number to the dimension(s) of the output (e.g. the output

(7)

dimension for predicting a single value will be just one node). If we denote a layer as n, that means that all nodes in this layer, will be connected to every node in layer n-1 and layer n+1 if these layers exist. Each of these connections has a weight associated with it, to apply to the value that gets passed to it. The output of every node is defined by its activation function. An activation func-tion is what achieves nonlinearity in a neural network. First the input passes through the network completely until an output is given, this is the forward pass. Then the loss function will determine the cost of the prediction. Loss functions are functions that map the predicted outcome and the ground truth to an associated cost, i.e. a lower cost indicates a prediction closer to the target. This cost is then ’communicated’ back through the network and the weights are updated, this is back-propagation (Rumelhart, Hinton, Williams, et al. 1988). To learn the optimal weights optimizers are used in neural networks. An opti-mizer is the method to tweak network parameters to improve its performance in terms of the loss function. In deep learning, gradient based optimizers are used. This means that the parameters are updated based on the partial derivative of the parameters w.r.t. the loss function.

To train a network, there are several other aspects that have to be taken into account. When training a network, it is usually done by dividing the dataset into three parts. Of which one part is the training set, one the test set, and one the validation set, of which the majority of the data is apportioned to the training set. The training set is used for training of the model, the test set is only used for measuring of final results, and the validation set is used for training of the hyperparameters. Hyperparameters of the neural network consist of parameters of the network that have more to do with the network structure, than with the internal weights that the network should learn, i.e. these are parameters that are not optimized by an optimizer. These include the choice in loss function, optimizer, batch size, amount of epochs, and learning rate. Often the training set will be divided in batches, which is the number of training examples in one forward pass. Batches are useful to keep training the network computationally feasible and often faster. The batches with data from the training set will be fed through the network for a specified number of epochs. Epochs are the number of cycles a network is trained on the full training set once. A low number of epochs can cause underfitting, while a high number can cause overfitting. The learning rate can be seen as the speed the weights can update, based on the back-propagated error.

To make sure a FFNN can learn the information from a given dataset, the dataset has to be pre-processed. Some features in a dataset are non-numerical, if these features are categorical, it means that they have to be encoded into numerical features. When encoding non-numerical features to numerical fea-tures, often an approach like one-hot-encoding is chosen, which introduces a new feature for every value a specific feature can take. For features that have many possible values, this may introduce a large number of feature dimensions. When this happens, the curse of dimensionality has to be taken into account. The curse of dimensionality occurs when a machine learning model stops work-ing correctly in high dimensional spaces (Bishop 2006). To select the optimal hyperparameters, k-fold cross-validation can be applied. K-fold cross-validation is done by dividing the training set into k parts and using k-1 parts as training set and the last part as validation set. This is repeated k times with every time a different part of the data as validation set. The performance is then measured

(8)

by averaging the k different performances. To select optimal hyperparamets, each combination of hyperparameters is tested and the optimal combination can be found reliably with k-fold cross-validation. Using k-fold cross-validation means that there is no need for a separate validation set, meaning it is fit for a smaller dataset.

Batch normalization makes training of networks more stable and improves performance. Batch-normalization is a method that adds two parameters per activation function, the standard deviation parameter (γ) and a mean parameter (β). Batch norm ensures that the output of the previous layer is normalized, followed by an affine transformation defined by β and γ. Batch normalization keeps other network properties, like the nonlinearity of activation functions, intact (Ioffe & Szegedy 2015).

(9)

3

Method

3.1

Acquiring data

A dataset that contains Linux processes has to be created to be able to train a machine learning model with. For this thesis, no dataset was available yet, consequently, a dataset has to be created, so this is a part of the results of this thesis. This dataset should be created on the same machine that the model will be used on in applications, ensuring hardware and software differences not playing a role.

Not every machine has processes behaving the same way. Processes are in-fluenced not only by the hardware and software, but also by the user(s) of that system. The behaviour of the user can exhibit large variations over time. So a dataset that represents (relatively) current behaviour should be used.

The dataset was created on linux kernel 4.4.0-21-generic on operating system Linux Mint 18 on a HP laptop with Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz cpu. The behaviour in the time the script was running was just normal use for a student. Programming, using Google Chrome/Firefox, Spotify, and the occasional chat program (Franz).

In Linux, there are multiple ways to acquire data about processes without having to simulate an artificial environment first. A kernel module can be made, or a user space interface can be used or made. Using an userland interface is more straightforward, so this approach is chosen. As mentioned in section 2.1, there are two kernel space interfaces that use information from proc. It would also be possible to create an interface to get information from proc directly. These options will be discussed below. Top runs for a couple of seconds, while a snapshot is preferred for this thesis, because process data has to be collected for an extended amount of time. So ps is preferred to top, even though there is a batch mode in top where one could write to a file all at once. But ps contains the same information, moreover gives a snapshot of all running processes. Because a snapshot was needed and ps offers a handy interface, ps was chosen over top to collect the data.

Proc has a subdirectory for every running pid, which contains some infor-mation that is not captured with any ps command, but has no ready to use interface in the same way that ps has. Ps is able to collect a lot of information about running processes, so the need to use a self-defined interface for proc is not there at this moment. In conclusion ps is preferred to a new interface for proc for this thesis, because it provides a snapshot with enough promising fea-tures to use for this exploratory approach.

To assemble the dataset, a Python script was written. This script uses ps and writes to a csv file. The csv file format was chosen because it can be used as input with Pytorch, which is the deep learning platform the machine learning models will be built in. Ps can run with a lot of different keywords. Redun-dant information is cut from the dataset by keyword selection, because some keywords were aliases of each other and thus only one alias was kept.

One of the objectives of this research is trying to predict the final execution time of processes, consequently it is necessary to find a way to only register programs when the final execution time is known. The Python script executes

(10)

a ps command, followed by a sleep(1) command and then again the same ps command. sleep(1) is used because a lower sleep time would not change the final execution time since it is measured in seconds. A higher sleep time might give imprecise results. The script then registered which process identifiers (pid’s) are not in the later results. The processes that belong to those pid’s are not running anymore and are written to a file. The downside to this method is that processes that potentially stop and start again within one timeframe are missed.

while script runs do

old list of processes ← get active processes with ps; sleep(1);

new list of processes ← get active processes with ps; for every process in new list of processes do

if process occurs in old list and is not a ps command then prepare process line for writing to csv write process line to csv end

end end

Algorithm 1: Script that collects dataset with use of the ps command and writes to a csv. The complete code can be found in Appendix A

The script also writes its own ps commands to a file, which means it mea-sures itself in a way. Normally, the user would not be constantly executing ps commands, so this greatly distorts the dataset. Because of this, ps processes were ignored. If the user uses ps a lot this might turn out to make the model worse, because the model ignores ps usage so will probably not know how to deal with a process that belongs to a ps command. The monitor script will prob-ably increase the runtime of other processes because the extra ps commands will spawn processes that need to be executed and thus will need time on the CPU. This will slow down all processes in the dataset evenly, so it shouldn’t hurt prediction performance now. However, for this thesis the assumption is made that the monitor script will not have any negative impact on execution time of unrelated tasks, because it does not influence other tasks.

A downside to only writing to a file when the process is about to finish is that there is only a snapshot of the value of every feature at the last second of time that it is active. The potential problem here is twofold. Firstly, if processes behave very similar when about to end, there might be little differentiating in-formation, making performance of any machine learning model worse. Secondly, it could also be a problem that there is information in these features that is only useful when measured at different points in time. This sequential information might be necessary for a machine learning approach to have a good enough performance. The simplification in assuming that the mentioned problems will not be significant is necessary to reduce the complexity of the approach. It is unnecessary to immediately explore the more complicated approaches, if the simple approach has not been explored.

To acquire the dataset, the script was started and not stopped for the du-ration of a month. This does mean that the date feature has to be processed to be useful. Because it is useless to measure the date if it comes from just one month. However, time elapsed since starting point can still be useful.

At this moment, there are no other steps taken to prevent measuring the results of the script that collects data. The assumption is that processes that

(11)

might be spawned by the Python script will not be numerous enough to disturb the dataset in such a way that hurts training. The Python time.sleep function that is used, uses the underlying system sleep function, which does not spawn new processes. The only other action that the script executes is writing to a file. These actions are constantly repeated, so even if the task slightly disturbs the dataset, it will do this evenly to all datapoints, so it should not hurt performance in the machine learning model. In conclusion, a dataset should be created by the Python script and is saved in a csv file.

3.2

Pre-processing the dataset

To be able to incorporate the dataset in a prediction model, the dataset has to be pre-processed. Some rows might not be usable or some features might not be useful. Furthermore, some features may not be useful if they contain dupli-cate information or have too many possible values while it being a dupli-categorical feature. Because categorical features need to be encoded, this could blow up the dimensionality of the dataset and could invite the curse of dimensionality. Additionally, to prevent overfitting, there is a need for more samples when di-mensionality is high. So the need for pre-processing is clear.

In the next section, we give an explanation on which features are dropped and why. We want to predict the execution time, which is the time a process spends on the CPU. So the prediction label should be one of the features ps offers that contain this value. The prediction label is the cputime which is the cumuluta-tive CPU time according to the manual page of ps. Other features that contain the total cputime are dropped (e.g. bsdtime or time).

To prevent the predictive model having a bias towards predicting a execution time of zero, all datapoints that have a cputime of ’00:00:00’ are removed. If the predictive model is going to be utilized for more efficient scheduling, processes that are done within a second are ’flying under the radar’. Because they are already finished before they can or need to be optimized. Removing processes with a execution time of zero decreases the dataset by 81%. The dataset was 26128 lines, of which 4870 remain when execution times of zero are removed. Ideally, the entire range of execution times would be learned by the model. But if the majority of execution times is under one minute, the model might not be able to learn how to predict higher execution times well. When model per-formance is hurt by these outliers, it is necessary to also test models without runtimes above a certain value to ensure a more evenly distributed training set. Various cutoff points are selected, and datapoints above this cutoff point are dropped. This should improve the models’ performance. Cutoff points at 7200 seconds, 500 seconds, 200 seconds, 20 seconds, and 5 seconds have been chosen. This selection is made to see if a general increasing accuracy can be found.

All categorical features that have more than 10 different values are removed. This is done to keep the dimensionality as low as possible. Keeping the dimen-sionality low at the cost of removing features is necessary to keep the model working correctly. Features that have many possible values and need to be en-coded can cause a certain value only to appear a few times, hurting a model’s performance without adding value. Furthermore, all features that only con-tained one single value in every row were also discarded.

(12)

The date and start time features are removed because contain equal infor-mation as etimes. etimes is the amount of seconds elapsed since the process started.

While args is removed because there are in theory infinite possible values, it would contain useful information. However, this information is contained in fname so it can safely be dropped.

Subsequently, there are some pairs of features that contain equal information. One contains an ID in numbers and the other of the pair contains the ID in words. It is not needed to keep both, so the numerical ID is discarded.

To reduce dimensionality and with the assumption that current stack size contains information about execution time, the stackpointer (esp) and stack base pointer (stackp) are combined into one feature that contains the stacksize. After this initial phase of selecting features there are still 32 features left. Of which 19 are categorical. An overview of these features can be found in appendix B. To be able to use the categorical features, they have to be encoded to numerical features. This is done by One Hot Encoding. This increases the amount of features to 103 on the dataset without execution times of zero.

3.3

Feedforward Neural Network for Regression

The machine learning method that is the focus of this thesis, is a FFNN that should learn to predict execution time in seconds. The objective is to predict ex-ecution times, based on features that are easily obtained through ps. Exex-ecution time is always a positive value in seconds, consequently the network should only predict positive values. This means that for an activation function, we can use a rectified linear unit (ReLu) in the final layer. The other layers also achieve nonlinearity with ReLu. The feedforward neural network is implemented in PyTorch(Paszke et al. 2017).

The relevant aspects of the neural network implementation are now dis-cussed. A network can be trained with different optimizers, of which the Adam (adaptive moment estimation) optimizer (Kingma & Ba 2014) is the most pop-ular for use in neural networks. It has been shown by Wilson et al. that the Stochastic Gradient Descent (SGD) optimizer for machine learning often finds different, and better, outcomes than adaptive methods like Adam (Wilson et al. 2017). So the networks will be tested with SGD.

There are multiple loss functions that can be used in a neural network for regression, of which Means Squared Error (MSE) loss is the most commonly used option. A disadvantage of MSE is that it is sensitive to outliers. In the dataset there are a small minority of processes that run longer than 500 seconds, about 1.4% of de dataset. While these are outliers, these are processes that would ideally be learned by the neural network. Learning to predict smaller execution times would also be a step in the right direction, so treating them as outliers that aren’t important might be needed to get results. The Mean Absolute Error (MAE) and Huber loss are less sensitive for outliers, and prevent a model being dominated by them. For this reason, we consider MSE, MAE and Huber loss in our experiments.

The performance of a network is influenced by its hyperparameters, so hy-perparameter tuning is necessary. K-fold cross validation with a standard k of 5 is used to determine the optimal batch size and learning rate. Tested batch sizes are 8, 16, 32, and 64. Tested learning rates are 10−2, 10−3, 10−4, and 10−5.

(13)

To see if results with the networks trained with one hidden layer were due to overfitting, networks will also be trained on the smallest batch size of 8 and the smallest learning rate of 10−5to see if this performs better. For every setting a computational budget of 500 epochs will be made available.

The network structure depends on the number of hidden layers and the number of nodes per layer. The chosen network structure consists of one hidden layer with 60 nodes. If this model underfits, a model with more hidden layers, will also be tested to see if performance improves. This second model will have three hidden layers and every hidden layer consists of 60 nodes. Figure 1 contains a visualization of the neural network structure. Figure 2 contains a visualization of the more complicated network structure.

Input #1 Input #... Input #... Input #∼100 Output Hidden layer Input layer Output layer

Figure 1: Depiction of feedforward neural network used in thesis. Hidden layer is 10 times as big in reality, but does not visualize well. In reality, the hidden layer consists of 60 nodes. Input #1 Input #... Input #... Input #∼100 Output Hidden layer Hidden layer Hidden layer Input layer Output layer

Figure 2: Depiction of feedforward neural network used in thesis on best performing models. Hidden layer is 10 times as big in the actual model, but this does not visualize well. In the actual model, the hidden layer consists of 60 nodes.

(14)

3.3.1 Training Details

To optimize the network’s performance several optimizations are executed to improve the performance of the neural network. The optimizations are described in the following paragraph.

When the input of a normal network is not normalized, some features have a much larger range than other features. When this occurs, these features count towards the total error more than their smaller counterparts. Consequently, when some features have extreme outliers these can also make it harder for the model to train. According to Sola and Sevilla input normalization can reduce estimation errors and the training process can be shortened significantly (Sola & Sevilla 1997). The input normalization is done with sklearn MinMaxScaler (Pedregosa et al. 2011).

As mentioned in section 2.2 and according to Ioffe and Szegedy, batch nor-malization can dramatically speed up the training process and improve a net-work’s performance(Ioffe & Szegedy 2015). Batch normalization is done with PyTorch’s BatchNorm(Paszke et al. 2017).

The cpu-time as measured by ps is in seconds. The shortest process mea-sured only ran for one second, while the longest process ran for 19918 seconds (331 hours). With the same logic of why batch normalization works, it is use-ful to transform the targets so the values are closer to zero. The targets are transformed to minutes, so they keep the same meaning, while being closer to-gether. When higher execution times are dropped and the execution times are closer together, this might be an unnecessary step, so while training on lower execution times only, performance will also be tested without this step.

3.4

Multivariate Linear Regression

As a starting point and to use as a baseline later, multivariate linear regression is used to predict the targets. If a simple multivariate linear regression model is not outperformed by a neural network implementation, it might not be worth it to pursue this approach further.

Multivariate linear regression is an algorithm that learns the parameters of a linear function. To learn these parameters gradient descent is used. Gradient descent adjusts the parameters of the linear function until the error converges to a minimum. At this minimum the function has optimal parameters.

This model is implemented in PyTorch with their linear module, with the SGD optimizer. This approach was preferred over least-squares because of the amount of features present in the dataset after encoding, making least-squares too com-putationally intensive. The SGD optimizer has been chosen because of the shallow network structure (no hidden layers in linear regression). The MSE and MAE will both be tested. The learning rate will be set to 10−5, which is relatively low, to avoid overfitting. For batch size different experiments will be done and the best performing one will be chosen.

(15)

Input #1 Input #... Input #... Input #∼110 Output Input layer Output layer

Figure 3: Depiction of pytorch model used for multivariate linear regression.

3.5

Evaluation

To evaluate the models’ performance, we define an accuracy measure. Accuracy for this regression problem is defined as the difference between the prediction and the actual target value being within a range. The range is set to 0 seconds, 1 second, or 2 seconds. The models will all be tested on the same test set so they can be compared with each other. The test set contains of 10% of the original dataset which is 487 datapoints. This comes as close as possible to how a model would perform in a real setting.

(16)

4

Results

4.1

Dataset

Cutoff point no cutoff 7200s 500s 200s 20s 5s

Mean 51.8 45.7 24.9 17.3 4.9 2.2 Min value 1 1 1 1 1 1 Max value 19918 6822 499 200 20 5 Standard Deviation 405.5 252.2 57.1 30.5 4.9 1.4 Median 5 5 5 5 3 2 Mode 1 1 1 1 1 1 # of Datapoints 4870 4868 4801 4674 3661 2542

Table 1: Stats about the dataset. All values are in seconds, except for the number of number of datapoints. Measurements are taken after dropping all datapoints with a execution time of 0 seconds.

Below, an overview of the distributions of execution times in the dataset at various cutoff points. For every value on the x-axis, the amount of datapoints having that execution time are shown on the y-axis. The bins are equally sized and are the x-axis range divided by number of bins. An overview of visualiza-tions of every feature in the final dataset can be found in the appendix.

Figure 4: binsize equal to 2000 Figure 5: binsize equal to 700

(17)

Figure 8: binsize equal to 2 Figure 9: binsize equal to 0.4

4.2

Results of Predictions

A batch size of 8 and a learning rate of 10−5 were used in all networks. All models are trained with the SGD optimizer. In table 2 and table 3 an overview of all tested models can be found. The models in table 3 are chosen to train with 3 layers because of the performance of their counterparts, which can be seen in table 6. In table 4 and table 5 an overview of the predictions on the test set are given. All predictions are visualized in figures 4 to 25.

(18)

Model # Error Measure Cutoff Point ML Technique #1 MSE 7200s FFNN #2 MSE 500s FFNN #3 MSE 200s FFNN #4 MSE 20s FFNN #5 MSE 5s FFNN #6 MAE 7200s FFNN #7 MAE 500s FFNN #8 MAE 200s FFNN #9 MAE 20s FFNN #10 MAE 5s FFNN #11 Huber 7200s FFNN #12 Huber 500s FFNN #13 Huber 200s FFNN #14 Huber 20s FFNN #15 Huber 5s FFNN #16 MSE 7200s MVLR #17 MSE 200s MVLR #18 MAE 7200s MVLR #19 MAE 200s MVLR #20 Huber 7200s MVLR #21 Huber 200s MVLR

Table 2: An overview of all different models A batch size of 8 and a learning rate of 10−5 were used in all FFNN’s. All FFNN’s consist of one hidden layer with 60 nodes. Models trained with the same error function are grouped together. The colours correspond to the different cutoff points.

Model # Error Measure Cutoff Point ML Technique Scoring best on accuracy s

#22 MSE 500s FFNN accuracy (2s)

#23 MAE 500s FFNN accuracy (1s)

#24 MAE 5s FFNN accuracy (0s)

#25 Huber 5s FFNN accuracy(0s)

Table 3: An overview of all different models that are tested with three hidden layers One model for every accuracy level was tested again. For accuracy measure of 0 seconds, two models scored very similar so both were tested. A batch size of 8 and a learning rate of 10−5 were used in all FFNN’s. All FFNN’s consist of three hidden layer with 60 nodes. Models trained with the same error function are grouped together. The colours correspond to the different cutoff points.

(19)

Model Minimum output value Maximum output value Mean pre-dicted value Median out-put value Standard Deviation #1 0s 63s 4.25 0.0 0.16 #2 1s 3s 2.15 2.0 0.38 #3 17s 17s 17.0 17.0 0.0 #4 5s 5s 5.0 5.0 0.0 #5 0s 2s 1.31 1.0 0.67 #6 1s 2s 1.55 2.0 0.50 #7 1s 2s 1.95 2.0 0.22 #8 0s 2s 1.61 2.0 0.57 #9 0s 3s 1.03 1.0 0.54 #10 0s 2s 1.04 1.0 0.48 #11 0s 54s 8.99 7.0 0.11 #12 0s 3s 0.96 1.0 0.44 #13 0s 3s 1.03 1.0 0.65 #14 0s 4s 0.82 1.0 0.86 #15 0s 2s 1.01 1.0 0.52 #16 35s 54s 46.53 47.0 2.89 #17 13s 21s 17.43 17.0 1.30 #18 1s 5s 4.72 5.0 0.78 #19 1s 5s 4.73 5.0 0.70 #20 1s 6s 4.83 5.0 0.68 #21 2s 5s 4.70 5.0 0.70

Table 4: An overview of all statistics of the different models that are tested with one hidden layer A batch size of 8 and a learning rate of 10−5 were used in all networks. All networks consist of one hidden layer with 60 nodes. The colors correspond to the different cutoff points.

Model Minimum output value Maximum output value Mean pre-dicted value Median out-put value Standard Deviation #22 24s 24s 24 24.0 0.0 #23 0s 9s 2.03 2.0 1.01 #24 0s 3s 0.84 1.0 0.64 #25 0s 3s 0.53 0.0 0.64

Table 5: An overview of all statistics of the different models that are tested with three hidden layers A batch size of 8 and a learning rate of 10−5were used in all FFNN’s. All FFNN’s consist of three hidden layer with 60 nodes. Models trained with the same error function are grouped together. The colours correspond to the different cutoff points.

(20)

Model Accuracy(0s) Accuracy(1s) Accuracy(2s) #1 0.41% 21.97% 29.98% #2 10.27% 39.01% 47.84% #3 0.8% 3.08% 4.72% #4 6.57% 13.96% 23.2% #5 17.66% 37.78% 44.14% #6 19.91% 39.63% 45.37% #7 12.94% 41.48% 46.20% #8 12.32% 39.43% 44.97% #9 19.51% 34.91% 41.68% #10 20.94% 35.73% 41.89% #11 1.44% 10.47% 10.47% #12 22.38% 34.09% 41.07% #13 17.25% 34.08% 40.45% #14 10.88% 29.97% 39.2% #15 20.33% 34.09% 41.2% #16 0.20% 0.41% 0.81% #17 0.62% 3.49% 5.34% #18 6.98% 16.02% 26.69% #19 6.57% 16.63% 27.10% #20 6.98% 16.63% 26.07% #21 6.37% 16.43% 27.10%

Table 6: Accuracy on the test set of all models with 1 hidden layer A batch size of 8 and a learning rate of 10−5 were used in all FFNN’s. All FFNN’s consist of one hidden layer with 60 nodes. Models trained with the same error function are grouped together. The colours correspond to the different cutoff points.

Model Accuracy(0s) Accuracy(1s) Accuracy(2s)

#22 1.23% 2.87% 3.70%

#23 9.65% 34.29% 48.67%

#24 16.84% 33.06% 40.45%

#25 10.27% 29.78% 38.6%

Table 7: Accuracy on the test set of all models with 3 hidden layers A batch size of 8 and a learning rate of 10−5 were used in all FFNN’s. All FFNN’s consist of three hidden layer with 60 nodes. Models trained with the same error function are grouped together. The colours correspond to the different cutoff points.

(21)

4.2.1 Predictions Feedforward Neural Nets trained with MSE

Figure 10: FFNN #1, 1 hidden layer, 7200 seconds max, MSE

Figure 11: FFNN #2, 1 hidden layer, 500 seconds max, MSE

Figure 12: FFNN #3, 1 hidden layer, 200 seconds max, MSE

Figure 13: FFNN #4, 1 hidden layer, 20 seconds max, MSE

Figure 14: FFNN #5, 1 hidden layer, 5 seconds max, MSE

(22)

4.2.2 Predictions Feedforward Neural Nets trained with MAE

Figure 15: FFNN #6, 1 hidden layer, 7200 seconds max, MAE

Figure 16: FFNN #7, 1 hidden layer, 500 seconds max, MAE

Figure 17: FFNN #8, 1 hidden layer, 200 seconds max, MAE

Figure 18: FFNN #9, 1 hidden layer, 20 seconds max, MAE

Figure 19: FFNN #10, 1 hidden layer, 5 seconds max, MAE

(23)

4.2.3 Predictions Feedforward Neural Nets trained with Huber loss

Figure 20: FFNN #11, 1 hidden layer, 7200 seconds max, Huber

Figure 21: FFNN #12, 1 hidden layer, 500 seconds max, MSE

Figure 22: FFNN #13, 1 hidden layer, 200 seconds max, Huber

Figure 23: FFNN #14, 1 hidden layer, 20 seconds max, Huber

Figure 24: FFNN #15, 1 hidden layer, 5 seconds max, Huber

(24)

4.2.4 Predictions linear models

Figure 25: MVLR #16, 1 hidden layer, 7200 seconds max, MSE

Figure 26: MVLR #16, 1 hidden layer, 7200 seconds max, MSE

Figure 27: MVLR #17, 1 hidden layer, 200 seconds max, MSE

Figure 28: MVLR #18, 1 hidden layer, 7200 seconds max, MAE

Figure 29: MVLR #19, 1 hidden layer, 7200 seconds max, MAE

Figure 30: FFNN #20, 1 hidden layer, 7200 seconds max, Huber

(25)

Figure 31: FFNN #21, 1 hidden layer, 200 seconds max, Huber

4.2.5 Predictions FFNN with 3 hidden layers

Figure 32: FFNN #22, 3 hidden layers, 500 seconds max, MSE

Figure 33: FFNN #23, 3 hidden layers, 500 seconds max, MAE

Figure 34: FFNN #24, 3 hidden layers, 5 seconds max, MSE

Figure 35: FFNN #25, 3 hidden layers, 5 seconds max, Huber

(26)

5

Discussion and Future Work

The results presented in the previous section indicate that the proposed method cannot reliably prediction execution times. Table 6 and table 7 show that even though the feedforward neural network outperforms the naive baseline approach of multivariate linear regression, it is not by much. Most models seem to con-verge on predicting the mean or the median value of the dataset they were trained on, as can be seen in table 4 and table 5. The mediocre results may be explained by various factors which are discussed in the remainder of this sec-tion. Upon inspection of the dataset, it becomes clear that it is very unbalanced and the vast majority of execution times is within a few seconds. The mediocre performing models can very much be a result of this, because the other execu-tion times might simply not be present enough in the dataset for the models to generalise the relation that it should have discovered.

So if there are fewer and fewer samples, the higher the execution times get, it follows that even though ideally the model should learn to predict these processes, it might not be possible with this current dataset. So treating those datapoints like less important outliers, tweaking the neural network with an error function that is less sensitive to outliers does help with performance as can be seen in tables 6 and 7. The networks seem to perform better when trained on datapoints with lower execution times, even when they are all tested on the test set that contains datapoints from the original dataset, meaning higher execution times are all included. The range of predicted values only barely increased in some cases, although sometimes even decreasing, suggesting it just becomes easier to guess a more often occurring value while not really learning better (see table 4). The performance also does not significantly increase when the most promising models in terms of accuracy were trained again with more layers (see table 3). Suggesting again that the combination of a dataset consisting of ps data and a feedforward neural network is not a promising approach.

The heatmaps would ideally show a line of squares that follows the diagonal (x = y), if the model would be performing well. However, what actually happens is that all model focus on a very small range of seconds and it stays there no matter the actual value. In some cases the networks does learn to focus on the values that are actually occurring most often, but one could get similar performance by just always guessing a process lasts for 1 second.

The dataset might have been too small, if more datapoints were present the neural network should be able to generalise better if the assumption that data from ps holds that the dataset contains enough information. It could also be the case that the dataset does not contain the information needed to predict execution time behaviour at all, even when abundantly available. This would not immediately discard data from ps, and thus proc, but it would call for a different approach in acquiring it. An improvement on the current approach could also be to acquire data for longer time and also to make sure the data is acquired in a totally non-intrusive way. This was out of scope for this research, but the machine might have been slightly slowed down by acquiring the data, making the dataset possible less representative for actual use. More features could also help, meaning an approach where an interface for proc would be created, could be more succesful.

Another explanation for the cause that this approach was not successful was that the dataset is from a personal computer with complicated architecture. It

(27)

could be that user behaviour might change a lot, with it the started and stopped processes. For example one might switch to chrome from Firefox. Or one can start watching a movie every night. This could impact the system in such a way that the predictive capability of the model is worsened. One might need to regularly retrain. This might have even happened within this dataset.

Huybrecht et al found that their dataset was also not sufficient for a FFNN to predict meaningful upper bounds(Huybrechts, Cassimon, et al. 2018). They be-lieve that this is a feasible approach, when further optimizing hyperparameters and the dataset. However, after careful consideration for tuning hyperparame-ters and experimenting with multiple structures for the FFNN, it turns out it doesn’t significantly help. The MVLR approach does not show an accuracy that is close to a reliable one. Meaning that there is not linear connection between the features in the dataset and the cputime.

The experiments indicate that a straightforward approach of using simple proc features and a FFNN does not result in a satisfying execution time predic-tion model. However, the results obtained in this exploratory study should not be interpreted as evidence for infeasibility of the proposed approach. Instead, some other variations of this approach should be considered.

As mentioned before, the dataset could have been insufficient because of changes in user behaviour. So possible future research would include detecting when to discard old data, or when this drift in user behaviour occurs. So this problem might be solvable with more data or simpler architecture, for example on a network which has more regular tasks and less user dependent tasks. Also a dataset with more types of processes than normal use might be good for comparison with ’normal use’ database could work better, especially if the division in execution times is more evenly distributed.

One other possible solution that would make acquiring data with ps still a worthwhile endeavour, would be to not use a snapshot of all the features of when a process is in its last second of cputime, but to collect time series data about the processes instead. A recurrent neural network should then be trained with this data, because recurrent neural networks can make use of sequential data. If that approach is more successful than the one used in this thesis, it would show that processes in itself are not independent from each other and cannot be treated as such. However, there are three main reasons the FFNN and MVLR were chosen for this thesis instead. Firstly, a recurrent neural net is a significantly more complex machine learning technique than a FFNN, so while possible, it is not a logical first step to take when exploring this problem. Second, the assumption that datapoints are independent simplifies the problem so it means it was a good first approach.

6

Conclusion

We researched predicting execution time behaviour in Linux with machine learn-ing by collectlearn-ing data on a personal Linux computer with the ps interface for proc. After careful pre-processing of the dataset two machine learning algo-rithms are tested so see if they are able to learn to predict the cpu time. Both MLVR and FFNN are tested in PyTorch. The MVLR to use as a baseline and as a good first step in this orienting thesis, and a FFNN to see if there is a nonlinear non sequential dependant relationship. We find that these basic

(28)

ma-chine learning models are not able to learn to predict execution times reliably, even when applying careful hyperparameter tuning. We believe that with an-other Machine Learning approach that uses the dataset there will be probably be more success. Another possible more succesful approach is trying to use a machine learning approach that is able to handle the sequential information in the dataset.

(29)

References

Ali, Arshad, Ashiq Anjum, Julian Bunn, Richard Cavanaugh, Frank Van Lingen, Richard McClatchey, Muhammad Atif Mehmood, Harvey Newman, Conrad Steenberg, Michael Thomas, et al. (2004). Predicting the resource require-ments of a job submission.

Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer Science+Business Media.

Bonenfant, Armelle, Denis Claraz, Marianne de Michiel, & Pascal Sotin (2017). “Early WCET Prediction Using Machine Learning”. In: OASIcs-OpenAccess Series in Informatics. Vol. 57. Schloss Dagstuhl-Leibniz-Zentrum fuer Infor-matik.

Hochreiter, Sepp & J¨urgen Schmidhuber (1997). “Long short-term memory”. In: Neural computation 9.8, pp. 1735–1780.

Hornik, Kurt (1991). “Approximation capabilities of multilayer feedforward net-works”. In: Neural networks 4.2, pp. 251–257.

Huybrechts, Thomas, Thomas Cassimon, Siegfried Mercelis, & Peter Hellinckx (2018). “Introduction of Deep Neural Network in Hybrid WCET Analysis”. In: International Conference on P2P, Parallel, Grid, Cloud and Internet Computing. Springer, pp. 415–425.

Huybrechts, Thomas, Siegfried Mercelis, & Peter Hellinckx (2018). “A new hy-brid approach on WCET analysis for real-time systems using machine learn-ing”. In: OASIcs: OpenAccess Series in Informatics.-Place of publication unknown, pp. 1–12.

Ioffe, Sergey & Christian Szegedy (2015). “Batch normalization: Accelerat-ing deep network trainAccelerat-ing by reducAccelerat-ing internal covariate shift”. In: arXiv preprint arXiv:1502.03167.

Kerrisk, Michael (n.d.[a]). credentials(7) - Linux Programmer’s Manual. Online; accessed June 26th 2019.

— (n.d.[b]). Proc(5) - Linux Programmer’s Manual. Online; accessed June 24th 2019.

— (n.d.[c]). PS(1) - Linux Programmer’s Manual. Online; accessed June 24th 2019.

— (n.d.[d]). top(1) - Linux Programmer’s Manual. Online; accessed June 11th 2019.

Kingma, Diederik P & Jimmy Ba (2014). “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980.

Lozi, Jean-Pierre, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien Qu´ema, & Alexandra Fedorova (2016). “The Linux scheduler: a decade of wasted cores”. In: Proceedings of the Eleventh European Conference on Computer Systems. ACM, p. 1.

Negi, Atul & P Kishore Kumar (2005). “Applying machine learning techniques to improve linux process scheduling”. In: TENCON 2005 2005 IEEE Region 10. IEEE, pp. 1–6.

Paszke, Adam, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, & Adam Lerer (2017). “Automatic Differentiation in PyTorch”. In: NIPS Autodiff Work-shop.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.

(30)

Cournapeau, M. Brucher, M. Perrot, & E. Duchesnay (2011). “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12, pp. 2825–2830.

Rumelhart, David E, Geoffrey E Hinton, Ronald J Williams, et al. (1988). “Learning representations by back-propagating errors”. In: Cognitive mod-eling 5.3, p. 1.

Sola, J & Joaquin Sevilla (1997). “Importance of input data normalization for the application of neural networks to complex industrial problems”. In: IEEE Transactions on nuclear science 44.3, pp. 1464–1468.

Wilhelm, Reinhard, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, et al. (2008). “The worst-case execution-time prob-lem—overview of methods and survey of tools”. In: ACM Transactions on Embedded Computing Systems (TECS) 7.3, p. 36.

Wilson, Ashia C, Rebecca Roelofs, Mitchell Stern, Nati Srebro, & Benjamin Recht (2017). “The marginal value of adaptive gradient methods in machine learning”. In: Advances in Neural Information Processing Systems, pp. 4148– 4158.

(31)

Appendices

A

Data acquisition script

'''

Script that creates a dataset of processes based on data from ps. Willemijn Beks (10775110) ''' import subprocess import datetime import time import argparse import sys import csv import re

def find_finished(FEATURES, N_ITERATIONS): '''

Find finished processes by comparing list of active processes with one second between to see which processes have stopped. The stopped

processes are written to a csv file. '''

filename = 'dataset' + datetime.datetime.now().strftime('%m%d%H%M') + \ '.csv'

file = open(filename, 'a')

dataset_writer = csv.writer(file) n_features = len(FEATURES.split(','))

while True:

current = subprocess.getoutput('ps -eo ' + FEATURES) last = current

time.sleep(1)

current = subprocess.getoutput('ps -eo ' + FEATURES)

curr_pid_list = [line.split()[0] for line in current.splitlines()] for line in last.splitlines():

pid = line.split()[0]

if pid not in curr_pid_list and ' ps ' not in line: processed_line = re.split(r'\s{10,}', line) dataset_writer.writerow(processed_line)

if __name__=='__main__':

p = argparse.ArgumentParser()

(32)

# This should be one single uninterrupted string, but is broken up to # show what is happening

default='pid:200,%cpu:200,%mem:200,args:200,blocked:200,bsdstart:200,bsdtime:200, \ c:200,caught:200, cgroup:200,class:200,cmd:200,comm:200,command:200,cp:200,cputime:200,\ egid:200,egroup:200,eip:200,esp:200,etimes:200,euid:200,f:200,fgid:200,fgroup:200,\ fname:200,fuid:200,fuser:200,gid:200,group:200,ignored:200,label:200,lstart:200,\ lwp:200,maj_flt:200,min_flt:200,ni:200,nice:200,nlwp:200,nwchan:200,pending:200,\ pgid:200,pgrp:200,ppid:200,pri:200,psr:200,rgid:200,rgroup:200,rss:200,\ rtprio:200,ruid:200,s:200,sched:200,sess:200,sgi_p:200,sgid:200,size:200,spid:200,\ stackp:200,start:200,start_time:200,stat:200,suid:200,supgid:200,\ supgrp:200,suser:200,sz:200,tname:200,tpgid:200,vsize:200,wchan:200') p.add_argument('--N_ITERATIONS', help='number of times main loop is \

executed', default=100, type=int) args = p.parse_args(sys.argv[1:])

(33)

B

PS features that were used

Feature Name Description

%cpu CPU utilization of the process in ”##.#” format. Currently, it is the CPU time used divided by the time the process has been running (cputime/real-time ratio), expressed as a percentage. It will not add up to 100% unless you are lucky. (alias: pcpu).

%mem ratio of the process’s resident set size to the physical memory on the machine, expressed as a percentage. (alias: pmem).

cgroup display control groups to which the process belongs.

egroup effective group ID of the process. This will be the textual group ID, if it can be obtained and the field width permits, or a decimal representation otherwise. (alias: group).

etimes elapsed time since the process was started, in seconds.

euid effective user ID (alias: uid).

f flags associated with the process, see the PROCESS FLAGS section. (alias: flag, flags).

fgroup filesystem access group ID. This will be the textual group ID, if it can be obtained and the field width permits, or a decimal representation otherwise. (alias: fsgroup).

fname first 8 bytes of the base name of the process’s executable file. The output in this column may contain spaces.

fuser filesystem access user ID. This will be the textual user ID, if it can be obtained and the field width permits, or a decimal representation otherwise.

maj flt The number of major page faults that have occurred with this process.

min flt The number of minor page faults that have occurred with this process.

nice nice value. This ranges from 19 (nicest) to -20 (not nice to others). (alias: ni).

nlwp number of lwps (threads) in the process. (alias: thcount).

Table 8: Overview of ps features that were used in training of the models. Descriptions are from the Linux manual page on ps (Kerrisk n.d.(c)

(34)

pending mask of the pending signals. See signals. Signals pending on the process are distinct from signals pending on individual threads. Use the m option or the -m option to see both. According to the width of the field, a 32 or 64 bits mask in hexadecimal format is displayed. (alias: sig).

pri priority of the process. Higher number means lower priority.

psr processor that process is currently assigned to.

rss resident set size, the non-swapped physical memory that a task has used (in kilobytes). (alias: rssize, rsz).

rgroup real group name. This will be the textual group ID, if it can be obtained and the field width permits, or a decimal representation otherwise.

ruid real user ID.

s minimal state display (one character). See section PROCESS STATE CODES for the different values. See also stat if you want additional information dis-played. (alias: state).

sgi p processor that the process is currently executing on. Displays ”*” if the process is not currently running or runnable.

sgid saved group ID. (alias: svgid).

size approximate amount of swap space that would be required if the process were to dirty all writable pages and then be swapped out. This approximation is very rough.

supgid group ids of supplementary groups, if any.

suser saved username. This will be the textual user ID, if it can be obtained and the field width permits, or a decimal representation otherwise. (alias: svuser).

sz size in physical pages of the core image of the process. This includes text, data, and stack space. Device mappings are currently excluded; this is subject to change. See vsz and rss.

vsize virtual memory size of the process in KiB (1024-byte units). Device mappings are currently excluded; this is subject to change (alias: vsz).

wchan name of the kernel function in which the process is sleeping, a ”-” if the process is running, or a ”*” if the process is multi-threaded and ps is not displaying threads.

(35)

Special features

Feature name Description

stacksize Instruction pointer and stack pointer combined to create stack size cputime The target. Cumulative CPU time.

Table 10: Remaining features that were used. These consist of the target value that had to the model should have learned to predict. And a special feature combined from two other features

(36)

C

Feature visualization of dataset

All following features are bar graphs with bin size either normal or logarithmic on the y-axis, and the value the feature can take on the x-axis. The percentages shown are the portion of the bins are in relation to the rest of the shown bins. All visualisations are done on the entire dataset after pre-processing. Some features have been shown more than once if zooming in was required to give a better overview.

(37)
(38)
(39)

Figure 39: egroup

(40)

Figure 41: etimes zoomed in with max value 25000

(41)

Figure 43: f

(42)
(43)

Figure 46: fuser

(44)

Figure 48: maj flt with max value 500

(45)

Figure 50: min flt with max value 1e07

(46)

Figure 52: nlwp

(47)

Figure 54: pri

(48)

Figure 56: rgroup

(49)

Figure 58: ruid

(50)

Figure 60: sgi p

(51)

Figure 62: size

(52)
(53)

Figure 65: suser

(54)

Figure 67: vsize

Referenties

GERELATEERDE DOCUMENTEN

Learning modes supervised learning unsupervised learning semi-supervised learning reinforcement learning inductive learning transductive learning ensemble learning transfer

Learning modes supervised learning unsupervised learning semi-supervised learning reinforcement learning inductive learning transductive learning ensemble learning transfer

Chapter ( 5 ) – Source classification using Deep Learning: We provide three approaches for data augmentation in radio astronomy i) first application of shapelet coefficients to

The Materials Engineering Division of Altran Corporation is an U.S.A.-based engineering consulting company specialising in consulting services related to materials

De tijdsverlopen van de locaties die benedenstrooms liggen van de locatie 957.00_LE zijn zodanig verschillend dat zij niet door eenzelfde trapeziumverloop benaderd

Our results suggest that patients with poorer ability to relabel symptoms engage di fferent neural path- ways during expressive suppression, which are implicated in cognitive-

The research material comes from returnees who used the return programmes of NGOs part of the ERSO network and social workers of Caritas Armenia and VWON.. Caritas is situated in

The current study was designed to see whether or not native speakers of Standard English show sensitivities towards the Type of Subject-constraint and the Subject