Predicting resilience through classification of skewed incident data

(1)

Predicting resilience through

classification of skewed

incident data

(2)

(3)

of skewed incident data

A Blue Student Lab research

Dante C.C. de Lang 11014083

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. S. van Splunter Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam July 10, 2020

(4)

Abstract

This research proposes a machine learning based classification method in order to predict resilience. For this method incident data from ABN AMRO is extracted and labelled based on expert knowledge. Using this labelled data, two binary classification algorithms have been explored; k -NN and SVM. Based on the patterns found in the feature space of the labelled incident data, predictions were made on unlabelled data. To be able to validate the correctness of this approach, the predictions were compared with resilience metrics used at ABN AMRO.

However, while a machine learning approach for the classification of resilience could be a viable new method, the metrics used at ABN AMRO are not consistent and adequate enough. Since the current metric only confirms if a service is behaving non-resilient, it was not possible to validate the correctness of the proposed method. Thus, a conclusion of this research is that there is a need for a more adequate resilience metric that makes it possible to state if a service is behaving either on a resilient or on a non-resilient manner. For future work this research proposes improvements for the performance of the used algorithms, the resilience metric and the data quality at ABN AMRO.

(7)

1 Introduction

As mentioned by Salles and Marino[2012] computer networks continue to grow as com-plex systems. Their article further addresses this development as a result of the expo-nential growth and usage of Internet. A common problem for complex networks is that their susceptible for random failures and malicious attacks. Najjar and Gaudiot [1990] are one of the first to address this problem by implementing a metric for network re-silience based on probabilistic measures of fault tolerance. Using this metric combined with secure computing concepts a new approach was developed to create sustainable and reliable complex networks [Avižienis et al.,2004]. Ever since resilience has been an increasingly hot topic and plays an important role for future developments in networking systems [Castet and Saleh,2008].

According to the Cambridge Dictionary the term resilient can be referred to as a property of a physical material and its "ability to quickly return to its usual shape after being bent, stretched, or pressed"1. According to Da Silva et al. [2015] resilience can also be referred to as a property of a network. In their paper resilience is described as a state of a network in which it can maintain an acceptable level of service, while being confronted with operational challenges. Since resilience is a latent variable, assessing a service or network on this property can be challenging, only through a combination of so called predictor variables it is possible to observe resilience. Currently the assessment of resilience at ABN AMRO is done manually, for instance by checking reports or assessing source code of parts of their network. Since large companies like ABN AMRO have an IT department consisting of hundreds of business services supported by thousands of IT products, manual assessment is a very costly method. Thus, this research proposes a method based on Machine Learning that could support this process. For this method a part of their incident data is labelled on resilience per business service based on ex-pert knowledge. An important assumption for this method is that the labelled incident data grouped per business service contains machine learning learnable latent patterns of resilient behaviour.

1.1 Research environment

This research originated from a collaboration between the ABN AMRO and the Univer-sity of Amsterdam (UvA) mainly referred to as the Blue Student Lab (BSL). Since 2018 this collaboration made it possible to share knowledge and use research to create new broader insights for future developments. Furthermore, it lowers the threshold between students and companies and therefore creates opportunities for both sides. Parallel to this research another UvA student is also doing her research in collaboration with ABN AMRO. In the research of Michielsen [2020], she examines the quality of data objects

1

(8)

used in business processes. In her research she states it is necessary to implement more verification steps around data objects that are vital for business processes. While her research will not be discussed further, conducting our research side by side did improve our understanding of ABN AMRO’s inner workings.

With sizeable companies such as banks the IT-department has been developing con-tinuously for decades. For ABN AMRO this development to their current IT-structure started in 1991 after the merge between the ABN bank and the AMRO bank. Dur-ing this development numerous systems and services were added and deleted from their digital environment. This has caused changes in how components interact and commu-nicate with each other. Furthermore, automation of simple processes and the increasing pressure to deliver consistent and continuous services contribute to the complexity of a system. Since ABN AMRO offers thousands of products and services, to external and internal clients, they have developed a complex combination of systems and services. To be able to ensure a reliable and secure service, all services in their IT-infrastructure are continuously monitored. The department Incident, Problem and Changemanagement plays an important part in this process by focusing on creating an IT-infrastructure that is resilient, making it reliable and serviceable 24/7. A team of five employees from this department have assisted me during this research. One of the current developments from their department is creating a tool called the Resilience Health Check. With this check it should eventually be possible to assess business services and IT product on resilience.

CI CI

IT product Business service

Domain EUC

Figure 1: A graphical ex-ample of the hierarchical IT-structure found at ABN AMRO.

In Keizer et al. [2019] a first approach for this health check is described. In this internal report the Software Im-provement Group has assessed the iDEAL IT section on iso-lation. For this research the main approach consisted of evaluating the code supporting iDEAL. Although code plays a very important part in IT products, evaluating it can be time consuming and difficult. The approach taken for this research will not be based on the code of services but on inci-dent data. Using this data it is still possible to get an insight of the inner workings of services within the IT-structure of ABN AMRO without the need of specialized knowledge.

To better understand the environment of this research it is important to explain the IT infrastructure at ABN AMRO. Like at other large companies it is hierarchically assembled as seen in figure 1. At the highest level are the end-user-chains (EUC), which will be described as a single service from an end-user point of view. This EUC is further divided into domains with each containing multiple business services. These services are supported through a combina-tion of IT products. Each IT-product is running on multiple

(9)

infrastruc-ture almost every business services has a different incident management team. While this increases the solution speed of incidents caused within a business service, it has a negative effect on solutions if an incident is caused by inter-dependencies. Therefore this research aims to create a method that can be used as a first holistic indication. For this method incident data and expert knowledge is combined to create a labeled dataset that can be used for classification, this will be further explained in section4.

1.2 Goal

This research aims to get a broader insight on the resilience of business services using incident data and expert knowledge. Therefore, a method is proposed that uses AI to predict the similarity of patterns found in the incident data with known patterns found at resilient and non-resilient services. As input for this algorithm a large set of incident data from ABN AMRO is combined with expert information. The expert information is based on research and developments done by employees at ABN AMRO and contains statements concerning the resilience of a selection of business services. These statements will be used to label the underlying incidents of these service as belonging to a resilient or non-resilient service. Using this labelled data two classifiers have been applied to classify the data, the k -Nearest Neighbors algorithm and a Support Vector Machine.

At section3the labelling of the data is discussed and preprocessing will be explained. In section 4 the method approached in this research with both classifying models is proposed to assess the data. In the last sections 5-9 the results are reviewed, discussed and suggestions are given as to how the model could be further implemented for future research.

2 Background

In this section background information will be presented in three subsections starting by positioning this paper within the current body of research on resilient networks in subsection 2.1. In the second subsection 2.2 a background is given on how incident registration is approached at ABN AMRO. The last subsection2.3discusses the relevant AI techniques used for this research.

2.1 Resilient networks

According to the Normal Accidents Theory (NAT) introduced by Charles Perrow [ Per-row, 2011] the parts of a system will always be subject to failure. Even if multiple safeguards and buffers are implemented it could still be possible that a few small failures

(10)

can interact with each other in a manner that was not anticipated. Perrow states that these unexpected interactions can have different consequences depending on the com-plexity and coupling of a system. The two main axes to distinguish systems ranges linear systems to complex systems and loose to tight coupling. A good example of a loose cou-pled linear system is a manufacturing belt. Here, if a problem arises it is possible to stop the process, fix the problem and continue the manufacturing. Tightly coupled systems are mostly continuous processes such as supplying transports or power grids which are much harder to stop temporarily. Complex systems have more components that interact as seen with companies, universities or governmental agencies. A combination of tightly coupled and complex systems are for instance space missions or nuclear plants. With such systems a high risk of major incidents arises since the interaction of multiple smaller problems can create larger unexpected failures.

In the survey of Sterbenz et al. [2010] the increasing dependency and complexity of systems play an important role in defining resilience. A basic definition of resilience would describe the ability of a network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation. Sterbenz et al. state that internally dependent systems are more prone to disruptions with increasing consequences. According to their paper it is advisable to implement resilience as property of future networks by already incorporating it in the designing phase. An important concept that is being discussed is the fault → error → failure chain. This process is divided by defend and detect steps, maintaining different thresholds for these steps changes the disruption and fault tolerance of a system. Trustworthiness is an important discipline within resilience that covers the assurance that a system will perform as expected. Here, dependability, security and performance are the main focus. Together they form the base of a resilience strategy that focuses on the robustness and complexity of a system.

A more developed resilience strategy based onSterbenz et al.[2010] is proposed in the paper ofSmith et al.[2011] and is referred to as D2R2+ DR: Defend, Detect, Remediate, Recover, and Diagnose and Refine. For this research it is important to understand how an incident can influence the operational state of service. Enclosed in the D2R2 + DR strategy is the understanding that resilience can only be maintained if there exists a continuous control loop outside the framework addressed earlier bySterbenz et al.[2010]. Furthermore, an incident can have multiple origins and effects depending on its location within a network and the amount of connectivity with its environment.

Earlier research claims that when maintaining reliability and security of complex systems it is necessary to implement a certain degree of isolation [Barker et al.,2013]. The implementation of isolation in a system can be compared with implementing fire doors between compartments of a building [Bliek,2017]. Thus, if a fire arises it will be limited to a smaller area of the building. Within complex IT-systems isolation would be implemented such that the propagation speed of failures is lowered. If well isolated an error will less likely spread to other components. Resilient networks therefore tend to have less failures that trigger new failures. This property of a resilient network will be

(11)

an important indicator for choosing the features of the data.

2.2 Incidents and ITSM

As mentioned above the complexity and coupling of a network both contribute to the potential effect a failure can have on the overal system performance. Moreover a larger network with more internal parts implies an increase of the number of failures. Large corporations have to manage thousands of incidents and events on a daily basis which are managed in an IT Service Management (ITSM) environment. A common platform for ITSM also used by ABN AMRO is called ServiceNow2. ABN AMRO maintains two ServiceNow platforms, one Blue and one Green platform. The Blue platform was the first platform which is still governed by IBM, currently the incident management department is gradually making the transition to the Green portal that will be governed by ABN AMRO itself. Within ServiceNow a difference exists between logging problems as incidents or events.

Incidents are logged manually and can be called in by an internal client (an employee of the organization) or an external client (a customer). For internal and external clients different help desks exist. When an external client contacts the public help desk via telephone, chat or email, the help desk first tries to solve the problem using knowledge articles. These articles describe common problems and function as a guide to solve a problem through a protocol. When knowledge articles are not sufficient enough to specify the problem and discover the root-cause, the problem will be documented in a help desk application (IBM Notes Helpdesk). From here an internal expert group further attempts to discover the root-cause. If they succeed the client will be informed, otherwise the problem will be documented as an incident in ServiceNow. Internal clients contact a separate help desk that will undertake similar steps in detecting the root-cause of a problem. However, if the help desk is unable to identify the root-cause, the problem will be documented directly in ServiceNow.

Events are similar to incidents aside from being automatically logged into ServiceNow by a monitoring tool. These monitoring tools are common at large corporations and are used to monitor critical applications by continuously testing them. When the monitoring tool encounters a problem with a certain service it automatically fills an incident form with extracted information. For this research events will not be taken into account but they could be of added value in future work, see section9.

In all cases the documentation of an incident in ServiceNow implies an assignment to a solution group. Due to the number of daily new opening incidents it is practically impossible to have direct communication between the help desk and the assignment group. Therefore the documentation of an incident is of absolute importance. Thus,

2

(12)

within the documentation of an incident all information needed to guide and assist the assigned solution group should be present. At ABN AMRO each incident in ServiceNow contains up to 200 features varying from its position within the IT infrastructure to a full description of the problem, in table1a schematic representation is given. In section

3 these features will be discussed further.

Number closed_at u_it_business_service ... close_notes major_incident

INC60150862 01-01-2020

15:02:53 EMAIL SERVICES ...

Long time back

resolved False

INC59265846 20-04-2020

11:41:13 ACCESS SYSTEMS ...

please check if document

is not bigger than 7MB False

INC60584166 15-02-2020

14:05:04 CARDS SYSTEM ... unauthorized acces False

INC60484926 13-03-2020

08:25:12 DATA NETWORK ...

overflow to server XYZ

is redirected True

INC59615484 16-02-2020

13:43:36 CORE BANKING ... minor glitch in API False

Table 1: Schematic representation of incident data as represented in the IBM ITSM ServiceNow environment.

2.3 Binary classification: SVM and k -NN

As already stated above this research aims to make predictions on resilience using a machine learning (ML) based classification algorithm. As will be discussed in section 3

the data will be labeled into two categories. For two class problems a binary classification algorithm is the most suitable. In this research two algorithms have been explored, namely a Support Vector Machine (SVM) and k -Nearest Neighbor (k -NN).

According to Santiago Hernández et al. [2012] SVM is currently one of the most important classification techniques. SVM is known for its exceptional performance with classification problems on balanced datasets. As also noted bySantiago Hernández et al.

[2012] SVM’s remarkable generalization power is based on the statistical learning theory. Due to the used vector representations it is also possible to manage high dimensional data. The goal of SVM is to create a maximum margin separating hyper plane that divides two spaces which each containing the points of a certain class. A simple representation is seen in figure2a. While SVMs performance is excellent on balanced datasets, it struggles with imbalanced datasets. However multiple solutions are already available as suggested bySantiago Hernández et al.[2012] such as applying SMOTE or using Exciting Support Vectors.

In a paper of Kim et al.[2012] k -NN is described as a method that classifies objects based on the surrounding k neighbors within the feature space. This local approach is among the simplest within machine learning techniques. In figure 2b a graphical representation is given of k -NN that uses euclidean distance as metric. This method also

(13)

makes it feasible to handle data with a high dimensional feature space. A known problem with k NN is that while it computes similarities locally, it handles every feature equally, this makes its approach prone for classification errors.

0 2 4 6 8 10 12 0 2 4 6 8 Resilient Not-resilient Maxim um margin h yp erplane

(a) A maximum margin hyperplane con-ceived by a SVM and due to two dimen-sional feature space represented as a sepa-rating line. 0 2 4 6 8 10 12 0 2 4 6 8 Resilient Not-resilient k=2

(b) k-NN with k=2 finds 2 nearest neighbors classifying the red point as resilient.

Figure 2: Examples of a 2-d feature space with mathematical representations of both classification algorithms used in this research.

3 Data

3.1 Incident data

The incident data used for this research was directly extracted from the Blue (IBM) ServiceNow environment and only contains closed incidents. Closed incidents are chosen since they contain all information from opening till closing, including information on the process in between. The full data set consists of 255101 incidents closed between the 1st of January 2020 00:00 and the 1st of June 2020 23:59. All incidents can be subdivided under 14286 affected CIs that can be divided in 2067 IT products that in their turn support 717 different business services.

Within ServiceNow an incident is able to have 196 features which can be divided in five classes depending on the sort of information. The five classes contain identity, temporal, natural language, dependency and categorical descriptive features. More information on these classes can be found in the research ofKnigge[2019]. In his research is also stated that most of the features are inherently empty at the registration of an incident. Taking this into account and after consulting with ABN AMRO a selection of 31 features was chosen as displayed in table 5in appendix A.

(14)

3.2 Resilience feature

For this research extra information was needed to be able to classify incidents on re-silience. Since resilience is a latent variable and it is not presented as a feature in ServiceNow, expert knowledge was called upon to label the data. After consultation with experts from the incident management department at ABN AMRO it was pos-sible to confirm four resilient and four non-resilient business services. It is important to mention that the approach taken by the experts at ABN AMRO is primarily based on internal documents and research using former failures and their impact as a metric of performance. Therefore, classifying a business service as behaving on a non-resilient manner can be substantiated but a business service behaving on a resilient manner can not. Thus, the services that are labeled as behaving resilient are mainly supported by expert judgement.

By adding an additional feature this information was used to label all incidents un-derlying one of the eight indicated business service with either a 2 if resilient or a 1 if non-resilient. All remaining incidents were labeled as unknown with a 0. It is important to acknowledge that labeling of incidents according to their overlaying business service contains an important assumption. Namely, that a correlation exists between the pat-terns found in the incident data and the resilient behaviour of a business service. This assumption will be further elaborated in section8.

Labelling the data according to the business services indicated by ABN AMRO cre-ated a skewed distribution between the labelled incidents as seen in figure 3. The ratio between resilient and non-resilient is roughly 1:14. This means that only 6.6% of the known data represents resilient labelled incidents as compared to 93.4% non-resilient la-belled incidents. Since there are four resilient and four non-resilient services indicated by the bank, this implies that the business services classified as resilient have less inci-dents than the non-resilient services. According toSantiago Hernández et al.[2012] such skewness in datasets can form problems for classification algorithms such as SVM.

Figure 3: The distribution found after labelling incidents based on expert knowledge. An important observation is the skewed distribution, with a 1:14 ratio, between resilient and non-resilient labelled incidents 3.

3

(15)

4 Method

The method proposed in this section covers the main contributions this research offers. Although this research is highly focused on the incident data given by ABN AMRO, the method approach could be deployed in other environments too. For all data related computations in this research it was possible to run the code on an external server facilitated by ABN AMRO via Microsoft Azure. On their Databricks environment a cluster was reserved that runs on Databricks Runtime 6.5 (including Apache Spark 2.4.5, Scala 2.11) with 2 to 8 Workers (type:Standard-DS3-v2, 14 GB Memory, 4 Cores, 0.75 DBU) and a Driver (same type as workers).

4.1 Evaluation metrics

As mentioned by Santiago Hernández et al. [2012] evaluation metrics such as accuracy are normally sufficient but can produce wrong conclusions with skewed datasets. An example could be a spam filter which has a ratio of 1:100 that an incoming mail is spam. If a classification algorithm was prone to overfitting it would predict that within all incoming mail there is no spam. While in such a case it would still be possible to achieve an accuracy of 99%, it also means it will never correctly classify a mail as spam.

A confusion matrix is a first step to create better insight in the performance of an algorithm. For a two class classification problem the matrix consists of four cells each defined by a combination of the predicted and the true value. Thus four combination are possible, see table 2, where a true positive (tp) and a true negative (tn) value are correct classifications and a false positive (fp) and a false negative (fn) value are incorrect classifications. Fp and fn are often referred to as a type I and type II error. For this research a positive value would mean non-resilient and a negative value resilient.

Actual

Positive (1) Negative (0)

Predicted

Positive (1) TP FP: Type I error

Negative (0) FN: Type II error TN

Table 2: An overview of a two class confusion matrix.

Using the values in the confusion matrix it is possible to calculate new values that can be used to evaluate performance. Precision (1) and recall (2) are two of the most commonly used metrics. Where precision gives the proportion of correct positive iden-tifications, recall returns the proportion of correct identifications of actual positives. A combination of both creates the F-score (4), this value compares recall and precision.

(16)

P recision = P tp P tp + P f p Recall/T P R = P tp P tp + P f n F P R = P f p P f p + P tn

F − score = 2 ∗P recision ∗ Recall P recision ∗ Recall

(1)

(2)

(3)

(4)

A last evaluation metric also used by Santiago Hernández et al. [2012] that will be used is the area under the ROC curve. A Receiver Operating Characteristic (ROC) analysis creates a plot on two parameters, the true positive rate (2) and the false positive rate (3). The Area Under the Curve (AUC) value represents how separable the two objects are. A ROC analysis can be done by using the labels in the true vector of the test set with the predicted vector created by the classifier. The visual and numerical results obtained through this method make it a very appropriate evaluation metric for binary classifiers.

4.2 Preprocessing

For the approach taken in this research the data has to be converted into a numerical representation. Therefore, most features will have to be preprocessed. As seen in table

5 in the appendix there are multiple categorical features in the dataset. Therefore this research has started exploring multiple approaches to convert these features to numerical representations such as One Hot Encoding and Entity Embedding. However after re-consultation with ABN AMRO it turned out that most categorical features, like priority and urgency, could already be converted to a number due to their hierarchical structure. Other categorical features like made_sla and u_major_incident where textual Boolean values which could readily be transformed to a numerical representation.

For the natural language features the numerical representations were based on their string length. Likewise for the temporal features the time in hours between opening and resolving and opening and closing was chosen as a numerical representation for length. Features describing identity and dependencies were only utilized to label or track the data during the process. The remaining continuous features needed no preprocessing except for scaling which will be addressed in the next subsection. In figure 7 in the appendix a graphical overview is given of 15 selected unscaled features extracted through preprocessing.

(17)

4.3 Scaling

After converting the feature values to a numerical representation the data was scaled. This is important so that features with high values are not capable of overruling others. Scaling the data was done using the broadly supported Scikit-learn library [Pedregosa et al.,2011]. Multiple scaling methods available in this library have been compared which lead to the implementation of the MinMaxScaler which is also supported by Hsu et al.

[2016]. Using this scaler all values within features were compressed within the range from 0 to 1 without changing the distributions.

5 Results

As already mentioned in section 2 two binary classifiers were used in this research. To get top performance for both classifiers cross validation and grid search were used to obtain optimal parameters. For both classifiers five different configurations have been executed, two using standard parameters indicated by the Scikit-learn library and three using optimal parameters found with grid search [Pedregosa et al.,2011]. Furthermore a distinction has been made between training and validating on scaled and unscaled data. The fifth configuration is a combination of optimal parameters trained on a scaled data set that contained all labelled incidents for the resilient class twice. This way the algorithm encounters these points two times giving the class more weight.

In table3and4presented in the next subsections the five configurations with results are presented. The first column indicates the configurations in an abbreviated manner. The first letter states if the algorithm had standard (S) or optimal (O) parameters. The second and third letter state if it was executed using unscaled (NS) or scaled (WS) data. For the fifth configuration a doubled (D) minority class was implemented. The other eight columns display the different performance values. First for both classes the Precision and Recall are stated followed by their compound F-score. The last two columns represent the mean Accuracy (mAcc) and the AUC value computed by measuring the area under the ROC-curve.

5.1 Support Vector Machine

For SVM the optimization through grid search with cross validation led to a C of 100, a gamma of 0.06 and a Radial Based Function (RBF) kernel. In table3the results for the five configurations are summarized. Only with the optimal configuration scaling has a very limited positive effect on the performance. The OWS configuration seems to perform the best since it returns the highest AUC value and the second highest mAcc value. While doubling the minority class in the OWSD configuration does have a positive effect on

(18)

classifying the resilient labeled incidents, it does overall not improve the performance. These findings are also supported by the confusion matrices displayed in figure4. Further information on the performances is found in appendix B.1. Here the ROC-curves and Precision-Recall curves are displayed for the SWS, OWS and OWSD configurations.

SVM P1 R1 P2 R2 F1 F2 mAcc AUC SNS 0.94 1.00 0.00 0.00 0.97 0.00 0.9406 0.550 SWS 0.94 1.00 0.00 0.00 0.97 0.00 0.9406 0.913 ONS 0.94 1.00 1.00 0.07 0.97 0.14 0.9450 0.661 OWS 0.95 1.00 0.73 0.11 0.97 0.19 0.9446 0.923 OWSD 0.93 0.98 0.69 0.42 0.95 0.53 0.9150 0.900

Table 3: Configurations and results of the SVM algorithm. Green indicates highest values and yellow the lowest.

(a) OWS (b) OWSD

Figure 4: Confusion matrices for the two optimal configurations of the SVM.

5.2 k -Nearest Neighbors

For the k -NN algorithm cross validation and grid search lead to a k of 2, a p of 1, uniform weights and a manhattan metric. The k value corresponds withKim et al.[2012] stating that for a two class classification problem the k must be odd. In table 4 the results for the five configurations are summarized. In both the standard and optimal configurations scaling has an overall positive effect on the performance. Furthermore, as indicated with colors the last two configurations performed much better than first configuration. Depending on which evaluation metric OWS or OWSD performs best. As mentioned in section 3 the labeled data is skewed. Therefore, it is important that the classification algorithm is able to classify both classes correctly, OWSD seems to do this better than

(19)

OWS. This is also supported by the confusion matrices displayed in figure 5. Further information on the performances is found in appendix B.2. Here the ROC-curves and Precision-Recall curves are displayed for the SWS, OWS and OWSD configurations.

k -NN P1 R1 P2 R2 F1 F2 mAcc AUC SNS 0.95 0.98 0.49 0.23 0.97 0.31 0.9398 0.745 SWS 0.98 0.98 0.74 0.68 0.98 0.71 0.9668 0.922 ONS 0.96 0.99 0.74 0.29 0.97 0.41 0.9517 0.726 OWS 0.98 0.99 0.83 0.67 0.99 0.74 0.9719 0.887 OWSD 0.97 0.98 0.83 0.76 0.98 0.79 0.9558 0.962

Table 4: Configurations and results of the k-NN algorithm. Green indicates highest values and yellow the lowest.

(a) OWS (b) OWSD

Figure 5: Confusion matrices for the two optimal configurations for the k-NN algorithm.

6 Validation

To be able to validate the performance of the model it was required to classify all incidents grouped per business service on resilience. Since the k -NN classifier performed better than the SVM classifier it was decided to only use k -NN for this classification. For the k -NN classifier the OWSD configuration was chosen since it had the highest AUC value, therefore it seemed the most capable configuration for classification.

With this method all incidents belonging to a business service were grouped per business service and put through the trained k -NN algorithm to create a prediction on resilience. The amount of incidents that were considered to belong to a resilient

(20)

service were divided by the total number incident of that service creating a percentage for resilience. As a result a list of 717 business services were sorted on their predicted resilience percentage. The first 211 business services had a prediction above 0% and at the top there were 14 services with a prediction above 50%. A visual representation of the top 211 is displayed in figure6. All business services used for training were also predicted creating a partial validation of the performance of the classifier. All eight services used for training were logically represented in the list, meaning the four resilient services were in the top five and the four non-resilient services were in the bottom. An important assumption made for further validation of these results is that if a business services has more than 50% of its incidents classified as belonging to a resilient service the overlying business service will also be classified as resilient.

50 100 150 200 0 20 40 60 80 100 Position in list Prediction resilience (%) Resilient Non-Resilient

Figure 6: A graphical representation of the top 211 predictions made using the k-NN algorithm with the OWSD configurations. All points present business services with a prediction above 0% sorted with green >= 50% and yellow < 50%.

To further validate the results a sample was extracted from the business service prediction list and send to ABN AMRO for expert validation. The sample consisted of 12 different services with corresponding percentages. All services in the sample had a similar amount of incidents than the services used for training. Five services from the sample were classified as resilient with percentages ranging from 67.41% to 86.11%. The other seven services were classified as non-resilient ranging from 49.18% to 0.00%.

Three experts at ABN AMRO independently assessed the business services using four different methods on resilience. As already mentioned in section3the information on the performance and reliability of business services is mainly based on internal documents and research at ABN AMRO. Based on the amount of failures and their impact on the behavior of a service, a metric for resilience is created. All four methods applied by the

(21)

experts are based on comparable approaches. Thus, this form of validation is only usable for the non-resilient services since it does not support any metric to validate if a business service actually performs on a resilient manner. However, the first two methods (I and II) used for validation were also used to decide which non-resilient business service were used for labelling. Thus, using these methods for validation at least ensures a comparable metric for the classification of non-resilient business services.

According to method I and II, 5 of the 12 business services used for validation were classified as non-resilient. The remaining 7 services received no classification which can mean that either it was not possible to classify them at all using the two methods or it was possible but there was not enough evidence to classify them as non-resilient. For the 5 non-resilient business services only two corresponded with the classifications made by the k -NN algorithm. Taking the results and the validation into account we can conclude that the validation was insufficient to support the correctness of the approach presented in this research. In section 8, the issues around labelling and validation are further discussed.

7 Conclusion

Taking the results and the validation into consideration, there a two conclusions possible for this research. First of all, while the method proposed in its current form could be directly implemented by ABN AMRO, it is difficult to validate the trustworthiness of the results. However, with certain alterations this could be resolved, this will be further discussed in9. A first conclusion states that there exists a lack of metrics on resilience for incidents and their overlying business services. Currently it is only possible to give sub-stantiated arguments on services behaving in a non-resilient manner. Moreover, working with skewed incident data and using it for classification creates problems for validation. After consultation with the experts that performed the validation, they acknowledged that their resilience metric was not sufficient enough to validate both classification classes. As stated in section 3 currently the classification of business services is done manually. This classification is based on internal information stating for example when a service failed or how often it influenced the performance of other services. In both cases it is past information. In addition, their metric is not as refined as the percentages obtained by the method proposed in this research. This creates a further imbalance between the metrics proposed in this research and those used by ABN AMRO.

A second conclusion states that, looking at the performances of the algorithms, they could already serve as a solid basis for the prediction of resilience. Not only did this research show it is possible to classify business services based on incident patterns, it was also possible to classify services as either resilient or as non-resilient. Therefore, the approach proposed in this research could function as a powerful tool to predict the resilience of business services. When implemented as a proactive monitoring tool, it could

(22)

even classify incidents and assess the resilience of business services continuously. Using this information large corporations, such as ABN AMRO, could get a better insight in the overall performance of the services they offer. However, based on the first conclusion, this method still requires an improved metric for resilience and a more balanced dataset to be able to give solid conclusions on the correctness of the predictions.

8 Discussion

There are multiple parts of this research that are in need of broader discussion and exploration. Before discussing the method it is important to mention the problems with the data used for this research. As mentioned in section 2.1 a resilient network is predominantly well isolated, therefore it has less connected incidents and possibly also less incidents in total. This is also represented by the 1:14 ratio between resilient and non-resilient services in the incident data used for this research. This skewness in the data influences the classification process significantly while the algorithms struggle with a minority class. A first solution could be adding more data to this minority class.

However, the quality of the data should also be questioned as done in earlier research [Kamdhi, 2018]. A first important assumption that influences the quality is also stated in section 3.2. In this section it is stated that a resilience correlation exists between patterns found in incidents and the behaviour of their overlying business services. While incidents do represent failures within parts of a business service, it is not defined which incidents affect the resilience of a service. Therefore, while labelling all incidents based on expert knowledge it is probable that incidents were labelled that had no effect on the resilience of a service at all. This imbalance between resilience related and unrelated incidents subsequently increases noise present in the classification process.

A second assumption worth discussing is the quality of the expert knowledge used for the labelling and validation in this research. While the research was done in tight collaboration with ABN AMRO, the scope of this research did not allow further parsing of the information and methods used by the experts. Furthermore during the valida-tion process it became clear that ABN AMRO still uses multiple metrics for resilience. Standardization or explicitly choosing one method for an approach as proposed in this research is necessary. Thus, for this research it is difficult to guarantee a correct connec-tion between the method proposed in this research and the informaconnec-tion provided by the stakeholder.

As also discussed in section1.1the IT-structure found at ABN AMRO can be divided in multiple layers. Each business service has multiple IT-products and CIs running underneath. An important consequence of this hierarchical structure that is not explored in this research is the inter-connectivity of systems. Incidents arising in a certain IT-product or CI can affect multiple business services. While this research does not take this

(23)

into account, it could add important information on the resilience of business services and possibly reduce the noise present in the incident data.

9 Future work

The goal of this research was to propose a applicable method based on AI to support the resilience classifications of business services. While the method as proposed in this research is directly applicable for ABN AMRO, it remains difficult to validate the correct-ness of the predictions made. However, with certain alterations this could be resolved. These alterations can be divided into improvements for the approach and method pro-posed in this research and in improvements for the logging of incidents while bettering resilience metrics used by ABN AMRO.

A first improvement focuses on the quality and selection of the incident data. For this research a decision was made on which features would represent the incidents the best. However, since there are hundreds of possible features, there are countless combinations possible. Creating a dataset with features that are optimal for the classification of re-silience is still open for improvements. A second improvement for the method proposed in this research is increasing the performance of the algorithms by further analyzing and balancing the data. Altering the scaling or dealing with outliers differently could also create new insights in the data. Leaving the quality of the data out of the equation, other improvements are still possible. There are additional methods available that increase the performance of classification algorithms on skewed datasets. For both algorithms us-ing SMOTE for data extrapolation would be an interestus-ing improvement [Chawla et al.,

2002]. Furthermore, for SVM using excited support vectors to get a better hyperplane could also be of added value [Santiago Hernández et al.,2012].

While the method and approach proposed in this research could still be improved, there are also changes possible at the ABN AMRO that could increase the reliability of predictions. Through this research it became clear that there is not yet consensus on the definition of or a consistent metric for resilience. Any metric that is being used to define resilience, is based on failures that occurred within a service. Therefore it would be mainly possible to define when a service is non-resilient. A possible addition could be a threshold for the amount of failures that a resilient service may have. This threshold could be adjusted based on for example the size of a business service measured in the number of underlying CIs.

Another addition that could be implemented by ABN AMRO is the addition of a resilience feature in their ITSM environment. This way, if an incident occurs it would be possible to state if the incident affected the resilience, or a property of resilience, for the overlying service. With this addition it would be possible to create a better selection from the incident data since it would already be labelled.

(24)

A last suggestion for future research would be creating a link between the incident data and event data. As mentioned in section 2.2 events are like incidents but are created automatically by monitoring tools. Furthermore, events also contain information on when the IT-department deliberately made changes to subsystems. Besides the events database ABN AMRO could utilize much more databases that are internally available to get a more refined insight on incidents and their connection to the resilience of systems.

(25)

References

[1] Algirdas Avižienis, Jean Claude Laprie, Brian Randell, and Carl Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11–33, 2004. ISSN 15455971. doi: 10.1109/ TDSC.2004.2.

[2] Kash Barker, Jose Emmanuel Ramirez-Marquez, and Claudio M. Rocco. Resilience-based network component importance measures. Reliability Engineering and System Safety, 117:89–97, 2013. ISSN 09518320. doi: 10.1016/j.ress.2013.03.012. URL http://dx.doi.org/10.1016/j.ress.2013.03.012.

[3] Richard Bliek. Resilience Roadmap (DRAFT). pages 1–20, 2017.

[4] Jean Francois Castet and Joseph H. Saleh. Survivability and resiliency of spacecraft and space-based networks: A framework for characterization and analysis. Space 2008 Conference, (September), 2008. doi: 10.2514/6.2008-7707.

[5] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.

[6] Anderson Santos Da Silva, Paul Smith, Andreas Mauthe, and Alberto Schaeffer-Filho. Resilience support in software-defined networking: A survey. Computer Networks, 92:189–207, 2015. ISSN 13891286. doi: 10.1016/j.comnet.2015.09.012. URL http://dx.doi.org/10.1016/j.comnet.2015.09.012.

[7] Chih-Wei Hsu, Chi-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classification. 2016. URL http://www.csie.ntu.edu.tw/{~}cjlin.

[8] Alrian Kamdhi. Data Quality Bottlenecks in ABN AMRO’s Incident Management Process. 2018.

[9] Marcel De Keizer, Mohammed Bakkar, Željko Obrenović, and Sylvan Rigal. ABN AMRO Internal. (March), 2019.

[10] Jinho Kim, Byung-Soo Kim, and Silvio Savarese. Comparing Image Classification Methods: K-Nearest-Neighbor and Support-Vector-Machines. Applied Mathematics in Electrical and Computer Engineering, pages 133–138, 2012.

[11] David M Knigge. Event Correlation and Root Cause Analysis to Support Incident Management in ITSM Environments. 2019.

[12] Stefanie Michielsen. Data quality within incident management. BSL research, 2020.

[13] Walid Najjar and Jean Luc Gaudiot. Network Resilience: A Measure of Network Fault Tolerance. IEEE Transactions on Computers, 39(2):174–181, 1990. ISSN 00189340. doi: 10.1109/12.45203.

(26)

[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[15] Charles Perrow. Normal accidents: Living with high risk technologies-Updated edi-tion. Princeton university press, 2011.

[16] Ronaldo M. Salles and Donato A. Marino. Strategies and metric for resilience in computer networks. Computer Journal, 55(6):728–739, 2012. ISSN 00104620. doi: 10.1093/comjnl/bxr110.

[17] José Santiago Hernández, Jair Cervantes, Asdrúbal Lópex-Chau, and Farid Carcía Lamont. Enhancing the Performance of SVM on Skewed Data Sets by Exciting Support Vectors. 7637(April 2014):101–110, 2012.

[18] Paul Smith, David Hutchison, James P.G. Sterbenz, Marcus Schöller, Ali Fessi, Merkouris Karaliopoulos, Chidung Lac, and Bernhard Plattner. Network resilience: A systematic approach. IEEE Communications Magazine, 49(7):88–97, 2011. ISSN 01636804. doi: 10.1109/MCOM.2011.5936160.

[19] James P.G. Sterbenz, David Hutchison, Egemen K. Çetinkaya, Abdul Jabbar, Justin P. Rohrer, Marcus Schöller, and Paul Smith. Resilience and survivabil-ity in communication networks: Strategies, principles, and survey of disciplines. Computer Networks, 54(8):1245–1265, 2010. ISSN 13891286. doi: 10.1016/ j.comnet.2010.03.005. URL http://dx.doi.org/10.1016/j.comnet.2010.03.005.

(27)

Appendix

A

Features

A.1 Selected features

Feature type Amount Feature names

Identity 1 number

Temporal 3 opened_at, closed_at, resolved_at

Natural language 4 description, comments, close_notes, work_notes

Dependency 2 u_it_business_service, u_it_product

Categorical descriptive 14

u_caused_by_change, parent_incident, u_major_incident, u_business_value, priority, urgency, impact, made_sla,u_sla_breached, u_complaint_escalation, u_intensive_care, u_possible_problem, severity,

u_security_incident_category

Continuous 7

u_rejected_count, reassignment_count, child_incidents, business_stc, reopen_count, u_number_users_impacted, u_waiting_time

Total 31

Table 5: An overview of all 31 features chosen in consultation with ABN AMRO and present in the raw dataset extracted from ServiceNow. Every feature is categorized on feature type. The categories are extracted from an earlier research by Knigge [2019].

(28)

A.2 Distributions

Figure 7: Distributions found in the preprocessed unscaled dataset. In total 15 features are displayed with in the downright corner a general graph with incidents per business service. It is important to acknowledge that for the three histograms (reopen-count, reass-count and child-incidents) the 0 values were excluded, this way it was possible to get a better display of the non-zero values.

(29)

B

Results

B.1 SVM

In the following figures multiple graphs are displayed corresponding to three configura-tions for SVM: SNS, OWS and OWSD.

Figure 8: ROC-curve for the standard SVM trained and validated on unscaled data (SNS).

(30)

Figure 9: ROC-curve for the optimal SVM trained and validated on scaled data (OWS).

Figure 10: ROC-curve for the optimal SVM trained and validated on scaled data with a double labelled minority class (OWSD).

(31)

Figure 11: PR-graph for the SVM trained and validated on unscaled data (SNS).

(32)

Figure 13: PR-graph for the optimal SVM trained and validated on scaled data with a double labelled minority class (OWSD).

(33)

B.2 k -NN

In the following figures multiple graphs are displayed corresponding to three configura-tions for k -NN: SNS, OWS and OWSD.

Figure 14: ROC-curve for the standard k-NN trained and validated on unscaled data (SNS).

(34)

Figure 15: ROC-curve for the optimal k-NN trained and validated on scaled data (OWS).

Figure 16: ROC-curve for the optimal k-NN trained and validated on scaled data with a double labelled minority class (OWSD).

(35)

Figure 17: PR-graph for the k-NN trained and validated on unscaled data (SNS).

(36)

Figure 19: PR-graph for the optimal k-NN trained and validated on scaled data with a double labelled minority class (OWSD).

Predicting resilience through classification of skewed incident data