Alarming the Intrusion Expert Alarm mining for a high performance intrusion detection system

(1)

Master Thesis

Alarming the Intrusion Expert

Alarm mining for a high performance

intrusion detection system

by

Jos van der Velde

100852573

October 26, 2016

42 ECTS

January 11, 2016 - October 26, 2016

Supervisors:

Dr M W van Someren (UvA) Drs T Matselyukh (OPT/NET)

Assessor: Dr. J M Mooij (UvA)

(2)

Alarming the Intrusion Expert:

Alarm mining for a high performance intrusion

detection system

October 26, 2016

Abstract

Nowadays, intrusion detection systems are indispensable to reveal infiltrators and misconfigurations in networks. On large streams of network data, these systems will raise unmanageable numbers of alarms. This problem can be alleviated by grouping alarms together, so that experts have to check only a single instance of each alarm group instead of every alarm. Various alarm mining approaches have been proposed, using either unsupervised clusterers, supervised classifiers, or a combination. Although combined approaches show strong performance while requiring only a small labelled dataset, existing studies have only applied it to detectors that are based on machine learning. The present study employs an unsupervised algorithm to cluster the alarms of an existing detector that is based on manually created rules. Subsequently, a supervised classifier allows experts to refine the alarm groups. The results promise manageable numbers of homogeneous alarm groups. This approach is helpful because it makes the analysis of alarms feasible for large networks, where intrusion detectors based on machine learning would require too many resources.

(3)

Acknowledgements

Firstly, I would like to express my sincere gratitude to both my supervisors: Dr. Maarten van Someren from the University of Amsterdam and Drs. Taras Matselyukh from OPT/NET, for their excellent support. During the project, Dr. Van Someren was a fantastic guide and motivator, and provided precise and swift feedback. Drs. Matselyukh proved to be an outstanding source of domain knowledge and fruitful discussions, provided access to OPT/NETs systems and software, and even allowed me the opportunity to present the work at the Toulouse Space Show. Besides my supervisors, I would like to thank Dr. Joris Mooij for agreeing to be a part of my defense committee. Finally, I would like to thank Hilko van der Leij, Joshua Snell and especially Anne Martens for leaving no stone unturned while proofreading my thesis.

(4)

1 Introduction

1.1 Motivation

Networks are constantly bombarded by various types of attacks, including intrusions by masqueraders who obtained a password, data spying or modification, and denial of service attacks. Since the networks of most organisations are nowadays connected to the internet, even those containing highly sensitive information, vulnerability to such attacks is a problem of utmost significance.

When prevention of attacks fails, as it inevitably will, organisations rely on intrusion detection to ensure that (automatic) measures can be taken promptly. Two approaches of intrusion detection can be distinguished: signature-based and anomaly-based detection[29]. The former detects attacks by comparing the network activity with a database of known attack signatures. This ensures fast and reliable detection as long as the signature database stays up to date. The main drawback for signature-based intrusion detection is therefore the vulnerability to zero day attacks (i.e. attacks exploiting a software vulnerability of which the vendor is not aware). This problem is severe, since software updates continually leave new loopholes and the creativity to find new attacks never seems to dwindle.

A solution for patching this vulnerability is given by anomaly-based intrusion detection. This approach forms a profile of normal network activity instead of focusing directly on misuses. For the detection phase, it assumes that new attacks will lead to new patterns of activity. Such patterns arise when the attacker tries to obtain access, or afterwards while exploiting this access. For instance, a change of configuration might signify a malicious user corrupting the network. Given enough normal network activity, features and time to train, this approach will detect any abnormality.

The main drawback of anomaly-based intrusion detection is a high number of false alarms, resulting in an imposing burden of manual labour. These false alarms are inherent to the fundamental assumption of anomaly-based systems: that any new activity pattern equals an intrusion. In reality, researchers estimate that only one out of a hundred new activity patterns denotes an intrusion. [4, p.1, p.3][28, p.444][45, p.2][23, p.2] The other ninety-nine of the unseen activity patterns will raise a false alarm. Another cause of false alarms are patterns of benign activity that cannot be distinguished from malicious patterns. The latter may be an unavoidable sacrifice when opting for near real-time performance on large streams of data, when not all information can be processed in time.

This thesis focuses on reducing the number of false alarms of an anomaly-based intrusion detection system. The problem of reducing false alarms is significant because false alarms require costly human intervention and may lead to human errors due to complacency.1

Moreover, we use an anomaly-based intrusion detection system that is based on manually created rules. Such a detector does not require a computationally expensive training phase, in contrast to detectors that are based on machine learning, making it

1

An extra motivation for reducing false alarms in anomaly-based intrusion detection is the genericity of the problem. When reduced to the extensive problem of finding patterns that do not comply to expected behavior, i.e. anomaly detection, similar software can be found in fraud detection, medical diagnosis, damage detection and image processing.[11, p. 5]

(5)

capable of handling larger streams of network data. The drawback, of course, is the reliance on experts to perform knowledge acquisition.

The motivation behind this thesis can thus be summarized as “minimizing human interventions when detecting intrusions on the largest streams of network data.”

1.2 Problem statement

The problem studied in this thesis regards the reduction of false alarms in the OPTOSS intrusion detection system. We will apply alarm mining to this problem. The goal is to group similar alarms together, so that the human expert needs to assess only a single instance of each alarm group, instead of every single alarm. Although this is not a literal reduction of the false alarms, it reduces the false alarms that need to be checked. For example, the human expert might have assessed an alarm regarding a person that tries to login to the system. The expert might decide to ignore the alarm, or to create an automatic script that should be executed every time such an alarm is encountered. Thereafter, new instances of such alarms should not be shown to the expert, but should be automatically handled. This way, the expert checks only one instance of each alarm group, knowing that the other alarms inside that group are similar. The number of (false) alarms that are shown to the expert is reduced.

We will thus not try to improve the accuracy of the intrusion detector, or indeed change the detector in any way, but only focus on grouping the alarms.

To give a clear grasp of the problem, the OPTOSS will be briefly introduced, whereafter the alarms will be characterized, including a discussion of the concept of similarity, culminating in an explicit problem statement.

OPTOSS is a proprietary intrusion detection system and stands for OPerator Time Optimized decision Support System for ICT infrastructures[1]. This system was conceived a decade ago, and performs the intrusion detection in five steps:

1. Collecting raw data, originating from logs of network activity. Each line of the log denotes an event. An event contains a description, a facility (from what kind of activity it originates, for example “ssh” or “snmp”), a time-stamp, but no duration: long lasting activity might produce a single event, as summary, or multiple consecutive events.

2. Each event retrieves a severity score which signifies the likelihood that the event is part of an attack.

3. The events are grouped together into event groups, containing consecutive events of the same device (e.g. a router). The events are split into two groups when the summed severity of all events in a time-span is low.

4. The detector raises an alarm over an event group if the severity per second (i.e. the summed severity of all events in that second) fluctuates significantly over time.

5. Similar alarms are grouped together in alarm groups, based on the severity scores of the events inside the alarms.

The detector contains two distinguishing features: it is based on manually created rules, and has a bag-of-events based alarm representation. The rules are made by human experts and are used to assign a severity score to each event, in the second step of the detection process. The detection is a direct consequence of these manually

(6)

created rules, which contrasts this detector with systems that are based on machine learning. Besides, the alarms are raised over event groups instead of over single events, giving the detector an unique, bag-of-events based alarm representation. An alarm thus contains all consecutive events inside an event group. The group starts and ends at places where the summed severity of the events is low, and originates from a single device. Typically, it ranges over a couple of seconds, containing a few dozens of events, although this depends on the rules and the thresholds of the system. Alarms can be classified as true positives (containing intrusive events) and false positives (only harmless events). The lack of alarms can be justly, in which case it is a true negative, or a result of missing harmless events, in which case it is a false negative.

A deficiency of the current OPTOSS is that the alarms are not grouped correctly. First of all, there are too many alarm groups to consider it manageable for the human operator. Secondly, many alarm groups are heterogeneous. Whereas an alarm contains an heterogeneous group of events, alarm groups should consist of similar alarms, in order to save time for the human expert: if the alarms are homogeneous, the expert needs to analyze only one alarm in each alarm group.

As concept of similarity, the most practical metric would be to declare two alarms as similar when the same action should be taken when the alarm is encountered. Since the action which needs to be taken typically does not follow from the type of events, we used the root cause of an alarm to form a notion about the similarity. Examples of root causes can be “attempt to obtain root access on device A by method B” or “misconfiguration of program C on device D.” This root cause can be established by an expert, based on an analysis of the events, so that the similarity within alarm groups can be evaluated. Two alarms are thus similar when they share the same root cause. In practice, the lines dividing two alarm groups may be cloudy: two Denial-of-Service attacks could differ in the size or ingenuity of the attack, requiring different countermeasures, and justifying the allocation of two different root causes. A different notion of similarity differentiates alarms based on them being true or false positives. This notion is more transparent, although it might sometimes be ambiguous still. In this thesis, we will use both definitions of similarity (same root cause or being both true or false positive) to evaluate the similarity within alarm groups. These considerations lead us to the following problem statement:

Problem statement: To group alarms that are raised by the OPTOSS detector into alarm groups, in such a way that each alarm within an alarm group has the same root cause. In this problem statement:

• The OPTOSS detector is an existing intrusion detection component. We will not modify this component.

• An alarm is a group of events (log entries having a time-stamp but no duration). An alarm group contains all events that where encountered at a single device, over a time-span of typically a few seconds.

• An alarm group is a group of alarms that were raised at a single device. If the grouping is performed correctly, an alarm group can be identifiable with a single root cause, that is, a root cause that is shared by all alarms within the alarm group. These alarm groups form the output of our new components, in combination with an assignment of each alarm to a single alarm group.

• The root cause of an alarm is the activity that generated most of the events that led to the alarm. Examples are an “attempt to obtain root access on device A by

(7)

method B.” The root cause of each event is determined by experts. Whenever the dividing line between two root causes was vague, we used the rule of thumb that the causes where seen as distinct, if and only if they trigger the expert to take a different action. This means, for instance, that am alarm that is caused by a successful attack has always a different alarm group than one that is caused by an unsuccessful attack. The root cause of an alarm is typically based on a subset of the events inside the alarm, the rest of the events being noise. While the OPTOSS runs, the root cause of each alarm is unknown to the algorithms. Instead, hints about the root cause need to be distilled out of the information that is available about each alarm.

• The result will be groups of alarms that have similar root causes. This is helpful, because the expert then has to assess only a single alarm from each alarm group, knowing that the rest of the alarms do not need to be checked, because they have a known cause. The expert can now even continue to set up automatic scripts, defending against new alarms that belong to known root causes.

For this research, an abundance of raw network data was available. A labelled dataset, consisting of alarms grouped correctly together, was not available.

1.3 Our Contributions

We propose to combine the existing, anomaly-based OPTOSS detector with alarm mining components. Since a labelled dataset for our test network was not available, which is the case for most application of the OPTOSS, we primarily relied on an unsupervised K-means clusterer. As any unsupervised component that works on a sufficiently complex problem, the clusterer makes mistakes. To correct such mistakes we included the possibility to refine the alarm groups. This way, the human expert can use his domain knowledge to combine two groups that were seen as separate by the clusterer, or to divide an alarm group into new alarm groups. The manual feedback should then be exploited by the system, by using it as a labeled dataset, to correctly group future alarms. This is the task of the supervised component, a Random Forest classifier. The requirement of human feedback is thereby reduced in two ways: firstly, experts need to cluster only some of the alarm groups, and not the individual events. Secondly, the clusterer already performs a grouping of the alarms. Given that the clusterer works well, the expert needs to make only small changes.

The benefit of this approach is a system enabling real-time detection of previ-ously unseen intrusions in large data streams, employing an unlabeled dataset. The drawbacks are a reliance on high quality rules for the detector and a reliance on manual feedback for the classifier. The proposed approach thus comes into its own for problems with large networks and high quality (expensive) experts, when labelled data is not available.

Of course we are not the first to propose to combine a clusterer and a classifier in sequential order to improve intrusion detection. But to the best of our knowledge, the architecture of our solution makes it unique by combining a rule-based anomaly detector with a clusterer and a classifier. The 2016 work of MIT research scientist Veeramachaneni and his colleagues from the company PatternEx, aptly titled “AI2, training a big data machine to defend”[56], is most closely related to this thesis. The main difference is that Veeramachaneni et al. proposed a detector which implements

(8)

outlier detection by an ensemble of clustering techniques, whereas we used an existing rule-based (and bag-of-event based) detector which we combine with a separate clusterer. Since our clusterer works on a significantly smaller subset of the data, namely the alarms and not every individual event, the computational complexity of training our system is lower than that of Veeramachaneni et al. Since the OPTOSS detector itself is efficient, this results in our solution using less processing power - an important factor when working with large streams of data - although it needs more expert knowledge setting the system up.

Other related works in alarm mining apply only a clusterer to an existing detector. The main difference with our system is that we refine the results of the clusterer with a supervised classifier, so that domain knowledge can be exploited to form better alarm groups.

1.4 Metrics

To evaluate the proposed system, we need to evaluate both the clusterer and the classifier. Firstly, we need to make sure that the computational requirements of the system are acceptable. This be analyzed theoretically, and verified by measuring the performance of the system on a live network.

To evaluate the alarm clusterer, we will assess the number and the quality of the alarm groups. The reduction ratio can be used to evaluate the former, describing the number of alarm groups relative to the number of unique alarms:

Reduction ratio= # unique alarms − # alarm groups # unique alarms

The reduction ratio should be high, signifying a relatively low number of alarm groups. Evaluating the quality (i.e. the homogeneity) of the alarm groups is less straight-forward. We will use the following metrics, the first two introduced by Pietraszek[46] and the last by Zhang[2]:

• The average fraction of true positives covered by alarm groups that contain more true positives than false positives. Ideally, this should by 1.

• The average fraction of false positives covered by alarm groups that contain more false positives than true positives. Ideally, this should by 1.

• The ratio of groups with only true positives, only false positives and the the ratio of mixed groups. The ratio of mixed groups should ideally be 0.

These metrics indicate the similarity within alarm groups, differentiating only between true and false positives. Since we are also interested in differentiating alarms based on the root causes, as described in subsection1.2, we also included the following metrics:

• The average group similarity: the ratio of alarms inside each group that originate from the same root cause, averaged over all alarm groups. Ideally, this should be 1.

• The ratio of mixed groups (containing alarms with different root causes). This ratio should ideally be 0.

(9)

As baseline, we use the existing OPTOSS system. This system already has a clusterer, forming alarm groups based on the severity score. Our goal is to improve both its reduction ratio and its group quality.

To evaluate the alarm classifier, we will simply measure its accuracy. The accuracy is the ratio of alarms that are assigned the correct alarm group. In contrast with the evaluation of the clusterer, we do know the correct assignments for the classifier, since it has to mimic the assignments of the human expert and the clusterer. As baseline, we will use accuracy of the clusterer, which will be excellent except for any alarm that is similar to the alarms that are changed by the human expert.

Comparison to other approaches is not straight-forward, mainly because the quality of our approach depends highly upon the alarms that are outputted by a detector that is not used before in alarm mining literature. Comparing these alarms with the alarms of other detectors proves difficult, because OPTOSS uses a bag-of-events based representation. Consequently, using the same alarms is impossible, since the representation differs. Moreover, we would need to use the same dataset as other approaches. Although many researchers use the outdated 1999 DARPA [36] or the arguably better 2012 UNB ISCX[52] dataset, the most related work, of Veeramachaneni et al., uses their own data only. With these considerations in mind, we have used our own dataset as well. We are therefore not able to directly compare the performance of our system with other approaches.

1.5 Thesis outline

The rest of this thesis will be structured as follows. First of all, the necessary back-ground of intrusion detection (section2) and machine learning (section3) will be covered quite thoroughly. These sections are covered in depth to allow novices to get up to speed, and, since this is a thesis, to display the fruits of UvA’s machine learning eduction. Once the necessary background is in place, the related works section (4) will list other approaches to solve similar problems. The current OPTOSS will be treated in section5, whereafter the proposed changes are described in section6. Following a famous statician, who was quoted saying “in God we trust. All others must bring data,”2we will do just that, describing the experiments in section7and the results in section8. Finally, this thesis will be summarized in the conclusion (section9).

2 Background intrusion detection

This section is aimed at those that are not familiar with the intrusion detection domain. In this section, intrusions will be characterized, intrusion detection techniques will be described, and approaches to reduce the false alarms will be listed. We suggest seasoned intrusion detection specialists to skip this section, and to continue with the machine learning background (section3).

2.1 Intrusions

Intrusions can be classified in a multitude of ways. Here we will mention the five types as defined for the 1999 DARPA dataset[36]:

2

(10)

1. Remote to Local: attempt to obtain user privileges. 2. User to Root: attempt to obtain root access. 3. Data compromise: attempt to read or modify data.

4. Denial-of-Service: attempt to disrupt the network, e.g. by consuming the networks resources.

5. Probe: scanning of a network, for instance to look for vulnerabilities.

To determine the attributes we want to process of all events, we need to account for these different attack types. To detect Remote to Local, User to Root and Data compromise attacks the system needs so-called content-based features which simply give clues about the actions the events correspond to. Denial-of-Service and Probe attacks can be recognized by time-based and connection-based features, which show respectively the number of events over time and the number of events originating from the same source.[34][33, p.6]

2.2 Detectors

Intrusion detection techniques are either signature-based (also known as misuse detection) or anomaly-based, a distinction made by Kumar in 1994.[29] On the one hand, misuse detection forms the most straightforward technique and is based on a database of known malicious events (signatures). The system simply compares the incoming events to the database, triggering on known intrusions. Anomaly-based techniques, on the other hand, as proposed by Denning[15], create a model of normal behavior, regarding every deviation from this model as an intrusion. This enables anomaly-based systems to detect unfamiliar events, but generally results in higher false-positive rates compared to their misuse-based counterparts.

Following Northcut[42], we can also distinguish intrusion detection systems based on the placement of the systems, being either on the network or on a host. In the former case, the so-called Network-based Intrusion Detection Systems (NIDSs) will directly monitor the packets of network traffic, enabling it to interfere with possible attacks before they have even reached a device. Host-based Intrusion Detection Systems (HIDSs), on the other hand, analyse system logs, making them capable of detecting intrusions once they reached the system, also when the device is offline (in contrast to NIDSs). In many ways, NIDS and HIDS complement each other, and a combination of both is recommended in most cases, in which the HIDS can be regarded as the last line of defence.

Another distinction can be made between the type of information the system uses, separating systems based on machine learning with system based on manually crafted rules. These manually generated rules can either specify normal behavior or malicious behavior (in this latter case, the rules are also referred to as “signatures”). Machine learning based systems, on the other hand, train to infer rules from a dataset. Many different machine learning techniques are applied in the intrusion detection domain: for instance neural networks, support vector machines, K-nearest neighbours, genetic algorithms and techniques based on association rules or fuzzy logic.[37]

As final distinction, some systems are capable of autonomous actions after a known intrusion is detected. Such systems are known as intrusion prevention systems

(11)

instead of mere intrusion detection systems. Typical actions of such intrusion preven-tion systems include terminating the connecpreven-tion, blocking the attackers IP address and changing the content of the attack.[41, pp. 2-2 and 2-3]

The OPTOSS detector can be labeled as a host-based intrusion prevention system that relies on manually crafted rules. It is neither completely signature-based nor completely anomaly-based, exhibiting characteristics from both approaches. This will become clear in section5, where the architecture of the current system will be explained.

2.3 Reducing false alarms

Different approaches for reducing the number of false alarms can be found in the vast literature of anomaly-based intrusion detection: enhancement of the detector, alarm verification, alarm prioritization, alarm correlation, hybrid methods, and - the approach we used - alarm mining.3 All but the first rely on placing an extra component after the detector for further processing of the alarms. Let us first identify the benefits of alarm mining, followed by a discussion of why the other approaches by themselves are not sufficiently capable of reducing the amount of false alarms, although they might form interesting additions.

Using alarm mining, we present the human expert a significantly reduced number of alarms by grouping the alarms together based on attributes such as protocol (eg. syslog, snmp), type of job (eg. traffic, sshd, cron) , and message (eg. “IP spoofing! From # to #, proto #. Occurred # times.”) Now the expert only needs to assess each type of attack once, reducing the number of seen alarms easily by two orders of magnitude.4 To implement such an alarm mining based approach, an extra component is put in place to reduce the output of the anomaly detector. Once we realize that the problem of grouping alarms together is reducible to finding similarities in a dataset, we can bring forward the extensive toolbox of machine learning. Algorithms can be borrowed from unsupervised learning (alarm clustering without expert feedback) and supervised learning (alarm classification based on expert feedback). The advantage of alarm mining, then, is that it allows the expert to assess each type of attack only once. Disadvantages include the sensitivity to network changes (e.g. reconfiguring the network might make new alarms unrecognizable [23, p. 14]), the inability to classify alarms as harmless or intrusion - it is not a panacea - and, in case of supervised learning, problems in obtaining a large and consistent dataset.

The other approaches are by themselves not sufficiently capable of reducing the number of false alarms. First of all, straightforward enhancement of the detector is not a feasible solution for reasons already stated in the introduction: there are performance limits when crunching large streams of data in real time.

The second option, alarm verification, might be the most ambitious. It attempts to identify whether an attack was successful. A distinction can be made between active and passive alarm verification, the former monitoring the network in real time, whereas the latter relies on a database of attack successes. Alarm verification brings a new advantage to the table: it makes it possible to distinguish between harmless activity and intrusions, promising to reduce the amount of alerts drastically. Currently

3

This enumeration is similar to the one found in signature based survey [23].

4

(12)

however, such methods are not reliable: mimicry attacks can trick the network into believing the attack failed.[55][23, p. 12]

Alarm prioritization takes an alternative path by assigning a priority score to alerts, whereafter the number of alarms is reduced by simply ignoring all but the most important alarms. To compute the priority score the system is given (or learns) rules about the importance of the target entity, about how well the part of the network is configurated and about the correlation to similar alerts. Although some priority indication may be a helpful addition to any intrusion detection system, the drawback of using it to ignore attacks seems obvious: the prioritization algorithm has essentially the same task and thus limitations as the detector, not prioritizing (detecting) harmless events, and thus we cannot expect it to perform better than the detector itself.

The literature does not agree on a single definition for the last unmixed approach: alarm correlation. Some use the term when grouping together alarms generated by various sensors[49], some by grouping together alarms from multiple intrusion detec-tion systems[62]. We follow[23, p. 15] and find a common ground by extending the definition of alarm correlation to “any attempt to group non-similar alarms together in attack scenarios”. Here the adjective non-similar is of vital importance, differentiating this approach from alarm mining. But the two techniques are closely related: without clustering similar attacks first, the relation between occurrences of non-similar might be less clear. Therefore, we see alarm correlation as a next step after alarm mining is performed. The result of alarm correlation is a number of attack scenarios, grouping non-similar alarms together that stem from a single process or a single attacker, possi-bly scattered over a longer time and over multiple devices. The (dis)advantages of this approach are similar to alarm mining, although we speculate that alarm correlation may require more (error-prone) fine-tuning, since measures of similarity are usually more obvious than parameters in determining the correlation of un-similar entities (e.g. the influence of time on the probability of belonging to the same attack).

Finally, the listed techniques are not mutually exclusive and lend themselves well to be combined in hybrid systems.5 Any anomaly-based intrusion detection system will benefit from alarm mining and a prioritization score, where after the number of alarms can be reduced even further by applying alarm correlation. Moreover, alarm verification might assist to ignore alarms (this might be a risky procedure) or to enhance the prioritization score.

3 Background machine learning

This section is meant for those that do not have sufficient machine learning experience. It brings the reader up to speed on feature extraction, feature preprocessing, and unsupervised, supervised and semi-supervised machine learning techniques. We recommend machine learning enthusiasts to skip this section, and to continue with the related work (section4).

Before we start with the machine learning concepts, let us give a quick preview of section6in order to get familiar with the overall framework of the machine learning

5

As a warranty note: the usage of the term ‘hybrid’ is ambiguous in intrusion detection literature. The term ‘hybrid anomaly-based intrusion detection systems’ should not be confused with mere ‘hybrid intrusion detection systems’, which can denote a combination of an anomaly-based system with a signature-based counterpart (capable of increasing the accuracy, not of decreasing the amount of false alarms), or, depending on the author, any combination of intrusion detection techniques.

(13)

solution of this thesis. Our goal is to group similar alarms together. Our dataset therefore consists of alarms. Each alarm consists of features that give clues about the similarity between alarms. An example of a feature is the most important description (the text of the log entry) of an alarm. We cannot simply use string attributes (i.e. textual data), because the machine learning algorithms expect numerical values, so we perform feature engineering. Once we obtained a numerical dataset, we can cluster those alarms together that have similar feature values. This is done using unsupervised machine learning, and results in a number of alarm groups. Subsequently, we will allow the human expert to change some alarm groups (the expert might, for instance, correct mistakes of the clusterer). The resulting alarms, labeled with their alarm group, will form the dataset for our supervised machine learning algorithm. This algorithm trains on the labeled dataset, and should learn to predict the correct alarm group for all new alarms.

The rest of this section will explain the basics of machine learning, and will therefore not directly relate the mentioned techniques to the work of this thesis. Of course, we mainly treat techniques that we used or considered using, the exception being section3.4about semi-supervised learning, which is added for completeness.

3.1 Features

It all starts with a dataset. Before any pattern can be recognized, before any cluster can be discovered, we need to establish and optimize our dataset. The dataset is established by deciding which attributes to use (feature selection) and, if desired, by creating features using a function over one or more attributes (feature engineering). The features can then be polished by different preprocessing techniques.

The importance of using the right features cannot be overstressed. People tend to underestimate this importance, focusing solely on the machine learning algorithms. In practice, deciding which features to use, and how, is a vital part of any applied machine learning research, and demands a significant share of time. In the words of Pedro Domingos, writer of a paper titled A Few Useful Things to Know about Machine Learning, “some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”[17, p.82] With that being said, we will give a brief introduction on feature selection, feature engineering and feature preprocessing, highlighting the main considerations.

First of all, feature selection is concerned with choosing the attributes that supply the best information for the machine learner. The straight-forward approach of using all attributes is usually not optimal: some attributes will worsen the performance. The reason might be that they contain redundant or irrelevant information. Redundant information may be caused by correlated attributes (e.g. number of seconds and number of milliseconds of an event), which will bias the learner to overvalue this information, resulting in worse performance. Irrelevant information can make the performance worse when the machine learning algorithm does find a pattern in the training data, a pattern that is not likely to exists in new data instances. Choosing the right attributes might be done by expert intuition, although it is often better to try different combinations of attributes in practice, because machine learning is usually not applied to problems that are simple enough to enable good human intuition.

Secondly, when the right attributes are chosen, some might not be in a numeric format, and might need a transformation step to make it possible to feed them into

(14)

the machine learning algorithms. This transformation of non-numeric into numeric attributes is denoted as feature engineering. The two most commonly encountered types of non-numeric data are categorical attributes (e.g the protocol of an event, such as SNMP or SSH) and string attributes. We will describe feature engineering steps for both.

Regarding categorical data, most machine learning algorithms will assume that the distance between feature values is meaningful. That means that assigning a number to each category will jeopardize the system. When, for instance, the colors are encoded as (red: 1), (blue: 2) and (green: 3), many machine learning algorithms will incorrectly assume that red and blue are closer to each other than red and green. To fix this problem, one-hot encoding can be applied: for each category a binary feature is made, being one if it is the category of the data instance, and zero otherwise.

Other non-numeric attributes, including string attributes, are less easily trans-formed into numeric values. A direct translation into categorical data is possible, for example when “authentication failed for user”, “failed password for user” and “server unreachable” are denoted as three, completely separate categories. But this direct translation is usually not preferred, because the degree of similarity between separate strings is often informative, as the previous example illustrates: the first two strings describe events that are interchangeable for an intrusion detector. Multiple approaches are possible to establish a notion of similarity between non-numeric values: amongst others string metrics and Bag-Of-Words models.

When using the first option, we rely on string metrics to give similar strings a similar numerical value. The most popular string metric, the Levenshtein distance[35], is defined as the minimum number of edits (insertion, deletion or substitution of a character) needed to transform one string into the other. Another example is the Jaro distance[26], which uses the ratio of matching characters (same characters which are at a similar position) and the number of transpositions needed to move these matching characters between the strings to transform one string into the other. Such a metric can be used to compute the distance between each string and a set of example strings, which can be used as features. To obtain strong features, a representative subset of the vocabulary should be used for the example strings.

The second option, the Bag-Of-Words model, counts the occurrence of each word in each string, creating a high-dimensional matrix that can be used as features. When the word order should be taken into account, other approaches are possible as well, but require more computational power. For most feature engineering problems therefore, the usage of string metrics or, maybe, Bag-Of-Words models are the only feasible solutions for transforming string attributes into numeric values. Both approaches, using string metrics or using a Bag-Of-Words model, have their own applications. If two strings are similar when they contain many of the same words, and a it is computationally feasible, the usage of a Bag-Of-Words model is recommended. Otherwise, string metrics are the way to go.

After the features are selected or created, they need to be polished. This stage is called feature preprocessing. It starts with feature cleaning, which should ensure that missing values are fixed, or that those data instances are deleted. Afterwards, data transformations such as whitening (a transformation to obtain zero mean and unit variance) are often needed to ensure that the machine learning algorithm can perform well. Dimensionality reduction is also a frequently used, to decrease the storage space, to speed up later algorithms and even to make them perform better by accentuating

(15)

the most important information. Multiple methods for dimensionality reduction may be applied, most notably Principal Component Analysis (PCA), auto-encoders and Fischer’s linear discriminant. We will briefly introduce PCA, since we applied this method, where after we will indicate the differences between PCA, auto-encoders and Fischer’s linear discriminant.

Principal Component Analysis[21][22] (PCA) performs an orthogonal projection into a lower dimensional linear space. This lower dimensional space is spanned by vectors called the principal components. The algorithm first chooses the direction of the first principal component. It does this in such a way as to maximize the variance of the data in the dimension of this vector. To avoid trivial solutions, the principal component is restricted to unit length. This way, the first principal component will be the most informative, in the sense that the diversity between the datapoints, seen by looking only at the data projected to this dimension, is maximized (i.e. higher than the diversity of the data in any other direction). The other principal components are then iteratively added the same way, but orthogonal to the previous principal components. The second principal component will thus be the second most informative vector. As a bonus, PCA will reduce the noise, given that the variance of the noise is equal in all dimensions, since the variance of the signal will be higher than average in the first principal components, while the variance of the noise stays the same.

Implementations of Principal Component Analysis can make use of eigenvec-tor decomposition, because math shows that the first principal component equals the eigenvector with the largest eigenvalue. Dimensionality reduction can now be performed by using only the first n principal components. PCA works well as long as three assumptions are met: the original dimensions of the data are comparable; the variance of the data projected to a vector is a meaningful criteria for informa-tiveness; and the variances of the data can be well separated in a linear subspace. The first assumption is usually met by performing feature normalization first. This way, each dimension will start off with unit variance. If the data is not normalized, PCA will consider the dimensions with the highest variance to be more important, which might indeed be preferred when the dimensions are of comparable origin. The second assumption states that PCA should not be used when the variance is not the most informative criteria. This might be the case when other information is available, for instance the class labels of each datapoint, in which case the use of Fischer’s linear discriminant[20] would be more appropriate. The last assumption can push the researcher towards non-linear dimensionality reduction methods such as auto-encoders[48]. Depending on the separability of the data in a linear projection, a non-linear reduction gives more informative principal components. This, of course, comes at a cost: the computational complexity of non-linear dimensionality reduction algorithms is higher than the complexity of plain PCA.

3.2 Unsupervised clustering

Unsupervised machine learning groups data into clusters without using labeled data: the algorithm does not receive any feedback. Instead, it is assigned a distance function (usually euclidean distance) and uses one of two possible approaches: hierarchical or flat clustering.[58, p.645]

Hierarchical clustering creates a tree-like structure where leaves with the same parents represent data points that are close to each other, and intermediate nodes with

(16)

the same parents represent clusters close to each other. From bottom up, each level of intermediate nodes groups the data into a smaller number of clusters. The final clustering is determined by choosing a level, returning each node of this level, with its children, as a separate cluster. While basic approaches like single-linkage clustering have a high computational complexity of O(n2_{), newer algorithms like BIRCH can}

achieve O(n).[60] Hierarchical clustering is especially strong when the data exhibits underlying ordering.

Flat clustering, on the other hand, groups the data by minimizing a certain distance-based criterion. As computing the optimal cluster locations and cluster assignments is typically NP-hard, approximation algorithms have been developed. Well-known algorithms, all minimizing a different criterion, include k-means, mixture models and DBSCAN:[58]

1. K-means: the well-known k-means6algorithm is the go-to algorithm for most data scientists because of its simplicity and good performance. It minimizes the sum of squares of the distances between each cluster and the centroid (mean) of that cluster, by iteratively performing two steps: keeping the clusters constant and assigning the data points, and subsequently keeping the assignments constant and refining the clusters based on these assignments.[6, pp. 424-428]

Pros: K-means converges fast (O(n)) and the implementation is simple. Cons: K-means is sensitive to the initialization and can converge to local optima. Furthermore, the square makes the algorithm sensitive to outliers and a bad choice for data with clusters that do not have a hyperspherical shape. Lastly, if the distribution of points belonging to each cluster is heavily skewed, a large ‘natural cluster’ might be divided at the expense of grouping small ‘natural clusters’ together.

2. Mixture model: a mixture model assumes that the data is generated by K distributions, where each distribution belongs to a cluster. After a choice of a distribution (e.g. a Gaussian or Bernoulli distribution) the joint probability of each point belonging to its cluster is maximized by maximum likelihood estimation. To approximate the maximum likelihood solution, often estimation-maximization is applied. This algorithm works in a way similar to the k-means algorithm, by iteratively assigning each data point to a cluster and refining each cluster given these assignments. [6, pp. 430-448]

Pros: Firstly, a mixture model provides insight in its uncertainty: by computing the probability that each data point belongs to each cluster (making so-called soft cluster assignments). Secondly, a mixture model has more parameters. Not only the centroid of each cluster is estimated, but also the covariance. This makes mixture models better suited for non-hyperspherical (but hyperoval) cluster shapes of different sizes.

Cons: The estimation-maximization has a slow convergence rate: it needs more and more expensive iterations than k-means.[6, p.438] Furthermore, a mixture model needs more initial parameters than k-means, to which it is sensitive, can converge to local optima, and it cannot handle correlated features (i.e. a singular covariance matrix).

6

To be precise, the proper name would be k-mediods when a distance function other than euclidean distance is used.

(17)

3. DBSCAN: DBSCAN is a density-based clustering algorithm. Clusters are formed by checking the density around a data point, within a predetermined distance. If this density is higher than the predetermined threshold, a cluster is formed, with all close-by data points in it (iteratively as long as the density remains high). Data points not belonging to any cluster and where the density is too low, are labeled as noise.[19]

Pros: The density-based approach makes DBSCAN suitable for any cluster shape. Furthermore, the algorithm is robust to outliers and needs only two initial parameters. The run-time complexity, O(n log n), is acceptable for most applications as well.[19, p.229]

Con: The algorithm will not work without a fairly constant density amongst clusters.

3.3 Supervised classification

Supervised machine learning trains on a labeled dataset to classify new data points into the right group. Many approaches have been proposed over the years, including kernel based methods (e.g. Support Vector Machines), variational inference and graphical models like Decision Trees or Bayesian Networks.[6] And although the No Free Lunch Theorem states that each method has its benefits and problems, in recent years two models stood out in overall performance: Neural Networks and Random Forests.[57]

Neural Networks originated from an attempt to mimic the human brain, but soon diverged from its natural counterpart by dropping restrictions that were deemed unnec-essary. Nowadays the term Neural Network refers to a broad range of classification models, identifiable by their multilayered structure. A neural network consists of layers, which themselves consists of nodes. It all starts with an input layer including a node for each feature (e.g. for each processed attribute of the events). It ends with an output layer that has a node for each class (e.g. for each cluster of alarms). Whenever a data point is fed into the input layer, it will result in a value of each output node. The output class is then the output node with the highest value. In between can be one or multiple hidden layers. In its simplest form, the network is called a Feedforward Neural Network, and all nodes in each layer are connected to all nodes in the previous and next layers. A Neural network makes predictions by performing a feed-forward round, whereby the values of all hidden and output nodes are computed as a function of the nodes they are connected to of the previous layer(s). To be exact, the value of each node is computed as a nonlinear function (the activation function) of the weighted sum over all the values in the previous layer(s). The network learns by adjusting its weights to reduce the prediction error of the training data, for which the Backpropagation algorithm is responsible.[6, pp.225-236]

The advantages of Neural Networks include a typically good accuracy and an interesting tendency to find hidden structures in the data: similar to neurons in the human brain, nodes in layers further from the input layer tend to activate on complex and specific input combinations. For example, when used on visual images, the input layer depicts single pixels, while more complex nodes would ‘fire’ on straight lines, or even on faces. This partly explains the recent tendency to use more layers then ever before, denoted by the hyped-up term ‘Deep Learning.’ Other explanations include the improvement of algorithms, the improvement of hardware and the realization

(18)

that Neural Network computations are well suited to be run on GPU. The main disadvantage of Neural Networks can already be read through the previous lines: training is computationally expensive (a simple Feedforward Neural Network has a complexity of O(n2_{)). Furthermore, they are prone to overfitting and can be difficult}

to optimize due to the many tunable parameters.

This is were Random Forests come in: they promise good accuracy with a more modest computational training complexity (O(n log n)) and fewer tunable parameters, especially useful when there are no hidden structures to exploit. Furthermore, their design makes them immune to overfitting.

Random Forests[10] are an ensemble of Decision Trees. Every tree gets a random subset of the features and then votes for the classes it believes the data points belong to. The algorithm then outputs the average or the majority vote as its prediction. The tree consists of nodes, in such a way that the top node will check every data point on one of the features, splitting the data points into two subsets, each of which will flow to another node, until the leave nodes are reached. The leave nodes then state that every data point in it belongs to the class most of their training instances belonged to. To train the tree, algorithms (most famously the C4.5 algorithm) try to split the data points in such fashion that the amount of data points in each subset will be divided as equal as possible. The training of each Decision Tree is deliberately hindered by pruning the trees (i.e. reducing their size), making the Random Forest safe from overfitting. The typically good accuracy is thus acquired from the combined votes of Decision Trees, which each receive a random sub-sample of the available features.

To conclude, the decision between Neural Networks and Random Forests can be simplified to this consideration: is there enough hidden structure in the data to justify the computationally more expensive Neural Networks?

3.4 Semi-supervised learning

Semi-supervised machine learning relies on a dataset that is partly labeled. It can be split into two distinct approaches: semi-supervised clustering and semi-supervised classification.[61]

Semi-supervised clustering (also known as constrained clustering) learns from an unlabeled dataset that is enriched with labeled data. The labeled data consists of two types of pairwise relations: must-links (two datapoints that must be in the same cluster) and cannot-links (two datapoints that must be in different clusters). This approach relies on the same assumptions as normal clustering approaches, and is thus appropriate when datapoints within a single class have similar features.

Semi-supervised classification takes the opposite route, by starting out with the labeled data. This approach will first fit a classifier, thereafter the unlabeled data is used to improve the classifications. At first sight, this might come across as an impossible task: what type of information could possibly be distilled out of the unlabeled data? But researchers have shown that the unlabeled data can be used in multiple ways.

First of all, the labeled data can be used to fit multiple generative models, where-after the unlabeled data allows the validation of these models. This is possible since generative models learn from the distribution of datapoints in the labeled training set, which can be compared with the actual distribution of datapoints in the unlabeled dataset. Hereafter, the generative model is used that fits the unlabelled dataset best.

(19)

This way, the unlabeled data can be used to find the best generative models. This ap-proach works well when the dataset can be split into well separated clusters, and when the assumptions of the generative model are appropriate (i.e. when the datapoints follow a type of distribution that is correctly identified).

An approach comparable to generative modelling is the use of semi-supervised Support Vector Machines (SVMs). While training, a SVM fits a decision boundary in order to separate two classes. To use SVMs on more classes, multiple SVMs can be combined into a multi-class SVM, by assigning each SVM a single class and training it to fit a decision boundary between its class and all other classes. Similar to the usage of multiple generative models, multiple multi-class SVMs can be trained on the labeled dataset, and validated on the unlabeled dataset. This time, the chosen multi-class SVM is the one with decision boundaries avoiding most unlabeled instances by the largest margin. This approach is a natural extension of normal SVMs, and therefore recommended when a SVM was implemented already for the classification problem.

A third approach, called self-training, iteratively grows its labeled dataset by adding unlabeled datapoints. In each round, it trains on the labeled dataset and makes a prediction for each unlabeled datapoint. The algorithm then adds unlabeled datapoints to the labelled dataset - but only those datapoints for which the algorithm is confident that it performed a correct classification. This approach is mainly recommended when a complicated supervised classifier is already used, since self-training is easy to implement.

The last approach, co-training, is similar to self-training. This time, two classifiers are trained, each with half of the labeled training set. Each of them then predict the labels for the unlabeled data, and gives those datapoints to the other classifier on which it is confident that it performed a correct classification. Both will train on their new dataset, whereafter the process repeats. Co-training can be used for similar problems as self-training, but relies on more assumptions: the training set needs to be split into two sets that are conditionally independent given the class, whereby both sets should enable a classifier to give good predictions.

All semi-supervised classification approaches can be used for either transductive or inductive learning. In the former case, the model is trained to predict the unlabeled data, and will not be validated on unseen datapoints. In the latter case, the model is trained to generalize from the labeled and the unlabeled data, so that unseen datapoints will be classified correctly as well.

4 Related work

Now that the necessary background is in place, we can delve into solutions to problems similar to ours. To quickly recap the problem statement (section1.2): we aim to group the alarms of the OPTOSS intrusion detector into alarm groups, so that the alarms inside an alarm group have the same root cause, a problem that belongs to the realm of alarm mining. Our contributions can be characterized by the usage of both an unsupervised clusterer and a supervised classifier, to group alarms together that originate from a separate detector based on manually created rules. The advantage of the clusterer is that it provides good clusters without needing a manually labeled

(20)

dataset; the advantage of the classifier is that it allows human experts to change the outcome of the clusterer.

To the best of our knowledge, we are the first to propose a “hybrid alarm mining” approach, as we like to call it, using both alarm clustering and alarm classification to group together the alerts of a separate detector. Closely related works may be found, though, in approaches that use clustering methods as detector, followed by an alarm classification module. We will first regard the similarities with and key differences to those works. Thereafter some other approaches in alarm clustering and alarm classification will be listed, in order to understand the possibilities and choices that encompass this thesis.

4.1 Works similar to hybrid alarm mining

Although we apply unsupervised machine learning to cluster alarms of a separate detector, clustering can also be used in the detector itself. During training, such a system will cluster all activity into groups of similar events. In the application phase, any event that falls outside the known clusters will trigger an alarm. This contrasts with alarm clustering, where alarms, instead of events, are used as input.

When the detector is based on clustering techniques, and the alarms are refined by an alarm classification module, the system comes reasonably close to our proposed system. We found five such approaches in the literature, originating from 2012 onwards, and differing chiefly in the algorithms used. These works are listed in Table

1. To the best of our knowledge, the first such approach was a work of Muniyandi et al. of the Tamil Nadu University titled “Network Anomaly Detection by Cascading K-means Clustering and C4.5 Decision Tree algorithm”[40].

The related works mostly employ K-means clusterers, in combination with a Decision Tree, Naive Bayes or Random Forest classifier. For us, their choice of classification algorithm is of higher importance than their choice of clustering algo-rithm, since they used the clusterer in the detector itself, not as alarm clusterer. In terms of accuracy of the classifier, one would expect the Random Forest classifier to outperform both Decision Tree and the Naive Bayes classifiers, in return for a higher computational complexity. We flag the work of Veeramachaneni et al.[56] as the most interesting paper to our research, since they employed a Random Forest classifier and showed a thorough and clear experiment setup. Although we do not directly compare the results of our thesis to this work, or indeed to any other work, it has functioned as inspiration for feature engineering and experiment design.

TABLE1: Related works similar to hybrid alarm mining. * Veeramachaneni et al. apply an ensemble of Replicator Neural Networks, matrix-decomposition-based and density-based outlier

analysis.

Year Clusterer Classifier Authors

2012 K-means C4.5 Decision Tree Muniyandi et al. [40] 2013 Weighted K-means Random Forest Elbasiony et al. [18] 2014 Weighted K-means Naive Bayes Emami et al. [59] 2016 K-means Naive Bayes Muda et al. [39] 2016 Ensemble* Random Forest Veeramachaneni et al. [56]

(21)

The difference between the two approaches (i.e. using a clusterer as detector or using a clusterer to group the alarms) has a large impact on the requirements and training performance, and possibly as well on the accuracy of the system. These contemplations hold for all the works mentioned in Table1.

Firstly, regarding the requirements, the approaches are positioned at opposite sides of a trade-off between the need for a large training set and the need for expert availability. Our approach, relying on experts to form rules to capture all intrusions, is rooted in our observation that the network intrusion market contains many highly skilled professionals whose job is to do just that: finding intrusions in network data. Our main assumption is that those manual rules are able to filter out almost all of intrusions with a fairly low number of false positives. The requirements for the other approach are mostly met by using training data that is supposed to contain no intrusions. This approach thus assumes that such an up-to-date dataset exists containing no intrusions, but otherwise containing events similar to the actual production environment. A changing environment, or new types of intrusions for which new information is needed to filter them out, could then lead to a necessity for a new dataset (or, equivalently, new expert rules), giving this trade-off long-lasting effects. To conclude this paragraph about the differences in requirements, it boils down to a matter of trust: do you put your trust in the ability of experts to form rules, or are you positive that you have a dataset good enough to train a detector?

The second difference lies in the training performance. In our approach, the presence of a rule-based detector ensures that the clusterer only needs to group alarms together, whereas it needs to group events together when the clusterer acts as detector. Our approach results in an enormous reduction in training instances, and seems therefore better suited to handle large streams of data, while the performance should be of the same order once the training phase has been completed.

For the last difference, concerning the accuracy of the system, we lack studies comparing the accuracy of both approaches. Ideally we would like to assess true and false positives of both approaches on the same dataset. As mentioned in subsection

1.4, those studies would be difficult to perform, since intrusion detection datasets label single events, whereas our system flags bags of events. Lacking those studies, we can only refer to the results section, where we will try to make plausible that our results are promising. It then remains an open question which approach yields better results under which conditions.

To conclude, our work differs from works in alarm classification that use a clusterer as detector, since we replaced the detector with a rule-based variant, and added a clusterer to group the alarms. We chose this direction because we have access to highly skilled experts, who we believe can make excellent rules for the intrusion detector. This believe remains an assumption throughout this thesis, since the accuracy of such expert-made rules, relative to clustering techniques, is not known. Since we relied on experts to manually create rules, we abandoned the dependency on a large, up-to-date, realistic and intrusionless dataset. We can now use production data directly to train our clusterer and classifier. Lastly, we expect our approach to be better suited for large streams of data, since the number of instances to train on is drastically lower for our clusterer.

(22)

4.2 Alarm clustering

The alarm clustering module is an important segment of our approach, for which an extensive body of research has been established. Alarm clustering approaches are not able to label the alarm groups, in contrast to the alarm classification methods which are treated in next subsection. This allows for an unlabeled dataset on which unsupervised machine learning algorithms can be applied. We will regard four approaches of alarm clustering: based on Attribute-Oriented Induction, based on K-means, based on meta-alarms and based on soft clustering techniques.

Each of the works in this subsection differs from our work by at least two points: the use of a different detector (we used OPTOSS NG-NetMS whereas most other used Snort[47]) and the absence of an alarm classification module in their works. The main difference between Snort and the OPTOSS detector is that Snort is network-based, whereas OPTOSS is host-based. Snort is therefore able to detect threats on the network, before they have breached into a specific device, whereas OPTOSS relies on logs from a host device and will thus only detect threats once they are already active on the devices. This enables OPTOSS to give a broader protection by detecting unexpected behavior independently of network activity. A second difference is that Snort is a purely signature-based detector, relying on a database of malicious events, while OPTOSS exhibits a more elaborate approach, alarming on a group of events. The more straight-forward approach makes Snort “incapable of performing detection on the basis of a bunch of events, each of which only hints at the possibility of an attack,”[13, p.83]. detections that are possible with the alarm representation of OPTOSS. A third difference is that the rules of SNORT are open-source whereas the rules of OPTOSS are proprietary. Both of these approaches have pros and cons regarding the safety of the system: open source allows everybody to improve the rules, whereas it also allows attackers knowledge about how the rules might be circumvented.[32] Lastly, a more practical difference between the detectors, favouring Snort, is that Snort can build upon a larger body of research than OPTOSS, including the application of rule learners to automatically derive the rules (e.g. [50]).

The absence of an alarm classification module in the other works of this subsection makes them not capable of the incorporation of expert knowledge. In our architecture, the classifier is a practical addition to allow the experts to change alarm groups that are inconveniently clustered together. These differences, the different detector and the absence of a classifier, are the differences that characterize all other works. Other differences are mentioned when the respective approaches are discussed.

The alarm clustering problem was coined by former IBM researcher Julisch in his 2001 work “Mining alarm clusters to improve alarm handling efficiency” [27]. After proving that the problem is NP-complete, he continued to approximate a solution using Attribute-Oriented Induction, an O(n)[25] combination of machine learning techniques and insights of relational database operations. This algorithm creates taxonomies (trees) for each of the attributes, such that the root node and interior nodes are generalizations of the leaf nodes, which are the possible values of the attribute. An example of such a taxonomy is shown in Figure1, with root node “All alarms”, interior nodes such as “All durations” or “The facility is a service”, and leaf nodes such as “duration= 7”(this is the duration node for Alarm1). Clusters consist of

a node for each attribute, such that each alarm of the cluster has attributes that are generalizable to those nodes (e.g. “duration= 7” is generalizable to “duration < 10”).

(23)

The algorithm then minimizes, for each alarm, the number of generalizations to the cluster. It can do this by changing the taxonomy (by deleting generalizations, that is, interior nodes; or by adding possible generalizations that are specified by human experts), by assigning alarms to the most similar cluster, and by making all cluster nodes as specific (low-level) as possible. In the example, the algorithm will cluster Alarm1 and Alarm4using interior node “duration <10”, which is more specific than

“All durations”, and it will delete the interior nodes “duration < 15” and “duration ≥ 15”. All alarms All durations duration <10 7 (Alarm1) 9 (Alarm4) duration ≥10 duration <15 13 (Alarm2) duration ≥15 16 (Alarm3) All facilities Services

sshd (Alarm2) cron (Alarm3)

Other traffic (Alarm1&4)

FIGURE1: A fabricated example of a possible taxonomy created by Attribute-Oriented Induction. Only four alarms are shown. Alarm1,

for instance, has a duration of 7 and traffic as facility. Given this taxonomy, two alarm clusters might be formed: “duration < 10 and facility= traffic” and “duration ≥ 10 and facility = Services.” Alarm1and Alarm4then need only two generalizations to reach its

cluster, whereas both Alarm2and Alarm3need 4 generalizations.

Multiple works of Julisch, as well as works of Zhang et al.[2] later on, showed good results for alarm clustering based on Attribute-Oriented Induction. The main difference of these works with our approach is the clustering mechanism: Attribute-Oriented Induction measures the difference to the cluster center separately for each attribute, whereby the alarm wont be assigned to the cluster when one attribute is too different, while K-means (the algorithm we used) looks at the total distance of the attributes to the cluster center. This means that a single alarm, having one attribute that differs from its cluster center, forces Attribute-Oriented Induction to generalize it’s rules, thereby penalizing all alarms inside the cluster, while K-means will assign a penalty only to the divergent alarm. As long as the concept of “total distance to cluster center” is meaningful (i.e. when the distance of one attribute might be compensated for by another attribute), the mechanism of K-means should be preferred: we expect that this difference results in Attribute-Oriented Induction needing more clusters in order to create meaningful (specific) clusters. This expectation is in line with the fact that the researchers only tried to create clusters for the alarms that occur most often,

(24)

reasoning that those are probably misconfigurations that need to be fixed. Indeed, this contrasting goal is a second difference, directly stemming from (or preceding) the different clustering mechanism. The last difference of their approach is that it provides intuitive insight in each cluster, consisting of simple if-then rules regarding each attribute, which is not as easy to obtain for K-means.

The second approach, using a K-means clusterer, is the approach we have used as well. In 2004, Law et al.[31] were the first to apply this algorithm to alarm clustering.7 Five years later, a PhD dissertation was written by Dey[16], combining the K-means algorithm with an Incremental Stream Clustering approach. Instead of the default K-means algorithm, an online learning approach is used, updating the cluster centers every time a new alarm is clustered. This differs from our offline learning approach, which could suffer from badly formed cluster centers. In practice, we have not found this to be a problem, since the number of completely new alarms is small when a sufficiently large dataset is used for training.

The third approach, based on forming meta-alarms, can be considered to be a simpler version of K-means clustering and was proposed by Predisci et al. in 2006[44]. First, a classifier is trained to label each alarm with a class, for instance portscan, DoSor NoClass. Next, the clusterer takes over, and instantiates an empty list of meta-alarms for each class. For each alarm, the clusterer will check the distance to each meta-alarm of the same class. If the distance is below a prespecified threshold, the alarm will be grouped to the same alarm group as the meta-alarm. Otherwise, the alarm is promoted to meta-alarm. This will yield similar results as K-means clustering, with three differences: more information is added by the classifier, the cluster centers are determined only by the first alarm of that class, and a different value needs to be specified in advance (a distance threshold versus the number of cluster centers). The former might be an advantage for the method of Predisci et al., although it does require a labeled dataset, which is an requirement that is often difficult to meet. Furthermore, we believe that the information is easily visible by examining the alarms. The second difference puts the meta-alarm on a disadvantage when comparing to K-means, since K-means is able to create cluster centers fitting the alarms perfectly (at least while training). Regarding the last difference, the specification of a distance threshold might be more straight-forward and more easily generalizable to other problems, in comparison with the specification of a number of clusters.

The last alarm clustering approach was established when Smith et al. decided to bring out the big guns in 2008[53]. Three clustering techniques were applied for alarm clustering: mixture models, self-organizing maps and a variation on auto-encoders. These algorithms perform soft clustering (each datapoint may belong to multiple clusters) and tend to perform better than K-means, at a cost of a higher computational complexity and more complex algorithms and tuning. Of the three algorithms that were tested, self-organizing maps performed worst, and, since we already covered mixture models in subsection3.2, we continue with a quick contemplation of the variation on the auto-encoder that the researchers developed.

An auto-encoder is a type of artificial neural network (see subsection3.3) which is trained to recreate the input data. This may sound like a trivial task, which it indeed 7_{Although Law et al. themselves saw it as a work in alarm classification, as demonstrated by the title}

of “IDS false alarm filtering using KNN classifier,” The term ‘semi-supervised classification’ would have been more appropriate since they trained the machine on intrusion-less data. In any case, the algorithm remains a clustering algorithm.