Hypervisor-based cloud anomaly detection using supervised learning techniques

(1)

Hypervisor-Based Cloud Anomaly Detection using

Supervised Learning Techniques

by

Onyekachi Nwamuo

B.Eng., Federal University of Technological Owerri, 2007

MPM, University of Lagos, 2014

A Thesis Submitted in Partial Fulfillment

of the Requirements for the Degree of

Master of Applied Science

in the Department of Electrical and Computer Engineering

(2)

SUPERVISORY COMMITEE

Hypervisor-Based Cloud Anomaly Detection using Supervised

Learning Techniques

by

Onyekachi Nwamuo

B.Eng., Federal University of Technological Owerri, 2007

MPM, University of Lagos, 2014

Supervisory Committee

Dr. Issa Traore, Department of Electrical and Computer Engineering

Supervisor

Dr. Riham Altawy Departmental Member

(3)

ABSTRACT

Although cloud network flows are similar to conventional network flows in many ways, there are some major differences in their statistical characteristics. However, due to the lack of adequate public datasets, the proponents of many existing cloud intrusion detection systems (IDS) have relied on the DARPA dataset which was obtained by simulating a conventional network environment. In the current thesis, we show empirically that the DARPA dataset by failing to meet important statistical characteristics of real-world cloud traffic data centers is inadequate for evaluating cloud IDS. We analyze, as an alternative, a new public dataset collected through cooperation between our lab and a non-profit cloud service provider, which contains benign data and a wide variety of attack data. Furthermore, we present a new hypervisor-based cloud IDS using an instance-oriented feature model and supervised machine learning techniques. We investigate 3 different classifiers: Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM) algorithms. Experimental evaluation on a diversified dataset yields a detection rate of 92.08% and a false-positive rate of 1.49% for the random forest, the best performing of the three classifiers.

(4)

List of Tables

Table 3.1 ISOT-CID VM distributions [6] ... 28

Table 3.2 ISOT-CID Hypervisor nodes [6] ... 29

Table 3.3: Attacks covered in the ISOT-CID dataset [6] ... 31

Table 3.3: ISOT-CID Network Traffic Distribution [6] ... 34

Table 3.4: ISOT-CID Network Traffic Distribution in percentage (%) [6] ... 34

Table 4.1: Some of the raw features extracted from the hypervisor network traffic using the traffic analyzer tool (Tranalyzer) ... 39

Table 4.2: Description of the in and out frequency terms [6] ... 42

Table 4.3: Features and their overall importance ... 48

Table 5.1: Extra-Rack and Intra-Rack traffic composition for ISOT-CID and DARPA 1998, showing the number of packets. ... 56

Table 5.2: Percentage composition of Extra-Rack and Intra-Rack traffic for ISOT-CID and DARPA 1998 .. 57

Table 5.3: Representation of a two-class confusion matrix ... 59

(8)

Table 5.6: Phase 1 detection results using SVM Classifier ... 62

Table 5.7: Phase 2 detection results using Logistic Regression Classifier ... 63

Table 5.8: Phase 2 detection results using Random Forest Classifier ... 63

Table 5.9: Phase 2 detection results using SVM Classifier ... 64

Table 5.10: Comparison of overall performance for ISOT-CID Phase 1 ... 64

Table 5.11: Comparison of overall performance for ISOT-CID Phase 2 ... 65

Table 5.12: Comparison of overall performance for ISOT-CID ... 66

Table 5.13: Performance at 600 seconds observation time window ... 67

Table 5.16: Snort IDS detection and False-positive rates on ISOT-CID Phase 1 network data for hypervisor B ... 70

Table A.1: ISOT-CID Phase 1 packet capture ID. ... 85

Table A.2: ISOT-CID VM distribution ... 85

Table A.3: Phase 1: Network traffic distribution ... 87

Table A.4: Phase 2: Network traffic distribution ... 87

Table A.5: Phase 1: Network traffic details ... 87

Table A.6: Phase 2: Network traffic details ... 88

(9)

List of Figures

Figure 1.1: Network Flow in ISOT-CID [6] ... 7

Figure 1.2: 1999 DARPA IDS Evaluation Dataset Network Flow [6] ... 8

Figure 3.1: ISOT-CID Project Environment [6] ... 28

Figure 3.2: Timeline for Phase 1 Day 2 (2016-12-15) Inside and Outside Attacks ... 32

Figure 3.3: Timeline for Phase 2 Day 1 (2018-02-16) Inside and Outside Attacks ... 33

Figure 3.4: ISOT-CID Network Traffic Distribution in percentage (%) ... 34

Figure 3.5: DARPA Simulation Network Environment [33] ... 36

Figure 4.1: Network traffic flow samples for phase 2 hypervisor A day 1 (2018-02-16) ... 39

Figure 4.2: Data Preparation Subsystem ... 45

Figure 4.3: graphical representation of the variable importance. ... 49

Figure 5.1: The CDF of the distribution of the number of flows at the edge switch in ISOT and DARPA ... 55

Figure 5.2: The CDF of the distribution of the flow inter-arrival time in ISOT and DARPA ... 55

Figure 5.3: Comparison of the ratio of extra-rack to intra-rack traffic for ISOT-CID and DARPA 1998 datasets ... 57

Figure 5.4: Phase 1 Machine Learning Performance comparison ... 65

Figure 5.5: Phase 2 Machine Learning Performance comparison ... 66

Figure 5.6: Effect of observation time window on the detection rate ... 68

(10)

ACKNOWLEDGEMENTS

To begin, I am extremely grateful to God for life and for walking me through this season of my life.

I would like to express my gratitude to my supervisor, Dr. Issa Traore, for his encouragement, indefatigable mentorship, support and time. Throughout the research process, Dr. Issa was readily available to provide feedback, guidance, and affirmation that I was doing the right thing. Dr. Issa supported me financially and consistently made my work one of his top priorities. Without his insight and encouragement, this thesis would not have been possible.

I would like to thank my committee members, Dr. Riham Altawy and Dr. Yvonne Coady for all their valuable advice and critical feedback.

To my mentors, Dr. Benjamin Onyenwosa and Mr. Jonathan Oluwafikayo Adefisayo for your advice, encouragement and support.

To my friends Adeshina Alani and Chinonye Egbejimba, thank you for your encouragement as well as your valuable insights and perspectives.

Thank you to Stanley Ihemanma and family for accommodating and supporting me when I first came to Canada.

Lastly, my sincere gratitude goes to my parents, brothers, sisters, uncles and in-laws for their love and support.

(11)

DEDICATION

To my darling wife, Maureen Onyekachi for her unconditional care, love and support

and

To my children, Queen-Esther and Rejoice, for understanding my absence in their lives when my research prevents us from sharing important moments of life. Without their support, this thesis would not have been concluded.

(12)

Introduction

1.1 Context

In today’s IT and business world, there has been a significant increase in the public adoption of cloud computing for the production systems and services support, and there seems to be no end in sight [1]. Essentially, cloud and cloud-related services have paved way for organizations to meet their business needs especially in the areas of digital transformation, expanding global presence, data growth, and IT consumerization [1]. Some of the key outstanding offerings of cloud computing are scalability, on-demand self-service, flexibility, virtualization, and efficiency. Cloud computing environment also suffers from some of the factors that affect traditional computer network especially in the areas of performance, compliance and security.

More so, the growth in the public adoption of the cloud paradigm has increased organizations’ exposure to a wide variety of cyber attacks and vulnerabilities. This makes security and privacy one of the main issues faced by cloud computing adopters as the user has no idea where their data are being stored and processed in the cloud [2].

In addition to the common security issues which are inherent in conventional networks, prior work [3] has identified some of the cloud-specific security risks as loss of governance, inadequate data protection, compliance risk, malicious insider, management interface compromise and insecure or incomplete data deletion. These security risks were further classified into two

(13)

categories in [4] as security issues on the side of the cloud service provider and security issues faced by the cloud user. Some scholars also classified these cloud computing security issues into three categories viz: organizational, technical, and legal as seen in [5]. These classifications show that the cloud environment is more susceptible to attack than the conventional network because of its size, complexity and potential user’s exposure to third-party services and interfaces.

According to [2], these attacks come in two groups namely, the insider and the outsider attacks. The insider attack originates from a malicious insider within the cloud environment who uses his/her access to the cloud infrastructure to gain unauthorized privileges. The attack here can be on the internal VMs or to the outside world [6]. The outsider attacks are those attacks targeting the cloud infrastructure from the outside world [2]. Furthermore, the cloud consumers stand a chance of losing control over their data once it is uploaded to the cloud infrastructure and the confidentiality and integrity of their information are also at stake due to the sharing of cloud resources [6].

1.2 Cloud Network and Conventional Network Data Characterization

A great disparity exists when considering the proximity of both the cloud and conventional network data centers. While the cloud data centers are distributed globally in the U.S., Europe, and South America, the non-cloud data centers are always situated in close proximity to their users or on the premises of the serving organizations. The global placement of the cloud data centers satisfies the requirements for geo-diversity, geo-redundancy and regulatory constraints

(14)

[7]. Studies have further shown that the characteristics of cloud network traffic are different from the conventional network traffic in so many ways as explained in the following:

a. Virtualization: In a cloud computing environment, virtualization is the main technology and the fundamental sweet spots that provide the platform where resources are globally networked and available to everyone as services. This means that the cloud computing paradigm can not exist without virtualization. As a result, cloud computing relies on virtualization for scalability, resource pooling, location independence, and load balancing. Virtualization in the cloud context is accomplished by having multiple operating systems running at the same time on one server controlled by a hypervisor [5]. One of the advantages of virtualization in cloud computing is that it brings about the logical partitioning, isolation, and encapsulation of the multiple VMs instances running on the same physical machine. Though with virtualization technology in cloud computing, the VMs are easily monitored, managed and protected, it also poses a big security concern [8]. The use of virtualization in the conventional network is much limited compared to the cloud environment, implying that each IP address in the network typically corresponds to a distinct server. This leads to applications being clustered around the switches.

b. Network Flow: Empirical studies in [7], [9] have shown that the inter-arrival time for 80% of the traffics in the cloud network is usually under 1ms while with the conventional network, this can be between 4ms and 40ms as the traffic does not change that quickly. In their work, it was also noted that the number of active flows for any given second at a switch is at most 10,000 flows and that new flows can also be highly instantaneous in

(15)

arrival. These studies also went on to explain how the flow inter-arrival time affects the kind of processing that can be done for each new flow and the usefulness of logically centralized controllers for the flow placement. The cloud network traffic is usually bursty in nature with the ON/OFF intervals being characterized by heavy-tailed distributions. Their analysis also shows that in a cloud computing environment, the load ratios of the internal/external traffic flow between an instance to an instance or instance and other sources are usually high. This typically applies to the ISOT Cloud Intrusion Detection (ISOT CID) dataset as will be seen in the graphical visualization of the network flow for a two-minute time window in the experiment section of chapter 5. ISOT CID is the first and only publicly available cloud intrusion detection, which was collected using cloud resources hosted by Compute Canada [5]. In [9], it was also discovered that in conventional network data, 80% of flows are usually smaller than 10Kb in size as compared with cloud network data. On the one hand, the flow of communication patterns in a cloud network is usually high due to the numerous applications being hosted and high link utilization across the cloud’s multiple layers. On the other hand, with the traditional network, the communication flow pattern and the link utilization are usually small in size. Figures 1.3 and 1.4 show the network flows for a typical cloud environment based on the ISOT-CID dataset and the DARPA 1998 Intrusion Detection System (IDS) evaluation dataset which was collected by simulating a conventional network environment [6]. In the ISOT-CID environment shown in figure 1.3, we can see significant variability in the network flows including the hypervisor to hypervisor network flows, the in/out traffic flows from VM to

(16)

an external source, not within the cloud environment, the VM to VM network flows, and traffic flow between tenants VMs [6]. The conventional network comprises limited network flows as can be seen in figure 1.4 which has only two network flows viz external and internal traffic flows [6].

c. The difference in Packet-Level Communication: Analysis in the packet trace characteristics of a data center by [7], [9] shows that the packet-level traffic flow of the cloud data center is not only bursty in nature but also follows a heavy-tailed distribution of ON/OFF pattern. Their study also revealed that the conventional network packet inter-arrival process is characterized by an identical heavy-tailed distribution as the aggregate traffic flow. d. Topology: The physical topology of a cloud data center follows a canonical 3-Tier

architecture that consists of the core layer or the uppermost layer, aggregation layer or the middle layer and the edge layer or the lower link layer. In contrast, the traditional data centers follow a 2-Tiered topology in which the core layer and the aggregation layers are collapsed to form one layer [7]. In a typical cloud network, data is either centralized or outsourced and provided to the users on-demand irrespective of their geographic location. This relieves the data owner of the full control of their data as the cloud service providers now manage and maintain the data. The cloud data also has the flexibility of being scaled up or down by automated means. Some giant cloud service providers such as Amazon, Google and Microsoft do have cloud data centers dispersed geographically for the provision of universal data access to the various users [7].

(17)

e. Types of Applications they run: In cloud data centers, various applications and a wide range of Internet-facing services are being offered such as webmail, web portals, instant messaging, data mining, storage, relational databases, etc. This is in contrast to the conventional network where the organizations physically possess the storage of their data [6], [10],[7]. The resources and infrastructures under the management of the cloud service providers are highly dynamic in nature unlike in the case of conventional networks where the managed resources are somewhat stable [6]. Unlike in the cloud data centers where the applications are spread across the entire data center, the conventional networks usually have applications clustered the lower link or the edge layer or across the switches due to lack of virtualization.

Figure 1.1 and figure 1.2 represent the network flow diagram of ISOT-CID and DARPA 1998 dataset respectively.

(18)

(19)

Figure 1.2: 1999 DARPA IDS Evaluation Dataset Network Flow [6]

1.3 Problem Statement and Research Objectives

In this thesis work, we have explored the existing literature in cloud IDS to build hypervisor-based intrusion detection and also to further demonstrate some of the striking differences between a cloud IDS dataset and the conventional network datasets.

While many of the existing security solutions can readily be used for cloud protection, some of the technologies need some adaptation to address the multidimensional and evolving threat

(20)

landscape of the cloud environment. Intrusion detection systems fall under such categories [6],[11]. Two types of Intrusion detection models exist in the literature, namely, misuse detection and anomaly detection.

The misuse detection is also known as signature-based detection or knowledge-based detection in that it depends on the database of previously known attack signatures [12]. Though this model is effective at detecting accurately a wide range of attacks, it is ineffective when confronted with novel attack methods [13].

Anomaly detection tracks the deviation from the normal system and reports such deviation as an intrusion. The model first trains the system to establish its normal behaviors and then uses this knowledge to detect anomalous occurrences. This model is behavior based and can detect new attacks. One drawback of this model is that it generates more false-positives than misuse detection since abnormal activities are not always malicious [14].

Out of these two types of models, comes out four types of IDS deployments in use in a cloud computing environment, viz: host-based IDS, network-based IDS, hypervisor-based IDS and VM-based IDS [6]. The host-VM-based IDS monitors and analyzes activities on a specific host. The Network-based IDS, on the other hand is usually installed on the network routers to inspect every packet entering into the network and the host for an anomaly. The hypervisor-based IDS (HypIDS) is usually positioned at the hypervisor or virtual machine monitor (VMM) level and has the privilege of being able to capture, monitor and analyze both the network and host communications against intrusions due to its wide visibility. The Hypervisor-based IDS is very

(21)

suitable for the cloud environment because, by default, the hypervisor has access to the performance information of its virtual environments such as VM to VM or VM to VMM network communication, system calls and events, and the operating system processes [6], [15]. Besides, the VM-based IDS which is deployed on the VMs, monitors, and analyses the VM network traffic or the guest operating system data. This can assume any of these forms viz: network intrusion detection, host-based IDS or hybrid IDS. Being at the VM level, means that the cloud consumer has total control over its operation and the type they choose.

Not until now, the availability of a publicly available cloud intrusion dataset has been a major challenge hindering the advancement of work on cloud intrusion detection. This major challenge made the researchers use the conventional datasets like the DARPA or the KDD cup 1999 intrusion detection datasets which do not portray a true cloud environment as explained earlier. Even the ones that have come up with cloud intrusion datasets have withheld it from public access thereby hampering the development of cloud IDS.

To address the aforementioned gap, the ISOT CID dataset was collected by the ISOT Lab and released to the research community. The ISOT-CID dataset was captured in a real cloud environment and also contains a wide variety of attack vectors.

On the one hand, as mentioned earlier, there are strong differences between cloud network data and conventional network data in terms of their characteristics such as flow inter-arrival time, packet-level communication, load ratios of the internal/external traffic flow, and so forth. On the other hand, the design of anomaly detection models involves constructing normal activity

(22)

baselines from previously collected sample activity data. Hence, constructing cloud anomaly detection models using conventional network data would fail to capture adequately cloud network behavior considering the aforementioned differences between cloud network and conventional network data.

The objective of the current thesis is to provide empirical justification for the need for a dataset collected specifically in a real cloud environment compared with using a conventional network dataset in developing cloud IDS. Furthermore, we explore the design of cloud anomaly detection using supervised machine learning techniques. Specifically, three machine learning algorithms are studied: logistic regression (LR), random forest (RF) and support vector machine (SVM). We compare also the proposed detection approach with Snort, which has widely been studied for cloud IDS in the literature.

1.4 Thesis Contribution

The contributions of this thesis are outlined below.

Proper structuring and labeling of the ISOT-CID dataset for public release: As a contribution to

the ISOT-CID documentation, we updated the labeling of the ISOT cloud intrusion dataset (ISOT-CID) as seen in appendix A by adding the omitted attack times. We also rearranged and grouped the IP addresses into three groups. Group one is the internal network IP address which is legitimate by default. The second group belongs to the external and legitimate IP addresses. And the final group is the external and malicious or blacklisted IP addresses. Finally, we gave a

(23)

high-level description of how to get started with the ISOT-CID dataset for an easy understanding by an external user.

a. Validation of the empirical assumption that Cloud IDS data and the conventional IDS data

have strong differences: The second contribution of this thesis was to validate one of the

empirical assumptions made in [7], [9], [16], that for the cloud data centers, about 80% of the traffic from the servers or hypervisor nodes reside in the rack or node, in contrast to the conventional network data centers where the figure ranges between 40% and 90% with the traffic traversing the network’s interconnect. The validation of this empirical assumption was carried out by analyzing the extra-rack and the intra-rack properties of the network flow for a subset of the ISOT-CID and the DARPA 1998 network traffic data, respectively.

b. Establish the effectiveness of hypervisor-based cloud anomaly intrusion detection using

supervised learning: Another contribution made in this research work is to establish the

effectiveness of hypervisor-based anomaly intrusion detection using supervised learning in our experimental analysis. This was achieved by using three machine learning classification algorithms viz: Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM). This contribution supports the rapidly evolving studies to detect anomalous network behaviors in the cloud computing environment. The results of the experiment show that the hypervisor-based IDS can detect malicious events occurring in the cloud hosting environment.

(24)

1.5 Thesis Outline

The rest of this thesis is structured as follows:

(i) We present background information on cloud computing and related research work on cloud computing intrusion detection systems in Chapter 2.

(ii) In Chapter 3, we describe our experiment design and dataset. Also discussed in chapter 3 are the threat model and attacks of the cloud computing environment. (iii) In Chapter 4, we zoom into the feature model, their extraction and data preparation

approaches. We also describe the supervised learning models used, and the Snort intrusion detection system.

(iv) Chapter 5 covers the experiment and results.

(v) In Chapter 6, we summarize the thesis by highlighting the overall contributions and makes suggestions for future works.

(25)

Chapter 2 Related Works

In this fast-growing cloud computing era, it is practically impossible to overlook the urgent need to protect the cloud computing systems and resources against attacks from intruders on the data confidentiality, integrity and the authenticity of the information being transmitted. In this context, a growing amount of literature has been published on cloud IDS in the last few years. Researchers are carrying out a significant amount of work in ensuring adequate security in cloud computing by examining the system logs, the configuration or the network traffic using different methodologies. Unlike in this work where a real cloud intrusion detection dataset (the ISOT-CID) was used, most of the previous research work on cloud IDS was based on the use of synthetic data or conventional network datasets which do not adequately portray a typical cloud environment characteristic.

Some of the conceptual frameworks proposed for cloud intrusion detection and findings by various researchers will be discussed in this chapter. Also discussed in this chapter will be the appraisal of the strengths and weaknesses of these works.

2.1 On Hypervisor-Based Cloud Intrusion Detection

Aldribi et al. [6] proposed a new hypervisor-based cloud intrusion detection system (IDS) using online multivariate statistical change analysis for the detection of network intrusions. In their

(26)

work, a new cloud intrusion dataset, the ISOT-CID was introduced which includes a wide variety of attack vectors. The ISOT-CID is the first publicly available dataset that portrays a typical cloud environment and was collected in a production environment while partnering with a notable cloud service provider (CSP). The Energy divisive (E-divisive), a multivariate sequential change detection algorithm, was used in combination with the gradient descent algorithm to successfully detect anomalous network activities in the cloud network by tracking the statistical change points. Their work, which is closely related to the current research work uses a feature space based on the relationship between the behavior of the individual instances or virtual machines and the hypervisor. The feature model is instance-oriented and includes frequency, entropy and load-based attributes of the virtual machines. Experimental evaluation of the proposed approach yielded a detection rate (DR) of 96.23% and a false-positive rate (FPR) of 7.56%. In this work, we try to augment their findings by adopting a supervised machine learning approach to detecting and evaluating cloud computing intrusions using the ISOT-CID dataset. While the authors highlighted the weaknesses of using conventional datasets for cloud intrusion detection, no experimental results were provided in support of such a claim. We provide such empirical results in the current thesis.

Nikolai and Wang [15] presented an architecture for cloud intrusion detection using hypervisor performance metrics, such as the number of packets incoming or outgoing and CPU utilization, to detect potential misuse patterns. In their work, the authors simulated a cloud environment using the Eucalyptus test environment and implemented their proposal using IBM Stream Processing Language (SPL). With the help of a Python script and the Libvirt application

(27)

programming interfaces (APIs), they were able to retrieve the following performance metrics from the hypervisor viz: the network data sent/received, CPU utilization, block device read/write data. The evaluation result of their approach shows 0.26% false-positive (FPR), 0% false negative (FNR) and 0% misclassification rates, which indicates that hypervisor performance metrics can aid in the detection of denial of service (DoS) attacks within a cloud environment and from a cloud instance to another cloud instance. Although their work was not based on a real cloud intrusion dataset, the authors were able to show that the hypervisor performance signatures and patterns over time can be used to detect malicious activities in the cloud computing environment. They lent further weight to their observation by adding that their proposed hypervisor-based cloud intrusion detection system can be integrated with or be used as a complement to the existing intrusion systems aimed at adequately securing the cloud computing environment. Their work does not take into consideration the protection of the individual instances within the hypervisor and it only focused on DoS attacks detection in the cloud.

Moorthy et al. [17] came up with a cloud intrusion dataset (CIDD) and a virtual host-based intrusion detection system. The cloud network was simulated on a testbed using Cloud Stack with KVM hypervisor. Their dataset was prepared based on specific open communication ports in the cloud stack platform. In this work, they applied the genetic algorithm for the generation of detection rules from the malicious traffic collected using network sniffers. With this approach, they were able to detect 80% of the cloud attack with 0% false-positives on both the collected cloud intrusion detection dataset and the DARPA IDS dataset. The drawback here is that this

(28)

dataset does not contain raw packets that could be used to extract a broader variety of features and it is not available for public usage.

Pandeeswari et al. [18] proposed an anomaly detection system in a cloud network called the Hypervisor Detector using the Fuzzy C-Means clustering-Artificial Neural Network (FCM-ANN) that can automatically detect the patterns of new attacks. They used a cloud simulator called Cloudsim 3.0 for the implementation of the Hypervisor Detector. Their evaluation was done using the KDD cup dataset. They also compared the performance of the proposed system on various attack types using Naïve Bayes classifier (NB) and Artificial Neural Network (ANN), and indicated that FCM-ANN gave a better performance. The performance results for FCM-ANN on different attack types, including probing, DOS, U2R, and R2L, are on average DR=98.6% and FPR=3.06%.

2.2 On general Intrusion Detection systems in Cloud

Mazzariello et al. [19] proposed an approach that uses Snort, a popular signature-based network intrusion detection system to provide a fast and cost-effective intrusion detection in a cloud environment. To implement their experimental scenario, the Eucalyptus infrastructure was used for the simulation of a cloud computing environment. They deployed the Snort IDS on each of the physical machines in their experimental setup. Though they claimed that this approach was successful, the detection and false-positive rates were not recorded for reference purposes. Furthermore, as a signature-based detection system, snort can detect only known attack patterns.

(29)

Bhat et al. [12] proposed an approach for detecting intrusions in the virtual machine environment on a cloud network using traditional and multiclass (hybrid) machine learning algorithms. The following machine learning algorithms were considered: Naïve Bayes Tree (NB Tree) classifier, hybrid of NB Tree and Random Forest. The NSL-KDD’99 dataset was used for evaluation and it was observed that hybrid machine learning models perform better than the traditional or individual algorithms. In using single classifiers, their evaluation generated accuracies of 95%, 91% and 98% for each of Random Forest, K-NN and SVM, respectively, while the combination of NB Tree and hybrid of NB Tree and Random Forest resulted in a high accuracy of 99% and low false-positive rate of 2%. More so, the hybrid of Random Forest and weighted K-Means amounted to 94.7% accuracy and 12% false-positive rates.

Modi and Patel [2] presented an approach that integrates a hybrid network intrusion detection system into a cloud computing network. The experimental setup involves using the Eucalyptus infrastructure for the simulation of a cloud computing environment, while the KDD-IDS dataset was used for the evaluation of their work. Their research framework involved the integration of signature-based detection and anomaly detection. They utilized Snort for the signature-based intrusion detection and three machine learning classifiers viz the Bayesian, Associative (a machine learning model using association rule) and Decision Tree classifiers singularly and collectively for the network anomaly detection. In the design of their experiment, the detection time was reduced by first running the signature-based technique to detect known and derived attacks and then followed by the anomaly detection technique for detecting the unknown attacks. The experimental result of their proposal for the three classifiers and their collective

(30)

ability yielded a true positive rate (TPR) of 97.14%, a false-positive rate (FPR) of 1.17%, an accuracy of 97.57%, a precision of 99.60% and a high F Score of 0.98.

Lo et al. [20] proposed a cooperative intrusion detection system to detect distributed denial-of-service attacks (DDoS) in a cloud computing environment. In the proposed approach, individual network intrusion detection systems need to be deployed in every cloud computing region. Their implementation was based on adding three modules viz block, communication and cooperation, into Snort IDS. The result of their experiment shows a 97% detection rate against 97.2% for traditional Snort IDS. More so, to computation time, it took their proposed IDS system 0.00269 seconds to process 10,000 data packets against 0.00263 seconds for Snort on the same amount of data packets. Though it seems that their proposed cooperative IDS performed a little less than the pure Snort IDS as seen above, the system is at advantage in that if one cloud region is affected by a DoS attack, the cooperative IDS alerts the other IDS sensors and that means the system is saved from a single point of failure attack. This cooperative participation of intrusion detection literarily comes with a very high computational time as seen in their experimental results.

Muthurajkumar et al. [21] used the combination of fuzzy SVM and random feature selection algorithms (RSFSA) to propose a cloud intrusion detection model. In their experiment, they built two sets of intrusion detection models, one with the whole data features and the other after introducing feature selection. A dataset consisting of 10% of KDDCUP was used for the experiment and analysis of their approach. The average detection rate from their experimental results before and after applying the RSFSA to the Fuzzy SVM classifier is 86.88% and 94.15%, respectively. Their work confirmed that feature selection plays an important role in classifiers

(31)

detection accuracy. It would have been interesting to evaluate this proposal using a real cloud intrusion detection dataset.

Vieira et al. [22] came up with a grid and cloud computing intrusion detection system (GCCIDS), which uses knowledge and behavior analysis to detect intrusions. Here, they simulated a cloud environment prototype using Grid-M, a middleware developed at the Federal University of Santa Catarina, Brazil. They used a feed-forward neural network (FFNN) classifier in their experiment to identify user behavioral patterns and deviations in the simulated cloud computing environment. They measured the IDS efficiency using the false-positive and false negative rates and obtained average scores of 6.7% and 10.54%, respectively. Part of the limitation of their work is that the approach was not tested with a typical cloud computing dataset.

Chou et al. [23] proposed an adaptive network-based intrusion detection system for the cloud environment using the DARPA 2000 and the KDD Cup 1999 datasets. Their approach used spectral clustering, an unsupervised learning algorithm to build a decision tree-based detection model for detecting an anomaly in an unlabeled network connection data. They used Bro-IDS to generate connection records from the raw packets. Their experimental result on the DARPA dataset yielded a detection rate of 95% and a false-positive rate of 4.5% while the KDD Cup 1999 dataset yielded a detection rate of 90% and a false-positive rate of 5%. Their approach is not enough robust as it could not detect DOS and some probing attacks which create a great number of connections.

(32)

Ahmad et al. [24] presented an intrusion detection model that uses a dendritic cell algorithm for detecting intrusions in the cloud computing environment. The experimental evaluation was conducted using the DARPA 1998 dataset. The network-based attributes were used as signals in their experiments. They carried out their experiment on a total of 187 threat events of Week 4 and Week 5 of the DARPA 1998 dataset and the algorithm achieved a detection rate of 79.43% and a false-positive rate of 13.43%. In their work, they demonstrated that using a dendritic cell algorithm could provide a solution in detecting attacks in the cloud environment.

Kannan et al. [25] proposed a host-based cloud intrusion detection system that uses a genetic algorithm-based feature selection and a Fuzzy SVM based classifier for deciding if an event is an intrusion or not. The cloud environment was simulated with Proxmox VE 1.8 which is an open-source virtualization environment while the evaluation was done using the KDD’99 cup dataset. In the experimental results, a detection rate of 98.51% and a false-positive rate of 3.13% were obtained.

Bakshi et al. [26] proposed an approach to detect distributed denial of service (DDOS) attacks in cloud virtual machines (VMs) using Snort as the intrusion detection sensor. They stimulated a cloud network environment using a VMware virtual ESX machine running over the Internet. Snort was installed in the virtual switch to capture, analyze and detect intrusions based upon known signatures. In their result analysis, Snort was seen to have detected multiple TCP SYN scan attacks targeted at the receiving host. However, they were not specific to the performance of this approach in terms of the detection rate and false-positive rate. More so, no real evaluation was made to ascertain the effectiveness of the approach.

(33)

Zhao et al. [27] put forward an anomaly intrusion detection system based on an unsupervised learning algorithm, namely the K-means clustering algorithm. The dataset chosen for their experiment was the KDD Cup 99 dataset. For the comparative analysis of the performance of their proposed approach, they used the Particle Swarm Optimization (PSO) and Backpropagation (BP) Neural Network algorithms to test the performance of their proposed algorithm. The K-means algorithm performed better than the other two algorithms, yielding a false-positive rate (FPR) of 3.56% against FPR of 6.78% and 5.75% for PSO and BP neural network algorithms, respectively. And in terms of false-negative rate (FNR), 7.65% was achieved in contrast to 10.46% and 13.75% obtained using PSO and BP neural network algorithms, respectively. Their study was not carried out with a real cloud IDS dataset but it however worth its salt as it highlighted the possibility of predicting several types of attacks in the cloud.

Xiong et al. [28] proposed an anomaly detection method for cloud computing systems based on two approaches, viz the Synergetic Neural Network (SNN) algorithm and the Catastrophe theory (CT) algorithm. They used the DARPA dataset for their experiment and focused their work on the network traffic information. Their experiment yielded an overall average detection rate of 83% on the SNN method and an 86.62% overall average detection rate on the CT. The experiment also yielded an overall average of 8.3% positive rate on the SNN method and an overall false-positive rate of 9.06% on the CT method. In as much as this approach was not conducted on a real cloud dataset, their experimental results show that this approach can achieve a high detection rate and a low false-positive rate.

(34)

Ashad et al. [29] proposed an intrusion severity analysis for cloud computing using a hybrid approach consisting of misuse detection and anomaly-based detection. They used an artificial dataset generated from a computer program. Their experimental design was such that the attacks were detected based on the signatures of known attacks in their attack database for the misuse part while for the anomaly part, a profile engine which is based on the behavior of the monitored virtual machines (VMs) was consulted to do the detection of intrusions. The C4.5 decision tree algorithm was used for intrusion severity analysis on the monitored system calls and this was implemented using the Weka machine learning software. The detection rate obtained in their experiment which used a 10-fold cross-validation is 99.6%. Their proposal is quite comprehensive and also an automated approach, however, the dataset does not depict a real cloud dataset. And besides, the process of generating the dataset was not publicly made known which also calls for more inquiry.

Kwon et al. [30] presented a self-similarity based lightweight intrusion detection method for a cloud computing environment. They used cosine similarity to estimate the self-similarity of a given system. The DARPA 1998 dataset was used in their experiment which focused on security ID (SID) and the event ID of the monitored Windows event log. Their experiment gave an overall false-positive ratio of 4.17%. Though their study is meant to protect the cloud environment, a real cloud dataset was not used in the first place. Secondly, as robust and cost-effective as they claim their work to be in detecting anomalous events, the Windows event log does not provide enough information regarding security rather than the other operating system’s audit log.

(35)

Santoso et al. [31] proposed a signature-based network intrusion detection system (NIDS) for the protection of the private cloud from malicious attackers. Their NIDS module was based on using the Snort engine for the analysis and monitoring of network traffic between the VMs for pattern matching. To obtain their dataset, a private cloud was simulated using the Dev Stack community version 0.0.1 of the OpenStack cloud platform in a computer network laboratory. Their proposal was based on UDP flooding attacks and the experiment resulted in a detection rate of 100% and a false-positive rate of 11.42%. Although only one attack type was evaluated, the dataset as well does not represent a typical cloud dataset.

Li et al. [32] proposed an artificial neural network (ANN) based cloud intrusion detection system which has a distributed system architecture. The experimental part of their proposal involved simulating a cloud environment using Ubuntu Enterprise Cloud (UEC), an Eucalyptus-powered cloud platform and evaluating the result on 10% of the KDD’99 dataset. The experiment yielded an average detection rate of 99% and a high average detection time of 37.1 seconds. One of the drawbacks here is that the ANN takes huge training time for large databases, therefore, the anomaly detection algorithm may incur an increased cost if retraining is required due to change in traffic behavior as is the case of cloud computing environment. More so, the simulated dataset can not stand in as a real cloud dataset.

(36)

2.3 Summary

In this chapter, an overview of why cloud computing intrusion detection is receiving a lot of attention from researchers and scholars was made. Also presented, were some existing works done on the subject of cloud computing intrusion detection system to draw a comparative analysis of this thesis proposed approach which is on building intrusion detection system for cloud computing environment using a real cloud dataset, the ISOT-CID. It can be noted that several of the existing proposals are based on either the DARPA dataset or the KDD CUP dataset, which is derived from the DARPA dataset.

(37)

Chapter 3 Experiment Design and Dataset

3.1 Overview

Until now, the availability of a cloud dataset has been one of the major challenges hampering the progress of the research on cloud intrusion detection systems. The majority of the works done so far in the cloud intrusion detection system were performed using conventional datasets like the DARPA 1998 or the KDD’99 datasets [6],[12]. More so, the datasets used in the works done in a cloud network traffic are not made available for public use, in some cases for privacy concerns. These factors have denied the cloud researchers of an all-encompassing real-world cloud intrusion dataset to carry out their work on.

In 2018, Aldribi et al. [6] were able to tackle the cloud dataset challenges by presenting the cloud security community with the ISOT cloud intrusion detection dataset which is cloud-specific and publicly available. The dataset consists of a variety of data sources, including network traffic, system logs, CPU performance data, etc. Only the network traffics data of the ISOT-CID will be used in the experiment of this thesis work.

In this chapter, we will present in more detail the ISOT CID dataset, and give a brief overview of the DARPA 1998 dataset. Since the ISOT CID is relatively new we will present the collecting procedures, architecture and the underlying threat model.

(38)

3.2 The ISOT Cloud Intrusion Detection Dataset

3.2.1 Data Collection Environment

The ISOT-CID is the first publicly available dataset that was collected in a real-world environment. The cloud service provider is Compute Canada, a nonprofit organization in Canada that extends its services in the areas of supporting the computational needs of researchers [6]. There were two phases involved in the ISOT-CID data collection process, namely Phase 1 and Phase 2. The data in the two phases were collected on the same production environment based on OpenStack from various cloud layers including hypervisors (or the virtual machine monitor) layer, guest hosts layers and the network layer. The dataset size is more than 8 terabytes and it contains data of different formats such as the memory dumps, CPU and disk utilization, system call traces, system logs and network traffic [6]. Another advantage of the ISOT-CID is that it is labeled and includes both normal and attack activities. This thesis work is based on the network traffic attributes of the ISOT-CID.

The ISOT-CID collection environment contains three hypervisor nodes viz, node A, node B, and node C. The collecting environment is also composed of 10 virtual machines or instances (VM1 to VM10) launched in three different cloud zones A, B, and C [6] as seen in figure 3.1. The distribution of the instances, operating systems, zones, and their respective IP addresses are as shown in Table 3.1. Table 3.2 shows the ISOT-CID Hypervisor labeling and configurations.

(39)

Figure 2.1: ISOT-CID Project Environment [6]

Table 3.1 ISOT-CID VM distributions [6]

VM Label Operating System Zone Internal / External IP Address

VM 1 Centos C 172.16.1.10 / 206.12.59.162 VM 2 Centos A 172.16.1.28 VM 3 Debian A 172.16.1.23 / 206.12.96.142 VM 4 Windows Server 12 A 172.16.1.26 / 206.12.96.149 VM 5 Ubuntu A 192.168.0.10 VM 6 Centos A 172.16.1.19 / 206.12.96.240 VM 7 Ubuntu B 172.16.1.20 / 206.12.96.239 VM 8 Ubuntu B 172.16.1.24 / 206.12.96.143 VM 9 Windows Server 12 B 172.16.1.21 / 206.12.96.141 VM 10 Centos B 172.16.1.27 / 206.12.96.150

(40)

Table 3.2 ISOT-CID Hypervisor nodes [6]

Hypervisor ID Operating System RAM Size CPU

Specification

Clock Speed

A (poseidon0050) Centos 125GB 32 cores 2.00GHZ

B (poseidon0049) Centos 125GB 32 cores 2.00GHZ

C Centos 264GB 55 cores 2.40GHZ

In the ISOT-CID collection environment, the collector agents are classified and integrated into the three cloud layers as follows: VM or Instance-based agents, Hypervisor-based agents and Network-based agents [6]. The Instance-based agents reside on each of the VMs to record the VM’s runtime data which is forwarded to the ISOT-lab log server for further analysis. The Hypervisor-based agent runs on each of the hypervisors to receive the runtime data which is also forwarded to the log server. Finally, the Network-based agents also operate on the individual hypervisors to collect the runtime data which is also forwarded to the log server for storage and analysis.

3.2.2 Threat Model and Attacks

The ISOT-CID dataset is labeled and in addition to the normal activities, it also contains a wide range of malicious activities originating from both inside and outside the ISOT cloud environment. These malicious activities adversely affect the cloud user’s data through violation of the information security model policies, the CIA triad (Confidentiality, Integrity, and Availability).

(41)

The normal data in the ISOT-CID came from web applications/traffic and administrative activities [6]. Some of the web application activities include account registration, blog activities, and web browsing. The web traffic statistics revealed that more than 160 legitimate users were involved in the generation of the normal data which comprises 60 human users and 100 robots [6]. While the administrative activities cut across Instance routine maintenance, system rebooting, application updates, file creation, machine access via SSH and remote server access.

Table 3.3 presents all the attacks covered in ISOT-CID include probing, DoS, information disclosure, R2L, input validation, backdoors, and authentication breach. These attacks were grouped into insider or outsider attacks depending on its source [6]. On the one hand, the inside malicious activities were perpetrated by either an insider within the cloud environment who had root access on the hypervisor nodes or by a compromised VM within the cloud environment used as a stepping stone. Some of the inside attacks were backdoor and Trojan horse, network scanning, password cracking, DoS attacks, and so forth. On the other hand, the outside malicious activities emanated from outside the cloud environment with the ISOT-cloud environment being the primary target. Some of the outside attacks are made up of the application layer (layer 7) and network layer (layer 3) DoS attacks, input validation attacks, SQL injection, path/directory traversal and crypto-jacking (unauthorized crypto-mining). For instance, Figure 3.3 shows a timeline for the attack scenario in Phase 1 Day 2 (2016-12-15) and figure 3.4 depicts the timeline for the attack scenario for phase 2 day 1 (2016-12-16).

(42)

Table 3.3: Attacks covered in the ISOT-CID dataset [6]

Attack Target Layer Insider Attack Types Outsider Attack Types

Application Layer ➢ SQL Injection ➢ Web Vulnerabilities Scanning ➢ Cross-site Scripting (XSS) ➢ Dictionary/Brute Force login attack ➢ Fuzzers ➢ HTTP Flood DOS ➢ Directory/Path Traversal Network Layer ➢ Trojan Horse ➢ Backdoor (reverse shell) ➢ Unauthorized Crypto-mining (download/install/run crypto-miner)

➢ UDP Flood DOS

➢ Synflood Dos ➢ Unclassified

(unsolicited traffic)

➢ DNS amplification

DOS

➢ Ports and Network scanning

(43)

➢ Stepping Stone Attack ➢ Ports and Network

scanning ➢ Synflood DOS ➢ Revealing Users Credentials and Confidential Data by Insider ➢ Dictionary/Brute Force login attack

➢ Dictionary/Brute Force login attack

(44)

Figure 3.3: Timeline for Phase 2 Day 1 (2018-02-16) Inside and Outside Attacks

3.2.3 Composition of ISOT-CID Network Traffic Data

The ISOT-CID dataset is composed of three levels of network communications, namely, external, internal and local traffic [6]. In the ISOT-CID context, the external traffic is the traffic between the instance and an outside machine. The internal traffic or the hypervisor traffic is between the hypervisor nodes. And finally, the local traffic is the traffic between two VMs on the same hypervisor node. The ISO-CID network data also come in kinds, one being without payload on both hypervisors and VMs and the other involving the full network traffic only on the hypervisors [6]. The collected network traffic is stored in packet capture (pcap) format and made available for public use. The ISOT-CID network traffic/packet distribution is as shown in Table 3.3 and Table 3.4. Figure 3.5 shows the percentage distribution of ISOT-CID network traffic.

(45)

Table 3.3: ISOT-CID Network Traffic Distribution [6]

Phase Total Normal Traffic Total Malicious Traffic Total Packets

1 22,356,769 15,649 22,372,418

2 9502872 2,006,382 11,509,254

Table 3.4: ISOT-CID Network Traffic Distribution in percentage (%) [6]

Phase Total Normal Traffic Total Malicious Traffic Total Packets

1 99.93% 0.07% 100%

2 82.57% 17.43% 100%

Figure 3.4: ISOT-CID Network Traffic Distribution in percentage (%)

0 20 40 60 80 100 phase1 phase2 P ER CENTAGE TR AFF IC PHASE

ISOT PERCENTAGE TRAFFIC DISTRIBUTION

normal malicious

(46)

3. 3 DARPA 1998 Intrusion Detection Dataset

Until recently, the DARPA and KDD’99 intrusion detection and evaluation datasets have thrived as the benchmark datasets for building and testing intrusion detection systems. DARPA 1998 IDS dataset was developed by the MIT Lincoln Laboratory with funding from DARPA and the Air Force Research Laboratory (AFRL)[33]. The DARPA dataset collection process involved a simulation to model the network traffic for a US Air Force base connected to the Internet [34],[35]. The outcome of the simulation includes background or user activities, injected attacks at well-defined points, gathered TCP dumps, sun Basic Security Module (BSM) audit data, process and file system information. All the network traffic was stored in a tcp-dump format and made available for evaluation. The DARPA dataset is a publicly available off-line evaluation dataset that provided the direction for research efforts in intrusion detection system development and evaluation. The network data consists of seven weeks of data for training and two weeks for testing the detection performance. It contains labels for normal and malicious activities. The DARPA simulation network environment is shown in Figure 3.2. In addition to the DARPA dataset being criticized over time by researchers and reviewers, it neither contains new advanced attack scenarios that are cloud-specific and nor was it collected from a real-world cloud environment like the ISOT-CID. Therefore, the DARPA dataset can not be used as a substitute for building cloud IDS even though it is publicly available.

(47)

Figure 3.5: DARPA Simulation Network Environment [33]

3.5 Summary

In this chapter, we presented the two datasets used in our experiments: ISOT CID and DARPA 1998 datasets. We further described ISOT CID in mode details by presenting the collection environment and the involved attack vectors. The discussion focus was centered on the network traffic since the scope of our work is to build a hypervisor-based cloud network IDS. We also discussed some of the shortcomings why the DARPA 1998 evaluation dataset can not stand in the place of a real-world cloud intrusion detection dataset.

The focus of the next chapter will be on the feature model and feature extraction, the machine learning algorithms and the experimental designs.

(48)

Chapter 4 Feature Model

One of the most important ingredients of any intrusion detection system is the nature of the input data. In the cloud computing environment, the measurement and analysis of network traffic are quite challenging because of the following reasons. Firstly, the high density of the network links and the high number of features transferred at high-speed rates which is much higher than what is obtainable in the conventional network. Secondly, most of the proposed methods on network traffic measurement and analysis can only extract network traffic features on a few hundreds of hosts [36]. Thirdly, encryption and privacy regulations may restrict the payload inspection in the cloud computing environment [6]. With all these challenges in mind, analyzing the cloud network traffic at the flow-based level rather than packet-based level will help to maximize aggregate network utilization. This will also help to achieve both online and offline intrusion detection for the cloud computing environment by decreasing the required time for learning. In addition, this will also enable us to eliminate redundancy by extracting only the relevant features thereby increasing the classification accuracy and detection rate. The components of the network flow features used in this thesis work will be discussed in this chapter.

(49)

4.1 Flow-based feature model

Because the VM instances in the cloud environment share the same hypervisor, therefore to improve the cloud computing intrusion detection system, the feature extraction should be such that it takes into account the correlated behavior of the instances.

A network flow, on the other hand, can be seen as a bidirectional packet stream between two hosts or the movement of network traffic across different network points, usually from a source to a destination and vise versa. A network flow is characterized by five-tuple viz: source IP, source port, destination IP, destination port and protocol. Grouping network traffic into flow ensures that the individual packets are representative and saves the cost of keeping a higher volume of data for analysis. The network traffic flow samples are shown in Figure 4.1. We first grouped the captured hypervisor packets in the pcap file formats into a stream of packet flows based on a time window using a flow-based forensic and network troubleshooting traffic analyzing tool called Tranalyzer. Eighty raw features were extracted from the packet headers and some of the raw features are represented in table 4.1. The next step was to feed the generated packet flows into a python script written to extract the instance-oriented feature vector components for use by the cloud intrusion detection system.

(50)

Figure 4.1: Network traffic flow samples for phase 2 hypervisor A day 1 (2018-02-16)

Table 4.1: Some of the raw features extracted from the hypervisor network traffic using the traffic analyzer tool (Tranalyzer)

Feature Description

flowInd The flow index

timeFirst Date time of first packet

timeLast Date time of last packet

duration Flow duration

srcIP Source IP

srcPort Source port

dstIP Destination IP

dstPort Destination port

numPksSnt Number of transmitted packets

numPksRcvd Number of received packets

numBytesSnt Number of transmitted bytes

(51)

4.2 Instance-oriented feature characteristics

We used a three-dimensional feature space in this thesis work namely frequency-based features, entropy-based features and load-based features based on the work of Aldiribi et al. [6]. The following was noted out of the work in [6]:

a. A change in the normal frequency of a packet flow between specific IP addresses could be an indication that the resulting traffic flow emanated from an attacker. A typical example of this unusual change in the normal frequency can be seen when a victim’s machine is affected by a denial of service (DoS) attack. This leads to our consideration for the frequency-based features.

b. That malicious activities always introduce a high level of randomness in the network behavior which necessitated the application of entropy to measure the degree of randomness.

c. During normal traffic flow, the received traffic pattern matches the sent traffic pattern meaning that a deviation from the traffic load is an indication of malicious activity. That is why the consideration of the traffic load ratio becomes important.

4.2.1 Frequency-Based Feature Extraction

Two categories of frequency-based features were adopted for each VM instance, namely, the ‘in-frequency’ and the ‘out-‘in-frequency’ features. The in-frequency represents the frequency of the packets incoming to a specific instance from any source or endpoint while the out-frequency represents the outgoing packets from a specific instance back to the respective sources.

(52)

The in-frequency features as adopted from [6], are defined as follows: 𝑓_𝑠,𝑠𝑖𝑛_𝑃_{,𝑖,𝑖𝑃}(𝑡) = [𝐹𝑖𝑛𝑡,𝛿𝑡(𝑠,𝑠𝑝,𝑖,𝑖𝑝)] 𝛿𝑡 , (4.1) 𝑓_𝑠𝑖𝑛_𝑃_,𝑖(𝑡) = [U∀𝑠,𝑖𝑝𝐹𝑖𝑛 𝑡,𝛿𝑡_{(𝑠,𝑠𝑝,𝑖,𝑖𝑝)]} 𝛿𝑡 , (4.2) 𝑓_{𝑖,𝑖𝑝}𝑖𝑛(𝑡) = [U∀𝑠,𝑠𝑝𝐹𝑖𝑛𝑡,𝛿𝑡(𝑠,𝑠𝑝,𝑖,𝑖𝑝)] 𝛿𝑡 , (4.3) 𝑓_𝑖𝑖𝑛(𝑡) = [𝐹𝑖𝑛𝑡,𝛿𝑡(𝑖)] 𝛿𝑡 , (4.4) 𝑚𝑎𝑥𝑖𝑝{𝑓𝑖,𝑖𝑝𝑖𝑛(𝑡)}, (4.5)

In a like manner, the out frequencies adopted from [6], are defined as follows:

𝑓_{𝑖,𝑖𝑃,𝑑,𝑑𝑝}𝑜𝑢𝑡 (𝑡) = [𝐹𝑜𝑢𝑡𝑡,𝛿𝑡(𝑖,𝑖𝑝,𝑑,𝑑𝑝)] 𝛿𝑡 , (4.6) 𝑓_{𝑖,𝑑𝑝}𝑜𝑢𝑡(𝑡) = [U∀𝑖𝑝,𝑑𝐹𝑜𝑢𝑡 𝑡,𝛿𝑡_{(𝑖,𝑖𝑝,𝑑,𝑑𝑝)]} 𝛿𝑡 , (4.7) 𝑓_{𝑖,𝑖𝑝}𝑜𝑢𝑡(𝑡) = [U∀𝑑,𝑑𝑝𝐹𝑜𝑢𝑡𝑡,𝛿𝑡(𝑖,𝑖𝑝,𝑑,𝑑𝑝)] 𝛿𝑡 , (4.8) 𝑓𝑖𝑜𝑢𝑡(𝑡) = [𝐹_𝑜𝑢𝑡𝑡,𝛿𝑡(𝑖)] 𝛿𝑡 , (4.9) 𝑚𝑎𝑥_𝑖𝑝{𝑓_{𝑖,𝑖𝑝}𝑜𝑢𝑡(𝑡)} ,

(4.10)

The detail description of each of the in and out frequency feature terms is as given in table 4.2 [6].

(53)

Table 4.2: Description of the in and out frequency terms [6]

Feature / Term Description

[t , t + 𝜹𝒕] Observation time interval

endpoint (e) A type of communication node e.g. 𝔢𝑖=(𝑖, 𝑖𝑝)

𝒊 A VM instance

𝒊𝒑 Represents an instance port

s A source of traffic

sp A source port

d A destination

dp A destination port

𝒇_𝒔,𝒔𝒊𝒏_𝑷_{,𝒊,𝒊𝑷}_(𝒕) _{The number of packets flowing from the endpoint 𝔢}

𝑠 to 𝔢𝑖 during [t , t + 𝛿𝑡] divided by 𝛿𝑡.

𝒇_𝒔𝒊𝒏_𝑷_,𝒊_(𝒕) _{The number of packets flowing from specific sp in all endpoints 𝔢}

𝑠 to all ip 𝔢𝑖 during [t , t +

𝛿𝑡] divided by 𝛿𝑡.

𝒇_{𝒊,𝒊𝒑}𝒊𝒏 (𝒕) The number of packets flowing from all sp in all endpoints 𝔢𝑠 to specific ip in 𝔢𝑖 during [t , t

+ 𝛿𝑡] divided by 𝛿𝑡.

𝒇_𝒊𝒊𝒏(𝒕) The total number of packets flowing to 𝔢𝑖 during [t , t + 𝛿𝑡] divided by 𝛿𝑡.

𝒎𝒂𝒙_𝒊𝒑{𝒇_{𝒊,𝒊𝒑}𝒊𝒏 _(𝒕)} _{The maximum number of packets over ip flowing to 𝔢}

𝑖 during [t , t + 𝛿𝑡] divided by 𝛿𝑡.

𝒇_{𝒊,𝒊𝑷,𝒅,𝒅𝒑}𝒐𝒖𝒕 (𝒕) The number of packets flowing from the endpoint 𝔢𝑖 to 𝔢𝑑 during [t , t + 𝛿𝑡] divided by 𝛿𝑡.

𝒇_{𝒊,𝒅𝒑}𝒐𝒖𝒕(𝒕) The number of packets flowing from all ip in 𝔢𝑖 to specific dp in all endpoints 𝔢𝑑 during [t , t

𝒇_{𝒊,𝒊𝒑}𝒐𝒖𝒕(𝒕) The number of packets flowing from specific ip in 𝔢𝑖 to all dp in all endpoints 𝔢𝑑 during [t , t

𝒇_𝒊𝒐𝒖𝒕(𝒕) The total number of packets flowing from 𝔢𝑖 during [t , t + 𝛿𝑡] divided by 𝛿𝑡.

𝒎𝒂𝒙_𝒊𝒑{𝒇_{𝒊,𝒊𝒑}𝒐𝒖𝒕_(𝒕)} _{The maximum number of packets over ip flowing out of 𝔢}

Hypervisor-based cloud anomaly detection using supervised learning techniques

Hypervisor-Based Cloud Anomaly Detection using

Supervised Learning Techniques

by

Onyekachi Nwamuo

B.Eng., Federal University of Technological Owerri, 2007

MPM, University of Lagos, 2014

A Thesis Submitted in Partial Fulfillment

of the Requirements for the Degree of

Master of Applied Science

in the Department of Electrical and Computer Engineering

SUPERVISORY COMMITEE

Hypervisor-Based Cloud Anomaly Detection using Supervised

Learning Techniques

by

Onyekachi Nwamuo

B.Eng., Federal University of Technological Owerri, 2007

MPM, University of Lagos, 2014

ABSTRACT

Table of Contents

List of Tables

List of Figures

ACKNOWLEDGEMENTS

DEDICATION

Introduction

1.1 Context

1.2 Cloud Network and Conventional Network Data Characterization

1.3 Problem Statement and Research Objectives

1.4 Thesis Contribution

1.5 Thesis Outline

Chapter 2

Related Works

2.1 On Hypervisor-Based Cloud Intrusion Detection

2.2

On general Intrusion Detection systems in Cloud

2.3 Summary

Chapter 3

Experiment Design and Dataset

3.1 Overview

3.2 The ISOT Cloud Intrusion Detection Dataset

3.2.1 Data Collection Environment

3.2.2 Threat Model and Attacks

3.2.3 Composition of ISOT-CID Network Traffic Data

ISOT PERCENTAGE TRAFFIC DISTRIBUTION

3. 3 DARPA 1998 Intrusion Detection Dataset

3.5 Summary

Chapter 4

Feature Model

4.1 Flow-based feature model

4.2 Instance-oriented feature characteristics

4.2.1 Frequency-Based Feature Extraction