A DSL for pattern detection in web server access logs

(1)

A DSL for pattern detection in web

server access logs

Kevin van Cleef

kevin@developmentstudio.nl

November 14, 2015, 41 pages

Supervisor: Anthonie van Dijk (Coolblue) & Magiel Bruntink (UvA) Host organisation: Coolblue B.V.,http://www.coolblue.nl

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

Each day more and more businesses move their services to the web. The web has a lot of potential. However it has also some side effects, as for example, it is easily accessible to everybody to try to detect errors and weaknesses in your application and misuse them.

Coolblue, the host organisation during this work, is a big player in the e-commerce market in the Netherlands, at the time of writing, an amount of approximately 70 GB of log information is collected from their web servers each day. Except some monitoring, nothing is done with this data-set and they were looking for an easy and efficient way to analyze this data-set from a security perspective.

In this work we have tried to get to a solution to detect different kind of security related patterns by just using the access log files of the Coolblue web-servers. E.g. scraping/automation attacks on the web shops and vulnerability scanners and how we can relate these results to orders/accounts on the webshops. Research questions we have tried to answer during this works were:

1. How to define security related patterns in web requests access logs?

2. How can we define a DSL for expressing security related patterns in web request access logs in a fast, easy and reusable way?

3. Which patterns can be expressed in our DSL and can also be detected in our web request access logs?

During this work we started with a literature research to create an index of known security vul-nerabilities which can be detected from access logs. This gave us an overview of what our solution has to support. Based on this index we created a domain specific language, we called Cool Security Language (CSL), to describe the identified patterns. Finally we created a prototype application which uses the CSL language to detect patterns in the access logs.

We experimented with the CSL language by describing multiple scenario’s and have tried to detect these pattern within two test setups. One setup with a limited log data-set, and one which uses the production log data-set of Coolblue. Based on these experiments we can discuss if elasticsearch is the right data store for this type of analysis. Also the way in which we implemented the communication between our application and elasticsearch data-store has some flaws.

We can conclude that the prototype application by which we wanted to show the possibilities of the CSL language showed us some flaws during the experiments. When querying a small data-set everything worked out fine, however when querying a bigger data-set like the production ones, memory related exception where thrown. By reasoning about the implementation of the prototype application we can conclude that these memory problems are related to the way in which we implemented the execution of the search queries to elasticsearch. Based on the disk usage measurements we did during the experiments we can conclude that the elasticsearch data store is not the ideal candidate for doing patterns detection in all scenario’s.

(5)

Chapter 1

Introduction

1.1 Initial Study

Each day more and more businesses move their services to the web. The web has a lot of potential. It makes it, for example, easy to reach a big(ger) audience and the possibility to make your service available 24 hours a day. However it has also some side effects, as for example, it is easily accessible to everybody to try to detect errors and weaknesses in your application and misuse them [FO07]. Two kind of options are available to act against it. The first one is prevention. To prevent a system from attacks every update/patch available for all parts of the system should be applied as they become available. In this way gaps can not be misused anymore. There is always some period of time between detection of an security gap and developing/applying a patch. New security issues are discovered each day and every time it is necessary to detect if these issues exists in the system [BD12].

The second one is detection. Instead of guessing which security issues come into play in you ap-plication you can analyze the logs files for suspicious traffic/patterns and act based on these results. Examples which can be detected in log files are, for example but not limited to, parameter manipu-lation [FWCW14], automation, scraping, XSS-attacks [KMD11] and SQL injections [LJYM12], but also the behavior of how users are making use of the application gives a lot of information.

The Gartner Group estimates that over 70% of attacks against a company’s web site or web appli-cation come at the appliappli-cation layer, not the network or system layer [FO07].

1.2 Problem Statement

Coolblue, the host organisation for this work, is a big player in the e-commerce market in the Nether-lands, at the time of writing, an amount of approximately 70 GB of log information is collected from their web servers each day. Except some monitoring, nothing is done with this data-set and they are looking for an easy and efficient way to analyze this data-set from a security perspective. Questions raised are for example:

• Can we trace automated ordering happening on our application? • Can we trace visitors trying to find vulnerabilities in our applications? • Can we relate security patterns to specific users?

• Can we trace accounts that are being hijacked? For example by the time between changing their password/address etcetera and placing an order?

Most of these questions can possibly be answered by analyzing this data-set. Multiple commercial tools are available for analyzing vulnerabilities such as SQL injections, path traversals, etcetera. Most of these tools are hard to configure or maybe not efficient or exhaustive enough to detect all patterns. Multiple tools can be a solution but then you have none uniformity in you configurations which is not efficient. Salgueiro et al. [SDBA11], presented a language to describe network security

(6)

patterns on network package level which is the inspiration for this research project. They have defined the NeMODe system which consist of a DSL as a uniform interface for describing patterns and an application which generates code that can be used for detection, based on this DSL. In this work we are mainly focused on describing an efficient uniform language for describing security patterns in web application access logs. Based on this language, configuration files for other systems can for example be generated, but also complete tools can be generated for analyzing and detection [vdBvdS11].

1.2.1 Research Questions

The research we will do in this project is to present an efficient language for describing security related patterns in application web requests access logs in such way that a system can fast and easily be analyzed on the presence of (new) vulnerabilities. Below the research questions are stated:

1. How to define security related patterns in web requests access logs?

2. How can we define a DSL for expressing security related patterns in web request access logs in a fast, easy and reusable way?

3. Which patterns can be expressed in our DSL and can also be detected in our web request access logs?

1.2.2 Research Method and Solution Outline

To answer the first research question we will focus mostly on literature research to create an index of known security issues which can be detected from web application access logs. Further we will research on how we can define patterns describing these issues. More or less two categories can be identified. First, signature based intrusion detection and second, anomaly based intrusion detection. With signature based intrusion detection, intrusions are described using their signatures, particular properties of a request access log entry. These properties are then looked up in the request logs to find the desired intrusion [SDBA11]. With anomaly based detection, the system models the normal behavior of the request access logs using statistical methods and/or data mining approaches. The behavior is then monitored, and if it is considered anomalous according to the model, there is a great probability of an attack [SDBA11]. In this work we will take a look at the first category.

To answer the second research question we do some literature research to get more information about describing patterns in web application security. The focus will be to setup a DSL in which we can describe patterns in a fast, easy and reusable way. To validate the concept we will develop a prototype application which uses the DSL to detect patterns in the access log data-set we have.

To answer the third research question we are going to experiment with describing patterns in our DSL and analyzing if we can find the patterns in our data-set. These experiments will give validation of our setup and will function as our proof of concept. More information about the research method and setup can be found in chapter4.

1.3 Related Work

In this section related work is discussed which inspired us to start with this project. NeMODe

Salgueiro et al. [SDBA11] and [SA10] presented NeMODe, a domain specific language (DSL) for de-scribing security/intrusion related patterns to be detected in network traffic. Where most tools are focused on signature based detection within a single packet they have tried to develop an uniform interface for describing and detecting patterns which span across multiple packets. Where NeMODe is focused on describing patterns on the network layer, did we, in this work, reproduced this concept for a web application and the additional access logs.

(7)

Derric

van den Bos and van der Storm [vdBvdS11] presents the concept of using the external domain specific language Derric to overcome the complexity and the time consuming process of extending and mod-ifying an application for a specific use case. The language they developed is focused on file carving, which is the process of finding full or fragmented files structures on storage devices. File structures are very heavy subject to change and minor variations are used by all kind of different vendors. With Derric they defined an uniform interface to easily describe, change, and reuse these file structure definitions. In our work we have made use of this concept to separate the pattern definitions from the pattern detection logic to make it reusable, and fast and easy to change.

1.4 Contributions

NeMODe and Derric are both interesting languages which are developed for a specific goal. In this work we converted their idea’s to detect (security) patterns in web application access logs based on our domain specific language. In this work we will prove that pattern detection can be done with this domain specific language. In this work we will show:

• The patterns that can be detected in access logs files which will be discussed in chapter2. • CSL, our domain specific language and the prototype application which will be discussed in

chapter3.

• The research method used in chapter4.

• The experimental results and findings of our experiments in chapter 5. • Finally we discuss and conclude in chapter 6and7.

(8)

Chapter 2

Background

In this chapter some background information is given about the patterns that we are interested in during this research and what we mean by a domain specific language.

2.1 Patterns

Patterns that can be found in web request access logs can be divided in two categories. Patterns which exist of a single request and patterns which are spread across multiple requests.

Examples of single request patterns includes parameter manipulation attacks which involves the attacker to modify the data sent by the client to the server in a way it compromises or tries to compromise one or more security services. Most of them are described in the OWASP Top Ten [OWA].

Example patterns where multiple requests are involved include, but is not limited to, information scraping, automated attacks (e.g. dictionary/brute-force attacks), analyzing cookies for a specific IP address/device. More or less every flow over an application can be described.

2.1.1 Single Request Patterns

Single request patterns, are simple patterns that can be identified from an single access log entry. Search for the relevant properties in the data-set and the result set is everything that matches the defined pattern. The most important single request patterns, from a security perspective seen, are described in the OWASP Top Ten [OWA]. In this section some of the most common web applica-tion security flaws are described including, Cross Site Scripting, injecapplica-tion techniques, session/cookie manipulation, forced browsing and encoding attacks. Vulnerable code examples of these attacks are given in appendixA.

Cross Site Scripting (XSS)

Cross Site Scripting or shortly XSS, is the most prevalent, obstinate, and dangerous vulnerability in web applications as stated in the Top 25 Most Dangerous Software Errors list [Chr11].

An XSS-vulnerability lets an attacker inject browser executable code via URLs/HTTP requests parameters into a web-page. These attacks can vary from injection of HTML tags, for example <h1>, to injection of executable javascript code like, <script>alert(’XSS attack’)</script>, or worse, include an iframe which injects malware into the page.

The impact of such an attack differ from the vulnerability in a web application. If the modified parameter is saved to a persistent storage, like a database or disk, also the attack is persisted and everyone that loads the specific page is automatically attacked by the executable code. When the modified parameter is not saved to persistent storage, still the attack can be distributed over a large group of users, for example by distribution of an modified URL. In this case the user has to click the modified URL first to be attacked. Code examples are given in appendixA.1.

(9)

SQL Injection

A SQL injection attack is an attack whereby some SQL code is injected in an input/a parameter field. When the input is not validated, a SQL query can be altered by the user, which enables the attacker to query the database themselves [OWA]. For example, reading sensitive data, modifying data or deleting the whole database. Along with Cross Site Scripting attacks, SQL injection attacks are one of the most common vulnerabilities that occur in web applications.

SQL injection attacks can be classified under seven main categories: Tautologies , Illegal/Logically, Incorrect Queries, Union Query, Piggy-Backed Queries, Stored Procedures, Inference, Alternate En-codings [SZM13] [HVO06]. Examples are given in appendixA.2.

Directory traversal

Directory traversal attacks are attacks whereby an attacker tries to access files and directories outside the applications root directory. A simple example to illustrate such kind of attack is shown below whereby an attacker modifies a request parameter. More extensive examples are given in appendix

A.3.

1 http://example.com/files.php? file =picture.png

2 http://example.com/files.php? file =../../ some/file/or/directory

Command injection

The goal of command injection is to execute arbitrary commands on the operating system of a vul-nerable web application. These kind of attacks are possible by passing unsafe data without validation to the application. An vulnerable application acts as a pseudo system shell which runs as the user the web application is running under. When the web application is running under the root user, the attacker is able to execute every command on the system [KMD11]. Vulnerable code examples are given in appendixA.4.

Session manipulation

HTTP is a stateless protocol, which means that information about requests is not shared between pages/requests. Session tokens allow users to identify themselves to a service after an initial authenti-cation procedure without needing to resend the authentiauthenti-cation information (usually a user name and password) with every request/message [Cor14]. Session manipulation attacks are focused on these session tokens. Different types of session attacks can be distinguished, such as: predictable session to-kens, client side attacks (for example a XSS-attacks), session sniffing and man in the middle/browser attacks [OWA]. The goal of session manipulation attacks are to modify/intercept session tokens to be able to authenticate as another authenticated user.

Cookie poisoning

Cookie poisoning involves modifying a cookie so that the web application is deceived into giving away sensitive data. It is usually used to steal the identity of a user, so that the web application treats the attacker as the victim. Thus, the attacker can access the web application as the victim, and can then gain, damage or delete confidential information [HH03].

Forced browsing

Forced browsing, also known as fuzzing, enumeration or parameter/URL tampering, is an attack based on manipulation of parameters sent between client and server to get access to information the attacker is not authorized to. An simple example is modifying the URI parameters, but this can be

(10)

done on every parameter exchanged between client and application, for example cookies, sessions, URI parameters, forms, etcetera [OWA]. Examples are given in appendixA.5.

URL Encoding

A very popular evasion technique is to obfuscate the URL and its parameters by using different en-coding schemes to bypass sanity checking filters [Mey08]. This is possible because the web server accepts and processes client requests in many encoded forms. Multiple forms of encoding exists, for example, UTF-8 unicode and hexadecimal encoding which are both allowed in URLs [Gun07]. Also attacks were the values are double encoded are known. By using double encoding it is possible to bypass security filters that only decode user input once. The second decoding process is executed by the back-end platform or modules that properly handle encoded data, but don’t have the correspond-ing security checks in place [OWA]. Some examples of attacks where encoding is used are given in appendixA.6.

2.1.2 Multi Request Patterns

Multi request patterns concerns patterns which involves multiple entries from the access logs to be analyzed before a result can be determined. In this section some examples will be described, including information scraping, automatic ordering on e-commerce sites be it automation in general, vulnerability scanning, but also analyzing cookies used on a specific IP address/device.

Information Scraping

A lot of information on the internet is indexed for different purposes e.g. prices comparison websites, search engines or even competitors monitoring the website for changes. Detection of these kind of patterns can can be done by checking session duration or for example the amount of request in an specific time window.

Automation

The normal way a visitor uses a website is by clicking around manually, however there are ways activity on web pages can be automated [Fak]. This has not to be immediately bad, but it is good to know if and when it happens. For example, someone who has written a tool to automate the process of adding products to the shopping cart can be a valid activity for someone that has to order a lot of products from the webshop. However it can also be used for fast ordering via hijacked accounts and then it is a case of fraud [vD15]. Detection of these kind of patterns can be done by describing a flow pattern over the website combined with the amount of requests in a specific time frame for example. Scanning

Most attacks described in section 2.1.1, are taking place in an automated way with tools called vulnerability scanners, for example [Nes] and [Acu], or custom scripts made by an attacker. With such tools a high volume of pages can be scanned for vulnerabilities in a short amount of time. Most of these commercial scanners sent, by default, custom headers with each request by which they can be identified in the access logs. However most of these tools allows the default headers to be changed which makes it harder to detect. By detecting individual OWASP related attacks and trying to establish a relationship between these requests, a pattern can be found. This can give information about, for example, the session duration of an attack versus the amount of scanned vulnerabilities [vD15]. Another example of an automated scanning attack is scanning all (product) URLs to find unreleased or hidden product information and/or pages. See also forced browsing in section2.1.1. Tracing by cookies and IP information

When working in an e-commerce organisation you have to deal every now and then with fraudulent orders. In most cases, where the fraudster tries to hide his traces, the order will be created via a newly

(11)

created account with bogus user information and is therefore more or less meaningless, or maybe not? What if we can match the IP address or cookie used to place the order with an account/order placed some time ago? As a main player in the online e-commerce market their is a lot of information available that can be used trying to identify the fraudsters real information. For example via a user account of the fraudster before he/she started committing fraud. An example scenario where identification is done based on a IP address1 is given below in table2.1.

Date IP address Activity

01/01/2015 12:01 172.16.167.19 Visitor creates real account A on shop.example.com 01/01/2015 12:10 172.16.167.19 Visitor places order via account A

... ... ...

24/06/2015 03:56 172.16.167.19 Visitor creates fake account B on shop.example.com 24/06/2015 03:57 172.16.167.19 Visitor places fraudulent order via account B

Table 2.1: Identification based on IP address

With identification based on an IP address we can only say something about which internet con-nection is used to place the order. When it is traced back to the internet concon-nection of a small household, then the group of suspects will be relatively small and is it possibly enough information to get it investigated by a local police station. But what if we trace it back to a big campus or company with hundreds or thousand of computers/laptops/smartphones which have visited shop.example.com? Identification based on IP address probably results in tens of user accounts of our webshop. Ideally we want to go a step further and point out the device, and the user account, which is used to place the fraudulent order. Our access logs contain information about the cookies placed on the device of the visitors during visiting our webshop. Matching this information with cookies found on a specific device can say something about the participation in the fraud case [ECL]. An example scenario is given in table2.2.

Date IP address Cookie Activity

01/01/2015 12:01 10.0.25.210 ppz8ZApxn.. Visitor creates real account A on

shop.example.com via his personal laptop 01/01/2015 12:10 10.0.25.210 ppz8ZApxn.. Visitor places order via account A

... ... ... ...

24/06/2015 03:56 172.16.167.19 ppz8ZApxn.. Visitor creates fake account B on

shop.example.com via his personal laptop 24/06/2015 03:57 172.16.167.19 ppz8ZApxn.. Visitor places fraudulent order via account B

Table 2.2: Identification based on cookie information

Based on the scenario shown in table2.2we can relate multiple IP addresses to each other based on the cookie dropped on the device by shop.example.com. Cross referencing this information with the scenario described in table2.1can lead to the information needed to identify the fraudulent person.

2.2 Domain-Specific Languages

A domain-specific language, also called DSL, is a small language focused on a certain problem do-main. Most of them are declarative and can therefore be viewed as specification languages as well as programming languages. Van Deursen et al. [vDKV00] did a literature aggregation research on domain-specific languages and proposes a definition for what a domain-specific language usually is.

1_{All IP addresses used in this document are, for reasons of privacy, addresses from within the private network address} spaces.

(12)

We quote their definition below.

A domain-specific language (DSL) is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power fo-cused on, and usually restricted to, a particular problem domain.

(13)

Chapter 3

A DSL to describe and detect

security patterns

In this section we will present our DSL for describing security related patterns, further we will describe the architecture of our prototype and the structure of the request access logs used.

3.1 Language

In this work we present a simple, declarative, Domain Specific Language [vDKV00], we called Cool Security Language (CSL), for describing patterns in request access logs. We inspired our language structure by the NeMODe language presented in [SDBA11] and [SA10]. Instead of targeting net-work/packet level patterns, we are focusing with CSL on describing requests and responses, their properties and the relations between them, to describe a signature and finally detect them in the access logs. The main characteristics of CSL are the ease and the intuitive way in which patterns can be defined. CSL abstracts the way the access logs are searched and act as an uniform language for searching across access logs. For this first prototype, searches have to be started manually whereby a time frame has to be defined. In the remaining of this section the specification of CSL will be discussed and ends with an example of a pattern described within CSL.

3.1.1 Specification

For the specification of the CSL language we are inspired by the NeMODe language described by Salgueiro et al. and the JSON data structure [Int]. In this section we will outline the structure and the possibilities of CSL.

With CSL we describe detectors, which can be configured to do exactly one thing, detect a specific pattern in access logs. The base setup for a detector is shown below in listing3.1, whereby NAME can be exchanged for a name/description for the detector.

detector ”NAME” {}

Listing 3.1: Base Detector

The body of the detector consists of three types of elements, which are: one or more request definitions, a find block and a result block. A request definition, of which an example is shown in listing 3.2, describes a specific request that is part of the pattern which we are interested in. In this example we are interested in all GET requests coming from remote-address 10.0.167.19 with a response status code 200.

(14)

1 A = request { 2 method: ”GET” 3 remoteAddress: ”10.0.167.19” 4 } => response { 5 status : 200 6 }

Listing 3.2: Example request definition

The definition consists of a request and a response part which are keywords in CSL. Both, the request and the response body, consists of properties which consist of a key and a value. The keys are 1:1 related to the keys used in the access logs. More information about the access log structure we used can be found in section3.3. For the value part, CSL provides three types: strings, numbers and regular expressions. Strings and numbers can be used for exact matches, whereas regular expressions can be used to search for partial matches. Examples are shown in listing3.3.

method: ”GET” // String remoteAddress: ”””(192|10)\.168\.167\.19””” // Regex status : 200 // Number

Listing 3.3: String, number and regex values

Multiple request definitions can be defined as long as the name of each definition is unique. The request definitions can be placed everywhere in the detector body.

Besides request definitions the detector body needs to contain exactly one find block. Within this find block everything related to the pattern to detect is configured, such as the pattern itself and the relationship between the request definitions. Also the period to search in can be defined as is the amount of occurrences within a time frame. An example of a find block is shown in listing3.4.

1 find { 2 from: 01−01−2015 3 till : 31−06−2015 4 pattern { 5 A −> B −> A −> C 6 } with relation on { 7 request .remoteAddress, 8 request . url .host

9 }

10 times: 2

11 within: 10 seconds 12 }

Listing 3.4: String, number and regex values

The example shows the configuration for detecting a pattern consisting of a request definition A, followed by B, followed by A, followed by C, whereby the requests remote-addresses and host names are equal, or in other words are related to each other. The period in which we are searching in is the 1st of January 2015 till the 31st of June 2015. It is also possible to only set the from or the till date to search only from or till a specific date. Defining the period is optional. When the period is not defined all log information will be searched through. The last part of the find block described the

(15)

amount of occurences of the pattern we are interested in. In this case the pattern has to be detected twice in 10 seconds before it will be included in the results.

Helper methods can be used for describing patterns more easily. The methods supported by CSL are in(), not(), repeat(), and the wildcards * and ?.

The in() method lets you create a pattern where a specific part can exists of more than one request definition. in(A, C, D), for example, matches on request definition A, C or D. The not() method allows to define a pattern where some part is known to be not a specific request definition. not(A, B), for example, matches on every request except the requests that matches A and B. The repeat() method made repetitive requests in a pattern cleaner and easier, for example, repeat(C, 5) is equal to C -> C -> C -> C -> C. Finally the wild card methods * and ? match on all requests, respectively, multiple times and one time.

Finally we are interested in the requests matching our pattern. The result block contains the information necessary for exporting the result set for further analysis. In the result block you can define which information fields of the request definitions you are interested in and how you like the data set to be exported. In our prototype we have chosen to implement the .csv format only, but this can easily be extended with other formats. An example of a result block is shown in listing3.5.

1 result { 2 csv { 3 request.timestamp, 4 request.remoteAddress, 5 request.cookie.PHPSESSID, 6 response.status 7 } 8 }

Listing 3.5: Result block

A detector with this result block defined will export a .csv file with the first column the docu-ment identifier which is fixed followed by the columns request.timestamp, request.remoteAddress, request.cookie.PHPSESSID and response.status.

3.1.2 Example

An example of a fully described detector in shown in listing3.6. The detector describes the situation where we are interested in visitors that are adding a specific product to their shopping cart without viewing the product page first. This can be an indication for an automated ordering attack. When the detector finishes it will export the result set as .csv file with the defined property columns. A detailed explanation is given after the CSL example.

1 detector ”Automated Ordering” { 2

3 productPage = request { 4 url . uri : ”/product/358243” 5 } => response {}

6

7 addProductToShoppingCart = request { 8 url . uri : ”/shoppingcart/?add=358243” 9 } => response {}

10

11 shoppingCart = request { 12 url . uri : ”/shoppingcart” 13 } => response {}

(16)

14

15 find {

16 from: 01−01−2015 17 pattern {

18 not(productPage) −> addProductToShoppingCart −> shoppingCart 19 } with relation on { 20 request.remoteAddress, 21 request.cookies.Coolblue−Session, 22 request.url.host 23 } 24 times: > 100 25 within: 2 minutes 26 } 27 28 result { 29 csv { 30 request.timestamp, 31 request.remoteAddress, 32 request.cookies.Coolblue−Session, 33 request.url.host, 34 request.url.uri , 35 response.status 36 } 37 } 38 }

Listing 3.6: Example detector (CSL)

On line 1, a new detector definition is started with the name Automated Ordering. This name is used for identification purposes and will be used as part of the file name when the result is exported. On lines 3-13 the request definitions which make up the pattern are defined. All three matching a specific uri. On lines 15-24 the find block is stated. In this find block we defined on line 16 that we are interested in the period from the 1st of January 2015 and on. On lines 17 and 18 the pattern is defined. In this case we are interested in all the traffic which does not match the product page (productPage) followed by the product added to the shopping cart (addProductToShoppingCart) followed by the shopping cart page itself (shoppingCart). On lines 19-23 we defined that their has to be a relation between the requests. The properties that has to be equal are the requests’ remote-address, the session cookie Coolblue set for all the visitors and the host name. On lines 24 and 25 we set that we are only interested in those results matching the pattern more that 100 times within a 2 minute time frame. Finally on lines 28-37 we define which properties are of interest and needs to be exported to the .csv file.

3.2 Application

3.2.1 Concerns

Standard search tools, such as the ELK stack 1 _{and command line tools as grep, are very good in}

searching for a specific event in log files at scale. This is enough when your are interested in if or when a specific event happened. However there are a lot of situations where this isn’t enough. Patterns which exists of multiple subsequent requests, for example, are hard to find with these tools. You have to manually create and execute search queries and create and execute followup queries bases on the previous result-sets. With CSL and the prototype application developed, we have created a language which abstracts the creation of these search queries. Through describing the patterns once all the subsequent search queries are created and executed in an automatic way.

(17)

3.2.2 Implementation

The application consists of multiple components, their relations are visualized in figure3.1. Each of the components is written in Scala [Sca] of which the source code can be found in our github repository [CSL]. In this section we will describe what the responsibilities are for each component and how they are implemented. During this work we have used elasticsearch [ela] as our main data-store in which the access logs are centrally indexed.

Figure 3.1: Schematic view of the application components.

Interpreter

The interpreter acts as entry point for the application. It is responsible for combining all the different components together to form the application.

Parser

The parser is responsible for parsing the CSL pattern description into a detector abstract syntaxt tree (AST). This detector AST is used as input for the other components of the application.

Request Definition Collector

The pattern described in CSL exists of one or more request definition elements that are related to each other via properties. The request definition collector takes care to collect all occurrences matching a request definition from elasticsearch and saving this collection to a temporary MySQL storage waiting for further analysis. For example, we are looking for the pattern A -> B -> A.

(18)

1. Collector filters unique request definitions from pattern and gets information from detector AST. In this case A and B.

2. Collector get the definition descriptions from the detector AST and starts collecting all infor-mation from the access logs matching A and B independently.

3. Resulting matches are stored temporarily in a MySQL database.

As soon as all necessary information is collected from the access logs it returns to the interpreter. Relation Collector

With the request definition collector we have collected all documents matching exactly the requests we are interested in. However to check if a specific path is followed we also need all the documents related to the matched documents. In this way we can match the pattern exactly on the collection of documents for each relation. The relation collector takes care of extending the temporary result set with related log information.

Filter Query Generator

The request definition collector and the relation collector both are executing queries on our access log data-set based on in the CSL description defined request properties. The filter query generator generates, as the name implies, an elasticsearch filter query to narrow down the data-set efficiently. Pattern Detector

The pattern detector tries to detect the pattern defined in CSL in the temporary data-set. For each unique relation the documents will be collected and sorted by request time. Hereafter each document is checked against the defined pattern we are interested in.

Result

The resulting data set will be exported in the in CSL defined format with the in CSL defined properties to include. In this prototype only the export format .csv is supported, but can easily be extended with other formats. Based on the information in the export file further analysis and reporting can be done via applications of choice.

3.3 Access Logs

The access logs we have used during this work are an aggregation of eight web-servers. The logs of each of the servers are processed via logstash, a log processing tool, and indexed into an elasticsearch cluster. The elasticsearch cluster is our main source of data and it is the point where our application connects with. An example entry from the access logs is shown in listing 3.7. The values in the elasticsearch indices are not analyzed which allows us to match on the whole value.

Each log entry is a JSON document consisting of three, for our application mandatory, fields, namely, request, response and @timestamp. The request and response fields are objects which consists of properties, as for example, cookie and URL information. The @timestamp field is used for sorting the log information in such that our pattern detector works. Further, properties can be added or removed as desired. Field names used in CSL are one-to-one mapped to the structure of the access logs.

(19)

{ ”request”: { ”timestamp”: ”2014−10−15T09:07:29+02:00”, ”method”: ”GET”, ”cookies”: { ”SSID”: ”BwB9lx1GAAAAAAAgHT5UGekSASAdPlQBAAAAAA...”, ”SSLB”: ”1”, ” gat”: ”1”, ” gat allshops ”: ”1”, ”SSOD”: ”AJVTAAAAMAAKDAAAAQAAACkdPlQpHT5UUw0AAAEAAAApHT...”, ”Coolblue−Session”: ”5dc2773a4104ba41b8d01eb24265b802”, ”SSSC”: ”1.G6070321371036182809.1|95.3161:119.3640:190.5595:202.5814:203.5873”, ”SSRT”: ”KR0−VAIDAQ”, ”PHPSESSID”: ”mrbcgcghs2kd31l724tokq8ua3”, ” ga”: ”GA1.2.771225263.1413356833” }, ”url ”: { ”host”: ”www.boormachinestore.nl”, ”uri ”: ”/category/196142/bosch−boormachines.html”, ”query”: {} }, ”protocol”: ”HTTP/1.1”, ”headers”: { ”x−forwarded−host”: ”www.boormachinestore.nl”, ”accept−language”: ”en−US,en;q=0.8,nl−NL;q=0.5,nl;q=0.3”, ”connection”: ”Keep−Alive”,

”accept”: ”text/html, application/xhtml+xml, ∗/∗”, ” referer ”: ”http://www.google.nl/url?sa=t&rct=j...”, ”host”: ”www.boormachinestore.nl”, ”x−forwarded−server”: ”www.laptopshop.nl”, ” sitespect ”: ”1−1249”, ”x−forwarded−for”: ”10.0.167.19, 192.168.162.52, 192.168.162.197”, ”cookie”: ”SSLB=1; SSID=BwB9lx1GAAAAAAAgHT5UGekSASAdPlQBAAAAA...”, ”dnt”: ”1”,

”user−agent”: ”Mozilla/5.0 (Windows NT 6.3; WOW64;

Trident/7.0; MAMIJS; rv:11.0) like Gecko” }, ”remoteAddress”: ”10.0.167.19” }, ”response”: { ”status”: 200, ”bytesSent”: 94888, ”processingTime”: 0.496, ”headers”: { ”pragma”: ”no−cache”,

”expires ”: ”Thu, 19 Nov 1981 08:52:00 GMT”, ”content−type”: ”text/html; charset=iso−8859−1”, ”x−frame−options”: ”SAMEORIGIN”,

”cache−control”: ”no−store, no−cache, must−revalidate, post−check=0, pre−check=0” }

},

”@timestamp”: ”2014−10−15T09:07:29.377Z”, }

(20)

Chapter 4

Research Method

In this section we will describe the research method used to formulate an answer to research question

2and3. Experiments will be done according to multiple scenario’s and measurements related to these scenario’s. In this section more information is given about the scenario’s and the measurements that are done. Also a description of the test setup is given.

4.1 Scenario’s

To formulate an answer for research question 3 we have to test which patterns can and cannot be described in the CSL language, and finally which patterns can and which patterns cannot be found in our data-set. In this section multiple scenario’s are worked out which are used in our experiments to give an answer to this question.

4.1.1 Scenario 1: Automated ordering

Automated attacks can be related to the amount of request in a certain amount of time. Automatically ordering, for example, can be an example of an automated attack. With automatic ordering we mean putting products in the shopping cart in a automated way instead of manually. This scenario can possibility be detected by defining a pattern where the visitor is not visited the product page followed by the request necessary to add a product to the shopping cart and finally a request for the shopping cart page. This scenario has not necessarily to be fraud as we stated earlier in the background information section. However, it can be used to perform fraud and therefore it is good to know if, when and where from these activities are happening. With this information specific orders can be monitored in more detail to check whether it is fraud or not.

4.1.2 Scenario 2: Vulnerability scanners

A lot of commercial tools are available today for scanning websites for vulnerabilities, e.g. Acunetix

1 _{and Nessus} 2_{. Goal of these tools is to enable a developer a company to check if their systems}

and applications are save. However these tools will also be used by third parties to check your website. Will it be some organization that’s doing research on e-commerce application safety, but also fraudsters which will misuse the vulnerabilities they find. In both cases it is good to know when such kind of scans are taking place and from where. Most of these tools do sent by default identifiable information with each request by which these tools can be detected. Think for example of custom headers. Most tools however have support to change this information which makes it harder to detect. Other properties that can be used are attack signatures as for example these described by OWASP [OWA]. But also the amount of request in some period for a specific session can give some information as these tools are firing request in a fast and automated way. For our experiment we will use this last property.

1_{www.acunetix.com}

(21)

4.1.3 Scenario 3: Cookies used on a specific IP address

Fraudulent cases in e-commerce usually begin with a suspicious order being detected. An order usually include the IP address used during placement of the order. This IP address is a property on which an investigation can be started. As stated earlier, an IP address can lead to a physical address but not to the specific device used. With a cookie we can. In this scenario we will try to find the cookies used while placing the fraudulent order to our webshop and will try to establish a connection with possibly other IP addresses the device is used on.

4.1.4 Scenario 4: IP addresses using a specific cookie

In this scenario we start with a device discovered during police investigation used to commit fraud on our webshop. Based on the cookie information found on this device we try to find the IP addresses the device is used on, and which orders can be related to this device and thus the fraudulent case.

4.1.5 Scenario 5: Related IP addresses to a specific IP address

Fraudulent cases in e-commerce usually begin with a suspicious order being detected as already stated earlier. Sometimes it is an individual, other times it’s more people working together to commit fraud. In scenario 3 we are interested in the cookies used on a specific IP address and in scenario 4 which IP addresses are using a specific cookie. In this scenario we try to combine these to to get a list of IP addresses related to a specific IP address. This happens for example when a specific device is used on multiple locations/internet connections.

4.2 Measurements

To formulate an answer on research question2 some measurements will be made based on the CSL descriptions created for the in section 4.1described scenario’s. Measurements that will be done are related to the speed and easiness of setting up a pattern definition in CSL, e.g. the amount characters in each detector definition. Also some measurements are done related to the size of the data-set versus the time necessary for the detector to complete. More information about the measurements is given in the following sections.

4.2.1 Detector definition size

Fast and easiness is related to the amount of time necessary to describe a pattern. The more characters the definition consists of the more time it takes. For each of the scenario’s in section 4.1 the total amount of characters the definitions consists of will be measured. For each detector definition three measurements will be made.

1. Total amount of characters the detector definition consists of. 2. The amount of characters related to the CSL structure.

3. The amount of characters used to define the pattern without the CSL structure. The results of theses measurements will be outlined in some graphs.

4.2.2 Completion time

The time the detector requires to detect a specific patterns is very dependent on the pattern description to search for. Searching for a pattern which exist a lot in the data-set, we expect, will require more time to complete than a rare pattern. Also the size of the pattern will influence the time required to complete. For example, a pattern consisting of three rare unique parts will require lesser time than a pattern consisting of ten unique and very common parts. For each of the scenario’s described in section 4.1 we will record the time required for each step in our application and other metrics like

(22)

the size of the data-set, the amount of matches for each part of the pattern and the total amount of documents matching the full pattern.

The results will be outlined in tables combined with a chart to make it insightful.

4.2.3 Disk usage

Our application uses a temporary MySQL database for storing the documents related to the pattern defined in our detector. The amount stored to this temporary database is related to the amount of matched and thus the scenario which we are looking for. For each scenario we will measure the maximum size of the temporary database during each application run. This information give some idea of how much disk space is used and if it is possible to run the application from a certain device for a specific scenario.

4.3 Test Setup

Two test setups are used for doing experiments with the CSL application. The first setup is a single node which contains the application, including the database, and the elasticsearch index with a limited log data-set. The second setup consists of multiple nodes. One node which contains the application and database and a elasticsearch cluster which held the log information for the last 10-days for all the web-servers. More information about the test setups is the following sections.

4.3.1 Local setup

In the local test setup the experiments we will do are run on an single machine with contains the application itself, the database to store the temporary result in and the elasticsearch index. The elasticsearch index contains the log information of a single day from a single server. The data used during these experiments is a limited, fixed data-set compared to the cluster setup. In figure 4.1 a graphical visualization of the local setup is shown. Specifications of the machine are given in table

4.1.

(23)

Local machine

MacBook Pro (2012) — OSX Yosemite Version 10.10.4 16,00 GB RAM

2,3 GHz Intel Core i7

Table 4.1: Machine specification running the CSL application

4.3.2 Cluster setup

In the cluster setup the experiments we will use a local machine to run our CSL prototype application. The application will connect to our elasticsearch cluster in the data center containing the log informa-tion data-set. The log informainforma-tion currently has a reteninforma-tion time of 10 days. The elasticsearch cluster consists of four data nodes, of which two store their data on HDD and two of them on SSD, and three master nodes and will be accessed through a HAproxy3 load balancer. A graphical visualisation of the test setup is shown in figure4.2. Specifications of the local machine are shown in table4.2. The specifications of the elasticsearch cluster are shown in table4.3and4.4.

Figure 4.2: Cluster test setup.

(24)

Local machine

64-bit Windows 7 Professional 8,00 GB RAM

Intel(R) Core(TM) i5-4590 CPU @ 3.30 GHz

Table 4.2: Machine specification running the CSL application

3 x Elasticsearch Master nodes 64-bit CentOS 6.6

8,00 GB RAM

Intel(R) Xeon(R) CPU E5-2650L v2 @ 1.70GHz 1 core

Vm on VMWare

Table 4.3: Elasticsearch master node specifications

2 x Elasticsearch Data nodes 2 x Elasticsearch Data nodes 64-bit CentOS 6.6 64-bit CentOS 6.6

72 GB RAM 64 GB RAM

Intel(R) Xeon(R) CPU X5680 @ 3.33GHz Intel(R) Xeon(R) CPU E5-2430L 0 @ 2.00GHz

24 cores 12 cores

Storage on HDD Storage on SSD

(25)

Chapter 5

Experimental Results

5.1 Scenario’s

In this section the description of the CSL pattern descriptions used in our experiments are shown along with an explanation for each of them.

5.1.1 Scenario 1: Automated ordering

In section4.1.1we described a scenario of order automation on the webshop of Coolblue. Listing5.1

shows the CSL definition for this scenario.

1 detector ”Automated Ordering” { 2

3 productPage = request {

4 url . uri .raw: ”””/product/[0−9]+/.∗””” 5 } => response {}

6

7 ourAssortment = request {

8 url . uri .raw: ”/ons−assortiment” 9 } => response {}

10

11 addProductToShoppingCart = request {

12 url . uri .raw: ”””/winkelmandje\?add=[0−9]+”””

13 } => response {} 14

15 find {

16 from: 01−01−2015 17 pattern {

18 not(productPage) −> ourAssortment −> addProductToShoppingCart 19 } with relation on { 20 request.remoteAddress.raw 21 } 22 times: >3 23 interval: 20 seconds 24 } 25 26 result { 27 csv { 28 request.timestamp, 29 request.remoteAddress, 30 request.cookies.Coolblue−Session, 31 request.cookies.PHPSESSID, 32 request.url.host,

(26)

33 request.url.uri

34 }

35 }

36 }

Listing 5.1: Scenario 1: Automated Ordering (CSL)

On line 1, a new detector definition is started with the name Automated Ordering. This name is used for identification purposes and will be used as part of the file name when the result is exported. On lines 3-13 the request definitions which make up the pattern are defined. All three matching a specific uri. On lines 15-24 the find block is stated. In this find block we defined on line 16 that we are interested in the period from the 1st of January 2015 and on. On lines 17 and 18 the pattern is defined. In this case we are interested in all the traffic which does not match the product page (productPage) followed by the page our assortment (ourAssortment) followed by the page to add a product to the shopping cart (addProductToShoppingCart). On lines 19-21 we defined that their has to be a relation between the requests. The property that has to be equal is the requests’ remote-address. On lines 22 and 23 we define that we are only interested in those results matching the pattern more that 3 times within a 20 seconds time frame. Finally on lines 26-35 we define which properties are of interest and needs to be exported to the .csv file.

5.1.2 Scenario 2: Vulnerability scanners

In section 4.1.2we described a scenario where tools are used to scan our website on vulnerabilities. Multiple properties can be used to detect these tools. For this scenario we will make use of the amount of request in a session. Listing5.2shows the CSL definition for this scenario.

1 detector ”Vulnerability Scanners” { 2

3 someRequest = request {} => response {} 4 5 find { 6 pattern { 7 someRequest 8 } with relation on { 9 request.remoteAddress 10 } 11 times: >100 12 interval: 10 seconds 13 } 14 15 result { 16 csv { 17 request.timestamp, 18 request.remoteAddress, 19 request.cookies.Coolblue−Session, 20 request.cookies.PHPSESSID, 21 request.url.host, 22 request.url.uri 23 } 24 } 25 }

(27)

On line 1, a new detector definition is started with the name Vulnerability Scanners. This name is used for identification purposes and will be used as part of the file name when the result is exported. On line 3 the request definition which make up the pattern is defined. In this case the definition matches all traffic. On lines 5-13 the find block is stated. In this find block, on line 7, the pattern is defined. In this case we are not interested in traffic matching specific properties but only the amount of traffic. For counting the amount of traffic we are interested in grouping these by remote-address so we have defined a relation on this property. On line 11 and 12 we define that we are interested in all traffic whereby the pattern matches more than 100 times within 10 seconds. Finally on lines 15-24 we define which properties are of interest and needs to be exported to the .csv file. In this scenario the cookie information.

5.1.3 Scenario 3: Cookies used on a specific IP address

In section4.1.3we described a scenario where we have found an IP address for a suspicious order in our ordering systems and want to know which cookies are placed on the device(s) used from this IP address. 1 Listing 5.3shows the CSL definition for this scenario.

1 detector ”Cookies used on a specific IP address” { 2 3 ipAddressSuspect = request { 4 remoteAddress.raw: ”10.0.167.63” 5 } => response {} 6 7 find { 8 from: 01−01−2015 9 till : 31−08−2015 10 pattern { 11 ipAddressSuspect 12 } with relation on {} 13 } 14 15 result { 16 csv { 17 request.timestamp, 18 request.remoteAddress, 19 request.cookies.Coolblue−Session, 20 request.cookies.PHPSESSID, 21 request.url.host, 22 request.url.uri 23 } 24 } 25 }

Listing 5.3: Scenario 3: Cookies used on a specific IP address (CSL)

On line 1, a new detector definition is started with the name Cookies used on a specific IP address. This name is used for identification purposes and will be used as part of the file name when the result is exported. On lines 3-5 the request definition which make up the pattern is defined. In this case the definition matches all traffic from the specific IP address. On lines 7-13 the find block is stated. In this find block we defined on line 8 and 9 that we are interested in the period from the 1st of January 2015 till the 31st of August 2015. On line 11 the pattern is defined. In this case we are interested in all the traffic matching the IP address of the suspicious order

1_{All IP addresses used in this document are, for reasons of privacy, addresses from within the private network address} spaces. During the experiment real IP addresses are used.

(28)

(ipAddressSuspect). A relation is not defined as we are interested only in the documents from the specific IP address. Finally on lines 15-24 we define which properties are of interest and needs to be exported to the .csv file. In this scenario the cookie information.

5.1.4 Scenario 4: IP addresses using a specific cookie

In section 4.1.4 we described a scenario where we have found cookie information during a police investigation and are interested in the IP addresses the cookie is used on. Listing5.3shows the CSL definition for this scenario.

1 detector ”IP addresses using a specific cookie” { 2

3 cookieSuspect = request {

4 cookies .Coolblue−Session.raw: ”375fd242a5a76b401b187705fc2477f6” 5 } => response {} 6 7 find { 8 pattern { 9 cookieSuspect 10 } with relation on {} 11 } 12 13 result { 14 csv { 15 request.timestamp, 16 request.remoteAddress, 17 request.cookies.Coolblue−Session, 18 request.cookies.PHPSESSID, 19 request.url.host, 20 request.url.uri 21 } 22 } 23 }

Listing 5.4: Scenario 4: IP addresses using a specific cookie (CSL)

On line 1, a new detector definition is started with the name IP addresses using a specific cookie. This name is used for identification purposes and will be used as part of the file name when the result is exported. On lines 3-5 the request definition which make up the pattern is defined. In this case the definition matches all traffic that is using the specific cookie. On lines 7-11 the find block is stated. On line 9 the pattern is defined. In this case we are interested in all the traffic matching the specific cookie found during the police investigation (cookieSuspect). A relation is not defined as we are interested only in the documents matching the specific cookie. Finally on lines 13-22 we define which properties are of interest and needs to be exported to the .csv file. In this scenario the remote-address information.

5.1.5 Scenario 5: Related IP addresses to a specific IP address

In section4.1.5we described a scenario where we have found an IP address for a suspicious order in our ordering systems and want to know which IP addresses are related to this IP address. Listing5.5

shows the CSL definition for this scenario.

1 detector ”Related IP addresses to a specific IP address” { 2

(29)

3 ipAddressSuspect = request { 4 remoteAddress.raw: ”83.128.124.67” 5 } => response {} 6 7 find { 8 pattern { 9 ipAddressSuspect 10 } with relation on { 11 request.cookies.Coolblue−Session.raw 12 } 13 } 14 15 result { 16 csv { 17 request.timestamp, 18 request.remoteAddress, 19 request.cookies.Coolblue−Session, 20 request.cookies.PHPSESSID, 21 request.url.host, 22 request.url.uri 23 } 24 } 25 }

Listing 5.5: Scenario 5: Related IP addresses to a specific IP address (CSL)

On line 1, a new detector definition is started with the name Related IP addresses to a specific IP address. This name is used for identification purposes and will be used as part of the file name when the result is exported. On lines 3-5 the request definition which make up the pattern is defined. In this case the definition matches all traffic from the specific IP address. On lines 7-13 the find block is stated. On line 9 the pattern is defined. In this case we are interested in all the traffic matching the IP address of the suspicious order (ipAddressSuspect). At line 11 we defined the relation we are interested in. In this case all traffic which is related based on the cookie information is also returned. Finally on lines 15-24 we have defined which properties are of interest and needs to be exported to the .csv file. In this scenario the IP-address and cookie information.

5.2 Measurements

During the experiments we did several measurements. These measurements can be divided into three types. The first one is the amount of characters necessary to describe the scenario’s in our language CSL which gave some idea of the amount of work. The second one is the time the detector takes to scan the log data for the specified pattern for each test setup. The third one is the amount of disk usage the application needs during the detection of pattern for each test setup.

5.2.1 Detector definition size

To get an idea of how much work it takes for describing a scenario in our language CSL, we have measured the amount of characters used for describing the scenario’s. Three measurements are done for each scenario definition, namely:

1. Total amount of characters the detector definition consists of. 2. The amount of characters related to the CSL structure.

3. The amount of characters used to define the pattern without the CSL structure. The results of these measurements are shown in figure5.1.

(30)

Figure 5.1: Measurement results for each scenario.

5.2.2 Completion time

To get an idea how long it takes to search for a specific scenario within the log data-set, we timed our application. Several parts are measured, including:

1. The amount of documents within the log data-set.

2. The amount of matches for each part of pattern description.

3. The time necessary to retrieve the documents matching the pattern parts. 4. The time it takes to retrieve all the related document.

5. The time necessary to find the documents matching the whole pattern. 6. End finally, the total time needed to complete.

The time measurements for both setups are given in figure5.2and5.3. Local setup

(31)

Cluster setup

Figure 5.3: Time measurements cluster setup.

5.2.3 Disk usage

Because of the large data-set we search through we have measured the disk usage of the CSL applica-tion during detecapplica-tion of a certain scenario. These measurements give us an idea of the disks needed to execute a specific query or for which scenario’s we have to optimize the application. In figure5.4

and5.5 the measurement results for both setups are given. For the first and second scenario, which failed during execution time, disk usage at the time of failure is shown.

Local setup

Figure 5.4: Disk usage local setup.

Cluster setup

(32)

Chapter 6

Discussion

6.1 Scenario’s

For the experiments we have implemented the scenario’s we discussed in section4.1. All the scenario’s can somehow be implemented into the CSL language. For some scenario’s some improvements can be made. In the section we will discuss our findings during the execution of the experiments.

While defining the implementation for scenario 1 we noticed a few difficulties. First we detected that there is a request that had to be defined in the pattern which can not been noticed when browsing the website. In this case it is the ourAssortment request definition. We discovered this request after manually analyzing a part of the log files. In the log files also requests made in the background are logged which makes it hard(er) to define a pattern by just browsing the website. Another thing we noticed is that logstash, responsible for aggregating the log files and putting these into elasticsearch indices, creates an analyzed index and adds another field which stores the raw, not analyzed, data. In our prototype we just configured the elasticsearch index to be not analyzed so it matches our use case. For the experiments to work we had to use the raw data fields which can be accessed by putting .raw behind each request description property. An issue which can not be expressed in the CSL language is sharing property values between request definitions. An example can be found in the product id’s. In the current implementation we have used a regex for the request definitions productPage and addProductPageToShoppingCart, but in the ideal situation these id’s have to be the same before the pattern matches. While defining the other scenario’s we didn’t had any difficulties.

6.2 Measurements

During the experiments we have done measurements on five different scenario’s implemented in the CSL language. During the experiments we discovered that not all of the scenario’s can be successfully executed by our prototype application/setups. While executing scenario 1 and 2 we got multiple exceptions, all of them related to memory or problems while querying elasticsearch. After analyzing the problem we discovered that the way we implemented the execution of the queries to elasticsearch is probably causing these exceptions. In the application we just implemented a loop to execute all the queries in an asynchronous way. Which works fine with a very limited data-set, but in the experiment setups with more realistic data this resulted in such many request to elasticsearch that even the elasticsearch query queue couldn’t handle the request anymore.

However we had exceptions for the first two scenario, we have done some measurements which will be discussed in the remaining part of this section.

6.2.1 Detector definition size

For each of the scenario’s we counted the amount of characters and divided them in two categories, CSL structure related and pattern description related characters. We noted that the amount of characters used for giving variables a descriptive name has a big effect on the total amount of characters for each

(33)

of the scenario’s descriptions. The complexity of the scenario’s has effect on the amount of structure characters, although relatively small compared to the total amount of characters.

6.2.2 Completion time

Based on the amount measurements we did it is hard to form a valid conclusion about the completion time of the detector. Two of the experiments failed due to memory issues so did not complete. The other experiments had a much smaller result set and completed within some seconds. By reasoning about the application we can say that the amount of documents matching the request definitions is of great effect on the completion time of the detector. All the documents have to be retrieved by the applications, so the more documents matching, the longer it will take. Also the amount of unique relationships detected and documents matching these relationships are of effect on the duration time.

6.2.3 Disk usage

The amount of disk space necessary during detector execution is mainly dependent on the amount of matching documents. All the documents matching the request definitions are locally stored on disk and are therefor a big effect on the amount of disk usage. For patterns consisting of very rare request definitions, elasticsearch as data store is probably a good solution. Elasticsearch is very fast in searching for these definitions and disk usage will be limited. However when we are trying to find patterns which consists of request definitions that are not that rare we are soon getting all, or a big part, of the documents from elasticsearch to store them locally before analyzing the documents. In our situation this will already lead to about 80 gigabytes for each day we are searching in.

(34)

Chapter 7

Conclusion

In this work we presented a Domain Specific Language, CSL, for detecting of patterns in web server request access logs based on access logs stored in elasticsearch. The experiments done in this work shows that it is possible to create an uniform language to describe patterns in an easy, efficient and reusable way. For all of the scenario’s used during the experiments we were able to define a CSL detector description.

The prototype application by which we wanted to show the possibilities of the CSL language showed us some flaws during the experiments. When querying a small data-set everything worked out fine, however when querying a bigger data-set like the production ones, memory related exception where thrown. By reasoning about the implementation of the prototype application we can conclude that these memory problems are related to the way in which we implemented the execution of the search queries to elasticsearch.

Based on the disk usage measurements we did during the experiments we can conclude that the elasticsearch data store is not the ideal candidate for doing patterns detection in all scenario’s. Pattern consisting of very rare definitions will work fast, without using a lot of disk space. However, the more general patterns will require a lot of matching documents to be retrieved and stored, and therefor using a lot of disk space. For these scenario’s probably a better data store can be found.

7.1 Research Questions & Answers

7.1.1 How to define security related patterns in web requests access logs?

Two main categories of patterns can be found in request access logs, namely single- and multi- request patterns. Single request patterns are patterns as described in the OWASP Top Ten, which includes, cross site scripting, multiple injection en traversal attacks. These attacks can take place on every data input that is exchanged between client and server, including cookies and sessions. Multi request patterns, are more focused on subsequent requests patterns and the more automated attacks. By defining patterns to detect these kind of attacks encoding issues had to be taken into account.

Attacks carried in POST parameters can not be detected because of the absence of POST bodies in the request log files used in this work.

7.1.2 How can we define an efficient DSL for expressing security related

patterns in web request access logs?

The HTTP request format is standardized in RFCs [htt]. The terminology used in these RFCs is implemented in CSL as much as possible to create a familiar interface. The data structure standard JSON combined with the NeMODe language [SA10] act as the basis structure for our CSL language. By combining known terminology, data structures, and the English language, we have created an easy and efficient DSL for describing security patterns.

(35)

7.1.3 Can the patterns, expressed in our DSL, be detected in our web

request access logs?

The pattern described in CSL and interpreted by our prototype application can be found in a limited data-set. The application handles only exact matches. By using wildcards some flexibility is given in defining patterns.

A DSL for pattern detection in web server access logs