The simpler the better: towards a non parametric anomaly detector using Simplicity Theory

(1)

The Simpler

The Better

(2)

Layout: typeset by the author using LA_TEX.

(3)

The Simpler

The Better

towards a non-parametric anomaly detector

based on Simplicity Theory

Darius Salvador Barsony 11234342

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Giovanni Sileno

Complex Cyber Infrastructure – Informatics Institute Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

(5)

0.1 Abstract

This research project is aimed at creating a tool for anomaly detection based on Simplicity Theory. One of the main concepts of Simplicity Theory is unexpected-ness. Unexpectedness is central to this research project by the assumption that anomalous exemplars generate unexpectedness by definition of the theory. In order to compute the unexpectedness of an exemplar it is important to have conceptual representations on which the model operates. These conceptual representations are embodied by prototypes. It is important to construct such prototypes dynam-ically, updating them as more and more exemplars are presented to the program. Most research done in this project has been devoted to this particular aspect of the problem. A tool for the creation of prototypes has been developed. In a one dimensional space the desired goals of a number of tests have been met.

(7)

Chapter 1 Introduction

1.1 Introduction

In machine learning, the detection of data points that are not normal has always been of great interest. This problem is more commonly referred to as anomaly detection and it can be defined as the process of identifying unexpected items or events in datasets. In some cases, density-based algorithms, such as K Nearest Neighbors play a role in anomaly detection problems [1].

Simplicity Theory is a cognitive theory that seeks to explain the relevance of situa-tions or events to the human mind. Unexpectedness is one of the main components that constitute relevance (the other being emotional intensity). A situation is

un-expected if a situation is easy to describe and/or hard to generate. Following this

idea, anomalies are rare and still descriptively simple which gives rise to the as-sumption that the model lends itself well for anomaly detection. So far the theory has proven its applicability to the problem of anomaly detection. In a recent mas-ter project it has been applied on network monitoring, resulting in an accuracy of 96.4% on the DARPA dataset, containing (anomalous) network attacks. [2]. Furthermore Simplicity Theory has been empirically proven to explain a number of phenomena, such as coincidences [3] as well as human conversational topic con-nectedness [4]. However, little use-cases of the model have been explored.

The goal of this research project is exploring the applicability of the model for anomaly detection. Initially, this project was aimed at creating a tool that was able to detect deforestation in satellite image data. After investigating this problem for a while it turned out that the scope of was going to be large, so the focus was laid mainly on the construction of prototypes, which, as will be discussed later, is

(8)

(9)

Chapter 2 Theoretical Framework

As was discussed in the introduction, prototype construction is important for anomaly detection. In order for anomalies to be detected a notion of normal-ity needs to be created. One approach to this is the process of prototype creation, which has also been the approach taken in [2], which applies Simplicity Theory as well. In this section relevant theory for this problem will be introduced. A brief section on the data part of the project is provided before going into the theory for the tool that is built.

2.1 Data

Initially the decision was made to make the program detect anomalous parts of large satellite images. Due to the extensive processing that would have been nec-essary to make the raw data compatible with the tool, the focus was laid on the creation of the tool and creating ’mock’ data similar to what would have been in the desired format. There has however been research conducted in possible ways of getting the right data output format. The underlying research into this process will be discussed before going into the process that led to generating the ’mock’ data.

Originally there was decided to take satellite images as raw data input. Given that it would be difficult to directly work with these images, the decision was made to ’convert’ these data points to instance data in csv format. One aspect regarding this conversion was deciding how and what features to extract from the images that were supposed to give a solid representation of the underlying image data. Thoughts have been put into a few techniques for image feature extraction. First of all it was decided to split up the images into equally sized cells and extract

(10)

the features from these cells. It was decided that the rgb values would be extracted using linear spectral unmixing executed by the SPRING program. This method is centered around decomposing images into its red, green and blue components and using these to construct images, the averages of each component would be the instantiations in the csv data.

Eventually the decision was made to generate sample data of a similar structure that was to be gotten from processing the images. This decision was mainly made because the focus was to be put on the creation of the tool. The generation of the sample data was done using the Pandas library. Using mean and standard devia-tions for different categories, which corresponded to a range of fruits, a number of samples were taken from a normal distribution for each fruit.

2.2 Simplicity Theory

The assessment of unexpectedness of a situation is an unique competence that un-derlies much of human decision making. Simplicity Theory has been introduced as cognitive computable model which has empirically proven to have predictive power of interest in conversational narratives [5]. Aspects of the model are suggested to play a key role in human experience of good or bad luck after the occurrence of an event [6] and it has been provided as an alternative model to explain human sen-sitivity for coincidences [3]. One crucial difference from the standard approaches is that it is non-extensional and it considers a resource-bounded version of Kol-mogorov complexity, while probabilistic approaches build upon considering the set of possible alternatives.

Three ideas that are relevant are those of unexpectedness, generation complexity and description complexity. The formal definitions will be provided. Later the frame in which the theory operates will be specified (see: subsection 2.2.1). The first and foremost important concept is that of description complexity (Cd(x))

which is defined as: "The shortest possible description that an observer can pro-duce to determine an event s, without ambiguity".

Another important concept is generation complexity (Cw(x)), which is defined as

the length of the shortest possible program that the "W-machine" can execute to generate some state s. The W-machine is a computing machine that respects the constraints of the ’world’ [7].

(11)

Consequently, unexpectedness is defined as the difference between generation and description complexity. Leading to the formula:

U (x) = Cexp(x) − Cobs(x) (2.1)

In order to give an idea about how unexpectedness of an event arises according to the theory, there is often referred to the ’lottery-draw’ example.

Imagine a lottery draw, where each draw has a length of five numbers. For every draw in the lottery, the generation complexity is the same because every draw is generated by the same sequence of operations, namely five draws (5 * log2(10)).

Now imagine a lottery draw of 33333. For this draw the generation complexity is still 5 * log2(10), whilst its description requires one draw (1 * log2(10)) and

four copy operations. This leads to a much lower description complexity, which consequently leads to a much higher unexpectedness of the draw [8].

Despite its potential of application, Simplicity Theory lacks a practical computa-tional implementation. As the theory is described rather at funccomputa-tional level, many implementations are possible, but at the moment there is no complete reusable framework. This research project is aimed at bridging this gap.

In order to detect anomalies, the unexpectedness of events needs to be able to be assessed. Most of the challenge lies in defining a frame in which both generation and description complexity can be assessed. This frame will first be defined in the mental operators section below.

2.2.1 Mental operators

Simplicity Theory operates on events/situations and tries to assess their

unex-pectedness and eventually their relevance (that is when also looking at emotional

intensity, which is not explored in this project). These events/situations are re-ferred to as exemplars in this research project. In a mono-dimensional space δ, an exemplar is defined by a single value corresponding to a center and a variance that is small (around 0.01) since we are dealing with an exemplar. An example of a mono-dimensional exemplar could be: α = (a, b) where a corresponds to the center and b to the variance, which can for example be set to 0.01.

Prototypes form conceptual representations of underlying data points and they are important as points of reference that the model of Simplicity Theory can operate on. A prototype can be defined as: P = (p, q) where p corresponds to the center

(12)

and q to the variance covering a region around the center, leading to a geometrical representation of [p − 2q, p + 2q].

Additionally, two exemplars α = (a, b) and β = (c, d) can together form a prototype according to the following two equations:

p = a + c

2 (2.2)

q = min(a − 2b, c − 2d) − max(a − 2b, c − 2d)

2 (2.3)

More often however, exemplars exist in a multi-dimensional space ∆, where (α) is not defined as a single value but as a vector of values, each issued from a dimension

δi i.e. α = (δ1, δ2, ..., δn).

Extending the previous algorithm, two multi-dimensional exemplars α and β to-gether form a multi-dimensional prototype: P = (P1, P2, ..., Pn), by the

dimension-wise prototype construction operation according to Equation 2.2 and Equation 2.3.

2.2.2 conceptual spaces through contrast

We have a practical definition of a prototype that serves as a conceptual repre-sentation. Now one needs to execute this process dynamically. One approach to solving this problem could be through using a clustering algorithm. The clus-ters would then serve as the conceptual representations instead of the prototypes mentioned earlier. Most clustering algorithms assess data similarity through a distance function. Euclidean distance is a distance function that is used by many algorithms. When computing similarity, distance in every dimension is usually taken into account. However, from a cognitive point of view, this approach is untenable, since the number of dimensions that an object can be described in are potentially infinite. Simplicity Theory suggests a contrast-based approach to sim-ilarity and in turn the dynamic process of prototype construction.

More generally we can ask ourselves the question of how we humans distinguish a textbook from the conceptual representation of books? In [9] it is stated that conceptual spaces provide the medium on which such computations are performed. One crucial operation to this process is contrast. When performing an operation of contrast on any book with respect to a prototypical textbook, the lack of exercises may pop out. Another way to look at this operation is extracting the most dis-criminating features. Dynamic contrasting is essential for translating perceptions

(13)

into predicates.

The conceptual spaces that were spoken of just now are embodied as prototypes in this research project. Since these emerge though an operation of contrast a con-trastor object is introduced. In a mono-dimensional space, this object is generated through an operation of contrast, between a target concept A = (a, b) and a base concept (prototype) P = (p, q) and can be computed as CA,P = A − P :

center, c = a − p

q (2.4)

variance, d = q − b

q (2.5)

After having deconstructed an exemplar into a contrastor, it can be re-created through a merge operation. More formally, exemplar A = (a, b) can be recon-structed by merging a contrastor CA,P = (c, d) with the original base P = (p, q)

as P + CA,P by:

a = p + c × q (2.6)

b = 1 − d

q (2.7)

Re-creating the exemplar through this operation provides us with a definition of that exemplar, which can be used to compute the description complexity (Cd(A))

necessary for the computation of unexpectedness of the exemplar with respect to a prototype by:

Cd(A) = Cd(P + CA,P) = Cd(P ) + Cd(CA,P) = Cd(p) + Cd(q) + Cd(c) + Cd(d)

(2.8)

2.2.3 Description complexity of numbers

Now, p, q, c and d are in general all real numbers (floats). One way to approxi-mate their functioning is through taking the logarithm base 2 of their discretized

(14)

version, where discretization is done by dividing them by their smallest grain of representation. So:

Cd(x) = log2(1 +

|x|

∆X) (2.9)

For positive integers, no discretization is needed so description complexity of a number can be computed as Cd(n) = log2(1 + n).

2.2.4 Finding the best prototype

Now that we have a way of computing the description complexity of an exemplar, the issue of finding the prototype P that an exemplar A can best be described as arises. This issue boils down to the formula: Cd(A) = minP Cd(P + CA,P), which

makes sense if the retrieval of P out of all prototypes, can be separated from the rest of the equation leading to: Cd(A) = minP[Cd(P + Cd(CA,P) + Cd(P )].

Now let’s assume that the retrieval of all prototypes costs the same. Having NP

prototypes, the description complexity of a prototype can be computed as:

Cd(P ) = log2(NP) (2.10)

2.2.5 marginal description complexity

Now in principle, if the retrieval of prototypes costs the same, the description complexity of the retrieval of any P can be left out of the equation, leading to the marginal description complexity of A as:

Cd(A) = Cd(P + CA,P|P ) = Cd(CA,P) = Cd(c) + Cd(d) (2.11)

More commonly however, the retrieval of prototypes does not cost the same. In order to assign different values to the retrieval of different prototypes, a recency approach can be considered. This approach involves pushing the prototype that an exemplar is most recently classified as, on a stack. Consequently, the description complexity of the prototype is determined by:

(15)

Cd(P ) = log2(rr(P )) (2.12)

Where rr(P ) (recency ranking) corresponds to the position of P in the stack.

2.2.6 Generation complexity

In order to compute unexpectedness it is necessary to have an approach to comput-ing generation complexity (Cw) as well. Coming back to the definition, generation

complexity is defined as the length of the shortest possible program that the "W-machine" can execute to generate some state s. It has a strong similarity with Kolmogorov complexity. Originally, the Kolmogorov complexity of a finite object

x is the length of the shortest effective binary description of x. [10]. More generally,

it is the length of a shortest computer program (in a predetermined programming language) that produces the object as output.

A theoretical connection is present between the probability of an object and its Kolmogorov complexity [11] which is illustrated by the following two formulas:

p ' 2−CK _(2.13) CK ' log2(

1

p) = −log2(p) (2.14)

We can then interpret the frequency (number of occurrences) of an exemplar A or prototype P in terms of probability. For exemplars n stands for the number of times that particular exemplar has been presented to the program and N for the total number of exemplars that have been presented so far. For prototypes n is the number of times an exemplar has been categorized as that particular pro-totype and N represents the total number of exemplars presented to the program (assuming that every exemplar is classified as a prototype). Classification entails that a prototype P is decided to be the most descriptive one of an exemplar (see subsection 2.2.4).

Further deconstruction of Equation 2.14 gives:

(16)

2.2.7 Anomaly detection machine

Now that the theory for computing unexpectedness is present, we can proceed onto the problem of anomaly detection. One important issue that goes along with this is deciding what is the best grain of representation of an exemplar. If an exemplar is classified as some prototype, it also resides within higher level (larger) conceptual representations (prototypes). Unexpectedness is used to decide what the correct grain of representation is. This decision process is incorporated in the algorithm below. It shows the process of both finding the best prototype as well as the decision boundary for deciding whether an exemplar is an anomaly.

Algorithm 1 Anomaly detection machine

1: _{procedure Function(Anomaly detection machine)} 2: for all prototypes P that exemplar A resides in do

3: Compute Cw(A) as Equation 2.15

4: Compute Cd(A) as Cd(A) (as in Equation 2.11) + Cd(P ) (as in

Equa-tion 2.12)

5: Select P for which U = Cw− Cd is maximal

6: if U > 0 then

(17)

Chapter 3 Method & approach

The goal of this project is creating a tool based on Simplicity Theory for anomaly detection. For this problem it was decided to construct prototypes that constitute a sense of normality which Simplicity Theory operates on. The process of creating the data that constitute the events will first be gone into before diving into the creation of the actual tool. Finally the testing part will be discussed briefly.

3.1 Data

As was discussed earlier, initial idea of the project was to create an anomaly detec-tion tool based on Simplicity Theory and apply it to the problem of deforestadetec-tion detection in satellite images. In order for the satellite image data to be able to be used by the program, they needed to be changed from images to numeric data. This gave rise to two issues. The first one was that images contained clouds and shadows and the angle of the satellite caused distortions. A full preprocessing of the data would have been a project in itself. Secondly, deforestation is a tempo-ral process. In order to tackle this problem incrementally, there was moved onto working on a synthetically generated multi dimensional database of objects. Eventually there was decided to generate a dataset by sampling from a normal dis-tribution with mean and standard deviations corresponding to a number of fruits. The dataset has a similar layout to what would have resulted from the original approach. it contains a number of dimensions, such as red green and blue values, which would have also been extracted from the image data (averages over a region within the image). Furthermore it contains a number of categories, which would have corresponded to a variety of different terrain types.

(18)

Table 3.1 below, shows some examples of the dataset that has been created. It shows an example of the various categories. The length, width, sugar, water, fibre values of an apple, apricot carrot and banana. They were sampled using a mean and standard deviation in every dimension. The ’raw’ data containing the means has been provided by the supervisor and they are for illustration purposes only. Additionally the average r(ed) g(reen) and b(lue) can be witnessed. The overall dataset that was created by sampling contains 50 exemplars of these four cate-gories.

Table 3.1: Example fruit data

instance_name label length width sugar water fibre r g b

apple_4 0 9.20 7.09 51.30 80.24 2.43 48.85 195.70 53.28

apricot_15 1 7.10 4.77 1.23 83.01 2.41 243.70 126.35 49.72

carrot_29 2 18.81 1.88 4.69 87.76 2.55 255.65 80.22 31.95

banana_10 3 16.93 3.73 2.52 80.96 0.901 174.073 189.94 1.69

3.2 program

The following sections go over the choices made regarding the tool for anomaly detection. subsection 3.2.1 discusses an important shift in approach that was considered. subsection 3.2.2 goes over the algorithm for dynamic prototype con-struction and subsection 3.2.3 discusses the rest of the body of the tool.

3.2.1 From contrast to normalized Manhattan distance

The theory in section 2.2 provides basic definitions for exemplars and prototypes as well as how to construct a prototype, perform an operation of contrast and compute description complexity of an exemplar by doing so. However the means for the construction of prototypes dynamically is not yet present. Dynamic pro-totype construction means that as more and more exemplars are presented to the program/tool, prototypes need to emerge that form good representations of the underlying data. An algorithm that utilises the basic definitions provided in the theoretical section has been developed to tackle this problem.

As was mentioned earlier in subsection 2.2.2 conceptual spaces emerge through an operation of contrast. Each time that a new exemplar is encountered the algo-rithm needs to decide how to update the prototypes it has constructed so far. An operation of contrast would have to be executed between the new exemplar and

(19)

all of the prototypes that have been constructed so far. The description complex-ity of the exemplar with respect to any prototype can be calculated according to Equation 2.11. The value that is computed serves as a metric of similiarity. Then it would be necessary to decide whether any of the prototypes are a good enough representation of the newly presented data point or not based on the value of de-scription complexity. However it would be necessary to introduce some decision bound for this. Due to the fact that there was no straightforward way to set this bound, a different distance metric was considered instead in order to speed up the process.

It is necessary to decide how good of a representation any prototype is of any exemplar. In order to make judgement of this, Normalized Manhattan distance was used as a distance metric. Say we have an exemplar A and a prototype

P . Both are multi-dimensional. A has not yet been presented to the program.

Now the normalized Manhattan distance between the two can be computed by taking the absolute difference between the instantiations in every dimension of the exemplar and those of the center of the prototype P . This vector of differences is then divided dimension-wise by a normalization vector of the same length. The normalization vector is defined as the absolute difference between the minimum and maximum in every dimension of every exemplar presented to the program so far.

3.2.2 Incremental prototype construction

Here the algorithm for the dynamic construction of prototypes will be illustrated according to some illustrations. Four scenarios can occur. The first and most sim-ple one is the situation in which there have been no prototypes constructed so far. In this case the exemplar that is presented to the program forms a prototype where the center(s) correspond to those of the exemplar. The variance of the prototype constitute a small region around the image (this being 0.01).

the second case can be witnessed in Figure 3.1. It can be seen that x3 lies both within the already constructed prototype (the largest one) whilst at the same time it lies close enough to x2 for the program to construct a sub-prototype between the two.

(20)

Figure 3.1: new exemplar is within two variances of prototype

The third case is illustrated in Figure 3.2. It can be seen that x3 is outside of the already constructed prototype (the smallest of the two), so an all encapsulating prototype is constructed, which then also encapsulates the outlier.

(21)

Figure 3.2: new exemplar is not within two variances of prototype

Lastly an exemplar can lie within two variances of the prototype but not within one variance of the closest support vector although in practice these always seem to occur together.

The pseudo-code that formalises the process can be seen below. A few of the func-tions used need further clarification. The function ’closest’, returns the prototype that is closest to the new instance, in terms of normalized Manhattan distance. Prototype come about through the construction between two exemplars, which can be called support vectors. The function nearest and furthest return the near-est and furthnear-est support vectors respectively. Then, the function within, decides whether an exemplar is encapsulated (according to n times the variance) by a pro-totype or exemplar. Construct, executes the operation for propro-totype construction that is provided in the theoretical framework (see Equation 2.2 and Equation 2.3). Lastly, extremes, returns the two exemplars for which the normalized Manhattan distance between the two is the largest.

3.2.3 Simplicity Theory

Efforts have been made towards implementing the remaining theory defined in the theoretical section. For implementation an object oriented approach has been taken, introducing objects for exemplars, prototypes and Simplicity Theory.

(22)

Algorithm 2 Incremental prototype construction

Input: new exemplar , prototype list, previous exemplars Output: updated list of prototypes

procedure Function

2: if prototype list not empty then:

best ← closest(new instance, prototype list)

4: nearest ← nearest(best, new instance)

furthest ← furthest(best, new instance)

6: if new instance.within(best, 2) then:

if new instance.within(nearest, 1) then:

8: construct(new instance, nearest

remove best

10: new ← construct(furthest, new instance)

prototype list.add(new)

12: else:

previous exemplars.add(new instance)

14: minE, maxE ← extremes(previous exemplars)

new ← construct(minE, maxE)

16: prototype list.add(new)

else:

18: construct prototype containing just the new instance, with variance

of 0.01

Regarding exemplar objects one has numerous ways of handling them. Exemplar objects can be compared to one another for equality which means that the centers of two exemplar are equal in every dimension (the variance of an exemplar is re-dundant). It can be determined whether an exemplar object falls some n number of variances within another object (bound to be a prototype). The Manhattan distance between an exemplar and another object can be computed in the way that it was defined earlier. An exemplar object can be contrasted with a proto-type object according to Equation 2.4 and Equation 2.5, resulting in a contrastor object, embodied by a prototype object. Two exemplars can together be used to output a prototype object. And most importantly, the description complexity and unexpectedness of an exemplar with respect to a prototype can be computed. The functionality of prototype objects is not much different from exemplar objects. Differences consist of the fact that internally every prototype has a counter, which counts the number of times an exemplar has been classified as it, which allows for

(23)

the computation of generation complexity (Equation 2.15).

Thirdly, the Simplicity Theory object contains most of the functionality and works together with prototype and exemplar objects. A main function is used to ’feed’ the exemplar data one by one, while prototypes are constructed according to al-gorithm 2, storing the updated prototypes in the object along the way. Most importantly it contains the anomaly detection machine, which can be called onto with an exemplar as it’s argument, and it determines when there is dealt with an anomaly. It is based upon the algorithm defined in the theoretical framework (see algorithm 1). Efforts have been made into making this ’machine’ function properly.

3.3 testing

In order to test the program, a number of tests were designed and implemented using unit testing in python. A brief description of the tests will be given here and their outcomes will be discussed in the results section.

Initially data in a mono-dimensional space was considered consisting of a number of different combinations of numbers. In order to create the program in this mono-dimensional space a number of tests were designed. Assumptions were made about what would be good prototypes for different combinations of mono-dimensional in-put data. A combination of inin-puts could consist of the numbers 2, 3 and 7. The assumption was that both 2 and 3 would have to be represented by the same pro-totype and this entire propro-totype as well as the input number seven in turn needed to be encapsulated by a larger prototype, resulting in an output of [[2 3] 7]. Fur-thermore if the input consisted of the numbers 2, 3 and 3.5, the output [2 [3 3.5]] would have been desired.

Remaining in the mono-dimensional space, tests for finding the best prototype were designed. The exemplars 2, 3, 10 and 11 were considered as input which would have the desirable result of the prototypes [ [2 3] [10 11] ]. The tool was then presented with a test set containing the exemplars 2, 3, 9 and ten times the exemplar 10, which would have the desirable output of prototype [2 3] having occurred twice, the prototype [10 11] having occurred 11 times and the all encap-sulating prototype [ [2 3] [10 11] ] being never classified as (since both smaller prototypes would always be a better representation).

One basic test for the detection of anomalies has been developed where 2, 3, 10 and 11 are again presented as input having the same desired prototype output as

(24)

before. Then the occurrences of the prototypes have been incremented in the same way, before presenting the program with an anomaly (the exemplar 25) and a non anomaly (the number 10). The unexpectedness was checked for being larger than zero for the anomalous point and below zero for the non anomalous point.

Tests were also developed for simple multi-dimensional (2D) data as well as for the dummy data mentioned in section 2.1. For the 2D case one simple test was developed where the input consisted of the exemplars [2, 0], [3, 1], [7, 1]. The output was assumed to be X = [[3, 1], [7, 1]] and [[2, 0], X]. A test for the multi dimensional fruit example was developed. The prototype construction algorithm was applied to the entire train dataset (see section 2.1). A test set was presented to the program. Classification of the test exemplars happened according to the prototype that it had the lowest description complexity as. The percentages of classification of the various categories are shown in the results section.

(25)

Chapter 4 Results

4.1 Results

The overall goal was to create a program that was able to detect anomalous data based on Simplicity Theory. In order to do so a system needed to be created that allowed for the computation of description (Cd) and generation complexity (Cw).

The first part was to be able to compute description complexity. In order to do so, prototypes needed to be created.

A number of tests were designed that were assumed to correspond to good repre-sentations of the input data presented to the program. First the results for one dimension will be discussed, then the multi-dimensional results will be gone over. All test outputs for prototype construction are presented in Figure 4.1. Images Figure 4.2, Figure 4.3 and Figure 4.4 show the outputs of the tests in monodimen-sional space. Figure 4.5 shows the output of the simple 2D test that was designed. For the occurrence update test the following result is achieved: [2 3] is classified as twice, and the prototype [10 11] being classified as eleven times. The anomaly test for which the inputs were 2, 3, 10 and 11 and the exemplars 10 as non anomaly and 25 as anomaly the best prototypes was the all encapsulating prototype [ [2 3] [10 11] ] in both cases. The unexpectedness of both 10 and 25 were 0.49 and -1.03 respectively.

In Table 4.1, the percentages of exemplars that fall within a certain prototype can be seen. If an exemplar falls within a prototype, it is meant that it falls within two variances in each of the dimensions. The last column shows the mean squared error computed between the centers of a prototype and the means in all dimensions

(26)

Figure 4.1: prototype construction

Figure 4.2: mondimensional: inputs = [2, 3, 7] Figure 4.3: mondimensional: inputs = [2, 3, 10, 11]

Figure 4.4: mondimensional: inputs = [2, 3, 3.5, 7, 9, 25]

Figure 4.5: multi-dimensional: inputs = [[2, 0], [3, 1], [7, 1]]

that the data was sampled from. What is striking about the data is that there are a lot of prototypes that singularly encapsulate apples and that their mean squared errors are all relatively high (protos 0 to 8). Then there are some prototypes that mostly encapsulate apricots (9 to 13) one of which only encapsulates apricots, which at the same time also has the lowest mean squared error, which implies that it is a good representation of the category of apricots.

The prototypes shown in Table 4.1 were assigned a label, corresponding to the biggest percentage of fruits that it encapsulates. In order to further test the con-structed prototypes, a test dataset has been created. It contains 10 data points for each fruit and it was sampled from the same means as the train set.

The description complexity of all exemplars in the test set with respect to every constructed prototype has been computed according to Equation 2.11. The pro-totype for which the description complexity was the lowest has been chosen as the prototype that can best describe the exemplar. In the case that the exemplar corresponded to the label that the prototype had been assigned, the classification was considered accurate. This resulted in a total accuracy on the test set of 55%. A couple of side notes need to be placed with the results that have been obtained here. First of all in one dimension more testing needs to be done to see how the

(27)

Table 4.1: Percentages of data points encapsulated by various prototypes and MSE of most represented category

prototype p_apples p_apricots p_carrot p_banana mse biggest category

0 proto0 100.0 0.0 0.0 0.0 50.50 1 proto1 100.0 0.0 0.0 0.0 152.59 2 proto2 100.0 0.0 0.0 0.0 823.02 3 proto3 100.0 0.0 0.0 0.0 764.26 4 proto4 100.0 0.0 0.0 0.0 305.62 5 proto5 100.0 0.0 0.0 0.0 540.78 6 proto6 100.0 0.0 0.0 0.0 367.12 7 proto7 100.0 0.0 0.0 0.0 489.43 8 proto8 100.0 0.0 0.0 0.0 651.92 9 proto9 27.78 63.89 8.33 0.0 2043.25 10 proto10 23.53 64.71 5.88 5.88 1452.30 11 proto11 14.29 71.43 7.14 7.14 1340.33 12 proto12 16.67 70.0 11.67 1.67 1274.84 13 proto13 0.0 100.0 0.0 0.0 6.87 14 proto14 35.29 36.13 17.65 10.92 1241.56 15 proto15 64.10 0.0 0.0 35.90 710.80 16 proto16 24.86 28.32 21.39 25.43 1247.17 17 proto17 44.26 40.98 1.64 13.11 1192.07 18 proto18 0.0 0.0 0.0 100.0 235.34 19 proto19 32.38 30.48 8.57 28.57 1092.89

prototype construction behaves when presented with a number of edge cases and different combinations of input data. Also additional testing in multiple dimen-sions as well as considering three dimensional prototype construction would be desirable. Furthermore the unexpectedness for a non anomalous exemplar (10) is higher than an exemplar that is assumed to be anomalous (25). Further research needs to be done in order to figure out the mechanisms behind this. Lastly the table shows that many prototypes encapsulate a number of categories whilst one expects them to be more homogeneous.

(28)

Chapter 5 Conclusion

5.1 Conclusion

The goal of the project was creating a tool based on Simplicity Theory for anomaly detection. A framework containing a number of definitions and formula’s was de-fined first since the model of Simplicity Theory lacks the means to a straightfor-ward implementation. One of the main components of the problem is the creation of prototypes. These serve as conceptual representations that are necessary for computing description complexity and generation complexity. In turn, these two are necessary for the computation of unexpectedness under the assumption that anomalous exemplars generate unexpectedness according to the model.

Initially the application domain would have been deforestation detection based on satellite images. However because the data would require a lot of preprocessing, it was decided to generate sample data containing a number of categories corre-sponding to various fruits and vegetables instead.

Most work on the actual tool has been devoted to the construction of prototypes and doing so dynamically. A Manhattan distance approach was considered instead of a contrastive approach which would have been more logical from a cognitive point of view. Furthermore, efforts were made towards expanding the program in order to actually detect anomalies but an accuracy worth mentioning has not yet been reached.

In one dimension prototypes that are good representations of the underlying train data have been constructed. Points that lie close to one another are conceptually represented by the same prototype, which is what is to be expected. The results also show that the program is able to construct prototypes if presented with simple

(29)

multi-dimensional data.

Furthermore it can be seen that the multidimensional fruit example data is not properly represented by the prototypes that have been constructed. A significant number of the prototypes encapsulate more than one fruits that are not to be expected to be represented by the same prototype (e.g. apples and carrots). All in all, the theoretical frame defined in this research project is what constitutes the most value for further research into the translation of Simplicity Theory into a practical context. Further experiments based upon the theory need to be done in order to confirm the usability of the theory for the problem of anomaly detection.

(30)

Bibliography

[1] Guangjun Wu, Zhihui Zhao, Ge Fu, Haiping Wang, Yong Wang, Zhenyu Wang, Junteng Hou, and Liang Huang. A fast knn-based approach for time sensitive anomaly detection over data streams. In International Conference

on Computational Science, pages 59–74. Springer, 2019.

[2] Mar Badias Simó, Giacomo Casoni, Cees de Laat, and Giovanni Sileno. Anomaly detection on log files based on simplicity theory. 2020.

[3] Jean-Louis J-L Dessalles. Coincidences and the encounter problem: A formal account. arXiv preprint arXiv:1106.3932, 2011.

[4] Jean-Louis Dessalles. Conversational topic connectedness predicted by sim-plicity theory. In CogSci, 2017.

[5] Adrian Dimulescu and Jean-Louis Dessalles. Understanding narrative interest: Some evidence on the role of unexpectedness. 2009.

[6] Jean-Louis Dessalles. Emotion in good luck and bad luck: predictions from simplicity theory. arXiv preprint arXiv:1108.4882, 2011.

[7] Jean-Louis Dessalles. Algorithmic simplicity and relevance. In Algorithmic

Probability and Friends. Bayesian Prediction and Artificial Intelligence, pages

119–130. Springer, 2013.

[8] Jean-Louis Dessalles. A structural model of intuitive probability. arXiv preprint arXiv:1108.4884, 2011.

[9] Jean-Louis Dessalles. From conceptual spaces to predicates. In Applications

of Conceptual Spaces, pages 17–31. Springer, 2015.

[10] Andrei N Kolmogorov. Three approaches to the quantitative definition ofin-formation’. Problems of information transmission, 1(1):1–7, 1965.

(31)

[11] Paul MB Vitányi and Ming Li. Minimum description length induction, bayesianism, and kolmogorov complexity. IEEE Transactions on information

The simpler the better: towards a non parametric anomaly detector using Simplicity Theory

The Simpler

The Better

The Simpler

The Better

towards a non-parametric anomaly detector

based on Simplicity Theory

Contents

0.1

Abstract

Chapter 1

Introduction

1.1

Introduction

Chapter 2

Theoretical Framework

2.1

Data

2.2

Simplicity Theory

2.2.1

Mental operators

2.2.2

conceptual spaces through contrast

2.2.3

Description complexity of numbers

2.2.4

Finding the best prototype

2.2.5

marginal description complexity

2.2.6

Generation complexity

2.2.7

Anomaly detection machine

Chapter 3

Method & approach

3.1

Data

3.2

program

3.2.1

From contrast to normalized Manhattan distance

3.2.2

Incremental prototype construction

3.2.3

Simplicity Theory

3.3

testing

Chapter 4

Results

4.1

Results

Chapter 5

Conclusion

5.1

Conclusion

Bibliography