Reliability of attribute selection in forensic investigation

(1)

Reliability of attribute selection in forensic

investigation

Graduation Research Project (36 EC)

At the Testimon - Forensic Focus Laboratory

Gjøvik University College

During the period of January 2013 – July 2013

By Ian Dashorst, 5730007

MSc in Forensic Science

At the University of Amsterdam

Supervisor: Dr. Katrin Franke

Professor

Gjøvik University College

Examiner: Marcel Worring

Intelligent Systems Lab Amsterdam

University of Amsterdam

(2)

2

1. Introduction

1.1 Topic

Computational methods have become important tools in forensic science in recent years that enable forensic scientists to analyse and identify traces in an objective and reproducible manner. It can also help to standardise investigative procedures, search large volumes of data efficiently, assist in the interpretation of results and their argumentation and reveal previously unknown patterns. Machine learning is one such method applied in forensic science. In machine learning, high-dimensional data is often encountered, such as chemical data, images or intrusion detection logs. In machine learning, a more common term for the assignment of a label to a given input value is pattern recognition. Performance of a pattern recognition system can be reduced due to irrelevant or redundant features in the data sets. Therefore, feature selection is considered an important pre-processing step in pattern recognition systems for high-dimensional data sets. Reducing the feature set may also speed up the process of learning and classification. In practice most work for determining a good set of features is still done manually and depends too much on expert knowledge.

Automatic feature selection methods exist in three different categories: the wrapper, the filter and the embedded models. Due to computational efficiency and good generalisation ability, only filter methods will be explored in this study.

1.2 Keywords

Feature selection, forensic investigation, computational forensics, machine learning, pattern recognition

1.3 Problem description

Many examination techniques in forensic investigations are expert based. The analysis of evidence in forensics is performed to determine the nature of the evidence and perform identification, classification and individualisation of the evidence. It is important that this is done objectively to ensure sound interpretation. However, it has been shown that decisions made by these experts are subjective and that they are susceptible to extraneous influences [1]. Therefore there is a need for more objective analytical results in forensic investigations. Computational methods are suitable to fill this need, but most of these methods are not up to par with the performance of experts. Pattern recognition for one has been employed in various forensic disciplines, such as DNA profiling, chemometrics for illicit substances, handwriting

(5)

5 comparison and automated fingerprint identification system (AFIS).

In pattern recognition labels are assigned to given input values. The input is a set of attributes describing the properties of an object to be classified. In machine learning an attribute is referred to as a feature. These features represent the object that exists in the real world, but not every feature of an object is equally important for classification into a particular category and using more features to represent an object does not automatically improve classification accuracy. Features can be irrelevant to the class, redundant or even confusing. Besides, having many features increases training and classification time. Feature selection is therefore considered an important step in the pre-processing of the data for pattern recognition.

1.4 Justification, motivation and benefits

As interpreting evidence objectively can be done easier by producing objective analysis results, computational methods will be used in this thesis for the analysis phase of the forensic process. As mentioned before, forensic experts can be influenced and thus their decisions should be considered subjective. It is hard to measure error rates caused by this bias and a more objective solution should be explored to eliminate unknown error factors in the reasoning process of a forensic expert. The performance of automated systems can be measured more easily and completely.

Automated systems for classification have already been applied to different forensic domains and many use a form of dimensionality reduction, like feature selection. Examples of some forensic applications are:

– Signature/handwriting comparison [2]:

Two data sets are used, one containing 556 forms by 250 different writers, the second 8000 letters from 1600 different writers, are represented by 686 attributes informative of the orientation and curvature of writing for writer identification from offline handwritten document images.

– Chemical profiling, such as:

– Quantification of illegal substances as cocaine using Raman spectra [3] [4]:

Commonly data sets of Raman spectra have 500- 3000 attributes (data points) and consists of 20-200 samples.

– Pen ink comparison using LDI-MS [5]:

An LDI-MS (Laser Desorption Ionization - Mass Spectrometer) generates approximately 64000 raw data points from each mass spectrum. Sample size is small in general.

(6)

6 – Hybrid PMI (Postmortem interval) estimation:

Combining different PMI estimation methods to more accurately estimate time of death. – Forensic Botany [6]:

An example is a grass identification system that uses up to approximately 1200 base pairs (attributes). A DNA barcoding sequencing approach can be employed for identification at the lower taxonomic levels. Feature selection can help identify the important base pairs in the classification.

– Face recognition [7]:

Facial features in face recognition can be either be represented by attributes that cover a small part of the image, which result in a large amount of data points or larger areas or shape models in the images, which result in around 100 or fewer features.

– Footwear pattern classification [8]:

Depending on the method between 72 and 25000 attributes can represent a shoe pattern.

– Glass Identification [9] [10]:

Mostly only 7 or 8 attributes are used for identification of glass samples. Feature selection is used to find optimal combination of those features.

Methods and systems, that are employed in forensic context as evidence, are subject to legal requirements. Feature selection can therefore not be directly used in forensic investigations, as the evidence generated employing feature selection might be deemed inadmissible.

1.5 Research questions

This thesis aims to answer the following research questions:

 What are the possibilities of transferring the feature selection approach in pattern recognition to forensic investigations?

o What are the requirements for feature selection methods for forensic investigations?

o How can forensic investigations benefit from feature selection?  What is the performance of feature selection for specific forensic domains?

o Does the feature selection improve classification on the data? o How reliable is feature selection on the data?

(7)

7 1.7 Outline

This section provides an overview of the contents in this thesis. The thesis starts with an overview of related work and is followed by the methodology, experiment, results, discussion, conclusion and future works..

Chapter 2 presents a literature review of work on feature selection for forensic investigations. It starts with the requirements for forensic methods, followed by a definition of classification, then introducing the general principles of feature selection and concluding with applications of feature selection in current research.

Chapter 3 discusses the choice of methods used in the experiments to evaluate the feature selection for forensic investigations. First it describes the discretisation method used, then it shortly explains three different feature selection methods discussed in this thesis. This is followed by the search strategy for feature selection and the classification algorithms to determine the performance of the feature selection methods. The chapter concludes with the method for estimating the performance of the feature selection.

Chapter 4 explains the experiment. First a description of the data sets used in the experiments, followed by the design of the experiment.

Chapter 5 reports on the results of the feature selection and the following classification. Chapter 6 discusses the results of the experiment based on the research questions posed. Chapter 7 provides a summary of the contents of this thesis.

2. Related Work

2.1 Requirements for forensic methods

While developing a new method for forensic purposes, the admissibility in court should be taken into account. It is important to do that from the start of developing the new techniques. Machine learning techniques normally don’t need to take into account all the rules that come with a technique being admissible in court. This section will describe which requirements there are for forensic evidence and on what points they differ from the usual requirements for scientific publications.

To evaluate forensic software Hildebrandt, Kiltz and Dittmann[11] proposed an evaluation scheme for the admissibility in court. They describe their scheme as a common scheme for evaluation of forensic software (COSEFOS) and it is based on two major requirements in the US jurisdiction, the Federal Rules of Evidence and the Daubert Challenge.

(8)

8

potential kinds of evidence. Mainly Rule 901, 702, 1001 and 1008 are identified as very important for digital evidence. Rule 901 addresses the requirement of authentication or identification of evidence. Part b, clause 9 states that if an automated process or system is shown to produce an accurate result, the result of this automated process or system is automatically authentic evidence. Rule 702 addresses the qualification of the expert witness and states that a witness qualifies as expert by knowledge, skill, experience, training or education. The testimony must also be based upon sufficient data or fact, be the product of reliable principles and methods and those principles and methods must have been reliably applied by the witness to the facts of the particular case. The last important rules are the best evidence rules (Rules 1001 and 1008). These rules state that in court, if any original of the evidence is available it must be used and a copy is not sufficient. Clause 3 of rule 1001 defines that every printout or other output that is readable by sight and shown to reflect the data accurately, is an original. For data that is not readable by sight, the original source must not be altered and the integrity of every copy must be verifiable.

The Daubert Challenge addresses the admission of scientific evidence that is presented by expert witnesses. Within the Daubert Challenge, a judge has the role of a gatekeeper and if a particular piece of evidence is challenged, the judge has to evaluate the evidence and decide about admission or exclusion.

There are 3 major criteria on which the judge has to rely the decision [11]:

– Reliability: Is the evidence genuine and valid knowledge of the expert's area of expertise? – Relevance: Will the evidence assist the trier of fact in determining a fact at issue?

– Qualifications: Does the expert have specialized knowledge in the field relevant to the testimony?

Additionally, other factors can be taken into account. Five factors are defined within the Daubert decision to assess the reliability of the experts testimony:

– Peer review and publication;

– General acceptance in the relevant expert community; – Potential for testing or actual testing;

– Known or potential rate of error;

– Existence and maintenance of standards controlling the use of the technique or method.

In COSEFOS a discrimination is made between hard-criteria, which can be determined with particular tests and soft-criteria, which are subjective; they are not directly measurable. In the

(9)

9

hard-criteria, which are divided into must-, should-, and can-criteria, only must-criteria can be

considered while developing the algorithm that's in the core of the forensic software.

Should-criteria can be fulfilled by external tools. The must-Should-criteria address the core functionality of a

particular forensic application and determine the benefit that should be achieved by using this application. For every type of tool the test for these criteria are different in contrast to the other criteria. The following properties of a particular forensic tool are suggested for the evaluation of the core functionality:

– Reproducibility of the results: Is the same result obtained every experiment? – Possible errors that can occur: Are there false positives or false negatives? – Frequency of those errors: How often are there false positives or negatives? – Significance of the errors: How large are the errors and what is the consequence?

The most important property for a forensic application is the reproducibility of the results. A tool is unsuitable for forensic investigations if it behaves non-deterministic, because it is not possible to determine whether the results are valid or not. This is especially important for the requirements of rule 901. The knowledge of possible errors and their frequency is also required due to the Daubert factors. It is suggested that the error rates of false negatives and false positives are determined during testing [11].

The soft-criteria are defined by Hildebrandt et al. [11] as follows: – General acceptance within the expert community;

– Publication of the method;

– Standards for the usage of the application; – Intention of the investigation;

– Personal familiarity with the application.

General acceptance within the expert community is a very important Daubert factor. In over 96%

of the cases where the general acceptance of challenged evidence was rated unreliable, the evidence was excluded from the case. However, general acceptance is not sufficient on its own for the Daubert Challenge. Publication of the method is required for peer review, which might increase the general acceptance of the particular method. Having standards for the usage of the

application is necessary especially for very complex applications or in cases where errors are

very likely. The intention of the investigation is important for the evaluation of a tool. When the intent of a tool is only to collect data that would incriminate a suspect, while no data that might prove innocence is collected, the tool is biased. The last suggested soft-criterion is the personal

(10)

10

familiarity with the application. A better familiarity can reduce necessary time for the

investigation. The Federal Rules of Evidence (rule 702) requires that the expert is qualified by knowledge, skill, experience, training or education.

Actually all of the criteria listed in the Hildebrandt’s paper are a case of good scientific practice. The difference to forensic research, however, is that these criteria should be taken in account and addressed explicitly in the process of developing a new technique.

The evaluation criteria for the core functionality will be used in this thesis to evaluate feature selection methods and their influence on classification accuracies. The next section describes first general principle of classification, then general feature selection on data sets, why it is used and how it can be useful for forensic applications.

2.2 Classification

In classification the goal is to learn the mapping from an input 𝑋 to an output 𝐶. 𝑋 is a set of features 𝑥_𝑖 , which are properties of an object, individual or event. The mapping is learned through the training on a set of observations 𝐴 for which the output 𝐶 is known and each observation is represented by 𝑋 , this is called the training data. The performance of the mapping generated by training can be evaluated on a test set by comparing the output of the trained mapping for an observation to the actual output belonging to the observation. To improve the feature set 𝑋, feature selection can be performed before classification.

2.3 Feature Selection

Feature selection is primarily performed to select informative and relevant features in a data set, but it can have other motivations as well, such as [12]:

- General data reduction, to increase algorithm speed and limit storage requirements; - Feature set reduction, to save resources in the next round of data collection or during

utilisation;

- Performance improvement, to gain in predictive accuracy;

- Data understanding, to visualise the data or gain knowledge about the process that generated the data.

There are generally three acknowledged types of feature selection algorithms [13]. Filters select subsets of variables as a pre-processing step, independently of the chosen predictor, thus the characteristics in the feature selection are uncorrelated to that of learning methods and

(11)

11 have a better generalisation property. These methods can be computed easily and very efficiently, as they select features based on the intrinsic characteristics, which determine their relevance or discriminant power with regard to the target classes. Wrappers utilize the learning machine of interest as a black box to score subsets of variables according to their predictive power. In other words, the feature selection is “wrapped” around a learning method. Features are judged by estimating the accuracy of the learning method. This often gives a high prediction error with a small number of non-redundant features. However they typically require extensive computation to get this feature set. Embedded methods perform variable selection in the process of training and are usually specific to given learning machines.

A formal definition for feature selection can be found in [14]. Let 𝐽(𝑆) be an evaluation measure to be optimised (say to maximise) defined as:

𝐽: 𝑆 ⊆ 𝑋 → ℝ

The selection of a feature subset can be seen under three considerations, with 𝑚 number of features in the subset 𝑆 and 𝑛 number of features in the full set 𝑋:

- Set |𝑆| = 𝑚 < 𝑛. Find 𝑆_𝑖 ⊂ 𝑋 such that 𝐽(𝑆) is maximum.

- Set a value 𝐽₀, this is, the minimum 𝐽 that is going to be tolerated. Find the 𝑆 ⊆ 𝑋 with smaller |𝑆|, such that 𝐽(𝑆) ≥ 𝐽₀.

- Find a compromise among minimising |𝑆| and maximising 𝐽(𝑆) (general case).

An optimal subset of features need not be unique with these definitions. This definition represents all feature selection models, whether it is filter, wrapper or embedded. Within this definition, feature selection has three components in which different methods can be characterised [14]:

- Search organisation – General strategy to explore the space of all possible feature combinations more efficiently;

- Generation of successor – Mechanism by which possible variants of the current hypothesis are proposed;

- Evaluation measure – Method by which successor candidates are evaluated.

Filter and wrapper methods differ mostly by the evaluation measure. Where wrappers use the performance of a trained learning machine to evaluate a given feature subset, filters use criteria not involving any learning machine, but an index based on test statistics or a relevance index. Compared to wrappers, filters are faster and some filters (those based on mutual information criteria) provide a generic selection of variables, not tuned for/by a given learning machine.

(12)

12 Filtering can also be used as a pre-processing step to reduce space dimensionality and overcome overfitting for wrappers.

Both filter and wrapper methods can make use of search strategies to explore the space of all possible feature combinations more efficiently. Yet for filter methods sometimes only single features instead of subsets are evaluated and then they become feature ranking methods, which is the simplest form of feature selection. It is often used as a principal or auxiliary selection mechanism because of simplicity, scalability and good empirical success. Variable ranking focusses solely on ranking for individual variables, independently of the context of others. Variable ranking is a filter step: it is a pre-processing step, independent of the choice of the predictor. There are however limitations to individual feature ranking, because of underlying assumptions. A common criticism of variable ranking is that a redundant subset is selected. However, noise reduction and consequently better class separation may be obtained by adding variables that are presumably redundant. Also some features that are not individually relevant may become relevant in the context of others or features may be redundant, although they are individually relevant. Although variable ranking is not optimal, it may be preferable to other feature selection methods due to its computational and statistical scalability.

It is useful to select subsets of features that together have good predictive power instead of ranking according to individual predictive power. The feature subset evaluating methods do not only estimate subsets by their relevance, but also by the feature-feature relations that can make certain features redundant. The performance of a machine learning system can be degraded by redundant features [12]. Due to the better generalization property and cheaper computation costs, the filter model is chosen over the wrapper as the feature selection method in the experiments.

2.4 Forensic applications for feature selection

Madden and Ryder [3] explore the use of machine learning methods for developing automated methods for the identification and quantification of illicit materials using Raman spectroscopy. The goal of their paper is to estimate the concentration of cocaine in solid mixtures. The task is broken down in two sub-tasks, data reduction and prediction. For the data reduction sub-task a simple Local Maxima approach and a wrapper searching for the optimal solution using a Genetic Algorithm (GA) are used. Their experiments indicate that data reduction was necessary to achieve good results in the regression to predict the concentration of cocaine in solid mixtures, the reduction in dimensionality was necessary due to the high dimensionality of the Raman spectra in combination with a low sample size. But the data set size of 36 samples was

(13)

13 too small to acquire any definitive results.

The previous study is extended by O’Connell, Howley, Ryder, Leger and Madden [15] where mixtures comprised of a wide range of different materials are analysed on the presence of a single analyte, cocaine, using near infrared Raman spectroscopy. The efficiency of several machine learning techniques was compared with Principle Components Regression (PCR). The study showed that Support Vector Machines outperformed PCR in the identification of Acetaminophen. However, no feature selection was used to reduce the dimensionality of the data and this might further deal with the problems there were with problems such as misclassification of samples.

In the paper by Howley, Madden, O’Connell and Ryder [4] NIPALS (Non-Linear Iterative Partial Least Squares) PCA is used to improve the performance of machine learning in the classification of high-dimensional chemical data. It shows that it is a promising approach by improving the performance of machine learning, despite a major reduction of the data, from 1646 attributes in the original data set to at least six attributes. This result is also promising for other feature selection techniques, such as GeFS, ReliefF or CFS to be used in the chemometrics field, but it is not much explored yet.

The feature selection methods are also used in ranking features for glass evidence analysis [10]. In this paper the results show that feature selection can aid the choice of variable for the univariate modelling of the data for evidence evaluation with likelihood ratio estimates.

Another application of feature selection in forensic settings is writer identification from handwritten document images [2]. As in Madden and Ryder [3] a GA for the feature selection. The features considered in the study are part of an existing writer identification framework. Feature selection helped categorising these features into three classes: indispensable, partially relevant and irrelevant. It showed that approximately half of the features can be eliminated without degrading the writer identification performance.

Kittelsen [16] showed in his research that the Golub-score is more efficient than GeFS for selecting features for detecting malicious pdfs. This indicates that GeFS might not be suitable for use in every domain. However, the Golub-score is not applicable in many cases, as calculating the Golub-scores requires all features to be numerical and having a binary classification task. The latter issue may be overcome by using a one-vs-all strategy, but it will still not be able to overcome the first issue.

(14)

14

3. Choice of Methods

This chapter provides information on the methods used in the experiments. These methods were chosen to provide a proper analysis of the feature selection methods. It is important that this is done in a setting representing “real world” situations, where for example the data has no uniform setting and there are no labels for the data you want to classify. Firstly the method of discretisation is explained, which is followed by a description of the feature selection methods chosen. Next, the search strategy used for feature selection is described and the chosen classification algorithms are given. Finally the chapter finishes with the method of evaluation.

3.1 Discretisation

To ensure a uniform comparison of the feature selection methods, discretisation is necessary for data with numeric features as a pre-processing step to have the same type of input for all features. For the discretisation, a supervised technique is used, which was developed by Fayyad and Irani [17]. It combines an entropy based splitting criterion with a minimum description length stopping criterion. A number of studies [18, 19] have found this method to be superior overall and is thus used in the experiments.

The discretisation is done as follows:

For a set of instances 𝐴 , a feature 𝑥𝑖 and a cut point 𝑇 , the class information entropy of the partition induced by T is given by

𝐸(𝑥_𝑖, 𝑇; 𝐴) =∣ 𝐴1 ∣

𝐴 ∗ 𝐸𝑛𝑡(𝑆2) + ∣𝐴2 ∣

𝐴 ∗ 𝐸𝑛𝑡(𝐴2)

Where 𝐴₁and 𝐴₂ are two intervals of 𝐴 bounded by cut point 𝑇, and 𝐸𝑛𝑡(𝐴) is the class entropy

of a subset 𝐴_𝑖 given by

𝐸𝑛𝑡(𝐴) = − ∑ 𝑝(𝐶_𝑖, 𝐴) log₂(𝑝(𝐶_𝑖, 𝐴) 𝑘

𝑖=1

Fayyad and Irani [17] employ a stopping criterion based on the minimum description length principle. A partition induced by 𝑇 is accepted if and only if the cost of encoding the partition and the classes of the instances in the interval induced by 𝑇 is less than the cost of encoding the classes of the instances before splitting. The partition induced by cut point 𝑇 is accepted 𝑖𝑓𝑓

𝐺𝑎𝑖𝑛(𝑥_𝑖, 𝑇; 𝐴) > log2(𝑁 − 1)

𝑁 +

𝑑𝑒𝑙𝑡𝑎(𝑥𝑖, 𝑇; 𝐴) 𝑁

(15)

15

𝐺𝑎𝑖𝑛(𝑥𝑖, 𝑇; 𝐴) = 𝐸𝑛𝑡(𝐴) – 𝐸(𝑥𝑖, 𝑇; 𝐴)

and

𝑑𝑒𝑙𝑡𝑎(𝑥_𝑖, 𝑇; 𝐴) = 𝑙𝑜𝑔₂(3𝑐 – 2) – [𝑐 ∗ 𝐸𝑛𝑡(𝐴) – 𝑐₁∗ 𝐸𝑛𝑡(𝐴₁) – 𝑐₂∗ 𝐸𝑛𝑡(𝐴₂)] Where 𝑐, 𝑐1 and 𝑐2 are the number of distinct classes present in 𝐴, 𝐴1and 𝐴2respectively.

3.2 Feature selection methods

For the comparison of the generic feature selection (GeFS) measure with traditional feature selection methods, the correlation-feature-selection (CFS) measure and the minimal-redundancy-maximal-relevance (mRMR) measure using a greedy hill climbing search method were chosen. These measures were selected, because CFS is a good overall performer, which is fast and selects few features [20]. However, CFS considers the linear relation between features. mRMR is a nonlinear feature selection method and for that reason chosen as the other method. The result of every feature selection method is a (locally) optimal subset of features.

3.2.1 Correlation-based Feature Selection (CFS)

Correlation-based Feature Selection (CFS) is a simple filter algorithm proposed by Hall [18], that ranks feature subsets according to a correlation based heuristic evaluation function. This heuristic favours feature subsets in which the features have relevancy and are not redundant. Features are relevant if their values vary systematically with category membership. A feature 𝑥_𝑖

is said to be relevant 𝑖𝑓𝑓 there exists some feature value 𝑣_𝑖 and 𝑐 for which 𝑝(𝑥_𝑖 = 𝑣_𝑖) > 0 such that

𝑝(𝐶 = 𝑐|𝑥_𝑖 = 𝑣_𝑖) ≠ 𝑝(𝐶 = 𝑐).

And a feature is set to be redundant if one or more of the other features are highly correlated with it.

The definitions used for relevance and redundancy lead to Hall’s hypothesis for CFS:

“A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.”

If the correlation between each of the features in the subset and the class is known, and the correlation between each pair of features is given, then the heuristic merit of a subset can be predicted from:

𝑀_𝑆= 𝑘𝑟̅̅̅̅̅𝑐𝑓

(16)

16 Where 𝑀_𝑆 is the heuristic “merit” of a feature subset 𝑆 containing 𝑘 features, 𝑟̅̅̅̅ is the mean _𝑐𝑓 feature-class correlation, and 𝑟̅̅̅̅ is the average feature-feature inter-correlation. The _𝑓𝑓 numerator of this equation provides an indication of how predictive a set of features is of the class; the denominator of how much redundancy there is among the set of features.

Three variations of CFS, each using a different attribute quality measure are described in Hall[18]. The one chosen in this experiment uses the symmetrical uncertainty coefficient. The other two variants use the relief measure and the minimum description length (MDL) principle.

The symmetrical uncertainty coefficient uses entropy to model the feature and feature-class correlations. Entropy is a measure of the uncertainty in a system. The entropy of a feature

𝑥_𝑖is given by

𝐻(𝑥_𝑖) = − ∑ 𝑝(𝑣_𝑗) log₂(𝑝(𝑣_𝑗))

𝑣𝑗𝜖𝑥𝑖

Where the probability 𝑝(𝑣_𝑗) are the individual probabilities of the values 𝑣_𝑗 𝜖 𝑥_𝑖 from the

training data.

The conditional entropy of 𝑥_𝑖is gained by partitioning the observed values of 𝑥_𝑖 in the training

data according to the values of a second feature 𝑦_𝑘 . If this entropy of 𝑥𝑖 with respect to the

partitions by 𝑦_𝑘 is less than the entropy of 𝑥_𝑖 prior to partitioning, then there is a so-called

information gain. The entropy of 𝑥𝑖 after observing 𝑦𝑘 is:

𝐻(𝑥_𝑖|𝑦_𝑘) = − ∑ 𝑝(𝑣_𝑗) ∑ 𝑝(𝑤_𝑚|𝑣_𝑗) log₂(𝑝(𝑤_𝑚|𝑣_𝑗))

𝑤𝑚𝜖𝑦𝑘 𝑣𝑗𝜖𝑥𝑖

The amount by which the entropy of Y decreases reflects the information gain that X provides about Y. Information gain is given by:

𝑔𝑎𝑖𝑛 = 𝐻(𝑥_𝑖) − 𝐻(𝑥_𝑖|𝑦_𝑘)

= 𝐻(𝑦_𝑘) − 𝐻(𝑦_𝑘|𝑥_𝑖)

= 𝐻(𝑥_𝑖) + 𝐻(𝑦_𝑘) − 𝐻(𝑦_𝑘,𝑥_𝑖)

To compensate for information gain’s bias toward features with more values and to normalise its value to the range [0,1] the symmetrical uncertainty coefficient is introduced:

𝑠𝑦𝑚𝑚𝑒𝑡𝑟𝑖𝑐𝑎𝑙 𝑢𝑛𝑐𝑒𝑟𝑡𝑎𝑖𝑛𝑡𝑦 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 = 2 ∗ [ 𝑔𝑎𝑖𝑛 𝐻(𝑥_𝑖) + 𝐻(𝑦_𝑘)]

To apply the heuristic measure with the symmetrical uncertainty coefficient the value of 𝑟̅̅̅̅ will _𝑐𝑓 be that of the symmetrical uncertainty coefficient with 𝑐 = 𝑦_𝑘 and 𝑓 =𝑥_𝑖 and for the value for

𝑟_𝑓𝑓

(17)

17 Hall and Holmes [20] compare attribute selection techniques for discrete class data mining. In their paper they compare an unsupervised technique (PCA), a wrapper, information gain ranking, ReliefF, CFS and consistency-based subset evaluation. For evaluation, the Naï ve Bayes and the C4.5 tree classifiers were used. The benchmarking shows that in general, attribute selection is beneficial for improving the performance of common learning algorithms. There is no single best approach, however, for all situations. The wrapper is for accuracy the best attribute selection scheme, but is slow. CFS, consistency and ReliefF are good overall performers, where CFS chooses fewer features and is faster, but consistency or ReliefF are better choices when there are strong attribute interactions that the learning scheme can use.

3.2.2 Minimum-Redundancy Maximum Relevance (mRMR)

Besides the CFS measure, which considers the linear relation between features, there exist nonlinear feature selection methods that are based on mutual information from information theory [21]. This section introduces the mutual based feature selection. The goal in mutual based feature selection is to find a feature subset 𝑆 with 𝑘 features from the full set of 𝑛 features, which jointly have the largest dependency on the target class 𝐶 . This scheme is called Max-Dependency. The Max-Dependency feature selection scheme can be applied when the numbers of features is small and has theoretical value. It is however difficult to obtain an accurate estimation for multivariate densities, because 1) the number of samples is often not large enough and 2) it is an ill-posed problem, as multivariate density estimation often involves computing the inverse of the high-dimensional covariance matrix. Peng [22] therefore proposed a heuristic approach to the optimal Max-Dependency, called Minimal-Redundancy-Maximal-Relevance (mRMR). The main idea is to select features based on maximal relevance. However, features selected by maximal relevance can be redundant and therefore also the redundancy of features in the subset are evaluated to remove redundant features.

The mRMR measure maximizes the relevance of features to the target class 𝐶 and at the same time minimises the redundancy between these features. The maximal relevance is the mean value of all mutual information values 𝐼 between individual feature 𝑥_𝑖 and the target class 𝐶.

max 𝐷(𝑆_𝑘, 𝐶) =1

𝑘∑ 𝐼(𝑥𝑖, 𝐶)

𝑘

𝑖=1

The features selected by maximal relevance can have a rich redundancy. The class-discriminative power would not change much if those redundant features are removed. Mutual information can also be utilised to estimate the redundancy of 𝑘 features in the subset 𝑆 and the

(18)

18 goal is to minimise this value as follows:

min 𝑅(𝑆_𝑘) = 1

𝑘2 ∑ 𝐼(𝑥𝑖, 𝑦𝑗) 𝑥𝑖,𝑦𝑗∈𝑆𝑘

The minimum-redundancy-maximal-relevance (mRMR) measure is then defined as: max

𝑆𝑘

= 𝐷(𝑆_𝑘, 𝐶) − 𝑅(𝑆_𝑘)

It has been shown that the mRMR measure is equivalent to Max-Dependency for the first-order incremental search [22]. Different heuristic search measures can be used to solve the mRMR feature selection problem. Here we chose to apply greedy hill climbing search, which is described briefly in the in section 3.3.

3.2.3 Generic-Feature Selection (GeFS)

Nguyen, Petrovic and Franke [23] combine the CFS and mRMR measures with a new globally optimal search method and validate this method with various feature selection algorithms for intrusion detection. Both measures were fused and generalised into a generic-feature-selection (GeFS) measure and describes the CFS (GeFSCFS) and mRMR (GeFSmRMR) measures. This new

generic-feature-selection (GeFS) method outperformed the other approaches (SVM-wrapper, Markov-blanket and CART algorithms) on the KDD Cup’99 data set by removing more than 30% of redundant features from the original data set, while maintaining or even improving the classification accuracy. In a follow-up study Nguyen, Torrano-Gimenez, Alvarez, Petrovic and Franke [24] extended their research to feature selection for filtering HTTP-traffic in Web application firewalls. The experiments were conducted on the ECML/PKDD-2007 and CSIC-2010 data sets. This study too showed that by choosing the appropriate instances of the GeFS measure, 63% of irrelevant and redundant features were removed from the original data set, while only reducing the detection accuracy by 0.12%. Further testing revealed that the GeFS measure is more reliable in feature selection than the genetic search with CFS and the mRMR method [25]. Although these are promising results, the GeFS measure is only tested on IDS data sets and further experimenting is necessary to determine the usability in other domains.

3.3 Greedy hill climbing search

Both CFS and mRMR are employed here with a greedy hill climbing search algorithm using forward selection. This is a simple search strategy, which considers local changes to the current feature subset. The search algorithm considers all possible additions of features to the current subset and in each iteration selects the change with the biggest improvement. At the point

(19)

19

where no improvement on the current merit can be made, the search returns that subset. Once a change is accepted, it is never reconsidered, so there is a risk of only finding a locally optimal solutions.

3.4 Learning algorithms

Weka is used to test the performance of the feature selection methods. For this, six different learning methods are chosen. This is done because of the “No Free Lunch Theorem”, which means, loosely, that if an algorithm works well on one problem this does not automatically guarantee that it will yield good results on another and there might be a better algorithm for the problem. Therefore, different classification algorithms are used to get a good overview of the results on different problems. The six learning algorithms that are chosen are Naï ve Bayes, Bayesian Networks, Bagging, the decision tree C4.5 and Random Forest

3.4.1 Naïve Bayes

The Naï ve Bayes classifier is based on applying Bayes’ theorem with naï ve independence assumption. It is a simple probabilistic classifier that is applied in many complex real-world situations despite the assumptions it makes. The Naï ve Bayes classifier makes the assumption that features are conditionally independent given the class and unrelated to any other features, whether they are present or not. The implementation in Weka is used for the experiments.

3.4.2 Bayes Net

A Bayesian network is a directed acyclic graph whose nodes correspond to random variables; each node has a conditional distribution for the node, given its parents. It gives a concise way to represent conditional independence relationships within a domain. The Weka implementation of BayesNet assumes that for the data set all variables are discrete finite variables and that no instance has missing values. The learning of a Bayesian network is a two stage process, where first a network structure is learned and then the probability tables. The default setting from the Weka implementation is used in the experiments, which has a local K2 search algorithm and a simple estimator.

3.4.3 Bagging

Bagging is an ensemble meta-algorithm designed to improve stability and accuracy of learning algorithms used in classification. In bagging a training set 𝐴 of size 𝑛 is used to generate 𝑚 new

(20)

20

training sets, each of size 𝑛’ , sampling from 𝐴 uniformly with replacement. The selected classification algorithm is trained on the 𝑚 new sets taken from training set 𝐴, where the output prediction will be the class that received the most votes. The default settings from the Weka implementation were used in the experiments.

3.4.4 C4.5

C4.5 algorithm is used to generate a decision tree. J48 is an open source Java implementation of C4.5, which is used in Weka. It builds decision trees in the same way as ID3. At each node of the tree, the algorithm chooses the attribute of the data that most efficiently splits the data into subsets. The splitting criterion is based on the difference in entropy. The one with the highest difference in entropy is chosen for the split at that node. This is repeated on the smaller nodes until no improvement can be made. In C4.5 it is possible to prune the trees after creation, removing branches that do not help and replacing them with leaf nodes once the complete tree has been created. Pruning can prevent overfitting on the training data.

3.4.5 Random Forest

The RandomForest learning algorithm is an ensemble learning method for classification specifically for decision trees. RandomForest combines bagging with random selection of features to construct a multitude of decision trees. Voting is then used to determine the class output when classifying.

3.4.6 Support Vector Machines

The Support Vector Machine (SVM) classifier tries to separate class attributes by placing an optimal hyper plane. The optimal hyper plane is equally distant from the nearest instances of both classes. These instances are called “support vectors” and the distance between these and the hyper plane is called a “margin”. The optimal hyper plane is the plane that has the maximal margin. When dealing with non-linearly separable class problems, the problem can be solved by mapping the input space to a high dimensional feature space in which they are linearly separable. For multiclass problems the SVM reduces the single multiclass problem into multiple binary classification problems. As SVM classifier in Weka the LibSVM implementation was used.

(21)

21 3.5 Performance estimation

According to the COSEFOS by Hildebrandt, Klitz and Dittmann [11], the results must be reproducible and the errors of those results must be known. Therefore accuracy and reliability are used as performance estimators for the different feature selection. Nguyen [25] introduced a formal definition for reliability for feature selection. This definition will be discussed in the next section.

3.5.1 Reliability

Nguyen [25] has introduced a formal definition of a reliable feature selection process. The reliability is measured by the steadiness of a classifier's performance and the consistency in search for relevant features. Both these aspects of reliability are crucial to the Daubert-criterion of reproducibility of the results and the errors of those results. Where the consistency of the feature selection is important for the reproducibility and errors of the feature selection itself and the steadiness of the classifier’s performance is more focussed on the frequency and significance of the possible errors that can occur.

Reliability of the feature selection is defined by Nguyen [25] as follows. Given a data set 𝐴, a classifier 𝐶 and a feature-selection method 𝐹𝑆. Suppose that we run the 𝐹𝑆 algorithm 𝑚-times to select features from the data set 𝐴 with different executions of a search strategy utilised in the feature-selection process, the feature-selection results might be different. Let 𝑆_𝑖, (𝑖 = 1, 𝑀) be the selected feature subset in the 𝑖𝑡ℎ_{run. 𝐴𝑐𝑐}

𝑖 (𝑖 = 1, 𝑀) is the classification accuracy of the

classifier 𝐶 performed on 𝑆_𝑖. 𝐴𝑐𝑐_𝐹 Is the classification accuracy of the 𝐶 performed on full set of features.

Definition 1 (Consistency): A search strategy utilized in feature-selection process is consistent, with level of approximation 𝛼 or 𝛼_{𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦}, if, for a given 𝑀,

∣ 𝑆₁∩ 𝑆₂∩ … ∩ 𝑆_𝑀 ∣ ∣ 𝑆₁∪ 𝑆₂∪ … ∪ 𝑆_𝑀 ∣= 𝛼.

The numerator is the length of the feature set consisting of the features that are always selected in the 𝑀 runs and the denominator is the length of the feature set of the features that get selected at least once in the runs. The greater the value of α, the more consistent the search strategy is. When 𝛼 = 1 for every 𝑀, we say that the search strategy is truly consistent. This means that every feature that gets selected at least once, gets each run.

(22)

22 Definition 2 (Steadiness): A feature-selection method generates 𝛽_{𝑠𝑡𝑒𝑎𝑑𝑖𝑛𝑒𝑠𝑠}of the classifier 𝐶, if, for a given 𝑀:

𝛽 =𝐴𝑐𝑐𝐹− 1

𝑀∑𝑀𝑖=1∣ 𝐴𝑐𝑐𝐹− 𝐴𝑐𝑐𝑖 ∣

𝐴𝑐𝑐_𝐹

When 𝐴𝑐𝑐_𝐹 = 𝐴𝑐𝑐_𝑖, the value of 𝛽 is 1. When the difference between 𝐴𝑐𝑐_𝐹 and 𝐴𝑐𝑐_𝑖 gets bigger, the value of 𝛽 decreases. The greater 𝛽, the better the classifier's steadiness is safeguarded by the feature-selection method. Thus, the wrong choice of feature-selection methods might affect the steadiness of a classifier's performance.

Definition 3 (Reliability): A feature-selection method is called (𝛼 , 𝛽)_{𝑟𝑒𝑙𝑖𝑎𝑏𝑙𝑒}, if, for a given 𝑀, the search strategy utilized in the feature-selection process is 𝛼_{𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡} and the feature-selection method generates 𝛽_{𝑠𝑡𝑒𝑎𝑑𝑖𝑛𝑒𝑠𝑠}of the classifier 𝐶.

4. Experiment

4.1 Data sets

For the evaluation of the feature selection methods seven data sets were used (Table 1). Four of which were general data sets to evaluate the general quality of the feature selection methods. These general data sets were taken from the Neural Information Processing Systems (NIPS) feature selection challenge, which had five data sets for feature selection available [12]. The other 3 data sets were forensic specific to evaluate the performance of the feature selection methods in forensic settings. The first forensic data set is the Enron author identification data set. The final two data sets used are two digital forensic datasets used for intrusion detection. These are the KDD Cup 1999 and the ECML/PKDD 2007 data set [26]. The CFS and mRMR measures using greedy hill climbing search will be compared for the results on the KDD Cup 1999 and ECML 2007 data sets to the results of the GeFS counterparts. For the GeFS measure the results as reported in [21] will be used in the comparison.

4.1.1 NIPS data

The goal of the NIPS challenge was to find algorithms that significantly outperform methods using all features. All tasks are two-class classification problems. Three of the data sets are dense data (Arcene, Gisette and Madelon) and one is sparse integer (Dexter). In this experiment the sparse binary data set (Dorothea) has been left out.

The task for the Arcene set is a biomedical application, which is to separate cancer samples from normal samples, using mass-spectra data of blood serum. Dexter is a text categorisation, where

(23)

23

the task is to identify texts about “corporate acquisition”. Gisette is a handwritten digit recognition application. The task is to separate between written fours and nines. Madelon is an artificial task.

4.1.2 Enron data

The Enron data set is used to demonstrate the versatility of feature selection and its use for “real world” data. The Enron data set contains real emails of Enron employees of which we use the extracted features and cleaned up version as used by Chitrakar [27].

Table 1 – Data sets used in experiment

Dataset Features Type Train

Samples

Validation samples

Classes

Arcene 10000 Dense 100 100 2

Dexter 20000 Sparse Integer 300 300 2

Gisette 5000 Dense 6000 1000 2 Madelon 500 Dense 2000 600 2 Enron 549 Dense 33866 0 118 KDD Cup 1999 41 Dense 494021 0 5 KDDCup99 DoS&Normal 41 Dense 488736 0 2 KDDCup99 Probe&Normal 41 Dense 101385 0 2 ECML/PKDD 2007 30 Dense 50116 0 2 4.1.3 IDS data

Both the KDD Cup 1999 and ECML/PKDD 2007 data sets are publicly available. From the KDD Cup 1999 (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) data set 10% of the complete data set of five million instances is used. The data set contains four attack classes (Denial of Service (DoS), Probe, Remote to Local (R2L) and User to Root (U2R)) and a normal traffic class. The attack classes were separately processed as in Nguyen, Franke and Petrovic [28], because there is a large difference in distribution of attack classes (e.g. the ratio of the number of U2R to the number of DoS is 1.3*10-4_{) and thus the feature selection and}

classification might concentrate only on the most frequent class data and neglect the others. Only two attack classes (DoS and Probe) are considered, as there are a several drawbacks in the KDD Cup’99 data set [29]. The data is split into two separate data sets, the first one containing data from the normal traffic class and the DoS attack class and the second set containing data

(24)

24

from the normal traffic class and the Probe attack class, respectively KDDCup99 DoS&Normal and KDDCup99 Probe&Normal.

The ECML/PKDD 2007 data set was generated for the ECML/PKDD 2007 Discovery challenge and the training set of that challenge is used here. It consists of 50,000 samples, 20% of them attacks. The requests are labelled with specifications of attack classes or normal classes. For this experiment the attack classes were grouped into one attack class group and therefore only 2 classes were used. The same 30 features were extracted from the data set as in Nguyen [24] for a good comparison.

4.2 Design

Figure 1 is a flow chart of the experimental design. Every data set was evaluated by using k-fold cross-validation with 𝑘 = 10, whereby the data set was divided into 10 sets of equal size, with the samples randomly drawn out of the original data set. The training set is then a combination of 𝑘 − 1 sets and the remaining set was used as test set. When there was a validation set available, that set was used for the test set to evaluate the performance. For the other sets the remaining fold was used as test set. The training sets were then discretised using the Fayyad and Irani method and the corresponding test set was discretised accordingly. Feature selection was performed on the training set with CFS or mRMR and the classifier was trained and the test set was classified and evaluated to determine the accuracy of that classification. This process was repeated 𝑘 = 10 times and the mean accuracy was reported, as well as a box plot of the tenfold cross-validation, which shows the variation in accuracies in the cross-validation.

(25)

25

(26)

26

5. Results

This section will discuss the results of the feature selection on the different data sets. First the consistency of the feature selection and the amount of features selected for each data set will be discussed. Secondly the accuracy of the different classifiers on the data sets obtained from feature selection will be provided.

5.1 Feature selection results

In general the results show that CFS selects fewer features than mRMR, except for the Enron data set where mRMR selects fewer features and the Dexter data set where the amount of features selected is not significantly different (Table 2). For both heuristics the search strategy greedy hill climbing is 100% consistent, as is the GeFS method. The amount of features selected by CFS is on average slightly higher than those used by Chitraktar [27] in the smaller CFS feature set, respectively 56.7 and 53. mRMR selects fewer features than both with a mean of 33.8 features selected.

Both CFS and mRMR heuristic with the greedy hill climbing algorithm are more conservative than the GeFS method used by Nguyen [21] for the KDD Cup 1999 and ECML 2007 data sets. The smallest difference is with the CFS heuristic. It selects on average 5, 7.7 and 2.9 features for respectively the DoS, Probe and ECML 2007 data sets, where GeFSCFS selects 3, 6 and 2 features from the data.

The difference between the amount of features selected by CFS and mRMR on the Madelon data set is striking. It is a much greater difference than for other data sets. This might mean that many features are more linearly correlated to each other [21].

Table 2 – Results of feature selection on the 8 different data sets.

Dataset Features Mean selected features CFS Mean selected features mRMR Arcene 10000 34,2 37,3 Dexter 20000 30,9 30,2 Gisette 5000 64 58,2 Madelon 500 6,7 56,4 Enron 549 56,7 33,8

KDDCup99 DoS & Normal 41 5 7,6

KDDCup99 Probe & Normal 41 7,7 16,4

(27)

27 5.2 Classification Results

The classification results are represented in tables 3 and 4 and are visualised in figures 2 - 9 in boxplots for the different data sets.

For the Guyon data sets (Arcene, Dexter, Gisette and Madelon) the feature selection performs very well overall. Table 4 shows that for two data sets (Arcene and Dexter) a feature subset obtained by feature selection gives the highest single classification accuracy. In the other cases the best results are still gained by using the full feature set. In three data sets, the CFS feature selection method has the highest average classification accuracy.

For the Arcene data set (figure 2), feature selection improves classification accuracies for SVM and Naï ve Bayes, the best performance was achieved by Naï ve Bayes and the mRMR features with 75.3% accuracy. There is no difference observed for J48 classification and only a small degradation for randomForest classification. Accuracy for bagging degrades most after feature selection on this data set.

Figure 2 – Classification accuracies on the Arcene data set

For the Dexter set (figure 3) the highest classification accuracy is achieved by the CFS feature set with randomForest classification with 83.73%. There is no big degradation observed in performance after feature selection, where the CFS measure performs slightly better than mRMR. SVM has a surprisingly low accuracy for the full feature set classification and for all but one fold it classified all instances in a single class.

0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85

Arcene

(28)

28

Figure 3 – Classification accuracies on the Dexter data set

The results on the Gisette data set (figure 4) show that the Naï ve Bayes and BayesNet classifiers do not perform well. The full feature set performs best for this data set, followed by CFS.

Figure 4 – Classification accuracies on the Gisette data set

Figure 5 shows that the classification result on the Madelon data set for the CFS feature subset has less variation than the full and mRMR sets. This is caused by the much smaller feature set that CFS was able to select. The highest single result is achieved by the full feature set, but for three classifiers (SVM, NB and BayesNet) feature selection improves the classification result.

0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9

Dexter

0,86 0,88 0,9 0,92 0,94 0,96 0,98

Gisette

(29)

29

Figure 5 – Classification accuracies on the Madelon data set

The results from the experiments on the NIPS data sets are below those of the best results from the NIPS 2003 feature selection challenge, but those were achieved by tweaking the classification algorithms as well as the feature selection based on the specific data, whereas here only default settings for classification and general feature selection methods were used. The number of features in the full data sets for Arcene and Dexter was too big to obtain results with the BayesNet classifier and therefore no results are reported here for BayesNet.

Figure 6 – Classification accuracies on the Enron data set

The Enron and ECML 2007 data sets (figure 6 & 7) suffer more overall accuracy loss from feature selection than the NIPS data sets. For the Enron data set, SVM benefits highly from discretising features for the full feature set, but for both the CFS and mRMR sets the results are comparable

0,55 0,57 0,59 0,61 0,63 0,65 0,67 0,69 0,71 0,73 0,75

Madelon

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

Enron

(30)

30

to the results from Chitrakar. Naï ve Bayes also benefits greatly from discretising and performs on all fronts better than the results in the previous research and there is no large degradation in accuracy after feature selection. For the other three classification methods worse performances are observed. With a steadiness of 94,12% and 93,53% for CFS and mRMR respectively, the feature selection performs reasonably well on steadiness of the classifier’s performance.

Figure 7 – Classification accuracies on the ECML 2007 data set

A higher steadiness is achieved on the KDD Cup 1999 and ECML 2007 data sets. The results for the ECML 2007 (figure 7) data set are consistent with the ones found by Nguyen [21]. Although both classification accuracies for the full feature set and the mRMR feature set are lower than what was reported, the accuracy for the CFS feature set is higher and the difference between them smaller and therefore the steadiness values for the ECML 2007 set are higher than in the experiments in Nguyen [21]. Feature selection improves classification accuracies for the ECML 2007 data set for Naï ve Bayes and Bayes Net. The mRMR measure performs better than the CFS measure, but the full set outperforms both for the other classification algorithms. But an overall decrease in average classification accuracy for all classifiers after feature selection is observed. Feature selection on the KDD Cup 1999 does not reduce classification accuracies by much. The steadiness for the classification of DoS attacks is 99.29% and 98.83% for CFS and mRMR respectively. Figure 8 shows that the full set performs best for all classifiers for the KDD99 DoS & Normal data set. Second best is the CFS feature subset and the mRMR performs worst. For both Naï ve Bayes and Bayes Net there is no difference for the feature subsets.

0,8 0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96

ECML07

(31)

31

Figure 8 – Classification accuracies on the KDD Cup 1999 data set with DoS and Normal classes

Figure 9 – Classification accuracies on the KDD Cup 1999 data set with Probe and Normal classes

For the KDD Cup 1999 Probe & Normal data set there is only a small difference in classification accuracies for the full data set and the CFS and mRMR subsets, as can be seen in figure 9. Naï ve Bayes and Bayes Net classification gives the lowest accuracies for all sets. For the Probe attack class the steadiness is higher than those for the DoS attack class with 99.91% and 99.87%.

Table 3 – Steadiness values of CFS and mRMR feature-selection methods

Arcene Dexter Gisette Madelon Enron

KDDCup99 DoS & Normal

KDDCup99

Probe & Normal ECML07

CFS 97,03% 91,63% 99,23% 99,63% 94,12% 99,29% 99,91% 97,44% mRMR 97,53% 92,62% 98,90% 96,58% 93,53% 98,83% 99,87% 98,73% 0,975 0,98 0,985 0,99 0,995 1 1,005

KDD99 DoS & Normal

0,98 0,985 0,99 0,995 1 1,005

(32)

32

Table 4 – Classification results on all 8 datasets with 3 different feature subsets (accuracy in %)

Dataset

SVM C4.5

Random-Forest NB BayesNet Bagging Average

Arcene All Features 63,00 66,20 74,50 72,00 71,00 69,34 CFS Features 73,40 66,50 72,80 74,50 74,00 67,20 71,40 mRMR Features 73,80 65,30 70,30 75,30 74,60 67,00 71,05 Dexter All Features 52,77 79,13 83,33 83,60 82,67 76,30 CFS Features 83,17 80,30 83,73 83,43 83,43 82,03 82,68 mRMR Features 82,87 79,33 82,50 83,23 83,13 80,50 81,93 Gisette All Features 95,87 94,76 95,09 90,41 90,47 95,12 93,62 CFS Features 95,45 93,63 94,76 89,95 89,97 93,61 92,90 mRMR Features 94,57 93,52 94,01 89,99 90,04 93,40 92,59 Madelon All Features 62,80 69,00 69,13 60,45 60,43 68,48 65,05 CFS Features 65,98 66,82 66,72 62,88 62,90 66,45 65,29 mRMR Features 61,44 64,28 64,25 61,70 61,70 63,58 62,83 Enron All Features 56,09 46,48 45,59 52,90 53,58 42,87 49,58 CFS Features 40,99 44,66 47,91 51,79 51,75 42,92 46,67 mRMR Features 40,76 45,00 48,44 50,64 50,65 42,77 46,38

KDD99 V2 DoS & Norm

All Features 99,99 99,99 100,00 99,30 99,30 99,99 99,76 CFS Features 99,35 99,35 99,35 98,46 98,46 99,35 99,05 mRMR Features 98,64 98,65 98,65 98,48 98,48 98,65 98,59 KDD99 V2 Probe & Norm All Features 99,92 99,93 99,94 99,46 99,48 99,95 99,78 CFS Features 99,85 99,89 99,89 99,31 99,32 99,89 99,69 mRMR Features 99,79 99,86 99,87 99,26 99,26 99,87 99,65 ECML07 All Features 93,43 91,36 92,44 93,04 86,70 86,75 90,62 CFS Features 88,31 88,31 88,31 88,31 88,30 88,30 88,31 mRMR Features 89,45 89,45 89,49 89,49 89,47 89,47 89,47

(33)

33

For most data sets, except for the ECML 2007, Arcene and Dexter sets, the CFS heuristic performs best on steadiness (Table 3). However, for the Arcene and Dexter data sets, the CFS feature subsets improves the classification accuracy more on average than the mRMR subsets and thus the steadiness value is misleading in that case. A higher overall steadiness has been achieved on the ECML 2007 data set than Nguyen [21] did with the GeFS-mRMR method and mRMR heuristic with genetic search. The accuracy for the full feature set is lower to begin with however and the accuracy for the mRMR subset is lower as well.

6. Discussion

The goal of this thesis was to investigate the transfer of feature selection in pattern recognition to forensic investigations and evaluate the performance of feature selection in a forensic domain. In section 2.3 multiple researches were reviewed wherein a form of feature selection was applied to forensic domains, however no paper discusses the requirements that are necessary for admissibility of the techniques in court. Hildebrandt et al. [11] proposed an evaluation scheme (COSEFOS) to use when developing forensic software that is to be used in court. In essence the scheme follows the guidelines for proper scientific research and if those are followed, the evidence should be admissible in court. Evaluating software, while being developed, with COSEFOS helps in achieving that goal. In this paper the evaluation of feature selection was done using a reliability measure composed of steadiness and consistency, which was proposed by Nguyen [25], and the overall classification accuracy.

The results in this thesis can be compared to the results from two previous theses from respectively Nguyen [21] and Chitrakar [27]. However, there are some differences in methodology from both researches in comparison to this one. Firstly, all continuous features were discretised using the method of Fayyad and Irani [17]. Previous research has shown that discretisation gives superior results when performed at the outset [18] and many feature selection algorithms require numeric features to be discretised, such as RELIEF and IB4. Secondly, the test set used in cross-validation has not been used for feature selection. Both Chitrakar and Nguyen use the full data set to perform feature selection. Chitrakar does so by performing cross-validation on the complete data set for selecting the features for CFS and selected for one set the features that were chosen each fold and for the other set removed the features that were never selected. In the experiments of Nguyen features were selected from the full data set and evaluated by four classifiers with 10-fold cross-validation. Both, therefore, use information from the data that isn’t used in training during the classification stage for the feature selection, which means that there can be a bias in the measured performance. Lastly, for

(34)

34

the Enron data set Chitrakar used stratified cross-validation for the evaluation of the classification accuracy, which means that each fold holds the same proportion of class values. This ensures a more stable classification, but might not represent the real world situation completely, as in some cases you might have less training samples than in other cases. These changes in experimental setup may change the results from the outset, but they were deemed a better methodology to simulate a real-world setting where the information from a test set is not available for the feature selection process.

Eight different data sets have been used for testing the performance of two feature selection methods (CFS and mRMR) and evaluated with six classifiers (SVM, C4.5, RandomForest, Naï ve Bayes, BayesNet and Bagging). The non-forensic data sets (Arcene, Dexter, Gisette and Madelon) were used to prove the general quality of the feature selection algorithms, while the forensic data sets (Enron, KDDCup 1999 and ECML 2007) were used to check whether these results are consistent when applied to the forensic domain. The performance of the feature selection algorithms was expressed by the amount of features selected and the reliability of the method, which is measured through the steadiness of a classifier’s performance and the consistency in search for relevant features. Both the CFS and mRMR heuristic showed a 100% consistency using the greedy hill climbing search strategy on all data sets. This is similar to the GeFS method and thus in that area both methods are comparable. The genetic search used by Nguyen [21] for comparison with GeFS achieved much lower consistency, which can be ascribed to the random nature of the strategy and the different initial populations used. In that respect, greedy hill climbing is a more suitable search strategy than genetic search for feature selection in forensic investigations.

The steadiness of the classifier’s performance for the best feature selection method was in 5 out of 8 cases over 98% and for the Arcene and Dexter data sets it was lower than that, but the classification accuracy improved for those. The results on the Arcene, Dexter and Madelon data sets did show a flaw in the definition of reliability for feature selection introduced by Nguyen. One of the goals of feature selection described in Section 2.2 is to improve classification results, but this goal contradicts with the definition of the steadiness of a classifier. By improving classification results the steadiness of that feature selection method will automatically be lower than one that achieves the same result. The first result may, however, be preferred over the latter. Results on the Enron data set show that this data set is not suitable for author classification with these features selected and the methods used in this thesis. Similar results were obtained by Chitrakar. This demonstrates the importance of using an appropriate data set and methods for the classification problem.

Reliability of attribute selection in forensic investigation