Relevance Scoring of Digital Traces and Clustering

(1)

Relevance scoring of digital traces and clustering

Literature Thesis

Master Forensic Science University of Amsterdam

Version: 2.0

Ameya Puranik ameya.puranik@student.uva.nl Student ID: 11995254

Supervisor: Ruud Schramp

Organization: Netherlands Forensic Institute

(2)

Abstract

Advances in semiconductor industry had resulted in availability of cheap consumer electronics and embedded devices. Thus, number of electronic devices found at a crime scene is rising significantly as well. A recent rise in electronic wearable devices along with Internet of Things is also observed. The storage capacities of digital devices has grown at a rapid rate. These factors combined together pose new challenges in digital forensics. One of the most important is the amount of time and effort an analyst has to put in to find relevant pieces of evidence within this huge heap of data that is collected from devices. The existence of this problem is proven by the major backlogs observed in digital investigations. These issues will likely will be exacerbated in future without focused research efforts. We will explore the research conducted until now in this domain and analyze them.

Keywords: Digital Forensics, automation, analysis, relevance scoring, digital traces

(3)

1 Introduction

The initial problem in digital investigations was the time spent on mak-ing the forensic copies and extractmak-ing digital artefacts from the recovered devices. This added the first overhead in the overall investigation process. This problem however have been greatly solved due to implementing digi-tal forensics in service model Van Baar, Van Beek, and Van Eijk [1]. The Hansken framework introduced in this research aids in proper resource man-agement and collaboration opportunities between investigators at an early stage to alleviate the overhead of extraction. Though the model has some disadvantages such as latency and dependency of internet connections, the advantages outweigh them. There is still a major problem that yet remains to be addressed. This is the time and cognitive overhead resulting from the manual analysis of the ever increasing data in the digital investigations. We will elaborate this problem in detail and highlight a few related problems as well as present research that has already been done to tackle this problem. We have formulated the following Research questions to achieve this goal:

1. Which methods exist that can add relevance scores within heteroge-neous digital traces?

2. What are the approaches explored that assist in identification of rele-vant digital traces within a large dataset of heterogeneous digital traces extracted for analysis?

This literature review is structured in the following way. We start with a Benchmark study that highlights the problems in the field of digital forensics and the state of the field at the time of this study in section 2. Section 3 will cover different approaches put forth to tackle the main problem highlighted in the introduction above along with its related problems or sub problems. We then will again analyse a recent benchmark study in section 4. Section 5 of the review will highlight the most recent research and developments to tackle this difficult problem in question. We provide our own views and opinions related to the research work analysed in discussion (section 6) and conclude the review.

2 First Benchmark

We consider the article by Nicole Beebe back in 2009 as the first Benchmark for our review. This study provides a view on the state of Digital Forensics then. Beebe [2] highlighted the ”Good , Bad and Unaddressed” parts of Dig-ital Forensics. Our research question falls mostly under the ”Unaddressed” section. The author introduces the problem of volume and scalability. The search and analysis methods used in light of the volume of data have a large overhead, The critical part of this overhead is the processing time spent by

(4)

human investigator on non-relevant data. This is widely agreed upon by researchers even today. Beebe proposes a few solutions for this issue, selec-tive acquisition of data being one of them. This is a reasonable solution to reduce the data for analysis by an investigator however it may also result in loss of valuable information pertaining to the case. This is acknowledged by the author by stating that research is needed to design selective acquisition schemes that are capable of identifying relevant data. The paper also states of high recall rates in digital forensic investigations.

A salient feature of this study is the potential solutions suggested. This also involves the exploring the usage of available research in different fields of computer science in forensic context. The approaches suggested by Beebe in this study are:

1. Application of existing data warehousing and information retrieval re-search to digital forensics

2. Adaptation of data mining approaches for homogeneous data sets to deal with heterogeneous forensic datasets

3. Application of existing link analysis research to establish relationships between events and data

Feature based extraction without file metadata, statistical methods and sig-nature based systems are also suggested as potential intelligent analytical solutions. A temporal approach for analyzing the traces is also suggested which points to the importance of timelining of digital traces.

3 Developed Approaches

3.1 Forensic Feature Extraction and Cross Drive Analysis The number of storage devices recovered for a single case has increased sig-nificantly and thus the resulting amount of data to be analyzed. Garfinkel [3] put forth two novel approaches of ”Forensic Feature Extraction” (FFE) and ”Cross-drive analysis” (CDA) to aid the investigation when large volumes of data is involved. The two approaches are designed to work independently but are more efficient when combined together for correlating multiple drives to get a more complete picture. These approaches help in prioritizing the analysis of relevant drives from the recovered corpus, provide opportunities for data correlation, focuses on evidentiary and investigatory goals rather than just recovering data and also provides opportunity for social network discovery if there is a common relation such as organizational ownership among the corpus of drives under investigation.

The CDA approach considers pseudo-unique identifiers as forensic fea-tures which have sufficient entropy and the likelihood of finding the exact

(5)

same identifier within a given corpus is highly unlikely. The idea is to use the observation of these features across multiple drives in a corpus as a fac-tor for correlation and identification of drives of interest. The author gives an example of message-ids from email messages as one of these pseudo-unique identifier or Forensic Feature(FF). The email address itself can be used as a forensic feature and can be used for ownership determination of the drive. Although there are some common email addresses that can be found across multiple drives and not so unique as they are part of OS files or software applications, the author suggests using these to make a stop list for filtering. Thus, more unique email addresses are used for drive attribu-tion. The approach provides flexibility of suppressive filtering or positive selection filtering of ubiquitous forensic features depending on requirement of investigation. The forensic features can be adapted as per the exami-nation questions of investigation and can be extracted across the drives as part of CDA to identify drives that have largest number of these features for prioritizing drives for analysis.

An interesting advantage of this approach is that source attribution can be done based on activity and supporting artefacts from the drive. A multi-drive correlator developed in this research can be used for clustering of iden-tical drives or use case specific features to cluster new drives in either mali-cious or benign clusters. The implementation maintains provenance informa-tion of extracted features which is important from legal standpoint.However the implementation of this approach is limited to extracting string based fea-tures from textual data on the drives. This comes as a drawback in modern complex digital investigation cases.

3.2 FACE

Case et al. [4] acknowledges the increasing capabilities of individual appli-cations and the interactions between a number of interconnected systems making the digital investigation landscape more complex. The authors take a scenario-driven approach in developing a forensic tool aiding digital in-vestigation than just burdening the investigator with raw extracted traces. This approach focuses on solving the problem of heterogeneity of traces by developing a tool capable of extracting data from multiple sources and correlation.

The tool ”ramparser” is specifically designed for Linux systems and har-nesses the knowledge of Linux operating system’s data structures. This tool acts just as an extractor that feeds data into the Forensics Automated Cor-relation Engine (FACE) developed by the authors. The engine correlates timestamps obtained from different sources such as login files and network traces to frame user actions during the specific times. Similarly, the engine maps users to their user ids, group ids, login shells and home directories using the information extracted by ramparser from passwd and group files.

(6)

The user interface of engine is capable of displaying information with re-spect to user names rather than ids making it easier for investigator. FACE supports visualization in five data views: users, groups, processes, filesys-tem and network captures. FACE is capable of displaying all user specific activities such as open files and network connections as well as other sources extracted by ramparser. The process of correlation of data from ramparser modules is fully automated.

One of the drawbacks of the system is that it is designed only for linux memory analysis and currently relies only on the tool ”ramparser” devel-oped by the authors for feeding the data. However, adding new sources of data is relatively easy if it is possible to write a correlation function for the new sources. This makes FACE scalable and has potential for further development for more generic application.

3.3 Automated Timeline Reconstruction

Timelines are vital in understanding the sequence of events in a digital investigation. There are multiple tools developed to extract timestamps from different sources such as log files, metadata or registry files. ”Super Timelines” created by such tools are extensive and difficult to visualize and understand manually with large number of entries. The Automated Timeline Reconstruction approach Hargreaves and Patterson [5] summarizes a high level human understandable timeline from the complex and extensive super timeline. A software has been developed as a part of this research, Python Digital Forensics Timeline (PyDFT).

The tool extracts low level events which essentially is a super timeline and reconstructs a high level event based on it. The extractor manager of PyDFT has multiple extractors to counter the heterogeneity problem and extract timestamps and events from multiple sources. The parsers used for extraction are developed based on data structures and thus can be reused in other related researches as well as parsers developed as part of other research can be used in PyDFT. Another advantage of this design is that provenance of data is also preserved. The reconstruction phase uses analysers to map multiple low level events from super timeline to a high level event based on predetermined rules. This is where the expertise of a human investigator is applied to automate the process. The reconstruction approach is based on temporal analysis of events.

The developers made it easier for investigators to query the backing store for reconstruction by using a framework for querying that allows use of regular expressions. The framework of ”test events” based querying proves useful in hypothesis driven search and investigation which is an additional advantage. Another strength of this approach is that it may also provide timelines of devices related to the device under investigation in some cases. The authors acknowledge that there is a performance bottleneck for this

(7)

design when scaled up to multiple analysers. Thus, optimization of design is a must before practical application. The design is dependent on system clock time for timeline generation and cannot detect anti-forensic techniques deliberately applied to circumvent it.

3.4 Directory Imager (Dirim)

A number of approaches have been developed for automating investigation based on metadata. Rowe and Garfinkel [6] came up with an interesting ap-proach that uses only directory metadata for detecting anomalies in a drive. Their approach is based on a novel idea of ”superclustering” of clusters.

The tool Directory Imager developed by the authors analyzes data from a large corpus of representative disks to set test statistics for anomaly de-tection from a wide perspective. The file metadata such as filepath, name, sizes, MAC times, NTFS flags is used for analysis. Heuristics is used to fill in the missing characters in strings of filenames and filepaths. The ap-proach excludes operating system directories and considers only user related directories and files and groups them semantically. The semantic grouping is useful in understanding the distinctiveness of a drive. Dirim can use com-parison of predefined semantic groupings or comcom-parison of derived clusters from corpus for detecting anomalies. File association factors such as tem-poral associations, spatial associations, co-occurrence tendency is used for clustering. The investigator has flexibility to weight this factors. A large cluster of OS and application software files is usually formed through these associations and can be excluded. Superclusters are formed from the clusters of the corpus used for baseline. The clusters of new drive lying outside the superclusters or present in smaller superclusters can be deemed anomalous and prioritized for manual investigation. An example of cluster formation can be seen in figure 1 Rowe and Garfinkel [6] in appendix.

This approach depends on readily accessible metadata and does not per-form very well when most of the metadata is lost where analogous correction of Dirim may fail. There may be a possibility to use this approach to es-tablish relation among different drives based on cluster similarity and rate the suspicious level of drive being analysed. The process of clustering is automated.

3.5 Machine Learning based Triage

Machine learning methods are getting popular in the field of digital forensics. Marturana and Tacconi [7] proposed a triage model for digital evidence based on machine learning algorithms. The model proposed is a knowledge based system and requires representative samples of devices with respect to the potential crime being investigated. This paves a way for training bias if a careful selection is not made. The model is aimed for triage of evidence

(8)

on scene and may skip creation of forensic image in time critical situation. This could be questionable. Careful selection and weighing of features have to be done since some of the features suggested by the authors in one of the case studies disregard personal preferences of users and classify as crime related feature. Also, predefined lists of crime related features would be required to triage on the scene. The type of crime is not always clear during the initial moments and this may hamper the practical application of this approach. The features considered by the authors in this approach are numerical which raises the question for the process of setting up thresholds. The performance figures for this approach are not clear which would also be necessary to determine its on scene application to triage evidence for actionable intelligence. The workflow of the model can be seen in figure 2 Marturana and Tacconi [7] in appendix.

The authors acknowledge that the approach is used as an aid rather than automating or replacing manual investigation. Despite some pitfalls, the authors contribute with valuable knowledge regarding the performance of various machine learning algorithms that can be used for triage of digital evidence.

3.6 Ontology based approach

Ontology can be used to establish relations between different objects and entities with the help of their properties. A lot of approaches discussed until now use databases or some backing stores to store event information and provides little semantic understanding of event reconstruction. Chabot et al. [8] propose an ontology based approach for automated event reconstruc-tion and analysis to bridge this gap. One of the primary strengths of this approach is that it is based on OWL2 DL language with strong theoretical foundation. The framework of this model can be seen in figure 3 Chabot et al. [8] in appendix.

The approach segregates a digital artefact into subject, predicate and object form. The model is automated end to end from the process of ex-traction to analysis and visualization. A salient feature of this approach is that it can deduce new facts based on the existing knowledge from artefacts, thus, enhancing the reconstruction. Another strength is that traceability of artefacts extracted and analyzed is maintained throughout the process. The data extraction is capable of handling heterogeneous data sources. Thus, using the ontology, it is possible to establish detailed semantic relationships between multiple entities be it user or a process related to an event or an artefact. The enhancement phase mentioned earlier uses a set of rules for deduction. One of the rules mentioned is based on temporal analysis. This however requires careful definition of thresholds. An example rule mentioned (namely Rule 2) about identification of user session for a particular event might not hold true for time-sharing systems and shared servers with

(9)

multi-ple users. Though with careful application, this approach can come in handy for investigating multi-user systems as well. A distinction is made between facts identified from the raw events and deduced knowledge in visualization. This is an appreciable factor as it maintains transparency of investigation. The deduced knowledge is also scored lower value taking into account the uncertainty. The approach uses semantic links as well as correlation methods such as temporal analysis for reconstruction. It also allows to incorporate expert knowledge based correlation criteria during reconstruction. The out-put is provided in a graphical view. This has an advantage as it provides an intuitive view for the investigator but also a disadvantage as graphical view can be counter-intuitive in complex cases. Also specialised knowledge of SPARQL is required for advanced query searches by investigator. There is room for further research on this approach to incorporate compatibility for extraction of more data sources and optimizing performance times before the model is available for practical application.

3.7 Data mining approaches

The amount of data to be analysed in digital investigations can be reduced by leveraging the data mining techniques such as predictive data modelling, classification, descriptive data modelling and content based retrieval. Beebe and Clark [9] highlights how these techniques can be utilized in digital in-vestigations and also cautions the methods to be avoided. For instance, the data characterization models may be useful for digital investigations but if based on data aggregation are not fit for forensic purpose. However charac-terization of data using descriptive models can assist in focusing the digital investigation. Activity based and event based association rule mining can be used for generating behavioral profiles of standard users and aid in mining logs for anomalies.

A focused application can be seen in the image mining for digital forensic investigations Brown, Pham, and Vel [10]. In this approach, the image is segmented into multiple patches and corresponding haar coefficients are used as features. It is an expert based system, where an experienced investigator acts as the model trainer for the support vector machine. The expert sets up the parameters and constraints in the model to recognize specific patches of images for classification. Once the model is trained, a less experienced query operator can submit unclassified image for classification if it is an improper image such as partially clad people. An advantage of this system would be to automate analysis of large image datasets. However, the grammar developed for querying is not user friendly and an expert review is still needed after an image is classified.

(10)

3.8 Automated Target Definition

This approach focuses on automating the analysis by comparing the ques-tioned datasets with predefined target definitions. Target definition can be derived specific to an incident or based on the evidence already observed during the investigation Carrier, Spafford, et al. [11]. Thus, this approach is also an expert based system. A tool developed in this approach provides additional search suggestions to the investigator through target definition derived from existing evidence. An additional feature is refining or modi-fication of target definition for search and comparison. The modimodi-fications are based on spatial, temporal, keyword and content relationships observed in the existing evidence material. For instance, a spatial outlier can be de-tected using this technique which may be of interest and should be reviewed by an investigator. The method works best as an aid during the investi-gation to uncover evidence from large datasets but does not automate the process entirely. The process model can be seen in figure 4 Carrier, Spafford, et al. [11] in appendix.

A different version of the above approach can be seen when target can be defined as user categories for user classification Grillo et al. [12]. In this case. the investigator defines user categories. This approach uses machine learning algorithms and a minimal set of features to describe user profiles present on the device under investigation. The features used are mostly extracted from the registry and classified into two categories: user specific features and system features. This method can used for prioritizing devices to be analysed based on the user profiles that fit the incident. Figure 5 Grillo et al. [12] describes this process and is available in appendix.

3.9 Intelligent Analytical Systems

A Multi Agent System (MAS) consisting of Intelligent Software Agents (ISA) can be used for distributed processing and analysis of digital evidence Hoelz, Ralha, and Geeverghese [13]. The MultiAgent Digital Investigation Toolkit (MADIK) developed by the authors consists of six different ISA analysing the large digital datasets in distributed manner. Each ISA contains knowl-edge base and set of rules defined by the investigator based on type of incident. These agents can perform temporal, spatial, keyword, content, registry and signature analysis in distributed manner. The design is a hier-archical model and strategic agent decides the allocation of resources while the tactical agent combines the results from individual ISA and presents to the investigator for review. The selection of traces by investigator can be used for training the system further to increase efficiency and identify which ISA are more reliable than others for a specific type of investigation. The framework of system is shown in figure 6 Hoelz, Ralha, and Geeverghese [13] in appendix.

(11)

3.10 Signature based approaches

An event or action can result in modification or updation of timestamps of various files across the system. These timestamps can be used to create a unique signature for a specific event. James, Gladyshev, and Zhu [14] uses this argument to create a knowledge based signature system for identifying user events during investigation. The study analyzes a few common process such as opening browser to identify the patterns in changes of timestamps of files. The results show that there are three category of timestamps: the ones that always update on occurrence of event, the ones that update only during first run and the ones whose updation depends on user interaction. The authors then define the collection of these timestamps within a threshold of one minute as signature of that event.

A further implementation of this work can be seen in James and Glady-shev [15]. The study is based on causal model between a user action, a process and generation of trace. The method provides a framework for cre-ating signatures for events and matching signatures. As mentioned above earlier, with the always update category of timestamps as core signature and using other categories to form supporting signature. The method also fo-cuses more on defining thresholds and the process of updation of timestamps is not instantaneous. This method gives promising results but however has a big disadvantage. The thresholds have to be defined by the expert for different processes and on different systems.

3.11 Enhanced Timeline Analysis

Super timelines can be extensive and challenging to understand by an inves-tigator. Thus, visualization plays an important role when using timelines as method of analysis of digital evidence. Inglot, Liu, and Antonopoulos [16] provides ideas on visualization of super timelines such as noise reduction, clustering of events and presentation. The work is built upon Zeitline Buch-holz and Falk [17]. Although the author claims to address the shortcomings of Zeitline, that does not seem to be the case. This work focuses mostly on improving the visualization and user interface of Zeitline rather than en-hancing the functionality. It still requires manual grouping of events and does not provide any automation. The author claims that this enhanced timeline analysis meets the real world needs but in our opinion is not a viable option in current landscape of digital investigations.

3.12 Forensic Data Reduction Approaches

Field Triage is one of the ways to reduce the amount of data to be collected for examination but risks losing of important relevant information. Shaw and Browne [18] acknowledges these risks and also highlights the risk of conducting full forensic examination on all seized evidence in terms of timely

(12)

analysis for criminal justice system and its implications. The author tries to find a trade off point to balance risks involved in both methods by a new way of previewing the data.

This method involves previewing the data on evidential system itself instead of a forensic copy. The previewing is performed using a remastered version of an existing forensic bootable CD CAINE Bassetti [19]. It provides a number of routines such as keyword search, chat message artefact recovery, graphic image discovery, file analysis to name a few. It is not an automated decision making system and an investigator has to look through the findings to make a decision if the device is relevant for full forensic examination or can be excluded. A salient feature is that the filesystem structure including the directories is maintained in the visualization of previewing process. Thus, if the system needs further examination, the investigator already gets an idea of places in the file system that may be of interest. Another feature is that cross drive analysis for intelligence purpose is also possible through the tools written in the remastered CAINE live CD. The method uses resources of the system under investigation itself, provides a fast way of previewing for decision making on further steps and hence is promising balance between risks of field triage and complete analysis.

Another way to deal with large forensic datasets is to treat it as big data for analysis. Zawoad and Hasan [20] proposes a model for big data forensics over cloud. It also highlights possible opportunities such as cross correlation of datasets to identify patterns and connections between cybercriminals, phising blacklists and IoT device forensics. The study does not provide any implementation of the proposed model. The design of the proposed model can be seen in figure 7 Zawoad and Hasan [20] in appendix.

Similar to previewing process mentioned earlier, a slightly different ver-sion is to do selective image of devices. Quick and Choo [21] provide a framework for the process of selective imaging while minimizing the risk of missing out relevant evidence. This method takes the process of previewing described earlier a step further by providing more filters and routines such event logs, database files, registry, logs to name some out of the extensive list from the paper. An additional feature is conversion of video files into snapshot of images for quick visual analysis by an investigator. This system is also not completely automated and an investigator is needed for decision making. A comparison of various triage methods and their performance is also summarised in the study. The framework for selective acquisition is shown in figure 8 Quick and Choo [21] in appendix.

4 Second Benchmark

We analysed a few approaches that have been developed in the years around the first benchmark study. Another literature study has been conducted

(13)

recently regarding the field of digital forensics. We will use this study by [22] as a second benchmarking point in the time for our review.

We observe some recurring problems highlighted again in this benchmark such as the complexity, heterogeneity, correlation, volume and timelining. A new problem on the location of data with respect to cloud services is also introduced. The authors also highlight the complexity arising from the IoT and mobile devices operating on different standards and systems. The problem of correlation is discussed with respect to the number of devices that need to be analysed in one single case. However, further research in some of the researches mentioned in previous section will alleviate if not completely solve this problem. A new lightly touched problem in previous approaches, use of anti-forensics appears to be getting more significant in recent years. Security constrains of data extracted from small battery oper-ated devices in modern digital cases is discussed as another challenge. Since such devices have limited computational capacity, the data stored on them sacrifices integrity and authenticity due to lack of cryptographic systems on board. This raises questions on the usefulness of such data in legal context. This study also highlights the low importance given to the performance re-quirements of forensic tools developed, both by users and developers. This trend is also visible in our analysis of some approaches discussed in previous section. This negligence towards performance requirements is a major bar-rier for implementing novel approaches in practical field. The lack of usage of High Performance Computing and parallel processing by digital forensic community is also highlighted. This idea was also suggested by [2] back in 2009 but still remains unexplored. The lack of usage of GPUs’ and Field Programmable Gate Arrays (FPGA) to counter performance bottlenecks is another issue that is highlighted. On the fly identification of incriminating during the acquisition process is suggested as a potential future research opportunity.

Significant research efforts have been made in the issues mentioned in the first benchmark however, lack of practical implementation results in observ-ing these problems again in the second benchmark. A few areas still remain untapped by the digital forensics community mentioned in both benchmarks mostly related to improve efficiency of forensic tools. The comparison be-tween the two benchmarks help us understand areas where research is re-quired. It also helps us in understanding the new emerging challenges in the digital age.

5 Current Research

The current research with respect to our research questions inclines towards using metadata and timelines for distinguishing between relevant and non-relevant digital traces during the investigation. We discuss three latest

(14)

re-searches in this area.

5.1 Technology assisted analysis of timeline

The research conducted by [23] is focused not only at advanced cybercrime cases but is also applicable to more general cases where digital evidence recovered. The approach uses Artificial Intelligence to discover relations between digital traces. The research is focused on the tool ”Axiom” used for analysis and is based on forensic ontology. The approach is similar to one discussed earlier by [8] but uses different ontology to generate relational graphs. This approach makes it easy to identify all the related evidence to a particular link in the graph which peeks interest of the investigator. One advantage of this approach over other ontology based approaches is use of simpler graph query language supported by multiple graph databases making it easier for an investigator. An example of the output provided by Axiom is shown in figure 9 Henseler and Hyde [23] in appendix.

A feature of Axiom is that it allows to add relations identified from other sources manually in the graph. However, it is debatable if this is necessarily a good idea. We argue this since we are not aware of the reliability of other sources of information such as witness and victim statements, incorporating them in the graph deduced from facts may create more confusion for an investigator. It also increases possibility of inducing bias in a fairly objec-tive system. The authors expect the possibility of using graph queries as a means of hypothesis testing which can add value to the system. The fu-ture research suggested on this tool proposes use of Graph Neural Networks for identification of relevant sub-graphs from complex graphs that can be generated by real life cases.

We think a further development in this area would increase the practical usability value of the tool. The research presented does not give details about the exact methodology used to form the links so it is difficult to comment on the sophistication, reliability and validity of the system. 5.2 Metadata based classification of Incriminating Artefacts Automating analysis process requires multiple tools conducting different tasks of extraction, correlation and visualization to work in synchroniza-tion together. [24] identify that tools designed to work in a framework are key to achieve this goal. Their research employs a machine learning based approach combined with a centralised, deduplicated digital evidence processing framework that contains information of previously encountered illegal file artefacts to identify classify suspicious artefacts in new cases.

The research provides a toolkit for dataset generation from disk images which is filtered out through the centralised framework to identify known file artefacts. The known file artefacts are used for training the machine learning

(15)

model. This models is then used to predict the unknown file artefacts in the dataset and classify as suspicious or benign. The research also provides guidelines on selection of features from metadata for machine learning model. The research also provides the performance metrics of some machine learning algorithms used for prediction. The model can be seen in figure 10 Du and Scanlon [24] in appendix.

The authors conclude that the toolkit is designed for integration with a DfaaS frmework rather than as a standalone application. The approach is practical for implementation provided that a DFaaS framework with a centralised known file artefacts database exists. This approach could be one of the keys to achieve on the fly identification of incriminating traces highlighted in the second benchmark discussed earlier.

5.3 Zero Knowledge Event Reconstruction

Most of the approaches for reconstruction of events are knowledge based system. K¨alber, Dewald, and Idler [25] proposes an approach to reconstruct timelines without any prior application level knowledge. The tool developed in this research is limited to NTFS filesystems and harnesses the information about the metadata updation pattern of NTFS timestamps.

The authors argue that under normal usage, the filesystem activity fol-lows a Pareto distribution. Thus, individual action by a user results in timestamp updation of multiple files on the filesystem in a time frame of few seconds. This idea is used to cluster files accessed based on the meta-data such as absolute path and file name from Pareto distribution peaks and temporal analysis. Since an application accesses multiple files stored across the filesystem, multiple clusters can be formed for one particular event. An-alyzing these clusters can provide an idea of what application was used in that instance of time. The tool developed also facilitates visualization of events in a timeline.

This approach provides an idea of possible activities that have been per-formed on a filesystem but cannot indicate the exact applications necessarily. The approach is limited to NTFS filesystem and a timeline can be generated only of the latest events since only timestamps from files are considered. It is also possible to circumvent the tool since file timestamps alone are easy to manipulate and this cannot be verified by the tool with other sources. The approach provides a research direction for zero knowledge event recon-struction but is not feasible for practical application in real life cases. 5.4 Compression based framework

Logs from Intrusion Detection Systems and Antivirus can provide informa-tion regarding the malicious actors in a system. These massive logs cannot be manually examined and Torre-Abaitua, Lago-Fern´andez, and Arroyo [26]

(16)

provides a methodology to analyse large amount of textual data. The ap-proach is based on Normalised Compression Distance (NCD) Cilibrasi and Vit´anyi [27].

The textual data is divided into two disjoint groups. One of these groups is used for generating attributes while the other group is used to train the classifier such as a support vector machine (SVM). A k number of attributes can be generated and a NCD is computed for each text string from the classification to form a matrix. This matrix is used to train the SVM which can thus be used to classify between anomalous and regular data. Attributes generated based on the type of investigation can help an investigator to automatically skim through large amount of textual information and identify only relevant pieces of information for the investigation. The diagram for this model is shown on figure 11 Cilibrasi and Vit´anyi [27] in appendix. 5.5 Deduplicated Acquisition

An interesting and efficient way of selective data acquisition is to image only the data that is new and not encountered in previous investigations. Design of such a system is possible with a centralised database coupled with a Digital Forensics as a Service (DFaaS) system. This methodology is put forth by Du, Ledwith, and Scanlon [28].

It is a client-server-database model. During investigation, the client sends metadata of a file to the server. The server checks if the file already exists in the database through hash, if it does then the file is not acquired. If the file does not exist, then the server requests to send the file. Thus, redundancy of data in the DFaaS system is avoided. Once this imaging process is complete, an entire disk image can be reconstructed using the existing files from database and information stored on the server during the acquisition process. An exact copy can be reconstructed using the metadata. The hash of the reconstructed image matches with the full disk hash of the system under investigation and proves forensic soundness of the method. An additional benefit of this method is that incriminating evidence can be identified on the fly during acquisition process when comparisons are made with the central database. This method combines acquisition and pre-processing together. Figure 12 Du, Ledwith, and Scanlon [28] in appendix shows the model of this system.

6 Discussion

The research in the field of digital forensics related to the problem of large data to be analysed truly started in 2005. Research in this field gained mo-mentum from 2009. The first benchmarking study appears to be the catalyst for research. The second benchmarking study acknowledges the advances in research until 2016 but also introduces new problems such as Internet of

(17)

Things (IoT) device forensics and cloud forensics. The initial problem of coping with large amounts of data for investigation is still mentioned as un-solved. The reason for this is that although many promising researches to solve the problem have been undertaken, most of them cannot be applied directly in practice. We have some ideas mentioned below that may add value to the existing methods or increase their viability in practise.

Methods such as Cross Drive Analysis is promising to cope with large amounts of data but is designed with specific features such as emails, credit card and SSN numbers in mind. However can a method such as Term Frequency Inverse Document Frequency (TF-IDF) be used instead to select features for analysis? An answer to this question may make the system more generic and viable for practical application in all scenarios.

Timelining is crucial in recovering the relevant traces from large datasets. However, visualization of super timelines is a must for it’s practical appli-cation. Ontology and high level event reconstruction combined together might yield better results, both in terms of performance and visualization with automation. This applies to majority of standalone applications such as ”Dirim”, ”ramparser” to work in combination with a DFaaS system than standalone applications. This will help in overcoming the limitation of sys-tems such as FACE which is applicable only to Linux syssys-tems.

Another way to cope with large amount of forensic data in digital in-vestigations is leveraging the data mining approaches. This still may have performance bottlenecks. Combining the data mining approach with dis-tributed processing using intelligent analytical systems may be fruitful to streamline the investigation. A lot of developed systems are knowledge based even if automated. There needs to be more research on zero knowl-edge reconstruction methods.

There are two ways to reduce the amount of data to be analysed dur-ing digital investigation. The first one is Triage and the second reduction during acquisition. In our opinion, Triage induces significant issue of over-looking critical information and admissibility of evidence in court. This study does not recommend Triage as a data reduction method when better alternatives such as previewing is available. Data reduction during acquisi-tion can be performed either by selective acquisiacquisi-tion or deduplicaacquisi-tion during acquisition. We recommend deduplication since it imposes the need for a centralised database and DFaaS system as a requirement and enforces avail-ability of framework for integration of multiple tools. Thus it not only boosts acquisition speeds, reduces data to be analysed, provides pre-processing on the fly but also enforces a common centralised framework.

(18)

7 Conclusion

We analysed the literature to find methods that can add relevance scores to heterogeneous digital traces. There are no methods that par say directly add relevance scores to observed digital traces but achieve this goal by means of clustering digital traces. An expert can easily distinguish between rele-vant and non-relerele-vant clusters of traces. The FACE correlation engine and Directory imager ”Dirim” are examples of such methods. This answers our first research question.

We found multiple approaches ranging from standalone tools, timeline analysis, metadata analysis, Triage, Data mining and Data reduction for identification of relevant digital traces within large datasets provided for digital investigation. Timeline and metadata approach are effective ways to identify key information from large corpus. Timeline analysis is more dif-ficult to automate completely and require expert knowledge for setting up thresholds for consistency. There is no single approach that is generic and can be used in all scenarios. Thus, a combination of multiple approaches together complementing each other is vital for developing a generic digital forensic toolkit. Automation of the approach, required knowledge base for the system for automating the process, performance bottlenecks, practical application and maintaining legal standards are the key challenges for any approach. Only a few of describe approaches meet most or all of these re-quirements and are in practical use. This stresses the need for collaboration and combined effort to tackle the big forensic data problem.

(19)

References

[1] RB Van Baar, HMA Van Beek, and EJ Van Eijk. “Digital Forensics as a Service: A game changer”. In: Digital Investigation 11 (2014), S54–S62.

[2] Nicole Beebe. “Digital Forensic Research: The Good, the Bad and the Unaddressed”. In: Advances in Digital Forensics V. Ed. by Gilbert Peterson and Sujeet Shenoi. Berlin, Heidelberg: Springer Berlin Hei-delberg, 2009, pp. 17–36.

[3] Simson L. Garfinkel. “Forensic feature extraction and cross-drive anal-ysis”. In: Digital Investigation 3 (2006). The Proceedings of the 6th Annual Digital Forensic Research Workshop (DFRWS ’06), pp. 71–81. issn: 1742-2876. doi: https://doi.org/10.1016/j.diin.2006.06. 007. url: http://www.sciencedirect.com/science/article/pii/ S1742287606000697.

[4] Andrew Case, Andrew Cristina, Lodovico Marziale, Golden G. Richard, and Vassil Roussev. “FACE: Automated digital evidence discovery and correlation”. In: Digital Investigation 5 (2008). The Proceedings of the Eighth Annual DFRWS Conference, S65–S75. issn: 1742-2876. doi: https://doi.org/10.1016/j.diin.2008.05.008. url: http:// www.sciencedirect.com/science/article/pii/S1742287608000340. [5] Christopher Hargreaves and Jonathan Patterson. “An automated

time-line reconstruction approach for digital forensic investigations”. In: Digital Investigation 9 (2012). The Proceedings of the Twelfth An-nual DFRWS Conference, S69–S79. issn: 1742-2876. doi: https:// doi . org / 10 . 1016 / j . diin . 2012 . 05 . 006. url: http : / / www . sciencedirect.com/science/article/pii/S174228761200031X. [6] Neil C. Rowe and Simson L. Garfinkel. “Finding Anomalous and

Sus-picious Files from Directory Metadata on a Large Corpus”. In: Digital Forensics and Cyber Crime. Ed. by Pavel Gladyshev and Marcus K. Rogers. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 115– 130.

[7] Fabio Marturana and Simone Tacconi. “A Machine Learning-based Triage methodology for automated categorization of digital media”. In: Digital Investigation 10.2 (2013). Triage in Digital Forensics, pp. 193– 204. issn: 1742-2876. doi: https://doi.org/10.1016/j.diin.2013. 01.001. url: http://www.sciencedirect.com/science/article/ pii/S1742287613000029.

[8] Yoan Chabot, Aur´elie Bertaux, Christophe Nicolle, and Tahar Kechadi. “An ontology-based approach for the reconstruction and analysis of digital incidents timelines”. In: Digital Investigation 15 (2015). Spe-cial Issue: Big Data and Intelligent Data Analysis, pp. 83–100. issn:

(20)

1742-2876. doi: https://doi.org/10.1016/j.diin.2015.07.005. url: http : / / www . sciencedirect . com / science / article / pii / S1742287615000869.

[9] Nicole Beebe and Jan Clark. “Dealing with terabyte data sets in digital investigations”. In: IFIP International Conference on Digital Foren-sics. Springer. 2005, pp. 3–16.

[10] Ross Brown, Binh Pham, and Olivier de Vel. “Design of a digital foren-sics image mining system”. In: International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer. 2005, pp. 395–404.

[11] Brian D Carrier, Eugene H Spafford, et al. “Automated Digital Ev-idence Target Definition Using Outlier Analysis and Existing Evi-dence.” In: DFRWS. Citeseer. 2005.

[12] Antonio Grillo, Alessandro Lentini, Gianluigi Me, and Matteo Ottoni. “Fast user classifying to establish forensic analysis priorities”. In: 2009 Fifth International Conference on IT Security Incident Management and IT Forensics. IEEE. 2009, pp. 69–77.

[13] Bruno WP Hoelz, C´elia Ghedini Ralha, and Rajiv Geeverghese. “Arti-ficial intelligence applied to computer forensics”. In: Proceedings of the 2009 ACM symposium on Applied Computing. ACM. 2009, pp. 883– 888.

[14] Joshua Isaac James, Pavel Gladyshev, and Yuandong Zhu. “Signa-ture based detection of user events for post-mortem forensic analysis”. In: International Conference on Digital Forensics and Cyber Crime. Springer. 2010, pp. 96–109.

[15] Joshua I James and Pavel Gladyshev. “Automated inference of past action instances in digital investigations”. In: International Journal of Information Security 14.3 (2015), pp. 249–261.

[16] Bartosz Inglot, Lu Liu, and Nick Antonopoulos. “A framework for en-hanced timeline analysis in digital forensics”. In: 2012 IEEE Interna-tional Conference on Green Computing and Communications. IEEE. 2012, pp. 253–256.

[17] Florian P Buchholz and Courtney Falk. “Design and Implementation of Zeitline: a Forensic Timeline Editor.” In: DFRWS. 2005.

[18] Adrian Shaw and Alan Browne. “A practical and robust approach to coping with large volumes of data submitted for digital forensic examination”. In: Digital Investigation 10.2 (2013), pp. 116–128. [19] Nanni Bassetti. CAINE Live USB/DVD - Computer Forensics Linux

(21)

[20] Shams Zawoad and Ragib Hasan. “Digital forensics in the age of big data: Challenges, approaches, and opportunities”. In: 2015 IEEE 17th International Conference on High Performance Computing and Com-munications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE. 2015, pp. 1320–1325.

[21] Darren Quick and Kim-Kwang Raymond Choo. “Big forensic data reduction: digital forensic images and electronic evidence”. In: Cluster Computing 19.2 (2016), pp. 723–740.

[22] David Lillis, Brett Becker, Tadhg O’Sullivan, and Mark Scanlon. Cur-rent Challenges and Future Research Areas for Digital Forensic Inves-tigation. 2016. arXiv: 1604.03850 [cs.CR].

[23] Hans Henseler and Jessica Hyde. “Technology assisted analysis of time-line and connections in digital forensic investigations”. In: (2019).

[24] Xiaoyu Du and Mark Scanlon. “Methodology for the Automated Metadata-Based Classification of Incriminating Digital Forensic Artefacts”. In: Proceedings of the 14th International Conference on Availability, Re-liability and Security. ACM. 2019, p. 43.

[25] Sven K¨alber, Andreas Dewald, and Steffen Idler. “Forensic zero-knowledge event reconstruction on filesystem metadata”. In: Sicherheit 2014– Sicherheit, Schutz und Zuverl¨assigkeit (2014).

[26] Gonzalo de la Torre-Abaitua, Luis F Lago-Fern´andez, and David Ar-royo. “A compression based framework for the detection of anomalies in heterogeneous data sources”. In: arXiv preprint arXiv:1908.00417 (2019).

[27] Rudi Cilibrasi and Paul MB Vit´anyi. “Clustering by compression”. In: IEEE Transactions on Information theory 51.4 (2005), pp. 1523–1545. [28] Xiaoyu Du, Paul Ledwith, and Mark Scanlon. “Deduplicated disk im-age evidence acquisition and forensically-sound reconstruction”. In: 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). IEEE. 2018, pp. 1674–1679.

(22)

Appendix

(23)

Figure 3: Ontology approach model

(24)

(25)

(26)

(27)

Figure 9: Example GUI output of Axiom

(28)

Figure 11: Compression based approach model

(29)

Appendix 2: Search Strategy

The search strategy for literature was performed in the following manner: ● The initial literature was provided by the supervisor

● The next phase of literature involved keyword search on google scholar using the keywords in base form. For instance using “forensic” instead of “forensics” as keyword ● Additional literature was searched from the references of papers already acquired. This

process involved going through the abstracts of the referenced papers to see if they are relevant for the research questions defined. This was a recursive process for multiple highly cited papers and yielded good results.