Data Prediction in Forensic Networks

(1)

Data Prediction in Forensic Networks

Sophie Warringa*a

a_{University of Amsterdam, MSc Forensic Science, Science Park 904 1098XH Amsterdam}

Abstract. Recent studies have shown the possibility of representing (forensic) evidence in a mathematical graph or network. This study proposes the use of these forensic networks in order to predict missing data, including missing edges and/or (node) characteristics. A literature study is presented on the currently known methods of edge- and characteristic prediction. After a brief explanation on the computation of the predictions, the discussed methods are compared to one another with regard to their applicability to forensic networks. The conclusion states the promising methods for predicting missing data in mathematical network representations of evidence. Finally, the limitations are discussed and further research on the topic is recommended by the author.

Keywords: Forensics, Graph theory, Link prediction, Feature prediction Date: 14th of December 2020

Supervisors: Dhr. prof. dr. R. Nunez Queija (UvA) and Dhr. dr. A.V. den Boer (UvA) Examiner: Mw. prof. dr. M.J. Sjerps (UvA)

Number of words: 8778.

* Contact information: s.m.warringa@student.uva.nl, Student-ID: 12019410

1 Introduction

Evidence in a forensic investigation tends to consist of a set of unstructured data, scientific as well as non-scientific. Networks are a commonly applied tool for structuring data in many differ-ent fields of studies (Tantardini, Ieva, Tajoli, & Piccardi, 2019). These networks can be used for, among other things, predictions of missing data (Kushwah & Manjhvar, 2016) (Sha et al., 2018). Recently, networks have proven its use in forensic investigation, with studies on general modelling of investigations, e-mail evidence, cyber breaches and others (Haggerty, Karran, Lamb, & Taylor,

2011) (Easttom, 2020) (Easttom, 2017a). A forensic network is a graph modelling the forensic evidence of a specific case, consisting of nodes, characteristics and edges. However, in a forensic investigation there are always unknowns that the investigation aim to find. These unknowns might be missing data in the forensic network. Since the ultimate goal of the investigation is to find these unknowns, it is useful to be able to predict this missing data. Therefore, in this paper, the main focus is to use graph theory in order to answer the following research question:

RQ: What are promising methods for predicting missing data in mathematical network repre-sentations of evidence?

In order to do so, first the literature on forensic networking will be reviewed in section 2. Part of this literature has a more general aim on modelling all types of evidence. Other studies focus primarily on a specific type of evidence, such as e-mails (Haggerty et al., 2011), or on a spe-cific type of crime, such as missing persons (Caridi, Dorso, Gallo, & Somigliana, 2011). Second, the other half of the research question is examined: predicting missing data. Within the field of graph theory, there are various types of data which might be missing from a network. Missing edges, missing nodes or even missing specific characteristics of present nodes might occur (Kim &

(2)

Leskovec,2011) (Eyal, Kraus, & Rosenfeld,2011). The data prediction will, however, be focused solely on missing edges and missing characteristics. As will be explained in section 2 (Foren-sic Networking), the nodes of the foren(Foren-sic network represent entities, such as persons, locations and other forms of “evidence”. Intuitively, it is expected that the predictions on the network of the forensic investigation can never foresee new entities, but only their links and characteristics. Therefore, in section3methods of predicting missing edges are discusses and compared focusing on whether these methods would be applicable to a forensic network. In section 4, methods for predicting missing characteristics re presented and compared for their forensic application.

The results of the literature review as presented in sections2-4lead to a conclusion and discussion in section6. Lastly, some recommendations for further research will be presented in section6.2.

2 Forensic Networking

The amount of research on forensic networks is limited. The field of study is fairly new, with the first paper on a generalized methodology published in 2017 (Easttom, 2017a). The same author concluded three years later that still “there is a notable absence of a rigorous, general application of graph theory to network forensics” (Easttom, 2020). A large share of the papers that have been published on the subject focus on the application of digital forensics, such as cyber breaches (Noel, Harley, Tam, Limiero, & Share,2016). However, graph theory has also been applied to other types of crimes, that require a different forensic approach. In 2006, Zufferey and co-authors have aimed to recognize patterns in heroin cutting agents by representing these cutting agents in multiple graphs (Terretaz-Zufferey, Ribaux, P., & Kanevski,2006). Moreover, forensic networks have been used to locate missing persons that have disappeared from 1074 to 1981 from Argentine (Caridi et al., 2011). As stated before, a general working methods for using graph theory to represent evidence and estimate missing information is still lacking. In this section, a definition of a graph will be provided. Moreover, the results of the researches on forensic networking that have been obtained up until today will briefly be discussed.

2.1 Graph theory: nodes, edges and weights

Graph theory was first introduced by Leonard Euler to solve the K¨oningsberg Bridge Problem in 1735 (Ahmed,2019). Nowadays, the field of graph theory is used for solving many problems. In the more recent past, problems in forensic science have been added to this list. In order to discuss methods for modelling forensic investigations, it is useful to present a general notation of a graph and its adjacency matrix.

Graph: A graph is denoted by G(V, E) in which V is the set of nodes and E the set of edges (also known as: links). Each edge has two so-called endpoints; the, not necessarily distinct, ver-tices that are associated with that specific edge. These endpoints are defined by one mapping ψ : E → V × V (West, 2002) in case of an undirected graph. In case G is a directed graph the endpoints are defined by two mappings ψ1 : E → V and ψ2 : E → V , in which the first mapping

results in the initial vertex and the second in the terminal vertex (Diestel,2000). A graph can also be expanded by associating a weight we ∈ R to each edge e ∈ E (Rosen,2012). This weight might

for instance represent the ’strength’ of the connection between the two nodes (Fornito, Zalesky, & Bullmore,2016).

(3)

Unless stated otherwise, throughout this paper the notation G(V, E) represents an unweighted, undirected graph, in which V is the set of nodes (or: vertices) and E is the set of edges (or: links).

Adjacency matrix: A graph can also be represented as a so-called adjacency matrix, denoted by A. In this matrix, Aij = 0 if and only if that is no edge from node i to node j in graph G(V, E).

Moreover, if there is an edge from i to j, Aij equals the weight of this edge in case of a weighted

graph. In case of an unweighted graph, Aij = 1 if and only if there exists a link between nodes i

and j (West,2002).

2.2 Applications on evidence representation

American computer scientist Easttom is the first author to publish a paper on the general application of graph theory on forensic problems (Wilson,2013) (Easttom,2017b). Although the methodology is explained using digital evidence, the writer states that it can be applied to all forensic evidence. The idea is rather simple: the vertices represent all the relevant entities and the directed edges their connections to each other, with the initiator of the data flow to be the initial vertex, and the target entity the terminal vertex. The edges are not only directed, each edge is also associated with a weight, representing the strength of the connection. With this basic methodology, many questions arise, such as: How do we establish the weights? And when exactly are two entities “connected”? There appears not to be an unambiguous answer to these questions, as they are dependent on what type of evidence or what type of case the forensic networking methodology is applied to. With regard to the first question, Easttom himself proposes using the number of occurring aspects of means, motive, and opportunityto be the weight of the edges associated to the suspect (Easttom,

2017a). These aspects are sometimes used in criminal investigation (Newburn, Williamson, & Wright, 2012) The weight of edges that do not involve a suspect is, however, not specified. In order to find a general methodology for finding edges and calculating their weights, it is useful to discuss papers on this topic with regard to specific types of evidence and/or crimes. However, for only two types of crimes researches have come so far to discuss the details of the forensic networks: cyber breaches (digital forensics) and groups of missing persons.

2.3 Research-gap: weight construction

There is a lack of research on the construction of forensic networks, for example specifically on the construction of the edge weights. To illustrate this issue, the methods of network construction with regard to two case-types are discussed in this subsection.

Most papers with regard to forensic modelling are focused on cyber breaches, e.g. (Noel et al.,

2016) (Easttom, 2020). The modelling of computer networks, specifically the traffic of data in this network, using graph theory has been performed years ago by multiple scientists (W. Wang & Daniels, 2008) (Blazwicz, Brzezinski, & Gambosi, 1993). On the other side, there are mul-tiple tools available for analyzing network intrusions, that do not use graph theory but do award a vector with metrics describing the network attack, such as the Common Vulnerability Scoring System (CVSS) (Mell, Scarfone, & Romanosky,2007). There is now a method known to express the attack in scientific data (in the form of scores) and methods to represent the system on which the attack is aimed as a graph. The next step is to use algebraic graph theory to explore this graph,

(4)

such as the degree matrix and the Laplacian, in order to understand the cyber breach (Easttom,

2020).

An issue in digital forensics, is that finding the perpetrator in the form of the computer that carried out the attack is not sufficient. It is often not known who controlled this computer during the at-tack (Chaski, 2005). With this reason, extra nodes need to be added to the model of the computer network. These nodes consist of the potential users; persons who might have had access to the computers during the attack (Easttom, 2017b). However, the issue is that these links are missing. Graph theory might be able to predict those.

A very case-specific research on applying graph theory to a forensic investigation has been car-ried out on the disappearance of a group of people in Tucum´an (Argentine) from 1974 to 1981 (Caridi et al., 2011). A network representing the mass-disappearance has been created by letting each node of the network represent one missing person. The edges are created by following a set of rules, which fundamentally describe how much the specific two nodes are “alike” in for instance their political affiliation. Moreover, known information on the location a person has been seen dur-ing their the captivity is used to establish the rules. The constructed graph is then used to predict the possible destination of each person, resulting in a 71% success-rate of the method.

2.4 Features of forensic networks

It is concluded that, although there are some ideas about constructing a forensic network, there is no conclusive solution to problems such as the weight assignment of the edges. There is no unambiguous resolution to the construction of a (complex) forensic network. The papers that have been published on the combination of graph theory and forensics are mostly specialised in one type of case and/or evidence only. This issue occurs due to lack of research, as will be discussed in Section6.2Recommended research. From the studies that have been obtained on the subject of forensic networks, it can be concluded that forensic networks have some common features. First of all, the edges are accompanied by weights representing the strength of the evidence and a direction deciding the “initiator” and the “target”. Moreover, the nodes carry a number of attributes. The nodes represent different entities, which might include persons, location and subjects. Among other things, the type of entity must be specified in an attribute.

3 Prediction of missing edges

In contrast to the construction of forensic networks, the prediction of missing edges is a rather active field of research (Abdolhosseini-Qomi, Yazdani, & Asadpour, 2020) (Krause, Huisman, Steglich, & Snijders, 2020) (Almquist & Butts, 2018). Recently, Kumar and co-authors have reviewed the available literature on the topic (Kumar, Singh, Singh, & Biswas, 2020). Pandey and co-authors review the literature with regard to edge prediction specifically focused on social networks in 2019 (Pandey, Bhanodia, Khamparia, & Pandey, 2019). The results in these two review articles will be considered in this section. Kumar and co-authors divide the existing meth-ods for edge prediction in four categories: similarity-based methmeth-ods, probabilistic and maximum likelihood models, methods based on dimensionality reduction and “other approaches”. These other approaches consist of learning-based frameworks, information theory-based link prediction, clustering-based link prediction and the Structural Perturbation Method (SPM). Pandey and

(5)

co-structural perturbation method (SPM). However, the learning-automata approach and the Monte Carlo algorithm are considered by Pandey and co-authors.

Not all methods for link prediction described by (Pandey et al., 2019) and (Kumar et al., 2020) are represented in this paper. A first, intuitive selection on the methods is already made based on their expected suitability for forensic networks. For example, the link prediction methods based on dimensionality reduction are not elaborated in this paper. Forensic evidence and therefore the nodes in the forensic networks have a relatively low number of dimension, which all should be taken into account as evidence. Evidence cannot be ignored. Moreover, some methods of link pre-diction have proven to be only suitable for very large-scale networks, such as the Top-L-clustering methods that requires at least 100.000 links (Wu, Lin, Wan, & Jamil, 2016). It is not expected that there are this amount of links in a forensic network present, particularly not at this stage of research in the forensic field.

3.1 Similarity-based methods

The first category of link prediction methods contains approaches based on similarity. In these approaches, a so called similarity score is calculated for each pair of nodes (Bhanodia, Sethi, Khamparia, Pandey, & Prajapat, 2019). The similarity score between nodes x, y ∈ V is notated by sxy. The similarity matrix S of G(V, E) contains as the elements the similarity scores between

each pair of nodes. How the scores are calculated, depends on the exact method. The trail of thought behind the similarity methods is: the higher the score, the stronger the paths between the nodes and the more likely an edges will occur between the two nodes (L¨u, Jin, & Zhou, 2009a). How these similarity scores are subsequently converted into the prediction of links, is a subject of debate (Jiang, Chen, & Chen,2015). However, the most common method for conversion is to set a threshold for the proposed similarity score: if the score between nodes v1, v2 ∈ V is higher than

the threshold and then is no edge between v1 and v2 present in the concerned graph G, then the

edge v1v2 is labelled as a missing edge (Lee & Tukhvatov,2018).

The different similarity indices are often divided into three groups: local, global and quasi-local. Global indices take into account the entire graphs when calculating the similarity score of two nodes, while local indices only use the direct neighbourhoods. Quasi local similarity scores, as the name suggests, is a mix between these two: part of the graph is considered, but more than only the direct neighbourhoods of the concerned nodes (Chroł,2018). An issue with local indices is that the score can only be calculated for nodes with a path length of less than or equal to two. If the path length is larger than two, the nodes do not have any common direct neighbours to take into account and the scores will automatically lead to zero. Therefore, potential links between nodes that occur “far away from each other” might be wrongfully ignored (Srilatha & Manjula,2016). On the other hand, as one might intuitively already expect, global similarity scores are far more complex to calculate than local scores. There is simply more information to take into account when using the entire graph. However, global indices are also more complete than local indices due to this same fact. The scores are meaningful for nodes with a path length larger than two. Quasi-local similarity scores offer a solution to this trade-off. The indices carry the calculation efficiency of local similar-ity scores, and completeness of considered features of global similarsimilar-ity scores (Kumar et al.,2020).

(6)

Both efficiency and completeness of predictions are highly important in the forensic context. Forensic investigations often carry a large amount of time pressure due to various reasons, which calls for efficient research (Ask & Alison, 2010). Moreover, ignoring evidence during a forensic investigation is not wishful or even not allowed (England,2020). It is detrimental to only use direct neighbour within the graph. For these reasons, and the fact that Kumar and co-authors conclude on a good performance of quasi local similarity scores compared to local and global scores, this paper will focus solely on quasi-local indices (Kumar et al.,2020). It is expected that this group of indices are most suitable for link prediction in forensic networks.

Pandey ad co-authors consider five different quasi-local similarity score in their review paper: SimRank, hitting/commute time, PropFlow and supervised random walks (Pandey et al., 2019). From these four quasi-local indices, one is discussed by Kumar and co-authors; the supervised random walks. Kumar and co-authors however add two other quasi-local similarity scores: Local Path index (LP) and CH2-L3-Method (Kumar et al., 2020). Therefore, the total of six different quasi-local indices are discussed in the remainder of this section.

3.1.1 Local Path index (LP)

The Local Path index, proposed by Lu et al. in 2009 as a trade-off of accuracy and complexity, counts the number of two-step-paths and three-step-paths between the two concerned nodes, in order to calculate the similarity score (Tian, Li, Zhu, & Tia,2019) (L¨u, Jin, & Zhou,2009b). Recall that the adjacency matrix A for unweighted graph G(V, E) contains only zero- and one-values. The number of two-step-paths can now be calculated by A2 and the number of three-step-paths by A3. The similarity matrix is now calculated by:

SLP(G) = A2+ A3, (1)

where is a free parameter which decides in what proportion the three-step-paths determine the score compared to the two-step-paths. This method can also be extended to a more ’global version’, taking into account not only the two- and three-step-paths, but also longer-step-paths. Formula (1) can now be rewritten into:

SLP(G) = A2+ A3+ 2A4... + n−2An, (2)

where n is the maximal order of the graph. The more extended the local path index is, the more global the index is and the higher the complexity to compute this similarity score. A trade-off between complexity and completeness needs to be made for choosing n.

3.1.2 CH2-L3-Method

In a 2018 biomedical paper, written by Muscoloni, Abdelhamid and Cannistraci, another method based on length-three-paths for computing similarity scores was presented: the so-called CH2-L3-method (Muscoloni, Abdelhamid, & Cannistraci,2018). The CH2-L3 method however adds Can-nistraci Resource Allocation (CRA). Let v1, v2 ∈ V for (unweighted, undirected) graph G(V, E).

(7)

Now the number of paths of length three are calculated by: L3(v1, v2) = X u,v∈L3 1 √ du× dv , (3)

where u and v represent the two other nodes on the considered path of length three, in between v1 and v2. The du and dv are the degrees of nodes u and v. As the Local Path index, this formula

can also be generalized for path of length n. In order to do so, the writers merge the formula by using the geometrical mean of the path length. The following formula is derived for the general path length: LN (v1, v2) = X u1,...,un∈LN 1 (du1 × ... × dun) 1 n−1 , (4)

which is extended using CRA. The authors expect a better performance, since it had already been proven that the CRA as an extension of L2 outperformes L2 (Muscoloni & Cannistraci, 2017). Now the methods does not only use the paths of length two, but also the local community structure of the concerned nodes. Following this logic, Mascaloni and co-authors extend equation (4) with CRA and proof its performance for L3 on a data set, leading to the CH2 − L3-model. These basis for the CH2 − L3-model now relies on the one hand in local paths of length three, and on the other hand in the internal and external local community links of the common neighbours of the concerned nodes. Let Γ(v) be the set of neighbours of vertex v ∈ V The following equation is found for the corresponding similarity index (Kumar et al.,2020):

sCH2−L3(v1, v2) = X z1∈Γ(v1),z2∈Γ(v2) az1,z2p(1 + cz1)(1 + cz2) p(1 + oz1)(1 + oz2) , (5) where: az1,z2= (

1, if there is a link between z1and z2

0, otherwise.

Moreover, in equation (5), cz1 and cz2represent the number of links between respectively z1 and

z2 and the intermediate nodes of all paths of length three between v1 and v2. On the other hand,

oz1 and oz2 represent the number of links between respectively z1 and z2 and the nodes that are

not intermediate nodes of all paths of length three between v1and v2.

3.1.3 Supervised Random Walks (SRW)

The methods of Supervised Random Walks is, in contrast to LP and CH2-L3, based on random walks rather than on path lengths. Random walks have shown to contribute to many problems in graph theory, such as estimating commuting time and the use of eigenvalues (Lovasz, 1993). In the Supervised Random Walks, walkers are leaving the starting node vs continuously over time.

Each of them walks independently. The choice to use multiple independent walkers relies on the issue of clustering, which typically occurs in social networks (Khotilin & Blagov, 2016). Due to this clustering, it might happen that a walker moves away from the starting node, while not getting closer to the target node. In SRW, for each of the walks the similarity score is calculated by the Local Random Walk similarity score (SLRW) and finally summed into the final score SSRW (Liu

(8)

& L¨u,2010). To calculate SLRW, let vs, ve ∈ V for graph G(V, E), in which vsis the starting node

and ve the target node of a random walker. Moreover, consider transition probability matrix P .

The elements of matrix P are the probabilities that a random walker moves from one node to a second node in one time step. That is: Pvsve is the probability that a random walk at node vsmoves

to vein the next time step. Since the walk is random and each step is independent, this probability

can easily be computed: the number of links between vsand vedivided by the total degree of vs,

that is: Pvsve =

avsve

dvs . The transpose of the transition matrix is now used to compute the probability

that the random walker reaches vefrom vswithin t ∈ N (time)steps (Liu & L¨u,2010):

− →_p

vsve(t) = P

T−→_p

vsve(t − 1), (6)

where−→p is a column vector. For starting point −→p (O) the vector consists of 1 at the v1th element

and zero’s at all other elements. The Local Random Walk similarity matrix is now computed by: SLRW(vs, ve) =

kvs

2|E|pvsve(t) +

kve

2|E|pvevs(t), (7)

where |E| represents the number of edges in graph G. As stated before, the similarity score for the Supervised Random Walks (SRW) consists of the sum of local indices. Therefore, it is now possible to compute the SRW similarity matrix as follows (Kumar et al.,2020):

SSRW(vs, ve)(t) = t

X

l=1

SLRW(vs, ve)(l). (8)

3.1.4 Expected Hitting and Commute Time (EHCT)

The methods of Expected Hitting and Commute Time is, like SRW, also based on random walks. In order to explain the method, the random walk is represented as a Markov Chain Xk, k ≥ 0,

in which the probabilities of moving to a specific neighbour v1 ∈ Γ(v) of v ∈ V is equal to _d1_v

(Huang, Li, & Xie, 2019). Note that the Markov Chain now consists of all nodes that have been walked during this specific random walk. The hitting time of node v2 ∈ G from node v1 ∈ G,

denoted by H(v1, v2) is the ’first time’ this walk starting from v1 reaches the concerned node and

thus the iteration in which this concerned node is found in the Markov chain for the first time. That is:

H(v1, v2) = inf{k ≥ 0 : Xk = v1}.

Note that the hitting time can be computed for both directed and undirected graphs. For a directed graph, it is important to remark that the random walk can only move from a vertex to another if the directed of the edge corresponds to the moving direction of the walk. For undirected graphs, the commute time, C(v1, v2) is often used, to provide a symmetric value (Pandey et al.,2019):

(9)

The similarity score of v1and v2can now be defined as the expected commuting time and therefore

computed by (Chen & Zhang,2008):

sHittingTime(v1, v2) = ∞

X

k=1

kP (C(v1, v2) = k), (9)

where P is the transition matrix, in which element Pij is the probability that a random walker

moves from i to j. That is, zero if i and j are not connected (note the direction), and 1/di if i and

j are connected. 3.1.5 PropFlow (PF)

The third and last method based on random walks is the so-called PropFlow-index. In contrary to previous proposed indices, the PropFlow uses weights of the edges. These weights, which can be seen as the ’activeness’ of the edge, provide information to define the probabilities of the possible next steps in the random walk and thus Markov chain (Lichtenwalter, Lussier, & Chawla,

2010). Therefore, let G(V, E) be a undirected, weighted graph and v0, vs ∈ V The computation

of PropFlow simlarity index SPropFlow between nodes v0 and vsis presented based on an algorithm

rather than an equation. The algorithm firstly first the shortest path between v1and v2. Let v0, ..., vs

be the nodes in this shortest path. The similarity index between v0and vscan now be computed by

(Munasinghe & Ichise,2013):

SPropFlow(v0, vs) = s−1 Y i=0 wi,i+1 P k∈Γiwik . (10) 3.1.6 SimRank

The similarity measure SimRank proposed in 2002, is an iterative method based on neighbours. The approach relies on the thought “two objects are similar if they are related to similar objects”: the PageRank-approach (Jeh & Widom,2002) (Shaojie Qiao et al.,2010). The maximum similarity score has been set to 1, which is the preassigned similarity score of a node to itself. The score of the distinct nodes is based on the SimRank of their neighbours. Since the similarity scores are dependent on one another, there needs to be a ’starting point’ in order to actually compute the indices. Let v1, v2 ∈ V be two (not necessarily distinct) vertices of graph (undirected and

unweighted) G(V, E). Recall that Γ(v1) and Γ(v2) denote the set of neighbours of v1 and v2

respectively. Jeh and Widom present a solution based on iteration to a fixed point and provide the following equations (Jeh & Widom,2002):

Rk+1(v1, v2) = ( _C |Γ(v1)||Γ(v2)| P|Γ(v1)| i=1 P|Γ(v2)| j=1 Rk(Γi(v1), Γj(v2)), if v1 6= v2 1, if v1 = v2. (11)

And for the starting point R0:

R0(v1, v2) =

(

0, if v1 6= v2

(10)

In equation (11), C is a constant where C < 1. The value of this constant represents the level of decay, which can intuitively be seen as a ’smaller confidence level’ for nodes with a larger path length between the two of them. Note that in each iteration, the computation updates the similarity score of the nodes and neighbours. An issue with such an iteration process is when it is supposed to be finished. Jeh and Widom have however shown that the iteration process converges, concluding in a final similarity score for nodes v1, v2 ∈ V :

sSimRank(v1, v2) = limk→∞Rk(v1, v2). (12)

3.1.7 Path Entropy (PE)

Computing the similarity index based on path entropy leads to an information theory-based link prediction method (Kumar et al., 2020). Xu and co-authors have suggested such a path en-tropy similarity measure in 2016 (Xu, Pu, & Yang, 2016). Note that: p(v1, v2are linked) =

1 − p(v1, v2are not linked). Moreover, the probability that there is no link between v1 and v2

can be calculated by dividing the number of possible link sets that are incident to v2 but not to v1

by the number of possible links sets that are incident to v2. Therefore:

p(v1, v2 are linked) =

#(possible link sets that are incident to v2but not to v1)

#(possible link sets that are incident to v2)

(13) = #L(v2, ¬v1)

#L(v2)

. (14)

Xu and co-authors then compute the probability of the occurrence of a simple path D = v0, . . . vt

by the multiplication of the link probability of the nodes in the path, that is (Pandey et al.,2019):

p(D) =

t−1

Y

i=0

p(vi, vi+1are linked). (15)

With this probability of path occurrence, as can be calculated by equation (15), one can compute uncertainty of the event. In information-theory this is known as the entropy I(D) and is computed by taking the negative logarithm (Xu et al.,2016):

I(D) = −log(p(D)) (16)

= −log

t−1

Y

i=0

p(vi, vi+1are linked)

! (17) = t−1 X i=0

I(vi, vi+1are linked). (18)

The similarity index proposed by Xu and co-authors is based on the entropy as computed in equa-tion (16), conditionally on all simple paths between the concerned nodes. Therefore, let l be the maximum length of the simple paths between nodes v1 and v2 and let Pvi1,v2 be the set of simple

(11)

v1 and v2 can now be computed by: SP E(v1, v2) = l

X

i=2   1 i − 1   X P ∈Pi v1,v2 "_i−1 X j=0 log #L(vj) #L(vj) − #L(vj, ¬vj+1) #     (19) − log #L(v1) #L(v1) − #L(v1, ¬v2) . (20)

3.2 Probabilistic and maximum likelihood models

In this section, the link prediction methods based on probabilistic and maximum likelihood models as presented in both (Pandey et al., 2019) and (Kumar et al., 2020) are discussed. These models each use a form of optimization with the characteristics (attribute information) of the concerned nodes and possibly the structural information of the graph. This section includes in total five different models: local probabilistic model, hierarchical structure model, probabilistic relational model, stochastic relational model and stochastic block model.

3.2.1 Local Probabilistic Model (LPM)

The first considered probabilistic model regards the Local Probabilistic Model (LPM) as proposed by Wang and co-authors in 2007 (C. Wang, Satuluri, & Parthasarathy, 2007). LPM uses three dif-ferent features in order to derive the maximum likelihood. These features include co-occurrence probability features, topological features and semantic features. Once obtained, the features are combined by a supervised learning framework in order to predict links.

The first to be acquired feature is the co-occurrence probability, which is intuitively the proba-bility that the two concerned nodes (v1 and v2) will ever be linked in the future. To compute this

probability, the so-called central neighbourhood set of v1 and v2 is considered. This set consists

of nodes that lie on any path between v1 and v2. A maximum size for the central neighbourhood

set is decided by a fixed parameter n. If n is smaller than the total number of nodes on any path between v1and v2 of graph G(V, E), an algorithm is followed to decide which nodes are included

in the central neighbourhood set. After having found the central neighbourhood set, the next step in deriving the co-occurrence probability, is to find the underlying probability distribution in the central neighbourhood set. To calculate the co-occurrence probability, a uniform distribution is wishful for consistency. Therefore, Wang and co-authors use a maximum entropy approach. A maximum entropy can be seen as the distribution that is as least as great as the found probabil-ity distribution in the central neighbourhood set (Cover & Thomas, 2006). Wang and co-authors present an algorithm for using this distribution to compute the co-occurrence probability feature (C. Wang et al.,2007).

The second feature that needs to be derived is the topological feature. Since the Katz-index has proven to be the most effective topological feature, Wang and co-authors use this index in their model, but then restricted to a maximum path length to increase efficiency. The Katz-index is com-puted for nodes v1 and v2 with maximum path length m by (Liben-Nowell & Kleinberg, 2003):

(12)

Katz(v1, v2) = m

X

i=1

βipi, (21)

where pi represents the number of paths between v1 and v2 of length i and 0 < β < 1 is

pre-decided parameter refered to as the damping factor.

Lastly, Wang and co-authors consider the semantics feature, consisting of all available useful infor-mation that cannot be quantified in topological or frequency based features, such as not-scientific data (or: evidence) (C. Wang et al.,2007).

As briefly mentioned before, the ultimate goal is to combine the three considered features to predict links based on the local probabilistic model. A so-called supervised learning framework is used by Wang and co-authors to obtain this combination. The local probabilistic model is established on a training set of data, and tested on a testing set. Therefore, the model is different for each underlying graph.

3.2.2 Hierarchical Structure Model (HSM)

As the name suggests, hierarchical structure models make use of the hierarchical structure of a graph. Many real-life networks have proven to be hierarchical structured, which means that nodes can be divided into groups, subgroups, etc. (Clauset, Moore, & Newman, 2008). Due to this structure, the graph can be represented as a so-called dendogram. This dendogram is a binary tree in which each node of the graph is represented by a leaf (Everitt,1998) (Kumar et al., 2020). The internal nodes of the dendogram are linked with two leafs/nodes and are associated with a probability that the pair of vertices in the left and right sub-trees of that node are connected (Clauset et al., 2008). The dendogram is used to find the parameters of the maximum likelihood model of the graph. With the probabilities of links between clusters, to goal is to find the “complete” graph that best fits the representation of this dendogram by maximum likelihood. If the complete graph is found, the missing links can be appointed. The equation for the likelihood model is based on Bayes’ theorem and is presented as follows (Kumar et al.,2020):

L(D∗, {pr}) =

Y

r∈D∗

pEr

r (1 − pr)LrRr−Er, (22)

in which D∗ represents the dendogram, r ∈ D∗ the nodes of the dendogram and 0 < pr < 1

their associated probability of a link between the left and right sub-trees Lr and Rr respectively.

Moreover, Er is the number of links in the network whose endpoints have r as their lowest

com-mon ancestor in D*. The in equation (22) represented likelihood now needs to be maximized by replacing prin equation (22) by the maximizing probability: the number of links whose endpoints

have r as their lowest common ancestor in D∗, Er, divided by the product of the left- and right

sub-tree:

pmax_r = Er LrRr

. (23)

(13)

same probability distribution. Then for each possibly missing link, the corresponding probabilities are selected from the maximum likelihood model and averaged. The possibly missing links with high probabilities are either selected by listing the probabilities in descending order and taking a fixed number of high probabilities, or by selecting all probabilities higher than a pre-defined fixed threshold probability.

3.2.3 Probabilistic Relational Model (PRM)

Probabilistic Relational Models (PRM) focus their link prediction not on the structure of the graph, as previously discussed methods LPM and HSM do, but on the characteristics of the nodes and the edges. There are multiple studies concentrating on characteristics, applying slightly differ-ent PRMs (Taskar, Wong, Abbeel, & Koller, 2004) (Getoor, Friedman, Koller, & Taskar, 2002) (Neville,2006). The common ground is that the models aim to use known characteristics to predict unknown characteristics that predict whether or not a link is missing. The missing link prediction problem is now converted to a characteristic prediction problem (Kumar et al., 2020). This prob-lem is then soled by a maximum likelihood approach. The parameters for maximizing probability are based on the characteristics rather than on the structure of the graph. For example, Taskar and co-authors classify the nodes into a number of (potentially overlapping) subgroups based on their characteristics (Taskar et al., 2004). These subgroups each have their own sub-graph with links and nodes within the subgroup. Moreover, the sub-graphs are linked by inter-group links. On this “link-graph”, a probabilistic model can be executed, such as applying a relational Markov graph. 3.2.4 Stochastic Relational Model (SRM)

A method of link prediction based on both the structure of the graph and on characteristics of nodes/edges, is the Stochastic Relational Model (SRM) In SRM, a Gaussian process (GP) frame-work is used to model the entire profile of links of a netframe-work (Pandey et al., 2019). The method, proposed by Yu and co-authors in 2006, firstly assigns attributes (or: characteristics) to each node of the network (K. Yu, Chu, Yu, Tresp, & Xu, 2006). These attributes are represented as the attribute vector of the node. It is assumed that the present edges between the sub-groups were defined by a relational function based on the vectors of the sub-groups and that each link is only dependent on the vectors of the two concerned nodes (endpoints). The focus of the study by Yu and co-authors concerns the processes on the vector spaces of the two sub-groups that identify their relational function (K. Yu et al., 2006). These processes are (independent) entity-specific Gaussian processes acting on the vector spaces, which carry certain parameters. Yu and co-authors define a probability of linkage for each set of entity pairs between sub-groups, conditional on the parameters of their Gaussian processes. Let U1 and U2 be the considers sub-groups of nodes with

U1, U2 ⊂ V with their relational function t : U1 × U2 → R. Moreover, let σU1, σU2 be the

pa-rameters of the Gaussian processes on U1 and U2. Recall that relational function t determines to

presence of link ri,nwhere i ∈ I and I containing all possible pairs of nodes. Now Yu and co-author

define the equation for the matrix of probabilities of links between U and V by: p(R_I|σ) =

Z Y

(i,n)∈I

p(ri,n|ti,n)p(t|σ)dt, (24)

where σ is the vector containing σU1 and σU2 and RIthe vector containing ri,nwith i, n ∈ I.

(14)

p(t|R_I, σ) can be calculated, which are the probabilities that the concerned pair of nodes should be linked. Note that in order to follow the above prescribed computation and, with that, equation (24), one should first find p(t|σ). This probability function is dependent on the Gaussian process. In this paper, Gaussian processes will not be extensively discussed. Even though different forms of Gaus-sian processed can lead to (slightly) different Stochastic Relational Models, in-depth elaboration on these processed do not fall within the scope of the research question of this study. Therefore, please refer to (K. Yu et al.,2006) for a number of computation techniques for p(t|σ).

3.2.5 Stochastic Block Model (SBM)

The basis of the Stochastic Block Model is similar to SRM, since both models classify the nodes into “groups”, with no further hierarchical structure as HSM does. In SBM, the sub-groups of nodes, based on characteristics, are referred to as blocks (Kumar et al., 2020). The model does not only look into missing links with a high probability of existing, but also at existing links with a rather low probability, so-called spurious links. Guimera and Sales-Pardo, who proposed SBM in 2009, compute the probability that a certain link actually exists in the observed network (link reliability) (Guimer`a & Sales-Pardo, 2009). The computation of the reliability is based on the partitions P of the graph: all possible sub-groups of the nodes. Note that these contain all possibilities of partitions and are therefore not (yet) dependent on the characteristics of the nodes. Let the endpoints of the concerned edge be v1 and v2. For each partition, it is identified in which

sub-group the concerned nodes belong (referred to as: σv1 and σv2. Then the number of links

between the group of the first and the group of the second endpoints in the observed network are counted, that is lσ_v1 and lσ_v2. The maximum number of possible edges between the sub-groups

is also known, dependent on the number of nodes in the subgroups. This maximum number of edges is notated as rσ_v1,σ_v2. According to Guimera and Sales-Pardo, the edge-reliability can now

be computed by: Rv1,v2 = 1 Z X p∈P _l σ_v1,σ_v2 + 1 rσ_v1,σ_v2 + 2 exp[−H(p)], (25)

where H(p) is the so-called partition function, defined as:

H(p) = X σ_v1≥σv2 ln(rσv1,σv2 + 1) + ln rσv1,σv2 lσv1,σv2 . (26)

Moreover, Z is calculated dependent on H(p) as presented in equation (26):

Z =X

p∈P

exp[−H(p)]. (27)

Equations (25) - (27) can be used to find the link reliability between each pair of nodes, regardless of to which sub-group the nodes belong. One can then decide on thresholds for link reliability: if the link reliability is larger than a certain threshold and the concerned link does not occur in the observed network, one has a found a missing link. To go a step further, this method can also be used to find (potentially) false links: if the link reliability is smaller than a certain threshold and the concerned link does occur in the observed network, one has a found a (potentially) false link.

(15)

3.3 Other approaches

In (Pandey et al., 2019) and (Kumar et al., 2020) a number of methods that do not fall within the previous presented categories are discussed. An intuitive selection on these methods is made based on their suitability for forensic networks. Two link prediction methods are chosen. First, the Structural Perturbation Method (SPM) will be explained. SPM has shown to out-perform the previously discussed HSM and SBM. Moreover, the method does not make any a-priori assump-tions of the nodes and/or links (Lu, Pan, Zhou, Zhang, & Stanley, 2015). The second “other approach”-methods are constructed based on molecular mechanisms and interactions between cel-lular components and uses not only direct links but also indirect correlations to predict missing links (Barzel & Barab´asi,2013a). Since this approach does not rely on network topology and since indirect correlation might have influence in a forensic network, the so-called global silencing of indirect correlationsis chosen to be (briefly) explained below.

3.3.1 Structural Perturbation Method (SPM)

The Structural Perturbation Method (SPM), as proposed by L¨u and co-authors, focuses on estimat-ing link predictability, dependent on the consistency of the network after links have been randomly removed (Lu et al., 2015). The trail of thought is that if the structural features of the network preserved after the removal of a set of links, then the link predictability is rather high. To measure the structural consistency of the graph, a randomly selected fraction of links is removed resulting in a new graph, in which the set of nodes and the other links are preserved. The adjacency matrix of the new graph, referred to as AR_{, is diagonalized to study the eigenvalues and with that the}

eigenvectors. The latter are used to indicate and compare the structural features: the more alike the eigenvectors of the new and original graph are, the less the structural features have changed. L¨u and co-authors construct the perturbed matrix via first-order approximation, a common-used method to test sensibility of a network (C. Wang, Peng, Wu, Sun, & Yuan, 2012). Also the con-sistency change by all the removed links specific can be compared. This information is used by L¨u and co-authors to finally calculate one value for the structural consistency of the network, re-ferred to as σc. Moreover, the entries of the matrix can be seen as “similarity scores”. These scores

are ranked for all of the node-pairs with no observed links in the original matrix. The pair with the highest score is indicated as a missing link.

3.3.2 Global Silencing of Indirect Correlations (GSIC)

Authors Barzel and Barab`asi note that pairwise interactions between cellular components are not only based on direct paths (links), but also on indirect paths (Barzel & Barab´asi, 2013a). The correlation matrix of a network is considered and transformed into a so-called silenced matrix S, that is based on the direct links and the paths between nodes. Therefore, for an entry Sv1v2 of

the silenced matrix, all paths between nodes v1, v2 ∈ V of graph G(V, E) are considered and

combined by a number of general algebraic equations. The idea behind the silenced matrix is that the indirect responses on a node are silences and thus not represented, so that only the direct paths are visible in silenced matrix S. This leads to a clear separation of responses resulting from either direct of indirect influences, to more accurately be able to predict direct paths and therefore links. Matrix S is also referred to as the local response matrix. Barzel and Barab`asi propose an equation to approximate this matrix in order to predict missing links, based on the separation of direct and indirect responses.

(16)

4 Missing characteristics

Since the nodes of a forensic network include, at least, different type of entities, it is useful to assign characteristics to the nodes. In the study on the group of missing persons, as already mentioned in section 2 Forensic Networking, these characteristics have proven to be very useful (Caridi et al., 2011). Therefore, in this section, different methods for predicting missing characteristics (or: features) of nodes in graphs will be discussed. The number of studies on the prediction of node features is smaller than on link prediction, but is substantive. Therefore, in this section of missing characteristics prediction, the methods presented in a review article written by Altenburger and Ugander in 2020 will be considered (Altenburger & Ugander, 2018). Altenburger and Ugander compare within-network and across-network attribute prediction, in which the first regards nodes within the same network and the latter nodes in multiple networks. Since there are no indications within the research field of forensic networking that the evidence will be spread across multiple networks, only the methods for attribute prediction within one networks will be considered in this section.

4.1 Majority of the Votes: NetKit-SRL

The idea behind the Majority of the Votes principle is rather simple: a node is classified with an attribute that the most of its neighbours are classified in (Macskassy & Provost, 2007). A clas-sifier that only takes direct neighbours into account is considered to be local. The complexity of classifiers is higher when it is more global. Macskassy and Provost propose a new toolkit in that generalizes well-performing known methods for majority of the vote classifiers: the NetKit-SRL (Macskassy & Provost,2007). The combined methods include: a local classifier (LC), a relational classifier (RC) and a collective inferencing method (CI).

The LC is applied first to the data set, which consists of all the information of the concerned graphs; nodes, known attributes, links, etc. The LC considers for a missing attribute xi of node

vi ∈ V the set of possibilities for attribute x and computes a probability distribution over all

pos-sibilities for attribute xi. This process is carried out on each missing characteristic of the nodes.

Secondly, the RC takes into account the original data set again. RC does not only use the known attributes of node vi to compute a probability distribution over the option for xi, but also

consid-ers the neighbourhood of vi. LC and RC now both have found a set of estimates for the missing

attributes of the graph. These set are recognized by the Collective Inferencing (CI) as the a priori estimate set. A CI can make predictions of values based on two different variables (LC and RC) for the same data set , which reduces the (potential) error (Jensen, Neville, & Gallagher,2004). Within the Majority of the Votes methods, many options for local classifiers, relational classifiers and collective inferencing models exist (Gago-Pallares, Fontenla-Romero, & Alonso-Betanzos,

2007) (Lichtenberk,1983) (Niu, Moreno, & Neville,2015). Macskassy and Provost compare four relational classifiers for their NetKit-SRL-method, all in which only the local neighbourhoods are used for classification. Moreover, three collective inference methods, with own local classifiers, are studied. In total, the writers thus consider twelve LC-RC-CI-combinations on a number of data sets. It is concluded that the choice of LCs, RCs and CIs is dependent on the ratio of nodes with known and nodes with unknown characteristics.

(17)

4.2 Link-based Models

As the name of the model might suggest, the link-based models base their predictions of missing characteristics on the links of the concerned node and the link-distribution of the graph (Getoor,

2005). In 2003, Lu and Getoor have proposed a methods to model link distributions and use this distribution to classify attributes of nodes (Lu & Getoor, 2003). First, features are assigned to the links. These features are based on the nodes that the concerned edge is linking. Lu and Getoor propose statistics that convert the node attributes to a link feature. Three different models for assigning link attributes are mentioned in the study. To use these features in order to predict missing nodes characteristics, Lu and Getoor choose to use a regularized logistic regression model1 and compare a number of models to find the best fit. An iteration is applied: from the starting position of (some) missing characteristics, features of the links are assigned, which help compute the posterior probability of the missing attributes. Once this first prediction is carried out, the features of the links are updated, leading to a new (better) posterior probability of the missing characteristics, and so on. Lu and Getoor propose an algorithm that executes this iterative process. Once the process converges, that is: the probabilities and attributes do no longer change with the next iteration, the algorithm is terminates and the process is completed. A probability distribution for the options of the missing characteristics is now found.

4.3 Recursive Feature eXtraction (ReFeX)

The method of ReFeX (Recursive Feature eXtraction) is proposed by Henderson and co-authors in 2011 (Henderson et al.,2011). The algorithm combines local (node-based) features with neigh-borhood (egonet-based) features in order to predict the missing characteristics of the node. The algorithm works recursively: the features that are extracted from the graph generate new, recur-sive features. These recurrecur-sive features are combined with local and egonet attributes. The local attributes consist of the degrees of a node (in-, out- and total-degree). The egonet attributes consist of the node itself, the neighbours and the edges between the node and neighbourhood-nodes. The ReFeX-algorithm now computes the means and sum of the local and egonet features of the node. That is, for example, the mean of the in-degrees of the neighbours. ReFeX now combines these features of the local neighbour of the nodes to capture the behaviour of the node, which describes to what kind of neighbour the node is connected . Handerson and co-authors do not extensively describes how the behaviour of the node can exactly predict the missing characteristics, but only its suitability to do so.

5 Prediction of data in forensic networks

Now that both forensic networking and the data prediction of missing edges and characteristics are discussed, the information can be combined in order to find suitable methods for the prediction of data in forensic networks. As discussed in section2, the forensic graph is weighted, directed and characterized. Moreover, in section3.1 the importance of efficiency and completeness of predic-tion is stretched. The completeness references to the prohibipredic-tion of ignoring evidence. Therefore, it is useful to not only take into account the concerned node(s) of the prediction, but also associ-ated information and therefore nodes in the neighbourhood. For these reasons, only quasi-local indices are considered with regard to similarity scores. Finally, it is also of importance to have

(18)

a high accuracy of the prediction methods for both links and characteristics. The more links and characteristics are predicted correctly, the more useful the predictions on the forensic network are for the forensic investigation.

Taking into account the above described reasons, the following features are presented in table 1

and2in section9(Attachment B) in order to discuss the suitability of the data prediction methods for forensic networks:

• Table 1 and 3, Column 2: Whether the method is studied for its suitability for weighted graphs

• Table1and3, Column 3: Whether the method is studied for its suitability for directed graphs • Table1and3, Column 4: Whether the method takes into account the node attributes

• Table2and4, Column 2: The time complexity of the computation of the methods

• Table2and4, Column 3: The completeness of the methods, that is; which part of the graph the method takes into account

• Table2and4, Column 4: Validation of higher accuracy than one or more other in this paper considered methods2

5.1 Link prediction in forensic networks

In table1and2the suitability-features of the in section3considered link prediction methods are presented. In half of the methods (7 out of 14) the node attributes are or can be considered during the prediction of missing links. All these methods are neither global nor local, but take into account a sub-graph. SBM, PRM and SR are suitable quasi-local methods, considering characteristics and are suitable for directed and weighted graphs. Therefore, based on the information available, SBM, PRM and SR are best applicable for forensic networks. It is difficult to discuss efficiency, since this factor is unknown for more than half of the considered methods. There is also a lack of information on the performance of the link prediction methods; some have been tested in a number of studies, but most have not been compared to one another. Moreover, the data sets used in the studies may be very different than a forensic graph data set. Please refer to the discussion (section6.1) for more information on limitations, including the lack of studies on suitability for weighted and directed graphs.

5.2 Characteristic prediction in forensic networks

In table3and4the suitability-features of the in section4considered link prediction methods are presented. The lack of studies on the performance of the methods (see discussion in section 6.1) forms a large gap in the possibility to compare the methods. ReFeX is the only method considered for prediction missing characteristics that is known to be suitable for weighted graphs. Note that all methods take the node attributes into account, since the missing attributes is what the methods aim to find. The link-based model proposed by Lu and Getoor in (Lu & Getoor, 2003) is the only that

2_{Note that in many of the cases this is unknown; not all methods have been compared to each other in studies.} Moreover, the performance might be highly dependent on the dataset considered in the study. Please refer to section

(19)

takes into account the features of the entire graph, which makes it a global method. As discussed in 3.1, global prediction methods might take a long time to compute, since a large amount of information is considered. This might result in less efficiency, making the methods less suitable for forensic networks3. Therefore, based only on the (small amount of) information present, it is expected that ReFex is the most suitable methods for the prediction of missing characteristics in forensic networks.

6 Conclusion and recommendations

The goal of this study is to answer the following research question, as presented in section1: RQ: What are promising methods for predicting missing data in mathematical network represen-tations of the evidence?

This answer to this question is searched for in three steps. (1) In section 2the forensic network with the evidence is mathematically presented. Moreover, its typical features and the research gap with regard to forensic networking are explained. (2) In section 3 and 4 respectively, a number of methods for link- and characteristic prediction are presented. It is chosen to focus this study on missing links and characteristic, rather than on missing nodes, since links and characteristics are expected to be potentially missing in a forensic network. (3) Lastly, the proposed methods are researched for their suitability for forensic networks, based on a number of important features in section5. The probabilistic and maximum likelihood models are very time consuming due to the large amount of information that is necessary.

It is found that it is rather difficult to find an unambiguous to the research question for the fol-lowing reasons. A rather small amount of research has been carried out on forensic networking and there are a number of factors unknown (see section2.3). However, there are some consistent features of forensic networks found in the literature, as presented in section2.4. The methods of data prediction that are considered in this study are tested for these features, which is summarized in tables 1 to 4. Even though a larger amount of studies is obtained on link- and characteristic prediction, the features remain uncharted for a percentage of the methods. Moreover, not all of the intelligence is true when applied to a forensic network specifically4. Taking into account the infor-mation that is known and presented in this paper, it is concluded that the Stochastic Block Model, as presented in section 25, the Probabilistic Relational Model, as presented in section3.2.3, and SimRank, as presented in section11, are the most promising methods for predicting missing links in the mathematical presentations of evidence. With regard to missing characteristics, the most promising method is found to be Recursive Feature eXtraction, as proposed by Henderson and co-authors in (Henderson et al.,2011) and as described in section4.3.

6.1 Discussion

There are a number of uncertainties within this study. In this section, the most influential uncertain-ties will be discussed: the lack of information on forensic networks, the accuracy of the proposed methods and doubts on the completeness of the considered literature.

3_{A more global method might also result in high accuracy. It is, however, unknown how accurate the considered} methods are. For more information, please refer to section6.1

(20)

As mentioned in section2and6, the amount of research on forensic networks is limited. Therefore, it is difficult to decide on the applicability of the data prediction methods for forensic networks. For instance, it is not clear how many nodes and edges a forensic network typically consists of. Moreover, there is nothing known about clustering of hierarchical structures within forensic net-works, which is of great importance for some methods, such as HSM (Clauset et al.,2008). Another limitation of this study is the lack of knowledge on the accuracy of the proposed methods. Some of the methods have been compared to one another in literature, such as LP and HSM in (Lovasz, 1993). However, even if the methods are compared, they have been tested on specific, non-forensic data sets. This might provide an indication of the accuracy on other data sets, but not a conclusive statement. It might be possible that a forensic network has an entire different structure than the graphs on which the methods have been tested or compared.

As in every literature study, a limitation is the incompleteness of the list of literature that has been considered in this paper. Even though an in-depth search has been carried out for possible useful literature, the possibility of missing studies remains. The search strategy is described in section8

(Attachment B.). It cannot be stated with one hundred percent certainty that all available literature on the subject of forensic networks and data prediction has been considered.

6.2 Further research

In order to reduce the limitation as described in section 6.1 and to take the next step is finding methods for data prediction suitable for forensic networks, some recommendations of further re-search are made. Most of all, it is recommended to obtain more information on forensic networks. The studies that have been carried out on this subject are limited. It is recommended to complete more case studies in order to find a conclusive method for constructing forensic graphs. Due to the small amount of researches on this subject, it is expected that large improvements are yet to be made. Secondly, it is useful to test the proposed methods on sets of forensic data in order to find their suitability, accuracy, performance and time complexity. For the some of the other discussed methods, there is a lack of research on the applicability to weighted and directed graph. Since this is necessary in order to apply the methods to forensic networks, it is recommended, among other things, that analysis on this subject is carried out.

7 Acknowledgements

I would like to thank Dhr. prof. dr. R. Nunez Queija (University of Amsterdam) and Dhr. dr. A.V. den Boer (University of Amsterdam) for advising and challenging me during the execution of this study. Moreover, a grand gratitude for the comments that greatly improved this paper.

(21)

References

Abdolhosseini-Qomi, A. M., Yazdani, N., & Asadpour, M. (2020). Overlapping communities and the prediction of missing links in multiplex networks. Physica A: Statistical Mechanics and its Applications, 554, 124650.

Ahmed, H. (2019). Graph Routing Problem Using Euler’s Theorem and Its Applications. , 3, 1-5. Allen, D., Ching, L., Huber, D., & Moon, H. (2011). Hierarchical random graphs for networks

with weighted edges and multiple edge attributes.

Almquist, Z. W., & Butts, C. T. (2018). Dynamic network analysis with missing data: Theory and methods. Statistica Sinica.

Altenburger, K. M., & Ugander, J. (2018). Node attribute prediction: An evaluation of within-versus across-network tasks. 32nd Conference on Neural Information Processing Systems. Antonellis, I., Garcia-Molina, H., & Chang, C.-C. (2008). Simrank++: Query rewriting through

link analysis of the click graph. , 1.

Ask, K., & Alison, L. (2010). Investigators’ decision making.

Backstrom, L., & Leskovec, J. (2011). Supervised random walks: Predicting and recommending links in social networks. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, 635-644.

Barzel, B., & Barab´asi, A.-L. (2013a). Network link prediction by global silencing of indirect correlations. Nat Biotechnol, 31, 720-725.

Barzel, B., & Barab´asi, A.-L. (2013b). Supplementary material: Network link prediction by global silencing of indirect correlations. Nat Biotechnol, 31, 720-725.

Bhanodia, P., Sethi, K., Khamparia, A., Pandey, B., & Prajapat, S. (2019). Similarity-Based Indices or Metrics for Link Prediction.

Blazwicz, J., Brzezinski, J., & Gambosi, G. (1993). Graph theoretical issues in computer networks. European Journal of Operational Research, 71(1).

Caridi, I., Dorso, C., Gallo, P., & Somigliana, C. (2011). A framework to approach problems of forensic anthropology using complex networks. Elsevier Physica A, 390, 1662-1676. Chaski, C. (2005). Who’s At The Keyboard? Authorship Attribution in Digital Evidence

Investi-gations. Int. J. Digit. EVid., 4.

Chen, H., & Zhang, F. (2008). The expected hitting times for finite Markov chains. Linear Algebra and its Applications, 428(11), 2730 - 2749.

Chroł, B. (2018). Retrieved from https://cran.r-project.org/web/packages/ linkprediction/vignettes/proxfun.html

Clauset, A., Moore, C., & Newman, M. E. J. (2008). Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191), 98–101.

Cover, T. M., & Thomas, J. A. (2006). Elements of information theory. Wiley, 2nd edition. Diestel, R. (2000). Graph Theory; Electronic version. Spring-Verslag New York.

Easttom, C. (2017a). Applying Graph Theory to Modeling Investigations. IOSR Journal of Mathematics, 13(2V), 47-51.

Easttom, C. (2017b). Utilizing Graph Theory to Model Forensic Examination. International Journal of Innovative Research in Information, 04(02).

Easttom, C. (2020). On the application of Algebraic Graph Theory to Modeling Network Intru-sions. CEC-Security LLC Plano.

(22)

Everitt, B. (1998). The Cambridge dictionary of statistics. Cambridge University Press.

Eyal, R., Kraus, S., & Rosenfeld, A. (2011). Identifying missing node information in social networks.

Faskowitz, J., Yan, X., Zuo, X.-N., & Sporns, O. (2018). Weighted stochastic block models of the human connectome across the life span. Scientific Reports, 8.

Fornito, A., Zalesky, A., & Bullmore, E. T. (2016). Fundamentals of Brain Network Analysis. Elsevier, Inc.

Gago-Pallares, Y., Fontenla-Romero, O., & Alonso-Betanzos, A. (2007). A Comparative Study of Local Classifiers Based on Clustering Techniques and One-Layer Neural Networks. Springer Berlin Heidelberg.

Getoor, L. (2005). Link-based classification. Springer London.

Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2002). Learning Probabilistic Models of Link Structure. Journal of Machine Learning Research, 3, 679-707.

Guimer`a, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078.

Haggerty, J., Karran, A., Lamb, D., & Taylor, M. (2011). A Framework for the Forensic Investi-gation of Unstructured Email Relationship Data. International Journal of Digital Crime and Forensics, 3(3), 1-18.

Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H., & Faloutsos, C. (2011). It’s who you know: Graph mining using recursive structural features.

Huang, J., Li, S., & Xie, Z. (2019). Further results on the expected hitting time, the cover cost and the related invariants of graphs. Discrete Mathematics, 342(1), 78 - 95.

Jeh, G., & Widom, J. (2002). Simrank: A measure of structural-context similarity. Association for Computing Machinery.

Jensen, D., Neville, J., & Gallagher, B. (2004). Why Collective Inference Improves Relational Classification. Association for Computing Machinery.

Jiang, M., Chen, Y., & Chen, L. (2015). Link prediction in networks with nodes attributes by similarity propagation.

Khotilin, M., & Blagov, A. (2016). Visualization and cluster analysis of social networks.

Kim, M., & Leskovec, J. (2011). The network completion problem: Inferring missing nodes and edges in networks.

Koller, D., & Pfeffer, A. (1998). Probabilistic frame-based systems.

Kov´acs, I., Luck, K., Spirohn, K., Wang, Y., Pollis, C., Schlabach, S., . . . Barab´asi1, A.-L. (2019). Network-based prediction of protein interactions.

Krause, R. W., Huisman, M., Steglich, C., & Snijders, T. (2020). Missing data in cross-sectional networks – An extensive comparison of missing data treatment methods. Social Networks, 62, 99 - 112.

Kumar, A., Singh, S. S., Singh, K., & Biswas, B. (2020). Link prediction techniques, applications, and performance: A survey. Physica A: Statistical Mechanics and its Applications, 553, 124-289.

Kushwah, A., & Manjhvar, A. (2016). A Review on Link Prediction in Social Network. Journal of Grid and Distributed Computing, 9(2), 43-50.

(23)

Liben-Nowell, D., & Kleinberg, J. (2003). The Link Prediction Problem for Social Networks. Journal of the American Society for Information Science and Technology, 58.

Liben-nowell, D., & Kleinberg, J. (2003). The link prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58.

Lichtenberk, F. (1983). Relational classifiers. Lingua, 60(2), 147 - 176.

Lichtenwalter, R. N., Lussier, J. T., & Chawla, N. V. (2010). New Perspectives and Methods in Link Prediction. Proc. 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1495-1502.

Liu, W., & L¨u, L. (2010). Link prediction based on local random walk. , 89(5). Lovasz, L. (1993). Random walks on graphs: A survey. Paul Erdos is Eighty, 2, 1-46.

Lu, L., Pan, L., Zhou, T., Zhang, Y. C., & Stanley, H. E. (2015). Toward link predictability of complex networks. Proceedings of the National Academy of Sciences, 112(8), 2325-2330. L¨u, L., Pan, L., Zhou, T., Zhang, Y.-C., & Stanley, H. E. (2015). Toward link predictability of

complex networks. Proceedings of the National Academy of Sciences, 112(8), 2325-2330. Lu, Q., & Getoor, L. (2003). Link-based classification. AAAI Press.

L¨u, L., Jin, C.-H., & Zhou, T. (2009a). Link prediction techniques, applications, and performance: A survey. Physical Review W, 80.

L¨u, L., Jin, C.-H., & Zhou, T. (2009b). Similarity index based on local paths for link prediction of complex networks. Physical Review E.

Macskassy, S. A. (n.d.). Srl. Retrieved fromhttp://netkit-srl.sourceforge.net/

Macskassy, S. A., & Provost, F. (2007). Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8(34), 935-983.

Mart´ınez, V., Berzal, F., & Cubero, J.-C. (2016). A survey of link prediction in complex networks. ACM Computing Surveys, 49, 6-9.

Mell, P., Scarfone, K., & Romanosky, S. (2007). A complete guide to the common vulnerability scoring system version 2.0.

Munasinghe, L., & Ichise, R. (2013). Link prediction in social networks using information flow via active links. IEICE Transactions on Information and Systems, E96D, 1495-1502. Muscoloni, A., Abdelhamid, I., & Cannistraci, C. V. (2018). Local-community network automata

modelling based on length-three-paths for prediction of complex network structures in pro-tein interactomes, food webs and more. BioRxiv.

Muscoloni, A., & Cannistraci, C. V. (2017). Local-ring network automata and the impact of hyperbolic geometry in complex network link-prediction. ArXiv.

Neville, J. (2006). Statistical models and analysis techniques for learing in relational data. Com-puter Science department Publication Series.

Newburn, T., Williamson, T., & Wright, A. (2012). Handbook of criminal investigation. Willan London.

Niu, R., Moreno, S., & Neville, J. (2015). Analyzing the Transferability of Collective Inference Models Across Networks.

Noel, S., Harley, E., Tam, K., Limiero, M., & Share, M. (2016). CyGraph: Graph-based Analytics and Visualization for Cybersecurity.

Pan, L., Gao, L., & Gao, J. (2017). Link prediction in weighted networks via structural perturba-tions.

Pandey, B., Bhanodia, P. K., Khamparia, A., & Pandey, D. K. (2019). A comprehensive survey of edge prediction in social networks: Techniques, parameters and challenges. Expert Sysems