From probability table to network

(1)

Master Thesis

Rogier Hetem

Supervised by: dhr. dr. L.J. Waldorp

A

B

C

D E A

B

C

D

E A

B

C

D

E

June 5, 2017

(2)

1 From Probability Table to Network

1.1 Introduction

“Networks consist of nodes and edges”. In theory nodes can represent anything (van der Bork et al., 2017, p. 3). Usually nodes represent entities and edges their relationship. Networks can be used to describe many phenomena such as biological and technological structures, social relationships or symptoms of a disorder (Costantini et al., 2015, p. 1). “Network models in psychopathological fields are relatively new” (van der Bork et al., 2017, p. 2). They derivate from existing models by not assuming a common cause (van der Bork et al., 2017; Borsboom, 2008). Instead, in network models nodes on itself attract or cause the activation of other nodes and loops are even allowed in this process (van der Bork et al., 2017, p. 2). For instance in clinical psychology nodes can represent symptoms of a disorder and the edges than represent the relationship between those symptoms. This without subscribing the cause of the disorder to a single and possibly latent variable but allowing symptoms to have an undirected relationship with each other. Furthermore, network models clearly portraits the (conditional) dependencies, as network analysis provides researchers with a good visualization of their model. In this paper is demonstrated how to form a network. Although there are multiple possibilities to form net-works, for instance by using a correlation matrix matrixes (van der Bork et al., 2017, pp. 5-7), we restrict this demonstration to from networks by using probability tables while using binary data. For a broader and clear understanding of the difference between network models and the common cause models, we recommend reading Borsboom and Cramer (2013); Borsboom (2008). For a complete understanding of network analysis and its full potential we recommend reading van der Bork et al. (2017).

1.2 Creating Networks

In this section is demonstrated how to create a network from a probability table. Before, demon-strating the actual process of creating a network, we first have to explain some basic principles. Those basic principles are; joint, marginal and conditional probabilities, which originate from probability theory. Having a grasp of these basic principles will make it easier to form networks. To avoid giving a solely abstract description, we will demonstrate how to obtain the probabilities of interest using a probability table. This table will then also be used to form a network, which is the main goal of this section. As mentioned before we restrict our explanation to binary data, which means we only use variables with two possible states (0/1). We do so to explain in great detail how to obtain a network from a probability table, this also enables the possibility to do calculations by hand.

1.2.1 Joint Probability

A probability table is a table that contains all the possible outcomes of the measured or chosen variables. A probability table displays the probabilities for the possible combined outcomes. We refer to those combined outcomes as joint probabilities. Below such a table is depicted (Table 1). This table displays all the possible outcomes of three binary variables and the probability for those combined outcomes. The first three columns are used for the measured variables and their outcome. The variables are labeled X, Y andZ. The last column is used to display the probability of the combined outcome of that particular row, this column is labeledP. Because the variables are from a Bernoulli distribution there are only two possible outcomes, which we labeled with either 0 or 1. In this example we used three variables, which means all the joint probabilities consist of three variables. In a formal way the joint probability of our three vari-ables is written as; P (X = x, Y = y, Z = z). Using this table, the joint probability of variable X having value “1” and both variable Y and Z having value “0” has a probability of 2/13. This can be found in the fourth row of the table. Again, in a formal way we should write that down as; P (X = 1, Y = 0, Z = 0). Now, for this example we could come up with another seven

(4)

possible joint probabilities of which all of them are already displayed in the table.

Table 1: Probability table with three variables X Y Z P 0 0 0 1/13 0 0 1 2/13 0 1 0 1/13 1 0 0 2/13 1 1 0 1/13 0 1 1 1/13 1 0 1 4/13 1 1 1 1/13 13/13

Every probability of a combined outcome, for two or more variables is called a joint probability. Although our example only has three variables, in theory it could have been more but also less. Below, the general formula (Formula 1) for the joint probability is given.

Joint Probability

P (A = a, B = b, ..., N = n) (1)

As a joint probability is the combined outcome of two or more variables, we could be interested in the joint probability of only the variablesXandY. As in our case this would mean we have to leave out, or rather marginalizing out variableZwe will discuss this in the next subsection.

1.2.2 Marginal Probability

We speak of a marginal probability when we have to marginalize one or more variables out of the probability distribution. For instance if our interest is the probability distribution of the variables X and Y we would have to, in a sense get rid of variable Z. This process is called marginalization. For example, let's say we are interested in P (X = 1, Y = 1). In this case the only situations that are of interest are when variable X and Y are both 1. When going back to Table 1 it can be found that there are two situations (rows) in which both variable X and variableYare so. When we look at the probabilities for those situations we see the probabilities of 1/13 and another of also 1/13. Taking the sum of those probabilities will answer our question.

1 13+ 1 13 = 2 13

To answer our question explicitly:

P (X = 1, Y = 1) = 2 13

(5)

By doing so we marginalized variable Z out of our distribution. We could do this with every possible combination of variables and their outcomes. Now, it is also possible to obtain the probability for a single variable. We would then have to marginalize out the other two variables that are not of interest. For example, if we would be interested in P (Z = 1), we would only be interested in the cases that variableZwould be 1. Again, going back to Table 1 we would have to search for all the rows in which this is the case. This would give us; 2/13, 1/13, 4/13 and 1/13. Taking the sum of those probabilities will answer our question.

2 13+ 1 13+ 4 13+ 1 13 = 8 13

P (Z = 1) = 8 13

The formula By doing so we marginalized variable Zout of our distribution. We could do this with every possible combination of variables and their outcomes. Now, it is also possible to obtain the probability for a single variable. We would then have to marginalize out the other two variables that are not of interest. For example, if we would be interested in P (Z = 1), we would only be interested in the cases that variable Zwould be 1. Again, going back to Table 1 we would have to search for all the rows in which this is the case. This would give us; 2/13, 1/13, 4/13 and 1/13. Taking the sum of those probabilities will answer our question.

2 13+ 1 13+ 4 13+ 1 13 = 8 13

P (Z = 1) = 8 13

The general formula for the marginal probability is given below (Formula 2).

Marginal Probability

P (A = a) =X

N =i

P (A = a, ..., Ni) (2)

When working with a probability table and using this method it is possible to derive marginal probabilities for every variable or combination of variables. In the next subsection we take a look at situations in which the probability space in confined.

(6)

1.2.3 Conditional Probability

A conditional probability is the probability that an event will occur, given that another event already occurred. In other words, the probability space for the variable of interest is confined by another variable. For example, if we are interested in the probability that variableZwill be 1 and we would already know that variableXis 1, we are confined to solely the possibilities when indeed variableXis 1. The formula for the conditional probability, using only two variables is given below (Formula 3).

Conditional Probability

P (A = a|B = b) =P (A = a, B = b)

P (B = b) (3)

On left side of the equality symbol is stated; the probability of variable A has value a given variable B has value b. In a less abstract way an example can be; what is the probability that a person, diagnosed with depression also has sleeping problems. In this case we know that a person suffers from depression, making depression our known/set variable, which confines our probabil-ity space. Given Formula 3, variable B has to be that known variable, depression and variable A therefore has to be the other variable, sleeping problems. Now, formulating this example as it is written in the formula; what is the probability of having sleeping problems given that someone is diagnosed with depression. When looking at the right side of the equality symbol in Formula 3 it can be seen that both the joint as the marginal probability for a single variable, which were described in the two previous subsections, are used to calculate the conditional probability. Meaning it will not be necessary to introduce other mathematical concepts. While still using Table 1 for our example, we are now interested in the probability that variableZis “1” given that variable X is “1”. Stated more formally; P (Z = 1|X = 1). As this is equal to the joint probability divided by the marginal probability of the second (“given that”) variable, we repeat both the process for deriving the joint probability (P (Z = 1, X = 1)) and the process of deriving the marginal probability (P (X = 1)). This will give the following results.

P (Z = 1, X = 1) = 5 13 And,

P (X = 1) = 8 13

Dividing those two probabilities will give the solution to P (Z = 1|X = 1).

P (Z = 1|X = 1) = P (Z = 1, X = 1) P (X = 1) = 5 13 8 13 =5 8 = 0.625

Thus, given variable X is “1” the probability of variable Zbeing “1” is equal to 0.625. This method can be used on any possible combination with two variables. Of course a conditional probability could also contain more than two variables. For instance it is possible to use a joint probability as the probability of interest. If one wants to calculate such a conditional probability, the formula would look as following (Formula 4).

(7)

P (A = a, B = b|C = c) = P (A = a, B = b, C = c)

P (C = c) (4)

In Formula 4 we can see that the only change in calculating, compared to Formula 3 is that you now have to take the joint probability of three variables. Which is not more complex than taking the joint probability of only two variables. In theory one can add as many variables as wanted, also for the “known” variable which will further confine the probability space. This would have some minor changes for the calculations. As the provided information will be sufficient to form a network from our three variables example, there is no need to further elaborate on this topic. Now, the three basic principles, which originate from probability theory are, to a certain amount explained and demonstrated. Using these basic principles makes it possible to form a network from a probability table (as Table 1). Because our example probability table only contains three variables the previous explanations should be sufficient to form a network, this will not be the case when working with more variables. In such a case one needs to gain a better grasp of the different possible situations which would fall under conditional probability. In the next subsection is demonstrated how to go from our three variable probability table to a network.

1.2.4 Connecting Variables

With the information about probabilities, from the previous section it is possible to transform our probability table (Table 1) into a network. The process of forming a network bottles down to; deciding whether to connect variables or not. Variables that are dependent on each other should be connected. If a variable is dependent on another variable this means that the outcome or state of the variable depends on that of another variable. Now, this does not have to mean that if variable A is 1 that per definition variable B will also be 1. The dependency can be more subtle than that. It can even be that the probability of variable B being 1 increases with something as little as 1% when variable A is 1. Whether such a weak connection is interesting enough to actually add it into a network model is a different question and goes beyond the scope of this paper. In theory, if a variable is dependent on another variable there should be a connec-tion between those variables. To claim that variables are dependent we have to be quite secure in determining so. Figuring out that the outcome of variable B is influenced by the outcome of variable A does not guarantee that those two variables are in fact dependent. For instance, a third variable might explain their dependency away. In a less abstract way; consider a person that suffers from depression and also has sleeping problems. Depression and sleeping problems seem to be depend on each other, however when we add worrying thoughts as another variable, we figure that the dependency between sleeping problems and depression no longer exists. As depression happens to lead people to have worrying thoughts, and worrying thoughts leads to sleeping problems, there no longer is a direct connection between depression and sleeping prob-lems. Of course we cannot just determine the directionality of the connection without a solid theory, and we do not attempt to do so in this paper. Concluding, two variables are dependent on each other when their dependency cannot be explained by another (set of) variable(s). Our decision in whether to connect variables is going to be based on whether or not the connections between variables can or cannot be explained away. In the next subsection we go into detail on explaining the practical side of this story.

1.2.5 Determining the Dependencies

Although we just established the understanding that we cannot claim that variables are depen-dent on each other if we (solely) figure out that they influence each other. It is however, a good starting point to see whether variables are influencing each other, without controlling for other variables. As it could also happen to be the case that two variables are not influencing each other at all, we than already would be ensured that there is absolutely no connection between

(8)

those two variables. Although in practice many variables within a study are influencing each other somewhat, it does happen sometimes that a variable completely falls out of a network. It can save lots of time if figured, right from the start whether variables have anything to do with each other at all. It seems more than a logical starting point to check, using nothing more than two variables at the time to see whether they have anything to do with each other. To see whether two variables influence each other we check the following equation; P (A|B) = P (A). If the equation is true for any set of two variables we can claim that those two variables are independent of each other. In words the equation states; the probability for variable A given variable B is equal to the probability of variable A on itself. Using our depression example, if the probability that a person has sleeping problems given he/she has depression is equal to the probability that a person has sleeping problems given he/she has no depression, we would claim that depression has no influence on sleeping problems at all. If this is the case it probably feels quite intuitive that those variables should not be connected in a network. In the table below (Table 2) we use our three variables and check solely in pairs whether the equation holds.

Table 2: Checking whether variables influence each other at all Conditional probability Marginal probabilty Equal Influencing

P (X|Y) = .500 P (X) = .615 No Yes P (Y|Z) = .250 P (Y) = .308 No Yes P (Z|X) = .625 P (Z) = .615 No Yes

In Table 2 the equation P (A|B) = P (A) is being checked per pair of variables. When the equa-tion is true the pair of variables would be independent of each other. In the table we can see that for no pair of variables the equation holds. We can therefore not yet conclude that a variable is independent from another variable. If one would create a network solely on this information the following connections would be made and the network would look like Network 1. Although we cannot be sure yet whether this network model is correct, it is a nice visualization of what we established so far.

Network 1

X

Y Z

In Network 1 all the variables are dependent on each other, as all the variables are connected. We cannot be sure that this actually is the case, before we have tried explaining connections away. This is what we will do in the next subsection.

1.2.6 Conditional Dependencies

After checking whether variables are influencing each other at all, we move on and check whether the dependencies can be explained away. If a dependency can be explained away with

(9)

an-other (set of) variable(s) we should not draw an edge between those two variables. To deter-mine whether variables are dependent, in our three variables example we check the equation; P (A = a, B = b|C = c) = P (A = a|C = c)P (B = b|C = c). If this equation holds than variables A and B are independent. This makes sense as the equation here states that the joint probability of A and B given C is just the same as the separate probabilities of A given C times B given C. Which means, when we have the information about C (whether C is 1 or 0 ) we know what this does to the probabilities for A and B. No matter what variable A is then going to be it will not change a thing for what variable B will do and vice versa. In other words all we need is the information of variable C to determine the probabilities for A and B. So, if the equation holds for any set of three variables it seem to be logical that the variables should not be connected. In Table 3 the equation is being checked for all possible combinations for our three variables.

Table 3: Explaining dependencies away

Conditional Probability Joint Probabilities Equal Dependent (product) (direct edge) P (X,Y|Z) = .125 P (X|Z)P (Y|Z) = .156 No Yes P (X,Z|Y) = .250 P (X|Y)P (Z|Y) = .250 Yes No P (Y,Z|X) = .125 P (Y|X)P (Z|X) = .250 No Yes

In Table 3 we can see that only variableXandZare conditionally independent given variableY. This means that their dependency is explained away. No direct edge should be drawn between those two variables. With this new information we can create a new network, which is depicted below in Network 2.

Network 2

X Y Z

With Network 2 as our final graphical representation, we transformed our three variables prob-ability table (Table 1) into a network model. Using this process one can create networks from other probability tables as well, also when working with more variables. However, at a certain point a probability table can become so large that it will cost massive amounts of time, even when using a computer to do the calculations. Just consider having a reasonable amount of 15 variables in your experiment. An over 30.000 rows probability table will be the result of those 15 variables. Imagine summing over those 30.000 rows just to calculate the marginal probability for one variable. “Now many applications involve graphs with thousands of variables” making the “computational complexity of the brute force approaches quickly become intractable” (Wain-wright, 2015, p.16). To keep calculations manageable researchers developed algorithms that work much faster than the method we introduced, which falls in the brute force category. In the next section one of those more advanced methods for calculating probabilities is introduced.

1.3 Advanced Methods

In this section a more advanced method for calculating probabilities is introduced. This method consist out of two parts, (1) forming a so called Junction Tree and (2) the Elimiation Algorithm. This method and/or elaborations on it are used in software packages that perform network anal-ysis. This section will not provide sufficient material to perform the advanced method yourself.

(10)

The goal of this section is to provide a clear overall feel of what needs to happen to calculate probabilities from networks no matter the number of variables.

1.3.1 Junction Tree

The main message of this section is; using probability tables for calculating probabilities can become intractable when adding to many variables. Fast algorithms can however still do the job for us. But to use those fast algorithms a network has to be transformed into a Junction Tree. The necessary steps to form a Junction Tree are twofold (1) Triangulating and (2) Clustering. We start with explaining Triangulating.

Triangulating

Simply said, triangulating a network is done by adding edges between nodes in such a manner that a network has no edge-less cycles of length greater than three (Wainwright, 2015, p.35). Below two different cycles are depicted, one following the “no edge-less” cycles of length greater than three criteria and one that does not.

Cycles A B C A B C D A B

On the left-hand side (figure A) we see a cycle consisting of three variables, this cycle satisfies the requested criteria. On the right-hand side (figure B) we see a cycle which has a length of greater than three, and therefore not satisfying the requested criteria. When triangulating properly, one should avoid cycles as depicted in figure B and of course larger cycles too. In words, when a node is connected to two or more nodes, it should be possible to start and end at that same node without passing more than three nodes in total, including itself. Let's take a look at a (simple) network and triangulate it (Network 3 & 4).

On the left-hand side we see Network 3, which is not triangulated as it contains a cord-less cycle of greater than three. On the right-hand side we see Network 4, which is the triangulated version of Network 3. This version is triangulated because it does not have any cord-less cycles of length greater than three. So yes, triangulating can be as simple as adding one edge to a network. When working with bigger networks this process is still equally simple, only it will take some extra time to add all the edges. The real question is of course, what is the reason that we have to triangulate a network. As mentioned before, the goal is to use algorithms that can calculate probabilities in a fast way. However, those fast algorithms only work properly when working with a tree structured graph (Wainwright, 2015, p.31) instead of a network. Therefore transforming a network into a tree structured graph is necessary and the first step to do so is triangulation.

(11)

Network 3 & 4 A B C D E A B C D E

(3)A not triangulated network (4)A triangulated network

To transform a network into such a tree structured graph we have to divide our networks into small cliques, which than can become the branches of the tree. To form those small cliques we first have to connect variables in such a way that it becomes possible to create those small cliques, which brings us right back to the triangulation step. Let's move on to this next step and start creating our triangulated network into clusters.

Clustering

Simply said, clustering a triangulated network means forming small cliques from the variables. This clustering process has to follow two basic rules. Rule 1, if a node is directly connected to another node it must be in a clique with that node. Rule 2, the cliques must be as small as possible. Let's use our triangulated network (Network 4) as an example to demonstrate the clustering process (See Clustering).

Clustering

A

B

C

D E Clique 1 Variables A, B and C A

B

C

D

E Clique 2 Variables B, C and D A

B

C

D

E

Clique 3 Variables B, D and E.

Here (Clustering) we see three times the same triangulated network. In every of those networks some nodes are highlighted (by bold, red letters). Those highlighted nodes together are the cliques we formed. We did this while following the two basic rules, and therefore state that we clustered this network. As this is not the easiest way to depict a clustered network, we will visually transform it, and by doing so making one graph out of it. This is done by joining the variables of the separate cliques together into one single node. This is depicted in Clique graph 1.

(12)

Clique Graph 1

A B C B C D B D E

Clique 1 Clique 2 Clique 3

We can see that the three cliques contain the same variables as it was the case after the clustering process. Only now they are placed together in one single node. Also this graph does not contain any cycles. This makes it a tree structured graph, which is exactly what we want.

Without going into the mathematical details, forming clique graphs this way has one draw-back: In general, the marginals of this clique tree will not match the marginals of the original distribution (Wainwright, 2015, p.33). To prevent miscalculations, clique graphs need to meet the so called Running Intersection Property. This is done by making sure that every two cliques that contain at least one similar node are connected through a path in which every clique in between also contains that same node. (Wainwright, 2015, pp. 33-34). Using Clique Graph 1 as an example; Both clique 1 and 3 contain node B. To meet the Running Intersection Property all the nodes that fall between clique 1 and 3 must also contain node B. This happens to be the case. This clique graph therefore meets the Running Intersection Property. When a clique graph meets this property it is called a Juntion Tree. In short, after triangulating a network and transforming it into a clique graph, while meeting the running intersection property, one forms a Junction Tree. After transforming a network into a Junction Tree it is possible to use fast algorithms to calculate probabilities. In the next subsection one of those algorithms will be explained and demonstrated.

1.3.2 Elimination Algorithm

The Elimination algorithm is an “exact but rather naive method” (Wainwright, 2015, p.17) for calculating marginals from Junction Trees, which correctly implies that there are more ad-vanced methods. Simply said, the algorithm takes the marginal from the variable of interest by removing/eliminating all the other variables. This “eliminating” part is performed by taking the partial sum. When dealing with variables that are present in multiple cliques, one first has to take the product of those cliques. For a broader explanation of the algorithm we refer to (Wainwright, 2015, pp. 17-21). Below a demonstration of the algorithm is given. For this demonstration we use our own three variable network, which is depicted in Network 2. Although in practice it does not make any sense to use this algorithm on a three variable network, it is only meant for illustrative purposes. As mentioned before the fast algorithms only work on tree structured graphs, so let's pretend that our three variables network actually is a tree structured graph. We do this by dividing our three network model into three cliques. Clique 1 contains only variable X, clique 2 contains variableX and variableY, clique three contains variableY

and variable Z. This clique graph is depicted below in Clique Graph 2. In this a scenario we only have direct access to the probabilities within each clique. These probabilities are depicted in Table 4.

Clique Graph 2

(13)

Table 4: Probabilities per clique P(X) P(Y|X) P(Z|Y) P (X = 1) = 0.615 P (Y = 1|X = 1) = 0.250 P (Z = 1|Y = 1) = 0.500 P (X = 0) = 0.385 P (Y = 0|X = 1) = 0.750 P (Z = 0|Y = 1) = 0.500 P (Y = 1|X = 0) = 0.400 P (Z = 1|Y = 0) = 2/3 P (Y = 0|X = 0) = 0.600 P (Z = 0|Y = 0) = 1/3

If we now want to know a marginal probability for let's say, a single variable we would have to use another method than the brute force method, which we introduced in one of our previous sections. This is what the elimination algorithm can do for us. As we already know that the marginal probability for our single variableZ= .615 (P (Z = 1) = 0.615), we can check whether our calculations match up with that of the elimination algorithm. We start the process by elim-inating variableX. Below the actual calculations are shown. What is happing is the following; first we obtain the joint probabilities, just in a different manner than is done in the brute force method, secondly we sum out variableX.

Forming the Joint Probability forX and Y

(X = 1)(Y = 1|X = 1) = 0.615 × 0.250 = 0.15375 (X = 1)(Y = 0|X = 1) = 0.615 × 0.750 = 0.46125 (X = 0)(Y = 1|X = 0) = 0.385 × 0.400 = 0.15400 (X = 0)(Y = 0|X = 0) = 0.385 × 0.600 = 0.23100

By taking the product we formed the joint probability of the two present variables. Now, we need to sum variableXout. This means that we are going to sum over the situations in which Y = 1 and sum over the over the situations in which Y = 0. This will give us, P (Y = 1) = 0.30775 (0.15375 + 0.15400) and P (Y = 0) = 0.69225 (0.46125 + 0.23100). Now we have eliminated variableX, a graphical representation of this is presented in Clique Graph 3.

Clique Graph 3

Y YZ

By repeating the same process we can eliminate variableYand obtain the marginal distribution forZ. The calculations for the joint probabilities are shown on the following page. Here we first formed the joint probability of the two present variables by taking the product. Then we sum out variableY. This will give us; P (Z = 1) = 0.615375 (0.153875+0.4615) and P (Z = 0) = 0.384625 (0.153857 + 0.230750). As we can see (P (Z = 1) = 0.615375)) the results are in agreement with our previous calculations. Concluding, we have performed our calculations correctly.

(14)

Forming the Joint Probability forY and Z

(Y = 1)(Z = 1|Y = 1) = 0.30775 × 0.500 = 0.153875 (Y = 1)(Z = 0|Y = 1) = 0.30775 × 0.500 = 0.153875 (Y = 0)(Z = 1|Y = 0) = 0.69225 × 2/3 = 0.461500 (Y = 0)(Z = 0|Y = 0) = 0.69225 × 1/3 = 0.230750

Summarizing; to work with networks with a large number of variables one has to use more ad-vanced methods for calculating marginals. A simple brute force method, which was explained in section “Creating Networks”, will simply cost too much time. To make it possible to work with faster algorithms one does however first have to transform their network into a tree structured graph. Using tree structured graphs enables the possibility to use fast algorithms, as the Elimi-nation Algorithm. As mentioned at the beginning of this section, this algorithm is still a na¨ıve method. There are algorithms that elaborate on the Elimination Algorithm that work faster and therefore better, these can be found in Wainwright (2015).

(15)

2 Simulation Study

Abstract

This simulation study researches the effect of unobserved variables on the dependencies within a network. Four different situations are tested. When the unobserved variable is a; (1) mediating variable, (2) a common effect of two observed variables, (3) accounts for unexplained variance for one of the observed variables or is (4) a confounding variable. Results Dependencies are robust when dealing with mediating and common effects. Dependencies are slightly distorted when dealing with an unobserved variable that accounts for some unexplained variance for one of the observed variables, this is moderated by sample size. A bigger sample leads to a smaller distortion. Dependencies are not interpretable when dealing with a confound.

2.1 Introduction

Using graphical models, such as network models one can infer marginal distributions. Of course, one will be bound to the variables that were observed. Therefore it is important to understand the studied concept, so the variables that matter can be included. As it is not possible to observe, or even be aware of all variables that might be of importance, inferring marginal distributions can unfortunately go wrong. Consider a study in which a new antidepressant is being tested. With the hope that this new antidepressant improves the general mood. In this study sleep quality is also taken into account, as it is known to be an important factor in depression (Augner, 2011; Tsuno et al., 2005). The results of the study demonstrated an improved mood. However, there was no causal relationship found between the new antidepressant and mood. What happened was the quality of sleep improved, which caused the better mood. This improvement in sleep quality could be either being caused directly or indirectly by this new antidepressant, but it is also possible that the sleep quality improvement was caused by some other unrelated factor. As the latter happened to be the case they concluded that the new antidepressant did not cause an improvement in mood. Now consider this exact same experiment, but this time the researchers did not take sleep quality into account. The results would demonstrate that the antidepressant has a positive effect on mood, which could lead the researchers to the conclusion that the new antidepressant caused this improvement. This conclusion would then be false, making it an ex-ample of wrongly inferring a marginal distribution. On a more abstract level, when studying the effect of treatment X on response Y , it is common practice to take possible intervening factors into account. Some of those factors might affect the response and some factors might affect the treatment, it is also possible that it affects both (Pearl, 2001). By not taking these factors into account one could end up inferred incorrect marginal distributions. In this paper we research which situations could lead to those incorrect inferred marginal distributions. Using simulations, different network situations are created in which we are dealing with an unobserved variable, in our example we used sleep quality to portrait such an unobserved variable. Our goal is to demonstrate whether inferring the marginal distribution would go correct or incorrect in those different situations.

2.2 Materials and Methods

2.2.1 Software

All calculations are executed within the statistical computing software R (R Core Team, 2016). For simulating data and calculating the conditional probabilities the following R packages were used; gRbase (Dethlefsen et al., 2005), gRain (Højsgaard et al., 2012), gRim (Højsgaard, 2013), and stats (R Core Team, 2016). By specifying conditional probability tables the package gRain can simulate data according to those tables. This package does require the gRbase package to run. From that simulated data a network is estimated and the conditional probabilities of interest are taken, to do this the package gRim is required.

(16)

2.2.2 Network Situations

In this simulation study four different network situations are simulated. In all of those situations the simulated network will contain four variables, namely A, B, C and U . The variable U will be the variable that will serve as the unobserved variable, hence the name. The process is quite straightforward we start with simulating a network based on one of the four network situations. We then calculate the conditional probabilities twice, once while including all four variables and once without variable U . The difference between the calculated conditional probabilities in those two instances (all variables included or without variable U ) will be the effect the unobserved variable would have had. To keep results clear we will only use one specific conditional proba-bility instead of using all possible conditional probabilities per network. Before continuing with further details, let's first take a look at the four different network situations that will be simulated.

A B C

U Situation 1

Mediation Path

Four Different Network Situations

A B C U Situation 2 Common Effect A B C U Situation 3 Unexplained Variance A B C U Situation 4 Confound

Above the four network situations that will be simulated are shown. Although the four different situations might appear to be rather similar to each other the differences lie within the connec-tions and their direction. We can see that in both situation 1 and 2 at least one of the observed variables (A, B and C) is having a causal relationship with the unobserved variable (U ). While in situation 3 and 4 we see that this is the other way around. Here, only the unobserved variable has a causal relationship with at least one of the observed variables. This is an important feature, because in situation 3 and 4 the unobserved variable is completely free from any influences from the network itself. This enables the possibility for variable U to be biased. Just imagine variable U representing gender. Now we understand that such a variable is of major importance and every paper explicitly states the gender proportions in their study. But, hypothetically gender could be the unobserved variable in our network situation 3 and 4. On top of that, it could even

(17)

be biased which would mean our sample could only consist out of female or male participants. Having gender as an unobserved and biased variable in your study it is very plausible that this would influence the results. Although it is unlikely that specifically gender is not taken into ac-count by researchers, there are of course many other possibilities left which could fill the spot of our unobserved variable. Some other options could be that the researchers only have participants that are; smokers, from the same generation, educated at the same level, drug users, religious, working at the same company or being educated by the same teacher(s). So, this feature makes situation 3 and 4 extra interesting. Concretely, in situation 3 and 4 is researched how an unob-served variable, which is free from influences from the obunob-served variables affects the conditional probabilities of those observed variables. We are specifically interested in the case when the un-observed variable is biased. As mentioned, to keep results clear we will only use one conditional probability. The conditional probability of our interest here will be the probability for variable C given variable A. In the next subsection a description of the analysis can be found, in which we also explain the different way of analyzing situation 3 and 4 in comparison with situation 1 and 2.

2.2.3 Analysis

The goal of the analysis is to answer the following question; what is the effect of different sorts of unobserved variables on the relationship between the observed variables. To answer this ques-tion, we look at four different network situations. Those four different networks situation are, when the unobserved variable is; (Situation 1) a mediating variable, (Situation 2) a common effect, (Situation 3) unexplained variance influencing the response variable and (Situation 4) a confounding variable influencing both the response and the treatment variable. In situation 1 and 2 we calculate P (C = 1|A = 1) in both the cases when variable U is included in the model and when it is not. The difference between those two cases is the effect variable U has on the relationship between variable A and C if it were not observed. In situation 3 and 4 we calculate P (C = 1|A = 1) and compare it with P (C = 1|A = 1, U = 1). The difference between those two situations is the effect variable U has when it would have been an unobserved and biased variable, which explains why in this situation variable U is set to 1. Furthermore, it is realistic to assume that N matters as it does in many analysis, we see no reason not to use at least two different sizes for N per simulation. Each simulation procedure will therefore be repeated with both 100 and 500 observations. The argument for using 100 data points as our minimum has to do with the fact that the packages ran into too much trouble when using a smaller N . The argument for using 500 data points for our maximum is due to the fact that it seemed like a large number for N which can still be reached when conducting a big study. Lastly, each simulation is repeated with three different connection strengths between the unobserved variable and the observed variable(s). An example of the different connection strengths can be found in Table 5.

Table 5: The different connection strenghts for situation 1 Conditional Probabilities Weak Medium Strong

In Table 5 we can see that the three different connection strengths are labeled as weak, medium and strong. In the first row, which is highlighted we can see that the connection strengths

(18)

in-crease going from the weak to the strong column. When the connections strengths are closer to the extremes, which are either 0 or 1, the influence that variables have on each other is larger.

Before displaying and running the actual simulations we first do a “sanity check”. This is done to demonstrate the functionality of the packages. The sanity check is outlined in the fol-lowing subsection.

2.2.4 Sanity Check

We will run a sanity check to demonstrate the functionality of the packages. This will be done as following, we create a three variable chain structured network (see the chain structured network below) and we calculate the conditional probability of variable C given variable A. This will both be done when all three variables are present and when the middle variable, variable B is re-moved. This procedure is similar to the procedure in the simulations study itself. The difference is that in this instance we know that there should be no difference between the two calculations. The reason for this is that here there is only one path from variable A to C, it therefore should not matter whether the middle variable B is removed or not. If indeed no difference, or just very little difference is found in this sanity check we be sure that the packages are performing up to standard.

A B C

Chain Structured Network

This network is used for the sanity check.

Above we see the three variables chain structured network, which will be used for our sanity check. The results of the sanity check are outlined below.

Results

The mean absolute difference found in the simulations with 100 observations was .0047. The histogram below (Histogram 1) displays all absolute differences found and their occurrences in percentages. In Histogram 1 we can also see, above the bar on the left-handed side; > 91%. This means that in a little over 91% of all the simulations the difference found was either 0 or extremely close to 0 (smaller than 1e−15). The highest difference found during these simulations was .1098.

(19)

Histogram 1

The mean absolute difference found in the simulations with 500 observations was .0102. The histogram below (Histogram 2) displays all absolute differences found and their occurrences in percentages. In Histogram 2 we can also see, above the bar on the left-handed side; > 61%. This means that in a little over 61% of all the simulations the difference found was either 0 or extremely close to 0 (smaller than 1e−15). The highest difference found during these simulations was .0466.

Histogram 2

Concluding

The results of the sanity check certainly confirm the quality of the used algorithms within the packages. Clearly, the simulations in which there was no difference found at all stands out the most. This in combination with the low mean absolute differences demonstrates that the functionality of the packages are more than trustworthy. Although the highest difference found was indeed quite higher than we had hoped for with its .1098, we want to emphasize that the occurrence rate in finding the somewhat higher differences is very small. The differences that were found are probably due to sampling error and or rounding differences and do not worry us. However, a more objective solution than accepting these small differences can be established by creating a baseline. A good baseline for the simulation study seem to be the mean absolute difference found during this sanity check. For clarity reasons we chose to use only one baseline.

(20)

To be as fair as possible we choose the highest mean as our baseline, which is .0102.

In the following subsection we give a description of the simulation process. The actual code for the simulations can be found in the appendix which is included at the end of this paper.

2.2.5 Simulation Procedure

The general simulation process is outlined below in the numbered list.

1. To simulate data the function “simulate” is used, which comes with the package stats (R Core Team, 2016).

2. Next the simulated data is placed in to a contingency table, using the “xtabs” function, which comes with the package stats (R Core Team, 2016).

3. The contingency table is used to form a log-linear saturated model using the “dmod” function, which comes with the package gRim (Højsgaard, 2013).

4. To estimate an appropriate model, we use a stepwise backward elimination procedure, where the AIC is the selection criterion (Højsgaard et al., 2012, p.17). To do this the function “stepwise” is used, which comes from the package gRbase (Dethlefsen et al., 2005).

5. To construct the model into a graph object the “ugList” function is used, which also comes from the package gRbase (Dethlefsen et al., 2005).

6. Step 2 until 5 are repeated, only this time variable U is not taken into account.

7. The graph objects from step 1 to 6 are used to build an independence network. The “grain” function, which comes with the gRain package (Højsgaard et al., 2012) is used for this. 8. With the function “querygrain”, from the same package, the conditional probabilities of

interest are requested and saved. Both when all variables are present and when variable U is not taken into account .

9. This procedure repeats itself until the objects from step 8 contains 10.000 values.

2.3 Results

A B C U Situation 1 Mediation Path Network Situation 1

In the simulations for network situation 1 we found the following results. Three of the six mean differences exceeded our baseline of .0102. The highest mean difference found was .0228 which was found in the case with 100 observations and a strong connection strength between the unobserved and observed variables. The used connection strength had a significant influence on the out-come in both the 100 and 500 observations simulations. When comparing the weak and the strong connection strength within the 500 observations simulation an effect size of 1.14 was found, using Cohens D (Cohen, 1988). This was the highest effect size.

Also, the used sample size had a significant influence on the outcome. When comparing the cases when a strong connection strength was used between the 100 and 500 observations, the effect size was .34, which was the highest effect size. In Figure 1 and 2 the results are depicted in the form of distributions. Each distribution on its own contains 10.000 differences. The red horizontal line represents our baseline of .0102. The black square accompanied with the white rectangle, represent the median and the interquartile range.

(21)

Figure 1

Figure 2

A B C U Situation 2 Common Effect Network Situation 2

In the simulations for network situation 2 we found the following results. Two of the six mean differences exceeded our baseline of .0102. The highest mean difference found was .0113 which was found in the case with 500 observations and a strong con-nection strength between the unobserved and the observed vari-ables. The used connection strength had a significant influence on the outcome in both the 100 and 500 observations simulations. When comparing the weak and the strong connection strength within the 500 observations simulation an effect size of 1.53 was found which was the highest effect size. When using a weak or

a medium connection strength the results differed significantly between the 100 and 500 ob-servations simulations. When comparing the cases when a weak connection strength was used between the 100 and 500 observations the effect size was .23, which was the highest effect size. In Figure 3 and 4 the results are depicted in the form of distributions. Each distribution on its

(22)

own contains 10.000 differences. The red horizontal line represents our baseline of .0102. The black square accompanied with the white rectangle, represent the median and the interquartile range.

Figure 3

Figure 4

A B C U Situation 3 Unexplained Variance Network Situation 3

In the simulation for network situation 3 we found the following results. All six mean differences exceeded our baseline of .0102. The highest mean difference found was .0529 which was found in the case with 100 observations and weak connection strength between the unobserved and the observed variable. The used connection strength had a significant influence on the outcome in both the simulation using 100 and 500 observations. When comparing the weak and the strong connection strength within the 500 observations simulation an effect size of .2 was found, which was .2 was found, which was the highest effect size. Also,

(23)

a strong connection strength was used between the 100 and 500 observations, the effect size was .93, which was the highest effect size. In Figure 5 and 6 the results are depicted in the form of distributions. Each distribution on its own contains 10.000 differences. The red horizontal line represents our baseline of .0102. The black square accompanied with the white rectangle, represent the median and the interquartile range.

Figure 5

Figure 6

Network Situation 4

In the simulation for network situation 4 we found the following results. All six mean differences exceeded our baseline of .0102. The highest mean difference found was .2367 which was found in the case with 100 observations and a strong connection strength between the unobserved and the observed variables. The used connection strength had a significant influence on the outcome

(24)

A B C

U Situation 4

Confound in both the simulation using 100 and 500 observations.

Compar-ing the weak and the strong connection strength within the 500 observations simulation an effect size of 6.19 was found, which was the highest effect size. When using a weak or a medium connection strength the results differed significantly between the 100 and 500 observations simulations. When comparing the weak connection strength between the 100 and 500 observations simu-lations we find an effect size of .34, which was the highest effect size. In Figure 7 and 8 the results are depicted in the form of distributions. For these two figures the y-axis has been rescaled

to make the results fit in. Be cautious when comparing these figures to the previous figures. Each distribution on its own contains 10.000 differences. The red horizontal line represents our baseline of .0102. The black square accompanied with the white rectangle, represent the median and the interquartile range.

Figure 7

(25)

2.4 Discussion

The simulations study reported here shows the following. First, in each simulation the con-nection strength between the observed and the unobserved variable(s) influenced the outcome. When the connection strength gets higher, the distortion of the results get bigger. We do want to remark that in network situation 3 (unexplained variance situation) the differences between the connection strengths were exceptionally small and most likely negligible. Therefore we con-clude that the connection strength between the unobserved and the observed variables plays an important role in the distortion of the results Secondly, in most of the simulations the number of observations influenced the outcome. Mostly, this meant that the bigger sample had a smaller distortion of the results, notable however is that this was not the case in all situations. We do want to remark that the differences between the two different sample sizes was rather small, with network situation 3 as an exception. Therefore we conclude, using 100 or 500 observations will in most cases only affect the results in a minor way.

Network Situation 1 (Mediation Path). In network situation 1 half of the mean differences found stayed below the baseline. The differences that did exceed the baseline were quite low, as they stayed beneath the .03. This translates itself to; the distortion of the results can be as high as 3% when dealing with an unobserved mediating variable. We argue that this is barely worth mentioning. We conclude that these results demonstrate that the dependencies within networks are robust against the influence of an unobserved mediation variable.

Network Situation 2 (Common Effect ). In network situation 2, four of the six mean dif-ferences stayed below the baseline. The difdif-ferences that did exceed the baseline were quite low, as they stayed beneath the .02. This translates itself to; the distortion of the results can be as high as 2% when dealing with an unobserved common effect variable. We conclude that these results demonstrate that the dependencies within networks are robust against the influence of an unobserved common effect variable.

Network Situation 3 (Unexplained Variance). In network situation 3 it was most clear that the sample size influenced the outcome. The bigger the sample size, the lower the distortion of the results. This simulation demonstrates that when using a smaller sample size a small, but certain distortion of results is unavoidable. We speak and emphasize that we are dealing here with a small distortion, as the biggest mean difference found was (only) .0529. We conclude that these results demonstrate that the dependencies within a network are robust against influ-ences of variables that account for unexplained variance. Noting however, when working with a smaller sample one should be more cautious and be aware that the results might be somewhat influenced by variables that account for unexplained variance.

Network Situation 4 (Confound ). In network situation 4 it was most clear that the connec-tion strength influenced the results. The higher the connecconnec-tion strength the bigger the distorconnec-tion of the results. The sample size did not have any practical influence on the results. This simula-tion demonstrates that when dealing with a confounding variable a certain and probably large distortion of the results is certain. We conclude that the dependencies within a network are not robust against the influences of confounding variables.

General Conclusion The dependencies within networks seem to be robust against most of the influences of unobserved variables. Especially when working with bigger samples one can be confident that the dependencies are trustworthy estimates of the relationships within a network. We do have to add, as a not surprising side note that network analysis is no exception to the rule and that results here too are not interpretable when one is dealing with a confounding variable.

(26)

References

Augner, C. (2011). Associations of subjective sleep quality with depression score, anxiety, phys-ical symptoms and sleep onset latency in students. Central European journal of public health, 19(2):115.

Borsboom, D. (2008). Psychometric perspectives on diagnostic systems. Journal of clinical psychology, 64(9):1089–1108.

Borsboom, D. and Cramer, A. O. (2013). Network analysis: an integrative approach to the structure of psychopathology. Annual review of clinical psychology, 9:91–121.

Cohen, J. (1988). Statistical power analysis for the behavioural sciences. hillside. NJ: Lawrence Earlbaum Associates.

Costantini, G., Epskamp, S., Borsboom, D., Perugini, M., M˜ottus, R., Waldorp, L. J., and Cramer, A. O. (2015). State of the art personality research: A tutorial on network analysis of personality data in r. Journal of Research in Personality, 54:13–29.

Dethlefsen, C., Højsgaard, S., et al. (2005). A common platform for graphical models in r: The grbase package. Journal of Statistical Software, 14(17):1–12.

Højsgaard, S. (2013). gRim: Graphical Interaction Models. R package version 0.1-17.

Højsgaard, S. et al. (2012). Graphical independence networks with the grain package for r. Journal of Statistical Software, 46(10):1–26.

Pearl, J. (2001). Causal inference in the health sciences: a conceptual introduction. Health services and outcomes research methodology, 2(3):189–220.

R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Tsuno, N., Besset, A., and Ritchie, K. (2005). Sleep and depression. The Journal of clinical psychiatry, 66(10):1254–1269.

van der Bork, R., van Borkulo, C., Cramer, A., Waldorp, LJ., and Borsboom, D. (2017). Network models for clinical psychology. In: Stevens handbook. Ed. EJ Wagenmakers. Elsevier. Wainwright, M. J. (2015). Graphical models and message-passing algorithms: Some introductory

lectures. In Mathematical Foundations of Complex Networked Information Systems, pages 51– 108. Springer.

(27)

3 Appendix

In this appendix the R-code used for simulating the four networks situations is given.

# L o ad i n g P a c k a g e s require ( gRbase ) require ( gRain ) require ( gRim ) require ( d p l y r ) # I n f o r m a t i o n f o r n e t w o r k s i t u a t i o n 1 # Network S i t a t i o n 1 = M e d i a t i o n s i d e w a y s , Weak l i n k ##### # I n f o r m a t i o n f o r n e t w o r k s i t u a t i o n 1 # Network S i t a t i o n 1 = M e d i a t i o n s i d e w a y s , Weak l i n k # C o n n e c t i o n s t r e n g h t s b e t w e e n v a r i a b l e s . x1 <− . 5 x0 <− . 5 y1x1 <− . 7 y0x1 <− . 3 y1x0 <− . 4 y0x0 <− . 6 z 1 y 1 c 1 <− . 5 8 z 0 y 1 c 1 <− . 4 2 z 1 y 0 c 1 <− . 5 4 z 0 y 0 c 1 <− . 4 6 z 1 y 1 c 0 <− . 5 3 z 0 y 1 c 0 <− . 4 7 z 1 y 0 c 0 <− . 4 9 z 0 y 0 c 0 <− . 5 1 c1x1 <− . 5 4 c0x1 <− . 4 6 c1x0 <− . 3 5 c0x0 <− . 6 5 ALL <− c ( ) MIN <− c ( ) ALL . 2 <− c ( ) MIN. 2 <− c ( ) yn <− c ( ” y e s ” , ” no ” ) x <− c p t a b l e ( ˜ X, v a l u e s = c ( x1 , x0 ) , l e v e l s = yn ) c . x <− c p t a b l e ( ˜ C + X, v a l u e s = c ( c1x1 , c0x1 , c1x0 , c0x0 ) , l e v e l s = yn ) y . x <− c p t a b l e ( ˜ Y + X, v a l u e s = c ( y1x1 , y0x1 , y1x0 , y0x0 ) , l e v e l s= yn ) z . yc <− c p t a b l e ( ˜ Z + Y + C,

v a l u e s = c ( z1y1c1 , z0y1c1 , z1y0c1 , z0y0c1 , z1y1c0 , z0y1c0 , z1y0c0 , z 0 y 0 c 0 ) , l e v e l s= yn ) p l i s t <− compileCPT ( l i s t ( x , c . x , y . x , z . yc ) )

(28)

#S t a r t i n g s i m u l a t i o n .

p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 5 5 0 ) # 500 d a t a p o i n t s

p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 5 5 0 ) f o r ( j i n 1 : 5 5 0 ) {

i f ( length (ALL) > 1 0 0 0 0 ) { stop ( ”We r e a c h e d t h e 1 0 . 0 0 0 ! ” ) } f o r ( i i n 1 : 2 0 ) {

Network1sim <− s i m u l a t e ( Network1 . weak , nsim = 5 0 0 ) cad . t a b <− x t a b s ( ˜ . , data = Network1sim )

cad . s a t . a l l <− dmod ( ˜ . ˆ . , data = cad . tab , m a r g i n a l=c ( ”X” , ”C” , ”Y” , ”Z” ) )

cad . s e l <− s t e p w i s e ( cad . s a t . a l l ) cad . gc <− cad . s e l$ g l i s t

cad . ug <− u g L i s t ( cad . gc )

cad . t a b 2 <− x t a b s ( ˜ . , data= Network1sim [ , c ( 1 , 3 , 4 ) ] ) cad . s a t 2 <− dmod ( ˜ . ˆ . , data = cad . tab2 ,

m a r g i n a l=c ( ”X” , ”Y” , ”Z” ) ) cad . s e l 2 <− s t e p w i s e ( cad . s a t 2 ) cad . g c 2 <− cad . s e l 2$ g l i s t cad . ug2 <− u g L i s t ( cad . g c 2 )

i f ( i s . n u l l ( cad . ug2@edgeL$Z [ [ 1 ] ] ) == FALSE ){

Network1 . min <− g r a i n ( cad . ug2 , data = Network1sim [ , c ( 1 , 3 , 4 ) ] ) Network1 . a l l <− g r a i n ( cad . ug , data = Network1sim )

min <− q u e r y g r a i n ( Network1 . min , n o d e s = c ( ”X” , ”Z” ) , t y p e=” c o n d i t i o n a l ” ) [ 1 ] a l l <− q u e r y g r a i n ( Network1 . a l l , n o d e s = c ( ”X” , ”Z” ) , t y p e=” c o n d i t i o n a l ” ) [ 1 ] ALL <− c (ALL, a l l )

MIN <− c (MIN, min) } } p r o g$ t i c k ( ) $print ( j ) } p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 7 0 0 ) # 100 d a t a p o i n t s p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 6 0 0 ) f o r ( j i n 1 : 6 0 0 ) {

i f ( length (ALL . 2 ) > 1 0 0 0 0 ) { stop ( ”We r e a c h e d t h e 1 0 . 0 0 0 ! ” ) } f o r ( i i n 1 : 2 0 ) {

Network1sim <− s i m u l a t e ( Network1 . weak , nsim = 1 0 0 ) cad . t a b <− x t a b s ( ˜ . , data = Network1sim )

m a r g i n a l=c ( ”X” , ”Y” , ”Z” ) ) cad . s e l 2 <− s t e p w i s e ( cad . s a t 2 ) cad . g c 2 <− cad . s e l 2$ g l i s t

(29)

cad . ug2 <− u g L i s t ( cad . g c 2 )

min <− q u e r y g r a i n ( Network1 . min , n o d e s = c ( ”X” , ”Z” ) , t y p e=” c o n d i t i o n a l ” ) [ 1 ] a l l <− q u e r y g r a i n ( Network1 . a l l , n o d e s = c ( ”X” , ”Z” ) , t y p e=” c o n d i t i o n a l ” ) [ 1 ] ALL . 2 <− c (ALL . 2 , a l l )

MIN. 2 <− c (MIN . 2 , min) } } p r o g$ t i c k ( ) $print ( j ) } ALL <− ALL [ 1 : 1 0 0 0 0 ] MIN <− MIN [ 1 : 1 0 0 0 0 ] ALL . 2 <− ALL . 2 [ 1 : 1 0 0 0 0 ] MIN. 2 <− MIN . 2 [ 1 : 1 0 0 0 0 ] ALL . weak <− ALL

MIN . weak <− MIN ALL . weak2 <− ALL . 2 MIN . weak2 <− MIN. 2

# I n f o r m a t i o n f o r n e t w o r k s i t u a t i o n 1 . 2 # Network S i t a t i o n 1 . 2 = M e d i a t i o n s i d e w a y s , Medium l i n k ##### # I n f o r m a t i o n f o r n e t w o r k s i t u a t i o n 1 . 2 # Network S i t u a t i o n 1 . 2 = M e d i a t i o n s i d e w a y s , Medium l i n k # C o n n e c t i o n s t r e n g t h s b e t w e e n v a r i a b l e s # and w h i c h o n e s a r e h i g h e r / l o w e r . x1 <− . 5 x0 <− . 5 y1x1 <− . 7 y0x1 <− . 3 y1x0 <− . 4 y0x0 <− . 6 z 1 y 1 c 1 <− . 6 4 # HIGHER z 0 y 1 c 1 <− . 3 6 z 1 y 0 c 1 <− . 5 9 # HIGHER z 0 y 0 c 1 <− . 4 1 z 1 y 1 c 0 <− . 4 8 # LOWER z 0 y 1 c 0 <− . 5 2 z 1 y 0 c 0 <− . 4 2 # LOWER z 0 y 0 c 0 <− . 5 8 c1x1 <− . 5 4 c0x1 <− . 4 6 c1x0 <− . 3 5 c0x0 <− . 6 5

(30)

ALL <− c ( ) MIN <− c ( ) ALL . 2 <− c ( ) MIN. 2 <− c ( ) yn <− c ( ” y e s ” , ” no ” ) x <− c p t a b l e ( ˜ X, v a l u e s = c ( x1 , x0 ) , l e v e l s = yn ) c . x <− c p t a b l e ( ˜ C + X, v a l u e s = c ( c1x1 , c0x1 , c1x0 , c0x0 ) , l e v e l s = yn ) y . x <− c p t a b l e ( ˜ Y + X, v a l u e s = c ( y1x1 , y0x1 , y1x0 , y0x0 ) , l e v e l s= yn ) z . yc <− c p t a b l e ( ˜ Z + Y + C,

v a l u e s = c ( z1y1c1 , z0y1c1 , z1y0c1 , z0y0c1 , z1y1c0 , z0y1c0 , z1y0c0 , z 0 y 0 c 0 ) , l e v e l s= yn ) p l i s t <− compileCPT ( l i s t ( x , c . x , y . x , z . yc ) ) Network1 . medium <− g r a i n ( p l i s t ) #S i m u l a t i o n p r o c e s s s t a r t p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 5 5 0 ) # 500 d a t a p o i n t s p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 5 5 0 ) f o r ( j i n 1 : 5 5 0 ) {

Network1sim <− s i m u l a t e ( Network1 . medium , nsim = 5 0 0 ) cad . t a b <− x t a b s ( ˜ . , data = Network1sim )

(31)

f o r ( i i n 1 : 2 0 ) {

Network1sim <− s i m u l a t e ( Network1 . medium , nsim = 1 0 0 ) cad . t a b <− x t a b s ( ˜ . , data = Network1sim )

MIN. 2 <− c (MIN . 2 , min) } } p r o g$ t i c k ( ) $print ( j ) } ALL <− ALL [ 1 : 1 0 0 0 0 ] MIN <− MIN [ 1 : 1 0 0 0 0 ] ALL . 2 <− ALL . 2 [ 1 : 1 0 0 0 0 ] MIN. 2 <− MIN . 2 [ 1 : 1 0 0 0 0 ] ALL . medium <− ALL

MIN . medium <− MIN ALL . medium2 <− ALL . 2 MIN . medium2 <− MIN. 2

# I n f o r m a t i o n f o r n e t w o r k s i t u a t i o n 1 . 3 # Network s i t u a t i o n 1 . 3 = M e d i a t i o n s i d e w a y s , S t r o n g l i n k ##### # Model i n f o r m a t i o n # Network s i t u a t i o n 1 . 3 = M e d i a t i o n s i d e w a y s , S t r o n g l i n k x1 <− . 5 x0 <− . 5 y1x1 <− . 7 y0x1 <− . 3 y1x0 <− . 4 y0x0 <− . 6 z 1 y 1 c 1 <− . 7 4 # HIGHER z 0 y 1 c 1 <− . 2 6

(32)

z 1 y 0 c 1 <− . 6 9 # HIGHER z 0 y 0 c 1 <− . 3 1 z 1 y 1 c 0 <− . 3 6 # LOWER z 0 y 1 c 0 <− . 6 4 z 1 y 0 c 0 <− . 2 9 # LOWER z 0 y 0 c 0 <− . 7 1 c1x1 <− . 5 4 c0x1 <− . 4 6 c1x0 <− . 3 5 c0x0 <− . 6 5 ALL <− c ( ) MIN <− c ( ) ALL . 2 <− c ( ) MIN. 2 <− c ( ) yn <− c ( ” y e s ” , ” no ” ) x <− c p t a b l e ( ˜ X, v a l u e s = c ( x1 , x0 ) , l e v e l s = yn ) c . x <− c p t a b l e ( ˜ C + X, v a l u e s = c ( c1x1 , c0x1 , c1x0 , c0x0 ) , l e v e l s = yn ) y . x <− c p t a b l e ( ˜ Y + X, v a l u e s = c ( y1x1 , y0x1 , y1x0 , y0x0 ) , l e v e l s= yn ) z . yc <− c p t a b l e ( ˜ Z + Y + C,

v a l u e s = c ( z1y1c1 , z0y1c1 , z1y0c1 , z0y0c1 , z1y1c0 , z0y1c0 , z1y0c0 , z 0 y 0 c 0 ) , l e v e l s= yn ) p l i s t <− compileCPT ( l i s t ( x , c . x , y . x , z . yc ) ) Network1 . s t r o n g <− g r a i n ( p l i s t ) p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 5 5 0 ) # 500 d a t a p o i n t s p r o g = d p l y r : : p r o g r e s s e s t i m a t e d ( 5 5 0 ) f o r ( j i n 1 : 5 5 0 ) {

Network1sim <− s i m u l a t e ( Network1 . s t r o n g , nsim = 5 0 0 ) cad . t a b <− x t a b s ( ˜ . , data = Network1sim )

(33)

i f ( length (ALL . 2 ) > 1 0 0 0 0 ) { stop ( ”We r e a c h e d t h e 1 0 . 0 0 0 ! ” ) } f o r ( i i n 1 : 2 0 ) {

Network1sim <− s i m u l a t e ( Network1 . s t r o n g , nsim = 1 0 0 ) cad . t a b <− x t a b s ( ˜ . , data = Network1sim )

MIN. 2 <− c (MIN . 2 , min) } } p r o g$ t i c k ( ) $print ( j ) } ALL <− ALL [ 1 : 1 0 0 0 0 ] MIN <− MIN [ 1 : 1 0 0 0 0 ] ALL . 2 <− ALL . 2 [ 1 : 1 0 0 0 0 ] MIN. 2 <− MIN . 2 [ 1 : 1 0 0 0 0 ] ALL . s t r o n g <− ALL MIN . s t r o n g <− MIN ALL . s t r o n g 2 <− ALL . 2 MIN . s t r o n g 2 <− MIN. 2

D i f f e r e n c e . weak500 <− ALL . weak − MIN . weak D i f f e r e n c e . weak100 <− ALL . weak2 − MIN . weak2

abs . D i f f e r e n c e . weak500 <− sqrt ( D i f f e r e n c e . weak500 ˆ 2 ) abs . D i f f e r e n c e . weak100 <− sqrt ( D i f f e r e n c e . weak100 ˆ 2 ) D i f f e r e n c e . medium500 <− ALL . medium − MIN . medium

From probability table to network

Master Thesis

Rogier Hetem

Supervised by: dhr. dr. L.J. Waldorp

A

B

C

B

C

D

B

D

E

June 5, 2017

Contents

1