Estimating causal effects from observational data (2014-2015)

(1)

Master Thesis

Artificial Intelligence

Estimating causal effects from

observational data (2014-2015)

Author:

Jerome Cremers

Supervisor: Joris Mooij

(2)

In many scientific fields, researchers are interested in estimating the causal effect of one variable on another. Such an effect can be exposed by randomized controlled experi-ments in which a certain variable is ‘intervened’ by setting it to a certain value manually, thereby overruling the influences of its natural causes. However, in many settings these experiments are expensive, time consuming, or simply impossible. For example, ’forcing’ someone to smoke in order to reveal and examine the causal effect on lung cancer would be unethical. In this project, we will combine several rules from Claassen et al. (2011) into an algorithm that tries to find structures in the data for which there is theoret-ical proof (Entner et al., 2013) that the causal effect is identifiable from observations only. The focus will be on the simplest scenario; no selection bias and only one variable in the conditioning set. We try to optimize this search and validate whether the esti-mated causal effect indeed holds empirically, using real data and simulated data. The theoretical part of this report covers basic notions of causal inference (DAGs, (condi-tional) independence, D-separation, back-door criterion) and more advanced rules that link correlation with causation (Claassen et al; 2011, Entner et al; 2013). The practical part of the project captures the implementation of the algorithms (the main algorithm and analysis scripts), and extensive testing of strengths and weaknesses of the intended methods. The algorithm works well on data where the signal-to-noise ratio is very high; but the power declines if observations are scarce and/or the variance increases. This ex-plains the poor performance when evaluating the algorithm on a real data set (Hughes et al, 2000).

(3)

Abstract 1

Contents 2

1 Introduction 3

2 Theory 7

2.1 Directed Acyclic Graphs . . . 7

2.2 Independence, d-separation, faithfulness . . . 8

2.3 Finding indirect causal relations . . . 13

2.4 Estimating causal effects . . . 14

3 Methods 17 3.1 Related work . . . 17

3.1.1 Hughes data set . . . 18

3.2 Algorithm outline. . . 19

3.3 Testing (in)dependencies and conditional (in)dependencies between vari-ables . . . 22

3.4 Validation . . . 23

3.5 Data simulation. . . 24

3.6 Bayes-Ball algorithm . . . 26

3.7 Position w.r.t. related work . . . 26

4 Experiments and results 28 4.1 High signal-to-noise data sets . . . 28

4.1.1 Single DAG . . . 28

4.1.2 Multiple DAGs . . . 33

4.2 More realistic data sets . . . 35

5 Discussion and future work 38

6 Conclusion 41

Acknowledgements 42

(4)

Introduction

’I’m not a discriminator, let that be clear. I just ascertain the facts. There used to be nothing wrong with the environment here. Now we have all these immigrants, and suddenly there’s a hole in the ozone layer. Now you tell me; how could that be?’ [Theo Maassen, Dutch comedian]. Heedlessly interchanging ’correlation’ and ’causation’ can make a theater audience laugh instantly. Although the number of acceded immigrants could be highly correlated with the demolition of ozone in the atmosphere, no one would ever draw any causal relationship between these two variables. However, what happens if the absence of a causal connection is less obvious by common sense? For instance, if a politician would replace the ozone layer part by the increase of the unemployment rate. Then, people might forget about correlation and causation being two different terms, and mindlessly draw causal conclusions based on correlation only.

Simpson’s paradox (named after statistician Edward H. Simpson, who first described this phenomenon in 1951) is a well known phenomenon that shows how easily you run into difficulties when making inferences about causality by common sense. Let us walk through an example, taken from [1]. See table 1.1. Suppose the Department of History and the Department of Geology of a University are hiring. For history, five men and eight women apply. One and two of them are hired, respectively. For Geology, eight men and five women apply. Six and four of them are hired, respectively.

(5)

Men Women Department of History 20% (1/5) 25% (2/8) Department of Geology 75% (6/8) 80% (4/5) Overall 54% (7/13) 46% (6/13)

Table 1.1: Job application example.

The paradox is that in both departments a higher percentage of women succeed, while a higher percentage of men succeed in the overall numbers. How could this be possibly true? The numbers might seem small, but this does not cause the paradox. If all numbers are multiplied by 1000, the phenomenon remains. Intuitively, there must be some kind of hidden bias. Though, it is hard to point out directly. There is no difference in group sizes; in total, 13 men and 13 women are applying. Also, both departments have an equal amount of applicants. The bias is caused by the fact that a higher percentage of women apply for a department in which it is harder to succeed. In Geology 77% (10/13) of the persons succeed, while in History this is only 23% (3/13). So to get rid of this paradox, one should take into account a variable representing the average success rate per department and correct for it. The table above does not contain just a rare combination of test results. There are many important and similar real-world examples that hold the Simpson paradox and stress the mistakes you might make when inattentively replacing correlation and causation.

As another example, consider the relationship between cigarette smoking and lung can-cer. In 1964 the United States Surgeon General issued a report [2] claiming that cigarette smoking causes lung cancer. The evidence for this causation was based mainly on corre-lations between these two variables. As a result, the report was questioned by tobacco companies and statisticians. They claimed that there could be a hidden confounding factor (like a gene) which triggers a cigarette smoking addiction and heightens the risk of lung cancer. If the tobacco companies are right, the decision to smoke or not would have no influence on whether or not a person would get lung cancer, even though these variables are correlated.

(6)

Smoking Lung cancer (a) Hidden factor Smoking Lung cancer (b)

Figure 1.1: Surgeon General’s point of view (a) vs. tobacco companies’ point of view (b).

Is there a way to determine whether cigarette smoking heightens the risk of getting lung cancer? Yes, there is. A randomized controlled experiment would uncover the truth. Suppose an experimenter would have the power to intervene on a person, liter-ally forcing a person to smoke or not to smoke. The experimenter takes a large group of participants and randomly assigns them into two (equally sized) groups. One group is forced to smoke, the other group is forced to not smoke. By doing these interventions, the experimenter breaks the influence of any hidden factor that causes both smoking and lung cancer. There could still be a hidden factor (a gene) that causes lung cancer, but this relationship would be marginalized out by the randomized assignment of the participants. Now, if the rate of people that get lung cancer in the smoking group is significantly higher compared to the non-smoking group, you can conclude that there must be a causal relationship between smoking and lung cancer. This relationship need not be direct. For instance, smoking might cause an accumulation of toxic substances in lung cells. These substances, on their turn, might have a more direct causal effect on lung cancer by turning normal cells into cancer cells.

Notice the experimental design difference if the experimenter would have gathered al-ready smoking and alal-ready non-smoking people and observes the lung cancer rates between these two groups. The interventions are needed to overrule the influence of potential hidden factors that affect these two variables simultaneously. They are the keys to reveal the true relationship between cigarette smoking and lung cancer.

In many fields of science, researchers are interested in estimating the causal effect of one variable on another. In chemical biology, for instance, scientists are interested in the effect of gene knockouts on the expression of other genes. In economics, one may be interested in the influence of a person’s education level on his or her happiness in later life. The outcome could be crucial for predicting the effects of certain actions and policies.

In many cases, randomized controlled experiments are not executable. Experimenters usually do not have the power to intervene with participants. Forcing people to smoke

(7)

would be unethical, dictating the level of someone’s education would be impossible, and manipulating the expression of a gene in a lab environment is expensive. Moreover, randomized controlled experiments sometimes involve strong assumptions. In the lung cancer example, the experimental design assumes that whether or not a participant has smoked cigarettes in the past (before the experiment) is irrelevant, which is highly doubtful.

In this thesis we try to find ways to expose and estimate causal relationships between variables based on observational data only; without doing a randomized controlled ex-periment. In chapter 2, all building blocks to understand our approaches are covered. In chapter 3, a related paper [3] is reviewed and our methods are explained. Chap-ter 4 reports the experimental findings and chapChap-ters 5 and 6 capture a discussion and concluding words, respectively.

(8)

Theory

In this section, all important theoretical ’building blocks’ needed to understand the algorithm explained in the Methods chapter are covered.

2.1 Directed Acyclic Graphs

A graphical model is a common way to visualize a complex system. An intuitive way of representing causal systems is in the form of a directed acyclic graph (DAG). DAGs are a subset of a broad spectrum of graphical models. Figure 1.1a and 1.1b are examples of very simple DAGs. Consider figure 2.1; a hypothetical example of a more complex DAG. A node represents a variable in the system and an arrow between two nodes represents a direct causal effect in the direction of the arrow. Absense of an arrow between two nodes means that there is no direct causal effect between them. The graph is called ’directed’ because only directed arrows exist: variable X3 can causally influence

variable X4, but not vice versa simultaneously. Since there is an arrow directing from

X1 to X2, we call X1 a parent of X2 and X2 is a child of X1. A path is a sequence

of distinct consecutive arrows, for example the sequence X3 → X4 ← X6 → X9. A

directed path is a sequence of arrows which is like a path, but the arrows have to be head-to-tail. Since there is a directed path from node X3to node X8, there is an indirect

causal relation X3 ⇒ X8. Also, X3 is an ancestor of X8 and X8 is a descendant of

X3. Furthermore, acyclicity is assumed: there are no directed paths that have the

same start- and end node. An arrow pointing from X7 to X1 is not allowed, because

it would generate the path X1 → X2 → X7 → X1. Acyclicity is a strong assumption;

loops (or cycles) in a system are not uncommon. In molecular cell biology, for example, a protein synthesized at the end of a pathway could act as an initiator of that same pathway, thereby stimulating its own production.

(9)

X2 X1 X3 X4 X5 X6 X7 X8 X9

Figure 2.1: A DAG representing the causal relationships between variables X1-X9.

Besides its parents, a variable can be influenced by many other factors. In the lung cancer example, whether or not a person smokes might also be caused by the smoking behavior of his/her friends. We will never be able to draw all causes of all variables in a DAG. All other causes (besides the parents) of a modelled variable are captured in a single random variable E, which is typically not drawn and unobserved.

DAGs are often incomplete; in many settings it is impossible to identify and include all variables in a system. A DAG is causally sufficient if there is no pair of modelled variables that have an unmodelled confounder (or ’common cause’). Rephrased: all Es are independent.

2.2 Independence, d-separation, faithfulness

A main reason for drawing a DAG, is the ability to easily read out independence and conditional independence relationships among variables. Two variables Xa and Xb

are independent if the occurrence of one does not affect the probability of the other, so that their joint probability equals the product of the individual probabilities:

P (Xa, Xb) = P (Xa)P (Xb) (2.1)

⇒ Xa⊥⊥ Xb (2.2)

Variable Xais conditionally independent of Xb given Xcif and only if, given knowledge

that Xc occurs, knowledge of whether Xa occurs provides no information on the

likeli-hood of Xb occurring, and knowledge of whether Xb occurs provides no information on

the likelihood of Xa occurring. The joint conditional probability is equal to the product

of the individual conditional probabilities:

P (Xa, Xb| Xc) = P (Xa| Xc)P (Xb| Xc) (2.3)

(10)

To illustrate (in)dependence relationships, suppose person A and person B each toss separate coins. Let A represent the variable ”person A’s toss outcome”, and B represent the variable ”person B’s toss outcome”. Both A and B have two possible values (heads and tails). It would be uncontroversial to assume that A and B are independent; evi-dence about B will not change our belief in A (figure 2.2a) and (2.2) holds.

Now suppose both persons toss the same coin, and assume there is a variable C that represents whether or not the coin is biased towards heads. The coin could be biased, we do not know this for sure. In this case A and B are dependent (on variable C). For example, observing that B is heads increases our belief in the coin being biased, and indirectly increases our belief in A being heads (figure 2.2b). The dependence between A and B can be derived analytically [4] by proving the falseness of (2.2):

P (A, B)= P (A)P (B)?

=X

C

P (A | C)P (B | C)P (C)

In general, this does not factorize into the product P(A)P(B), so

⇒ A 6⊥⊥ B | ∅. (2.5)

A and B become independent again, when C is observed or given. For instance, once we know that the coin is biased 70% towards heads, observing B does not change our belief about A, and vice versa. A and B are conditionally independent given C (figure 2.2c). This can be derived analytically by determining whether (2.4) holds:

P (A, B | C)= P (A | C)P (B | C)? P (A, B | C) = P (A, B, C) P (C) = P (A | C)P (B | C)P (C) P (C) = P (A | C)P (B | C) ⇒ A ⊥⊥ B | C.

(11)

A B (a) C A B (b) C A B (c)

Figure 2.2: Toss coin example. A and B independent (a), A and B dependent (b), A and B independent given C (c). In this report, conditioned variables are indicated by a

blue color.

In causal inference, the terms ’independence’ and ’dependence’ are usually replaced with d-separation and d-connection, respectively. The difference between ’independence’ and ’d-separation’ is that d-separation is a property of the nodes in the DAG, where independence is a property of the variables represented by the nodes. This distinction if often ignored. So, in figure2.2bwe can say that A and B are d-connected, and in figure

2.2c we can say that A and B are d-separated given C. A way to define d-separation is by using the concept of blocking paths[5]:

Definition 2.1. A path p is blocked by a set of nodes Z iff:

i p contains a chain A → C → B or a fork A ← C → B, such that node C is in Z ii p contains a collider A → C ← B, such that neither node C nor any of its

descendants is in Z.

Definition 2.2. If G is a directed graph in which Z is a set of observed nodes, then node X and node Y are d-connected given Z if there exists a path between X and Y that is not blocked by Z.

X and Y are d-separated given Z if Z blocks every path between X and Y .

To make the definitions more intuitive, let us cover the three cases/structures in which a path is blocked.

1. _A _C _B

Consider a simplified model in which a high blood sugar level (A) causes hunger (B), but only indirectly through stomach acidity (C). In general, you can say that information about someone’s blood sugar level can tell you whether a person is hungry or not. C ’passes through’ the information; the path A → C → B is open and A and B are d-connected. However, conditioning on the stomach acidity makes information about the blood sugar level redundant. The path is blocked in C; A and B are d-separated given C.

(12)

2. _A _C _B

Already covered in the coin example above.

3. _A _C _B

D

Consider an example in which there are two events that could cause your car refusing to start (C): having no gas (A) and having a dead battery (B). The dashed arrow between C and D means that nor C, nor any of the descendants of C is observed. Before you try to start your car for the first time (in other words, before observing C), these events are independent. Telling you that the battery is charged, tells you nothing about whether there is gas: A and B are independent (or d-separated). The moment your car refuses to start, C is observed and A and B become conditionally dependent (or d-connected) given C. Now knowing that the battery is charged, tells you that the gas tank must be empty. All descendants of C and C itself are common effects of A and B. For instance, D could represent whether you are late at work. Observing any of those effects opens the path A → C ← B, making A and B conditionally dependent given D.

To complete the story about d-separation, consider the following statements:

• Regarding figure 2.3a. Allthough the path X1 → X3 → X4 is blocked in X3, the

path X1→ X2→ X4 is open. Therefore, X1 and X4 are d-connected given X3.

• Regarding figure2.3b. The path X1 → X3 ← X4 is open, because a descendent of

X3 is observed (X6). The other path, X1 ← X2 → X4, is blocked in X2. Therefore,

X1 and X4 are d-connected given {X2, X6}

X1 X2 X3 X4 (a) X1 X3 X2 X4 X5 X6 (b) Figure 2.3

(13)

Mindlessly interchanging d-separation with independence is possible if faithfulness is assumed. This means that all conditional (in)dependence relationships can be read off from the DAG using the d-separation and d-connection rules. Consider an example (taken from [6]) represented by figure2.4; a causal graph that describes the relationships among exercise, smoking, and health, where the + and - signs indicate stimulating and inhibiting effects respectively. From the d-separation rule, we obtain no independen-cies. However, this DAG could produce probability distributions in which no correlation between ’smoking’ and ’health’ can be found, because the direct negative effect is bal-anced out by the indirect positive effects. In such a case, we say that the distribution is unfaithful to the DAG that generated it. Faithfulness implies that all (in)dependencies occurring in the DAG arise not from incredible coincidence but rather from structure.

Figure 2.4: Example in which the DAG could produce an unfaithful distribution[6].

Another important concept is that of minimal conditional (in)dependence. In this term, the word ’minimal’ captures the notion that really every variable in the minimal set, indicated by the brackets, plays a role in making two other variables independent. If one variable in the minimal set would be eliminated from it, the (in)dependence relationship gets lost:

X ⊥⊥ Y | W ∪ [Z] ≡ X ⊥⊥ Y | W ∪ Z ∧ ∀Z0 _{( Z : X 6⊥}⊥ Y | W ∪ Z0 (2.6) X 6⊥⊥ Y | W ∪ [Z] ≡ X 6⊥⊥ Y | W ∪ Z ∧ ∀Z0 _{( Z : X ⊥}⊥ Y | W ∪ Z0. (2.7)

To visualize the meaning of these rules, consider figure2.5in which the set {A,B,C,D} is observed. All paths from X to Y are blocked, so X and Y are independent:

X ⊥⊥ Y | {A, B, C, D}.

Now, which variables in the observed set belong to the minimal set? It is easy to see that if A and/or D would not have been observed, X and Y would still be independent. However, eliminating B and/or C from the observed set would make X and Y dependent. Therefore, B and C belong to the minimal set:

(14)

A X

B C

Y

D

Figure 2.5: Minimal conditional independence example.

2.3 Finding indirect causal relations

All concepts covered so far are based on correlations and therefore applicable to obser-vational data. This is desirable, since obserobser-vational data is often the only data we have. Are there conditions under which we can transfuse correlational information into causal conclusions? Let X, Y and Z be observed nodes, W be a set of observed nodes, and S be a set of (unobserved) selection nodes1 in a causal DAG G. A ` B means ’A is provable from B’, A ⇒ B means that A is an ancestor of B, and A ; B means that A is not an ancestor of B. Then, Claassen and Heskes [7] propose the following rules:

• X ⊥⊥ Y | W ∪ [Z] ` Z ⇒ (X ∪ Y ∪ W ∪ S) • X 6⊥_{⊥ Y | W ∪ [Z] ` Z ; (X ∪ Y ∪ W ∪ S)}

• X ⊥⊥ Y | [W ∪ Z] ` Z ⇒ (X ∪ Y ∪ S).

By using (X ; Y )== ¬(X ⇒ Y ), observed minimal (in)dependencies can be directlydef translated into logical statements about indirect causal relations:

1. X ⊥⊥ Y | W ∪ [Z] ` Z ⇒ X ∨ Z ⇒ Y ∨ Z ⇒ W ∨ Z ⇒ S 2. X 6⊥⊥ Y | W ∪ [Z] ` Z ; X ∧ Z ; Y ∧ Z ; W ∧ Z ; S 3. X ⊥⊥ Y | [W ∪ Z] ` Z ⇒ X ∨ Z ⇒ Y ∨ Z ⇒ S.

Following Entner et al. [8], we will not consider selection variables here, which means we will always assume S = ∅. The theoretical background of these rules goes beyond the scope of this project. Here, we take them for granted and are only interested in their value in practice, which will become more clear in chapter3.

1

Selection nodes are nodes that are always implicitly conditioned upon. As an example, consider a medical study that only takes patients that are more than 18 years old. Then, the variable ”age” would be a selection variable, as all data is conditioned upon ”age > 18”.

(15)

2.4 Estimating causal effects

In mathematics, an intervention is represented by a do-operator. We are interested in P (Y | do(X = x∗)); the probability of outcome variable Y , given that the treatment variable X is set to a certain value x∗. This probability is usually different from the conditional probability P (Y | X = x∗), which can simply be determined by observing Y when X = x∗. As explained in the introduction, in randomized controlled experiments, the bias caused by the effect of variables that influence both the treatment variable (X) and the outcome variable (Y ) is marginalized out by the randomized assignment of the participants. In the lung cancer example in the introduction, the hidden gene is such a confounding variable, also called a covariate. When experiments are not available and only observational data is present, we need to find a theoretical way to properly ’correct for’ or ’adjust for’ this bias.

If the DAG is known, the back-door criterion [9] tests whether a certain set of variables is admissible for adjustment and a causal effect is identifiable:

Theorem 2.3. Let X and Y be observed nodes and let W be a set of observed nodes in DAG G. If a set of variables W0 ⊆ W satisfies the back-door criterion relative to variables X and Y , i.e.

• no node in W0 _{is a descendant of X; and}

• W0 _{blocks every back-door path from X to Y , where a back-door path is a path}

that leads to Y and contains an arrow into X: X ← ... Y ,

then W0 is admissible for adjustment and the causal effect of X on Y is identifiable and given by

P (Y | do(X)) =X

W0

P (Y | X, W0)P (W0) (2.8)

or, in case of continious variables,

P (Y | do(X)) = Z

(16)

If P (X, Y, W0) is a multivariate Gaussian, the causal effect of do(X = x∗) on Y is given by the partial derivative of the expectation2

∂

∂xE(Y | do(X = x∗)) (2.10)

and equal to the regression coefficient of X in the linear regression on Y of X and W0. The name ’back-door’ echoes the second bullet from the theorem, which requires that only paths with arrows pointing at X need to be blocked. These paths can be interpreted as entering X ’through the back door’.

Let’s visualize this theorem with an example. In figure2.6the DAG of a system is drawn. If we want to calculate P (Y | do(X)), for which observed variables W2, W3, W4, W5 do

we need to adjust? The sets W₁0 = {W2, W3} and W20 = {W3, W5} meet the back-door

criterion: no variables are a descendant of X and all back-door paths are blocked. W₁0 and W₂0 are admissible for adjustment.

W1

W2 W3 W5

W4

X W6 Y

Figure 2.6: A DAG to illustrate the back-door criterion. Blue nodes are observed. Inspired by [8].

In many problem settings, the DAG is not known and we can not use the back-door criterion. Intuitively, it might seem plausible to adjust for all the observed variables. However, figure 2.7 shows why such a simple strategy can fail. Assume that all the variables in a system are known, but we do not know how the variables influence each other (figure 2.7a). If we want to identify P (Y | do(X)), should we adjust for the other observed variable W2? If the true DAG of the underlying system would be as in figure

2.7b, the adjustment set W’ ={W2} meets the back-door criterion and is admissible

for adjustment. However, if figure 2.7c would represent the true DAG, adjusting for W2 is incorrect. Adding W2 to the adjustment set would open the back-door path

X ← W1→ W2← W3→ Y . Here, the empty set ∅ is admissible.

2_{Suppose random variable X can take value x}

1 with probability p1, value x2 with probability p2,

and so on, up to value xk with probability pk. Then, the expectation of X is defined as E[X] =

(17)

W1 W2 W3 X Y (a) W1 W2 W3 X Y (b) W1 W2 W3 X Y (c)

Figure 2.7: An example to show that mindlessly adjusting for all observed variables can fail. Inspired by [8].

How do we know for which variables to adjust in the common case where the full DAG is unknown? Entner, Hoyer and Spirtes [8] propose some rules to compose a proper adjustment set, solely based on testable dependencies, independencies and ancestral relations. These rules are applicable in settings where the causal structure of a system is represented by a DAG over a set of random variables V = {X ∪ Y ∪ W ∪ U }, where X represents the treatment variable, Y represents the outcome variable, W is a set of observed variables, and U is a set of unobserved/latent variables. Assumed is that V forms a causally sufficient set, i.e. any common cause of two or more variables in V is in V . Furthermore ’partial ordering’ is assumed. This means Y is not an ancestor of X, and both X and Y are not ancestors of any variable in W :

• Y ; X • X ; W • Y ; W .

Under above described problem setting and assumptions, Entner et al. (2013) prove the following rules using Pearl’s back-door criterion:

4. If there exists a variable W ∈ W and a set W0 ⊆ W \{W } such that W ⊥⊥ Y | W0_∪

[X], then W0 is an admissible set for X and Y .

5. If there exists a variable W ∈ W and a set W0 ⊆ W \ {W } such that W 6⊥⊥ X | W0 and W ⊥⊥ Y | W0, then the causal effect of X on Y is zero.

6. If there exists a set W0 ⊆ W such that X ⊥⊥ Y | W0 then the causal effect of X on Y is zero.

(18)

Methods

3.1 Related work

A major inspiration for this project comes from previous work by Maathuis et al. [3]. They combine two concepts:

• Given a set of observations, one can determine a set of DAGs that must contain the true, underlying one.

It is impossible to determine the underlying DAG of a system from a set of obser-vations, because there are typically multiple DAGs responding to the same set of conditional (in)dependencies and (in)dependencies (based on these observations). All responding DAGs form an equivalence class. DAGs in an equivalence class share some properties; they have the same so-called skeleton and v-structures [10]. The skeleton of a DAG is the undirected graph that results from ignoring the di-rectionality of all arrows. A v-structure is a triplet of nodes hA, B, Ci that has an arrow from A to B and from C to B and no arrow between A and C. An equiva-lence class can be uniquely described by a completed partially directed acyclic graph (CPDAG). An edge in the CPDAG is directed if all DAGs in the equivalence class have the same directed edge. An edge in the CPDAG is undirected if the direction of this particular edge is not consistent in all DAGs in the equivalence class. The CPDAG of a system can be estimated from observations, using the PC-algorithm [11].

• Given a DAG, and assuming

1. the joint distribution of all variables is a multivariate Gaussian 2. the DAG is causally sufficient (recall section2.1)

(19)

3. faithfulness,

the back-door criterion can be used to identify variable pairs (X, Y ) for which the causal effect of X on Y is given by (2.10). By applying (2.10) for all DAGs in the equivalence class, one gets a set of possible values for the causal effect of X on Y .

They call this combination ’intervention-calculus when the DAG is absent (IDA)’, and use it to obtain bounds on total causal effects for all possible X, Y variable pairs.

First, the PC-algorithm [11] is used to construct the CPDAG, based on statistical tests. If the number of variables is small (< 15), it is computationally feasible to compute all DAGs in the equivalence class defined by the CPDAG and apply (2.10) to estimate the causal effects. If the number of variables grows, determining all DAGS in the equivalence class becomes too expensive. Therefore, a method is used that, for every X, validates which local DAG structures (directly connected to X) are valid. This can be achieved by turning all possible combinations of undirected edges connected to X in the CPDAG into arrows into X and validate whether there is no improper v-structure with two arrowheads into X being formed. This local validation is computationally fast, asymptotic consistent in high-dimensional settings (number of covariates >> number of observations) and is the main contribution of the paper. For every X, IDA outputs a set of estimates for the causal effect on Y . Since one value of this set stems from the true DAG, the minimum absolute value of this set can be considered as a lower bound on the size of the true causal effect. Variables that have a high lower bound are considered most relevant.

3.1.1 Hughes data set

One of the real data sets to which IDA is applied is a compendium of gene expression profiles of Saccharomyces cerevisiae; a species of yeast [12]. We will refer to this data set as the Hughes data set. After initial data cleaning, the interventional part of the data consists of expression levels of 5361 genes after a single gene knockout has been made on 234 different genes. The observational part of the data contains expression profiles of the same 5361 genes for 63 wild-type cultures. They want to identify genes that have a causal effect on another gene expression level and rank them according to the strength of this effect. In other words; which genes, after they have been knocked out, have the most influence on some other gene expression level?

Out of the 234 x 53601 causal effects in the interventional part of the data, they show 1

The causal effect of a knocked out gene on itself is meaningless. This leads to 234 x (5361 - 1) effects that are of interest.

(20)

that the (absolute) largest effects predicted by IDA correspond significantly more often to the effects in the target set than occurs by random guessing.

3.2 Algorithm outline

IDA has its shortcomings. It assumes causal sufficiency (i.e. there are no unmeasured confounders); which is a strong assumption in general. Furthermore, before estimating causal effects between single variables, the expensive PC-algorithm uses all variables to determine the equivalence class of the underlying DAG.

Given all the concepts described in chapter 2, can we design an algorithm that overcomes these shortcomings?

Equation2.9identifies P (Y | do(X)) from observable (in)dependencies, which is what we need. The simplest ’route’ towards (2.9) is the following. We assume acyclicity and no selection bias (S = 0). Suppose the set W consists of only a single variable W . Then, W0 ⊆ W \ {W } = ∅. Following the recipe proposed by Entner et. al., we need to find structures in the data that meet the requirements:

1. Y ; X 2. X ; W 3. Y ; W 4. W ⊥⊥ Y | [X].

The first three requirements come from partial ordering, the last one is dictated by rule 4 by Entner et al. To test requirements 1-3, rule 2 by Claassen et al. can be used because it links minimal dependence relationships to negations of causal relationships (the ;-sign). Let us rewrite this rule for our setting (W = W , S = 0) using different letters to avoid confusion:

A 6⊥_{⊥ B | [C] ` C ; A ∧ C ; B.} (3.1)

In words: if you, for instance, want to make sure that C is not an ancestor of A, you have to find a variable B for which the minimal dependence relationship on the left side of rule3.1 holds. Rewriting of the requirements after using this rule:

1. X 6⊥⊥ Q₁| [Y ] 2. W 6⊥⊥ Q2| [X]

(21)

3. W 6⊥⊥ Q3| [Y ]

4. W ⊥⊥ Y | [X].

Note that variables Q1, Q2 and Q3 could be the same. If a sextet hX, Y, W, Q1, Q2, Q3i

passes these requirements, we can say that W0 = ∅ is admissible for adjustment and (2.9) becomes

P (Y | do(X)) = Z

P (Y | X, W0)P (W0)dW0 = P (Y | X). (3.2)

The time complexity of searching for combinations of variables that pass these tests is O(n4), where n = number of variables in the data set. Since we are using the relatively slow programming language ’R’ (compared to a more low level language), minimization of the computation time is highly valued. After having a closer look at the requirements, in this simple setting, requirements 1 and 3 turn out to be redundant because they can be deduced from the others:

Requirement 4 (W ⊥⊥ Y | [X]) implies X ⇒ W or X ⇒ Y . Since requirement 2 implies X ; W , there must hold X ⇒ Y . Using the acyclicity assumption, we can say that Y ; X. Also, X ⇒ Y implies Y ; W , because if Y would have been an ancestor of W , then X would also be an ancestor of W , and X ; W would not hold. This shrinkage of requirements linearly speeds up the algorithm by a factor three.

Summarizing, we are going to scan the complete set of observed variables for quadruples hX, Y, W, Qi that meet a minimal dependence and a minimal independence:

1 W 6⊥⊥ Q | [X], which tests X ; W . This minimal dependence can be split in an independence and a conditional dependence:

(a) W ⊥⊥ Q (b) W 6⊥⊥ Q | X.

2 W ⊥⊥ Y | [X], rule 4 by Entner et al. This minimal independence can be split in a dependence and a conditional independence:

(a) W 6⊥⊥ Y (b) W ⊥⊥ Y | X.

In pseudocode, this looks like in algorithm 1.

In the next section, it will be clear how lines 9, 11 and 13 are implemented and what the ’unknown’ classification means. The main question of this report is whether, in

(22)

Algorithm 1

1: Input: Observations of N variables in a system.

2: List_Of_Positives ← ∅ 3: List_Of_Negatives ← ∅ 4: List_Of_Unknowns ← ∅ 5: for node X ∈ 1 : N do 6: for node Y ∈ 1 : N \ X do 7: for node W ∈ 1 : N \ hX, Y i do 8: for node Q ∈ 1 : N \ hX, Y, W i do

9: if W 6⊥⊥ Q | [X] == True AND W ⊥⊥ Y | [X] == True then

10: append hX, Y, W, Qi to List_Of_Positives

11: else if W 6⊥⊥ Q | [X] == False OR W ⊥⊥ Y | [X] == False then

12: append hX, Y, W, Qi to List_Of_Negatives

13: else if W 6⊥⊥ Q | [X] == Unknown OR W ⊥⊥ Y | [X] == Unknown then

14: append hX, Y, W, Qi to List_Of_Unknowns

quadruples hX, Y, W, Qi that pass above criteria, P (Y | X) is indeed a good estimator of P (Y | do(X)), as predicted by (2.9).

Can passing quadruples be assigned to certain structures in the underlying DAG that can be recognized and visualized? If the DAG contains only four nodes, the only con-figuration a passing quadruple can have is a so-called Y structure, drawn in figure

3.1.

X

W Q

Y

Figure 3.1: If there are exactly four variables, the only DAG that obeys W 6⊥⊥ Q | [X] and W ⊥⊥ Y | [X] is a Y structure [13].

Obviously, as the number of variables in a DAG that contains a passing quadruple hX, Y, W, Qi grows, so does the amount of possible configurations of the DAG. If there are five variables, two possible DAGs are drawn in figure3.2. Determining the embedded structure of passing quadruples within its DAG is a difficult task; we are only interested in finding variables on which we can apply (2.9).

(23)

X W Q Y Z (a) X W Q Y Z (b)

Figure 3.2: If there are five variables, subfigures A an B show two out of many more DAGs that meet W 6⊥⊥ Q | [X] and W ⊥⊥ Y | [X].

If a quadruple does not pass the criteria, it does not necessarily mean that P (Y | X) is not a good estimate of P (Y | do(X)). For example, in figure3.2bwe have seen that quadruple hX, Y, W, Qi gives access to (2.9). Quadruple hX, Y, W, Zi does not meet W 6⊥⊥ Z | [X], so it will end up in the list of negatives, meaning (2.9) can not be applied. However, both quadruples contain the same X, Y -pair. Because of this ambiguity, ’negatives’ might be a confusing label. A quadruple labeled as ’negative’ means that based on a certain combination for W and Q, there is no evidence for accessing (2.9).

3.3 Testing (in)dependencies and conditional (in)dependencies

between variables

The whole algorithm relies on the ability to distinguish between (conditional) depen-dencies and (conditional) independepen-dencies among variables (see lines 9, 11 and 13 in algorithm 1). There are many ways to approach this. A useful term is Pearson’s cor-relation coefficient ρX,Y, which is a measure of the linear correlation (dependence)

between two variables X and Y , giving a value between −1 and 1, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation:

ρX,Y = E[(X − µX

)(Y − µY)]

σXσY

,

where σX is the standard deviation of X, µX is the mean of X, and E is the expectation.

To ’decide’ whether variables X and Y are independent, one could set a threshold on the absolute value of this coefficient directly. Here, a statistical test is performed with H0: ρX,Y = 0 (X and Y are independent), and H1 : ρX,Y 6= 0 (X and Y are dependent).

The test statistic is based on ρX,Y and the corresponding p-value is used as metric for

classification. A low p-value (close to 0) implies that variables X and Y are dependent. A weak dependence between X and Y , for example given by ρX,Y = 0.08, is hard

to distinguish from an independence, since an independence generally does not have a correlation coefficient of exactly 0. Therefore we use two p-value thresholds: a test

(24)

statistic smaller than the lower threshold plow is marked as a dependence, a test statistic

higher than the upper threshold pupis marked as an independence, and if the test statistic

falls in the domain between plow and pup, no decision is made. A similar method is used

for identifying conditional (in)dependencies. If we condition on a variable Z, the test statistic is based on the partial correlation coefficient of order 1:

ρ_{X,Y | Z} = qρX,Y − ρX,ZρY,Z 1 − ρ2 X,Z q 1 − ρ2 Y,Z .

Finding the optimal values for plowand pupbecame a major challenge during this project.

First we try to find heuristics for the threshold values. We also want to know whether the settings are stable. For instance, if we combine an independence test and a conditional dependence test into a minimal dependence test, do we still find the same optima for plow and pup? Therefore, a brute force search is done over the complete domain [0,1] for

both pup and plow.

3.4 Validation

There are two main questions regarding the performance of our methods:

• Given the assumptions and simplest scenario (faithfulness, acyclicity, S = ∅, W = {W }), can we empirically find evidence for the correctness of (2.9) by Entner et al.?

• Does the algorithm properly distinguish between positive and negative quadruples?

For answering the first bullet, we obviously need one (and preferably more than one) response of variable Y on a randomized controlled experiment in which an intervention has been made on variable X. Then, we can compare P (Y | X) with P (Y | do(X)) to validate our estimation. Figure 3.3 visualizes this comparison. A linear model of the expectation of P (Y | X) (the green line) is fitted through the observations of X and Y . This model, after possible extrapolation, acts as a prediction for P (Y | do(X = x∗)). A measure for the quality of the model is given by what we call the error:

|E(Y | X = x∗) − E(Y | do(X = x∗))|, (3.3)

indicated by the blue arrow. If there are more than one experiments in which X was intervened, the errors are averaged.

(25)

Figure 3.3: A visualization of the interpretation of the error, indicated by the blue arrow.

In the interventional part of the Hughes data set, 234 out of 5361 different genes have been intervened once. We can use (a subset of) this part for validation.

Obviously, we expect the error of X, Y -pairs belonging to positive quadruples to be significantly lower than those belonging to negatives. If this is not the case, it does not undisputedly mean that (2.9) is somehow invalid given our assumptions. The data might be too noisy to correctly distinguish between dependence and independence relationships among variables based on correlation tests.

For answering the second bullet, we need to check intermediate steps. If the algorithm marks an independence, are those variables indeed d-separated? If the algorithm finds a dependence, are those variables indeed d-connected? To validate these classifications, the true structure of the underlying DAG must be known. The Hughes data set, as well as almost any real data set, is therefore insufficient. This is why we simulate our own DAGs, observations and interventions.

3.5 Data simulation

To illustrate how the data simulation works, suppose the data set we want to simulate contains only three variables. First, a DAG representing the causal structure of the variables is built. For each node, the number of children of that node is chosen ran-domly and the children are selected ranran-domly out of all nodes with a higher index2. The strengths are sampled uniformly from the domain [−√3,√3] and rescaled so that all 2_{The index of a node is a number k ∈ 1 : N assigned to the node, where N is the number of variables}

(26)

variables have equal variance. If rescaling is omitted, the variance of a variable grows as its number of ancestors increases. Although Pearson’s correlation coefficient is corrected for variances and consequently the algorithm’s classifications are invariant to variance differences among variables, assuming equal variance simplifies comparing errors of dif-ferent X, Y -pairs.

The DAG is stored in a matrix G, where the entry Gi,j represents the strength of the

causal effect of variable Xi on Xj. A possible simulation could be the one drawn below.

Figure 3.4: A matrix G with its corresponding DAG.

Remember from the theory section, that each variable has a single unobserved parent E that represents all other causes of that variable. X1, X2 and X3 can be expressed as

follows:

X1 = E1

X2 = G1,2X1+ E2

X3 = G1,3X1+ G2,3X2+ E3

For each E, observations are sampled from a normal distribution with mean 0 and configurable standard deviation. The above set of equations is solved for all nodes. These solutions form the so called ’observational part’ of the data set.

Then, interventions on randomly assigned variables are executed. These interventions represent fictitious randomized controlled experiments. An intervened variable Xi is

’set’ to a value ξi, where ξi is sampled from a normal distribution with adjustable mean

and standard deviation.3 An intervention X2 = ξ2 overrules its parents’ influence: the

second column of G becomes 0. The set of equations to be solved becomes: 3

The content of the data is mimicked from an existing biological data set [12] (also used by Maathuis et al.[3]), in which the variables are gene expression profiles of yeast deletion mutants. Interventions represent ’gene knockouts’, where a gene is made inoperative, typically leading to a very low expression level.

(27)

X1 = E1

X2 = ξ2

X3 = G1,3X1+ G2,3X2+ E3

These solutions form the ’interventional part’ of the data set, which we can use for val-idating our algorithm by calculating the error.

Besides the ability to validate the classifications (dependencies, independencies, condi-tional (in)dependencies, minimal (in)dependencies, et cetera) of the algorithm, simulated data can be easily modified according to a certain desire. Parameters can be adjusted solely to investigate the influence of these parameters on the performance of the algo-rithm. What happens if the number of variables in the DAG grows? If the observations are noisier, will the algorithm be able to classify correctly?

A disadvantage of simulated data is that it is hard to perfectly mimic a data set from the real world. There might be done doubtful or invalid assumptions about the real data that are revealed by a weaker performance of the algorithm when applied to real data.

3.6 Bayes-Ball algorithm

Since we have simulated data of which we know the true structure of the underlying DAG, we can validate the correctness of the intermediate classifications made by the algorithm. To identify d-separated and d-connected variables, we do not have to draw the complete DAG. The ’Bayes-Ball’ algorithm, introduced by Ross D. Shachter [14], lists all d-connected and d-separated variables of variable Xifor a given DAG (in our case

G), a variable Xiand any possible conditioning variables. We consider the classifications

made by Bayes-Ball as ’ground truth’, and compare them with the outcome of correlation tests.

3.7 Position w.r.t. related work

The problem setup of this project differs from Maathuis et al. We do not aim to esti-mate all possible effects and determine the most substantial ones. We only estiesti-mate the causal effects of quadruples that pass the requirements. We do not try to figure out the structure of the complete DAG, nor its equivalence class. Our method looks simpler;

(28)

’just’ a loop over all possible quadruples and exposure of each quadruple to several sta-tistical tests to extract interesting ones. IDA uses a cut-off at the minimum absolute value of all estimated effects to retrieve important variables. Our algorithm already outputs important variables (of which the causal effect is not zero). Most importantly, our method does not assume causal sufficiency, as IDA does. This means it can handle latent variables, in particular confounders.

The difference in time complexity between the two methods is hard to determine. In our case, the search reaches all possible combinations of quadruples, which is n4. In IDA, the search is lower bounded by n2 (because it estimates the effect for all possible variable combinations), but upper bounded by addition of the number of possible local structures (dictated by the CDPAG). This number varies per covariate and depends heavily on the density of the DAG.

(29)

Experiments and results

This section summarizes the most important steps in the formation of the algorithm, in what has been an extensive trial-and-error process after identifying and manipulating many parameters and ideas that influence the simulation of the data and the perfor-mance of the algorithm.

4.1 High signal-to-noise data sets

4.1.1 Single DAG

Instead of evaluating our methods on the Hughes data set directly, let us focus on simulated data. A simulation using the following settings is generated:

• number of nodes N : 30.

• number of observations per node K_Obs: 3000.

• number of total interventions KInt: 200. Intervened nodes are chosen randomly,

so on average, a node is intervened KInt/N times.

• E_noise: 0.01. This is the standard deviation of the Gaussian (mean = 0) out of which the E’s are sampled.

• M_noise: 0.02. This is the standard deviation of the Gaussian (mean = 0) out of which the measurement noise is sampled. Measurement noise is added to all observations.

(30)

• if an intervention is made on node X, the value to which X is set is sampled from N (−0.75, 0.44), analogous to the Hughes data set.

Obviously, this is a quite unrealistic data set. Although interventions are usually more expensive than observations (in terms of money or difficulty, as explained in the intro-duction), making 3000 observations will probably be impossible in many systems. Also, the noise is extremely low. If the algorithm does not work on this data set, we have to review our methodology.

As a start, the classifying performance of the correlation tests is ignored, and the Bayes Ball algorithm is used to detect positive quadruples. In this way we can vali-date the correctness of the ’simplest route’ towards (2.9), based on the causal structure. Out of the 30*29*28*27 = 657720 possible quadruples, only 276 are marked as posi-tive. From the negatives, 3000 quadruple are sampled randomly. For every quadruple hX, Y, W, Qi, the error (|E(Y | X = x∗) − E(Y | do(X = x∗))|) is calculated. The errors are averaged: µBBneg = 0.181. Of all positives, the error is calculated and averaged: µBBpos = 0.0194. The difference is highly significant (two sample t-test, H0: µBBpos = µBBneg, H1: µBBpos < µBBneg, p-value <2.2 10

−16_{). This empirically proves the}

correct-ness of the methodology; we can say that for positive quadruples, the average error is about ten times smaller than for negative quadruples.

The main challenge is to correctly classify (conditional) dependencies and independencies among variables using the correlation test described in section 3.3. Classifications rely on the settings for plow and pup. To omit an expensive search for optimizing these

parameters, we use the Bayes Ball algorithm to retrieve some heuristics. In figure 4.1, for all possible combinations of variable pairs hA, Bi, the Bayes Ball algorithm is used to determine whether A and B are d-separated or d-connected. For each pair, the p-value of the correlation test (see section 3.3) is calculated and histograms of all p-values are drawn.

(31)

(a) (b)

(c) (d)

Figure 4.1: A-B: Raw view of the distribution of p-values (H0: ρA,B= 0, H1: ρA,B6=

0) among all possible dependent (A) and independent (B) pairs of variables. C-D: Raw view of the distribution of conditional dependent (C) and conditional independent (D) p-values (H0 : ρA,B | C = 0, H1 : ρA,B | C 6= 0) among all possible triples of variables.

The Bayes Ball algorithm is used to make the distinctions.

Strictly, figure 4.1b is superfluous, because the p-values of independent variables are uniformly distributed between 0 and 1 at all times. The figure is included to visualize the classification more clearly. The clear peak in the distribution of p-values of dependencies and conditional dependencies is the signal that should enable a strong classification based on correlation tests. The recall is defined as (true positives detected by the algorithm / all positives detected by Bayes Ball). The precision is defined as (true positives / (true positives + false positives)). As long as the position of the peak in the distribution of dependencies is smaller than plow, the influences of the threshold

settings are straightforward. A higher setting for plow will slightly increase the recall of

dependencies, drastically decrease the precision of dependencies, drastically decrease the recall of independencies and slightly increase the precision of independencies. The same holds for the conditioned case. The algorithm is run with plow = 0.01 and pup= 0.1 and

(32)

Relationship Recall Precision Dependence 0.85 0.99 Independence 0.97 0.92 Conditional dependence 0.71 0.99 Conditional independence 0.91 0.75 Minimal dependence 0.27 0.43 Minimal independence 0.46 0.63 Quadruples of interest 0.25 0.40

Table 4.1: The classification performance for plow = 0.01 and pup= 0.1. ’Quadruples

of interest’ stands for quadruples for which W 6⊥⊥ Q | [X] and W ⊥⊥ Y | [X] hold.

The classification performance of (in)dependences and conditional (in)dependence is satisfying, as expected due to the small peak in figures 4.1a and 4.1c. However, the performance drops if two tests are combined into a minimal (in)dependence test, and again if a minimal independence test is combined with a minimal dependence test to find the quadruples of our interest. This ultimately leads to a recall of 25%, with a precision of 40%.

How much of a problem are false classifications of the algorithm? Although the Bayes Ball algorithm is treated as ’holy truth’, there are cases in which we might want to question it. Consider figure4.2. Technically, X4 and X7 are d-connected, since common

ancestor X1creates an open path between them. The Bayes Ball algorithm will therefore

classify X4 and X7as ’dependent’. However, due to the very small weights, the influence

of X1 on X4 and X7 minuscule, perhaps even negligible. There will probably be no

fingerprint of X1 in the test statistic of the independence test between X4 and X7, and

the p-value will be high enough to mark an independence. Although this classification is technically wrong, we can not ’blame’ the algorithm for not finding a dependence. Moreover, since the Bayes Ball algorithm ignores the weights of the relationships between variables, the algorithms classification might be even closer to the truth.

Figure 4.2: The Bayes Ball algorithm will mark X4 and X7 as d-connected. Due to

the small weights, statistical tests are likely to conclude an independence.

The relatively low performance at the level of quadruples of interest (recall ≈ 25%, precision ≈ 40%) does not necessarily be a problem. If the average error of all positive

(33)

quadruples is significantly lower than the average error of all negatives, the classification is still satisfying. The error results are given in table4.2a.

Amount Average error Standard deviation True positives 68 0.0142 0.012 True negatives 652928 0.192 0.401 False positives 101 0.461 0.553 False negatives 101 0.0242 0.010 Unknown 4522 0.171 0.349 All positives 169 0.281 0.480 All Negatives 653029 0.179 0.363 (a) Amount Average error Standard deviation True positives 46 0.0127 0.0134 True negatives 657092 0.193 0.392 False positives 18 0.134 0.352 False negatives 126 0.0238 0.0107 Unknown 438 0.219 0.379 All positives 64 0.0469 0.191 All Negatives 657218 0.194 0.395 (b)

Table 4.2: The average error results after testing for W ⊥⊥ Y | [X] and W 6⊥⊥ Q | [X] (A) and also for Y ⊥⊥ Q | [X] (B).

The remarkably high error of false positive quadruples drastically increases the average error of all positives, making the classification of the algorithm worthless. So, to answer the question on the previous page; yes, the low precision is a problem.

To try to reduce the amount of false positives, we could add a test that also holds in the Y structures drawn in figure 3.1. Besides W 6⊥⊥ Q | [X] and W ⊥⊥ Y | [X], we add Y ⊥⊥ Q | [X] to the requirements. Now, W and Q are symmetric and some true positive quadruples might become true negatives after this addition. Results are given in table

4.2b. Addition of this extra minimal independence test indeed brings the number of false positive detections down to 18 (a decrease of 82%), with an average error of 0.134. Some true positives are lost; the recall drops from 0.25% to 0.17%. This is a price we are willing to pay, because now we can say that the average error of detected positive quadruples is about 80% lower than the average error of detected negative quadruples (two sample t-test, H0: µBBpos = µBBneg, H1: µBBpos < µBBneg; p-value < 4.1 10

−8_).

Is it possible to bring down the number of false positives even more? Let us write down other relationships that hold in the quadruples of interest (see figure3.1), and add them to the requirements: W 6⊥⊥ X, X 6⊥⊥ Y , X 6⊥⊥ Q, Y 6⊥⊥ Q, X 6⊥⊥ Q | Y , X 6⊥⊥ W | Y , Q ⊥⊥ W | Y , X 6⊥⊥ W | Q, Y 6⊥⊥ W | Q, X 6⊥⊥ Y | W , X 6⊥⊥ Q | W , Q 6⊥⊥ Y | W . These tests can be considered redundant, because they can be deduced from the test we are already using. However, it might form an extra barrier that false positives can not pass, increase the precision and lower the average error of all positives. Results can be found in table4.3. Now, the recall drops to 12%, while the precision and average positive error stay about constant. The performance (especially the recall) does not benefit from these additions.

(34)

Amount Average error Standard deviation True positives 32 0.0165 0.0145 True negatives 657354 0.196 0.392 False positives 14 0.0151 0.0149 False negatives 168 0.0201 0.0120 Unknown 152 0.109 0.176 All positives 46 0.0160 0.145 All Negatives 657522 0.184 0.362

Table 4.3: Average error results for requirements: W ⊥⊥ Y | [X], W 6⊥⊥ Q | [X], Y ⊥⊥ Q | [X], W 6⊥⊥ X, X 6⊥⊥ Y , X 6⊥⊥ Q, Y 6⊥⊥ Q, X 6⊥⊥ Q | Y , X 6⊥⊥ W | Y , Q ⊥⊥ W | Y ,

X 6⊥⊥ W | Q, Y 6⊥⊥ W | Q, X 6⊥⊥ Y | W , X 6⊥⊥ Q | W , Q 6⊥⊥ Y | W .

To rule out a clear shift of the optimum values for the p-value thresholds when different statistical tests are combined, a brute force search is performed over the domain [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.995, 0.999, 0.9995, 0.9999] for both pupand plow. Note that pup> plow, which leads to a search space of 136

combinations. Considering both the recall (figure4.3aand the precision (figure4.3b, we stay with our current settings for the thresholds (plow = 0.01 and pup= 0.1).

(a) (b)

Figure 4.3: A heatmap of the recall (left) and the precision (right) of the detection of positive quadruples. Note that no interpolation has been done between measurement points; the figure is a map of all 136 combinations of pup and plow, rather than a

Euclidean space. Also note that, for aesthetic reasons, the Y-axis goes from largest to smallest.

4.1.2 Multiple DAGs

All the analysis above has been done on a single simulated DAG. In order to make sure the results are not attributed to a coincidentally favorable DAG structure and we are overfitting, the algorithm is applied to multiple similar DAGs. Similar means that the same simulation parameter settings are used, but the causal structure of the DAGs will

(35)

vary. Before applying the algorithm, we use the Bayes Ball algorithm to find out if the DAG contains a certain minimum amount (in this case 20) of quadruples of interest. If there are not enough quadruples of interest present, we discard the DAG and simulate a new one. This prevents finding out afterwards that the algorithm has been applied to a DAG without interesting structures and no information about the performance can be achieved. Averaged over 25 DAGs, the performance of the statistical tests for plow = 0.01 and pup= 0.1 is given in table4.4. Averaged over 10 DAGs, the performance

of the algorithm is given in table 4.5.

Relationship Recall mean ± s.d. Precision mean ± s.d. Dependence 0.84 ± 0.12 0.99 ± 0.014 Independence 0.90 ± 0.045 0.85 ± 0.12 Conditional dependence 0.79 ± 0.11 0.95 ± 0.015 Conditional independence 0.85 ± 0.036 0.81 ± 0.11 Minimal dependence 0.35 ± 0.17 0.67 ± 0.24 Minimal independence 0.11 ± 0.069 0.27 ± 0.16

Table 4.4: The classification performance for plow = 0.01 and pup = 0.1, averaged

over 25 DAGs. Two tests (mean ± s.d.) Three tests mean ± s.d.) All tests (mean ± s.d.) True positives 0.016 ± 0.010 0.014 ± 0.0088 0.016 ± 0.0088 True negatives 0.19 ± 0.32 0.19 ± 0.31 0.19 ± 0.31 False positives 0.30 ± 0.35 0.17 ± 0.23 0.14 ± 0.18 False negatives 0.019 ± 0.012 0.019 ± 0.012 0.016 ± 0.0098 Unknown 0.19 ± 0.30 - 0.12 ± 0.18 All positives 0.23 ± 0.33 0.090 ± 0.19 0.064 ± 0.14 All Negatives 0.19 ± 0.31 0.19 ± 0.30 0.19 ± 0.31 Recall 0.37 0.43 0.19 Precision 0.23 0.43 0.52

Table 4.5: Results for two tests (W ⊥⊥ Y | [X], W 6⊥⊥ Q | [X]), three tests (addition of Y ⊥⊥ Q | [X]) and all tests (addition of W 6⊥⊥ X, X 6⊥⊥ Y , X 6⊥⊥ Q, Y 6⊥⊥ Q, X 6⊥⊥ Q | Y , X 6⊥⊥ W | Y , Q ⊥⊥ W | Y , X 6⊥⊥ W | Q, Y 6⊥⊥ W | Q, X 6⊥⊥ Y | W , X 6⊥⊥ Q | W , Q 6⊥⊥ Y | W ),

av-eraged over 10 DAGs.

To be even more sure about the significance of the results, another 25 similar DAGs are simulated and the mean error of all positive detections by the algorithm µpos is compared

with the mean error of all negative detections µneg. Also here, µneg remains about 80%

lower than µpos (µneg = 0.095, µpos = 0.011.; two sample t-test with H0: µpos = µneg,

H1: µpos < µneg; p-value < 2.6 10−6).

The results of this section are in line with the previous section; the single DAG we used there was representative, although the algorithm performs relatively worse on average than on the single DAG.

(36)

4.2 More realistic data sets

The previous section shows that our methodology and algorithm work on a very favor-able data set; the noise regarding unmodelled causes of a varifavor-able (E) is very low and every variable is observed 3000 times. If the data becomes more realistic, for instance if the number of observations drops and the noise grows, will the algorithm still work?

Ultimately, we want to know the performance of the algorithm on a real data set. We evaluate it on the Hughes data set (already covered in section 3.1). There are some differences to keep in mind regarding the simulated data and the real data. Although we can try to mimic the Hughes data set as close as possible (for instance, in terms of the variance and mean of the variables) while making simulations, the average indegree and outdegree (the amount of incoming and outgoing arrows) of the nodes can not be estimated. The number of incoming or outgoing arrows of a node in a simulated DAG was chosen to be (0, 1, 2 or 3) with probability (0.25, 0.4, 0.2 and 0.15), respectively. If the underlying DAG of the real data is less dense (meaning that there are less arrows) than our simulated DAGs, quadruples of interest will probably be scarce. Since the causal structure of variables in real data is unknown, the Bayes Ball algorithm can not be used to estimate the number of quadruples of interest. This might lead to many (time consuming) runs that lack positive detections, providing no information about the performance of the algorithm.

There are 5361 observed genes, but only 234 are intervened once. Our validation method requires an intervention on gene X, so the space of X values is limited to these 234 genes. For computational reasons, the data set is shrinked even more to a random subset of 30 genes. If we apply the algorithm with the three minimal tests (W 6⊥⊥ Q | [X], W ⊥⊥ Y | [X], Y ⊥⊥ Q | [X]) to 10 random subsets of 30 variables, no positive quadruples are found, which is analogous to the results on the simulated data. If the ’redundant’ test Y ⊥⊥ Q | [X] is discarded, the algorithm does find positive quadruples (20 in total, over 10 different subsets), but their average error (µpos = 0.194) is not

significantly lower than the average error of the negatives (µneg = 0.166). Apparently

many false positive detections pass the requirements, analogously to figure4.2a. Trying to filter false positives out of all positive detections by subsequently adding the extra tests described at the end of section 4.1provides no benefit. Either all positives disap-pear, or the average error µpos does not decrease. If we do the same analysis on two

random samples of 50 variables, similar results are found.

Is it possible to filter out the false positives in some other way? Reviewing section 2.4

(37)

causal effect is zero. Rule 6 states that if X ⊥⊥ Y or X ⊥⊥ Y | W , the causal effect is zero. Discarding detected positives that obey rule 5 or 6 does not improve the results. Using the more conservative setting for plow= 0.001, does also not improve the results.

Now that we know the algorithm does not work on the Hughes data set, is it possible to decrease the signal-to-noise ratio of simulated DAGs without losing classification power? We take the DAG on which we did the analysis in the section4.1and simulate multiple DAGs with an identical structure (all edges are placed on the same position), but the number of observations and the noise of the variables vary. The small peak in figure

4.1a is expected to become less prominent, causing more overlap between p-values of independent- and dependent relationships, which complicates the classification. If we drastically lower the number of observations to 25 and keep Enoiseequally low at 0.01, the

distribution of p-values of dependent relations looks like in figure4.4a. There is a clear, but relatively slight change compared to figure4.1a. If Enoise is increased drastically to

0.1 while the amount of observations is kept constant at 3000, a similar effect can be seen in figure 4.4c. If we drastically change both parameters, figure4.4eshows that the signal gets lost.

(38)

(a) (b)

(c) (d)

(e) (f)

Figure 4.4: A,C,E: Raw view of the distribution of p-values (H0 : ρA,B = 0, H1 :

ρA,B 6= 0) among all possible dependent pairs of variables. B,D,F: Raw view of the

distribution of conditional dependent p-values (H0 : ρA,B | C = 0, H1 : ρA,B | C 6= 0)

among all possible triples of variables. The Bayes Ball algorithm is used to make the distinctions. A,B: Enoise = 0.01, Kobs = 25. C,D: Enoise= 0.10, Kobs= 3000. E,F:

Enoise= 0.10, Kobs= 25.

If we evaluate properties of the Hughes data and simulate a data set with similar prop-erties (Enoise= 0.10, Kobs = 63), the distribution of p-values of dependencies looks like

in figure 4.4e, which explains the poor performance.

We simulate another ten DAGs of 30 variables and two DAGs of 50 variables with a ’medium’ signal-to-noise ratio: Enoise = 0.04 and Kobs = 50. Analysis is done

analo-gously to the Hughes data set and, unfortunately, also the results of these DAGs are in line and as poor as the results regarding the Hughes data set.

One remark regarding figure4.4: incorrect classifications are distributed quite uniformly on the right side of the peak. The peak does not shrink smoothly, but ends abruptly. This means that the ’unknown’ classification can be considered redundant. A single p-value threshold (instead of plow and pup) would have served already.

(39)

Discussion and future work

The goal of this project is to try to pool some theoretical ideas into an algorithm and find out whether this is a good track to study in depth. To uncover potential strengths of the algorithm, the problem we try to solve is narrowed down by assumptions and ignored issues.

The implementation of our methods is done in the relatively slow programming language ’R’, for two reasons. Previous work (referring to Maathuis et. al.) was available in R, of which we might want to use some scripts (eventually this was not the case). Also, due to a lack of personal programming experience in a low level, but computationally fast language (like C++), a high level language like R was more attractive. Because of the expensive search (O(n4_{)) and many runs with all kinds of minor adjustments to the}

algorithm and different parameter settings, using DAGs with many variables would be very time consuming. Besides, an enormous DAG is not essential for proving the key as-pects of our methodology. For these reasons, the number of variables n in the simulated DAG is set to 30 instead of 100 or 1000, even though many real life problem settings contain more variables. A DAG containing just a couple of nodes (for example, if fig-ure 3.1 would be the representation of a complete DAG) is easy to oversee: variations in correlation coefficients can be assigned to a particular variable with higher certainty. The challenge starts when the number of nodes and arrows grows and the DAG becomes more complex. Obviously, it would be interesting to find out to what level of complexity the DAG can be pulled up, while the power of the algorithm is maintained.

A causal relationship between one variable and another is assumed (and also simulated) to be linear. In real life, these relationships might be of a higher order. Fitting a model through the observations to estimate the error (3.4), as well as (conditional) indepen-dence testing, would become much more complex.

When we perform a conditional test, we condition on only one variable. Conditioning

(40)

on more variables would lower the number of usable data points1, and ultimately worsen the classification power. However, expanding the conditioning set might enable other structures to pass the requirements and give access to Entner’s rules, for both the Bayes Ball algorithm and our proposed algorithm.

There are parameters that remained untouched. The total number of interventions Kint

was fixed at 200, regardless of the number of variables in the simulated DAG (30 or 50). The value of Kint does not influence the classification performance. An increase

of Kint leads to more interventions per variable and a less noisy estimate of the error.

However, since interventions are generally expensive, a high Kint is unrealistic: in the

Hughes data set,234 out of 5351 have been intervened (once!). Also the measurement noise Mnoise is fixed. It could be of interest how heavily an increase Mnoise of this

pa-rameter complicates the classification.

Although the weights were sampled uniformly between −√3 and √3, the correlation distribution of dependent variables (figure 5.1a) shows a peak around zero, leading to a high overlap with the correlations of independent variables (figure5.1b) and complicates classification based on statistical tests. This could be caused by balancing opposite rela-tionships among different paths (see figure2.4), which casts doubt about the faithfulness assumption.

(a) (b)

Figure 5.1: The correlations of all possible variable pairs are calculated and plotted in a histogram. Using Bayes Ball, a distinction is made between dependent (A) and

independent (B) variables. Note the difference in scale on the horizontal axis.

In simulations used by Maathuis et al.[3], weights were sampled uniformly from the domain between 1 and 2. This would radically increase the classification power of the algorithm, while there is no evidence that this is a less realistic approach.

With respect to the simulations in section4.1, in section4.2Kobs and Enoiseare changed

rather drastically; the algorithm failed on the new settings. To retrieve more precise pa-rameter settings on which the algorithm still works (i.e. the average error of the positives is significantly lower than the average error of the negatives), smoother adjustments of 1_{Conditioning on a variable implies that the data is restricted to observations in which the}

(41)

the parameters is required.

Another idea would be to use the PC-algorithm to remove false positive detections. If a positive quadruple is in conflict with the structure given by the CDPAG, one might want to remove this quadruple from the list of positives.