Discovering Causal Links In Mass Cytometry Data

(1)

MSc Artificial Intelligence

Track: Machine Learning

Master Thesis

Discovering Causal Links

In Mass Cytometry Data

by

David Woudenberg

10069143

July 8, 2016

ESC: 42 December 2015 - July 2016

Supervisor:

Dr J.M. Mooij

Assessor:

Dr M. Welling

Informatics Institute

University of Amsterdam

(2)

Abstract

This thesis deals with the challenge of finding causal links between functional proteins in a complex biological dataset. This dataset contains information about the abundance of proteins in individual human white blood cells that were treated with different stimuli. Two constraint-based methods for causal discovery are applied. CLCD, a simple method that only finds one particular causal structure between 3 variables, and ACI, a much more general method that can be applied to a larger amount of variables and can also deal with uncertainty in the input. As no ground truth is available to compare the predictions to, validation is a non-trivial task. An internal validation of the CLCD results show that they are more consistent than predictions produced by the random baseline. ACI predictions score better when validated on an external database of known causal pathways, however they barely outperform the random baseline. This could be caused by the incompleteness of this database, but there are indications that the quality of the data could also explain this.

(3)

Acknowledgements

First and foremost I would like to thank Joris Mooij, my daily supervisor, for helping me understand the theory concerning causal discovery, guiding me through the process of conducting this scientific research but also being willing to help out with small practical issues when they popped up. Next I would like to thank Stephan Bongers for fruitful discussions during our weekly meetings and for feedback and ideas during both the research and writing process. It was very helpful. Also many thanks to Sara Magliacane who provided code for ACI, and was always full of contagious energy and enthusiasm. I would like to thank our collaborators from the University of Crete, Ioannis Tsamardinos and Sofia Triantafillou, for providing a starting point for my research and for filling me in on details of their work when necessary. Furthermore, thanks to Max Welling for acting as assessor on the examination committee. Thanks to my friends and family for their support and for trying to (or at least pretending to) understand what I was working on. A special thank you to my brother, Nicolaas, for making sure that spelling and style was of Ivy League level. Last but not least, thank you Julia. Not only for providing welcome distraction now and then but also for encouraging me to put in my best effort.

(4)

1 Introduction

One of the main objectives of science is to find causal relations, discovering how a system would behave if we intervene on it. Causation is quite often confused with correlation, even though these concepts are quite different. A well known study conducted by Messerli [17] illustrates this. It showed that the amount of chocolate consumed per capita annually is a very good predictor of the number of Nobel laureates per 10 million people in a country, as illustrated in Figure 1. This suggests that eating chocolate might make you smarter and thus increase the chance of winning a Nobel price. However, spending tax money to subsidize chocolate instead of investing in education does not seem like a smart public policy to increase the intelligence of the population. There might be other explanations of the correlation found in [17]. First, winning a Nobel price might actually increase the amount of chocolate that is eaten in your country. Second, there might be a third (unobserved) variable explaining both the chocolate consumption and the number of Nobel laureates to increase in a certain country, as for example: general welfare in that country. Another way to explain the correlation is that it was a coincidence, although the small p-value associated with the correlation indicates that this chance is very small. To make matters even worse, any combination of the reasons above might explain the correlation found between chocolate consumption and Nobel prizes.

Figure 1: Figure 1 from [17] showing the correlation between chocolate consumption and Nobel laureates. The example above illustrates that correlation and causation are two, although related, very distinct con-cepts. The first concept deals with questions of the type “If we observe that a country consumes a large amount of chocolate annually, what is the chance that it’s citizens will receive a Nobel prize?”. Whereas causality con-cerns itself with questions of the type “If we make the inhabitants of a country consume a lot of chocolate, what is the chance that it’s citizens will receive a Nobel prize?”. The difference between these two examples is that in the latter a specific variable (how much chocolate a citizen consumes) is forced to take a certain value and the effect measured, whereas in the former example just the co-occurrence of two events is observed. Current machine learning techniques can give a lot of insight into the (conditional) probability of some observation, however the models learned by these techniques do not give any insight how interventions change a system.

When for example making decisions about public policy or deciding whether a drug is beneficiary for certain patients we are interested in the effects of these interventions. One way to conduct that research would be to randomly divide your sample into two groups, one is intervened on and the other is left alone. If all other variables are controlled for, this research can identify the effects of an intervention. This sort of study is not always possible however, performing a certain intervention might be very expensive, unethical or just impossible. Also, an intervention might have other unforeseen consequences.

(6)

Pearl showed that under certain assumptions it is also possible to find causal relations in observational data and he developed a powerful do-calculus to reason mathematically about this [13]. Also the PC [8] and FCI [12] algorithms were developed to find the causal structure of the variables in an observational dataset. However these algorithms do not allow for any type of intervention in the data and do not always produce an unambiguous model. A brief presentation of these algorithms is found in Section 2.4.

The objective of this thesis is finding causal relations between functional proteins in human white blood cells. These cells have been treated with different types of activators, after which the abundance of proteins were measured in the experiment described in [16]. The knowledge of these causal relations could for example be of assistance when developing drugs to treat diseases. Section 3 presents both the Bodenmiller dataset and the experiment conducted to collect it. The treatment of a cell with an activator is an obvious interventions on the system, therefore we use the CLCD [7] and ACI [22] algorithms, which are suited for both observational data and data that contains some intervention. In Section 2 these algorithms are explained in detail. Since there is no ground truth available to compare the predictions made by various methods to, validation is also a main challenge addressed in this thesis. The following research questions are answered:

Research Question 1. Can CLCD and ACI find causal relations in the Bodenmiller [16] dataset? Research Question 2. Is it possible to validate causal predictions made by CLCD and ACI?

To answer these questions first the simple CLCD algorithm is applied to the Bodenmiller data. This method can only find one particular causal structure between two proteins and one activator variable. Three methods using the CLCD algorithm are used in this thesis and their results compared. Predictions made by CLCD can be validated internally in a fashion that does not rely on knowing the ground truth. To perform causal discovery on a larger number of variables the ACI algorithm is used. CLCD should be a special case of this framework, therefore the results of CLCD are first reproduced with ACI. Next, the complexity of the input for ACI is gradually increased, resulting in four different methods applying the ACI algorithm. Finally, all predictions can be compared to a database containing known causal pathways. However, predictions could not be validated using this method, possibly due to noisy data but the validation method itself might not be suitable either.

The rest of this thesis will be outlined as follows: section 2 contains the theoretical framework, including some preliminaries, the explanation of the two algorithms and a selection of related work. Next, Section 3 presents the Bodenmiller dataset. After that, the exact set up of different methodologies using either the CLCD or the ACI algorithm are given in Section 4. Then, the research questions posed above are answered in Section 5, and finally, Section 6 contains the conclusion and gives some ideas on possible future work.

(7)

2 Theoretical framework

Two main approaches for making causal discoveries are score-based and constraint-based methods. The first approach defines a set of models that can explain the distribution of the data, an algorithm then scores the

different models on how well they explain the data that is observed. The model that scores best is then

used to make causal predictions. Another possible score-base approach would be to apply Bayesian model averaging to a set of possible models. The second approach uses (in)dependence statements, and assumptions like faithfulness and the Markov property, which will be explained in the next subsection, to find possible models that explain the data. Usually these approaches start from all possible models and exclude those that violate the (in)dependence statements (constraints) under the given assumptions. The result is usually an equivalence class of possible models that do not violate any constraints. Examples of both approaches are given at the end of this section.

In this thesis two algorithms are used, CLCD (Section 2.2) and ACI (Section 2.3). Both are constraint-based methods that use (in)dependence relations to find causal relations. ACI also uses a scoring mechanism to find the models that are supported by the most amount of evidence and can be seen as a hybrid method. CLCD is a very simple algorithm and can only identify one specific type of model for three variables. ACI is a much more general method and can essentially find any ancestral structure between a larger amount of variables. Also, ACI is one of the first constraint-based methods that can deal with uncertainty in the input. Before the algorithms are presented, some preliminary assumptions and definitions are given. A brief overview of some related work is presented at the end of this Section.

2.1 Preliminaries

Variables will be denoted with capital letters and sets of variables with boldface capital letters. As is common in the field of causal research, it will be assumed that the data generation process can be modelled with a Directed Acyclic Graph (DAG), where the vertices model the variables and the directed edges causal relations. When in a DAG there is a directed edge X → Y we call X a direct cause of Y , relative to the variables in that DAG. When there is a sequence of directed edges from X1 to Xn, this is called a directed path from Xi to Xn and Xi is said to be a causal ancestor of Xn. An(X) denotes the set of variables that are ancestors of X, this also includes X itself. When X 6= Y and X ∈ An(Y ) this is denoted as X 99K Y . The absence of a directed path from X to Y is denoted as X 699K Y . For a set of variables W, X 99K W means that there is at least one Y ∈ W for which X 99K Y . When for all Y ∈ W it holds that X 699K Y it is denoted as X 699K W.

If there is some latent variable Z and two observed variables X and Y , we call Z a confounder if and only if there is a path Z 99K X that does not pass through Y , and a path Z 99K Y that does not pass through X. The full DAG over all variables can be marginalized and represented with an acyclic directed mixed graph (ADMG) with a bi-directed edge between X and Y .

For disjoint sets of variables X, Y, and Z, conditional independence of X and Y controlling for variables in Z is denoted as X ⊥⊥ Y | Z and dependence as X 6⊥⊥ Y | Z. The cardinality |Z| is called the order of the conditional (in)dependence relation. In [15] the notion of minimal conditional independence is introduced:

X ⊥⊥ Y | W ∪ [Z] := (

X ⊥⊥ Y | W ∪ Z, and

∀Z0

( Z : X 6⊥⊥ Y | W ∪ Z0,

This intuitively says that in the context of W the variables in Z are needed for the independence to hold. Similarly, minimal conditional dependence is formulated, note that this is not the negation of minimal conditional independence.

A common assumption in the field of causal discovery is that the Global Markov Property [8] holds. This property says that if a subset Z of variables d-separates [5] two other subsets of variables X and Y, and given that X, Y and Z are disjoint subsets, that X ⊥⊥ Y | Z follows directly. A second common assumption is that faithfulness holds. This means that the only (in)dependence relations that exist between a set of variables are those that can be found with d-separation [5], and no others [8]. The Markov and faithfulness assumption together make that we can know everything about the conditional independences in the joint distribution of a set of variables just by looking at the corresponding DAG. Both CLCD and ACI make these assumptions, and for the rest of this thesis it is assumed that these properties hold.

Furthermore, it is assumed that ancestral causal relations are irreflexive (1) and transitive (2). Together these two axioms guarantee that ancestral causal relations are acyclic (3) [22]. These last three assumptions are also the first three axioms used as causal reasoning rules by ACI which will be explained in Section 2.3. A relation that satisfies these axioms is called an ancestral structure.

(8)

X 699K X (1)

X 99K Y ∧ Y 99K Z =⇒ X 99K Z (2)

X 99K Y =⇒ Y 699K X (3)

2.2 Conservative Local Causal Discovery

Local Causal Discovery (LCD) is a constraint-based algorithm introduced by Cooper in [7] and is a very simple algorithm for finding causal relations between three variables in observational data, possibly containing some intervention.

Definition 1 (LCD). Given three variables I, A, B with the following (in)dependence relations: • I ⊥⊥ B | A.

• I 6⊥⊥ A. • A 6⊥⊥ B

And the extra assumption, that might arise from some prior knowledge about the system or experimental setting:

• An(I) ∩ {A, B} = ∅

The triplet (I, A, B) is an LCD triplet with a known causal structure.

The causal structure of the three variables can be represented with one of the ADMGs described in Fig-ure 2. This was proven by [7] but can also be seen intuitively by using d-separation as introduced in [5] and Reichenbach’s principle of causation [1]. Reichenbach’s principle tells us that if two variables are dependent, that either one of them is a cause of the other, or that they are confounded by some other latent variable. It is assumed that A is not a cause of I, so a directed edge A → I can be excluded. The faithfulness and Global Markov properties together with I ⊥⊥ B | A then tell us that A d-separates I and B. This knowledge allows us to remove the possibility of a confounder between A and B as well as B → A, as this would not agree with the definition of d-separation. Also, no direct edge I → B or a confounder between these two variables can exist, as A would then not d-separate I and B.

I A B I A B I A B

Figure 2: An LCD prediction has a causal structure which is represented by one of the ADMGs above. In the particular setting that the algorithm is applied to in this thesis, there is additional knowledge that I is a binary variable that indicates whether an intervention, specifically the treatment of the sample with some activator, has taken place. Because of this, one could be tempted to remove the confounding edge between I and A, as the value of I is only determined by adding or not adding the activator. An argument not to do this is illustrated in Figure 3. As there are multiple activators in the data examined in this thesis, different interventions could be a cause of A, these different interventions are called I1 and I2 in Figure 3. These two variables are confounded by the variable I, which in this context is a variable that governs the experimental setting, and for that reason the confounding edge can not be removed from Figure 2. However in the particular experiment examined in this thesis, samples were treated with at most 1 activator at once, and intervening with one activator implies that the data has not been treated with some other stimulus. This example shows that there are some complex subtleties that come into play when trying to remove the confounding edge between I and A in Figure 2, therefore no assumptions on the presence or absence of this edge will be made in this thesis. In the end the relations that are of interest are between functional proteins, represented by A and B in the example given here. Whether or not an intervention actually affects the value of these variables or that there is some other complicated confounded relationship is not important as it does not affect the causal relation between A and B, the variables of interest. For a detailed description of the data and the experiment conducted to gather it, please see Section 3.

The AGMGs in Figure 2 do not only entail the two dependence relations and single independence we used to infer the causal structure but also an additional three dependence relations. Conservative Local Causal Discovery (CLCD) also requires these extra 3 conditional dependencies to hold before it predicts a triplet of variables to have the causal structure as described in Figures 2. These extra requirements follow from the faithfulness assumption.

(9)

I

I1

I2

A B

Figure 3: An LCD triplet could also be confounded by a latent variable.

Definition 2 (CLCD). Given Definition 1 and the extra requirements: • I 6⊥⊥ B

• I 6⊥⊥ A | B • A 6⊥⊥ B | I

The triplet (I, A, B) is a CLCD triplet with a known causal structure.

To determine whether an (in)dependence statement holds, the p-value of statistical independence tests, which are discussed in section 4.1.2, is compared with a threshold, p < α for dependence and p > β for independence. When all 6 (in)dependences hold the triplet of variables is predicted to have the causal structure as in Figure 2. The (in)dependences are all binary decisions, and might not be well suited to deal with the uncertainty that comes with performing these statistical tests on real world data. Also CLCD cannot be used on a larger number of variables or to find triplets with a different causal structure. This simplicity makes CLCD a logical starting point for causal discovery. However, for real world application a more complex approach will most likely be necessary. For that the ACI framework is also applied to the Bodenmiller dataset in this thesis. The predictions made by CLCD will be a special case of the wide variety of causal structures that can be found with ACI.

2.3 Ancestral Causal Inference

Ancestral Causal Inference (ACI) [22] is a constraint-based causal discovery algorithm, just like CLCD, and also produces predictions of ancestral causal relations. It reduces causal discovery to a constrained discrete minimization problem which can be solved using Answer Set Programming (ASP) following work in [19]. ACI extends LoCI [15] that introduced a set of causal reasoning rules which can be used to determine ancestral causal relations, but did not allow for uncertainty in the input.

ACI has a number of advantages. First, unlike CLCD and many other constrained-based causal discovery algorithms, it does not require binary decisions to determine the (in)dependence relations. Instead, it can deal with the uncertainty of a statistical test of independence by attributing a weight to each (in)dependence statement. Also, where CLCD can only find one particular causal structure for 3 variables, ACI can find any structure between a much larger number of variables. Furthermore, it allows for background knowledge to be encoded and works with observational data, possibly, but not necessarily, containing interventions. ACI can be seen as a general framework that uses causal reasoning rules to infer any ancestral causal structure, the predictions made by CLCD are a special case of this framework for which one specific structure holds and where there is no contradiction in the input.

As input ACI takes a list L of weighted constraints (wi, li) where wiis the weight associated with statement li. This statement can be a, (conditional) dependence or independence relation between two variables but also a causal or non causal relation. It is also possible to add absolute, unweighted statements. In practice this is done by setting wi= ∞. This input is combined with a set of causal reasoning rules, the encoding, and an ASP solver finds the model that minimizes the sum of violated weights given the input and the encoding.

2.3.1 Causal reasoning rules

The input of ACI is a list of weighted conditional (in)dependence relations and possibly extra background knowledge on ancestral relations between variables. It then uses the axioms of reflexivity, transitivity and acyclicity, together with a set of eight causal reasoning rules to infer the ancestral relations between the variables.

Proofs for each causal reasoning rule can be found in the Supplementary Material of [22]. The rules are

(10)

The first two rules come from [15].

X ⊥⊥ Y | W ∪ [Z] =⇒ Z 99K ({X, Y } ∪ W) (4)

X 6⊥⊥ Y | W ∪ [Z] =⇒ Z 699K ({X, Y } ∪ W) (5)

A third rule comes from [18]:

(X ⊥⊥ Y | W) ∧ (X 699K W) =⇒ X 699K Y (6)

Magliacane, Claassen, and Mooij introduce an additional 5 rules.

X 6⊥⊥ Y | W ∪ [Z] =⇒ X 6⊥⊥ Z | W (7)

X ⊥⊥ Y | W ∪ [Z] =⇒ X 6⊥⊥ Z | W (8)

(X ⊥⊥ Y | W ∪ [Z]) ∧ (X ⊥⊥ Z | W ∪ [U ]) =⇒ (X ⊥⊥ Y | W ∪ U ) (9)

(Z 6⊥⊥ X | W) ∧ (Z 6⊥⊥ Y | W) ∧ (X ⊥⊥ Y | W) =⇒ X 6⊥⊥ Y | W ∪ Z (10)

(X 6⊥⊥ Y | W) ∧ (X 6⊥⊥ U | W) ∧ (Y ⊥⊥ U | W) =⇒ X 99K Y (11)

With these causal reasoning rules it can be shown that CLCD predictions are in fact a special case of ACI. Given three variables I, A and B and the (in)dependence relations and assumptions needed for a CLCD prediction as given in Definitions 1 and 2. From rule (4) and the independence I ⊥⊥ B | A, it follows that A 99K W, where W = {I, B}. The assumption needed for LCD was that A and B are not a cause of I, therefore I should be removed from W. The definitions of causal ancestral relations with respect to sets of variables dictate that there should be at least one variable in W that is caused by A, since B is the only variable in W it must be: A 99K B.

The set of causal reasoning rules, which will also be referred to as the encoding, is not necessarily limited to the one presented above. It is possible to add rules if it is proven that more complete causal relations can be found with the addition. Also, a more limited set can be used if so desired, the effect of doing so is examined in Section 5.6. Two different encodings are defined, ACI-simple that only encodes rules 1-5 and ACI-complete which uses all rules

It is also possible for a completely different set of rules to be used, as for example the encoding of Hyttinen et al., which encodes marginalization and conditioning operations directly on d-connection graphs [19]. An advantage of this encoding is that it also allows to encode extra knowledge on the presence or absence of latent confounders. However, as this encoding can not work with a causal ancestral structure but rather tries to find all the direct causal links of the underlying DAG, it is infeasible to solve a system of more than 8 variables. Magliacane, Claassen, and Mooij point out in [22] that for a system of 7 variables there are 2.3 × 1015 _possible DAGs but only 6 × 106_{causal ancestral structures. This makes it possible for ACI, with the encoding presented} above, to scale to a much larger number of variables than for example [19].

2.3.2 Finding the ancestral causal structure

In the constraint-based minimization formulation of causal discovery of ACI, the causal reasoning rules are hard constraints. For every input statement that violates one of these constraints there is an associated penalty. An ad-hoc loss function can then be formulated as the sum of weights of the violated statements, which was inspired by [19].

The set of all worlds W is, in the formulation of ACI, the set of all possible ancestral structures between the variables of interest. Remember that L denotes the set of weighted constraints (wi, li), where li is an (in)dependence or (non-)causal relation and wi the associated weight. The optimisation problem can then be formulated as in [22]: W∗= arg min W ∈W X (wi,li)∈L: W 6|=li wi (12)

Here W 6|= li denotes that rule li is not satisfied in world W . To solve this optimisation problem Magliacane, Claassen, and Mooij decided to use an ASP solver [11],[23], as it is well suited for the complexity of the constraints representing the causal reasoning rules.

For each ordered pair of variables (X, Y ) in the input, ACI finds the sum of weights of violated statements associated with X 99K Y and X 699K Y using the ASP solver. For a pair the latter sum of weights is subtracted from the first. A positive output then indicates X 99K Y is more likely than X 699K Y and a negative weight X 699K Y is more likely than X 99K Y .

For a system of k variables this means that the ASP solver needs to be called 2k(k − 1) times. When only using independences up to order 1, the number of input statements already amounts to 1₂k(k − 1)2. A larger input for the ASP solver does not necessarily increase the complexity, but it can give a larger number of conflicts which does make the problem harder to solve. This all means that when the number of variables increases the

(11)

computational costs quickly increase. However, the algorithm works within a reasonable amount of time for systems with up to about a dozen variables. Furthermore, the ASP solver can give an output in any time, so it can be run with a time-out to produce output within a reasonable amount of time, possibly sacrificing some accuracy.

2.4 Related Work

The causal discovery methods used in this thesis are constraint-based, another class of methods are the so called score-based methods. Traditionallyj score-based causal discovery algorithms find a graph, or equivalence class of graphs, that fit the data best by maximizing the Bayesian posterior. These methods do not use (in)dependence statements but rather incorporate the information directly from the data itself. Popular scores that are used to score the possible models are Bayesian Information Criterion (BIC) [4] or Akaike Information Criterion (AIC) [3]. A drawback of score-based methods is that they require certain modelling assumptions that might not be very realistic. Also, it is harder to allow for latent confounders using these techniques. Furthermore, score-based techniques rely on some heuristic to search through the vast space of possible models, which increases exponentially with the number of variables.

One of the earliest constraint-based causal discovery algorithms for observational data that was developed is the PC algorithm [8]. Instead of searching the large space of possible graphs, it starts from a complete undirected graph where the vertices represent the variables in the system and the edges describe possible causal relations. Using a set of rules and constraints, which are the conditional independence relations between the variables, edges can then be removed and directed. The result of the algorithm is an equivalence class of all the DAGs that possibly produced the data. Different algorithms have extended on the idea of the PC algorithm, the FCI algorithm [12] is one of the most well known. With the addition of enough rules it has been shown to be a sound and complete algorithm, under the assumptions that faithfulness and the Global Markov property hold and that the input is correct.

Both FCI and PC have a couple of drawbacks. First of all, both algorithms are only suited for purely observational data and it is not possible to encode any prior knowledge about causal relations in the data. Also, it is assumed that the independence statements used to infer these relations are correct, an assumption that is not realistic as real world data is inherently noisy and independence tests not always reliable. A last drawback of PC and FCI is that it does not give one single model, for example in the form of a DAG, that explains the data. While CLCD does allow for data with interventions, it still relies on the assumption of correct inputs together with certain prior knowledge, and can only predict one specific model to explain the data. ACI overcomes all these drawbacks. First of all, it allows for interventions in the data. Also, it can incorporate prior knowledge about the data for example on the existence or absence of causal relations. Furthermore, by using weighted instead of binary independences combined with the loss function formulated in Equation (12) both the problem of unreliable inputs and ambiguous output are dealt with.

(12)

3 The data

The data analysed in this thesis was collected by Bodenmiller et al. as described in [16]. This section will describe their experiment and the data. In the experiment of Bodenmiller et al. human white blood cells were treated with different types of stimuli and inhibitors in several dosages. After treatment, the relative abundance of different proteins in the cells was measured using a technique called multiplexed mass cytometry which was first introduced in [10]. This state-of-the-art technique allows biologists to obtain a sample, containing a large number of variables, for each cell separately. This is a great advancement from old techniques where biologists only obtained one measurement per experiment, in which the average protein abundance was found across a large number of cells. Single-cell data is not only interesting because of the large sample size which allows for more statistical tests, but as different cell types might behave differently this also allows biologists to find these different types of behaviour.

3.1 The experimental setup

The experimental setup of [16] consisted of 27 plates, each corresponding to one inhibitor. Each plate contained 8 × 12 wells, each containing a sample of white blood cells. The 8 rows correspond with the 8 different dosages of the inhibitor that were used, one of the dosages was 0 and can be used as a reference. The 12 columns of the plate correspond with the 11 stimuli and one reference row. A visualization of the whole experiment was given in Figure 4 of [16], a copy of this can be seen in Figure 4.

Figure 4: An overview of the experiment copied from Figure 4 in [16]. a) Shows the different plates, each treated with its own inhibitor. b) Gives a detailed view of one plate, each row corresponding with a certain dosage of the inhibitor and each column with an activator. c) Gives a schematic overview of the gating procedure. d) Shows dosage response curves for different functional proteins inhibited by staurosporine.

After treating the cells with a different combination of stimulus and inhibitor dosage in vitro and waiting a while, the cell membranes were permeabilized. Next, the cells were brought into contact with different antibodies, each of which is known to bind a certain protein. To each of the antibodies a unique isotope of a particular type of metal was attached, each with a unique molecular weight. Next, the samples were passed through the mass cytometer, where the cells were brought to extreme heat, and subsequently the mass spectrum of the metal isotopes that remained could be measured. This mass spectrum corresponds with the number of different types of antibody that were bound, which in turn corresponds with the relative abundance of a distinct proteins.

(13)

In [16] a total of 35 variables are obtained for each sample. This includes 14 functional proteins1, amongst which the interaction is most interesting. 10 surface markers, that can be used to determine which of 14 sub-populations a cell belongs to are also measured. Furthermore the abundance of 7 barcode markers, that are used to determine which well a sample came from, 2 DNA markers that correspond to cell size, and a time and cell length variable are measured. In table 1 all variables are summarized.

Type Count Names

Functional 14 pNFkB, pp38, pStat5, pAkt, pStat1, pSHP2, pZap70 (pSyk),

pStat3, pSlp76 (BLNK), pBtk, pPlcg2, pErk, pLat, pS6

Surface 10 CD3, CD45, CD4, CD20, CD33, CD123, CD14, IgM, HLA-DR, CD7

Barcode 7 BC1, BC2, BC3, BC4, BC5, BC6, BC7

DNA 2 DNA-1, DNA-2

Other 2 Time, Cell Length

Table 1: A summary of the different variables measured in the Bodenmiller dataset[16].

3.2 Gating

With a so-called gating procedure the sub-population of a sample was determined by looking at the abundance of surface markers. The inhibitor dosage and stimulus of a sample, or the well it came from, were determined by looking at the barcode marker signal of a sample. By using the barcoding and gating scheme variability between samples in the experiment, caused by for example pipetting a little bit more of a drug in one well compared to the others, was reduced and was the main contribution of [16]. The procedure is rather complicated, and could be sensitive to selection bias, as for example signal strength is influenced by the type of cell [20].

The fact that it is possible to distinguish between different populations and even sub-populations is another great advantage of single-cell data as obtained with mass cytometry. Because of this, the causal relations can be discovered specifically for certain sub-populations and the strong assumption that all white blood cells have the same causal mechanisms between the proteins within needs not be made. White blood cells are grouped into 5 populations in the Bodenmiller study, monocytes, T-cells, dendritic cells, B-cells and natural killer cells. These are divided into an even finer grouping of 14 sub-populations, some of which are a lot more abundant than others, as can be seen in Table 2. The subset of the Bodenmiller dataset that was analysed in this thesis consisted of almost 5.9 × 106_{cells. The entire dataset contains samples from 2.4 × 10}7_{white blood cells.}

Population Sub-Population Percentage of Sample

Monocytes cd14+hladr- 0.7% cd14+hladrhigh 0.4% cd14+hladrmid 13.6% cd14+surf- 1.8% cd14-hladr- 0.6% cd14-hladrhigh 0.7% cd14-hladrmid 2.7% cd14-surf- 12.1% T-Cells cd4+ 31.8% cd8+ 17.6% Dendritic dendritic 0.6% B-Cells igm+ 5.2% igm- 1.3% Natural Killers nk 11.0%

Table 2: The percentage of the cells that belong to each (sub-)population for the data that is used in this thesis.

3.3 General remarks on the Bodenmiller data

The effect of the inhibitors on the proteins seems like a very rich source of information for making causal discoveries, as it entails a wide variety of interventions. Despite this fact, in this thesis only the data from wells with 0-dosage of each inhibitor is used. The first reason for doing this is a practical one: the intervention with

1_{In the monocyte and B-cell sub-populations pSyk was measured instead of pZap70. In the monocyte, dendritic and B-cell}

sub-populations BLNK was measured instead of pSlp761. So a total of 16 different functional proteins were measured, but never more than 14 at a time.

(14)

an activator is simply binary, where there are several different dosages for the inhibitors. Second, as the samples corresponding with the 0-dosage of each inhibitor should have the same DAG generating it, some confidence might be gained by finding the same predictions in multiple inhibitor data sets. A final reason to disregard the cells treated with different amounts of an inhibitor is that dosage response curves did not look as expected, shown in Figure 4, but rather seemed to contain a lot of noise. Examples of the actual dosage response curves are shown in Figure 20, in the Appendix.

As mentioned earlier, only the functional proteins are examined in this thesis. A priori one might expect that there might also be causal relations involving surface markers. However an early analysis with the CLCD algorithm showed that these variables only appeared in a very small subset of the predictions made. ACI does not scale up to 24 variables at one time, and it was decided to disregard the surface markers for the rest of this thesis.

Mass cytometry data is interesting as it can give a large number of single cell samples in which many variables can be measured. However, because the technique is very new it needs to be treated with care. For example, the cytometer can get clogged with dirt after running for a long time, which causes measurements to vary. Also, the machine might have cooled of at some point for reasons as somebody opening the lab door, which can cause the signal strength to drop. These reasons, amongst others, can cause a lot of noise and variability in the data. This can be corrected for in some cases, but it needs to be kept in mind when analysing this dataset that it may contain a lot of noise.

(15)

4 Methodology

Three methods using CLCD and four using ACI were applied to the Bodenmiller dataset [16]. The three CLCD are slight variations of each other and should give relatively similar results. The four methods using ACI start by trying to reproduce the predictions made by CLCD. After that, the number of variables and complexity of the input is increased to predict the causal structure of all functional proteins at once.

A general pipeline that is shared among all methods is presented next, which extracts p-values corresponding with independence relation from the data. These independences are used as input for the algorithms in different ways, described in the specifics of each method. These methods then each produce predictions of ancestral causal relations in the Bodenmiller dataset.

4.1 General Pipeline

As experiments on the Bodenmiller dataset [16] in this thesis all share the same general pipeline we will describe this pipeline here. For certain experiments some parts might be changed or switched with others. The pipeline is also described in Figure 5.

(a) Data separation. (b) Independence testing.

(c) Aggregate p-values.

Figure 5: The General Pipeline. First, the data is divided into appropriate subsets in a). Next, b) shows the independence testing. Aggregation of p-values is shown in c). After this the aggregated p-values can be used as input for an algorithm.

This pipeline will produce predictions of the form: some protein Pi(indirectly) causes another protein Pjin a certain sub-population P oppunder a specific stimulus Acta. These will be referred to as (Acta, P opp, Pi, Pj). When the predictions are aggregated over the activators and populations they will be referred to as (Pi, Pj).

4.1.1 Separate Data

For each plate, activator and sub-population we want to do independence testing separately. Only data where the inhibitor dosage is (almost) 0 is used in this thesis. However, to avoid batch effects data from different plates are not pooled together. That activations of proteins may differ under different stimuli is a reasonable assumption. Also, the fact that causal mechanisms may differ between sub-populations is a safe assumption. The data separation can be seen in Figure 5a. This results in 4158 combinations of 11 activators, 14 sub-populations and 27 different inhibitor plates, where the data belonging to a certain sub-population, stimulated with one specific activator is pooled together with the reference, or unstimulated, data from the same plate. As the variables of interest in this dataset are the functional proteins, all other variables are excluded in the analysis. This means that per sub-dataset, we have a total of 14 protein variables and 1 binary indicator variable that indicates whether the cell was treated with the stimulus.

Because very small samples do not give meaningful independence test results, any dataset that contains less than 10 samples in either the stimulated or the reference data is skipped. For the sub-population cd14+hladrhigh and activator pVO4, the reference data contained less than 10 samples for each plate, and no results can be

(16)

obtained. Finally, 291 datasets were skipped for a certain combination, out of a total of 4158. An overview of the number of datasets that contained enough samples for each combination of sub-population and activator can be seen in Table 2 in the Appendix. If the sample sizes were sufficient, the data from the activated well was pooled together with the data from the reference well in order to do independence testing.

In this way, a total of 14 × 11 × 27 − 291 = 3867 subsets were extracted from the data for independence testing (populations × activators × inhibitors − skipped datasets).

4.1.2 Independence testing

As the constraint-based causal discovery methods in this thesis require independence statements as inputs, the next step in the pipeline is to do independence testing. Two random variables X and Y are said to be conditionally independent, conditioned on variables in W if and only if their joint distribution is the same as the product of their marginal distributions: P (X, Y |W) = P (X|W)P (Y |W). In other words, when the value of the variables in W (which can be empty) are known, additional knowledge of the value of X does not change the distribution of Y and vice-versa. Since the true distribution over the variables in the Bodenmiller data is not known, statistical tests are used to determine (in)dependence. Each test gives a p-value corresponding to the null hypothesis that two variables are independent, so a high p-value indicates independence and a low value dependence. The specific threshold used to decide whether a test indicates dependence or independence is given in the description of each algorithm. 3 different tests were used in the research described in this thesis. Students T-Test For tests of the form I 6⊥⊥ P where I is a binary variable (whether or not a sample was stimulated) and P is a continuous variable (the abundance of a certain protein), a Students t-test is used. This statistical test can be used to determine whether the distributions of two samples are significantly different from one another. In the context of this research the two samples are the part of the data that has been treated with the stimulus and portion that was not. If the two samples are significantly different, indicated by a low p-value, then it is suggested that knowledge of whether or not a sample was stimulated changes the distribution of the protein abundances. The SciPy implementation2of the t-test was used to perform the test in this thesis, where equal variance was not assumed.

Logistic Test For tests of the form I 6⊥⊥ Pi | Pj, where again I is a binary variable and Pi and Pj are continuous, we use a logistic test. This test builds two logistic regression models, one using only Pi to predict I and one using both Piand Pj. If the predictive power of the model does not change significantly, the p-value of this test is high, indicating that the extra knowledge of Pjdoes not give any extra knowledge of the distribution P (I|Pi). Pseudocode,3 for performing a logistic test and calculating a p-value, is provided in Algorithm 2 in the Appendix.

Correlation For all other dependency tests the p-value of the (partial) correlation coefficient is used. Cor-relation indicates how much two variables associate with each other, partial corCor-relation also allows to control for the effect of other random variables. The correlation coefficient itself is a measure that indicates how large this association is. For (in)dependence testing the magnitude of this association is not of interest. Rather, the corresponding p-value that indicates the probability that the association arises from chance will indicate whether or not two variables are dependent. In this thesis a Python port of the Matlab implementation4 _of correlation was used to obtain these p-values.

These tests can only capture linear (in)dependence relationships. This means that it is also assumed that all relations between variables in the Bodenmiller dataset are also linear. This is a rather strong assumption to make about the data, especially as biological data is notorious for having highly non-linear relations. Our collaborators were already using these tests and it was decided to use the same tests for consistency. More intricate independence tests are left for future work.

For every one of the 3867 subsets of data there were 15 variables. Only independence tests up to the first order were performed, as no higher order tests are used by the different methods. For every combination of activator, population and inhibitor this is a total of 15 × 14/2 × 14 = 1470 p-values, as there are 15 × 14/2 pairs of variables, and for every one of these pairs 14 possible conditioning sets (all 13 variables that are left and the empty set).

2_{http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html}

3_{Actually any generalized linear model can be used to perform such a test.} _{An explanation of this can also be found at}

http://nl.mathworks.com/help/stats/generalizedlinearmodel-class.html#btwom6v-1 under “deviance”.

4

(17)

4.1.3 Aggregate p-values

The DAG that generated the data for the same sub-population and activator can be assumed to be the same across the different inhibitor plates, as the same variables were measured under the same conditions. The only reason not to pool the data over the plates in the first step of the pipeline was to avoid batch effects. To aggregate, the median log p-value is used as a default. These aggregated p-values are used next as an input for a causal discovery algorithm.

Because of the aggregation step, the algorithms themselves need to be called a factor of 27 less. This speed up is necessary to achieve results in a reasonable amount of time for more complex methods. Also, it avoids the non-trivial problem of aggregating the outcome of the algorithms. Besides these practical issues, there is no theoretical reason not to use the separate p-values as input for the algorithms and aggregate the results in some way.

Another way to aggregate p-values would be to use the mean log p-value over all inhibitors. This measure is sensitive to outliers that could exist in real world data, and is therefore considered less reliable. Nonetheless, the effect of using this measure is examined in some experiments.

At this point there is a set of 1470 aggregated p-values for every one of the 153 combinations of activator and sub-population.

4.1.4 Run an algorithm

Next, we can use these (aggregated) p-values as input for CLCD and ACI, which were explained in Sections 2.2 and 2.3 respectively. The three different CLCD methods are all slight variations on the general pipeline as described above. All four methods using ACI use the default pipeline, but differ in the number of variables and complexity of the input.

4.2 CLCD

The input for all CLCD methods is a set of 6 p-values, which correspond with the different independence tests as described in the general pipeline. The output is a boolean which indicates whether or not all requirements, as defined in Definitions 1 and 2, hold.

To determine whether or not an (in)dependency holds, the p-value of the corresponding test is compared with a threshold. For all CLCD methods the dependence threshold is chosen as α = 0.01 and the independence threshold as β = 0.15. Remember that in order for a dependence to hold, the corresponding p-value needs to be below α and for an independence above β. For the CLCD methods in this thesis a p-value in the interval [0.01, 0.15] is regarded to neither indicate dependence nor independence.

Three different configurations of the pipeline to produce predictions using CLCD were experimented with: • Method a) uses the p-values obtained before aggregating over inhibitors. The binary outputs of CLCD

was then aggregated by calling a pair a causal pair if it was found in a minimum of 10 inhibitor datasets. Our collaborators originally proposed this method, and it was used as a starting point for the research in this thesis.

• Method b) uses the default configuration of the general pipeline to produce predictions.

• Method c) uses the mean p-value to aggregate over inhibitors. For all other steps in the pipeline the default setting were used.

In Algorithm 1, a schematic overview of the procedure used to make CLCD predictions is shown. To deter-mine the (in)dependences the aggregated p-values of the corresponding tests are compared to their respective thresholds. For method a) an extra inner loop over different inhibitors is needed as well as a counter that keeps track of the number of plates for which the structure holds and an if statement that compares that counter to a threshold before predicting a certain pair.

4.3 ACI

Four different methods to find causal relationships in the Bodenmiller data [16] using ACI were applied in this thesis. Starting from a method as close to method b) as possible, first the number of variables is increased and next the complexity of the input. The specifics of each method are given below.

For all methods ACI receives as input a set L of weighted conditional independence statements. Also, prior knowledge about weighted or unweighted ancestral relations can be encoded. As suggested in [22] the weight of each (in)dependency was set to the absolute difference between the log p-value of the (in)dependence test and the log value of a threshold as given in Equation (13). As a default for each of our methods a single threshold α = β = 0.01 was chosen. If a test returned a p-value below this thresholds it was seen as a dependence and

(18)

Algorithm 1 A schematic overview of the procedure used to make CLCD predictions, for methods b) and c). for Acta∈ Activators do

for P opp∈ Populations do for Pi∈ Proteins do for Pj ∈ Proteins do if Pi== Pj then continue end if

if Acta 6⊥⊥ Piand Acta 6⊥⊥ Pj and Pi 6⊥⊥ Pj and

Acta 6⊥⊥ Pi| Pj and Acta⊥⊥ Pj| Pi and Pi 6⊥⊥ Pj| Acta then Predict(Acta, P opp, Pi, Pj) end if end for end for end for end for

all otherwise as an independence relation. A single threshold was chosen as a default, because ACI should be able to work out that independences with a p-value close to the threshold are less important than those with a very high p-value, as the latter will be assigned a higher weight than the former. In section 5.3 the predictions made by ACI with different thresholds and those made by CLCD are compared.

w = |log(p) − log(thresh)| (13)

As output ACI produces a score for each pair of variables (Pi, Pj) in the system. The ASP solver finds the sum of weights associated with constraints that were violated in a world where Pi 99K Pj and the same for a world where Pi699K Pj. The first score is then subtracted from the second, so a positive score indicates that ACI has more evidence for Pi 99K Pj then for Pi 699K Pj, a negative weight indicates the reverse, and a zero weight means that there is an equal amount of evidence for both. The higher the score, the more evidence ACI has found for Pi to be an ancestor of Pj.

The implementation of ACI used in this thesis was provided by Sara Magliacane, one of the authors of [22].

4.3.1 Method i) CLCD with ACI

First, ACI is used to reproduce the CLCD results. To do so, ACI solves a system consisting of two proteins and one activator using all possible independence tests between these three variables. Also, the prior knowledge that neither of the proteins causes the activator is encoded. This results in 91 systems, or lists of weighted input statements, for each activator and sub-population.

Using the causal reasoning rules as presented in Section 2.3.1, ACI should assign a positive weight to the same (Pi, Pj) pairs as CLCD predicts with method b). As the same general pipeline is used, the p-values used as input for ACI are exactly the same as for CLCD. Instead of giving a binary decision for a certain pair of variables being a causal pair, ACI assigns a score to each possible causal pair. Because it allows for uncertainty in the input and can identify more than one causal structure, ACI might find more predictions than CLCD.

4.3.2 Method ii) Using ACI for post processing

The result of the previous method is, for each sub-population and activator, a weight for each ordered pair of proteins (Pi, Pj) indicating a score that can be associated with the confidence of that pair being a causal pair. The aggregate of these scores can be cyclic and is not necessarily closed under transitivity. We can impose these assumptions by using the scores found with method i) as input for ACI again.

All protein pairs with positive scores in method i) are encoded as weighted causal ancestral relations, and the pairs with negative scores as weighted non ancestral relations. The absolute value of the score from method i) is used as the weight of the input for this method. ACI then finds the ancestral structure containing all 14 functional proteins, omitting the activator variable, that is acyclic and closed under transitivity, and minimizes Equation (12).

4.3.3 Method iii) ACI on all 14 functional proteins

The next method combines the previous two. Instead of first finding weights for each pair of proteins and then using ACI to find the acyclic transitive closure, we feed all weighted (in)dependences for all 14 functional proteins to ACI. So instead of solving 91 systems, consisting of 2 proteins and an activator, separately for

(19)

each stimulus and sub-population and then post-processing these predictions, we have only one system of 15 variables, including 14 proteins and the activator.

The hope is that ACI can more easily include or exclude more complex causal structures by receiving all dependency information at once. Note that the input here is far from complete, no independence tests with conditioning set sizes larger than 1 were used. Specifically, no conditional independence tests of the form Pi⊥⊥ Pj | Pk (two proteins conditioned on a third) were included in the input. This means that 378 weighted (in)dependence and 14 non causing statements were the input for ACI with this method.

4.3.4 Method iv) Use all (in)dependencies on all proteins

To include even more information for ACI we also ran independence tests between two proteins conditioned on a third. As an input for ACI this means that for each sub-population and activator we have all independence tests with conditioning set size up to 1. In comparison to the input of method iii), for each of the 91 ordered pairs (Pi, Pj) there are now 12 extra weighted conditional (in)dependence statements, where the conditioning set consists of one protein variable.

This results in a total of 1470 (in)dependence and 14 non causing statements as input for ACI. When compared to method iii) this is a much larger number of clauses, but this does not necessarily increase the complexity of the problem. However, as the Bodenmiller dataset is expected to contain a lot of noise it can be anticipated that the extra information contains a lot of contradictions. Also, if the extra statements contain a lot of independences, these will increase the number of possible causal relations through Rule (4). These two phenomena will increase the computational complexity of running ACI. However, as mentioned earlier the ASP solver can give an answer in any time. The effect of using different time-outs is examined in Section 5.5, the longest time-out that was used is 200 seconds and if not stated otherwise results with this time-out are presented.

(20)

5 Experiments

Several experiments were conducted to analyse the results of the different methods presented in the previous section. Predictions are represented with heat maps, and may be compared visually. Predictions made by CLCD can be validated in an internal way, which shows that these methods provide more consistent predictions than the random baseline. That CLCD predictions are in fact a special case of the ACI framework is shown next, and predictions made by all methods are compared to an existing database of causal pathways. This is the only possibility to test the predictions on some external source, as there is not ground truth available. After this, some experiments examining the effect of different encodings, aggregation methods or time-outs on the predictions made by ACI give some extra insight into this method.

5.1 Results of different methods

Methods a) - c), using CLCD, give binary decisions whether for a certain ordered pair of proteins the causal relation Pi 99K Pj holds, in the context of a specific combination of activator and population. Methods i)-iv), which use ACI, all assign a score to every ordered pair of proteins, where a positive score indicates Pi 99K Pj and a negative score Pi 699K Pj. Every ordered pair with a positive score is called a prediction. No threshold was used. The focus of the analysis of the results lies on positive predictions. Negative predictions were not examined in detail as no validation was possible for them and also CLCD does not give negative predictions.

As stated before, every method makes a prediction, (Acta, P opp, Pi, Pj), of some pair of proteins having the causal relation Pi99K Pj under a stimulus Acta and in sub-population P opp. For performing validation on the external database, and also for easy visualisation, it is necessary to aggregate predictions over the activators and sub-populations. To do this the two following schemes are used. For CLCD methods the number of times a pair (Pi, Pj) is predicted is simply counted. ACI results are aggregated by summing all non-negative weights for each pair of proteins over all cell types and stimulations. Negative weights are not taken into account because finding the absence of some causal relationship under one stimulus does not make discovering the same pair under a different stimulus less relevant. Table 3 shows the number of predictions produced by each method, and the number of unique pairs that are obtained after aggregating.

Method Number of Predictions Aggregated Predictions a) 122 61 b) 200 79 c) 88 52 i) 1647 223 ii) 1672 221 iii) 797 212 iv) 7854 234

Table 3: Number of predictions for each method, before and after aggregation.

As can be seen in Table 3, the number of predictions made varies a lot between the two algorithms. Moreover, the number of predictions made by methods using the same algorithm also differs. A qualitative analysis of the CLCD predictions made by methods a), b), and c) is presented in section 5.2.1. The predictions made with CLCD and ACI are compared in section 5.3.

The predictions produced by the different methods can be visualized as heat maps. This can be done separately for each activator and sub-population but can also be done for the aggregated predictions. To aggregate, the schemes presented above are used. The result of aggregating over both T-cell sub-populations is shown below in Figure 6. The T-cell sub-populations had the largest sample sizes on average, as can be seen in Table 2, and results on these sub-populations are regarded as most confident. Results are also visualized for the cd4+ sub-population combined with each activator in Figure 23 and forcd8+ in Figure 24, all of which can be found in the Appendix. These low level heat maps only show predictions of methods i) - iv). In all heat maps, rows represent cause proteins and columns effect proteins.

The heat maps nicely show how the predictions gradually change by using the different methods. Comparing the heat maps of method b) and method i), in Figure 6, it seems that indeed the predictions made with CLCD are a special case of those made with ACI, this is further confirmed in Section 5.3. Methods i) - iii) all produce very similar heat-maps, which is not a very big surprise as their respective inputs contain a similar amount of information. It seems that method iv) produces a very different set of predictions, however as shown in Section 5.5 these results might not be reliable. In the next subsection the CLCD methods will be compared in more detail.

(21)

pS6 pLat pErk pPlcg2 pBtk pSlp76 pStat3 pZap70 pSHP2 pStat1 pAkt pStat5 pp38 pNFkB pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 (a)

pS6 pLat pErk pPlcg2 pBtk pSlp76 pStat3 pZap70 pSHP2 pStat1 pAkt pStat5 pp38 pNFkB

pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6 −4 −3 −2 −1 0 1 2 3 4 (b)

pS6 pLat pErk pPlcg2 pBtk pSlp76 pStat3 pZap70 pSHP2 pStat1 pAkt pStat5 pp38 pNFkB

pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6 −4 −3 −2 −1 0 1 2 3 4 (c)

pS6 pLat pErk pPlcg2 pBtk pSlp76 pStat3 pZap70 pSHP2 pStat1 pAkt pStat5 pp38 pNFkB pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 x 104 (i)

pS6 pLat pErk pPlcg2 pBtk pSlp76 pStat3 pZap70 pSHP2 pStat1 pAkt pStat5 pp38 pNFkB pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6 −1 −0.5 0 0.5 1 x 104 (ii)

pS6 pLat pErk pPlcg2 pBtk pSlp76 pStat3 pZap70 pSHP2 pStat1 pAkt pStat5 pp38 pNFkB pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6 −4 −3 −2 −1 0 1 2 3 4 x 104 (iii)

pS6 pLat pErk pPlcg2 pBtk pSlp76 pStat3 pZap70 pSHP2 pStat1 pAkt pStat5 pp38 pNFkB pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6 −6 −4 −2 0 2 4 6 x 105 (iv)

Figure 6: Visualisation of results for different methods over the T-cell sub-populations and stimuli. Rows represent cause proteins and column effect proteins.

When examining heat maps at low level, before aggregating over sub-populations and activators, as shown in Figures 23 and 24, a couple of things stand out. First of all, it can be seen in more detail that predictions produced by methods i) - iii) only slightly differ in output. Also, there is quite some variation when comparing results of the two different sub-populations, under equal stimulus and using the same method. This is also not only observed for the different types of T-cells, but also for the other populations, and supports the hypothesis that different sub-populations have other causal mechanisms governing them. Another thing that stands out is that method iii) does not produce any positive predictions for all but three activators in the cd8+ sub-population. This could be caused by the fact that ACI can incorporate more information at the same time for this method, and that weights associated with positive predictions turn out lower when combining all the input at the same time. However, not finding the same effect for the cd4+ is strange. The results of method iv) on cd8+ look very noisy. This could be caused by the fact that the ASP solver timed out for a lot of queries in these systems. This is supported by the fact that the cd4+ results look a lot more reliable for method iv) and that the ASP solver did not time out for most of the queries in these systems. A more detailed analysis of varying the time-out for method iv) is given in Sections 5.5.

5.2 Validating CLCD predictions

Predictions produced by the three different CLCD methods were validated by using other datasets and doing an internal validation in two ways. These forms of validation are mostly a sanity check to see whether the output of the algorithms is consistent. It does not compare the predictions made to a ground truth of some sorts.

To compare the results of a validation method the random baseline was created in a stratified manner for each method. This was done by predicting n different random ordered pairs of proteins for each activator Acta and population P opp, where n is the number of predictions that were made by the CLCD method for activator Acta and population P opp. Every validation was iterated over 10000 different random baselines for each method.

5.2.1 Qualitative comparison of CLCD predictions

Before the results of CLCD are validated, the differences between the three methods are examined in detail. As can be seen in Table 3 the number of predictions made by the three CLCD methods vary a lot. The Venn diagram in Figure 7 shows how much these predictions intersect and differ.

The lower number of predictions made by method c) when compared to method b) could be explained by outliers to the higher end of the range [0, 1] causing necessary dependencies to fail, or conversely outliers to the lower end which causes necessary independences to fail. Inspection of the p-values showed that in most cases the second hypothesis explained the difference in most cases. However, as can be seen in Figure 7 not all predictions made by method c) are included in the set of predictions made by method b), so these two explanations are not sufficient to account for the difference in predictions made. Off course outliers do not only

(22)

Figure 7: A Venn diagram showing the intersections and difference of the number of predictions made by different CLCD methods.

cause (in)dependencies to fail, but can also have the opposite effect of introducing (in)dependence relations that are not found when using the median log p-value.

Figure 7 shows that three methods, using the same CLCD algorithm and only a slightly different pipeline to produce causal predictions on the Bodenmiller [16] dataset, only agree on a very small number of triplets. If the independence relations could be obtained from an oracle (predicting binary p-values) the predictions would be the same for each method. Assuming that the DAG generating the samples obtained from different plates is the same, the (in)dependence relations that hold are also the same in each subset of the data. Whether the binary p-values are then aggregated before running CLCD, or the binary outputs of CLCD aggregated, the resulting prediction would be the same. Also, changing the aggregation method would not change the predictions made by each method. The most logical explanation for the discrepancy is therefore the unreliability of the statistical tests used to determine (in)dependence. This could be caused by the data being very noisy. Another explanation of the discrepancy could be that the linear assumption made by all the independence tests is too strong and that their outcome is unreliable for that reason.

5.2.2 Validating on other datasets

CLCD predictions were validated on other datasets also gathered in [16], where the same proteins were measured in different donors and after various time intervals and on a similar mass cytometry dataset described in [14]. For this validation method it is assumed that CLCD predictions have the causal structure as described by the DAG in Figure 8, rather than the ADMG described in Figure 2, so it is assumed that Pi and Acta are not confounded. For each prediction (Acta, P opp, Pi, Pj), this DAG as well as every other possible DAG for the three variables is scored with an exact scoring mechanism as presented in [6]. The percentage of predictions for which the DAG presented in Figure 8 had the highest score is reported. For the random baseline the average percentage over 10000 iterations is reported. The idea of this type of validation is credited to our collaborators, code was also kindly provided by them.

Acta Pi Pj

Figure 8: A DAG corresponding to a CLCD prediction, as assumed for validation.

The results of this validation can be seen in Figure 9 for method b). In the plots on the left side green bars indicate the percentage of predictions produced by CLCD which score highest, yellow bars represent the mean average percentage for the random baselines and the error bar the standard deviation of the random baseline. Plots on the right show the percentages for the five networks that most frequently scored highest. Similar plots for methods a) and c) can be found in the Appendix in Figure 21 and Figure 22.

Figure 9a shows the validation of the predictions on samples provided by 8 different donors. In Figure 9b the validation on samples that were measured at different time points is shown. Again, the random baseline is outperformed at all time points, predictions perform best when tested on data measured after 30 minutes. This is not surprising as the data analysed in this thesis was also treated with stimuli for 30 minutes. The data from different donors and time points was all collected by [16]. A study conducted by Bendall, Simonds, Qiu, et al. collected a very similar dataset [14] from two donors. Only a subset of the predictions can be validated on this

Discovering Causal Links In Mass Cytometry Data

MSc Artificial Intelligence

Master Thesis

Discovering Causal Links

In Mass Cytometry Data

David Woudenberg

July 8, 2016

Supervisor:

Dr J.M. Mooij

Assessor:

Dr M. Welling

Informatics Institute

University of Amsterdam

Abstract

Acknowledgements

Contents

1

Introduction

2

Theoretical framework

2.1

Preliminaries

2.2

Conservative Local Causal Discovery

2.3

Ancestral Causal Inference

2.4

Related Work

3

The data

3.1

The experimental setup

3.2

Gating

3.3

General remarks on the Bodenmiller data

4

Methodology

4.1

General Pipeline

4.2

CLCD

4.3

ACI

5

Experiments

5.1

Results of different methods

5.2

Validating CLCD predictions