Causal Compatibility of Directed Acyclic Graphs for Real-World Data

(1)

MSc Artificial Intelligence

Master Thesis

Causal Compatibility of

Directed Acyclic Graphs for

Real-World Data

by

Tim Ian Milo Smit

10779744

July 24, 2020

Number of Credits 48 November 2019 - July 2020

Supervisor:

Dr. Patrick Forr´

e

Assessor:

Prof. dr. Joris Mooij

(2)

Abstract

A Causal Bayesian Network is of great importance to ensure that predictive models, based on data, are unbiased. This is of particular importance whenever predictive models are used to make predictions on very personal data. The outcome of such a prediction model could have a significant impact on an individual’s life, especially when this model makes a wrong prediction. To ensure a fair model, a method is proposed that uses the causal beliefs of data experts to construct a propable Causal Bayesian network of the data. These networks are constructed by using the Inflation Technique as an oracle for Causal compatibility. This technique can either be applied to verify the experts’ proposals when they align, or to examine which proposal is the correct one, if the experts both have a different opinion. This process of testing proposals with the Inflation Techniqe, is applied to a dataset provided by the Municipality of Amsterdam that concerns the collection of social benefits.

We provide an extensive description of the Inflation Technique, where the technique is investigated, implemented and, where deemed necessary, adjusted. Our contributions are threefold. Firstly, a data frame is proposed in which the Inflation Technique can be used to test multiple inflated Directed Acyclic Graphs to find the best possible assignment for Causal Compatibility. Secondly, several slight improvements of the Inflation Technique are discussed. These improvements are the following: firstly, we will suggest what types of graphs are most useful to inflate and secondly, we introduce the addition of extra latent variables to graphs that are otherwise impossible to inflate. Our final contribution is a step-by-step plan, devised to construct the set of most probable causal assumptions out of proposed sets of causal assumptions devised by experts.

(3)

1 Introduction

On an ordinary morning at the statistics department Onderzoek Informatie Statistiek (OIS) of the Municipality of Amsterdam, two colleagues discuss the newest unemployment rates. There are certain minorities that stand out, which could indicate that there is social inequality in the city. However, as one of the colleagues points out, if the unemployment rates are compared within specific age groups, the disparity could diminish, or perhaps even disappear. This is a classic example of Simpson’s paradox (Judea Pearl, 2000). In this case, Simpson’s paradox could arise because of divergent age distributions within the different minority groups. This could potentially result in an overestimation, or misinterpretation, of statistical differences. Simpson’s paradox is merely one example of a phenomenon that the statistics department has to be aware of on a daily basis. Different forms of selection bias, common confounders and correlations by chance are other instances that can intervene in making sound decisions and policies as a government institute. The ability to check for causal relations instead of correlational relations, can therefore be an important tool to ensure that conclusions drawn from data are the correct ones.

A correlational relation is the statistical relationship between two random variables. Such statistical relationships can indicate predictive power, which can be exploited by a predictive model. In the research field of Artificial Intelligence (AI), such predictive models are able to obtain high accuracies on designated test sets and have therefore become an important tool for modeling real world data. However, the example of Simpson’s paradox illustrates that solely basing decisions on correlations can be problematic. To make proper decisions, an understanding of the causal relations between random variables is imperative. This understanding is even more important when the predictive model makes decisions concerning humans. When humans are concerned, a wrongly or unfairly allocated prediction can effect lives rigorously. To make sure that no one is negatively affected by wrong or unfair predictions, the Municipality of Amsterdam wants to ensure that their predictive models capture the correct causal relations within the data. A causal relation is the influence one event can exert on the occurrence of another event. This influence is often stated as the cause and effect relationship. For example, if A influences B it is said that “A causes B”, thus A is the cause and B is the effect. Causal Inference is the research field that researches cause-effect relationships in data. From these cause-effect relationships, a predictive model can be built that is able to not only make predictions of future events, but also provide insights in interventional distributions and answer counterfactual questions. These insights make a predictive model, based on causal relations, better explainable. This increase in explainability allows predicitive models to more clearly indicate unfair allocations and allows users to adjust the predictive models to be fairer.

The process of systematically finding causes and their effects has been actively researched over the past decades (Judea Pearl, 2000; Peters, Janzing, & Sch¨olkopf, 2017; Spirtes, Glymour, & Scheines, 2000). Techniques obtained by research into causal inference have been applied throughout science. In research fields such as biology and physics (Chang R, 2015; Fritz, 2012), causal inference is used to infer causal relations from measurements. Other applications can be found in Artificial Intelligence (AI), where causal inference is used to infer causal graph structures that can be used to build a predictive model (Janzing, Peters, Mooij, & Schoelkopf, 2012; Louizos et al., 2017). In recent years, these predictive models have been used to debias AI models by answering counterfactual questions (Helwegen, 2019; Miao, Geng, & Tchetgen Tchetgen, 2018). For research purposes, such as mentioned above, it is of great importance to correctly infer causal relations. In order to ensure that causal relations are correctly inferred, the focus in this thesis is the Causal Compatibility Problem (CCP). The CCP entails that a set of proposed causal relations and a data distribution are causally compatible, if the proposed causal relations are sufficient and non-contradictory to this specific data distribution. In other words, when causal compatibility is answered in the positive, it is possible that the causal relations were the generation process of the corresponding data distribution.

As to provide a systematic procedure to answer the question: “Is a set of causal relations causally compatible with a set of data points?”, a method along the lines of Karl Popper’s philosophy is investigated. Karl Popper argued that science should adopt a methodology based

(5)

on falsifiability (Popper, 1959). For example, “Are all swans white?” is a question that one might have asked oneself as a kid. This question is hard to prove in the positive, because to do so the color of all swans needs to be verified. However, to answer this question in the negative, only one counterexample is sufficient to prove that not all swans are white. As the example above illustrates, no finite number of experiments can ever proof a theory, but only one experiment is needed to refute one. The work proposed in this thesis addresses the CCP by systematically searching for incompatibilities between the causal relations and the data: looking for the black swans. Whenever no incompatibilities are found, it is plausible that the causal relations could have been the generation process of a certain data distribution.

In causal inference, the structure of causal relations is often captured in a Direct Acyclic Graph (DAG) Judea Pearl, 2000. Generally, answering the CCP on a dataset is difficult to do, because there is a vast amount of possible DAGs that could capture the causal relations. Although finding the correct DAG is difficult, it is possible to find inequality constraints that are able to falsify DAGs (Judea Pearl, 2013). For a DAG to pass an inequality constraint is a weak requirement for compatibility. A “weak” requirement means that it is a necessary requirement, but not a sufficient requirement. Such inequality constraints can thus be used to conclude that a DAG does not fit the data. Additionally, it is possible to systematically check multiple of these inequality constraints. The more of these constraints are passed, the more probable it is that the data fits the DAG (Navascues & Wolfe, 2017). Such inequality constraints can be found by inflating a graph: the systematic copying of nodes in a graph, such that the ancestral structure is retained. It has been shown that when the inflation of a graph is taken to an ever-higher-order test of causal compatibility, the amount of falsely compatible DAGs asymptotically goes to zero (Wolfe, Spekkens, & Fritz, 2019). This proves that the Inflation Technique indeed solves the CCP.

Although the Inflation Technique solves the CCP, it has two major flaws that limit its usefulness in a real-world scenario. Firstly, the search space for causal compatibility grows exponentially with the number of inflations, so much that it already poses a problem for very small graphs. Additionally, the cardinalities of the observed variables need to be finite. When the cardinality of an observed variable increases, the search space for causal compatibility grows exponentially. This makes the Inflation Technique already intractable for small DAGs with a low cardinality. Secondly, the Inflation Technique works best for a special subset of DAGs: the correlation scenario. The correlation scenario DAGs are bipartite graphs with one part latent variables and one part observed variables. Furthermore, only causal relations from latent variables to observed variables are allowed (Fritz, 2012). Although every DAG can be rewritten into a correlation scenario DAG, this is not possible without increasing the search space. These unfortunate characteristics limit the use of the Inflation Technique in real-world scenarios.

In this thesis we research the application of the Inflation Technique to answer the CCP. The goal of this research is to find out whether the Inflation Technique can be of use when testing causal assumptions behind real-world data. To do so, the Inflation Technique is implemented and experiments are devised to research the possibility of using the Inflation Technique in real-world scenarios. Where deemed necessary, the Inflation Technique is extended and the implementation is tested on synthetic data as well as on real-world data provided by the Municipality of Amsterdam. To guide our research, three research questions are devised:

1. Can the search space of the Inflation Technique be reduced by only inflating the “useful” parts of a DAG?

2. Can we generalize the Inflation Technique to work for a set of DAGs that is more general than the correlation scenario DAGs?

3. By using the results of the first two research questions, is it possible to apply the Inflation Technique in a useful manner to real-world data?

In the experiments we designed, tests of causal compatibility are performed on proposed causal relations. These causal relations are devised by experts of the concerned field. If the proposed causal relations are deemed incompatible, small adjustments are made to the causal graph until

(6)

they are compatible. If multiple options of the causal relations are deemed compatible, the set of causal relations containing the least amount of edges is chosen via Occam’s razor. Consequently, the compatible DAG with the fewest edges is assumed to have the true causal relations.

This thesis will be structured as follows: in section 2 the required background knowledge on the subject is discussed. Subsequently, in section 3 the Inflation Technique is described in detail. Then, in section 4, the implementation and design choices of the Inflation Technique are discussed. Finally, in section 5 we discuss the experiments and in section 6 the results of these experiments.

(7)

2 Background

Causal inference is the research field that studies the inference of cause-effect relationships in data. In this section, we will discuss the concepts that precede on our methodology and introduce causal inference as a field of research. First of all, in section 2.1, we present a graphical structure called the directed graph, that can be used to represent and to reason with causal relationships. Secondly, in section 2.2, we will explain how cause-effect relationships can be incorporated in directed graphs. To do so, certain definitions and properties are introduced, in particular the Structural Causal Model (SCM). Thirdly, in section 2.3, we motivate the main advantages of an SCM over a statistical model: the ability to intervene on variables and answer counterfactual questions. To conclude, in section 2.4, several methods for causal discovery are discussed: constraint-based causal discovery and score-based causal discovery.

2.1 Graphs

Graphs are structures that are widely used to represent relations between instances. In research fields concerning statistical or causal modeling, these relations are statistical or causal in nature. A graph represents these relations by linking nodes (or vertices) via edges (or links) to other nodes. Here, edges can be seen as bridges carrying relational information between two nodes. For statistical and causal modeling, the structure that arises from these edges has multiple uses:

1. They provide convenient means of expressing substantive assumptions,

2. they facilitate economical representations of joint probability functions and

3. they facilitate efficient inferences from observation (Judea Pearl, 2000).

The relational information carried by edges can either be directed or undirected. A directed edge represents information only going in one specific direction and an undirected edge represents information that is able to pass in both directions. In this thesis, only graphs with directed edges (Directed Graphs Definition 2.1) are considered.

To avoid ambiguity, let us first define the set notation used in this thesis: capital bold letters (e.g. A) are used to denote a set of variable sets, capital letters (e.g. A) denote a variable set and

lower-case letters (e.g. a) denote an instance of a variable set.

Definition 2.1 (Directed Graph). A directed graph_{G = (V, E) consists of nodes V = {v}1, . . . , vd} and directed edges_{E ⊂ V}2 _{connecting nodes to each other.}

In a directed graph_{G, the nodes that have a directed edge to node v}i are called the parents of vi (see Equation 1). Nodes that receive a directed edge fromvi are called the children ofvi (see Equation 2). A path in a directed graph_{G is defined as a sequence of nodes {v}i, . . . , vj}, such that there is a directed edge between every consecutive node in that sequence. Additionally, a path where all the edges are facing in the same direction is called a directed path. The ancestral set of nodevi, is the set of all nodes that have a directed path terminating invi (see Equation 3). Contrarily, the descendant set of nodevi is the set of all nodes that have a directed path that originated invi, which terminates in these nodes (see Equation 4).

Set of parents ofvi:=P AG(vi) ={vj|(j, i) ∈ E} (1)

Set of children ofvi:=CHG(vi) ={vj|(i, j) ∈ E} (2)

Set of ancestors ofvi:=ANG(vi) ={vj|There exists a directed path from vj tovi} (3) Set of descendants ofvi:=DEG(vi) ={vj|There exists a directed path form vi tovj} (4) For most purposes in causal inference, a directed graph is too general of a definition. This is the case, since these graphs allow for cyclic structures – directed paths with the same start as end node – which complicate the inference process. To avoid these complications, a set of graphs is defined that excludes cycles: Directed Acyclic Graphs (DAGs) (Definition 2.2).

(8)

Definition 2.2 (Directed Acyclic Graph). A directed acyclic graph_{G = (V, E) consists of nodes} V =_{v1, . . . , vd} and directed edges E ⊂ V2 connecting nodes to each other. For every directed path_{vi, . . . , vj} in G it must be that for all different combinations of nodes in the path vk 6= vm.

2.2 Bayesian Networks

Bayesian networks are statistical models that use directed graphs to represent the conditional dependencies between the nodes. The term “Bayesian Network” was coined by J. Pearl (1985) to emphasize three aspects:

1. The subjective nature of the input information,

2. the reliance on Bayes’s conditioning as the basis for updating information and

3. the distinction between causal and evidential modes of reasoning (Judea Pearl, 2000).

In this section, we will define and explain Bayesian networks, as well as discuss several properties that are related to these networks. Most notably, we will discuss Markov properties and Structural Causal Models (SCM), that can be used to model Bayesian networks.

Before a Bayesian network is formally defined, we will define (conditional) independence relations between random variables. Two random variables (X1andX2) are called independent, if the occurrence of one does not effect the probability of occurrence of the other. In the event that random variablesX1and X2 are independent, the independence condition states that the joint distribution ofX1 andX2is equal to the product of the individual distributions (Equation 5). A classic example of independence between two random variables are the first and second toss of a fair coin – a coin that has equal probabilities to land on heads or tails – where the outcome of the first toss does not influence the outcome of the second toss.

p(x1, x2) =p(x1)p(x2) ∀(x1, x2)∈ (X1, X2) (5) Two random variables (X1 and X2) are called conditional independent on a third random variable (Y ), when there is an absence of dependence between X1andX2, given thatY is known (conditioned on). The conditional independence condition states that the joint distribution of X1 and X2, conditioned on Y , is equal to the product of the probability distribution of X1, conditioned onY , and the probability distribution of X2, conditioned on Y (Equation 6). In order to provide an example for conditionally independent variables, we will extend the fair coin toss as described earlier. The two tosses of the coin are independent of each other, because of the knowledge that it is a fair coin. Let us introduce a third eventY where a coin is drawn from a hat, that contains one fair coin and one coin that is biased to land on heads. After the coin is drawn, it is tossed twice (eventsX1 andX2). By adding event Y to the experiment, events X1 andX2 become dependent. This dependency arises from the fact that whenever the first toss results in heads, the second toss has a higher probability to also land on heads. This increased probability is the result of the extra knowledge obtained by the first toss, inducing an increased probability that the biased coin was drawn in eventY . However, in the event that the outcome of Y is known (is conditioned on), this phenomenon disappears and X1andX2become independent again.

p(x1, x2|y) = p(x1|y)p(x2|y) ∀(x1, x2, y)∈ (X1, X2, Y ) (6) The notion of (conditional) independence is imperative for all Markov properties. One of these properties is the Local Markov property (Definition 2.3). This property is said to hold, when for a given DAG_{G and a joint probability distribution P}Xover the observed variables inG, every variable is independent of its non-descendants while it is conditioned on its parents.

Definition 2.3 (Local Markov Property). Given a DAG _{G and a joint distribution P}X, this distribution is said to satisfy the local Markov property with respect to the DAG _{G if each} variable is independent of its non-descendants given its parents.

(9)

With the Local Markov Property defined, the definition of a Bayesian network can be given. As described in 2.4, a Bayesian network is a DAG _{G paired with a probability distribution P}X, such that the Local Markov Property holds.

Definition 2.4 (Bayesian network). A Bayesian network is a DAG _{G = (V, E) of random} variables V =_{X1, . . . , Xn}, paired with a probability distribution over these random variables V =_{X1, . . . , Xn}, such that the local Markov property holds.

The conditional dependencies represented in a Bayesian network allow for efficient factorization of the joint distribution. This efficient factorization is the result of a Bayesian network satisfying the Markov factorization property (Definition 2.5). This property sheds a potentially large number of variables through factorization. To illustrate, an example of the Markov factorization of a DAG is given in Equations 7. If the joint distribution had been factorized without knowledge of the Bayesian network, it would have been reduced to p(c_{|a, b)p(b|a)p(a), whereas now it is} reduced top(b_{|a)p(c|a)p(a). A Bayesian network satisfies the Markov factorization property due} to the local Markov property, that requires the set of parents ofXi to be a sufficiently large set to condition on, in order to renderXi independent of al its non-descendants.

p(a, b, c) = p(b_{|a)p(c|a)p(a)} A

B

C

(7)

Definition 2.5 (Markov Factorization Property). Given a DAG _{G and a joint distribution P}X, PX is said to satisfy the Markov factorization property with respect to the DAG G if and only if

p(x1, . . . , xn) = Y

i

p(xi|pai), (8)

wherepai are instances of the parents of random variableXi.

A distribution over random variables can be described by multiple Bayesian networks. A DAG is called Markov Compatible (Definition 2.6) with a joint distributionPX, ifPX can be subjected to factorization as described in Equation 8.

Definition 2.6 (Markov Compatibility). If a joint probability distributionPX admits the Markov factorization relative to a DAG_{G, we say that G represents P}X, that G and PX are compatible, or that PX is Markov relative to G (Judea Pearl, 2000).

The compatibility of the joint distribution of the dataPXwith a Bayesian network of DAG G, is a necessary and sufficient condition for DAG G to be able to explain the data. Being able to explain the data means that there is a set of weights over the edges of the DAG, that can generate an identical distribution toPX. To check whether a Bayesian networkG is compatible withPX, is to check if the set of independencies and conditional independencies induced byG hold. The set of independencies and conditional independencies can be inferred from the graph_G through directional-separation (d-separation) (Judea Pearl, 1998).

The d-separation criteria are sufficient to identify the (conditional) independence relations in a Bayesian network_{G. d-separation is an algorithm that inspects all the different paths between} two nodes (or groups of nodes) and provides two rules to indicate whether information is blocked (see Definition 2.7). When all paths between two nodes are blocked, the two nodes must be

independent of one another.

Definition 2.7 (d-seperation). In a DAG _{G, a path between node A}i and Aj is blocked by set S/{Ai,Aj} whenever for every node Ak in the path from node Ai and Aj, such that one of the below statements forAk holds.

(10)

Ak−1 Ak Ak+1

or Ak−1 Ak Ak+1

2. neitherAk nor any descendants ofAk is in S and

Ak−1 Ak Ak+1

Furthermore, in a DAG_{G, it is said that two disjoint subsets of nodes A and B are d-separated} by a third subset of nodes S, if every path between a node in A and B is blocked by S. Denoted by

A_⊥_⊥GB|S

The d-separation criteria for a DAG_{G imply conditional independence via the Global Markov} Property. This property states that whenever two disjoint sets A and B are d-separated by the set S (disjoint with A and B), that A and B are conditionally independent on set S (see Definition 2.8).

Definition 2.8 (Global Markov Property). Given a DAG_{G and a joint distribution P}X,PX is said to satisfy the Global Markov Property with respect to the DAG _{G if and only if}

A_⊥_⊥GB|S =⇒ A ⊥⊥ B|S for all (disjoint) node sets A,B and S.

The Global Markov Property, together with the Local Markov Property (Definition 2.3) and the Markov Factorization Property (Definition 2.5), make up the Markov properties. These properties hold for every Bayesian network. When the distributionPX has a density p, these three properties are equivalent (Peters et al., 2017).

2.2.1 Structural Causal Model

The interpretation of a Bayesian network as a way of representing independence assumptions, does not necessarily imply causation in the network. When a representation of causal nature is desired, a subset of Bayesian networks that only allows for causal relations is used: Causal Bayesian Networks. A causal Bayesian network can be represented with a Structural Causal Model (SCM) (Definition 2.9). SCMs can be explained as an abstraction of the underlying data generation processes that take place over time. In other words, SCMs encode causal relations on a set of random variables, where a node represents a random variable and a directed edge represents a causal relation between two random nodes. The represented relation is called the “cause and effect” relation, where it is said that “A causes B” whenever there is a directed edge from A to B and it is said that “A effects B” whenever there is a directed edge from B to A.

Definition 2.9 (Structural causal model). A structural causal model (SCM)_{C := (S, P}N) consists of a collection S of d (structural) assignments

Xj:=fj(PAj, Nj), j = 1, ..., d, (9)

where PAj ⊆ {X1, . . . , Xd}\{Xj} are called the parents of Xj; and a joint distribution PN=PN1,...,Pd over the noise variables, which we require to be jointly independent; that is, PN is a product distribution.

(11)

The graph _{G of an SCM is obtained by creating one vertex for each X}j and drawing directed edges from each parent in PAj to Xj, that is, from each variableXk occurring on the right-hand side of Equation 9 to Xj. We henceforth assume this graph to be acyclic. (Peters et al., 2017)

Representing causal relations in a graph, instead of the more general (conditional) dependency relations, has a couple of advantages: Firstly, the relations represented in a causal graph contain more information. This is the case, since causal relations capture the “real” influences between random variables. This results in a better understanding of the model, due to the tendency of people to think in cause and effect and not in dependencies. Secondly, any local structures of the causal model can be reconfigured into small changes in the SCM. As a result, an SCM is better suited to predict changes in a real world setting. This ability makes for a more accessible model, where changes in the system can be better predicted. For example, predicting the effect of intervention (see Section 2.3.1), or answering a counterfactual question 2.3.2.

The definition of an SCM as provided in this thesis, does not allow for cycles. The exclusion of cycles is the direct result of the acyclicity assumption generally made in causal inference. This assumption states that the data generation process is acyclic in nature. When reasoning about SCMs, acyclicity is assumed to avoid infinite feedback loops in the inference process. However, acyclicity is a strong assumption, because there are plenty of real-world scenarios where one can determine a causal feedback loop. An example of such a causal feedback loop is blood clotting, where a chemical is released that activates platelets in the blood to clot after tissue is injured. Whenever a platelet has clotted the blood, it starts to release a chemical to signal other platelets to clot until the wound is covered (Guyton & Hall, 1991). Therefore, the process of blood clotting has cycles that could be represented as in Figure 1. Since the assumption of acyclicity might not be correct for every data set, research on SCMs with cycles is actively being done (Forr´e & Mooij, 2018; Rothenh¨ausler, Heinze, Peters, & Meinshausen, 2015), but it surpasses the scope of this thesis. Injured tissue Chemical for blood clotting blood clotting Injured tissue is covered

Figure 1: A causal representation of the blood clotting process.

It should be noted that an SCM is too flexible of a term. With “too flexible” we mean that, when given a joint probability distributionPX, there are multiple SCMs that could entail this distribution (Proposition 2.0.1). Thus, making the search for the correct SCM nearly impossible and solely possible when further assumptions are drawn. In section 2.4.2 these assumptions will be further discussed.

Proposition 2.0.1. Consider a random vector X =_{X1, . . . , Xd} with distribution Px that has a density with respect to Lebesque measures and assume it is Markovian with respect to_{G. Then} there exists an SCM_{C = (S, P}N) with graph G that entails distribution PX (Peters et al., 2017). To avoid confusion in the rest of this thesis, a distinction between observed (“known”) and unobserved, or latent, (“unknown”) variables is made. Observed variables are denoted as X = {X1, X2, . . . , Xm} and unobserved variables as U = {U1, U2, . . . , Uk}. When a distinction between observed and unobserved is not necessary, a set of variables is denoted as A =_{A1, A2, . . . , Ak}. In a figure, nodes representing observed random variables are depicted with the color green, unobserved random variables with blue and the general random variables with yellow (see Figure 2).

(12)

Xi (a) Observed Ui (b) Unobserved Ai (c) Undefined

Figure 2: The representation of variables in this thesis.

2.3 Causal representation

The different kinds of information that can be captured in an SCM vastly exceed the abilities of a general statistical model. Among other things, with an SCM we are able to predict what would happen under intervention in the model. This characteristic makes SCMs very useful for a wide range of research domains (Section 2.3.1). In addition, SCMs can be used to answer counterfactual questions. Such questions are usually of the following form: “What would have happened if another decision would have been made instead?” and can provide additional information to help take future decisions (Section 2.3.2).

2.3.1 Intervention

A well-known example to illustrate intervention is the causal dependency of the two random variables altitude (A) and temperature (T ). This dependency of altitude and temperature can be represented as in Figure 3: if the altitude increases, the temperature drops. Thus, a change in altitude causes a change in temperature and a change of temperature could be the effect of a change in altitude.

Altitude (A) Temperature (T )

Figure 3: A SCMs representation of the relation between altitude and Temperature

The dependencies represented in an SCM are assumed to be the only dependency relations between the nodes (Principle 2.1). This insight into all dependencies, makes it possible to predict the effect of intervention. We will illustrate how we can predict the effect of intervention with the following example: Assume an earthquake hits a city that is located a 1000 meters above sea level. Because of a landslide, the land underneath the city slides down and suddenly the city is at sea level: an intervention caused by the lowering in altitude of the city. When the landslide occurs, the correct model SCM_{C would suggest an increase in temperature (Figure 3). To illustrate the} opposite effect, assume the landslide did not happen and the city is still 1000 meters above sea level. Then, assume that global warming occurs and the temperature increases, which can be seen as an intervention on the temperature. According to the SCM, this intervention will not result in a change of the city’s altitude. The absence of a change in altitude is corroborated by the absence of a directed path from temperature to altitude in Figure 3 of SCM_C.

Changing a city’s altitude or temperature is an example of an intervention that is hard to effectuate in the real world and thus hard to measure. With the correct SCM, however, it is possible to intervene on the model and provide a possible scenario on what will happen if such events do occur. Additionally, whenever there is a potential for intervention, it can provide knowledge about the underlying structural model. Whenever a variable is unaltered after intervention, it means that it is either independent of the intervened variable, or it is in the ancestral set of the intervened variable. In the event where a random variable is altered by an intervention, it becomes clear that it has to be in the descendant set of the variable. Thus, we can conclude: interventional data provides information on the structural ordering of the random variables in an SCM.

The intervention of variables can be a powerful tool for causal discovery. Judea Pearl (2000) formalized do-Calculus that enables the modeling of intervention via the do-operator. When the do-operator is used it fixes the value of a variable, thus providing the ability to intervene in a

(13)

causal structure. The do-operator’s ability to model intervention is shown in Equation 10 and 11. Equation 10 is identical to the probability ofp(a), because fixing the temperature does not influence the altitude. Furthermore, Equation 11 is identical top(t_{|a) since altitude influences} temperature.

p(a_{|do(t)) = p(a)} (10)

p(t|do(a)) = p(t|a) (11)

2.3.2 Counterfactuals

In addition to the construction of an interventional distribution, an SCM is also able to provide counterfactual statements. Such counterfactual statements are of the form: ”If actionx had been taken, effecty would, or would not, have happened”. In order to illustrate this concept, imagine being on a holiday in the south of France and a decision needs to be made. Will you take the eastern route (which is longer, but has a lower probability of traffic jams), or the western route (which is shorter, but has a higher probability of traffic jams)? You choose the western route, but an hour into the journey you encounter a traffic jam that takes 2 hours to clear. One reaction to this situation could be of the form: “If we had taken the eastern route, we would have been there by now!”. This reaction is a counterfactual statement that uses the newly obtained information in the SCM to analyse the interventional distribution, while the rest of the distributions stay unchanged.

Definition 2.10 (Counterfactuals). Consider an SCM_{C := (S, P}N) over nodes X. Given some observationxi, we define a counterfactual SCM by replacing the distribution of noise variables:

CXi=xi := (S, PN(X|do(Xi=xi)) (12) The new set of noise variables need not be jointly independent anymore. Counterfactual statements can now be seen as do-statements in the new counterfactual SCM (Peters et al., 2017).

2.4 Causal Learning

The data used in Causal Inference can be partitioned into two types: observational data and interventional data (Section 2.4.1). Observational data is solely based on observations, thus there is no possibility to retrieve the interventional distributions. Contrarily, for interventional data it is possible to work with interventions. Therefore, interventional data contain a lot more structural information. For both types of data, not all causal relations can be inferred without any assumption on the data generation process. Consequently, assumptions on the generation process of the data are made (Section 2.4.2).

In causal inference, causal learning methods are used for the estimation of an SCM. Generally, these methods can be partitioned into constraint-based methods and score-based methods. Constraint-based methods (Section 2.4.3) look for (conditional) independencies in a graph, in order to infer causal relations between random variables. These relations act as constraints to which a joint probability distribution needs to adhere. When such constraints can not be satisfied by the data, this data could not have been created by the proposed SCM. Thus, the proposed causal relations could not have been true. Subsequently, score-based methods (Section 2.4.4) assign scores to the ease with which the SCM can fit the data. From this score it is concluded that the model for which the SCM weights were the easiest to infer, is the best potential model to have generated the data.

2.4.1 Data

In causal inference, data can be partitioned into observational data and interventional data. Observational data is data that is obtained without any interference of the system, thus only

(14)

measuring the combination of variables as they occur. Since multiple graphs could produce a data distribution, getting the true data distribution is impossible. Whenever assumptions on the additivity of noise are made, there are some findings that suggest that directionality can be found. In contrast, interventional data is able to intervene in the data generation process. When collecting interventional data, certain variables can be set to a certain value (intervened on) without the manipulation of the generation process that came before. Intervention makes the intervened variables independent of all its ancestors and is therefore an important tool to distinguish ancestral variable relations from descendant variable relations. In short, a dataset with the ability to intervene, carries much more information about the potential SCM_{C than an} observational dataset does.

2.4.2 Causal inference assumptions

Causal learning is the recovery of causal relations by which the data was generated. To recover an SCM_{C of joint data distribution P}X, it is problematic that potentially multiple SCMs could entail the distributionPX(Proposition 2.0.1). Since multiple SCMsC could entail a joint distributions PX, we should specify the compatibility of an SCMC with the joint distributions PX. A joint distributionPX is called compatible with an SCM C, if there exists a choice of variables that entails the joint distributionPX(Definition 2.11). Without any assumptions, the search for the correct SCM_{C is impossible.}

Definition 2.11 (Compatibility). A given distribution PX is compatible with a given causal structure _{G if there is some choice of the causal parameters that yield P}X. A given family of distributions on a family of subsets of observed variables is compatible with a given causal structure if and only if there exists somePX such that both

1. PX is compatible with the causal structure, and

2. PX yields the given family as marginals (Navascues & Wolfe, 2017).

As mentioned in 2.2, the search for correlation in a dataset is relatively easy. One has to compare the joint distribution of two variables with the product of their individual probability distributions to detect correlation between them. If these distributions are not equal, a correlational relation is present. Reichenbach’s Common Cause Principle states that, given that the data selection process occurred unbiased, every correlation is caused by a causation (Reichenbach, 1956). This principle, however, does not specify the structure of the causal relation. Such a causal structure could be fromA to B, from B to A, solely a correlation effect induced via a common cause (Figure 4 a-e), or a combination of the above. These different potential SCMs all entail the same probability distributions over the random variablesA and B and are therefore indistinguishable. Correlation can also emerge when the sampling of the data was biased. This type of correlation is due to selection bias (Figure 4f). Selection bias can occur when only one group in_{C is used to sample the data from. The ability to differentiate between these different} forms of correlation is one of the main focuses in the causal inference research field.

Before the construction of an SCM _{C from a joint distribution P}X, an assumption is made about the independencies present in the generation process. As described in Peters et al. (2017), the independence of mechanism assumption states that one module’s output (the structural assignments of_{C) may influence another module’s input, but altogether the modules are} inde-pendent (Principle 2.1). The independence of mechanism ensures that all the relations that are found in the data need to be represented in the SCM.

Principle 2.1 (Independence of mechanism). The causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other. In the probabilistic case, this means that the conditional distribution of each variable given its causes does not inform or influence the other conditional distributions.

An SCM_{C is said to be faithful to the joint distribution P}X, if the (conditional) independency relations induced by the joint distributionPX imply d-separation relations in the corresponding

(15)

A B (a) A causes B A B (b) B causes A A C B (c) Common cause A C B

(d) A causes B and common cause

A

C

B

(e) A causes B and common cause

A

C

B

(f) Selection bias

Figure 4: Different possibilities when correlation is found betweenA and B

DAG_{G. The assumption that the causal generation process is faithful with respect to a DAG G,} reduces the search space of potential DAGs dramatically. In short, the only potential DAGs, are DAGs that adhere to the inverse of the Global Markov Properties. Furthermore, assuming Causal Minimality, is to assume that the smallest DAG that is compatible with joint distributionPX is the most likely one. This assumption is in the spirit of Occam’s Razor that states: The simplest solution is most likely the right one. When we assume that an SCM is faithful and satisfies the Markov Properties, then this SCM satisfies Causal Minimality as well.

Definition 2.12 (Faithfulness and Causal Minimality). Consider a joint distibutionPX and a DAG _G.

1. PX is faithful to the DAG G if

Xi⊥⊥ Xj|Xk =⇒ Xi⊥⊥GXj|Xk (13) for all disjoint node sets Xi, Xj and Xk (Peters et al., 2017).

2. A distribution satisfies causal minimality with respect to_{G if it is Markovian with respect to} G, but not to any proper subgraph of G (Peters et al., 2017).

Assuming Independence of mechanism, Faithfulness and Causal Minimality, is not sufficient to distinguish a potential SCM that can fit the data distribution. Thus, we make another assumption: When the data generation process is assumed to be built by an Additive Noise Model (ANM) (Definition 2.13), the direction of cause can potentially be inferred. If we also make the assumption that every node in the ANM can not be a constant, but has to be a function of its input, an ANM model satisfies Causal Minimality as well.

Definition 2.13 (Additive Noise Model). We call an SCM_{C an Additive Noise Model (ANM)} if the structural assignments are of the form

Xi:=fi(P Ai) +Ni, i = 1, . . . , n (14) that is, if the noise is additive. For simplicity, we further assume that the functions fi are differentiable and the noise variablesNi have a strictly positive density (Peters et al., 2017). 2.4.3 Constraint-based causal learning

The constraint-based methods for causal learning make use of constraints to devise causal compatibility. These constraints need to hold, in order for the joint probability distributionPX to be compatible with a DAG_{G. A constraint-based method is usually based on the (conditional)} independence relations enforced by the DAG_{G. From such independence relations, (in)equalities} can be constructed that are a necessary requirement for the joint probability distributionPX.

One way to gather such constraints is through different patterns induced via conditional independence relations. Whenever all variables are observed, all constraints that are possibly

(16)

necessary can be captured solely by the conditional independence relations imposed by the joint distribution (Verma & Pearl, 1990). However, in general not all variables that influence a causal structure are known, causing different causal models to be compliant with the data. This diversity due to latent variables makes some causal models indistinguishable by independency relations (Figure 5). X1 X2 X3 U (a) X1 X2 X3 U (b)

Figure 5: Two DAGs that induce the same set of independence constraint on the observational variables but which are empirically different.

Whenever causal models with latent variables do not imply a difference in independence relations between the observed variables, one can look for inequality constraints on the observed distribution (Judea Pearl, 2013). To this end, Instrumental Variables (IV) can play a key role for causal inference, by inducing inequality constraints on the data. IVs are exogenous variables that directly affect some variables, but not all (Bowden & Turkington, 1985). To distinguish graphs as in 5, the IVs (X2 in 5 (a) and X1 in 5 (b)) can impose inequality constraints on the joint distributions. As described in (Judea Pearl, 2013), the IV can help us construct an inequality constraint for the probability distribution as follows:

X y

max

z p(x, y|z) ≤ 1. (15)

Such constraints are verifiable and able to form a necessary condition for a graph to be compatible with the data. The inflation technique described in the next chapter uses this idea of searching for (conditional) independence relations, for the construction of (in)equality constraints (section 3).

2.4.4 Score-based causal discovery

Score-based methods for causal discovery do not use the independencies of a graph, but assign a score to the ability of the DAG to fit the data. The reasoning behind score-based methods is that a DAG_{G that badly fits the data, will not be able to fit the data distribution properly. Most of} these models are optimization problems and will be off the form:

ˆ

G := arg max DAG G over X

S(_{D, G).} (16)

Where the most probable graph is the graph that admits the highest fit score. However, we will not use these methods in this thesis.

(17)

3 Inflation Technique

The Causal Compatibility Problem (CCP) asks whether a provided DAG – possibly involving latent variables – constitutes a genuinely plausible causal explanation for a provided joint probability distribution over the DAG its observed variables (Wolfe et al., 2019). In this section we will discuss the Inflation Technique as an answer for the CCP Navascues and Wolfe (2017). The Inflation Technique can solve the CCP by systematically constructing (in)equality constraints, that are necessary, but not sufficient, requirements for a provided joint probability distribution to be compatible with a provided DAG. Since these (in)equality constraints are no sufficient requirements, a single requirement can be seen as a weak insurance for causal compatibility. By passing more and more of such weak requirements, the drawn compatibility conclusion gets more robust.

This section will be structured as follows: firstly, in section 3.1, we will discuss the inflated graph needed for the Inflation Technique. Secondly, in section 3.2, we will describe the information that can be inferred from this inflated graph. Thirdly, in section 3.3, the constraints needed to check for causal compatibility will be introduced in the form of a linear program. Lastly, we will discuss the findings in (Wolfe et al., 2019) that state that the Inflation Technique solves CCP, when the inflation is of sufficient magnitude.

3.1 Inflation graph

The Inflation Technique, as described by Navascues and Wolfe (2017), inflates a DAG to be able to extract (in)equality constraints. In this section, these inflated DAGs and the notation that is used will be defined. Along the lines of the original paper, we discuss the Inflation Technique by using correlation scenario graphs as proposal DAGs. These correlation scenario DAGs are bipartite graphs with one part latent variables and one part observed variables, where the edges are only directed if they go from latent variables to observed variables (Fritz, 2012). Examples of correlation scenario graphs can be found in Figures 6 and 7.

The Inflation Technique uses a systematic copying method to obtain an inflated graph, that retains for every node the ancestral subgraph of their corresponding original node (Definition 3.1). In this thesis, a copied node is represented identically to the original node with the addition of a copy index that is noted as a superscript of the original node (e.g. the third copy of nodeAi is described asA3

i).

Definition 3.1 (Ancestral subgraph). The ancestral subgraph of node Xi (ANSubG(Xi)) in DAG _{G, represents the DAG that contains node X}i and nodes ANG(Xi) as nodes and all the edges between these nodes in_{G as edges.}

Graph_G0 _{is called an inflated graph of}_{G, if for every node X}j

i in G0 the generation process is identical to the corresponding nodeXiinG (Definition 3.2). As a result, every node X

j i in the inflated graph_G0 should have an identical ancestral subgraph to their corresponding uninflated nodeXi, up to the copy indices. With identical up to the copy indices we mean that there exists a bijection function that connects all the inflated nodes to their corresponding original nodes, such that all the nodes are connected in precisely the same way. We denote identical up to copy indices as an equal sign with a dot on top ( ˙=).

Definition 3.2 (Inflated Graph). A graph_G0 is said to be an inflation of_{G, if for every variable} Aj_i in_G0 the ancestral subgraph ANSubG0

Aj_i is identical up to copy indices to the ancestral subgraph ofAi (ANSubGAi). G0∈ Inflation(G) iff ∀ Xk i ∈ XG0 : ANSubG 0 (Xk i) ˙=ANSub G_(X i)

Equivalent to the above statement is: _G0_{∈ Inflation(G) iff ∀ X}k

i ∈ XG0 : PAG 0 (Xk i) = PA G_(X i)

(18)

For clarification purposes, an example of an inflated graph is provided in Figure 6. This example inflates nodesX1, X2 and U1, while retaining the ancestral subgraph for every copied node. As stated in the Definition 3.2, it can be checked that the graph in Figure 6b is a proper inflation of the graph in Figure 6a. This can be done by checking whether the parents of every node in the inflated graph are identical up to copy indices to the corresponding nodes in the original graph.

X1 X2 X3

U1 U2

(a) An example DAG G

U1 2 U1 1 U2 1 X1 1 X2 1 X1 3 X1 2 X2 2 (b) Inflated DAG G0

Figure 6: An example of a graph and an inflation of it

An SCM contains a graph and the corresponding weights of the probability functions on that graph (Definition 2.9). When this graph is inflated, the copied nodes will also copy the structural functions corresponding to their nodes in the original graph. By copying the structural functions, the generation process of every node in the inflated SCM is identical to the generation process of the corresponding node in the original SCM. In other words, the probability distribution of a copied node is identical to the probability distribution of the corresponding original node. This equality is due to the equality of the ancestral subgraph of a node in the inflated graph and the ancestral subgraph of the corresponding original node. The inflated SCM _C0 _{is said to be an} inflation of SCM_{C if it is observationally equivalent, meaning that both structures admit precisely} the same set of compatible distributions over their observed variables (see definition 3.3).

Definition 3.3 (Inflated Structural Causal Model). Consider an SCM _{C with corresponding} DAG_{G and an SCM C}0 with corresponding DAG_G0, where _G0 is an inflation of_{G. Then C}0 is said to be the _{G → G}0 inflation of_{C, if and only if for every node X}i inG0, the manner in which Xi depends causally on its parents within G0 is the same as the manner in which X depends causally on its parents within_{G. Thus, satisfying}

∀Xij ∈ X

0_:_{P (X}j i|PA

G0

(Xij)) =P (Xi|PAG(Xi)) (17)

3.2 Properties of the inflated graph

An injectable set is a set of observed inflated variables that, together with the ancestors of these observed inflated variables, can be injected into the original graph. This injection function, however, is only allowed to map inflated variables to their corresponding original variables. In short, an injectable set is a set that, together with its ancestral subgraph, is able to fit over the original graph. As a result of the definition 3.3, an injectable set has a marginal distribution over the inflated joint distribution that is identical to the marginal distribution of its original counterpart. The set that contains all the injectable sets is named the injectable sets (Definition 3.4).

Definition 3.4 (Injectable Sets). The injectable sets is the set that contains all sets that are injectable. In other words, the injectable sets contain all sets with the property that the ancestral subgraph of X0

i in the inflated graphG0 is identical up to copy indices to the ancestral subgraph of the set of corresponding original nodes Xi in original graphG.

Xi∈ InjectableSet(G0)

iff X0i= Xiand ANG 0

(19)

As a result of Definition 3.3, when X0_i_{∈ InjectableSet(G}0) it entails PXGi =P G0 X0 i

To illustrate an injectable set, the inflated graph_G0in Figure 6 is used. For this inflated graph, the set_{X₁2, X2

2} is an injectable set. This is an injectable set, because its ancestral subgraph (containing elements_{X₁2, U2

1,2, X22, U2,31 }) is identical up to copy indices to the ancestral subgraph for_{X1, X2} in the non inflated graph ({X1, U1,2, X2, U2,3}). This equality, in turn, entails the equality of the probability distributions of_{X₁2, X2

2} and {X1, X2} (P_{XG ₁,X2}=P G0 {X2

1,X22} ). For a distributionPG _{to be compatible with a graph}

G (Definition 2.11), there must be an SCM that entails it. WhenPG _{is compatible with graph}

G, Definition 2.11 states that the marginals of PG _{should also be compatible with graph}

G. Since the marginals corresponding with the injectable sets are compatible with_{G, these injectable sets are by Definition 2.9 also compatible} with_G0. This means that in order for the inflated distribution PG0

to be compatible with the inflated graph_G0, it must be compatible with the marginals of the injectable sets. Formally, Lemma 3.1 (The compatibility with respect to the injectable sets). Let graph_G0 with observed random variables X0 be an inflation of graph _{G with observed random variables X. Let S}0 _⊆ Injectablesets(G0_{) be a collection of injectable sets and let S}_{⊆ imageofInjectablesets(G) be the} image of the collection of injectable sets with the dropping of copy indices. If the distributionPX is compatible with_{G, then the family of marginal distributions {P}Xi : Xi∈ S} is also compatible with_{G. Furthermore, the family of marginal distributions {P}X0

i : X 0 i∈ S

0

}, defined via 3.3, are compatible with_G0 (Navascues & Wolfe, 2017).

To illustrate the additional value of the Inflation Technique, we will now discuss an example taken from Navascues and Wolfe (2017). Assume we have three perfectly correlated observed variables X =_{X1, X2, X3} – the only possible outcomes of (X1, X2, X3) are (1, 1, 1) or (0, 0, 0) – as data and we propose the triangle scenario graph (Figure 7a) as a potential fit to the data. The joint distribution of these perfectly correlated data is represented in Equation 18.

PX(x1, x2, x3) = (₁

2 ifx1=x2=x3

0 if else (18)

The problem with the proposal triangle scenario graph is the absence of conditional independence relations that can be used to check for causal compatibility. To address this problem, we will take the inflated graph of Figure 7b, where_{X₁2, X1

2} and {X21, X31} are members of the injectable sets. The images of these members in the original graph structure are_{X1, X2} and {X2, X3}, respectively. To prove that the distribution of Equation 18 is not compatible with the triangle scenario graph, a proof by contradiction is illustrated.

Proof. IfPX(Equation 18) is compatible with the triangle scenario, this means that the marginals of_{X1, X2} and {X2, X3} (Equation 19 and Equation 20) should also be compatible with PX.

P{X2 1,X21}(x1, x2) =P{X1,X2}(x1, x2) = (₁ 2 ifx1=x2 0 if else (19) P{X1 2,X31}(x2, x3) =P{X2,X3}(x2, x3) = (₁ 2 ifx2=x3 0 if else (20) By Lemma 3.1,_{X₁2, X1

2} and {X21, X31} must be compatible with the inflated graph G0. The injectable set _{X₁2, X1

2} is perfectly correlated and the injectable set {X21, X31} is perfectly correlated as well. Thus, the observed random variable X2

1 and X31 should also be perfectly correlated. This correlation, however, is not represented in the inflated graph_G0, which contradicts with the statement thatPX is compatible withG. In conclusion, due to the contradiction provided above,PX is not compatible withG.

(20)

X1 X3 X2 U2 U3 U1

(a) Triangle scenario

U1 2 U1 3 U1 2 U2 2 U2 3 U2 2 X11 X1 3 X1 2 X2 1 X2 3 X2 2 X3 1 X3 3 X3 2 X4 1 X4 3 X4 2

(b) Second order inflation triangle scenario

Figure 7: The triangle scenario web inflation of order 2

The Inflation Technique is not only capable of showing incompatibilities of probability distributions to graph structures, but is also able to construct requirements for the two to be compatible. These requirements are made up of (in)equality constraints that the inflated joint probability distribution must satisfy, in order to be compatible with the inflated graph structure. These inequality constraints will be referred to as causal compatibility inequalities and are formally defined in Definition 3.5. The violation of a causal compatibility inequality ensures incompatibility of the graph to the probability distribution. In the contrary, the satisfaction of a causal compatibility inequality does not ensure compatibility of the graph structure.

Definition 3.5 (Causal compatibility inequalities). Let_{G be a causal structure and let S be the} family of subsets of observed variables X in_{G, S ⊆ 2}X_{. Let}_I

S denote an inequality that operates on the corresponding distributions,_{PXi : Xi∈ S}. Then ISis a causal compatibility inequality for the graph structure_{G whenever it is satisfied for every family of distributions {P}Xi : Xi∈ S} that is compatible with_G.

As a result of Lemma 3.1, the inflation graph _G0 can help construct causal compatibility inequalities for the uninflated graph_{G. This is the case, since a causal compatibility inequality} for the inflated graph_G0 is also a causal compatibility for the original graph_G.

Corollary 3.1.1. Suppose that_G0 be is an inflation of graph_{G. Let S}0_{⊆ Injectablesets(G}0) be a collection of injectable sets and let S⊆ Imageinjectablesets(G) be the image of the collection of injectable sets with the dropping of copy-indices. Let IS0 be a causal compatibility inequality for_G0 operating on the the families_{PX0

i : X 0

i∈ S0}. Define an inequality IS as follows: in the functional form of IS0, replace every occurrence of a term PX0

i by PXi for the unique Xi ∈ S. ThenIS is a causal compatibility inequality for G operating on families {PXi : Xi∈ S}.

Proof. Suppose that the family of distributions _{PXi : Xi ∈ S} is compatible with G. Then according to the Lemma 3.1,_{PX0

i : X 0 i∈ S

0

}, where PX0

i =PXi whenXi is identical up to copy indicesX0

i, is compatible withG0. Since IS0 is a causal compatibility inequality of graphG 0_{, it} satisfies for_{PX0

i : X 0

i∈ S0}. Definition 2.11 and 3.5 states that IS0 evaluation on{PX0 i : X

0 i∈ S0} is equal toISevaluated on {PXi : Xi∈ S}. It therefore follows that {PXi : Xi∈ S} satisfies IS. Since_{PXi : Xi∈ S} was an arbitrary family compatible with G, we conclude that ISis a causal compatibility inequality for_G.

Before describing a new example, let us introduce the notion of ancestrally independent sets. Two sets of variables in a DAG are called ancestrally independent, if they do not have any mutual

(21)

X1 X3 X2 U2 U3 U1

(a) Triangle scenario

U1 2 U1 3 U1 2 U2 2 U2 3 U2 2 X11 X1 3 X1 2 X2 1 X2 3 X2 2

(b) Spiral inflation of the triangle scenario

Figure 8: The correlation scenario spiral inflation

ancestors. When two sets of nodes are called ancestrally independent, it implies that the sets are independent from each other as well. When the injectable sets from an inflated graph_G0 are ancestrally independent, this means that the inflated joint probability distribution must require for these mutual independency relations to hold. Take, for example, the inflation graph of . A tautology for a joint distribution of 6 binary random variables_{X₁1, X1

2, X31, X12, X22, X32} would be, PX1 1,X21,X31(1, 1, 1)≤PX12,X21,X31(1, 1, 1) + PX11,X22,X31(1, 1, 1) +PX1 1,X12,X23(1, 1, 1) + PX12,X22,X32(0, 0, 0). (21)

The inflated graph _{G of Figure 8 induces ancestrally independence relations over the marginal} distributions as stated below. Every marginal can be written down as a product of mutually independent variables. X12⊥⊥ X21, X31 =⇒ PX2 1,X21,X31(1, 1, 1) = PX12(1)PX21,X31(1, 1) X2 2 ⊥⊥ X11, X31 =⇒ PX1 1,X22,X31(1, 1, 1) = PX22(1)PX11,X31(1, 1) X32⊥⊥ X11, X21 =⇒ PX1 1,X21,X32(1, 1, 1) = PX32(1)PX11,X21(1, 1) X12⊥⊥ X22⊥⊥ X32 =⇒ PX2 1,X22,X32(0, 0, 0) = PX12(0)PX22(0)PX32(0)

When these mutual independence relations are applied, every probability distribution repre-sented in the inequality constraint is an injectable set of the inflated graph_G0. Thus, by Corollary 3.1.1, by dropping the copy indices we obtain a causal compatibility inequality constraint for the original graph (in Figure 8): Equation 22. Such an inequality constraint is a fairly weak requirement for causal compatibility on its own, but every requirement can be extended for all combinations of the instances of the random variables. Thus, the causal compatibility inequality of Equation 22 is one of the 26verifiable inequalities that a compatible distribution must satisfy.

PX1,X2,X3(1, 1, 1)≤PX1(1)PX2,X3(1, 1) + PX2(1)PX1,X3(1, 1) +PX3(1)PX1,X2(1, 1) + PX1(0)PX2(0)PX3(0)

(22)

3.3 Construction of causal compatibility constraints

The construction of causal compatibility inequality constraints, as stated in Section 3.2, is an extensive and hands-on process. In this Section we explore the possibilities for extending this technique to a generalized method for constructing causal compatibility inequalities. By doing so, the notion of an expressible set is defined in definition 3.6, such that it can be used to construct a marginal problem. Whenever the marginal problem cannot be solved for the expressible sets

(22)

obtained from the inflated graph_G0, this means that the joint distribution of observed random variables cannot be compatible with the proposal graph_G.

The distributions of the causal compatibility inequalities shown in Equation 21 are not all injectable sets, but are able to become injectable sets by splitting the distributions into mutually independent injectable parts. This notion of splitting a set of variables into mutually independent injectable parts is key for constructing causal compatibility inequalities that carry additional information on the compatibility of a graph. The sets of observed random variables that are able to be split into mutually independent injectable parts are able to express information of the inflated graph to the original graph and are called expressible sets (Definition 3.6).

Definition 3.6 (Expressible set). Consider an inflated graph _G0 with observed random variables X0 and the corresponding original graph _{G with observed random variables X. A set V}0 is said to be expressible if V0 _{∈ InjectableSets(G}0) or if V0 can be obtained by injectable sets, when recursively applying these rules:

• For A, B, C ⊆ X0_{, if A}_⊥_⊥

G0B|C, A ∪ C and B ∪ C are expressible, then A ∪ B ∪ C is also expressible.

• If V0 _{is expressible, then so is any subset of V}0_.

An expressible set is called maximal if it is not a proper subset of another expressible set.

The expressible sets of an inflated graph are convenient, because the probability distributions of these sets can directly be inferred from the probability distribution of the original data. However, the expressible sets are complex to construct, therefore a subset of the expressible sets was used: ancestrally independent expressible sets (ai-expressible sets). Such ai-expressible sets are expressible sets, that are constructed from ancestrally independent injectable sets. Since the injectable sets in an ai-expressible set are ancestrally independent, it entails, these injectable sets are mutually independent.

Definition 3.7 (Ai-expressible set). A set V0_{⊆ X}G0 is ai-expressible if it can be written as a union of ancestrally independent injectable sets.

V0_{∈ Ai-expressible(G}0)

iff _∃{X0_i _{∈ InjectableSet(G}0)_{} such that V}0 =[ i

X0_iand _∀i6=jX0i⊥⊥G0X0_j An Ai-expressible set is called maximal if it is not a proper subset of another ai-expressible set.

To illustrate, the ai-expressible sets of the inflated graph in Figure 6b are described in Equation 23.

{X11}, {X21}, {X31}, {X12}, {X11, X12, X22}, {X31, X12, X22, X11} (23) Note that for the maximal ai-expressible sets, only ai-expressible graphs that are no proper subset of another ai-expressible graph need to be taken into account. The maximal ai-expressible sets would be_{{X₁1, X1

2, X31} ⊥⊥G0{X₁2}} and {{X₁2, X₂2, X₃1} ⊥⊥_G0{X₁1}}

Any compatible probability distribution of an inflated graph has to adhere to independency relations implied by the inflated graph. The compatible probability distributions therefore has to obey the factorization of an ai-expressible set into its mutually independent injectable parts (Equation 24). This makes it possible to calculate marginals of the inflated probability distribution, without specifically knowing the inflated probability distribution.

PV0 i= Y j PX0 j (24)

For a joint distribution to be compatible with the inflated graph_G0, all marginal distributions obtained by the ai-expressible sets have to be part of one joint distribution. Calculating whether

(23)

there exists a joint distribution that results in a certain set of marginals is thus checking for causal compatibility.

To illustrate, we will take the three-way perfect correlation example (Equation 18), where Equation 19 and Equation 20 are the marginal distributions for_{X₁2, X1

2} and {X21, X31}. When we construct the marginal distribution of_{X₁2, X1

3}, X12 andX31 are independent, thus resulting in a marginal probability distribution of Equation 25. As shown, the incompatibility of this example results from_{X₁2, X1

2} and {X21, X31} being perfectly correlated, while {X12, X31} is not. This result is corroborated by the impossibility to obtain a joint distribution from these three marginals. In other word, it is possible to detect the incompatibility of DAG_{G by searching for} an inflated joint distribution in the marginal distributions, constructed from ai-expressible sets. Since the maximal ai-expressible sets contain all ai-expressible sets that are smaller, it is sufficient to construct marginals of these sets.

P{X2 1,X31}(x1, x3) =P{X12}(x1)P{X31}(x3) = 1 2 × 1 2 = 1 4 forx1, x3∈ {0, 1} (25) In short, the steps that are taken to extract information from the inflated graph for testing causal compatibility are the following:

1. Based on the inflation graph_G0 _{of a proposal graph}_{G, identify all the ai-expressible sets} and their partitions into injectable sets.

2. From the provided joint probability distributions, infer the marginal probabilities per ai-expressible sets by computing the product of marginal distributions of the injectable sets it contains.

3. Determine whether the distributions obtained in step 2 could all be marginals of one joint inflated distribution. If not, the proposal graph_{G is incompatible with the provided} probability distributions.

Note that passing a causal compatibility test, is a necessary but not sufficient condition for causal compatibility.

3.3.1 Marginal Problem

As stated in the last section, the final step for the causal compatibility test is to determine whether the set of marginals obtained by the maximal ai-expressible sets could have been constructed by one joint inflated distribution. The problem of answering whether a set of marginals could have been constructed from one joint distribution is called the marginal problem (Fritz & Chaves, 2013). Furthermore, whenever the joint distribution is unknown, the problem becomes the marginal satisfiability problem, where the question is “Does there exists a joint probability distribution that has a specified set of marginals?”.

The marginal satisfiability problem is constructed from a set of linear equations. These linear equations are constraints on the joint inflated probability distributions, constructed from the maximal ai-expressible sets. Here, the set of maximal ai-expressible sets are called the contexts, denoted as (V1, . . . , Vn) (Vi ⊆ X0) of the inflated graphG0 with observed variables X0. Now note that every context Vi can be described through the marginalization over the inflated joint probability distribution ˆPX0 as in Equation 26.

PVi := X X0_/V i ˆ PX0 (26)

To decide whether there exists an inflated joint probability distribution ˆPX0 that entails all the ai-expressible sets in the contexts as marginals, one should ensure that the intersection between different contexts obeys to the same distribution. Thus the marginalization of Vi and of Vj should have the same distributional values for the intersection Vi∪ Vj. When a ˆPX0 is found that satisfies all these requirements, the marginal satisfiability problem in answered in the positive.

Causal Compatibility of Directed Acyclic Graphs for Real-World Data

MSc Artificial Intelligence

Master Thesis