Conditional Independence Testing in Causal Inference

(1)

University of Amsterdam

Master Thesis

Conditional Independence Testing in

Causal Inference

Author:

Philip A. Boeken

Supervisor: Prof. Dr. Joris M. Mooij Second reader: Prof. Dr. Peter J.C. Spreij

A thesis submitted in fulfilment of the requirements for the degree of MSc. Stochastics and Financial Mathematics

at the

Korteweg-de Vries Institute for Mathematics University of Amsterdam

(2)

(3)

Abstract

Constraint-based causal discovery algorithms utilise conditional independence tests to infer causal relations from purely observational data. A lot of attention has been given to the theoretical performance of these algorithms, but under the assumption of having a condi-tional independence ‘oracle’ at disposal. In practice the performance of the causal discovery algorithm heavily relies on the performance of the conditional independence test that is being used. In this thesis we theoretically investigate both classical and state-of-the-art conditional independence tests. We revisit existing theory on the Pólya tree prior and utilise this to propose a novel Bayesian nonparametric conditional d-sample test, which fills a gap in the available conditional independence tests that are required for causal discovery. These tests and their impact on the ‘Local Causal Discovery’ algorithm are empirically analysed using synthetic and real-world data.

Title: Conditional Independence Testing in Causal Inference Author: Philip A. Boeken

Supervisor: Prof. Dr. Joris M. Mooij Second reader: Prof. Dr. Peter J.C. Spreij Examination date: August 25, 2020

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 105-107, 1098 XG Amsterdam https://kdvi.uva.nl

(4)

(5)

Summary

Statistics and machine learning has become indispensable over a wide range of domains: from the voice assistant in your smartphone, to the construction of high-resolution fMRI images. Although these algorithms have proven to be functioning well in detecting similar-ities between new observations and the data they have been trained on, the performance of these algorithms often deteriorates when the new observations are very dissimilar from the training data. For example, a self-driving car can drive a perfect lap on a clean and predictable test track, but is likely to cause accidents when encountering a more dynamic and less predictable environment. Documents from the USA National Traffic Safety Board that report a collision between a self-driving car and a person who was jaywalking provide the explanation that the car did not have “the capability to classify an object as a pedestrian unless that object was near a crosswalk”.

Rooted at this problem lies the fact that current machine learning algorithms do not model any underlying causal mechanisms, but merely capitalise on statistical correlations. For long the only well known method of inferring causal relations was the randomised controlled trial (Fisher, 1935), which requires the data to originate from a certain experiment. Although this method is very reliable, performing such experiments can be very time consuming, costly, and even unethical. Despite it early initiation by Wright (1921), a different mathematical model for describing causal relations was ultimately popularised by Pearl (1988), allowing for inference of these causal relations through purely observational data. A key ingredient of this inference is a statistical pattern called conditional independence, which has to be detected by a conditional independence test for the causal inference algorithm to work properly. A lot of research has been done on the design of causal inference algorithms, but only little attention has been given to this subroutine of conditional independence testing, despite its profound impact on the performance of the overall algorithm.

In this thesis, we analyse some conditional independence tests that have recently been prosed, and by incrementing on existing research we provide a new conditional independence test. The existing and novel conditional independence tests are both theoretically and empirically analysed, and incorporated as a subroutine of a well known causal discovery algorithm called ‘Local Causal Discovery’.

(6)

(7)

Preface

Writing my thesis in the field of causal discovery has been a gratifying process: by virtue of the type of mathematics that it involved, the fact that the field has many open questions, and the promise of powerful applications, may some of those open questions be resolved. It should be noted that the core contents of the research that preceded writing this thesis have culminated into co-authoring the NeurIPS submission “A Bayesian Nonparametric

Conditional Two-sample Test with an Application to Local Causal Discovery” (Boeken and

Mooij, 2020). Although most of the textual contents of this thesis are original, it may occur that some sections inhibit noticeable overlap between the two texts.

I thank Joris Mooij for his dedicated supervision, which included the helpful and moti-vating weekly meetings, inviting me into his reading-group, and even co-authoring the aforementioned paper together. I also thank Peter Spreij for his involvement in this project as a second reader.

(8)

(9)

Introduction

In the field of causal inference, the goal is to infer causal relations from data. Formerly, causal inference was only possible by means of a randomised controlled trial, as proposed by Fisher (1935). Such a randomised controlled trial proceeds by randomly splitting a group of subjects into a control group and a treatment group, of which only the latter is subjected to some treatment. After the treatment has been completed one takes measurements of both groups, and if the measurements differ significantly between the two groups, then one concludes that the treatment is a cause of the measured variable. Performing such an experiment can be costly or even unethical, so seeking alternative methods might be worthwhile.

A highly influential alternative model has been popularised by Pearl (1988), which enables causal inference from purely observational data, and thus mitigates the necessity of the possibly problematic experiments. Ever since a lot of research has been done in this field, and in particular on the Structural Causal Model: a mathematical model for describing causality, originally proposed by Wright (1921). Based on this mathematical model, algorithms have been developed to extract underlying causal relations from datasets. Recently, Mooij et al. (2020) have proposed Joint Causal Inference (JCI): a method which enables causal inference

by joining datasets from multiple contexts, which allows for efficiently increasing the sample size of the dataset at hand, and simultaneously allowing causal inference algorithms to uncover causal relations which would otherwise have remained unidentifiable. In Chapter 1 we will cover some of the theory describing the Structural Causal Model. Among the existing causal inference algorithms, the subgroup of constrained-based causal discovery algorithms leverages conditional independence patterns in the data to infer causal relations. In Chapter 1 we will cover such a constrained-based algorithm called Local Causal Discovery (LCD) (Cooper, 1997).

As constraint-based causal discovery algorithms utilise conditional independence tests to infer causal relations, their performance heavily relies on the performance of the conditional independence tests that are being used. Shah and Peters (2020) have shown that conditional independence testing a hard task, in the sense that there is no conditional independence test which performs well for all types of conditional distributions, provided that the conditioning variable is continuous. When applying a constraint-based causal discovery algorithm to a dataset which adheres to the JCI setting, the algorithm often requires a conditional independence test of the type C |= X|Z, where C is discrete, and X and Z are continuous. A gap in the current state of research is the absence of such a conditional independence test. As there are no conditional independence tests which are specifically designed for this setting, practitioners often use a conditional independence test which assumes that

C, X and Z are continuous, with possibly suboptimal results. In Chapter 2 we cover some

classical, and some state-of-the-art conditional independence tests.

(12)

2 Contents In Chapter 3 we revisit some existing theory on the Pólya tree prior, which is a random measure that is suitable for nonparametric Bayesian hypothesis testing. By incrementing on an existing two sample test (Holmes et al., 2015) and a continuous conditional independence test (Teymur and Filippi, 2020), we propose a novel conditional independence test which allows for testing the hypothesis C |= X|Z where C is discrete and X, Z continuous. This test fills the previously mentioned gap, and allows for applying causal discovery algorithms to the JCI setting without violating any assumptions.

Although we provide theoretical results for the independence tests in chapters 2 and 3, these results are hard to interpret in terms of what type of data the tests work well on. In Chapter 4 we address this problem empirically by simulating datasets, and analysing both the performance of the independence tests both individually, and the performance of the LCD algorithm when implemented with said datasets. Ultimately we apply the LCD algorithm to real-world data, to obtain some intuition on whether these methods work in practice.

(13)

1 Causal Inference

In this chapter we define and analyse a mathematical model for describing causality. We posit the definition of a Structural Causal Model (SCM), and revisit some graph-theoretic notions that are necessary for analysing the SCM. This is restricted to the necessary amount of SCM-theory to be able to state the main theorem underlying the LCD algorithm, which is covered in Section 1.2. We gratefully draw most of this theory from Pearl (2009), Mooij et al. (2020) and Bongers et al. (2020).

1.1 Structural Causal Models

The purpose of the causal model that we consider is the description of the data generating process of some set of observed random variables Xi, where i ranges through the index set I. As we have only access to (a finite sample of) the observational distribution PX of X := {Xi : i ∈ I}, we have to device a connection between this observational distribution

and the causal mechanism underlying the random variable X. In order to do so, we first have to properly define what is understood to be cause and effect. In this work, we consider the following model:

Definition 1.1 (Structural Causal Model, Bongers et al., 2020). A Structural Causal

Model (SCM) M is a tuple

M:= hI, J , X , E, f, PEi where

(i) I is a finite index set for the endogenous variables, (ii) J is a finite index set for the exogenous variables,

(iii) X :=

×

_i∈IX_i is the product of the codomains of the endogenous variables, where each

codomain X_i is a standard measurable space,

(iv) E :=

×

_j∈JEj is the product of the codomains of the exogenous variables, where each

codomain E_i is a standard measurable space,

(v) f : X × E → X is a measurable function that specifies the causal mechanisms,

(vi) PE :=N

j∈JPEj is a product measure, where every PEj is a probability measure on Ej.

A standard measurable space is a measurable space (X , Σ) that is isomorphic to a measurable space (Y, B(Y)), where Y is a Polish space (i.e. separable and completely metrisable) and B(Y) denotes the Borel σ-algebra on Y. We may refer to X as a standard measurable space by implicitly assuming the existence of a σ-algebra Σ such that (X , Σ) is a standard

(14)

4 Chapter 1. Causal Inference measurable space. As will be shown in Chapter 2, this rather technical assumption serves the existence of a convenient type of conditional distribution.

The preceding definition does not yet mention any random variables. In practice, we would like an SCM to be the underlying data generating process of observations of random variables Xi with codomain Xi, where i ∈ I. These random variables are related to SCMs via the following definition:

Definition 1.2 (Solution, Bongers et al., 2020). A pair (X, E) of random variables

X : Ω → X , E : Ω → E, where (Ω, Σ) is a probability space, is a solution of the SCM M if

(i) PE = PE, i.e. the distribution of E is equal to PE,

(ii) the structural equations

Xi = fi(X, E) ∀i ∈ I

are satisfied almost surely.

As the variables X are modelled by the causal mechanism f, this definition righteously suggests that the endogenous variables X are the observed variables of interest, and the

exogenous variables E are the latent variables, which can model noise, measurement error,

or perhaps some other factor that is absent in the provided data set. A rigorous notion of causality arises when inspecting this functional relation fi between endogenous variable Xi and all other endogenous and exogenous variables. In fact, the structural equations

xi= fi(x, e) x ∈ X , e ∈ E, ∀i ∈ I

are deterministic equations representing the causal mechanism of the SCM M. The causal interpretation of the mapping fi is made clear with the following concept:

Definition 1.3 (Parent, Bongers et al., 2020). Let M := hI, J , X , E, f, PEi be an SCM.

We call j ∈ I ∪ J a parent of i ∈ I if there does not exist a measurable function1

˜

fi: XI\j× EJ \j → Xi such that

xi = fi(x, e) ⇐⇒ xi= ˜fi(xI\j, e)

for all x ∈ X and for PE almost all e ∈ E.

As we will later see, under certain assumptions we may interpret a parent as a direct cause of its child. Note that exogenous variables may be parents of endogenous variables, but the reverse is not allowed.

1.1.1 From SCMs to graphs

The preceding notion of element i being a parent of element j hints at a graph representation of the SCM M, where the induced graph should contain the edge i → j. To allow for a rigorous explanation of the relation between SCMs and graphs, we first introduce some graph-theoretic definitions.

A directed graph is a graph G := hV, Ei with nodes V and edges E, where the edges are of the form i → j and i ← j, allowing for self cycles i → i. A directed mixed graph is a directed

1_{for O ⊆ I we define X}

(15)

1.1. Structural Causal Models 5

graph, which also allows edges of the form i ↔ j for i, j ∈ V with i 6= j. We denote this set of bidirected edges with F, and so the directed mixed graph is denoted as G := hV, E, Fi. A path between i, j ∈ V with i 6= j is a tuple (i = i1, e1, i2, e2, ..., in−1, en−1, in = j) of distinct alternating nodes and edges in G, such that every em ∈ E ∪ F connects im and

im+1. A directed path is a path where all edges are of the form im→ im+1, and can thus be denoted with (i1, ..., in). A cycle is a tuple (i1, ..., in, i1) where (i1, ..., in) is a directed

path, and where in and i1 are connected with an edge in → i1 ∈ E. A directed acyclic

graph (DAG) is a directed graph without any cycles, and an acyclic directed mixed graph is a directed mixed graph without any cycles. If the pattern i → j ← k occurs on a path, then we refer to j as a collider.

Having these graph-theoretic notions at our disposal, we are able to define the graph of an SCM:

Definition 1.4 (The graph of an SCM, Bongers et al., 2020). Given an SCM M :=

hI, J , X , E, f , PEi, we define:

(i) the augmented graph Ga(M) as the directed graph with nodes I ∪ J and directed

edges i → j if and only if i ∈ I ∪ J is a parent of j ∈ I.

(ii) thegraph G(M) as the directed mixed graph with nodes I directed edges i → j if and

only if i ∈ I is a parent of j ∈ I and bidirected edges i ↔ j if and only if there exist a k ∈ J that is a parent of both i ∈ I and j ∈ I.

If we let V := I ∪J and E be the nodes and edges of the augmented graph Ga_{(M) := hV, Ei,} then there is no ambiguity in defining the set of parents of j ∈ I as pa(j) := {i ∈ V : i →

j ∈ E }, and the ancestors of j ∈ I as

an(j) := {i ∈ V : there is a directed path from i to j in Ga_(M)}. These definitions extend to subsets O ⊆ V by defining pa(O) :=S

j∈Opa(j) and an(O) :=

S

j∈Oan(j). Note that it may occur that i ∈ an(i), and in accordance with Definition 1.3 and the possibility of self cycles in a directed mixed graph it may occur that i ∈ pa(i). As part of the variables of the augmented graph Ga_{(M) remain unobserved, we can at best} hope for identifying the graph G(M). Identifying as much as possible of the graph G(M) is precisely the goal of constraint-based causal inference algorithms, as this graph encodes all observable causal relations.

1.1.2 Simple SCMs

An important subclass of SCMs are simple SCMs, which is the class of SCMs that we will consider in the causal discovery algorithm in Section 1.2. To be able to define simple SCMs, we require the following definition:

Definition 1.5(Unique solvability, Bongers et al., 2020). Let M := hI, J , X , E, f, PEi be

an SCM. We call M uniquely solvable with respect to O ⊆ I if there exists a measurable

function g_O : X_pa(O)\O× E_pa(O)→ XO such that for PE-almost every e and for all x ∈ X

(16)

6 Chapter 1. Causal Inference

We call Muniquely solvable if it is uniquely solvable with respect to I.

From the definition of unique solvability it is not immediately clear whether this notion is related to the existence of a solution as given by Definition 1.2. The following theorem connects these two concepts:

Theorem 1.1 (Conditions for unique solvability, Bongers et al., 2020, Theorem 3.2.5).

Given an SCM M:= hI, J , X , E, f, PEi and a subset O ⊆ I, the following are equivalent:

(i) for PE-almost every e and for all x ∈ X with xI\O ∈ XI\O the structural equations

xO= fO(x, e)

have a unique solution xO ∈ XO;

(ii) M is uniquely solvable with respect to O.

Furthermore, if M is uniquely solvable, then there exists a solution(X, E), and all solutions

have the same observational distribution PX.

The importance of an SCM being uniquely solvable lies in the existence of solution (X, E), and the uniqueness of the observational distribution.

Definition 1.6 (Simple SCM, Bongers et al., 2020). We call an SCM M simple if it is

uniquely solvable with respect to every subset O ⊆ I.

As simple SCMs have to be uniquely solvable with respect to {i} ⊂ I, it is immediate from Definitions 1.3 and 1.5 that simple SCMs contain no self-cycles. It is important to note that all acyclic SCMs are simple (Bongers et al., 2020).

An advantage of simple SCMs (in contrast with more general SCMs) is the clear causal interpretation of its graph:

Definition 1.7 (Causal interpretation of a simple SCM, Bongers et al., 2020). Let M be

a simple SCM.

(i) If i → j ∈ G(M), i.e. i ∈ pa(j), then we call i a direct cause of j according to M.

(ii) If i → ... → j ∈ G(M), i.e. i ∈ an(j), then we call i a cause of j according to M.

(iii) If i ↔ j ∈ G(M), then we call i and j confounded according to M.

1.1.3 Markov properties and causal discovery

Much subtlety lies in the characterisation of observational distributions that are ‘compatible’ with G(M). As the constraint-based causal discovery algorithms that we consider are not necessarily designed for estimating f directly but merely try to extract the causal graph G(M) from observational data, we capitalise on such a notion of compatibility of distributions with G(M). The key ingredient is the relation between a graph-theoretic notion called σ-separation (Forré and Mooij, 2017) and the observational distribution. As will be explained later, we will assume the set of σ-separations to be isomorphic to the set of possible conditional independences in the observational distribution (Pearl, 1988).

(17)

Definition 1.8 (σ-separation, Mooij et al., 2020). We say that a path hi₀, ..., ini in the

directed mixed graph G = hV, E, Fi is σ-blocked by C ⊆ V if

(i) its first node i₀∈ C or its last node i_n∈ C, or

(ii) it contains a collider ik∈ an/ (C), or

(iii) it contains a non-collider i_k∈ C that points to a neighbouring node on the walk in

another strongly-connected component.

If all paths in G between any node in set A ⊆ V and any node in set B ⊆ V are σ-blocked

by a set C ⊆ V, we say that A is σ-separated from B by C, and we write A ⊥σ_G B|C.

We connect σ-separation and the observational distribution through the following theorem:

Theorem 1.2 (Generalised Directed Global Markov Property, Mooij et al., 2020). Any

solution (X, E) of a simple SCM M obeys the Generalised Directed Global Markov

Property with respect to the graph G(M):

A ⊥σ_G(M) B|C =⇒ XA |= XB|XC

for all A, B, C ⊆ I.

Theorem 1.2 provides a result on the conditional independences, given a causal structure. The converse is actually what drives constraint-based causal discovery algorithms: when observing a conditional independence, we wish to infer a causal structure. The cornerstone of this reasoning is the following faithfulness assumption:

Assumption 1(Faithfulness). For given SCM M := hI, J , X , E, f, PEi, every conditional independence in the observational distribution entails a σ-separation:

A ⊥σ_{G(M )}B|C ⇐= XA |= XB|XC

for all A, B, C ⊆ I.

Originally, a different version of σ-separation, called d-separation, was proposed by Pearl (1985) for modelling directed acyclic graphs, which however does not lead to a Markov property on the class of simple SCMs. The concept of d-separation has been thoroughly studied, and Markov properties and faithfulness have first been described in terms of

d-separation. Meek (1995) has for example shown that when G(M) is a DAG and the

joint probability measure PXAXBXC is multinomial or multivariate normal, for almost all

parameters needed to parametrise PXAXBXC, the faithfulness assumption holds in terms

of d-separation. The Markov property from Theorem 1.2 with d-separation instead of

σ-separation has been shown to hold when M is acyclic (Spirtes et al., 1998; Forré and

Mooij, 2017), when the endogenous spaces Xi are discrete (Pearl and Dechter, 1996; Forré and Mooij, 2017) or when the causal mechanisms fi are linear (Spirtes, 1995; Forré and Mooij, 2017).

The main distinction between d-separation and σ-separation is whether the Markov property holds when considering simple SCMs. In terms of σ-separation the Markov property holds for all simple SCMs, whereas with d-separation the Markov property only holds for acyclic or discrete or linear simple SCMs, as mentioned earlier. Following Mooij et al. (2020) we

(18)

8 Chapter 1. Causal Inference assume faithfulness to hold for σ-separation, allowing us to infer causal relations from observed conditional independences.

Constraint-based causal discovery algorithms capitalise on the notion of σ-separation or

d-separation to produce its output, being a (partial) reconstruction of the graph G(M).

Applying a causal discovery algorithm to some data is done through the following process: • assume that the data is generated by some simple SCM M;

• observe a sample X1, ..., Xn from PX;

• infer a set of conditional independences from the data, which implies a set of σ-separations via the faithfulness assumption (Ass. 1);

• construct a set of graphs { ˆG} 3 G(M) that are compatible with the set of σ-separations.

This scheme is depicted in Figure 1.1. Without limiting assumptions, there are always multiple graphs ˆG that share the same set of σ-separations, except for trivial cases. The steps of constructing the set { ˆG}and inferring from this set a (partial) reconstruction of the graph G(M) are specific to the causal discovery algorithm that is being used. Theoretical results on the performance of these algorithms assess the extent to which they are able to reconstruct G(M), provided they have a perfect estimate of the set of conditional independences at their disposal. If the dataset involves continuous variables then perfect independence testing is impossible, as we will see in Chapter 2.

M { ˆG} 3 G(M) PX {Conditional independences} {σ-separations} Thm. 1.1 Def. 1.4 CD alg. Thm. 1.2 ∼ = Ass. 1

Figure 1.1: A diagram which connects fundamental concepts of constraint-based causal discovery algorithms for simple SCMs. The arrow ‘CD alg.’

denotes the causal discovery algorithm.

1.1.4 Joint Causal Inference

Classically, causal relations are uncovered by means of experimentation. When an exper-imenter wishes to know whether some medicine cures an illness, she gathers a random sample of people with said illness, randomly splits this group into an treatment and control group, administers the medicine to the former and a placebo to the latter group, and measures the outcome. When a dependence between the group and outcome is measured, the experimenter may conclude that the medicine indeed cures the illness.

(19)

The causal discovery algorithms which assume purely observational data are not a priori applicable to datasets originating from such an experiment. Mooij et al. (2020) propose the Joint Causal Inference (JCI) framework, which resolves this issue by adding a context

variable: a discrete random variable which indicates whether some intervention has been

administered to the system. More specifically, any number of context variables can be included in the system, each indicating a different intervention. The related mathematical model is an SCM M :=D

˜I, J , X , E, f,PE E

with ˜I := I ∪ K, where in accordance with Definition 1.1 the index set of the endogenous variables is denoted with ˜I, consisting of

system variables X := (Xi)i∈I with standard measurable codomain X :=

×

_i∈IXi and

context variables C := (Ck)k∈K with standard measurable codomain C :=

×

_k∈KCk. A

solution of this SCM is a triple of random variables (X, C, E) which satisfy

(

Xi= fi(X, C, E) ∀i ∈ I

Ck= fk(X, C, E) ∀k ∈ K almost surely.

An advantage of incorporating context variables in the SCM is the possibility of merging datasets which are obtained from different contexts. Based on an example from (Mooij et al., 2020) we corroborate this advantage by considering the following question: does wearing a surgical mask lower the risk of contracting a disease related to a certain viral pandemic? If one simply registers observations of whether people wear a surgical mask when meeting people (system variable X1) and whether these people actually contracted the

disease in question (system variable X2), it is impossible to determine whether X1 causes X2 by applying a causal discovery algorithm. If one however adds context variables to the

system by registering the observations from a group of dentists (context variable Cα) and a group of people living in a neighbourhood which is classified as a ‘viral hotspot’ (context variable Cβ), then a possible causal relation between X1 and X2 becomes detectable, as

shown in Figure 1.2. An important condition for this result is that the contexts are not caused by the system; it does seem reasonable to assume that wearing surgical masks does not cause a career choice of becoming a dentist, and that contracting a disease causes one to move to a different neighbourhood.

Mooij et al. (2020) provide multiple assumptions regarding possible prior knowledge of the context variables. These assumptions are not necessary for applying causal discovery algorithms to the data, but they may be incorporated to improve the specificity and runtime of the algorithm. An assumption that will be of use in LCD is the following:

Assumption 2 (Exogeneity). No system variable directly causes any context variable, viz.

i → k /∈ G(M) ∀i ∈ I, ∀k ∈ K.

Similar to how the exogenous variables may not be caused by the endogenous variables (as per Definition 1.3), under this assumption the context variables can be interpreted as ‘observed exogenous’ variables.

(20)

10 Chapter 1. Causal Inference

Cα Cβ

X1 X2

Figure 1.2: Due to the presence of context variables Cαand Cβ the causal

relation X1→ X2 can be uncovered by a causal discovery algorithm.

1.1.5 Related theory

The theory presented in this section is sufficient for an understanding of causal modelling, and for understanding the relatively straightforward constraint-based causal discovery algorithm Local Causal Discovery, which will be covered in the next section. It is important to note that this display is not a complete display of the theory concerned with causal reasoning.

Although initiated by Wright (1921), causal reasoning is ultimately popularised by Pearl (1988), who introduced Bayesian networks as a mathematical framework for causal inference from purely observational data. A Bayesian network is a graphical model which is described by a set of conditional distributions which factorises over the causal graph:

P(X1, ..., Xn) =

n

Y

i=1

P(Xi|Xpa(i)).

By definition, Bayesian networks are restricted to the modelling of causal structures represented by a directed acyclic graph (Jensen and Nielsen, 2007), and so they cannot directly model latent confounding. As directed acyclic graphs are a subclass of simple SCMs, we note that the approach of simple SCMs is a generalisation of the classical Bayesian networks.

The realm of SCMs is certainly not restricted to the simple SCMs we introduced. The work of Forré and Mooij (2017) introduces modular Structural Causal Models (mSCM), of which the simple SCM constitutes a subclass. Bongers et al. (2020) analyse solvability issues, equivalence classes of SCMs, marginalisations of SCMs and the causal interpretation of different types of SCMs.

Ever since the first constraint-based algorithm is proposed by Pearl and Verma (1991), multiple causal discovery algorithms have been proposed, none of which is able to com-pletely recover the graph G(M) from the observed set of σ-separations. One of the most advanced algorithm is Fast Causal Discovery (FCI) (Spirtes et al., 1993), which outputs a partial ancestral graph containing the true directed mixed graph. In contrast with the straightforward causal interpretation we have for the graph of a simple SCM, partial ances-tral graphs constitute mixed graphs with multiple types of edges, where the interpretation of those edges is more subtle than of the edges of the directed mixed graphs we use. Zhang et al. (2011) provide a detailed description of these graphs, as well as a description of FCI and its completeness with regard to acyclic graphs. Mooij and Claassen (2020) have recently extended these results by showing that FCI is sound and complete when the data is generated according to a simple SCM.

(21)

1.2. Local Causal Discovery 11

1.2 Local Causal Discovery

Among many constraint-based causal inference algorithms, the Local Causal Discovery (LCD) algorithm is simple but effective (Cooper, 1997). The main hypothesis of the LCD

algorithm is as follows:

Theorem 1.3 (Mooij et al., 2020). Suppose that the data-generating process on three

variables X₁, X2, X3 can be represented as a subset of a faithful, simple SCM M and that

the sampling procedure is not subject to selection bias. If X2 is not a cause of X1 according

to M, the following conditional (in)dependences in the observational distribution

X16 |= X2, X26 |= X3, X1 |= X3|X2 (1.1) imply that the underlying causal graph must be one of the three DMGs in Figure 1.3. In particular,

(i) X₃ is not a cause of X₂ according to M

(ii) X2 is a (possibly indirect) cause of X3 according to M

(iii) X₂ and X₃ are not confounded according to M.

In case the pattern of equation 1.1 is found, we may speak of theLCD triple (X1, X2, X3).

X1 X2 X3 _X₁ _X₂ _X₃ _X₁ _X₂ _X₃

Figure 1.3: All possible graphs which can be detected by LCD.

Proof. Since we have X₁ |= X₂|X3, by the faithfulness assumption the graph must contain

the σ-separation X1 ⊥σ X2|X3. By inspection of Definition 1.8 we see that if there were a

path hX1, X3i, this path would not be σ-blocked by X3. As the σ-separation X1 ⊥σ X2|X3

requires all paths between X1 and X3 to be σ-blocked by X2, we have that X1 and X3

cannot be adjacent.

Furthermore, using the Generalised Directed Global Markov Property 1.2 we have

X16 |= X2 ⇐⇒ ¬X1 |= X2|∅ 1.2

=⇒ ¬X1 ⊥σ X2|∅.

We note that the empty set can only σ-block a path hX1, X2iif the path contains a collider,

which is not the case. If there were no path hX1, X2i then the σ-separation X1 ⊥σ X2|∅

would be vacuously true, so X1 and X2 must be adjacent. As we assume that X2 is not a

cause of X1, the remaining options for this adjacency are X1→ X2, X1↔ X2, or both.

Similarly, the dependency X26 |= X3 implies adjacency of X2 and X3. As we require

σ-separation X1 ⊥σ X2|X3 and none of the paths

X1→ X2 ← X3, X1 → X2 ↔ X3, X1↔ X2 ← X3, X1 ↔ X2 ↔ X3

(22)

12 Chapter 1. Causal Inference Upon combining these statements we conclude that the only possible graphs are the ones

depicted in Figure 1.3.

When applying LCD to a data set, the assumption that X2 is not a cause of X1 must be

satisfied due to some prior knowledge of the data generating process. This is the case in a JCI setting where the context variables are not caused by the system, i.e. when Assumption 2 holds. If this is the case, we implement the LCD algorithm by iterating over all triples (Ck, Xi, Xi0), with indices of context variables k ∈ K and indices of system variables i, i0 ∈ I

such that i 6= i0_.

1.2.1 Marginalisation and ancestral relations

Theorem 1.3 provides direct causal relationships as it assumes the SCM to be restricted to the random variables X1, X2 and X3. When applying LCD to a data set with more than

three variables, the edges in Figure 1.3 can merely by interpreted as ancestral relations. To provide some insight into this fact, we introduce the marginalisation of an SCM:

Definition 1.9 (Marginalisation, Bongers et al., 2020). Consider a simple SCM M :=

hI, J , X , E, f , PEi, a subset O ( I and define L := I \ O. For any gO : Xpa(O)\O ×

E_pa(O)→ XO that makes M uniquely solvable with respect to O (see Definition 1.5), we

call the SCM M_marg(O):=DL, J , XL, E, ˜f , PE

E

a marginalisation of M with respect to O,

with the ‘marginal’ causal mechanism ˜f : XL× E → XL defined by

˜

f(xL, e) := fL(gO(xpa(O)\O, epa(O)), xL, e).

Suppose we consider an SCM with endogenous variables C, X1, X2, X3 and the underlying

causal mechanism as depicted in Figure 1.4a. When checking for presence of the LCD triple (C, X1, X3), the LCD algorithm is agnostic of the presence of the variable X2, so the result

that X1 directly causes X3 can only be interpreted as a result in the marginalised graph

M_marg({X₂}), as depicted in Figure 1.4c. Bongers et al. show that simplicity is preserved

under marginalisation, and that X1∈pa(X3) in G(Mmarg({X2})) implies that X1∈an(X3)

in G(M), so we interpret this direct causal effect in the marginalised graph as an ancestral causal effect. C X1 X2 X3 (a) C X1 X2 X3 (b) C X1 X3 (c) Figure 1.4: 1.4a the graph of M, 1.4b its ancestral relations and 1.4c

its marginalisation with respect to X2. The dashed edge indicates the

(23)

2 Conditional independence testing

Conditional independence testing is an important element of all constraint-based causal discovery algorithms, of which an example has been demonstrated in Section 1.2. Most research has been done on testing conditional independences of the type X |= Y |Z, where

X, Y and Z are continuous. Moreover, no conditional independence tests for either X or Y

discrete, and the remaining variables continuous has been brought to our attention, despite its necessity in the field of causal discovery (in a JCI setting, for example). As in practice tests for X, Y, Z continuous are applied in this case, even though X or Y is discrete, we cover some of these continuous tests in Section 2.3. In Section 2.2 we cover a recently proposed ‘No Free Lunch Theorem’, which states that no conditional independence test can have uniform power against any alternative hypothesis. First we provide some preliminary theory.

2.1 Preliminary theory

In this section we revisit basic theory of probability distributions and some of their characteristics, of which we pay extra attention to (conditional) independence.

2.1.1 Distributions

Let (Ω, Σ, P) be a probability space, let (X , B(X )) be a measurable space where B denotes the Borel σ-algebra on X , and let X : Ω → X be a random variable. The probability measure PX(B) := P(X ∈ B) on (X , B(X )) is referred to as the distribution of X. If X ⊆ R, then the map x 7→ P(X ≤ x) is the distribution function of X. If µ is another measure on (X , B(X )), then PX is absolutely continuous with respect to µ if µ(B) = 0 =⇒ PX(B) = 0 for all B ∈ B(X ), which is denoted as PX  µ. If µ is σ-finite and PX  µ, then there exists a function p ∈ L1_{(X , B(X ), µ) such that}

PX(B) =

Z

X p(x)dµ(x).

The function p is often referred to as the Radon-Nikodym derivative of PX with respect to µ (Spreij, 2018), or alternatively as the density of X. We refer to p(X) is as the likelihood1 _of X. If X = Rd, µ is the Lebesgue measure and PX  µ, then we refer to X as a continuous

random variable, and to PX as a continuous distribution. If PX({x ∈ X : PX({x}) > 0}) = 1,

then we call X a discrete random variable, and PX a discrete distribution. If X is discrete

1_{The likelihood is often used in relation with a dominated, parametrised model of probability distributions}

{Pθ: θ ∈ Θ}, where the likelihood pθ(X) can be interpreted as a function of θ.

(24)

14 Chapter 2. Conditional independence testing and X is countable, then PX is absolutely continuous with respect to the counting measure on X .

Let X, Z be random variables on the probability space (Ω, Σ, P) and let B ∈ B(X ). We define the conditional probability of X ∈ B given Z as PX(B|Z) := P(X ∈ B|Z) := P(X ∈

B|σ(Z)) := E[1{X∈B}|σ(Z)], and so we have

P(X ∈ B ∩ Z ∈ C) =

Z

{Z∈C}PX(B|Z)dP. (2.1)

In the case that (X , B(X )) is a standard Borel space (which we will always assume), the mapping B 7→ PX(B|Z) can be made to be countably additive2_{, and we refer to such a}

mapping as the conditional distribution of X given Z (Ghosal and van der Vaart, 2017; Spreij, 2018). More specifically, in this case the mapping (B, ω) 7→ PX(B|Z)(ω) is a Markov kernel from (Ω, σ(Z)) to (X , B(X )):

Definition 2.1(Markov kernel, Ghosal and van der Vaart, 2017). A Markov kernel from a

measurable space (Ω, Σ) into another measurable space (X , B(X )) is a map P : B(X ) × Ω →

[0, 1] such that

(i) The map B 7→ P(B|ω) is a probability measure for all ω ∈ Ω

(ii) The map ω 7→ P(B|ω) is measurable for all B ∈ B(X ).

If we redefine Z to be the identity map on (Z, B(Z), PZ), then the conditional distribution of X given Z is a Markov kernel (B, z) 7→ PX(B|Z = z) from (Z, B(Z)) into (X , B(X )). In this case equation 2.1 can be written as

P(X ∈ B ∩ Z ∈ C) =

Z

CPX(B|Z = z)dPZ(z).

When the joint distribution PXZ admits a joint density p(x, z) with respect to the product measure µ ⊗ ν on X × Z, then the conditional distribution PX(·|Z) has a conditional

density x 7→ p(x|z) with respect to µ for every z ∈ Z, which satisfies

p(x|z) = p(x, z)

p(z) (2.2)

as shown by Rønn-Nielsen and Hansen (2014).

2.1.2 Independence

We call any two σ-algebras Σ1,Σ2 ⊆Σ independent if P(B1∩ B2) = P(B1)P(B2) for all B1 ∈Σ1, B2 ∈Σ2. Any two random variables X, Y on Ω are independent if the σ-algebras σ(X) and σ(Y ) are independent, which can be stated as P(X ∈ B ∩Y ∈ C) = PX(B)PY(C)

for all B ∈ B(X ) and C ∈ B(Y). We denote independence of X and Y as X |= Y . If X and

Y are not independent, we write X 6 |= Y .

2_{Countable additivity may a priori hold everywhere, except for some set N ⊂ Ω of probability zero. If}

for all ω ∈ N we redefine PX(·|Z)(ω) to some fixed probability measure, then countable additivity holds for

(25)

2.1. Preliminary theory 15

Independence of X and Y has multiple characterisations. We for example have that X and Y are independent if and only if the Markov kernel PX(·|Y = y) is constant over Y (Rønn-Nielsen and Hansen, 2014), in which case

P(X ∈ B) = P(X ∈ B ∩ Y ∈ Y ) =

Z

YPX(B|Y = y)dPY(y) = PX(B|Y = y0) · Z

YdPY(y)

= PX(B|Y = y0) (2.3)

for any y0 ∈ Y. If the density p(x, y) exists, then we have p(x, y) = p(x)p(y), which indeed

yields p(x|y) = p(x) via equation (2.2).

It is important to note that X |= Y =⇒ Cov(X, Y ) = 0, but the converse is not necessarily true.

Definition 2.2 (Conditional independence). Let X, Y, Z be random variables on a

prob-ability space (Ω, Σ, P). We say that X and Y are conditionally independent given Z

if

P(X ∈ B ∩ Y ∈ C|Z) = P(X ∈ B|Z)P(Y ∈ C|Z)

a.s. for all B ∈ B(X ) and all C ∈ B(Y), which is denoted with X |= Y |Z.

As we consider (X , B(X )) and (Y, B(Y)) to be standard Borel spaces, we have B(X × Y) = B(X ) ⊗ B(Y), and so (X × Y, B(X × Y)) is a standard Borel space as well. This implies that the joint distribution

PXY : B(X ) × B(Y) → [0, 1], (B, C) 7→ PXY(B, C|Z = z) is a Markov kernel from (Z, B(Z)) into (X × Y, B(X × Y)), and we have

X |= Y |Z ⇐⇒ PXY(B, C|Z = z) = PX(B|Z = z)PY(C|Z = z)

for all B ∈ B(X ), C ∈ B(Y) and all z ∈ Z. If the distribution of (X, Y, Z) admits a joint density p, then conditional independence is equivalent to p(x, y|z) = p(x|z)p(y|z).

We call X and Y weakly conditionally independent given Z if equation (2.2) holds in expectation, viz.

P(X ∈ B ∩ Y ∈ C) = EZ[PX(B|Z)PY(C|Z)]

for all B ∈ B(X ) and C ∈ B(Y) (Zhang et al., 2017). Similar to the marginal case, we have that X |= Y |Z =⇒ Cov(X, Y |Z) = 0 a.s., but the converse need not be true. Daudin (1980) shows that weak conditional independence holds if and only if

EZ[Cov(f(X), g(Y )|Z)] = 0 (2.4)

(26)

16 Chapter 2. Conditional independence testing

2.2 A No Free Lunch theorem

In this section we state a No Free Lunch Theorem,3 _{which states that there is no conditional}

independence test that is able to detect all types of conditional dependences. In order to state the theorem, we first recapitulate some statistical notions. For this exposition we have gratefully drawn from Shah and Peters (2020).

When given n observations of the random variables X, Y and Z, a possibly randomised conditional independence test is a measurable function

ψn: (X , Y, Z)n×Ω → {0, 1}, (2.5)

for some measurable space (Ω, Σ). Here ψn = 0 denotes X |= Y |Z, and ψn = 1 denotes

X 6 |= Y |Z. An example of such a randomised test is the RCoT (as will be treated in Section

2.3), where repeated application to the same data provides slightly different outcomes due to the involvement of drawing random elements in computing the statistic.

We denote a null hypothesis H0 as a space of distributions P, where we for example

let H0 : X |= Y |Z be the space of distributions of (X, Y, Z) such that the conditional

independence holds. Let (ψn)∞

n=1 be a sequence of tests. For any of these ψn, we define the size to be the worst case probability of erroneously rejecting H0, i.e. sup_P∈PP(ψn= 1). This erroneous rejection of H0 is referred to as a type I error of a false positive. We say

that ψn has valid level α ∈ (0, 1) if the size does not exceed α, i.e. sup

P∈P

P(ψn= 1) ≤ α, and the sequence (ψn) has uniformly asymptotic level α if

lim sup

n→∞ _P∈PsupP(ψn= 1) ≤ α.

If we let Q denote the space of distributions under which the alternative hypothesis H1

holds, then the power of the test ψnagainst alternative Q ∈ Q is the probability of correctly rejecting H0, i.e. Q(ψn= 1).

Ideally, a test has uniformly asymptotic level, and its power is large uniformly over Q, i.e. lim infn→∞infQ∈QQ(ψn = 1) = 1. For nonparametric problems it is often hard to

construct a test which adheres to this latter condition, so one needs to restrict Q to obtain power tending to 1 (Balakrishnan and Wasserman, 2019). However, restricting Q may not yet solve the problem. For even harder problems we can only hope to have at least more power than level at some alternative, i.e. sup_Q∈QQ(ψn= 1) > sup_P∈PP(ψn= 1). We call a pair (P, Q) untestable if this is not possible, so if we have

Q(ψn= 1) ≤ sup

P∈P

P(ψn= 1)

3

About the name ‘No Free Lunch theorem’, the initiators of this term explained “We have dubbed the

associated results No Free Lunch theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.” (Wolpert and Macready, 1997).

(27)

2.3. Overview of conditional independence tests. 17

for all n, tests ψn and alternative distributions Q ∈ Q. In this case, the only possibility of obtaining power against any alternative is by restricting P.

The following theorem states that testing for independence among continuous distributions is an untestable hypothesis. More specifically, let E0be the space of continuous distributions

of (X, Y, Z). Let P0 ⊂ E0be the set of distributions such that X |= Y |Z, and for M ∈ (0, ∞]

let E0,M be the subset of distributions which have their support contained in an `∞ ball of

radius M (so the range of X, Y and Z is bounded by M in `∞ _{norm). Let Q}

0 := E0\ P0,

and set P0,M := P0∩ E0,M and Q0,M := Q0∩ E0,M. Using this we formulate the following

theorem:

Theorem 2.1 (No Free Lunch, Shah and Peters, 2020). Given any n ∈ N, α ∈ (0, 1), M ∈

(0, ∞] and any potentially randomised test ψn that has valid level α for the null hypothesis

P_0,M_{, we have that Q(ψ}_n= 1) ≤ α for all Q ∈ Q_0,M. Thus, ψ_n cannot have power against

any alternative, and the pair (P0,M, Q0,M) is untestable.

Both Shah and Peters (2020) and Neykov et al. (2020) show that X |= Y |Z is untestable, even when X and Y are discrete. Continuity of Z is necessary for untestability. As we have mentioned earlier, it is important to note that the problem of untestability can be mitigated by restricting the null hypothesis P0,M. We will encounter examples of restricting

the null hypothesis in the next section.

2.3 Overview of conditional independence tests.

In this section we provide some conditional independence tests, which assume that all the variables are continuous. In accordance with the No Free Lunch Theorem, each of these tests is accompanied with restrictions it imposes on the data, which essentially restricts the space P from the previous section. It should be noted that each of these tests inhibits a ‘default’ marginal independence test, which will also be analysed in Chapter 4.

2.3.1 Pearson’s partial correlation

A well known conditional independence test is based on the partial correlation statistic. For one-dimensional random variables X and Y we define Pearson’s correlation coefficient by

ρXY := p Cov(X, Y )

Var(X)Var(Y ), (2.6)

and for observations x1, ..., xn and y1, ..., yn we define the sample correlation coefficient by ˆρXY := Pni=1(xi−¯x)(yi−¯y)

pPn_i=1(xi₋_¯x)2pPn_i=1(yi₋_¯y)2

where ¯x denotes the sample mean. Considering a one-dimensional random variable Z with observations z1, ..., zn, we define Pearson’s (sample) partial correlation by

ρXY |Z := ρXY − ρXZρY Z q 1 − ρ2 XZ q 1 − ρ2 Y Z , ˆρXY |Z := ˆρXY −ˆρXZˆρY Z q 1 − ˆρ2 XZ q 1 − ˆρ2 Y Z .

(28)

18 Chapter 2. Conditional independence testing Whittaker (1990) provides an equivalent definition of the partial correlation by considering the linear regression models X = αX + βXZ+ εX and Y = αX + βYZ+ εY, and denoting

the partial correlation coefficient as the Pearson correlation between the residuals, i.e.

ρ_{XY |Z} := ρε_XεY, with ρεXεY as defined by equation (2.6).

Conditional independence testing proceeds using the assumption that the implication

ρ_{XY |Z} = 0 =⇒ X |= Y |Z (2.7)

holds. To test whether ρXY |Z = 0, we use the test statistic

tXY |Z := √ n −2 − dq ρXY |Z 1 − ρ2 XY |Z ,

which has a student-t distribution with n − 2 − d degrees of freedom when (X, Y, Z) has a multivariate normal distribution (Kim, 2015; Kendall, 1946).

From the definition of Whittaker it is clear that marginally testing for X |= Y with partial correlation defaults to the standard correlation test.

Despite the ubiquity of the partial correlation test, no general necessary conditions on the distribution of (X, Y, Z) are known for assumption (2.7) to hold. Baba et al. (2004) investi-gate this assumption by relating the zero partial correlation with conditional independence through conditional covariance Cov(X, Y |Z) and the conditional correlation defined by

˜ρXY |Z := p Cov(X, Y |Z)

Var(X|Z)p

Var(Y |Z). The first notable result is the following:

Theorem 2.2 (Baba et al., 2004). For any random variables X, Y and random vector

Z= (Z1, ..., Zd), if there exist vector α and matrix β such that

E[(X, Y )|Z] = α + βZ and ˜ρXY |Z does not depend on Z, (2.8)

then ρ_{XY |Z} = ˜ρXY |Z.

Baba et al. (2004) proceed by exploring for what classes of distributions the condition of equation (2.8) is satisfied, and only find it to hold when (X, Y, Z) has an elliptical

distribution: a distribution PXY Z with characteristic function

φ(u) := Z

X ×Y×Z

eiu>(x,y,z)dPXY Z(x, y, z) = eiu>µψ(u>Σu)

for some vector µ, matrix Σ, and scalar function ψ (Baba et al., 2004). The multivariate normal distribution is such an elliptical distribution, as its characteristic function is given by φ(u) = eiu>µ−1₂u>Σu_{, where µ and Σ are its mean and covariance matrix respectively.}

Moreover, Baba et al. (2004) show that it is the only distribution in the class of elliptical distributions for which conditional independence actually exists, in which case it has zero partial correlation:

(29)

Theorem 2.3(Baba et al., 2004). The normal distribution is the only elliptical distribution

for which conditional independence exists. More specifically, if the conditional distribution of (X, Y, Z) has an elliptic distribution, then X |= Y |Z if and only if ρXY |Z = 0 and the

distribution of (X, Y, Z) is multivariate normal.

Baba et al. (2004) extend this result to nonparanormal distributions: random vectors (ψ1(X), ψ2(Y ), Z) with (X, Y, Z) multivariate normal and ψ1 and ψ2 monotone

transfor-mations (Liu et al., 2009), for which we have

˜ρXY |Z = 0 ⇐⇒ ψ1(X) |= ψ2(Y )|Z.

2.3.2 Spearman’s partial correlation

A nonparametric alternative to Pearson’s partial correlation is Spearman’s partial correla-tion. This dependence measure is based on Spearman’s rank correlation coefficient:

ρs_XY := ρ_{rg(X)rg(Y )}= p Cov(rg(X), rg(Y ))

Var(rg(X))Var(rg(Y )),

where rg(X) denotes the conversion of observations X1, ..., Xn to its ranks. Similar to Pearson’s partial correlation coefficient, we define Spearman’s partial correlation coefficient by ρs_{XY |Z} := ρ s XY − ρsXZρsY Z q 1 − (ρs XZ)2 q 1 − (ρs Y Z)2 .

The marginal ‘default’ of this test is the well-known rank correlation test.

A well known application of Spearman’s partial correlation test is in the ‘Rank PC’ algorithm, which is a constraint-based causal discovery algorithm known to be consistent in high dimensional nonparanormal settings (Harris and Drton, 2013).

2.3.3 The Generalised Covariance Measure

A more advanced conditional independence test is the Generalised Covariance Measure (GCM), as proposed by Shah and Peters (2020). As mentioned earlier, Pearson’s partial correlation can be defined as the correlation of the errors of linear regressions. The GCM consists of a similar dependence measure, namely a normalised version of the covariance of the errors resulting from nonlinear regressions. More specifically, for one-dimensional random variables X and Y and d-dimensional random variable Z we may decompose

X= E[X|Z] + ε, Y = E[Y |Z] + ξ.

For given observations (x1, y1, z1), ..., (xn, yn, zn) we may estimate the conditional expecta-tions f(z) := E[X|Z = z] and g(z) := E[Y |Z = z] via some (possibly nonlinear) regression method. Denoting the estimated conditional expectations with ˆf and ˆg, we define the

(30)

20 Chapter 2. Conditional independence testing regression errors by Ri := ˆεiˆξi, then the GCM is defined as

T(n):= √ n_n1 Pn i=1Ri 1 n Pn i=1R2i − 1 n Pn i=1Ri 21/2. (2.9)

Under the null hypothesis X |= Y |Z the GCM converges in in distribution to a standard Gaussian random variable, provided the regressions f and g can be estimated sufficiently well:

Theorem 2.4 (Asymptotic normality, Shah and Peters, 2020). Assume that (X, Y, Z)

have a distribution such that X |= Y |Z. Under this joint probability measure, if we have

nE 1 n n X i=1 (f(zi) − ˆf(zi))2 ! E 1 n n X i=1 (g(zi) − ˆg(zi))2 ! →0 (2.10) E 1 n n X i=1

(f(zi) − ˆf(zi))2E[ξ2|Z= zi]

! →0 (2.11) E 1 n n X i=1 (g(zi) − ˆg(zi))2 E[ε2|Z= zi] ! →0 (2.12)

then T(n) converges in distribution to a standard Gaussian random variable.

Note that the condition of equation (2.10) controls the mean squared prediction error of the prediction of both X and Y . The proof shows that the numerator of (2.9) weakly converges to a standard Gaussian, and that the denominator of (2.9) converges to 1 in probability.

It is instructive to view the GCM as a normalised version of the conditional covariance, as we have

2.3.4 The Hilbert-Schmidt independence criterion

An entirely different type of conditional independence test is based on the Hilbert-Schmidt independence criterion. This criterion is stated as a certain linear operator between two

Reproducing Kernel Hilbert Spaces (RKHSs) being equal to 0, which is under suitable

assumptions equivalent to (weak) conditional independence. Stating this theorem in full requires the introduction of an RKHS, the cross-covariance operator, and more (and causing this section to be more elaborate than those of the other tests). As we are not necessarily concerned with the broad theory of RKHSs, we adapt the definitions appropriately.

Definition 2.3 (Reproducing Kernel Hilbert Space, Aronszajn, 1950). Let X be a set,

(31)

that (H, h·, ·i) is a real Hilbert space. A reproducing kernel of H is a measurable map k: X × X → R such that

(i) For every y ∈ X , the map x 7→ k(x, y) is in H;

(ii) The reproducing property holds: for every y ∈ X and every f ∈ H, we have

f(y) = hf(·), k(·, y))i .

If such a reproducing kernel exists, we refer to H as a Reproducing Kernel Hilbert Space

(RKHS).

Instead of through the existence of a reproducing kernel, a RKHS is often defined as a Hilbert space of functions on which all evaluation functionals are continuous. This is due to item (ii) of the following proposition:

Proposition 2.1 (Aronszajn, 1950). Let H be a class of functions defined on X forming

a Hilbert space with the inner product h·, ·i, and let k denote a corresponding reproducing kernel. Then we have the following:

(i) If a reproducing kernel k exists then it is unique.

(ii) A reproducing kernel k exists if and only if for every x ∈ X theevaluation functional

Lx: H → R, f 7→ f(x) is continuous, i.e. there is a c ∈ R such that Lx(f) ≤ ckfk.

(iii) k is symmetric and positive definite

(iv) For every symmetric and positive definite k, there exists one and only one class of functions H forming a Hilbert space, admitting k as a reproducing kernel.

Proof. See Appendix, Section A.1.

Now that we have some insight into RKHSs, we turn to conditional independence testing. First we define the cross-covariance operator:

Definition 2.4 (Cross-covariance operator, Fukumizu et al., 2004). Let X, Y be random

variables with canonical representations4 (X , B(X ), PX) and (Y, B(Y), PY), and let (HX, kX)

and (HY, kY) be RKHSs on X and Y respectively with kX and kY measurable. If we

have EX[kX(X, X)] < ∞ and EY[kY(Y, Y )] < ∞, then there exists a unique operator

ΣY X : HX → HY such that

hg,ΣY Xf i_H

Y = Cov(f(X), g(Y ))

for all f ∈ HX and g ∈ HY. This is called thecross-covariance operator.

The cross-covariance operator is related to independence via the following result:

Proposition 2.2(Jacod and Protter, 2004, Theorem 10.1; Gretton et al., 2005). Random

variables X and Y are independent if and only ifCov(f(X), g(Y )) = 0 for all f ∈ Cb(X )

4

For measure space (Ω, Σ, P), measurable space (X , F) and random variable X : Ω → X with distribution

PX(F ) := P(X ∈ F ) for all F ∈ F, we call (X , F, PX) the canonical representation of X, as the identity on

(32)

22 Chapter 2. Conditional independence testing

and g ∈ Cb(Y). Consequently, if HX = Cb(X ) and HY = Cb(Y) we have that

ΣY X = 0 ⇐⇒ X |= Y.

Kernels which almost satisfy the preceding assumptions are c0-universal kernels: kernels

on a locally compact Hausdorff space (X , d) such that the induced RKHS H is dense in

C0(X ), the space of functions of X which vanish at infinity (Sriperumbudur et al., 2011).

Examples of c0-universal kernels on Rd are the Gaussian Radial Basis Function (Gaussian

RBF) and the Laplacian kernel, defined by

kG(x, y) := e−

kx−yk2

2σ2 and kL(x, y) := e− kx−yk

σ

for σ > 0 (Steinwart, 2001). The type of kernels which precisely satisfy the assumptions of Theorem 2.2 are cb-universal kernels: kernels with a corresponding RKHS H which is dense

in Cb(X), the space of bounded continuous functions on any topological space X (note that

C0(X) ⊂ Cb(X)) (Sriperumbudur et al., 2011). As more explicit characterisations of these

kernels are quite technical and beyond the scope of this thesis, we refer to Sriperumbudur et al. (2011) for more information on cb-universal kernels.

The operator ΣY X is a bounded operator, with adjoint Σ∗

Y X = ΣXY. As a result, the

covariance operator ΣXX is bounded and self adjoint (Fukumizu et al., 2004). If we define

˜ΣXX as the restriction of ΣXX to (kerΣXX)⊥_{, then the right inverse ˜Σ}−1

XX is well defined.

Definition 2.5 (Conditional cross-covariance operator, Fukumizu et al., 2004). If we

consider another random variable Z with canonical representation (Z, B(Z), PZ) and

RKHS(HZ, kZ) on Z with kZ measurable, then theconditional cross-covariance operator of (X, Y ) given Z is the bounded linear operator ΣY X|Z : HX → HY defined by

ΣY X|Z := ΣY X−ΣY Z˜Σ−1_ZZΣZX.

If we have EX[kX(X, X)] < ∞, EY[kY(Y, Y )] < ∞ and EZ[kZ(Z, Z)] < ∞, then

D

g,ΣY X|ZfE

HY = E [Cov(f(X), g(Y )|Z)] .

It is interesting to compare the latter result with Daudin’s characterisation (2.4) of weak conditional independence, which already hints at equation (2.13). Since ΣY Z = Σ1/2

Y YVΣ

1/2

ZZ for some bounded linear operator V (Baker, 1973), the operator ΣY Z˜Σ−1

ZZΣZX is uniquely defined, even if ˜Σ−1

ZZ is not unique (Fukumizu et al., 2004). For any two probability measures P and Q on X , the kernel kX is called characteristic if EP[f(X)] = EQ[f(X)] for

all f ∈ HX implies that P = Q. It is important to note that all c0-universal kernels are

characteristic with respect to the space of finite Radon measures5 _{(Sriperumbudur et al.,}

2011). The conditional cross-covariance operator is related to conditional independence through the following result:

5

A Radon measure µ on (X , B(X )) with X Hausdorff is a locally finite measure such that µ(B) = sup{µ(K) : K ⊆ U, K compact} for all B open, and µ(B) = inf{µ(K) : U ⊇ B, U open} for all B ∈ B(X ).