Detecting causal precursor regions in complex climate data: An evaluation of methods using synthetic data

(1)

Masters Thesis

Detecting causal precursor regions in

complex climate data: An evaluation of

methods using synthetic data

Author: Lennart Mettrop Supervisors: Dr. Dim Coumou Sem Vijverberg Examinor: Rick Quax

A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computational Science

in the

Computational Science Lab Informatics Institute

(2)

I, Lennart Mettrop, declare that this thesis, entitled ‘Detecting causal precursor re-gions in complex climate data: An evaluation of methods using synthetic data’ and the work presented in it are my own. I confirm that:

This work was done wholly or mainly while in candidature for a research degree

at the University of Amsterdam.

Where any part of this thesis has previously been submitted for a degree or any

other qualification at this University or any other institution, this has been clearly stated.

Where I have consulted the published work of others, this is always clearly

at-tributed.

Where I have quoted from the work of others, the source is always given. With

the exception of such quotations, this thesis is entirely my own work.

I have acknowledged all main sources of help.

Where the thesis is based on work done by myself jointly with others, I have made

clear exactly what was done by others and what I have contributed myself.

Signed:

Date: 19 October 2020

(3)

Abstract

Faculty of Science Informatics Institute

Master of Science in Computational Science

Detecting causal precursor regions in complex climate data: An evaluation of methods using synthetic data

by Lennart Mettrop

To reduce the harmful effects of extreme weather events like heatwaves, one may look at early warning systems. This thesis looks at the prediction of these heatwave events which is one aspect of early warning systems. To be able to make these predictions, causal inference algorithms are used to infer the precursors of hot weather events. As a causal inference algorithm only works for 1-dimensional datasets, the spatial di-mensions of climate data first have to be reduced. Inspired by McKinnon who clus-tered different weather stations together who simultaneously record extreme tempera-ture anomalies to reduce the spatial dimensions, Vijverberg expanded on an automated SST-based forecasting tool called RG-CPD, originally developed by Kretschmer et al. RG-CPD first calculates a dependence map using correlation for different lags. Adjacent points with significant correlation with the same sign are clustered together. The PCMCI causal inference algorithm is applied to the spatial mean of the detected communities to infer the underlying links to the target variable.

To improve the RG-CPD framework, this research implements different dependence metrics to calculate the dependence map instead of correlation. Partial correlation with the history of the target or the precursor as third variable, transfer entropy and granger causality are introduced.

A spatial-temporal toy model developed by Tibau et al called SAVAR is used to generate synthetic data. This data is used to test RG-CPD. Using different models and experi-ments, the dependence metrics are compared. This toy model also allows for testing the limitations of the RG-CPD framework.

(4)

This study shows that the tested metrics don’t influence the final causal model as found by RG-CPD. However, in the community detection part of RG-CPD differences between the metrics are found. Correlation and partial correlation are able to find larger areas giving us more information. Granger causality is better at filtering false positives and is even able to determine the direction of the dependence relation. Transfer entropy scored worst in all tests.

None of the teste metrics score optimal in every experiment. Therefore, a combination of different metrics will give a better result. Correlation should be used to find the largest portion of the real area while partial correlation on the target or granger causality should be used to minimize false positives.

(5)

Firstly I want to thank my supervisors Dim Coumou and Sem Vijverberg for helping me understand the climatological interpretations and guiding me through this thesis. I want to thank friends and family for their support.

I also want to thank my colleagues and supervisors at work to allow me to focus on finishing my thesis in time.

(6)

Declaration of Authorship i

Abstract ii

Acknowledgements iv

Contents v

List of Figures vii

List of Tables ix

Abbreviations x

1 Introduction 1

2 Background 4

2.1 Motivation for causal discovery algorithms . . . 4

2.2 Correlation based methods . . . 5

2.3 Granger causality . . . 6

2.4 Entropy based methods . . . 7

2.5 Graphical models . . . 8

2.6 Functional Causal Models . . . 10

2.7 Dimensionality reduction in climatology . . . 12

2.7.1 General dimensionality reduction . . . 12

2.7.2 Dimensionality reduction for timeseries . . . 13

2.7.3 Clustering . . . 13

2.7.3.1 Response guided methods and RG-CPD. . . 15

3 Methods 17 3.1 Introduction. . . 17

3.2 Response guided causal precursor detection (RG-CPD). . . 17

3.2.1 RG-CPD settings. . . 18

3.2.1.1 Settings DBSCAN . . . 18

3.2.1.2 Settings PCMCI . . . 19 v

(7)

3.3 Spatial averaged vector autoregressive model (SAVAR) . . . 19

3.3.1 SAVAR settings. . . 21

4 Experiment design 23 4.1 Causal models . . . 23

4.2 Detecting causal model. . . 25

4.3 Score functions . . . 26 4.3.1 Other measures . . . 28 4.3.2 Benchmark . . . 29 4.4 Experiments. . . 30 4.4.1 Limitations of RG-CPD . . . 30 4.4.1.1 Temporal sampling . . . 30 4.4.1.2 Signal strength . . . 31 4.4.1.3 Number of datapoints . . . 31 4.4.1.4 Number of modes . . . 31

4.4.1.5 Noise and spatial covariance . . . 31

4.4.2 Limitations of bivariate metrics . . . 32

5 Results 33 5.1 Limitations of RG-CPD . . . 33 5.1.1 Temporal sampling . . . 33 5.1.2 Signal strength . . . 34 5.1.3 Number of datapoints . . . 35 5.1.4 Number of modes. . . 36

5.1.5 Noise and spatial covariance. . . 37

5.2 Improvements of RG-CPD . . . 39

5.2.1 Filtering of non-precursor modes . . . 39

5.2.2 Indirect links . . . 42 5.2.3 Outgoing links . . . 43 6 Discussion 46 6.1 Limitations . . . 49 7 Conclusion 50 7.1 Future work . . . 52 Bibliography 53

(8)

2.1 Two examples why correlation between variable X and Z may not give the correct causal information. In example A, variable Y has a causal link with both X and Z, making them correlated. Example B shows how the link from X to Z doesn’t have to be directly, but can first go through variable Y . . . 5 2.2 Example output of the FCI algorithm. There is a causal link found

be-tween variable X and Y in both directions indicating there exist an un-measured latent confounder. The link between variable Z and Y is can either be in both directions or only towards Y . . . 9 2.3 The three main steps of transforming 3-dimensional climate timeseries

into 1-dimensional timeseries. In the first step, a point-wise correlation map is calculated from the 3-dimensional timeseries. In the second step, the region is clustered using the correlation map. The last step takes the spatial mean of the clusters, returning 1-dimensional timeseries. . . 16 3.1 Different values for the setting of the DBSCAN clustering algorithm

using the scoring method introduced in section 4.3. This setting is the radius around each point to search for other points to add to the cluster. Tested the sensitivity and found ∼ 250 to be a good value for temperature based correlation maps . . . 19 3.2 Mode: A mode is a region with covariant noise. The component of

a mode is the weighted average according to a bivariate gaussian dis-tribution or other disdis-tribution. The component influences other modes according to an underlying causal model. . . 20 3.3 Found autocorrelation for climate data on the left, and SAVAR data on

the right. As seen in the example climate data, the autocorrelation of the precursor timeseries stays much higher then the target timeseries. The settings of SAVAR have been chosen such that the precursor autocorre-lation and the target autocorreautocorre-lation are similar. We can see that the precursor autocorrelation of SAVAR still doesn’t get very similar to real climate data. . . 21 4.1 Different causal models used to test RG-CPD. The models are sorted on

increasing complexity. Model (A) and (B) are used to test for breakpoints of the RG-CPD framework, while model (C), (D) and (E) are used to test the different methods used by RG-CPD. . . 25

(9)

4.2 The general workflow used in this thesis. First a causal model is generated as shown in 4.2a. Using the SAVAR algorithm, 3-dimensional timeseries are generated. Using the RG-CPD framework, a correlation map is calcu-lated 4.2c. The DBSCAN algorithm is used to cluster the map in regions of significant correlation 4.2e. The PC-MCI algorithm is performed to find the underlying causal model 4.2b. . . 27 5.1 Temporal sampling. The RG-CPD score using different temporal

sam-pling settings. This setting takes the mean over x timesteps of the time-series within RG-CPD. Figure 5.1a gives the score using an autocorrela-tion value of 0.6 for the target. Figure 5.1b uses an autocorrelaautocorrela-tion value of 0.8 for the target. Both plots show two clear tipping points. . . 34 5.2 Signal strength. The RG-CPD score using different signal strength for

the causal link. Only one precursor is used using 100 iterations. RG-CPD starts to find the causal link with a signal strength > 0.03 and consistently with a signal strength of > 0.14. . . 35 5.3 Number of datapoints. The RG-CPD score using different number of

generated datapoints for each mode. Six precursor modes are used with 30 iterations using model B. With more datapoints, the score of RG-CPD increases. . . 36 5.4 Number of modes. The RG-CPD score using different number of modes

with model B. 30 iterations are used. When the causal model consists of more modes with a link to the target, the RG-CPD score decreases. . . . 36 5.5 Noise. The RG-CPD score using different noise values. Figure 5.5a

shows the score for model A and figure 5.5b shows the score for model B with 6 modes. For both models the noise parameter doesn’t seem to make a difference on the RG-CPD score. However, more noise reduces the found area size. . . 37 5.6 Spatial covariance. The RG-CPD score using different spatial

covari-ance values. Figure 5.5a shows the score for model A and figure 5.5b shows the score for model B with 6 modes. For both models the spatial covariance parameter doesn’t seem to make a difference on the RG-CPD score. . . 38 5.7 Independent noise. The RG-CPD score using different independent

noise values. Figure 5.7a shows the score for model A and figure 5.7b shows the score for model B with 6 modes. For both models, a larger independent noise decreases the score and the found region size.. . . 38

(10)

4.1 General experiment settings . . . 30

5.1 Scoring measures for model (C) using signal strength of 0.08. . . 39

5.4 Scoring measures for model (C) using signal strength of 0.11 and a higher spatial covariance of 3000. . . 41

5.5 Scoring measures for model (D) using signal strength of 0.11. . . 42

5.6 Scoring measures for model (E) using signal strength of 0.08. . . 43

(11)

General

CSL Computational Science Lab

UvA University van Amsterdam

VU Vrije Universiteit Amsterdam

Climate

SST Sea Surface Temperature

S2S Subseasonal to Seasonal

PEP Pacific Extreme Pattern

Algorithms and tests

SAVAR Spatial Averaged Vector AutoRegressive model RG-CPD Response Guided Causal Precursor Detection

MCI Momentary Conditional Independence

GC Granger Causality

(12)

Introduction

The global average temperature has warmed by 0.89◦c since 1900 (Hartmann et al., 2013 [20]). These small changes in average temperature may lead to larger changes in the frequency and intensity of extremes weather events (Perkins, 2015 [36]). One of such extremes that have a negative affect on society on a global scale, are heatwaves.

To reduce the harmful effects of heatwaves, one may look at early warning systems. For this thesis, we look at the prediction of heatwave events which is one aspect of early warning systems. By prediction of heatwaves, preparations could be made reducing the negative effects of these events.

To make these predictions, we have to use collected timeseries datasets. There have been various methods developed to infer causal relationships between different time-series leading to the ability to make forecasts. However, these methods work best using datasets of lower dimensions. To make predictions of extreme weather events like heat-waves, one has to look at larger regions increasing the dimensionality of the dataset. This introduces the problem of not only having to find the causal relationship between the variables, but also to find the variables themselves.

McKinnon has proposed a new method where they first use a clustering algorithm to group weather stations together who simultaneously record extreme temperature anoma-lies (McKinnon, 2016 [32]). They found that the Pacific Extreme Pattern (PEP), which is an evolving pattern of sea surface temperature anomalies, is able to make skillful predictions up to 50 days for hot weather events within a region spanning most of the eastern US.

Inspired by the predictive skill of McKinnon’s results, Vijverberg et al has expanded on an automated SST-based forecasting tool called RG-CPD (Vijverberg, [58]). The

(13)

RG-CPD framework is originally created by Kretschmer et al (2017, [26]). Vijverberg has expanded the code to be versatile.

The RG-CPD framework consists of two parts. In the first step, the Response Guided Community Detection step, regions are found in 3-dimensional timeseries that show a significant relationship to the target variable. This is done by calculating correlation maps for different lags. Regions are found by clustering adjacent points with significant correlation with the same sign together. These regions are spatial averaged into 1-dimensional timeseries, reducing the data’s dimensions.

By construction, these regions have a correlation with the target variable. Correlation doesn’t mean causality as there can be common drivers, autocorrelation or indirect links. To detect the causal drivers, the second step consists of a causal detection algorithm. The RG-CPD framework uses the PCMCI algorithm (Runge [42][38][40][44][43][41]). Vijverberg has used the correlation metric to detect robust features from Sea Surface Temperatures (SST), but this might have led to spurious results. To make the calcula-tions statistical significant, the correlation analysis requires independent and identically distributed datapoints. However, due to the high persistence of SST compared to the target temperature, this assumption is clearly violated.

Although the original work uses daily mean values, Vijverberg et al has concluded after a proper validation, that a daily resolution has no predictive skill. A cross-validation framework is applied to optimize the temporal aggregation window size and increase the signal-to-noise ratio. The signal-to-noise ratio turns out to be important for getting a good forecastability skill. We suspect the signal-to-noise ratio to be also important to improve the detection skill of causal inference methods. This thesis will focus on the spatial aspect of the framework to improve the signal-to-noise ratio and detection skill. To improve the spatial dimension, the thesis focusses on the first step of the RG-CPD framework. By default a correlation map is made between a possible precursor and the target variable. Although correlation maps have proven to be able to detect regions having a causal link to the target variable, we suspect that these regions could be improved by using other tests. This leads to the first question:

RQ 1: Can the spatial detection of precursor regions by the RG-CPD frame-work be improved by using other dependence tests instead of correlation?

To test the different methods, a spatio-temporal climate ’toy’ model developed by Tibau et al (2020, [56]), will be used. This model lets us generate 3-dimensional timeseries using

(14)

a known underlying causal model, see section3.3. The causal model defines causal links between different areas called modes, which can placed on a map.

Knowing the underlying causal model, we can test for the similarity with RG-CPD’s output. This gives the opportunity to give the output a score and compare different conditional independence tests.

The model gives us also the possibility to increase the complexity of the underlying causal model. We could increase the number of regions larger, add more noise, make the signal longer or generate more years of data. This leads to a second question:

RQ 2: What are the limitations of the RG-CPD framework?

RG-CPD focusses on finding causal links in sub-seasonal to seasonal timescales. There-fore possible precursors which vary more slowly to the target can be used to make the predictions. This research will use synthetic data where this difference between the precursors and the target is simulated.

This thesis is structures as follows. Chapter 2 will give a overview of the methods developed for causal inference and dimensionality reduction. Chapter3will give a more detailed explanation of the methods introduced by this thesis or new to this field. Here the SAVAR model and the new implementations for RG-CPD are explained as well as the general settings used. Chapter4 discusses how to the causal models are generated, which methods are tested and how we can calculate the score of one test. This chapter also introduces the experiments performed in this thesis. Chapter5 presents the results of the experiments performed in this thesis. In this chapter the framework is tested using the different ways of increasing complexity. In chapter6the results are interpreted. Here we discuss the limitations and assumptions of our approach. Chapter7gives a summary of our findings as well as recommendations for further research.

(15)

Background

This chapter gives an overview and review of current literature on the topic of causality detection algorithms. As causal discovery is an important research question in many fields. Hence, there is a large amount of literature to be found on this topic. First, general causal discovery methods will be discussed. Later in this chapter, we focus on research related to climate science and how we can apply it on 3-dimensional timeseries.

2.1 Motivation for causal discovery algorithms

Large scale complex networks have been the topic of many researchers. Knowing the complexity of a network gives a lot of information of the system it describes. Using the network, we can solve optimization problems, find equilibria [33] or make forecasts. Determining the role of the network structure on the dynamic properties of certain systems [3,11,17,61] is therefore an important study.

One method is reverse engineering where a network is deconstructed to reveal it’s struc-ture. Using reverse engineering methods, we could find underlying network structures, but we can’t do that without interfering with the system itself [53]. Experimenting with the network is not always possible in fields like neuroscience or climate science [44]. When it is possible, it’s expensive, time consuming and/or hampered by imperfect models. So other methods are needed which can find the system’s network using only observations of the networks behavior.

As there is a broad abundance of timeseries data, causal network inference methods are being researched in the recent years. Using collected or simulated timeseries of a system, causal network inference methods can be used to detect the underlying networks. These methods only use observed timeseries, so they don’t perturb with the network itself.

(16)

X

Y Z

(a) Common driver

X

Y

Z

(b) Indirect link

Figure 2.1: Two examples why correlation between variable X and Z may not give the correct causal information. In example A, variable Y has a causal link with both X and Z, making them correlated. Example B shows how the link from X to Z doesn’t

have to be directly, but can first go through variable Y .

2.2 Correlation based methods

To extract a network out of timeseries, all timeseries are compared to each other. One method to test if there is a link between two timeseries is to check if they are similar to each other using correlation. Two timeseries having a high correlation coefficient may have a link between them. Correlation matrices can also be used to define a metric of how close two nodes are on the network [31]. The network can be found by defining a threshold where a link is set to exist when the correlation coefficient exceeds the threshold [5]. As correlation only tells us if two timeseries are dependent, but not the direction, the network will be an undirected graph.

Correlation only tells us how similar two timeseries are to each other. The reason they are similar doesn’t have to be because there is a causal link between the nodes. There might be common causes or indirect links [1]. Examples are shown in figure2.1. Therefore does correlation not mean causality. To get closer to a causality test, lagged correlation can be used to get directional links. Lagged correlation and cross-correlation have been used to get the time delay and strength of links [19,25,28]. With this approach, correlation is measured between a time shifted variable and a possible target variable.

Instead of correlation, partial correlation may be used [23]. Partial correlation tests for dependence with the effect of a third variable removed. In [60] a comparison is made between models using correlation and models using partial correlation in the financial field. They found that the distributions of the correlation and the partial correlation coefficients differ completely. This tells us that other nodes of the network can have a big influence on the correlation coefficient. When chosen well, the third variable used by partial correlation can give a more accurate coefficient. Ideally, a third variable is chosen which has an effect on both initial variables. This way, we test for correlation when the effect of a common cause is removed.

(17)

2.3 Granger causality

Correlation tests have shown that only checking for similarities between timeseries isn’t good enough to find causal effects. The first concept of a variable to be causal to another variable is introduced by Norbert Wiener in 1956 [62]. He introduced the notion that a variable might be causal when the ability of the second variable is improved using the information of the causal variable. Clive Granger has introduced a practical implementation of the concept in 1969 [18].

The idea of Wiener-Granger causality (or just Granger causality or WGC), is that vari-able X Granger causes a varivari-able Y when the past of X contains some relevant infor-mation for Y which isn’t included in the past of Y itself, but also isn’t included in other variables [6, 39]. This can be tested by calculating the score of the prediction of Y without knowing the past of X and with the past of X and check if the prediction score significantly improves.

To measure the improvement, autoregressive models (AR) or vector autoregressive mod-els (VAR) are used. Defining Xt= (Xt1, . . . , XtN) as the set of N variables at timestep

t, Φ(τ ) the N × N coefficient matrix at lag τ with τmax the maximum time lag and t

an independent noise term, the VAR model is defined as:

Xt= τmax

X

τ −1

Φ(τ )Xt−τ + t. (2.1)

We can say variable Xi Granger causes variable Xj when the variance of the residuals t

of the unrestricted model with Xi is significantly less then the variance of the residuals of the restricted model without Xi. Geweke [15]has defined WGC as :

FX→Y = ln

var(t)

var(0_t), (2.2)

where t is the residual of the restricted model and 0t the residual of the unrestricted

model. A conditional granger causality can be defined by including a third variable in Xt for both the restricted and unrestricted model.

WGC does have some limitations. This method requires the estimation of (V)AR mod-els. Therefore, the timeseries are required to be stationary to make the model estimation tractable. To make the estimation sufficient, enough data is needed. Only linear features are found using this feature. However, nonlinear features can be approximated using linear features.

(18)

2.4 Entropy based methods

A more general version of Granger causality is transfer entropy (TE). This method from information theory also tests for additional information contained in possible precursors to consider a causal connection between the precursor and the target. Instead of (V)AR model regression, transfer entropy uses the uncertainty measured by Shannon entropy [47]. Shannon entropy is defined as:

h(X) = − Z

p(x) log p(x)dx, (2.3)

with p(x) the probability density function of X. From this definition, joint and condi-tional entropies can be defined for variables X and Y :

Joint entropy: h(X, Y ) ≡ h(Y, X) = − Z p(x, y) log p(x, y)dxdy, (2.4) Conditional entropy:   

h(X|Y ) ≡ −R p(x, y) log p(x|y)dxdy, h(Y |X) ≡ −R p(x, y) log p(y|x)dxdy.

(2.5)

As Shannon entropy gives the amount of uncertainty of a random variable, mutual information uses Shannon entropy to measure the deviation of independence of two variables. Mutual and conditional mutual information are defined as:

Mutual Information: I(X; Y ) ≡ h(X) − h(X|Y ) ≡ h(Y ) − h(Y |X), (2.6) Conditional MI: I(X; Y |Z) ≡ h(X|Z) − h(X|Y, Z) ≡ h(Y |Z) − h(Y |X, Z). (2.7)

To measure the direction of the information flow, transfer entropy can be used, proposed by Schreiber [46]. Transfer entropy is defined as:

TX→Y ≡ h(Yt+1|Yt) − h(Yt+1|Yt, Xt). (2.8)

As the conditional entropy h(Yt+1|Yt) gives the uncertainty of variable Yt+1given its past

Yt, transfer entropy gives the decrease of uncertainty when the past of Xtis included as

well. A positive value of transfer entropy tells us conditioned on both the past of the target Y and the possible precursor X, we have lower uncertainty meaning variable X contains meaningful information.

A more generalized version of transfer entropy is introduced by Sun et al [54]. They define causation entropy as:

(19)

where I, J, K ⊆ {1, 2, . . . , n}. When we have the uncertainty of a set of variables J conditioned on a set of variables K, causation entropy gives the reduction of uncertainty when we include the set of variables I. In this way, the pairwise restriction of transfer entropy is overcome. Note that we get the definition of transfer entropy when J = {Y }, I = {X} and K = {X}.

2.5 Graphical models

The previous methods are used to measure the causal dependence between two variables. To get a causal graph between all variables, conditional tests are performed between all sets of variables. Granger causality and entropy based measures give direction to the links, but still don’t account for indirect links or common causes. To be able to find only direct links, other algorithms are developed. These algorithms all start with a fully connected graphical causal model. They then prune links which don’t have a causal connection. Graphical causal models are sets of nodes and directed edges where the edges indicate the causal links and a joint probability distribution P over the values of all the variables.

These methods are based on two assumptions. The first assumption is the Markov condition which tells us a variable X is independent of its nondescendents conditioned on its parents:

Assumption 1 (Markov condition). Let DE(X) be the descendants of a variable X, let P A(X) be the parents of X and let V be a set of variables. For every variable X ∈ V and set of variables Y ⊂ V \DE(X), it holds that P (X|P A(X)) = P (X|P A(X), Y ).

When the Markov condition is assumed for a graphical causal map, it is called the Causal Markov Assumption. This assumption tells us that every conditional independence statement of the graph is satisfied by the joint distribution P . It can be there exists other conditional independence relations satisfied by P which are not included in the graph. To make causal inference possible, we have to include another assumption: Assumption 2 (Causal faithfulness). Every conditional independence relation in the joint probability distribution P of a Graphical Causal Model must be part of the graph.

Using these assumptions, we can say that two variables X and Y are causally related if and only if there isn’t another set of variables Z in the model such that X and Y are independent conditioned on Z. This statement forms the basis of many causal inference methods which will be described in this section.

(20)

Z Y

X W

Figure 2.2: Example output of the FCI algorithm. There is a causal link found between variable X and Y in both directions indicating there exist an unmeasured latent confounder. The link between variable Z and Y is can either be in both directions

or only towards Y .

A highly used algorithm is the PC algorithm [50] named after Peter Spirtes and Clark Glymour. This algorithm prunes the graphical model in multiple steps. First it removes all connections where the variables are independent. Then it prunes all connections which are independent conditioned on a third variable (still in the graph). The number of variables which are conditioned on increases until the algorithm has converged. The resulting graph consists of variables which are causally related as all links which are independent conditioned on some other set of variables are removed.

An extension on the PC algorithm, is the PC-MCI algorithm. Here they first perform the PC algorithm to find the possible parents of each variable. Then momentary conditional independence tests (MCI) are performed. The MCI test, tests for independence between a precursor X and target Y conditioned on the parents of Y as well as the parents of X. Using ⊥⊥ to notate independence and P(X) as the parents of X, MCI tests:

MCI: X ⊥⊥ Y | P(Y ), P(X), (2.10)

for all possible parents X.

Colombo et al [8] has modified the PC algorithm making it order independent. The new algorithm is called the PC-stable algorithm. The main difference between the PC and the PC-stable algorithm is that PC-stable prunes independent connections only after all conditioning sets of the same size are tested. Besides making it order independent, this approach makes it easy to parallel the algorithm.

Spirtes et al [50] have developed an variant of the PC algorithm called Fast Causal Inference (FCI), which is able to detect latent confounders which are not measured. The first steps are the same as the PC algorithm. The FCI algorithm starts with a fully connected undirected graph and prunes connections which are conditional independent. The difference with the PC algorithm comes with the orientation phase. While the PC algorithm assumes every edge is directed in one way, FCI allows for directions to be uncertain and for links with directions in both ways. Figure 2.2 gives an example output of the FCI algorithm. This example shows the direction of a link can be directed in two ways, as is the case for variables X and Y . This indicates there is at least one unmeasured confounder.

(21)

Colombo et al [9] developed a faster variant of the FCI algorithm called the Really Fast Causal Inference algorithm (RFCI). Although this algorithm is faster, it looses accuracy compared to FCI.

A similar approach is introduced by Chickering et al [7] called Greedy Equivalence Search (GES). Instead of starting wit a fully connected graph like the PC and FCI algorithms, GES starts with an empty graph. In the first part of the algorithm, edges are added when they increase a Bayesian measured fit. When adding edges doesn’t increase the fit anymore, the algorithm starts removing edges which removal causes an improvement of fit.

Ogarrioa et al combined the FCI and the GES algorithm to create the hybrid GFCI [34]. This algorithm starts using the GES algorithm to build a network and then uses the FCI algorithm to prune unnecessarily links.

2.6 Functional Causal Models

The methods described before find the causal model which generates the observed time-series by testing on (conditional) independence between the different variables. By cleverly organizing these tests, the algorithms are able to find directional, latent and common links and even links from unmeasured variables. For data where a good inde-pendence test can’t be performed (too few data, form of indeinde-pendence unknown), other methods are proposed. One of these methods are Functional Causal Models (FCM). FCMs describe the effects of a variable xi ∈ X in terms of its parents P A(xi) as:

xi = fi(P A(xi), i), (2.11)

where i is a noise term assumed to be independent of xi, and f ∈ F where F is a

functional class. This function is assumed to be invertible. This function can be fitted on the data. This is done in both directions. The direction on which the fitted noise term is independent of the causal variable is considered. However, this only works under certain assumptions. Different methods have been proposed making use of different assumptions.

Shimizu et al [48] introduced the LiNGAM method. They assumed the data to be generated by an acyclic model, the function to be linear and at most one noise term is Gaussian. The noise terms are also probabilistic independent. This leads to the following

(22)

equation and its matrix form: xi = X k(j)<k(i) bijxj+ i, (2.12) X = BX + , (2.13)

where bij ∈ B is the connection strength from xi to xj, B is a n × n adjacency matrix

and k(i) is an ordering such that later variables have no path to earlier variables in the ordering. By the assumption that the model is acyclic, the matrix B is permutable into a strictly lower triangular matrix matrix. By defining A = (I − B)−1, we can solve for X:

X = A. (2.14)

As the noise terms are independent and non-Gaussian, equation 2.14 is defined as In-dependent Component Analysis (ICA) found by Hyv¨arinen et al [22] which is shown solvable[10, 12]. Solving equation 2.14 estimates A and W = A−1, but still needs to permutated, scaled and signs have to be determined. But as B needs to be permutable to strictly lower triangular, W can’t have any non-zero elements on the diagonal which makes it possible to find the correct permutation. Using the assumptions of LiNGAM, the signs and scale can also be found. The permutation matrix B can then be found by B = I − W . Knowing B, gives us the parameters bij of equation 2.12 and thus the

functions in equation 2.11. The functions induce a probability distribution P on the variables from the distributions of the noise terms. From this distribution, the causal model can be determined.

The post-nonlinear causal model (PNL) is an extension on LiNGAM that was developed by Zhang et al [64]. They write the functions in the following form:

xi = gi(fi(P A(xi)) + i), (2.15)

where f, g ∈ F are invertible, and the noise terms are probabilistic independent. This model is has less assumptions, but is still able to identify the causal graph with five known exceptions.

The LiNGAM method has been further improved by other methods to reduce estima-tion errors. Sanchez-Romero et al [45] developed a two-step method including unmea-sured confounders into the equations. Shimizu et al [48] introduced the DirectLiNGAM method which combines the regression part with independence tests in-between. The above methods all use linear functionals. Nonlinear methods have also been used. Hoyer et al [21] introduced a model where the functional fi in equation2.11is assumed

(23)

to be any nonlinear function which may be different for each i. This model also uses an additive noise term.

2.7 Dimensionality reduction in climatology

All the methods discussed so far assume the timeseries to be one dimensional. In many fields like climatology, we have to deal with higher dimensional timeseries. In climatol-ogy, the timeseries are three dimensional as the timeseries also include a spatial element. In this field, a possible precursor timeseries might be the Sea Surface Temperature (SST) of a certain region. The data then consists of timeseries for each gridpoint in the area. Treating every gridpoint as an unique variable is computational inefficient as the num-ber of gridpoints quickly becomes too large. In general, causal inference methods can’t handle high dimensional timeseries efficiently. Therefore, methods are needed to reduce the dimensionality of the data.

2.7.1 General dimensionality reduction

In climate data, we can use the knowledge of the spatial features to reduce the dimension of the data. In general, the timeseries might be of higher dimensions where we don’t have such knowledge beforehand. Here a general approach is needed.

The most known dimension reduction technique is Principal Component Analysis (PCA, Pearson [35]). This technique finds a few orthogonal linear combinations of the variables which have the highest variance. As the first several principal components found by this technique give the most variance, the rest can be discarded without losing too much information. PCA greatly reduces the dimensions of data. Suri et al [55] have shown that PCA reduces the dimensions even more drastically when applied on timeseries data as timeseries are often more structured. However, PCA doesn’t take lagged interactions within the timeseries into account. Despite this, the PCA approach has been used for timeseries [29,49,63].

Another linear method is Factor Analysis (FA), which tries to find a linear relationship between the variables and a lower number of unobserved variables called factors. FA has as goal to find these factors, whereas PCA only approximates them at best.

Independent Component Analysis (ICA) looks for linear projections, but these don’t have to be orthogonal. The projections have to be statistically independent of each other.

(24)

Another approach is t-distributed Stochastic Neighbor Embedding (t-SNE, [30]). This method constructs a probability distribution over pairs of the high-dimensional objects. In this distribution, similar objects get a higher probability then dissimilar objects. Another probability is defined on the lower-dimension. By minimizing the Kullback-Leibler divergence between the two probabilities, the lower dimensional distribution is made similar to the higher dimensional distribution. In the lower dimension, this method often shows clusters.

2.7.2 Dimensionality reduction for timeseries

The above methods can be used on any high dimensional data, but do not necessarily fit timeseries well as timeseries also contain lagged information. In timeseries research, a method used is Singular Spectrum Analysis (SSA) which is the application of PCA on lagged versions of a timeseries. This analysis is mainly supposed for noise reduction, detrending and the identification of oscillatory patterns. Multivariate SSA (M-SSA) is an extension of SSA which applies the technique on multivariate timeseries. This is however not a dimension reduction technique, but applies SSA on each channel. Rosenzweig et al have developed an adaptation of the M-SSA method which can be used as a dimension reduction technique called SSA-FARY [37].

Other methods adapt the PCA algorithm to be able to use it better on timeseries. Georg introduced the Forecastable Component Analysis (ForeCA, [16]). This method finds a transformation to separate a multidimensional timeseries to a orthogonal and forecastable white noise space using a new forecastability measure.

Suri et al [55] developed the method Dimensionality Reduction OPtimizer (DROP). They found the same result of PCA could be achieved when performed on a smaller subset in the case of timeseries datasets. They used this result to develop DROP, which returns a low-dimensional basis using online aggregation and progressive sampling techniques. This way they apply PCA on an increasingly large sample of the dataset until the algorithm has converged.

2.7.3 Clustering

For climate data as used for this thesis, the timeseries consist of only three dimensions (longitude, latitude and time). We can use the knowledge of the spatial dimensions to reduce the timeseries to only the time dimension. In climate data, the three dimensional timeseries are regions where every gridpoint is a timeseries. Different adjacent gridpoints within the region tend to act similarly. In the example of SST, the temperature of a

(25)

specific gridpoint won’t be very different than a gridpoint very close to it. Treating these different areas within the region as one variable greatly reduces the dimensionality of the timeseries. Finding these clusters of similar gridpoints is still a challenge. But different methods have been proposed to solve this problem of finding the correct areas and making causal discovery with 3-dimensional timeseries possible.

One of the oldest and most approachable clustering techniques is k-mean clustering. First it assigns a fixed number of centroids which are the centers of the clusters. The points are assigned to the cluster with the nearest centroid. Iteratively the centroids are replaced to the mean of the defined clusters and the points are reassigned.

Some methods grow the clusters iteratively. One hybrid of spectral clustering and region growing iteratively merges gridpoints where a predefined dissimilarity measure is low enough [57]. This algorithm has the same problem as k-means clustering as we have to define the minimum number of clusters beforehand. The DBSCAN clustering algorithm [13] starts with a random point and assigns it a cluster. It then iteratively searches within a region around this point for other points and adds these when there are enough. Points which turn out to not be contained in any cluster are considered noise. For this method, the radius that the algorithm uses to search for other points around the initial point is very important. When this parameter is too low, some less dense clusters may not be found as the points are considered noise, but when this parameter is too high, clusters may be incorrectly grouped together.

One method specifically made for timeseries is ACTS [59] which combines hierarchical and dynamical clustering principles. This method is able to detect the clusters without having to specify how many there are beforehand.

The PCA algorithm can also be used as a clustering algorithm when used on three dimensional data like climate timeseries. But PCA has limitations on the physical interpretability. It’s also not very suitable for detecting the weaker patterns [4]. PCA can be used as a starting point for clustering algorithms like Shared Nearest Neighbor (SNN, [51]). This algorithm starts with a known clustering, and adds new elements to the clusters according to the k nearest neighbors. If the majority of the k nearest neighbors belong to cluster A, the new element will also belong to cluster A.

Other methods first define a network on the timeseries and apply community detection algorithms. Multiple methods have been developed to infer these communities [14]. A lot of these methods group variables together such that these groups have a lot of links within them, but only a few links between the groups. One traditional method is to partition the graph. This divides the graph in equal sized clusters with minimal links between them. A disadvantage of this type of method is we need to specify the size of

(26)

the clusters as well as the number of clusters. Some algorithms for partitioning are the Kernighan-Lin algorithm [24] or the Spectral Bisection method [2].

Other types of clustering techniques knowing the graph are hierarchical clustering, par-titional clustering or spectral clustering. Clustering methods can find the communities by finding the connecting links and removing them or by optimizing the modularity or other metrics. All these methods have in common that the clusters have to maximize or minimize some scoring method. But these scoring methods don’t take prior knowledge into account.

2.7.3.1 Response guided methods and RG-CPD

To be able to forecast a certain target variable, it is not always necessary to infer the whole causal graph. It can be sufficient to only know the connections to one specific variable we wish to predict. With this goal, new options become available. The clusters we are looking for now are not variables that just act similar to each other, but have a similar effect on the target variable.

To find the precursors of a response variable, different community detection algorithms have been proposed [51,52]. Their adaptations of the clustering algorithms group the variables together based on their ability to predict the response variable.

Bello et al have developed a response guided community detection strategy to find the regions associated with the response variable of interest[4]. The information of the response variable is used during the clustering process. The precursors found using this method can then be used as predictors of the response variable. They also introduced a method of finding the graph using the response variable with climate data. First they select all gridpoints with a significant correlation with the target variable. They form a graph using these gridpoints where a link between the gridpoints exist when the correlation between them is significant. Then they use their clustering method to group the grid cells in different regions.

Limitations of Bello’s strategy is that the clusters don’t have to represent geographical connected regions and it doesn’t take time lagged dependencies and causal relations into account. Kretschmer et al combined Bello’s strategy with causal discovery algorithms [26]. Kretschmer’s approach is called Response-Guided Causal Precursor Detection (RG-CPD). This method first makes correlation maps between the multivariate data and the response variable, similar to Bello’s approach. However Kretschmer’s approach applies this step for different lead-lag times. Figure2.3bshows an example of such a correlation map. This map shows the correlation between a possible precursor timeseries of ERA5

(27)

(a) ERA5 3-dimensional time-series at some time step t.

(b) Correlation map of the timeseries in figure2.3aand a

target variable.

(c) Clustered region using the correlation map of figure2.3b.

Figure 2.3: The three main steps of transforming 3-dimensional climate timeseries into 1-dimensional timeseries. In the first step, a point-wise correlation map is calculated from the 3-dimensional timeseries. In the second step, the region is clustered using the correlation map. The last step takes the spatial mean of the clusters, returning

1-dimensional timeseries.

data shown in figure 2.3a and a target timeseries. The found regions are transformed into one dimensional timeseries by taking area-weighted averages. The last step is to perform a causality test to filter out the non-causal links.

This framework is expanded on by Vijverberg [58]. Vijverberg’s implementation first calculates a dependence map using correlation similar to the previous implementations. Vijverberg uses the DBSCAN algorithm (Ester et al, 1996 [13]) to cluster the area into regions using the dependence map. The PCMCI causal inference algorithm (Jacob Runge [38,40–44]) is used by Vijverberg to find the underlying causal model.

Vijverberg’s implementation provides the functionality to use temporal sampling. This allows to change the interval of the timeseries from one day to x days. RG-CPD takes the mean of each consecutive x days and treats this calculated mean value as one timestep. This is done to improve the signal-to-noise ratio for variables which change over a longer period then the noise.

(28)

Methods

3.1 Introduction

As the previous chapter explained the various causal inference algorithms and dimen-sionality reduction methods that have been developed, this chapter will explain the new methods introduced by this thesis or new to this field. Also the general settings which aren’t part of the experiments are discussed.

3.2 Response guided causal precursor detection (RG-CPD)

By default, RG-CPD calculates the dependence map using correlation. As discussed in the previous chapter, other dependence metrics then correlation have been proposed to improve the detection of causal links. For this research, the following metrics are implemented instead of correlation:

1. Partial correlation. The partial correlation of variable X and variable Y given a set of controlling variables Z is calculated by performing a linear regression of X with Z and a linear regression of Y with Z. The partial correlation is then found by calculating the Pearson r correlation between the residuals eX and eY resulting

from the linear regression.

Without a controlling variable Z, partial correlation is the same as Pearson r correlation. So a suitable Z has to be chosen. For this thesis, Z is chosen to be either the history of the possible precursor X, the history of the possible target Y or both. For these three possible controlling variables, the history is taken up to time t − lag, where the lag can be varied. This means we measure the correlation

(29)

between the possible precursor and the target with the effects of the history of the variables up to lag days before the current time removed.

2. Transfer entropy. In information theory, Shannon entropy gives us the level of uncertainty of a variable X. Transfer entropy tells us the reduction of uncertainty when we include the history of the precursor timeseries in addition to the history of the target variable. A positive transfer entropy tells us that the precursor timeseries gives us more information about the target timeseries. A more detailed explanation is found in section 2.4.

3. Granger causality. Granger causality is very similar to transfer entropy as it tests if a variable X contains information of another variable Y which isn’t included in both Y and all other variables. Granger causality calculates a prediction score of Y without knowing X and then looks if this prediction score improves when it does include X. Also, see section2.3.

3.2.1 RG-CPD settings

Apart from the settings of the individual parts of RG-CPD, some general settings have been defined. RG-CPD uses k-fold cross validation. For this research, the average of each test split is taken as the final result, meaning a mode has to be found in each test split. To test the limitations of RG-CPD, k = 5 is taken. This is because each test split needs to be at least one year, so with k test splits, we need at least k years of simulation. For this research, five years is the minimum number of years tested, so we can’t take more test splits. To test the different methods, we will use the entire dataset, except for the consistency score (see section4.3.1). As we generated the timeseries using a lead-lag time of one timestep, RG-CPD only calculates the dependence map for lag 1.

3.2.1.1 Settings DBSCAN

The DBSCAN algorithm takes two arguments. The first argument is the radius . Figure 3.1shows the scores as explained in section 4.3, for different values of . With < 150, the DBSCAN algorithm doesn’t find the right clusters anymore while > 350 seems to decrease the score as well. Between these values this parameter doesn’t affect the score that much. For this research a value of = 200 is chosen. The second argument is the minimum number of points to form a cluster, which is set to 3.

(30)

100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400 100 150 200 250 300 350 400

DBSCAN distance_eps

0.0 0.2 0.4 0.6 0.8 1.0

Score

Figure 3.1: Different values for the setting of the DBSCAN clustering algorithm

using the scoring method introduced in section4.3. This setting is the radius around

each point to search for other points to add to the cluster. Tested the sensitivity and found ∼ 250 to be a good value for temperature based correlation maps

3.2.1.2 Settings PCMCI

The PCMCI algorithm concludes two modes have a connecting link if the p-value of the MCI test is below a threshold value α. This research takes this values as α = 0.01. The PCMCI only looks for links up to 2 lead lag times. For the conditionally independence tests, this research uses partial correlation within PCMCI as this is the default setting.

3.3 Spatial averaged vector autoregressive model (SAVAR)

The SAVAR model is a spatial-temporal model to generate timeseries for different regions affecting each other according to a (known) underlying causal model developed by Tibau et al [56]. These regions are called modes.

To interact with other modes, a mode is assigned a component. A component is the weighted average of the region according to a two-dimensional Guassian distribution. Optionally other distributions can be used. The causal model describes how the com-ponents influence each other. The effect of other modes to the component influences the region using the same weights. An example of a mode and it’s bivariate gaussian distribution is shown in figure 3.2.

This model is described by the equation:

y_t`:= N X j=1 u`j N X i=1 τmax X τ =1 φji(τ ) L X ` wi`y`_t−τ+ `_i, (3.1)

where y_t`is the value of the random variable y at time t at location l, u`j is the coefficient that links how the j-th component influences the value of y_tl, φji(τ ) gives the contribution to the i-th component to the j-th component at time lag τ , wi` is the coefficient that

(31)

Figure 3.2: Mode: A mode is a region with covariant noise. The component of a mode is the weighted average according to a bivariate gaussian distribution or other distribution. The component influences other modes according to an underlying causal

model.

links how y`contributes to the i-th component. The noise term is given by `_i ∼ N (µ, Σ) and Σ ∈ RL×L is the covariance matrix.

Using the notation yT_t = (y_t1, . . . , y_tL), u`j ∈ W+_{, φ}ij_{(τ ) ∈ Φ(τ ), w}i`_{∈ W and}` t ∈ t,

where W+ is the Moore-Penrose pseudoinverse of W , equation 3.1 can be written in matrix form: yt= W+ τmax X τ =1 Φ(τ )W yt−τ + t. (3.2)

For the noise term, Σ is calculated using W :

Σ = W+(W+)T · ρ + σI, (3.3)

where ρ is the spatial covariance and σ is the variance of the noise. The spatial covariance regulates how strong the noise is in relation to the signal. A higher value makes it easier for dimensionality reduction algorithms to find the mode, but makes it harder for causal discovery algorithms to detect the relations between the modes. A higher variance makes the data more noisier making it harder to find the causal relations between the modes. By defining a weight matrix W and a matrix Φ defining the causal links between the components, the 3-dimensional timeseries can be generated by applying equation3.2at each time step.

(32)

(a) Found correlation in cli-mate data. Above the autocor-relation of the target variable is shown. Below the autocorrela-tion of a precursor is shown.

(b) Found autocorrelation in by SAVAR generated data. Left the autocorrelation of the target is shown. Right the autocorrelation of a precursor is shown.

Figure 3.3: Found autocorrelation for climate data on the left, and SAVAR data on the right. As seen in the example climate data, the autocorrelation of the precursor timeseries stays much higher then the target timeseries. The settings of SAVAR have been chosen such that the precursor autocorrelation and the target autocorrelation are similar. We can see that the precursor autocorrelation of SAVAR still doesn’t get very

similar to real climate data.

For this research, an independent noise term can be used instead of the noise imple-mented by SAVAR. This noise term is independent of W and is added to the data after it has been generated.

3.3.1 SAVAR settings

How we define the causal model used by SAVAR for generating the 3-dimensional time-series will be explained in section 4.1. Most of the other settings of SAVAR are kept as default. Modes use a Gaussian distribution. The shape of one mode is 30 × 30 grid cells. This allows the mode to be of high enough resolution, but keeps the computational time of RG-CPD reasonable. A transient of 200 timesteps is used.

Every mode we will define in the causal models has the same autocorrelation except for the target. The autocorrelation values are chosen such that the found autocorrelation of the generated timeseries is similar to the autocorrelation found in climate data. Figure 3.3ashows found autocorrelation in climate data. We can see that the autocorrelation of the target variable quickly decreases over days, while the autocorrelation of the precursor stays very high even after 8 days. The precursors used for S2S forecasting often vary very slowly in climate data[27]. To reflect this in the SAVAR generated timeseries, autocorrelation of the target is set to 0.6 while autocorrelation of the precursors is set

(33)

to 0.96. Figure3.3b shows the found autocorrelation of the generated timeseries which now come closer to the found autocorrelation of real climate data.

Before performing tests to improve RG-CPD, we first fine tune the other settings of SAVAR to find a baseline setting such that the precursors are reasonably difficult to find. By tuning the settings of SAVAR, we simultaneously test for the limitations of RG-CPD. The settings that will be tuned are described in the experiments section4.4

(34)

Experiment design

This chapter introduces the different models used for this thesis. Also the scoring meth-ods will be discussed. Finally, the experiments which are performed will be explained.

4.1 Causal models

The input of the SAVAR model is a causal model. The causal model tells us the causal links between N modes. All the models used for this thesis have some shared restrictions:

1. Mode 0 is defined as the target variable. Modes {1, . . . , N } are the possible pre-cursors.

2. The possible precursors are placed on the area ordered from left to right. So mode 1 is always the leftmost mode and mode N is the rightmost mode.

3. All modes have autocorrelation.

4. All links have a lag of one timestep as we only test for the spatial features while keeping the time features consistent.

Using these restrictions, the following models are defined. Each model is an extension of the previous model. Figure 4.1shows examples of the defined models.

Model A. This is the simplest model and only consist of one precursor area which has a link to the target. This model is used to isolate the effect of the signal-to-noise ratio on RG-CPD’s score as defined in section 4.3.

(35)

Model B. This model has multiple precursors which all have a link with the target. The precursors are sorted on signal strength. So precursor 1 has the strongest signal strength while precursor N has the lowest. The signal strength for the other precursors increases linearly from precursor N to precursor 1. Precursor N has half the signal strength as precursor 1. This model is used to test the impact of multiple regions having a possible different effect on the target. As the signal strength decreases for succeeding modes, this model doubles as a test for the effect of the signal strength.

Model C. When tested on real climate data, some area’s are only found in only one or two test splits by RG-CPD. Our assumption is these rarely found areas are non-precursors which are found by accidental correlation. To test this phenomenon, some non-precursors are added to model B which don’t have any link to the target or any of the precursors. This model tests RG-CPD’s ability to filter out regions who don’t have any effect, but could be found by random correlation.

Model D. We assumed the precursors in the previous models only had a causal link to the target. In real data, these precursors also have their own precursors. This creates indirect links towards the target. To test for these situations, model D is used where the two non-precursor modes of model C are given links to some of the precursors. Although these modes don’t have a direct effect on the target, the effect can be seen through their indirect link. Ideally, these modes are still filtered by RG-CPD as we only seek for the direct links towards the target. This is the most complex model.

Model E. In the previous models, the target only had direct or indirect links. Most models will have a more complex network where the target variable is also a precursor to another variable. An outgoing link from the target variable will also correlate with the target, but ideally should be filtered. The bivariate measure must not only find the link, but also find the direction of the link. To test for out-going links, model E is used where the two non-precursors are set as the successors of the target variable.

The timeseries generated by SAVAR are saved into netcdf4 files which can be read by the RG-CPD framework. The timeseries of the target variable are saved in a separate netcdf4 file as 1-dimensional timeseries.

(36)

Target Precur 1

(a) The simplest causal model consisting of only one precursor with a link to the target

Target Precur 1 Precur 2 Precur 3 Precur 4 Precur 5 Precur 6

(b) A causal model using multiple precursors. Each precursor has a link to the target. Precur-sor 1 has the strongest signal while precurPrecur-sor 6

has the lowest signal.

Target Precur 1 Precur 2

Precur 3

Precur 4

(c) A causal model with multiple precursors with each a link to the target. In this model two modes are included which don’t have any link. Target Precur 1 Precur 2 Precur 3 Precur 4

(d) The most complex causal model with mul-tiple precursors which have a link to the target. Two other modes are included which don’t have a link to the target, but can have a link to a

precursor. Target Precur 4 Precur 1 Precur 2 Precur 3 Precur 8

(e) A causal model using multiple precursors. Each precursor has a link to the target. Two other modes are included which have link from the target.

Figure 4.1: Different causal models used to test RG-CPD. The models are sorted on increasing complexity. Model (A) and (B) are used to test for breakpoints of the RG-CPD framework, while model (C), (D) and (E) are used to test the different methods

used by RG-CPD.

4.2 Detecting causal model

The RG-CPD framework calculates a dependence map between the precursor timeseries and the target timeseries, shown in figure4.2c and figure 4.2d. These examples show a circular region around the center to have correlation with the target. Only for modes 2 − 5, the correlation is found to be significant. In this thesis the following conditional independence tests are performed for this step:

1. Correlation

2. Partial correlation with lag 5 on the precursor 3. Partial correlation with lag 2 on the precursor

(37)

4. Partial correlation with lag 1 on the precursor 5. Partial correlation with lag 5 on the target 6. Partial correlation with lag 2 on the target 7. Partial correlation with lag 1 on the target 8. Transfer entropy

9. Granger causality using Chi-squared significance test 10. Granger causality using the F-test significance test

For partial correlation, the lag parameter tells us up to how many days before the current time, the effect of the variables history is removed. This can be done for both the precursor as for the target variable.

Correlation and partial correlation are used to test the limitations of RG-CPD. For partial correlation, we test on the target and on the precursor where the lag parameter is set to 1, 2 and 5. For the final experiments, we test on all metrics. See section 4.4 Using the dependence map, the area is clustered into regions using the DBSCAN al-gorithm, as shown in figure 4.2e. In the example, only modes 2 − 5 are found by the clustering algorithm. Using the PC-MCI algorithm, the causal map is found. Figure 4.2bshows the found causal map. As the correlation analysis didn’t find mode 1 and 6, PCMCI isn’t able to find their causal link to the target.

4.3 Score functions

By using using generated datasets, we try to find a causal model we already know. This lets us compare our findings with the known truth.

The output of PCMCI consists of three matrices containing the p-values, strengths and confidence intervals for each possible link. A link is to be considered part of the causal map when it’s p-value is under a certain threshold α. For the tests performed in this thesis, this threshold is chosen as α = 0.01. Using the threshold, the p value matrix can be converted into a matrix containing booleans telling us which links are causal. The strength of the link can then be found in the val matrix using the indices found in p value matrix, and the confidence interval can be found in conf matrix.

The three matrices have shape (N, N, max lag + 1), where N is the number of modes, and max lag is the maximum lag. The first index gives the source of the link, the second index gives the target and the third index gives the lag.

(38)

Target Precur 1 Precur 2 Precur 3 Precur 4 Precur 5 Precur 6

(a) A causal model used to generate the 3-dimensional timeseries. Target Precur 2 Precur 3 Precur 4 Precur 5

(b) The causal model found by the RG-CPD framework using timeseries generated by SAVAR from the causal model shown in figure

4.2a.

0° 40°E 90°E 130°E

10°N 20°N

40°N col = 10, time = 1979-01-11

0.430 0.215 0.000 0.215 0.430

(c) Correlation map of the 3-dimensional timeseries generated by the model shown in figure4.2a. Red areas have a positive correlation while blue regions have a negative correlation.

0° 40°E 90°E 130°E

10°N 20°N

40°N col = 10, time = 1979-01-11

0.48 0.24 0.00 0.24 0.48

(d) Partial correlation map of the 3-dimensional timeseries generated by the model shown in figure4.2a. Red areas have a positive correlation while blue regions have a negative correlation.

0° 40°E 90°E 130°E

10°N 20°N

40°N

test_precur correlated

0.48 0.24 0.00 0.24 0.48

Corr Coefficient

(e) Clustered regions using the correlation map shown in figure4.2c.

Figure 4.2: The general workflow used in this thesis. First a causal model is generated

as shown in4.2a. Using the SAVAR algorithm, 3-dimensional timeseries are generated.

Using the RG-CPD framework, a correlation map is calculated 4.2c. The DBSCAN

algorithm is used to cluster the map in regions of significant correlation 4.2e. The

(39)

Knowing the real causal map, we can generate these matrices containing the truth values. A score can be defined as the averaged correct number of matrix entries. As we are only interested in the links to the target mode 0, we ignore all links to other modes in the calculation of the score. With the default experiment settings, the scoring method tests the whole matrix, but comparing the matrices at different lags and testing the autocorrelation can be turned off. For testing the limitations of RG-CPD, only lag 1 is tested.

As the RG-CPD framework filters some regions out, the output matrices of the PC-MCI algorithm might be of a smaller size then the matrices generated by the real causal map. As we can’t compare matrices of unequal size, only the rows and columns corresponding to the found modes are compared.

When a mode is filtered out even though it had a causal link to the target, RG-CPD should be penalized. To do this, a list is created. The list represents the real links. The list is initialized with ones, but elements corresponding to a mode which has link to the target variable are set to 0. For example, if a model with 8 modes has a link from mode 1 and a link from mode 2 to target mode 0, we get:

R : [0, 0, 0, 1, 1, 1, 1, 1]. (4.1)

Only modes that are found by RG-CPD are compared with the real matrix giving a score for each found mode. The corresponding indices in the list are replaced by their scores. So if in the above example, RG-CPD would have found that mode 2,3 and mode 6 had a correlation with the target variable, the list becomes:

Scores: [s0, 0, s2, s3, 1, 1, s6, 1], (4.2)

where si is the score of mode i. In the example, mode 1 has a causal link in the real

model, but as RG-CPD hasn’t found this mode to have a correlation, it’s score is now 0. Mode 4, 5 and 7 don’t have a causal link in the real model and aren’t found by RG-CPD, so their score remains 1 as RG-CPD has filtered them out correctly. This way, RG-CPD is penalized for not finding modes it should and gets a bonus if a mode is correctly filtered.

The final score is the average of the scores in the list of equation 4.2.

4.3.1 Other measures

The above defined scoring method gives a score for the whole RG-CPD process. To get a better understanding of the different parts of the RG-CPD framework, we define

(40)

other measures which test how well the framework is able to find the modes before the PCMCI algorithm is used.

Found. This scoring method tests which modes are found by the precursor detec-tion part of RG-CPD. The score is the average number of correctly found/filtered modes.

Precur area. This is the averaged size of the precursor modes found by the precursor detection part of RG-CPD. For every precursor mode, the area found by RG-CPD is divided by the known size of the mode. The score can be either given as the average of all precursor modes or as a list with scores for each mode. This score should be high.

Non-Precur area. This is the averaged size of non-precursor modes found by the precursor detection part of RG-CPD. For every non-precursor mode, the area found by RG-CPD is divided by the known size of the mode. This score should be 0 as these modes must be filtered out.

Consistency. This score tests for every mode, in how many test splits the mode is found. It gives a list with the consistency score for each mode. Ideally the precursor modes should be found by every test split and the non-precursor modes by none.

These scoring methods combined give a better understanding of how the RG-CPD frame-work is able to find the correct modes. Compared with the final scoring method, we get a better understanding of the errors made by the clustering part of RG-CPD and the errors made by PCMCI.

4.3.2 Benchmark

Using the same Gaussian distribution as is used generating the areas of the modes, the areas are also converted back to 1-dimensional timeseries. We perform PCMCI on these timeseries directly to give a benchmark score. This score represents the score if the precursor detection part would be able to find the modes perfectly. When the RG-CPD score is lower, it means RG-CPD either wasn’t able to find a mode or RG-CPD didn’t find the whole area loosing important causal information.