Causal Confusion in Imitation Learning

(1)

Master Thesis

Causal Confusion in Imitation Learning

by

Pim de Haan

10243496 36 ECTS Mar 2018 – Jun 2020

Supervisor:

Sergey Levine

Prof. Max Welling

Examiners:

Prof. Joris M. Mooij

(2)

Abstract

Imitation learning is our dominant strategy for learning machines to automate tasks. It’s simplest form, behavioural cloning reduces imitation learning to su-pervised learning and uses a demonstration data set to train a model to predict expert actions given observations. Such models are non-causal as the training procedure is ignorant about the causal structure that generated the environment and the expert’s actions. We point out that an imitator that implicitly learns a wrong causal model and is “causally confused”, may perform well on the ex-pert’s demonstrations, but fail when executed in the environment. We analyze this phenomenon analytically in a toy setup and empirically in more realistic en-vironments, and propose to solve it through targeted intervention to determine the correct causal model. Our model extends to visual situations, in which the causal variables are not given a-priori but need to be discovered by representation learning. Furthermore, we propose to use variational inference for bayesian causal discovery, to infer as much information as possible already from the demonstration data, before intervention. We evaluate our model on several control domains and show it outperforms DAgger and other baselines.

(3)

Acknowledgements

First, I’d like to thank Sergey Levine for hosting me in Berkeley, supervising me and providing me with an amazing opportunity to experience world-class machine learning research from up close. His endless source of great ideas ranging from the highest level to the lowest helped the project throughout and when I was ready to throw in the towel at several instances, he taught me to have the grit to keep pushing. Secondly, this project would have gotten nowhere without the help of collaborator and supervisor Dinesh Jayaraman. I’d like to thank him for sharing his research experience solving actual hard problems, his great ideas and help organizing and executing the project and the fantastic writing he did on our paper.

I would also like to thank Lea for having the audacity of travelling to the US at the same time and my parents for making the trip possible, as well as the Hendrik Mullerfonds.

Lastly, I’d like to thank the examiners for reviewing this thesis and my col-leagues at Qualcomm for providing me with the time to finish this project.

Declaration

Most of the ideas and experiments in this thesis have previously been published as de Haan et al. [10], co-authored by Dinesh Jayaraman and Sergey Levine. The textual content of the paper, to which Dinesh and Sergey have contributed greatly, has not been used for this thesis, except for section 4.1 which I wrote in its entirety in the paper and re-used for the thesis. Some similarity between the text of the paper and the thesis is nevertheless inevitable. All figures are also new, except for figures 1.1, 4.2 and 4.4, which were made by Dinesh and re-used with permission for this thesis. Although the idea of studying causality in imitation learning was due to Sergey, the majority of ideas and formalization and all of the execution were by me.

(4)

Learning from Imitation

As it is in humans, learning new behaviour by imitating others is the dominant learning strategy in machine learning. Nevertheless, a key difference exists. Hu-mans imitate while situated in an environment and optimize their imitation using the feedback from the environment. They combine the information from the demonstration with the experience gathered by interacting to accomplish a task. Machine learning, on the other end, makes radically simplifying assumptions in its most prevalent form of imitation learning, supervised learning. Assumed is that the decisions made by the imitator do not effect future tasks. This leads to very practical algorithms and finds successful application in tasks like disam-biguating cats from dogs on images – in which the stack of images to be classified is unaffected by the act of classification – but may fail dramatically in more com-plicated tasks like driving a car – in which sub-optimal decisions lead to dangerous situations, rarely observed by humans, that need to be resolved.

This reduction of imitation learning in an interactive environment to super-vised learning by making the simplifying assumption that subsequent decision problems are independent tasks, is referred to as behavioural cloning. It is easy to implement and requires neither interaction with the environment to learn, nor a definition of what constitutes good behaviour. All that is needed, is a data set of demonstrations, containing many pairs of an observation made by an expert and the decision it took based on that observation. A supervised learning algorithm is then used to train a model, such as a neural network, to clone these decisions. Due to its practicality, behaviour cloning is widely used in practice, also in prob-lems where the simplifying assumptions do not hold. Examples include driving cars [40, 34, 6, 2], walking humanoid robots [44], playing table tennis [35], using a robot arm to grasp rigid objects [30], or handling fish [11]. Beyond academia, behavioural cloning is also being used for commercial manipulation tasks and is being developed for use in decision making in autonomous vehicles. This begs the question: is operating under the crude simplifying assumption of behavioural cloning, often false in practice, good enough for the problems we care about?

(6)

1.1 Causal Confusion

In this thesis, we identify “causal confusion”: a surprising instance of the break-down of the core assumption in behavioural cloning that is due to causality, which is very harmful to the performance of our learner. We define causal confusion as the phenomenon in which an imitation learner wrongly identifies the causal model of the experts’ behaviour and therefore fails the task. It arises because learning models are generally not endowed with rich prior knowledge about the causal mechanisms in the world, such as the existence of objects, the laws of physics or the arrow of time. From the demonstration data set alone it can be impossible to infer which aspects of the expert’s observation are the causes of the expert’s action and which aspects are spuriously correlated.

A causally confused imitator may be very able to predict the expert’s decisions on the observations present in the demonstration data set by exploiting these spurious correlates. However, such predictions can be unstable and only accurate in the narrow set of circumstances the expert encountered. When this imitator is then executed in the environment, its decisions, which rarely match the experts’ exactly, will invariably change the distribution of states encountered from the ones in the demonstration data. This is referred to as the distributional shift from expert demonstrations to imitator roll-outs. The shift can potentially break the spurious correlations on which the imitator relied to predict the next expert’s action. Hence the imitator will fail to generalize on states in the shifted state distribution which were rare in the expert’s demonstration state distribution, as it will not have been trained to make decisions that are in agreement with the expert, leading to poor performance.

To illustrate, consider a hypothetical motivating example, depicted in fig-ure 1.1. We would use behavioural cloning to train a neural network to drive a car. In scenario A, the observations consist of an image of the traffic situation through the windshield, as well as of the dashboard, which contains an indica-tor which turns on immediately after the brake is applied. In scenario B, the dashboard is blacked out. In both scenarios, behavioural cloning yields an imita-tor whose predictions are in agreement with the expert on observations from the demonstration data set and even on demonstration examples that were not used to train the imitator.

However, when the imitators drive on the road, the model from scenario B correctly learns to apply the brakes whenever the traffic situation dictates, but the model from scenario A fails to do so. Instead, it wrongly learned to apply the

Scenario A: Full Information Scenario B: Incomplete Information

policy attends to brake indicator

(7)

brake whenever the brake indicator is on, even though the brake indicator is the effect of the expert braking, not the cause. In a driving scenario, braking is often repeated many times successively, as is not braking. Hence, in the demonstration data, the brake indicator is almost always on if the expert decides to brake next, and is off if not. The imitator can thus accurately predict the expert’s decision to brake on the demonstration data from just observing the brake indicator and ignoring the road view. However, when driving on the road and a pedestrian appears in front of the car, the brake indicator would be off, and the imitator would fail to brake.

This situation presents a tell-tale symptom of causal confusion, access to more information, which might include spurious correlates of the expert’s behaviour, leads to worse generalization performance under distributional shift. This may be surprising, as machine learning practitioners generally attempt to provide the learner with observations that are as rich as possible. For example, in interactive environments, immediate observations may not include all necessary information about the current state of the world, which is addressed by providing the learner with a history of observations. However, just as the brake indicator provided information of past actions, causing causal confusion, a history of past observations can cause the same effect, harming imitation performance.

Prior work has already identified particular instances of causal confusion in imitation learning. In particular Muller et al. [34] and Wang et al. [58] noted that adding history may harm the performance of a imitator and address this by not adding history at all. Furthermore, Bansal et al. [2] notes that the presence of past actions in the history harms performance, which the authors address by performing dropout on this particular feature. The general case in which unknown elements of the state are spuriously correlated to the expert’s actions, however, has not yet been identified in prior work.

1.2 Imitation Learning under Distributional Shift

Imitation learning by behavioural cloning dates back to Widrow and Smith (1964) [60] and is still commonly used today [40, 34, 6, 2, 44, 35, 30, 11]. Its general problem of distributional shift, wherein imitators fail to generalize to states en-countered during execution in the environment, has been widely acknowledged [40, 9, 41, 42, 24, 19, 2]. It is closely related to the “feedback problem” in machine learning [47, 1] and the problems with teacher forcing in sequence learning [3]. For imitation learning, various solutions have been proposed [9, 41, 42], of which DAgger [42] is the most widely used. The DAgger algorithm is very simple to state: (1) train an imitator on demonstration data, (2) execute the imitator in the environment, (3) query the expert for its decision on all states encountered by the imitator and add these to the demonstration and (4) repeat. With the addi-tional supervision, a DAgger imitator also knows how to act in situations that are rare for the expert to encounter and can be proven to converge to a good imitation policy. However, as all states are labelled indiscriminately, it may require many queries to the expert, making it complicated to use in practice. In addition, it turns out that it is difficult for humans to provide the feedback necessary for

(8)

DAg-ger. Again, using a driving scenario, imagine being the expert providing correct actions for a trajectory generated by the imitator. As it is a replay of the imita-tor’s prior experience, this involves having to select correct steering angles while receiving no feedback from the environment. It turns out this hard for humans to do. So while algorithms like DAgger address distributional shift in theory, room for improvement on particular instances of distributional shift exists.

1.3 Causal Inference

In our hypothetical motivating example, the issues from distributional shift can be addressed if the imitator has knowledge of the causal mechanisms in the world and in the expert. In some settings, such information is known a priori and can be built into the learner, before any learning takes place. However, in many settings this is difficult, such as when the learner is a neural network directly operating on raw visual input or when the causal structure is highly complicated. In such settings, we may wish to infer the causal mechanisms from observations.

In our motivating example, this inference would let us conclude that only the road view is a cause of the expert’s behaviour and the brake indicator a mere nuisance, and then let us train correct a predictor of the expert’s behaviour based on the road view alone. In general, we can model all hypotheses of causal structures of the expert as a selection of which aspects of the state are causes of the expert’s behaviour and which are nuisance variables and desire to infer the correct causal hypothesis.

For some environments, like our above example, the state is already “disen-tangled” in a way, that the state variables can be treated as causal variables and some subset of state variables can be selected to remove all nuisance informa-tion. However, for environments with visual states, the individual pixels can not in general be readily interpreted as causal variables. Our strategy to still rea-son causally about visual states, is to first use representation learning to attempt to infer causal variables from the visual input and subsequently infer the causal structure of these variables.

Causal inference is the general problem of inferring cause-effect relationships between variables [36, 38, 53, 51, 52, 12]. When done from passively observing a fixed data set, it is referred to as causal discovery [54, 18, 26, 16, 27, 28, 25, 15, 32, 59]. We will passively discover some information about the causal structure using a variational inference method to perform causal discovery. However, as we will later discuss, passive discovery relies on assumptions [36, 38] that are easily violated in the tasks we consider. Hence, we will perform causal inference in the interventional regime [57, 13, 50, 48], whereby the imitator will interact with the environment and collect data outside the demonstration data. On these new trajectories, we will be able to discern differences between the causal hypotheses. These differences will be scored using some additional information, such as to which degree the predictions corresponding to the causal hypothesis agree with the expert, and the highest scoring hypothesis will be selected. Overall, the method we propose consists of two phases:

(9)

1. For all causal hypotheses, train an optimal imitator based on the demon-stration data.

2. Intervene in the environment and select a causal hypotheses based on scor-ing.

1.4 Contributions

The goal of this thesis is to answer four questions:

1. Can we understand causal confusion using graphical causal models? 2. Does causal confusion exist in relevant imitation learning problems? 3. Can we solve causal confusion?

4. How do we infer causal mechanisms from unstructured data?

We answer these questions, by defining causal confusion as a failure of imitation learning because of a particular form of distributional shift due to causality. We ground this definition rigorously in terms of the language of Pearl [36], analyse it quantitatively in a tabular toy setup and propose a practical algorithm to address it in more complicated environments, which we extensively analyse in relevant control benchmarks.

Furthermore, to the best of our knowledge, we are the first to perform causal inference on visual states by representing the state with a learned representation, and the first to use variational inference to approximate bayesian causal discovery.

(10)

Chapter 2

A Simple Example

In order to study causal confusion, we first analyse it in a simple example. The example will show how distributional shift arises naturally in an environment in which we take actions sequentially, how an imitator suffers from it, and how understanding the causal structure of the expert’s behaviour can resolve the issue.

2.1 Graphical Models and Causality

First, we briefly set up some important notation, which will be useful to model probability distributions over many variables. Throughout, we will presume basic knowledge about probability theory. A more in-depth introduction to these topics can be found in the excellent reference by Pearl [36], from which much of this section has been adapted. Note that there are other ways of formalizing causality, such as Rubin’s [43], which is popular in the social sciences. Here we will stick to Pearl’s formalization, which is most popular in the field of Artificial Intelligence. Given N random variables, we often have some prior knowledge about the existence of some conditional independence relationships. Bayesian networks are a structured way of encoding these relationships in a graph. A bayesian network is simply a directed acyclic graph (DAG), whose nodes are the random variables and the edges encode independence relationships. We call a joint distribution

P over the random variables compatible with bayesian network G, if we can

factorise P as

P (x1, ..., xN) =

Y

i

P (xi|pai). (2.1)

where pai denote the parents of node i according to the Bayesian Network G. A

simple example of a bayesian network is given in figure 2.1.

Pearl derived a theorem, proven in [36], that can be used to characterise which distributions can be compatible with a Bayesian Network G. Consider any P compatible to G. Then for any three disjoint sets of random variables X, Y and

Z, X and Y are independent given Z under P if all paths between any node in

X and any node in Y is d-separated by Z. A path between nodes x and y is

(11)

N Pim needs to write a thesis

M

Pim finds motivation to write C The computer is working

T The thesis is finished

Figure 2.1: A simple bayesian network that encodes independences of the dis-tribution P (N, M, C, T ) of the variables that affect the completion of this the-sis. P is compatible with the diagram, if we can write P as P (N, M, C, T ) = P (T |C, M )P (M |N )P (N )P (C). Among others, this factorisation encodes that the need to write a thesis is independent of the functioning of the computer and that the need to write only affects the final result through generating motivation and not otherwise.

any consecutive sequence of edges starting at x and ending at y, irrespective of the direction of the edge.

Definition 2.1.1. D-Separation. A path p is d-separated by set of random

vari-ables Z if and only if either:

1. p contains a “chain” i → m → j or a “fork” i ← m → j, such that m ∈ Z, or

2. p contains a “collider” i → m ← j such that m 6∈ Z, and no descendent of mis in Z.

The first case is rather obvious. In our example in figure 2.1, we see that the path between N and T is blocked by M. Thus, N and T are independent given M, implying that, for example, knowing Pim found motivation to write his thesis, additionally knowing he needs to write a thesis does not provide more information about whether the thesis is now finished. The second case is somewhat counter-intuitive at first. Clearly, the empty set blocks the path between M and C, meaning that the functioning of the computer and the motivation for writing are independent. However, the path is blocked by T , thus conditional on T , M and C become correlated. This is called the “explain away effect”. For example, if the thesis is only finished when both the computer is working and Pim is motivated, and the thesis is not finished, knowing that M is true, then C must be false.

We will use this D-Separation criterion to derive independence relationships of a probability distribution if we know it is compatible to some bayesian network G. Note, however, if P is compatible to G, it is not true in general that all conditional independence relationships of P are also reflected in the d-separation of G. P is

faithful (also called stable) with respect to G if all independence relationships of

P are reflected in G. Later we will see that the property of faithfulness plays an important role in the discoverability of a bayesian network given (samples from) a probability distribution.

(12)

W Weekday R Rested P Productive W Weekday R Rested P Productive

Figure 2.2: Two bayesian networks that are equivalent, as they encode the same independence relationships, namely none, but are different as causal

diagrams. Statistically, any distribution P (P, W, R) can be factorised as

P (P |W, R)P (R|W )P (W ) or as P (P |W, R)P (W |R)P (R) and hence compatible

with either diagram, though not necessarily faithfully. The causal structure de-picted on the left indicates, for example, that the author is well-rested mostly dur-ing weekends, and productive durdur-ing weekdays and when he is well-rested. The diagram on the right implies that the weekend is caused by the well-restedness and is incompatible with reality.

A bayesian network is merely a representation of the statistical properties of a probability distribution, while making no claims about how the underlying with which the distributions came about. However, we can reinterpret its graph into a causal model, which does make such claims. A causal model or functional causal model consists of a DAG G and for each random variable xi of the graph,

a noise distribution P (ui) and a function fi, so that xi = fi(pai, ui) with ui ∼

P (ui). Then the joint probability P (x1, ..., xN)is by definition compatible to G,

interpreted as a bayesian network. Critically, the interpretation of a causal model is such, that the functions fi not just describe the data as observed, but really

describe the process of how the data is being generated. Two examples of causal models are given in figure 2.2.

It is questionable whether we ever really can infer such causal relationships, beyond just finding an explanation that fits the observed data [20]. Nevertheless, in practice we often do make causal statements about the generative process of the data we observe, using prior information. For example, we often assume that events in the future can not cause events in the past.

When we do dare to work with causal models, they can not only be used to answer questions about conditional or marginal probability distributions, but moreover to answer questions about the result of changing the process that gen-erates the data. In particular, from a causal model we can deduce how the joint probability distribution changes when we, as a deus ex machina, externally

inter-vene on the generating process and fix some random variable Xi to have some

value x.

Knowing how a data generating process is affected by external intervention is a strictly more powerful and considerably more relevant form of knowledge then merely being able to predict the data generated by an unaffected process. The

(13)

Rested (R) Tired (¬R) Sum

Weekday (W ) 5 45 50

Weekend (¬W ) 20 0 20

Sum 25 45 70

Table 2.1: A quantitative example of the left causal model of figure 2.2. interesting claims in e.g. social sciences are generally causal, not just probabilistic. For example, in order to make policy decisions, we are not interested in whether heavy policing correlates with high crime, but whether if we intervene and set the amount of policing to high or low, crime goes up or down. For more exam-ples about the power of having causal model, over just a probabilistic model, we recommend the popular scientific book by Pearl and Mackenzie [37].

In the causal model, an intervention corresponds to removing the incoming arrows into Xi, as the variable no longer has a cause internal to the model, and

setting it to have the set value (almost surely). The resulting distribution over set of random variables Y , given other random variables Z is called the

interven-tional distribution P (Y |Z, do(X = x)). More generally, we can intervene and

change any of the fi of the causal model, while keeping the others unchanged.

In the example in figure 2.2 on the left, we assume that the data generating process is such, that the author is well-rested always in the weekend and only rarely on week days and is productive only when well rested on a weekday. Now the conditional probability P (P |R) of being productive while well rested is low, because well-restedness occurs mostly on the weekend, when productivity is low. On the other hand, the interventional distribution P (P |do(R)) of being productive when we intervene and force the author to rest well is higher, because different from conditioning, intervention does not make it more likely to be weekend.

To study the example quantitatively, we say the author is well-rested in only 10% of weekdays and productive if and only if it is a weekday and the author is rested. The counts of days of 10 such weeks are shown in table 2.1. We see that P (P |R) = 5/25 = 20%, as only 5 days of the 25 rested days are during the week. On the other hand, if we force the author to be rested, we see he is productive at

P (P |do(R)) = 50/70 ≈ 71%of days.

2.2 Markov Decision Processes

One particular causal model used widely to model agents operating in an inter-active environments in which the agent incurs rewards based on its actions. This model is the Markov decision process (MDP), depicted in figure 2.3. In con-sists of a space of states the environment can inhabit, a space of actions that can be chosen by the agent, probabilistic dynamics of how the state evolves given the action taken P (St+1|St, At)and a stochastic reward at each time step. If it is

combined with a policy of the agent π(At|St), i.e. a distribution over actions taken

(14)

St At Rt St+1 At+1 Rt+1 . . . .

Figure 2.3: A general Markov decision process.

and the agent interact. This process obeys a Markov property: the only cause of the action at time t is the state at time t and no state prior to time t. Similarly, the next state St+1at time t + 1 is only affected by the state St and action Atat

time t. The graph of figure 2.3 is clearly compatible with this process.

Now we can define reinforcement learning as the optimization that attempts to find a policy π(At|St) that maximises expected reward, using interactions in

the environment and the observation of rewards, or alternatively, if the underlying probabilities are available, using exact optimization with dynamic programming. Similarly, imitation learning can be defined as the optimization of the policy to also maximise expected reward, but where only a (close to) optimal expert pol-icy πE(At|St)is available, or samples thereof, without observations of the reward.

Instead, the optimization attempts to find an imitation policy that behaves simi-lar to the expert. The simplest form of imitation learning, behavioural cloning, simply finds the policy that agrees most with the expert on the demonstration data. This data is generated by executing the MDP with policy πE, resulting in

a state distribution PE(St) and then collecting state-action pairs (St, At). The

imitator is inferred with maximum likelihood estimation on this data.

However, the resulting policy πI(At|St)of this optimization is unlikely to

ex-actly match the expert. When executed in the environment, the different policy will induce a different state distribution PI(St) in the Markov decision process

from the state distribution of the expert policy PE(St). This the distributional

shift mentioned in the introduction. Because of this shift, imitators that match

the expert reasonably well on the demonstration data can fail spectacularly when executed in the environment.

It is important to note that this is not an instance of overfitting, in which the learner is trained with insufficient data and does not generalise to other states generated by the expert. Even with infinite data, it is possible that the imitator does not exactly match the expert, because of peculiarities of the optimisation, so that the empirical risk is not optimally minimised, or because the hypothesis class of imitators does not contain the expert. Even a small mismatch can lead to distributional shift, potentially generating states that are unlikely in the expert demonstrations PE(St), likely in the imitator roll-outs PI(St). For such states,

little supervision exists, making the imitator perform poorly. When the imitator is trained with finite data, the problem is exacerbated, as the mismatch between the imitator and the expert will be larger, causing a larger distributional shift, causing

(15)

Pt Lt Bt Rt Pt+1 Lt+1 Bt+1 Rt+1 . . . .

Figure 2.4: A confusing Markov decision process.

even fewer demonstrations to exist for the states the imitator will encounter.

2.3 A Confusing MDP

Let’s look at a very simple example of a MDP, which we’ll call “confusing MDP”. It is inspired by our driving example and depicted in figure 2.4. Let the state consist of two binary variables: P whether a person is in front of the car and L whether the brake light is on. The action space consist of the binary variable B, whether we brake. We impose simple dynamics: a person stays in front of the car or remains absent for multiple time steps successively, so we let a person appear or disappear with probability1_/₁₀₀_{. With probability}99_/₁₀₀_{the person remains, if}

present, or remains absent, if absent. The brake light is on if and only if the last action was to brake. Clearly, when a person is in front of the car, we ought to brake. Hence if we fail to do so, we incur a penalty of reward −100. If we brake while no person is present, we incur reward −10. Otherwise we incur neutral reward 0. In this simple setup, we can evaluate a policy using value iteration in closed form with some simple linear algebra [55].

Our expert is an optimal agent that brakes if and only if a person is in front of the car, hence incurring a optimal expected reward of 0 per time step. Note that this expert pays attention to variable P , so it is a cause, while ignoring L, making it a nuisance variable. The graph in figure 2.4 without the dotted line is the causal graph of the process with an expert policy. However, a general agent may be unaware of this causal structure and consider both state variables to be causes, leading to the graph in figure 2.4 with the dotted line.

If we now let the expert interact in the environment and collect some data, we observe a trajectory like the one depicted in figure 2.5. A person appears in front of the car, the expert brakes and one time step later the brake light comes on. Then when the person disappears, the experts stops braking and next time step the brake light goes off. Note that even though the brake light is not a cause of the expert’s behaviour, it is nevertheless highly correlated with the expert’s action in that time step. In the graphical model of the expert, in figure 2.4 without the dotted line, we see that even though Lt+1is not a cause of Bt+1, the past state and

(16)

Given trajectories from an expert, we can learn an imitator. Our environment can be in 2 × 2 = 4 different states, so a general imitator is defined by the brake probability for each of these states. For exposition purposes, however, we make a simplifying assumption. We parameterize our imitator by:

πα,β(B|P, L) = σ(αP + βL) (2.2)

where σ denotes the sigmoid function and we let P and L be 1 if True and -1 if False. This parameterization assumes that the brake light and the presence of a person independently affect the decision whether to brake. The expert brakes if and only if P = 1 and thus corresponds to π∞,0. For each potential imitator, we

could ask three different questions:

1. Would the imitator take the same actions as the expert has done in the demonstration data?

2. Would the expert take the same actions as the imitator has done in the roll-outs from the imitator?

3. Does the imitator yield high reward?

Figure 2.6 shows the answer to all three questions for the space of imitators parameterized by equation 2.2. Again, the expert is at position (∞, 0), so in the middle right we see the highest accuracy on demonstrations and roll-outs and the highest reward. The most interesting region is in the top left, around (−8, 10). This is an imitator that achieves 90% accuracy on the demonstration data, but close to 0% accuracy on roll-outs and the worst possible reward in the class of imitators parameterized by equation 2.2 of -50 per time step.

Clearly, we can see causal confusion in action. An imitator exists that mostly agrees with the expert on the demonstrations, but fails dramatically to pick the same actions as the expert on imitator’s roll-outs and to yield a high reward, because it imitated the expert with the wrong causal model. Erroneously, it assumed that both the brake light, as well as the road view were causes of the expert’s behaviour. If we were to train the imitator assuming the correct causal model, corresponding to forcing the L parameter to be 0, we would’ve recovered the expert model (∞, 0).

2.4 Defining Causal Confusion

We can now define causal confusion in general as follows, using figure 2.7. We can always partition (potentially vacuously) the set of state variables St into

P L B

Time

Figure 2.5: A trajectory from an expert in the confusing MDP. The rows mean top to bottom, respectively, P , L and B. Black indicates True, white False.

(17)

10

5

0

5

10 P param

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0 L param

90.099.0

99.9

90.099.0 _99.9

-50-40 -30

-20-10

-0

Accuracy on demonstrations

Accuracy on rollouts

Reward

Figure 2.6: The different behaviours of the imitators parameterized by equa-tion 2.2.

variables Ctand Et, such that none of the actions Atare a successor of any Et−1

in the DAG, as shown in the solid lines of the figure, which depicts the MDP with the expert policy. Hence, Ctis the only cause of the expert’s behaviour. The

variable Etis the effect of the expert’s prior behaviour. Note that the path Et+1→

(Ct, At) → At+1 is not blocked, making Et+1 still correlated with At+1, due to

confounder (Ct, At). An imitator which does not know Etis a nuisance variable,

can learn a non-causal policy that depends on Et and have worse performance

than if it ignored Et. This phenomenon we call causal confusion.

Above, we formulated a specific form of causal confusion, in which the mis-specification of the expert’s causal model by the imitator is due to the presence or absence of some arrow in a causal graph, assuming that learning the expert’s behaviour from the causal demonstration data (Ct, At) leads to a successful

imi-tator. Instead, we could have formulated it more generally by a mismatch in the causal equations At = πE(St, ut), ut ∼ PE(ui)generating the expert’s behaviour

and At = πI(St, ut), ut∼ PI(ui)generating the imitator’s behaviour. The main

reason we pick the former over the latter is that we agree with Pearl [36] that cap-turing causal structure diagrammatically is a powerful modelling tool of causal relationships. Additionally, this choice leads to interpretable causal hypotheses, so that we can inspect the sensibility of different hypotheses.

Furthermore, we focus our attention on the case in which all information on which the expert relies, is available to state as seen by the imitator. If the imitator were instead lacking access to some variable which is a cause of both the expert’s behaviour and the observed state, distributional shift can deteriorate performance even in a single time step. This case is an example of latent confounding, which

(18)

Ct Et At Rt Ct+1 Et+1 At+1 Rt+1 . . . . . . .

Figure 2.7: Causal Confusion MDP. is widely studied in the causal and statistical literature.

A theoretical solution to our form of causal confusion is to use intervention on (Ct+1, Et+1)by externally setting the state to “break” the correlation between

(Ct, At)and Et+1. That way, we can observe states outside of the expert’s

distri-bution. Then, we would observe that PE(At|do(Et, Ct)) = πE(At|Ct) and hence

that Etis not a cause of the expert’s behaviour. However, such intervention is not

always practical, as the real world generally does not allow for arbitrary manipu-lation of its state. We will focus though on full observability and try to address causal confusion that is due to the sequential nature of the MDP.

Inspired by the above example and the general definition, we can formulate a potential practical solution to causal confusion. For each possible causal model, an imitator can be optimised to agree with the expert on the demonstration data. These imitators can then be scored based on either agreement with the expert on imitator roll-outs or simply on reward and the best scoring imitator is selected. How to do this in practice when the number of possible causal models is large is the main topic of the remainder of this thesis.

(19)

Chapter 3

Diagnosis

The previous chapter showed that causal confusion can occur in a very simple Markov decision process, which we can solve analytically and for which the space of policies fits in a single plot. Now, we will demonstrate that causal confusion also occurs in more complicated environments that are closer to relevant real-life tasks. To do this, we start with standard control benchmark environments, which can be solved by standard reinforcement and imitation learning techniques. These envi-ronments still have relatively simple state spaces, containing, we assume, mostly causal information and little nuisance information. We therefore augment the state spaces with additional information by introducing a nuisance variable to the state and study the effect on imitation learning. Such augmentation may seem contrived, but in real life learning, we wish to tackle hard problems using a wide range of observational data, some of which will invariably be non-causal nuisance information. It seems natural to desire that imitation learning algorithms do not suffer dramatically from the presence of such information.

(a) MountainCar (b) Mujoco Hopper

Figure 3.1: Two of the control benchmarks.

(20)

The nuisance information we will add is information about the past action, which is a proxy for the brake light in the motivating example. In many imitation learning tasks, we expect this to be the worst kind of information we could add, in terms of causing causal confusion. This is because in most environments, like in the motivating driving example, subsequent actions by the expert are highly correlated. However, when a trained imitator is executed in the environment, the past action is the action the imitator chose in the previous time step, whereas in the demonstration data, it’s the expert’s action. Whenever the imitator picks an action the expert would not, the subsequent augmented state is not contained in the demonstration data set, so that the imitator has no supervision about what is the optimal next action.

We study causal confusion in three kinds of environments from the OpenAI Gym benchmark [7], shown in figures 3.1 and 3.2. Firstly, we have the simple MountainCar environment, which has continuous 2D input and discrete actions. The goal of an agent in this environment is to alternatingly accelerate left and right to reach the flag on the right hand side. Furthermore, we have MuJoCo Hopper [56]. It has continuous 13D input and 3D continuous actions and the goal is to move the joints in a way that the agent “hops” as far as possible. Lastly, we have three games from the Atari Simulator, Pong, UpNDown and Enduro, which have 32 by 32 pixel visual input and discrete actions. The Pong is a simple tennis-like game, UpNDown is a platform game, and Enduro is a racing game. In order to make the Atari games a MDP, we concatenate two frames from the simulator, so that the agent can infer e.g. velocities from the differences. For each of these tasks, we train an expert with an off-the-shelve reinforcement learning algorithm. Most practical was using DQN [33] for MountainCar, TRPO [45] for Hopper, and PPO [46] for the Atari environments. We visually inspected the resulting expert policies and found they were comparable to human performance.

(a) Atari Pong (b) Atari Enduro (c) Atari UpNDown

Figure 3.2: The Atari control benchmarks. The white digit indicates the past action taken in the simulator and is the augmented nuisance information. It is

only used in theconfoundedimitation setting.

After obtaining demonstrations from the trained expert, we compare be-havioural cloning on the original states (original) to imitation learning in the

(21)

base-line corresponds to the imitator having just access to the windshield, while the

confounded state additionally observes the brake indicator. For the

Mountain-Car and MuJoCo environment, the state is represented as a vector, to which we can simply concatenate past action to this vector. The imitator model is a 4 layer multi-layer perceptron neural network. For the visual Atari environments, we represent the past action visually, to stay close to the original state. This is done in theconfoundedsetting by drawing the digit representing the past action onto the frame. The imitator on the visual environments is a 6 layer convolutional neural network.

The results of this comparison are shown in figure 3.3. Of the original and

confounded setting, we compare the behavioural cloning training loss, loss on

held-out demonstration samples and the reward attained by the imitator in the environment. From the earlier discussion, we expect causal confusion to be pos-sible at a large number of samples, if the expert roll-outs do not fully cover the state space, but that with fewer samples, the effect is stronger - this is exactly

what we observe. Across all environments, the confoundedimitator, which has

access to more information, performs worse in terms of rewards. This difference decreases when the sample size increases, as expected.

That this is the consequence of distributional shift and not of overfitting, is indicated by the fact that the train and test losses are better for theconfounded

setting versus the originalsetting. The confounded imitator is better at

pre-dicting the expert’s behaviour on the expert’s trajectories, but fails to perform optimally when executed in the environment. This observation confirms our hy-pothesis that causal confusion is at play here, in which the additional nuisance information makes the imitator pick up non-causal signals, which are predictive on the original state distribution, but fail to be predictive when the distribution shifts.

3.1 Diagnoses in driving settings

Our examples add explicit nuisance information, but prior works have shown con-vincingly that causal confusion-like phenomena are common in imitation learning tasks with rich, complicated state spaces, that have not been artificially aug-mented. In such cases, information about past behaviour is often implicitly present in the state and can be extracted by a sufficiently expressive imitator model. For example, Muller et al. [34] tried to improve performance of a self-driving imitator by providing it with two consecutive video frames, but found that it then tended to repeat past actions, making it perform poorly. Similar effects have been found by Wang et al. [58], that found that making history available to the learner resulted in improved losses on the demonstration data, but worse driving performance.

(22)

0.1 0.3 0.5 1 5 Demonstration dataset size (x 1000) 0.00 0.02 0.04 0.06 0.08 0.10

Mountain Car Train loss

0.1 0.3 0.5 1 5 Demonstration dataset size (x 1000) 0.25

0.50 0.75 1.00 1.25

Mountain Car Test loss

0.1 0.3 0.5 1 5 Demonstration dataset size (x 1000) 200

180 160 140 120

100 Mountain Car Reward

Scenario B (Original) Scenario A (Confounded) Expert

0.1 0.5 1 5 20 100 300 Demonstration dataset size (x 1000) 0.0

0.1 0.2 0.3

Hopper Train loss

0.1 0.5 1 5 20 100 300 Demonstration dataset size (x 1000) 0.4

0.6 0.8 1.0

1.2 Hopper Test loss

0.1 0.5 1 5 20 100 300 Demonstration dataset size (x 1000) 0 500 1000 1500 Hopper Reward Scenario B (Original) Scenario A (Confounded) Expert 10 200 250 800 Demonstration dataset size (x 1000) 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Pong Train loss

10 200 250 800 Demonstration dataset size (x 1000) 0.50 0.75 1.00 1.25 1.50 1.75

Pong Test loss

10 200 250 800 Demonstration dataset size (x 1000) 20 10 0 10 20 Pong Reward Scenario B (Original) Scenario A (Confounded) Expert 1 10 200 800

Demonstration dataset size (x 1000) 0.00 0.25 0.50 0.75 1.00 1.25

1.50 Enduro Train loss

1 10 200 800

Demonstration dataset size (x 1000) 1 2 3 4 5 6

7 Enduro Test loss

1 10 200 800

Demonstration dataset size (x 1000) 0 10 20 30 40 Enduro Reward Scenario B (Original) Scenario A (Confounded) Expert 1 10 200 800

Demonstration dataset size (x 1000) 0.0

0.2 0.4 0.6 0.8

UpNDown Train loss

1 10 200 800

Demonstration dataset size (x 1000) 1

2 3 4 5

UpNDown Test loss

1 10 200 800

Demonstration dataset size (x 1000) 0 20 40 60 80 100 UpNDown Reward Scenario B (Original) Scenario A (Confounded) Expert

Figure 3.3: Diagnosing Causal Confusion. Across all environments, adding the nuisance information harms imitator performance. Error bars show standard error of the mean of 7 runs.

(23)

Chapter 4

Solution Methods

The space of causal hypotheses about the expert policy is, in its most general form, the space of stochastic functions from the state St to the action At. For

continuous state spaces, this hypothesis space is large (infinite dimensional even) and hence causal inference unwieldy. We make the problem tractable, by reducing the hypothesis space to the space of causal graphs, the space of subsets of the state variables Xi

tthat are causes of the expert’s actions. In order to do so, we assume

that in the MDP, the state St consists of multiple variables Xti, as shown in

figure 4.1. For N variables X0

t, ..., XtN, this leads to 2N causal hypotheses. One

such causal graph, G can be seen as a subset of {1, ..., N} or as a binary string of N bits, in which a 1 indicates a variable is a cause and 0 that it is a nuisance

variable. We denote the space of graphs as G and write XG

t = (Xti)i∈G for the

causal variables under graph G. Given such a causal hypotheses, we can simply train an optimal imitator using behavioural cloning, so we are mostly concerned with inferring the correct causal hypothesis.

X0 t−1 X1 t−1 X_t−12 St−1 At−1 ? ? ? X_t0 Xt1 X2 t St At ? ? ? . . . .

Figure 4.1: The causal confusion problem: which of the Xi

t are true causes of the

expert’s actions?

(24)

4.1 Causal Discovery & Faithfulness

I(Xi t; At) I(Xti; At|St−1, At−1) X_t0 (cause) 0.377 0.013 X_t1 (cause) 0.707 0.019 Xt2 (nuisance) 0.654 0.027

Table 4.1: Mutual information in bits of the Confounded MountainCar setup.

In many learning scenarios, much information about the causal model can already be inferred passively from the data. Doing so is the problem of causal discovery. Ideally, it would allow us to perform statistical analysis on the random variables in the demonstration data to determine whether variable Xi

t is a cause

of the next expert action At or a nuisance variable.

Causal discovery algorithms, such as the PC algorithm [53] test a series of conditional independence relationships in the observed data and construct the set of possible causal graphs whose conditional independence relationships match the data. It does so by assuming faithfulness, meaning the joint probability of random variables contains no more conditional independence relationships than the causal graph. If this is assumed, the algorithm can simply return the set of graphs whose conditional independence relationships equal the independences found empirically in the data.

In the particular case of the causal model in figure 4.1, we see that (St−1, At−1)

blocks the path Xi

t ← (St−1, At−1) → At. Therefore, Xti is a cause of At, and

thus that the arrow Xi

t → Atexists, if and only if Xti6⊥⊥ At|(St−1, At−1), meaning

that Xi

t provides extra information about Atif (St−1, At−1)is already known.

We test this procedure empirically by evaluating the mutual information I(Xti; At|St−1, At−1)for the confounded MountainCar benchmark, using the

esti-mator from Gao et al. [14]. The results in Table 4.1 show that all state variables are correlated with the expert’s action, but that all become mostly independent given the confounder (St−1, At−1), implying none are causes.

Passive causal discovery failed because the critical faithfulness assumption is violated in the MountainCar case, which have completely deterministic dynamics. Whenever a state variable Xi

tis a deterministic function of the past (St−1, At−1),

then Xi

t ⊥⊥ At|(St−1, At−1)always holds and a passive discovery algorithm

con-cludes no arrow Xi

t → At exists. Such a deterministic transition function for at

least a part of the state is very common in realistic imitation learning scenarios, making passive causal discovery inapplicable. Active interventions must thus be used to determine the causal model.

4.2 Learning a Graph-conditional Policy

In order to try out the causal hypotheses by interaction in the environment, we already need good imitator policies for each causal graph G we wish to evaluate.

(25)

Figure 4.2: Graph-conditional policy learning.

One naive approach would be to train the optimal imitator separately for all 2N

hypotheses, but this becomes quickly infeasible for larger problems. Instead, we learn a single neural networks for all graphs. Besides the obvious benefit in terms of computational cost, this might have as advantage that the network can share some learned structure between similar graphs.

For environments with NA discrete actions, we construct a neural network

fφ : RN × RN → RNA, with parameters φ. The policies for each graph are then

defined by πG(A = a|XG) = f (X G, G)a, where we treat G as the binary vector

representing the graph. The element-wise product X G sets to 0 all variables that are not causal according to G. For environments with continuous actions of NA dimensions, we use a deterministic policy and a similar setup.

We learn the parameters by training for all graph jointly from demonstra-tion state-acdemonstra-tion pairs (X, A) ∼ D. Thus for each step, we randomly sample a graph G from a uniform distribution over G and perform gradient descent on some environment-specific action-space loss function `, which is cross-entropy for the discrete action case and mean squared error in the continuous action case. Hence, we optimize the following objective, also illustrated in figure 4.2:

E(X,A)∼DEG[`(fφ([X G, G]), A)], (4.1)

4.3 Scoring Hypotheses with Targeted Soft

Inter-vention

In Chapter 2 we introduced a potential method of evaluating hypothesis by

in-tervention. If the true graph of the expert is G∗ _{and we were able to perform}

arbitrary intervention, we can find out which causal hypothesis is true. We would set the state at arbitrary configurations and query the expert for its action. We’d find that only for G∗, we have that π

E(A|do(S)) = πE(A|XG∗) for all actions A

and states S1_.

However, arbitrarily modifying the state of the world is not always possible in reality. What we can do, however, is to intervene and pick different actions, by

1_{This is true under the rather trivial faithfulness assumption that X}i_{→ A}only exists in the

causal graph G∗_{of the expert if A 6⊥}_{⊥ X}i_{in the expert’s interventional policy π}

(26)

executing an imitator in the environment. By intervening on At−1, we modify the

distribution over St. In a sense, we preform a soft intervention on the subsequent

state distribution [39]. This process is illustrated in figure 4.3.

Given these interventions, we can score the causal hypotheses in two different ways, both of which are a proxy to comparing to the expert policy under arbitrary interventions. In the first strategy, we query the expert for its decisions on the intervened trajectories and compare it to the decisions made by the policies for each graph. In the second strategy, we recognise that if we have access to the reward of the MDP and if the expert is a (close to) optimal agent, the graph-policy that achieves the highest reward is likely to correspond to the true expert’s graph.

4.3.1 Intervention by Expert Query

In this strategy, we assume that the expert can be queried for its decision on arbitrary states that we encounter in the environment. We use the expert’s actions to find the causal graph whose policy agrees most with the expert. In this setup, we can make two design choices: which actions do we pick upon intervention and which states do we query the expert on.

Empirically, we found that choosing random actions leads to a poor exploration of the state space. For example, in the MountainCar case, the car never reaches its goal. Instead, we use the trained graph-conditional policies to explore the state space, with the simple strategy of randomly sampling a graph, then executing its corresponding policy for an entire episode.

Having collected a data set of such interventions, we wish to query the expert for its actions on some of these states. The classic DAgger [41] algorithm queries the expert for all states, but we wish to minimise the use of this expensive resource. What we do instead is computing which states the graph disagree most on, and query the expert on those states. This is an instance of active learning through Query by Committee [49].

We evaluate disagreement using the standard KL divergence to the mean [31]. For this, we first define mix policy πmix(A|S) = EGπG(A|S)and let the

disagree-St−1 At−1 X0 t X1 t X_t2 St At ? ? ? . . . .

Figure 4.3: Choosing a different At−1modifies the state distribution of St, leading

(27)

ment of some state S be:

D(S) = EGDKL(πG(A|S)||πmix(A|S))

If we wish to evaluate the hypotheses with M labels, we simply pick the M states from the intervention data set with the highest disagreement score. Subse-quently, we can define the score of a graph to be the action loss function ` of the graph policy with the obtained labels. For discrete action spaces with stochastic policies we use cross-entropy loss and for continuous action spaces with determin-istic policies, we use mean squared error. This process is outlined in Algorithm 1.

Algorithm 1 Expert query intervention

Input: policy network fφ s.t. πG(X) = fφ([X G, G])

Collect states S by executing πmix.

For each S in S, compute disagreement score D(S).

Select S0_{⊂ S} _{with maximal D(S).}

Collect state-action pairs T by querying expert on S0_.

Return: arg minGE(S,A)∼T`(πG(A|S), A)

For environments with only a few state variables, like MountainCar, the arg min in Algorithm 1 is tractable. However, for the Atari environments, we use 30 or 50 state variables and hence a hypothesis space of size ≈ 1015, making the

explicit minimization infeasible. We address this by assuming some shared struc-ture in the hypothesis space, just as we did when learning the graph-conditional policies. In particular, we assume that the loss of a given graph can be modelled

with a linear equation L(G) = E(S,A)∼T`(πG(A|S), A) = hw, Gi + b, where we

interpret G again as a binary vector and introduce parameters w ∈ RN

, b ∈ R. In effect, this amounts to the assumption that each variable contributes indepen-dently to the agreement of the model to the expert. From this linear model, we can also define a distribution over graphs, where we wish that graphs that achieve low loss, are more likely. Using a simple naive Bayes model, we get p(G) ∝ exphw, Gi. From this distribution, we can sample graphs, evaluate the loss and update the linear model. This is summarised in Algorithm 2.

(28)

Algorithm 2 Expert query intervention with linear model Input: policy network fφ s.t. πG(X) = fφ([X G, G])

Initialize w = 0, D = ∅.

Collect states S by executing πmix.

For each S in S, compute disagreement score D(S).

Select S0_{⊂ S} _{with maximal D(S).}

Collect state-action pairs T by querying expert on S0_.

for i = 1 . . . M do

Sample G ∼ p(G) ∝ exphw, Gi. L ← Es,a∼T[`(πG(s), a)]

D ← D ∪ {(G, L)}

Fit w on D with linear regression.

end for

Return: arg maxGp(G)

4.3.2 Intervention by Policy Execution

As an alternative strategy, we can rely on the reward function of the environment as a proxy for agreement of a causal hypothesis with the expert, assuming the expert behaves optimally. Again, we use a linear model to model the quality of the hypotheses. We learn the linear model by sampling a graph from the naive Bayes distribution and executing the corresponding policy for an entire episode. The strategy is summarised in Algorithm 3. It corresponds to a multi-armed bandit or 1-step reinforcement learning problem, in which the action space is the space of causal hypotheses and the state space is empty. Alternatively, we could have chosen a different action per time step, making it a multi-step RL problem, but we found this difficult to make work in practice.

Algorithm 3 Policy execution intervention

Input: policy network fφ s.t. πG(X) = fφ([X G, G])

Initialize w = 0, D = ∅.

for i = 1 . . . M do

Sample G ∼ p(G) ∝ exphw, Gi.

Collect episode return RG by executing πG.

D ← D ∪ {(G, RG)}

Fit w on D with linear regression.

end for

Return: arg maxGp(G)

We can make an immediate connection to our RL in hypothesis space and a 1-time step version Soft Q-Learning [17], with a linear Q-function. Our linear model becomes the Q-function and the linear regression the Q-function optimization. The resulting policy p(G) optimises the reward we get by executing the hypothesis in the environment plus an entropy bonus. Theorem 3 in [17] implies that if we were to apply Algorithm 3 with infinite samples, we are guaranteed to converge

(29)

to the optimal policy over graphs. We evaluate a finite number of graphs instead, so unfortunately do not have this guarantee.

4.4 Representation Learning

In the previous discussion, we have assumed that the state S consists of variables

Xi which can function as potential causes in the causal model of the expert we

intend to infer. However, many environments with a rich observation space do not come equipped with such a disentangled representation of state. For example, the individual pixels in an Atari game do not constitute meaningful causes. Instead, the pixels are a representation of underlying concepts that can be a cause. Before we can do inference on causal structures, we must hence infer a representation of state in which we can reason causally about the variables. But how do we learn such a representation from raw data into meaningful causal variables without supervision what the representation should look like?

As far as we know, this is an unsolved problem, albeit extremely interesting, as it is, in a sense, the problem of disentangled representation learning. An ideal solution would have two systems, one that represents the raw input into causal variables and one that infers a causal structure based on those variables. The learning process should have some form of a feedback mechanism, so that the success of fitting a causal model steers the representation. After the completion of the research for this thesis, such a strategy is proposed by Bengio et al. [4] in a meta-learning regime. However, it uses a class of causal hypotheses containing only two alternatives and a class of representations contained only a single 1 dimensional parameter. This suggests a practical and scalable solution is yet to be found.

In this thesis, we instead use a very simple approach, without feedback between learning the representation and inferring the causal structure. We simply learn a Variational Auto-Encoder (VAE) [22] to encode the unstructured input of the Atari pixels into a latent representation. A VAE is model which learns to generate samples from a data set by transforming samples from a prior distribution over latent variables to data samples using a deep learning “decoder”. In doing so, it also learns a deep learning “encoder” which transforms a data sample into a belief over the latent variables that could have generated that sample.

In our method, the individual dimensions of the latent variables given by the encoder are interpreted as 1 dimensional causal variables and used for the causal inference through intervention described above. In its common form, a VAE encourages these variables to be statistically independent and generally infers reasonable representations upon visual inspection.

This approach is simple, but has as two obvious serious weaknesses. Firstly, the conventional VAE objective is invariant under a rotation of the latent space. This is unfortunate, because when a true causal representation is rotated, all causal structure encoded in a causal graph is lost, as all variables become mixtures of causes and non-causes. The VAE objective can thus not disambiguate between the true causal representation and a useless causal representation. The second problem is that independence of the causal variables is not actually desirable.

(30)

Imagine our motivating brake light scenario when we disentangle the state into the brake light and the road view. This is the desired disentangled representation, but the variables are certainly not independent. In spite of these shortcoming, we luckily still find reasonable representations with this method, as shown in the next chapter.

4.5 Variational Causal Discovery

Earlier this chapter, we showed that inferring the full causal structure from pas-sive observations is impossible in the tasks we consider. This does not, however, exclude the possibility of extracting some information about the causal structure passively, by forming some prior belief ppassive(G) over the causal hypotheses, which may aide in the subsequent causal inference through intervention. We would use this prior by replacing the linear regression in Algorithms 2 and 3 with a Maximum-A-posteriori estimation [5, Sec 1.2.5].

Inferring a belief over causal structures from passive observations is referred to as Bayesian causal discovery [18]. In its general form, it tries to infer a belief over both the causal structure G, as well as the parameters θ of the functional causal model, using Bayes rule. Assuming some prior p0(G, θ), some demonstration data

D and p(D|G, θ), which gives the likelihood of the data given structure G and

parameters θ, we get belief:

p(G, θ|D) ∝ p(D|G, θ)p0(G, θ)

This is very simple to write down, but rather difficult to implement in practice. The parameter space is generally uncountable and high dimensional and the space of graphs countable, but exponentially hard, so that the normalization requires exponentially many intractable integrals.

A common solution to this intractability is variational inference, in which we parameterise a distribution qφ(G, θ) and learn parameters φ such that the

distri-bution is close to the true posterior p(G, θ|D). To simplify the task, we make an important simplifying assumption, namely that we model the parameters of the functional causal models using the single graph-conditional policy neural network introduced earlier and do not maintain a Bayesian belief over the parameters, just a point-wise estimate. This assumption is analogous to the assumption made earlier, that the causal hypothesis space is well represented by the graphical struc-ture. Thus we learn a single set of parameters θ and a belief qφ(G)over the causal

hypotheses. For arbitrary causal structures, care needs to be taken to assign no probability mass to causal graphs that are not DAGs, but in our simple case, qφ(G)is simply a distribution over N Bernoulli random variables Gi.

There are three obvious candidates for modelling the belief over the Bernoulli variables representing the graph. Firstly, to use an independence assumption, re-ducing qφ(G) to a product of N Bernoulli distributions. However, we found that

this is insufficient, as the distribution tends to assign most of its mass to the graph that treats all variables as causes. Secondly, we can model all options as a cate-gorical variable with 2N categories. This is ideal for small N, i.e. MountainCar,

(31)

but impractical for Hopper and the Atari Games. For these cases, we choose to model correlated distributions using a M dimensional continuous latent variable U, such that qφ(G) = Z N Y i=1 qφ(Gi|U )q(U )dU

where q(U) is a standard Gaussian and qφ(Gi= 1|U ) = gφ(U )i, where gφ: RM →

RN is a 2 layer MLP. In practice, we use M = 10. Such a latent variable model

is, at least in theory, able to model complicated correlated distributions over the variables Gi. To learn the variational distribution, we use the evidence lower

bound: arg min q DKL(qψ(G, θ)|p(G, θ|D)) = arg max ψ,θ E

(S,A)∼DEU ∼q(U ),G∼qψ(G|U )[log π(A|X, G, θG) + log p(G)] + Hq(G)

(4.2) In this objective, we can note the following elements:

Likelihood π(A|X, G, θG) = πG(A|S) = fφ([X G, G])is the graph-conditional

policy.

Entropy The entropy term of the KL divergence, Hq, acts as a regularizer

to prevent the graph distribution from collapsing to the maximum a-posteriori estimate. It is intractable to directly maximize entropy, but a tractable variational lower bound can be formulated. Using the product rule of entropies, we may write:

Hq(G) = Hq(G|U ) − Hq(U |G) + Hq(U )

= Hq(G|U ) + Iq(U ; G)

In this expression, Hq(G|U )promotes diversity of graphs, while mutual

informa-tion Iq(U ; G)encourages correlation among {Gi}. Iq(U ; G)can be bounded below

using the same variational bound used in InfoGAN [8], with a variational distri-bution bη: Iq(U ; G) ≥ EU,G∼qψlog bη(U |G). Thus, during optimization, in lieu of

entropy, we maximize the following lower bound: Hq(G) ≥ EU,G∼q " −X i log qψ(Gi|U ) + log bη(U |G) #

Prior We choose to set the prior p(G) over graph structures to prefer graphs

with fewer causes for action A—it is thus a sparsity prior:

p(G) ∝ expX

i

(32)

Figure 4.4: Variational Causal Discovery

Optimization Note that G is a discrete variable, so we cannot use the

repa-rameterization trick [22]. Instead, we use the Gumbel Softmax trick [21, 29] to compute gradients for training qψ(Gi|U ). Note that this does not affect fφ, which

can be trained with standard backpropagation.

4.5.1 Alternative interpretation

The loss of Eq 4.2 is easily interpretable independent of the formalism of vari-ational Bayesian causal discovery. A mixture of predictors fφ is jointly trained,

each paying attention to diverse sparse subsets (identified by G) of the inputs. This is related to variational dropout [23]. Once this model is trained, qψ(G)

represents the hypothesis distribution over graphs, and πG(x) = fφ([x G, G])

represents the imitation policy corresponding to a graph G. Fig 4.4 shows the architecture.

(33)

Chapter 5

Experiments

Having defined the main problem of causal confusion we intend to solve, some solution strategies and the five environments, we can now evaluate the success of these solutions. Recall that in theconfoundedsettings of these environments, the imitator obtained significantly worse rewards compared to the originalsetting. In our experiments, we seek to answer the following questions:

1. Do both intervention methods make that an imitator in the confounded

setting can get rewards similar to theoriginalsetting? 2. How does this depend on the number of interventions?

3. Does our approach indeed find the true causal graph of the expert?

4. Are disentangled representations of state necessary for the intervention methods to succeed?

5. Does the variational causal discovery (disc-intervention) perform better than using a uniform prior for the intervention phase (unif-intervention)?

Baselines On the confoundedsetting, we compare our method to three base-lines. The first is dropout, which, instead of inferring the causal graph by inter-vention, trains the graph-conditional policy and then selects the all-1 causal graph, corresponding to assigning all variables to be causes. This amounts to dropout regularization and is proposed by Bansal et al. [2]. The second baseline is DAg-ger[41], which we compare in query efficiency to intervention by expert querying. Lastly, we compare to Generative Adversarial Imitation Learning [19] (GAIL), an alternative to behavioural cloning which executes the imitator in the environment and adversarially tries directly to match the imitator’s state distribution to the demonstration state distribution.

Intervention by policy execution In figure 5.1, we show the rewards per

episode versus the number of policy execution for the MountainCar and Hopper 32

Causal Confusion in Imitation Learning

Master Thesis