Hierarchical Learning of Time-Sequences in Deep Neural Networks

(1)

MSc Thesis Physics

by

Koen Groenland

University of Amsterdam

July 2015

(2)

(3)

Abstract

This work investigates how neural networks can interpret and act based on hierarchical time-sequences. Deeply stacked auto-encoders have proven to be highly successful in encoding arbitrary inputs. We propose the ShiftNet architecture, a variation on a convolutional auto-encoder

which is efficient when dealing with on-the-fly encoding of time-sequences. By creating copies of deep activations, many convolution operations can be avoided, causing ShiftNet’s complexity to scale as O(n)rather than O(n2₎_{in the number of layers n. Moreover, we test if a combination of}

deep auto-encoders and reinforcement-learning perceptrons can implement hierarchical reinforcement learning without requiring any prior knowledge about the environment. Unfortunately, these networks are not found to perform better in transfer learning and partial

observability tests than flat networks with a suitable memory.

Supervisor: Dr. Sander Bohte

Centrum Wiskunde en Informatica

Examiners: Dr. Greg Stephens Vrije Universiteit Amsterdam Prof. Dr. Bernard Nienhuis Universiteit van Amsterdam

Author: Koen Groenland

Student ID: 6039375 Programme: MSc Physics Track: Theoretical Physics Institute: University of Amsterdam Version: First version, 20 July 2015

This document is best printed in full colour, or read online atwww.goo.gl/IbNgtx.

The cover image depicts an impression of a neural network. However, it also symbolizes hierarchical structure: The complicated internals form a simple sphere, which can be used as a building block for many new structures. Adaptation of an image taken fromhttp://www.greenspine.ca/.

(4)

C O N T E N T S

1 _introduction 6

2 _{reinforcement learning} 7

2.1 Definitions . . . 7

2.1.1 Temporal Difference Learning . . . 8

2.1.2 Learning Algorithms . . . 9

2.1.3 Exploration . . . 10

2.1.4 Eligibility Traces . . . 11

2.1.5 Complexity of RL problems . . . 12

2.2 Hierarchical Learning and Transfer Learning . . . 12

3 _{feedforward neural networks} 14 3.1 Training Neural Networks . . . 15

3.1.1 Hyper-Parameters . . . 15

3.1.2 Backward Propagation of Errors . . . 16

3.2 Auto-Encoders . . . 18

3.2.1 Stacking Auto-Encoders into Deep Networks . . . 20

3.2.2 Common Extensions to Auto-Encoders . . . 21

3.2.3 Convolution in Auto-Encoders . . . 22

3.3 Reinforcement Learning in Neural Networks . . . 25

4 _shiftnet 27 4.1 A closer look at Temporal Encoding . . . 27

4.2 Introducting ShiftNet: A minimalistic architecture . . . 28

4.2.1 Computational Advantages of ShiftNet . . . 29

4.2.2 Disadvantages of ShiftNet . . . 30

4.3 Testing ShiftNet on Reconstruction and Classification Tasks . . . 31

4.3.1 About the Datasets . . . 31

4.3.2 Reconstruction Methodology . . . 32

4.3.3 Reconstruction Results . . . 34

4.3.4 Classification Methodology . . . 39

4.3.5 Classification Results . . . 40

4.4 Conclusion on Temporal Encoders . . . 41

5 _{finding hierarchical actions with auto-encoders} 43 5.1 The Hierarchical Action Decoder . . . 43

5.2 Practical Problems of Decoding Hierarchical Actions . . . 45

5.2.1 Theoretical Problems . . . 45

5.2.2 Empirical Problems . . . 45

(5)

6 _{reinforcement learning at multiple scales} 48

6.1 A Multi-Scale Learning Architecture . . . 48

6.2 Overcoming Partial Observability . . . 49

6.2.1 The Choreography Environment . . . 49

6.2.2 Results with Partial Observability . . . 50

6.3 Transfer Learning . . . 52

6.3.1 The Patternswitcher Environment . . . 52

6.3.2 The Switchstate Environment . . . 55

6.4 Conclusion on Multi-Scale Learning . . . 58

7 _{discussion and conclusion} 59 acknowledgements 61 list of terminology 61 bibliography 62 a _{shiftnet python code} 64 a.1 ShiftNetLayer.py . . . 64

(6)

1

_{I N T R O D U C T I O N}

The human brain has the remarkable capacity to learn an extremely large range of tasks. Surely it wasn’t the process of evolution that prepared us to learn many complex actions such as driving cars, operating computers and playing chess. Somehow, the design of our brains allows us to understand and solve any reasonable problem, even those which we ourselves or even our ancestors may have never came across.

For many years, people have tried to build such intelligence in automatic machines, such as chess-playing computers [8] and soccer-playing robots [30]. The main difference with our brains is

that these implementations are designed for merely one task: They were hand-engineered using our very own prior knowledge about how these games work. To create something as versatile as the brain, one needs general intelligence, which is able to solve any reasonable problem without being tweaked for any task in particular.

Recently, great progress has been made on artificial neural networks in the form of deep learning [19]. These deep neural networks are specifically designed to interpret large-dimensional inputs

such as images, video, and sound. These architectures manage to narrow the gap between human and machine performance on tasks where humans had previously always outperformed computers, such as recognizing a person on a photo. There are many indications that these artificial networks work very similar to the brain [37]. As a next step after understanding the input processing, we

may wonder how the brain learns actual tasks, e.g. which actions to perform in order to maintain the body with food, shelter, social status, money, or to even accomplish certain sub-goals, such as making a cup of coffee. This thesis explores the possibilities of creating general intelligence with the help of deep learning. Most notably, we want to find how an artificial brain can deal with not only static information such as photos, but with a constantly changing environment and with dynamical tasks, forcing it to deal with complex time-evolving information. Because increasingly complex and high-dimensional tasks become exponentially more difficult to solve [16], it makes sense to look for

learning algorithms that decompose tasks into simpler problems. Such hierarchical learning is widely applied in our society, like when teaching our children first to read and write single letters or short words before expecting them to formulate grammatically complex sentences. There are also many indications that our brain applies these methods to learn hard tasks [29].

One particularly interesting feat of an artificial network is for them to be Biologically plausible, meaning that it would be possible for a biological brain to implement a similar construction. This would mean that insights about the artificial networks may lead to more neuroscientific knowledge. On the other hand, it would make sense to build artificial networks based on biological findings, considering the empirical success of the biological brain. The method used to train neural networks in this thesis, called backpropagation, was recently shown to be biologically plausible [21].

The central question in this thesis is: “How can neural networks learn to interpret and act based on hierarchical time-sequence input?”. We are particularly interested in how neural dimension-ality reductors can deal with time-sequence information efficiently, and how an agent can apply hierarchical learning by re-using routines of actions.

This thesis is organized as follows. Chapter2introduces Reinforcement Learning, a mathematical

framework to describe learning by trial-and-error. Chapter3introduces artificial neural networks

and deep learning. Our main results about deep networks dealing with time-series are found in chapter4, whilst chapter5and chapter 6describe our findings on hierarchical reinforcement

learning using deep networks. Throughout this thesis, various boxes can be found, which feature some fun-to-read facts and much more easy-to-digest matters. They are considered optional to the reader.

(7)

2

_{R E I N F O R C E M E N T L E A R N I N G}

In Reinforcement Learning (RL), an agent attempts to learn to maximize some reward function by interacting with its environment [41]. It is a widely used mathematical framework for modeling

learning by trial-and-error, and there are many indications that methods described in this section accurately model the way the brain learns [28]. Interestingly, by defining an appropriate reward

function, it also allows self-learning computers or robots to execute specific tasks. A full treatment of all aspects of RL is far beyond the scope of this thesis, so this chapter will focus only on concepts relevant to the later chapters. This chapter starts with the mathematical formulation of reinforcement learning and learning algorithms and then elaborates on how RL can scale to large problems using a hierarchical methods.

2.1 definitions

Let S be a set of unique states the environment can be in, and let A be a set of actions that the agentcan choose to perform. Performing some action a∈ A whilst in state s ∈S may cause the environment to transition to some other state s0with some chance P(s0|s, a) =T(s, a, s0)according to some transition rule T, and may also cause the agent to receive some reward r=R(s, a, s0) ∈R based

on reward function R. The agent’s goal or task is to maximize the reward it receives. To accomplish this, it will have to find the optimal policy, π∗ : S×A→ [0, 1], which gives the optimal distribution of probabilities1

to pick some action a in state s. In mathematics, such a task is often called a Markov Decision Problem (MDP). However, in many realistic situations, the agent may not be able to observe the full state with perfect accuracy, but it will often have to deal with a Partially ObservableMarkov Decision Problem (POMDP). Some examples of reinforcement learning include:

maze tasks In a maze task, the agent tries to reach a specified location in a structure of corridors, at which it will receive a positive reward. A typical set of actions could be movements in directions {North, East, South, West}. Environments where the agent can make discrete steps in space are also called gridworlds. To make the task harder, one might tell the agent not its exact position in the maze, but only the observation of the walls surrounding its current position. A state s might then look like [0, 0, 0, 1] at the top-left location indicated by the red arrow.

Figure 2.1.: A simple maze task.

Im-age:wikipedia.org

pole balancing In a pole balancing task, the agent con-trols a cart with a pole hinged at the top. The pole will tilt and drop due to gravity, and the agent’s task is to keep the pole from falling by exerting forces on the cart. A typical set of actions for a cart in one di-mension is {Left, Right}, and a typical state would be [Cart velocity, Pole angle, Pole angular velocity]. Harder versions include carts with multiple poles, poles with multiple hinges, and limited space for the cart to move.

Figure 2.2.: A pole balancing task.

Im-age: Ref [41]

1 _{One might wonder why the optimal policy is stochastic, since many tasks have fixed optimal actions. However, consider}

playing a game like poker against an opponent, where unpredictability is important, or exploring uncharted territory, which requires rare steps in a random walk in some cases.

(8)

Throughout this thesis, we will assume time is discrete: a new time step starts after choosing an action at, leading to a new observation st+1and a reward rt+1given for the recently performed

action.

We can estimate the utility or value of performing some action atin state stby calculating the

predicted future reward. However, for infinitely long tasks, this expectation value may blow up to infinity. Therefore, we define the discounted cumulative reward G after time step t as

Gt=rt+1+γrt+2+γ2rt+3+...= ∞

∑

k=0

γkrt+k+1 (2.1)

where γ∈ [0, 1]is a discount factor which determines the exponential decay of the importance of rewards further in the future. Discounting makes a lot of sense in realistic tasks, which may have a finite duration, or may change over time. The expectationE[.]of the discounted cumulative reward is called the Q-value:

Q(st, at) =EGt|s=st, a=at. (2.2)

Calculating all Q-values to reasonable accuracy is often sufficient to find the optimal policy, which is then to select the action with the highest Q-value in the current state: at = arg maxaQ(st, a).

An example algorithm to achieve this, stores the Q-value function as a table of discrete states and actions, where Q(s, a) are calculated by averaging all experiences of the discounted cumulative reward after choosing action a in state s. A collection of algorithms that implements this idea in an efficient way is called temporal difference learning.

2.1.1 Temporal Difference Learning

There are many ways to solve RL problems, but most methods scale unfavourably with the number of states and actions. Temporal Difference (TD) methods approximate the optimal policy by learning on-the-fly, starting with a random walk, but later sampling experience only from relevant states and actions in order to approximate only the relevant Q-values. When moving from state stto state

st+1, we cannot accurately estimate Q(st, at)yet through formula2.2, because we only know r_t+1.

However, we may already know the value of Q(st+1, at+1), which is exactly the estimate of future

reward that we cannot calculate yet. We can thus exploit the temporal difference between the Q-values of these two time steps, which can be derived as follows:

Q(s, a) = EhGt|st=s, at=a i = Ehrt+1+γ ∞

∑

k=0 γkrt+k+2 |st=s, at=a i . (2.3)

In the second line, we took rt+1out of the summation, and relabeled k. We can now describe the

expectation value of the next time step t+1 more explicitly: we have to average over all possible states s0 that may occur after selecting action a (described by transition probability P(s0|s, a)) and the value of the action a0that would be chosen in the next state. The chance that a0is selected equals π(s0, a0), such that we can write:

Q(s, a) =

∑

s0 P(st+1=s0|st=s, at=a)

∑

a0 π∗(s0, a0) R(s, a, s0) +Eπ γ ∞

∑

k=0 γkrt+k+2|st+1=s0, at+1 =a0 =

∑

s0 P(s0|s, a)

∑

a0 π∗(s0, a0) R(s, a, s0) +γQ(s0, a0) . (2.4)

Formula2.4 is the Bellman Equation which shows explicitly how we can write Q(s, a)based on

the possibly succeeding Q-values Q(s0, a0). An approximation of the Q-values can then be found by following and continually tuning some approximate policy π to sample the experiences of

{st, at, rt+1} triplets. The approximate Q-values q(s, a) can then be continually updated using

(9)

the potentially badly estimated or initialized q’s to learn more about Q(s, a)is actually useful, but this bootstrapping can be proven to converge for any MDP with a finite number of states, if over all time all actions are experienced infinitely often in all states [10,40].

2.1.2 Learning Algorithms

Notice that in formula2.4, the probabilities P(s0|s, a)and π(s0, a0)denote exactly the probabilities

that the agent visits state s0 and uses action a0 if it follows the current policy π. We will store a running approximation qt(s, a)of a Q-value Q(s, a), which we update every time we visit state s

and use action a. If at time t the prediction qt(s, a)if found to be off by a margin δt, we can adjust

qt(s, a)by shifting it towards the desired result with a fractional step α:

qt+1(st, at) =qt(st, at) +αδt. (2.5)

We often refer to α∈ [0, 1]as the learning rate. This learning by online updating is efficient, because the agent samples experiences mainly in the states and actions that are visited with a high probability. Therefore, the most often consulted entries q(s, a)will be approximated highly accurately, whilst little time is wasted in finding the Q-values of rarely visited states. The most popular online TD algorithms include:

sarsa In the Sarsa algorithm [41], the Q-values are estimated based on which actions are chosen

by the agent, making this an on-policy algorithm:

qt+1(st, at) =qt(st, at) +α rt+1+γq(st+1, at+1) −q(st, at). (2.6)

SARSA is named after the order in which state, action and reward appear before an update takes place: In order to know find q(st+1, at+1), the agent first experiences st, at, rt+1, st+1, and then

chooses some action at+1. Because the agent may sometimes choose sub-optimal actions at+1, the

Sarsa Q-values donnot represent the ‘ideal’ Q-values, but rather the ‘realistic’ reward prediction based on the policy π.

q-learning If one wants the Q-values to represent the ‘ideal’ utility of some state, one can use the Q-learning update rule [46]

qt+1(st, at) =qt(st, at) +α rt+1+γmax

a (q(st+1, a)) −q(st, at)

(2.7) in which case the maximal Q-value in state st+1is used for the update (which is not necessarily

selected by the policy, making the algorithm off-policy). The difference with Sarsa if fairly minimal, but is not hard to think of examples where the difference would be apparent. For example, assume an agent can choose between passing a cliff using a narrow bridge where it might fall off due to ‘accidental’ random actions, or using a wider bridge which would give a penaly that is not quite as bad as falling off the first bridge. Sarsa might ‘play it safe’ and penalize the Q-values for the exploratory actions causing falling, whilst Q-learning would select the ‘ideal’ pathway of crossing the narrow bridge.

advantage learning When working with very small time steps, the relative effect of each action becomes very small, causing the Q-values to shift increasingly close to each other, up to the point where we can’t distinguish what value is the best. To remedy this, the Advantage Learning algorithm [13] increases the distance of every Q-value to the maximal Q-value by a factor k in

so-called advantages A: At+1(st, at) =At(st, at) +α1 k rt+1+γmaxa At(st+1, a) + 1−1 k maxa At(st, a) −At(st, at). (2.8) We can readily see that Advantage Learning reduces to Q-learning if k=1. If we increase k, the term maxaA(st, a) becomes increasingly important, causing all advantages to shift towards the

(10)

maximal advantage. However, by choosing k<1 (yet still positive), the actual effect of the action (rt+1+γmaxaA(st+1, a)) is emphasized. This way, the maximal advantage maxaA(s, a)converges

to the same scalar as the ideal Q-value for every state, but non-maximal advantages lie k times further from the maximal value. When working with RL in continuous time, one should take the limit of∆t→0 whilst keeping∆t/k constant.

Dangers of intelligent machines

A dream of AI is to build an agent so intelligent, that it is able to overtake any task that humans now have to perform. Most notably, if AI reaches the point that machines can do AI research even better than humans, we reach a technological singularity, a sudden boost in research and technology that we humans cannot comprehend. This scenario may lead to great benefit for society, but giving away control of our lives to machines that we don’t understand may also come with substantial risks.

Earlier this year, the Future of Life Institute, an organization working to mitigate risks from runaway AI, wrote an open letter2

to warn AI research to focus on keeping AI systems beneficial for humanity. This letter was signed by many prominent scientists, including Stephen Hawking and Elon Musk, RL researchers Tom Dietterich and Andrew Barto, and neural network specialists Yann LeCun, Geoffrey Hinton and Yoshua Bengio. Especially Musk, who surely must be well-informed about the current state of technology being the founder and CEO of Tesla Motors and SpaceX, is especially concerned with the threats of intelligent machines. During the 2014 AeroAstro symposium at MIT3

, he stated “I think we should be very careful about artificial intelligence. If I had to guess at what our biggest existential threat is, its probably that”. Needless to say, the author took various safety precautions when working with self-learning computer programs, such as not giving them any actions that might be non-beneficial to mankind.

2.1.3 Exploration

Recall that Q-values need to be sampled infinite times in order for the proof of convergence to hold. This limitation makes a lot of intuitive sense: an agent should at least try every action in every state a substantial number of times before it can reasonably estimate all Q-values. Therefore, we often require a policy to not only exploit available knowledge, but also explore suboptimal states and actions in a balanced way. Frequently used exploration policies are [41]:

• e-greedy: At every time step, the agent chooses a random action with probability e∈ [0, 1], and the optimal (‘greedy’) action otherwise.

• Softmax: The probability of choosing an action is given by

P(a) = exp(q(a)/T)

∑a0exp(q(a0)/T)

(Softmax function)

in which T can be interpreted as temperature which increases the ‘randomness’ of the agent; the sum runs over all actions to guarantee normalization. This equation is similar to the Boltzmann distribution when we interpret the expected reward as some kind of (negative) energy.

• Optimistic initialization: By setting all Q-values too high at the beginning of a task, all actions will be executed very often until they are found to be sub-optimal. This causes the agent to explore frequently in the early stages, and only exploit when the policy has converged.

3 The open letter can be read and signed by anyone onhttp://futureoflife.org/AI/open_letter.

3 _{The appearance of Elon Musk can be viewed online at} _{http://webcast.amps.ms.mit.edu/fall2014/AeroAstro/}

(11)

Advanced RL methods may decrease the exploration chances over time, in order to quickly find the optimal policy and later maximize the reward. In this thesis however, we will look at constantly learning systems like the brain, which use continuous exploration.

2.1.4 Eligibility Traces

Eligibility Traces are a method to improve the learning speed of TD algorithms by updating Q-values not only after experiencing a single reward step, but also after every subsequent reward-step. For example, in the Sarsa algorithm, the Q-value Q(st, at)is updated to be closer to ‘the best estimate

we can make for Gtafter 1 time step’

Gt1=rt+1+γq(st+1, at+1) (2.9)

at every time step. We could however also update it towards the value G2

t = rt+1+γrt+2+

γ2q(st+2, at+2) or even Gtn = ∑i=1n γi−1rt+i+γnq(st+n, at+n) for any n. Using information from

further in the future makes the update more accurate, especially if other Q-values are still inaccurate. However, this also leaves the Q-value un-updated for a longer time, possibly penalizing other updates that rely on that value. Eligibility traces find a middle ground by updating Q-values based on a weighted average over all possible future updates, so using all targets G_tnfor all n.

To avoid having to store a whole history of states and actions, we equip all Q-values with an eligibility trace Zt(s, a), which is updated every time step as follows [41]:

Zt(s, a) =γλZt−1(s, a) +δs,st (2.10)

where δs,st, the delta with two indices, denotes the Kronecker delta. The update rule in formula2.5

can then be written as

qt+1(s, a) =qt(s, a) +αδtZt(s, a) (2.11)

where δt=G1_t −qt(s, a)is the same temporal difference error as in formula2.5. These formula’s

prescribe that the trace is increased by one if the corresponding state-action pair is observed, and decays exponentially through the parameters γ and λ ∈ [0, 1]. The Q-values are then updated with the most up-to-date prediction errors even after many time steps, although the size of the update decays exponentially. For example, figure2.3displays an agent which finds a high reward

state-action combination through a random walk. In this case, and many other situations, the traces turn out to be a useful extension which helps increase the learning speed.

Path taken Action values increased by one-step SARSA

Action values increased by SARSA(λ) with λ = 0.9

Figure 2.3.: An example of how eligibility traces improve the learning speed in a simple gridworld, where

a reward is received at the state indicated by the star (∗). The size of the arrows denote the magnitude of a Q-value for taking the action in the direction of the arrow. Using one-step updates (λ=0), only the action that brings the agent to the starred location will get an update, as can be seen in the middle image. Using eligibility traces however, all previous actions also receive an exponentially decaying update, as indicated in the right image. Figure taken from Ref. [41].

In principle, all TD algorithms can be extended with traces, leading to a family of algorithms called TD(λ). Most notably, we will use Sarsa(λ) throughout this thesis. One can readily check from equations2.10and2.11that by setting λ=0, the TD(λ) algorithm reverts to the standard one-step

returns based on G1_t. On the other hand, in the limit of λ→1, the trace hardly decays, and every Q-value qt(s, a)that is ever used will keep receiving updates. The downside is that these are the

(12)

that we will use in this thesis, and therefore λ will have an optimal value somewhere between 0 and 1, which has to be found empirically.

2.1.5 Complexity of RL problems

Finding the correct Q-values in a large reinforcement learning problem is generally considered computationally expensive. A ‘worst case scenario’ for the number of steps an agent must take before finding a goal state (or even a state that gives any reward), is O(|A||S|): in every state, up to |A| different actions can be performed, and potentially only a small number of state-action combinations gives a desired result. Ref. [16] reports that, using reasonable assumptions and

an appropriate representation of the states and actions, the complexity can be brought down to O(|S||A|)or O(|S|3₎_{. It must be noted that the number of states still increases exponentially with}

the dimensionality of the environment. Thus, if the state has many parameters, solving the RL problem can still take a very long time. In artificial intelligence, this is known as the curse of dimensionality, and this is generally an important bottleneck in RL. Therefore, a great deal of effort will be put into reducing the dimensionality of the state- and action spaces throughout this thesis.

2.2 hierarchical learning and transfer learning

Imagine the brain would use reinforcement learning to decide which actions to perform in which situations. It receives a huge number of inputs from all nerves that lead from the many organs throughout the body, and it can activate an incredible number of nerves that cause movements of the plethora of muscles. The total number of states equals every single possible combination of the many inputs the brain may receive, and the ‘correct’ actions require accurate coordination of the various muscles. Decisions even have to be made over various orders-of-magnitudes of time-scales, ranging from quick reflexes to catch a ball out of the air, to long-term determination about where one wants to live or work.

For many realistic reinforcement learning tasks where many inputs/outputs and large time-scales are involved, the standard RL methods scale too unfavourably with the number of parameters to be of any practical use. The brain seems to have found a solution, presumably by applying a form of Hierarchical Reinforcement Learning (HRL) [7]. Throughout the stages of child development,

we learn a set of basic skills such as grasping, walking, opening doors, putting things down, etc. When faced with more complicated tasks, such as making coffee or going to a supermarket, we may re-use these routines without having to learn them anew. The hierarchical arrangement of actions and routines is believed to make learning of realistic-scale problems feasible.

According to Ref. [11], the main advantages of HRL are :

• Scale-up: Many difficult tasks can only be efficiently learned by first learning smaller subtasks. In other words, tasks can be decomposed into oversee-able chunks.

• Transfer: Previously learned routines can be re-used in other settings. For example, anyone who has learned to open a specific door might transfer this knowledge to open similar doors in completely different environments.

• Overcoming partial observability.

In artificial intelligence, hierarchical problem solving is done mainly through temporal abstraction, in which actions can be chosen over multiple time steps [2]. Most forms of temporal abstraction

use some kind of macro-action, which can be sequences of actions to be executed in order, or sub-policies that are trained to execute a specific sub-task. Higher order macro-actions can call lower macro-actions, which forms exactly the aimed hierarchical structure.

Following Barto and Mehadan’s review on HRL [2], there are three commonly used approaches

(13)

of machines is initialized at the start of the problem. Based on their internal configuration, the machines may choose to alter their internals, call on other machines, pass control back to the caller, or select actual actions. Through reinforcement learning, only the policy that determines the altering of internals can be changed, all other policies are fixed. Therefore, the programmer must use its own knowledge to initialize the machines appropriately. The second approach to HRL is the MaxQ value decomposition [11], which uses prior knowledge to start a problem with a set of untrained subpolicies.

The subpolicies end whenever they fulfill their designated subgoal, after which credit is assigned based on the reward collected on the way, and the Q-value (expectation of future reward) of the higher-level policy. This way, the agent becomes highly proficient in solving subgoals assigned by the programmer, but it can never find its own subgoals.

The last and most relevant method for this thesis is the options framework [42]. Here, an option

is defined as a sub-policy, which replaces the elementary actions available to the agent. Note that every elementary action can still be a special case of an option. Every option consists of a policy π : S×A → [0, 1], an input set I ⊆ S and termination conditions β : S → [0, 1]. The input set determines in which states the option is available, and the termination condition assigns the chance that the option will terminate. Sometimes the word option is used to mean a fixed sequence of actions, as will often be done throughout this thesis.

By giving the agent the right options, it will learn significantly faster. For example, when solving the maze task in figure2.4, combining the elementary movement steps with options to move to the

next hallway was found to greatly improve the learning speed4 .

Figure 2.4.: A gridworld task used in Ref. [42] to show the effectiveness of options. The agent’s goal is to move to a goal location Gi. Whenever the agent is inside a room, it is given the option to navigate to both of the room’s narrow hallways. It may come as not surprise that adding these options makes learning significantly faster.

The downside of these options (and many other hierarchical learning implementations) is that they require extensive prior knowledge about the task. The options in Sutton’s maze task will only work in this specific gridworld environment where the world is made up of rooms with corridors at these specific locations. Notice that giving an agent simply every possible option available is not a viable solution, because the number of options will scale exponentially with the number of time steps the options take. In order to create fast-learning general intelligence, we want to find structure in action-sequences without resorting to prior knowledge about the environment. To our best knowledge, there are no elegant models with all these properties, but we will make an attempt to achieve exactly this in chapters5and6.

4 The author does not unambiguously support these kinds of HRL solutions which build macro-actions specifically for

reaching a specific part of state-space. For example, a human would not learn how to exit a room by continual trial-and-error, but rather by planning. A more logical application for these options would be in cases where a similar routine needs to be performed multiple times, such as directing one’s body towards the door, combined with the action of walking forward.

(14)

3

_{F E E D F O R W A R D N E U R A L N E T W O R K S}

Artificial neural networks are multi-dimensional function approximators that are able to approximate any non-linear function when given the appropriate parameters. A network consists of layers of neurons h with connecting weights W in between them. Every neuron has an activation represented by one scalar parameter. Thus, one layer of nhneurons can representRnh.

(a) (b)

Figure 3.1.: (a) A biological model of a neuron, which transports signals received at the dendrites, through the

cell body, to the axon terminals, where a synapse transports the signal further to the next neuron. (b) A simple 2-layer network. Images taken fromwikimedia.org.

From a biological viewpoint1

, as in figure3.1a, every neuron receives various input signals at

its dendrites, which are summed in the cell body and transported to the axons. At the end of every axon, an inter-neuronal synapse transports the signal further to the dendrite of a next neuron. The efficiency of a synapse is plastic, and can be adapted based on environmental circumstances, allowing the network to learn. The dendrites also have a (plastic) threshold which determines how easily they are activated, which is called their bias.

From a mathematical viewpoint, as in figure3.1b, we describe a neural network as a series of n

layers h0, h1, h2, ..., hninterconnected by weight matrices W0, W1, ... Wn−1. We will often denote the first layer by x=h0and the last layer by y=hnto indicate that these are the in- or output of the network. All layers in between the input- and output-layer are called hidden layers, whose neurons are called hidden units. When the activation of layer k is given by hk, the activations of subsequent layers are given by

ak+1_j =

∑

i

W_ijkhk_i +bk+1_j (3.1)

hk+1_j = σ(ak+1_j ) (3.2)

where W_ijk is the matrix element connecting neuron h_ik−1to hk_j, bk_j is the bias of neuron hk_j, ak_j is the sum of input signals received by the same neuron, and σ is a differentiable activation function indicating how strong a neuron responds to its input. Common activation functions are:

σ(x) =max(0, x) (rectified linear unit, ReLU)

σ(x) =tanh x (hyperbolic tangent)

σ(x) = 1

1+e−x. (logistic function)

1 _{Technically, biological neurons use spikes rather than continuous signals. The simplified description given here is accurate to}

(15)

The activation function is often omitted (e.g. the identity function) in the last layer hn.

All these indices can make derivations very confusing, and thereforee we opt for a notation that is as clear and structured as possible. The upper indices will indicate the neuronal layer, and the lower indices indicate neurons within the layer. We will start counting the layers starting at 0 such that the number n of the output layer is exactly equal to the number of connection layers. We often want to use the much cleaner vector notation:

ak+1 = Wk·hk+bk+1 (3.3)

hk+1 = σ(ak+1) (3.4)

where σ is applied element-wise.

3.1 training neural networks

Neural networks are useful because their parameters θ = {W, b}can be adjusted such that the network’s output y = N(x) approximates any other function ˜y = f(x) on a bounded domain, provided the network has at least n=2 layers and an appropriate activation function σ. This makes them universal function approximators. Technically speaking, for networks with 2 layers, there exists some number of hidden neurons n_h1 such that

|N(x) − f(x)| <e (3.5)

for any e larger than but arbitrarily close to zero [9]. Such a 2-layer network is also called a

multi-layer perceptron.

We now know that there is a solution for the parameters θ to approximate a function f , but there are no guarantees that we can find this solution easily. An exhaustive search over all parameters in θ is generally too computationally expensive, because the number of possible configurations scales exponentially with the number of parameters|θ|. A better approach, however, is stochastic

gradient descent, in which the parameters are randomly initialized and then repeatedly adjusted such that they bring us closer to our goal. We can define our goal as minimizing some cost function E, such as|y−f(x)|2. The network can then be given an appropriate selection of inputs x, after which the weights can be adjusted or updated (θi,new=θi,old+∆θi) according to

∆θi = −α∂E ∂θi

(3.6)

where α∈ [0, 1]is a learning rate scaling the step-size of every adjustment. A sequence in which all data is passed through a network is called an epoch. One might wonder if gradient descent doesn’t risk getting stuck at a local minimum. It turns out that the cost hyper-surface is filled with zero-gradient saddle points where many dimensions curve upwards, yet almost all of these have very similar values of the cost function [19]. Gradient descent will thus bring the network to a very

low cost state in nearly all practical situations.

3.1.1 Hyper-Parameters

A neural network has a large number of parameters θ which have to be adjusted to the data. However, when training Neural Networks, we also have to deal with some parameters which have to be set in advance, and which are not influenced by the data, which we shall call hyper-parameters [4]. Examples include the learning rate α, but also the number of trainings and the number of

neurons one chooses in a network. Hyper-parameters can have catastrophic effects on training of neural nets if chosen inappropriately, forcing the experimenters to try various values. For example, larger values of the learning rate α are advantageous because they will cause the parameters θ to converge to their ideal value more quickly. However, if the learning rate is chosen too large, divergence might occur, as depicted in figure3.2.

(16)

*

θ

₁

θ

2

Figure 3.2.: The cost surface E as a function of the parameters θ close to a minimum looks like a hyperbola.

By taking small steps in the direction of decreasing gradient, the cost will reduce (green arrows). If, however, the learning rate α is too large, the adjustments might be too large, and the network could diverge (red dashes). Modification of an image fromwikimedia.org.

3.1.2 Backward Propagation of Errors

When calculating the derivate in formula3.6, most standard computational differentiation methods

scale quadratically with the number of weights in the network. This leads to very slow algorithms, as the number of weights is typically very large. Backward propagation of errors, often abbreviated backpropagationor backprop, is a much more efficient algorithm to train artificial neural networks [33], scaling only linearly with the number of parameters. The aim of this subsection is to derive the

backpropagation formulas that can be used in practical computer simulations.

Our goal it to calculate the gradient∇_θE of the error function E with respect to the network parameters θ= {W, b}. We will first show an insightful example with only one layer and explicit formulae, and then derive a general formula for an arbitrary network.

a simple example Assume we have a one-layer network, with input x and output y, where y is given by

yj= f(aj)

aj=

∑

i

Wjixi+bj.

We take f to be the logistic function. Moreover, assume the error is simply the sum of squares of the errors per output neuron with respect to some target t:

E= 1

2

∑

_j (yj−tj)

2_.

We can then apply the chain rule to find: ∂E ∂Wji = ∂E ∂yj ∂yj ∂aj ∂aj ∂Wji = yj−tj ∂ f(aj) ∂aj xi.

The logistic function happens to be easily differentiable to ∂ f (a_∂aj)

j = f(aj)(1− f(aj)). Thanks to

(17)

independently from the other factors. We may even compactify the formula by defining the ‘error due to aj’, which we shall denote by δj:

δj =

∂E ∂aj

= (yj−tj)f(aj)(1− (f(aj)) (3.7)

such that the gradient becomes

∂E ∂Wji

=δjxi

We can thus assign an error measure to every neuron layer, such that a weight’s gradient is simply the product between and its input’s activation and its output’s error.

general neural networks In the previous example, we used a trick to describe the error δkj

due to neuron ak_j in layer k, which we’ll exploit for the general case. For any weight in a general, arbitrarily deep network with an arbitrary activation function, the gradient in weight wk−1_ji in layer k−1 is given by ∂E ∂W_jik−1 = ∂E ∂ak_j ∂ak_j ∂W_jik−1 =δk_jhk−1_i . (3.8)

Since hk−1_i should already be known at the time of backpropagation, we are left with finding all δk j

in the network. The δ’s for the last layer n are easy to find:

δn_j = ∂E ∂an_j = ∂E ∂yj ∂yj ∂an_j = ∂E ∂yj σ0(an_j) (3.9)

where the prime (’) denotes a function’s derivative to its argument, which we can often simplify like in we did in formula3.7. If the error function is simply the sum of squared errors (E= 1

2∑j(yj−tj)2),

the delta’s becomes δn_j = (yj−tj)σ0(an_j).

Now that we have found the error in the last layer, we will use induction to find the delta’s of subsequently earlier layers. Assume we know δ_jk+1for all j, and we want to find δk_i. With formula’s

3.1and3.2in mind, we find

δ_ik= ∂E ∂ak_i =

∑

j ∂E ∂ak+1_j ∂ak+1_j ∂ak_i (3.10) =

∑

j δk+1_j ∂ak+1_j ∂hk_i ∂hk_i ∂ak_i (3.11) =

∑

j δk+1_j W_jik σ0(ak_i). (3.12)

The gradients in the biasses b are now also easy to calculate, again with some help of formula3.1:

∂E ∂bk_j = ∂E ak_j ak_j ∂bk_j =δk_j ∗1. (3.13)

We can thus calculate the gradients of all parameters simply by starting with the errors in the last layer, and ‘back propagating’ these errors backwards through the network, layer-by-layer. The formula’s can even be written in a compact matrix form, being easier on the eye and more efficient in computer simulations: δn = σ0(an) ◦ ∇yE (3.14) δk = (δk+1· (Wk)T) ◦σ0(ak) (3.15) ∂E ∂Wk = δ k+1_×_hk _(3.16) ∂E ∂bk = δ k _(3.17)

(18)

where ◦ denotes element-wise multiplication, × denotes the cross (or outer) product between vectors, and WT denotes the matrix-transpose of W.

backpropagation efficiency What we have achieved with formula’s3.9and3.12is that we can

efficiently calculate the errors (or gradients) in all neurons in the network, simply by ‘propagating errors backwards through the network’. When these errors are known, the gradients in the parameters are easily found using formula’s3.8and3.13. One might wonder why this methodology

is particularly faster than other ways to calculate the gradients, since all we have done is rewriting partial derivatives. To illustrate this, we will consider the naive alternative of finite differences, where we approximate the derivative as follows:

∂E ∂θm

= E(θm+e) −E(θm)

e (3.18)

where m is some arbitrary numbering all parameters θ = {bk_j, W_jik}, and e is a very small scalar. Using this method in a computer program is very simple, as one surely already has a function to calculate the cost E given the weights. However, upon calculating the gradient for all weights, one will loop over the number of parameters |θ| to find all gradients, and within every loop, every element of θ will be needed again (twice) to make a forward propagation to calculate the individual costs E. The method of finite differences thus has computation complexity O(|θ|2), because calculation times will scale with the square of the number of parameters. Not so with the method of backpropagation, in which the calculation of the gradients happens in just one forward pass, followed by just one backward pass of the errors. Notice that calculations performed in3.15

are basically the same as those in a forward propagation, but this time using transposed weight matrices. Because the parameters θ are only used a few times in the calculation of the gradient, and never in a loop-within-a-loop, the computation complexity is of order O(|θ|). This is very relevant when dealing with large networks, because the number parameters easily becomes very large: If one wants to make a full connection between two layers of 1000 neurons, the weight matrix will already have one million entries. Surely we don’t want to deal with the square of this number!

3.2 auto-encoders

One specifically useful application of neural networks is the Auto-Encoder (AE) [36], which is a

network that can encode information into a different representation. These networks currently hold state-of-the-art performance in a wide range of fields such as image and speech recognition, object classification, prediction or extrapolation of data, and even interpreting particle accelerator data [19]. Whilst there are other stacked neural network implementations such as Restricted Boltzmann

Networks, this thesis will focus on feedforward neural networks such as defined earlier in this chapter. An auto-encoder is basically a 2-layer network with identical input- and output layers, which is trained to reconstruct the given input in its output layer. We will denote the layers by x, h and y. The act of forward propagating the input x to the hidden neurons is called encoding:

h=σ(W·x+b) (3.19)

depending on plastic weights W and b. The activation of the hidden units is also called code or latent representation. The second layer performs so-called decoding:

y=W0·h+b0 (3.20)

depending on plastic weights W0and b0. Sometimes, W0is chosen to be the transpose of W, which is called tied weights.

(19)

Modelling neural networks with Theano

Theano is a Python library specialized in optimizing neural network-like calculations [6]. In this package, rather than defining functions to calculate forward and backward

propagation through the network, the user builds a symbolic graph stating exactly which calculations have to be performed upon forward propagation. Once the graph is com-pleted, the user can compile functions on this graph, such as getting the feedforward output, reconstruction error, or storing neural activations. Because Theano is aware of the whole network, it is able to perform symbolic differentiation, such that it can automatically find the gradient in the weights efficiently. This makes it easier for the user to experiment with exotic network architectures, and guarantees that the backpropagation algorithm, which is notoriously prone to errors, works correctly.

For example, we can define a Theano graph as follows:

# Vector , S c a l a r , Tensor e t c . v a r i a b l e s a r e f l e x i b l e i n p u t s .

x = theano . t e n s o r . v e c t o r ( )

# Shared v a r i a b l e s w i l l have t h e i r v a l u e s s t o r e d .

W = theano . shared ( value = np . ones ( shape = ( 2 , 3 ) )

# We can d e f i n e new nodes using b u i l t−i n or s p e c i a l Theano f u n c t i o n s .

y = theano . t e n s o r . dot ( x , W )

# To g e t a value f o r y , we have t o compile a new f u n c t i o n .

g e t o u t p u t = theano . f u n c t i o n ( i n p u t s =[ x ] , outputs=y )

p r i n t g e t o u t p u t ( [ 1 , 2 ] ) # r e s u l t s i n [ 1 + 2 , 1+2 , 1+2] = [ 3 , 3 , 3 ]

To find the gradient of the cost with respect to the weigths, we can perform symbolic differentiation in matrix form:

t a r g e t = theano . t e n s o r . v e c t o r ( )

c o s t = ( x− t a r g e t ) ∗∗ 2

g r a d i e n t = theano . t e n s o r . grad ( c o s t = c o s t , wrt=W )

# Updates a r e s t o r e d i n a t u p l e with ( o r i g i n a l value , updated value )

update = ( W, W− 0. 1 ∗ g r a d i e n t )

do update = theano . f u n c t i o n ( i n p u t s =[ x , t a r g e t ] , updates =[ update ] )

do update ( [ 1 , 2 ] , [ 1 , 2 , 3 ] ) # changes v a l u e s s t o r e d i n W

The command theano.function() optimizes the whole calculation, making sure the calculation is performed as efficiently as possible in machine language. Moreover, it is possible to compile these functions for execution on CUDA-compatible GPU’s, which have proven to be extremely fast in large linear algebraic operations.

Auto-encoders are typically used for dimensionality reduction by choosing nh<nx. This way, the

hidden layer is forced to store all relevant information about the input in less parameters, enforcing some kind of data-compression, or a non-linear version of principal component analysis. If the inputs x given to the network would be completely random, it would be impossible to temporarily store all information about x in a lower dimensional vector h, because x requires an nx-dimensional

basis. However, most datasets consist of many parameters that form an overcomplete basis. An appropriately trained auto-encoder can exploit this by mapping the data points to a more compact, lower dimensional representation, which still holds most of the relevant information. For example, figure3.3shows data points living in a high-dimensional space, for which an auto-encoder could

find a manifold that represents the data almost as accurately. The hidden neurons are often called features, because their activation will depend on the presence of specific frequently observed combinations of parameters of the input.

What makes auto-encoders particularly useful is that they compress data without any prior knowledge about the data. They apply so-called unsupervised learning, which means that they are trained without any specific goal or target in mind. In artificial intelligence, it is often easy to gather large amounts of data, whilst it requires expensive manual labour to label or interpret the data [4],

(20)

Figure 3.3.: Often, data is represented in an over-complete basis. In this figure, the dots represent data-points

in a three-dimensional space, whilst the grey spiral forms a two-dimensional manifold that lies very close to all data points. Auto-encoders excel in finding these manifolds. Image source: [32]

3.2.1 Stacking Auto-Encoders into Deep Networks

The true power of auto-encoders lies within their ability to be stacked into a deep network . This is a surprising result, considering that simple 2-layer networks are able to learn any arbitrary function. The gain in efficiency stems from computational complexity [3]. Computational depth denotes the

maximal number of basic computational elements that need to be applied to the input of a function in order to calculate the function’s output. For example, f(x) =tanh(a∗x+b)using computational elements{tanh,∗,+}has a depth of 3. However, if the basic operation set consists of neurons with tanh activation (see formula’s3.3and3.4), the depth is merely 1. The computational depth of a

neural network is thus equal to the number of layers in the network. Now, an important result from complexity theory states that approximating a function with depth k can be done efficiently with an approximator of depth k, whilst approximators with depth k−1 may need a total number of paths exponential in the input size [3]. Thus, shallow networks may need exponentially many more hidden neurons

compared to deep networks in order to make a similar function approximation. This is not only a problem for computational power, but also one of statistics: If the number of hidden neurons increases, the network needs to be trained on more training examples as well, which may not be available.

layer-wise training A typical problem one runs into when training deep networks, is that the backpropagation gradient becomes increasingly small after passing through many layers. Recently, it was realized that networks could be trained layer-by-layer [5,14]. Stacked auto-encoders or deep

neural networks(DNN) work as follows (figure3.4): First, an auto-encoder is trained using the

training data to map input x of dimension nxto some code h of dimension nh. Then, a second

auto-encoder interprets the previous encoder’s code h as its input x, in order to compress this data even further. This way, an arbitrary number of layers can find features at an arbitrary number of abstraction levels. When deep networks are used for a specific task (such as classification), the layer-wise training of auto-encoders is called pre-training, which trains network to detect task-independent features. After this, fine-tuning can be applied by back-propagating task-specific errors through the network as a whole [4].

(21)

Auto-encoding (a) Encoding Classification (b)

Figure 3.4.: A stack of auto-encoders can be trained layer-wise. (a) The hidden neurons neurons of each

auto-encoder may be used as input for the next encoder, allowing structures of arbitrary depth. (b) When used for a specific task, pre-training can be combined with fine-tuning, through which the whole encoder is optimized for a specific task, such as classification as depicted here.

3.2.2 Common Extensions to Auto-Encoders

With auto-encoders having grown increasingly popular over the past few years, many extensions to the standard auto-encoder have been proposed. We try to summarize the most important additions: • Smarter descent on the error surfaceTraining speed can be greatly slowed if some inputs of the AE feature larger variation than others. For example, imagine the the cost hypersurface of figure3.2to be more strongly dependent on one of the parameters θ_i, such that it looks

like an oval. When applying gradient descent, the update steps will cause the parameters to strongly oscillate around the long axis, but merely move towards the minimum. A popular solution, is to give the weight updates some inertia, as if they behave like a ball rolling down a hypersurface [39].

θt+1=θt+vt+1

vt+1=µvt−α∇_θE(θ)

where µ∈ [0, 1]determines the velocity decay (similar to friction) and α is the learning rate. Other popular remedies for slow learning at small gradients include Hessian Free Learning [22], which keeps track of second-order derivatives like in Newton’s Method, and Nesterov’s

Accelerated Gradient [26], which modifies the regular momentum’s velocity update to ‘look

ahead’ for more accurate updates:

vt+1=µvt−α∇θE(θ+µvt).

Although the convergence of many of these algorithms can only be mathematically proven to work for non-stochastic gradient descent (in which updates are based on a full dataset), they are easily modified to work for the batch-based stochastic descent [39].

• RegularizationA typical problem for neural networks is over-fitting, for example, when an auto-encoder gets better at reconstructing data it’s been trained on (training data) than similar data which it hasn’t seen yet (test or validation data). Regularization methods help preventing this. Two popular methods in deep learning are L2 and L1 regularization, which add a

term proportional to∑(Wij)2or∑|Wij|to the cost function, respectively [27]. This way, the

weight update∆W ∝−∂E

∂W causes the the weights to decay either exponentially or linearly. Heuristically, by favouring smaller weights, no small change in the input can lead to a greatly different encoding or reconstruction in the AE, making sure that the AE generalizes well to new data2

.

2 _{A nice graphical explanation of why L2 regularization works, is given on the following webpage:} _http://

(22)

• CorruptionAnother way to avoid over-fitting is by slightly changing or corrupting the inputs (or even the hidden activations) in an AE. One way to do this is by using dropouts, inputs randomly set to zero, as used in Denoising Auto-Encoders [44]. Similarly to L₁ and L₂

regularization, this improves the AE’s robustness against small changes in the inputs. • Channel-Out / Winner Takes All In 2013, two research groups independently introduced

the idea of making neighbouring neurons compete to keep their activation values. Both methods are mathematically the same: Within every pre-defined group of k neurons, only the maximally activated neuron is allowed to maintain its activation, and all others have their activation set to 0. Ref. [38] argues that Biological neurons feature inhibitory connections to

neighbouring neurons, leading to this local winner-takes-all (WTA) behaviour. This method would prevent catastrophic forgetting of acquired knowledge when the network is trained on new data, because weights connected to zeroed neurons do not receive any error gradient. The second proposition, in Ref. [45], proposes the exact same mechanism under the name

Channel-Out as an alternative activation function. This way, the error backpropagation involves only linear matrix operations, which is more efficient to compute than non-linear functions such as tanh(). Both methods were shown to perform well in benchmark tests.

Degree, dimension, axes and order of a tensor

Many people would describe a matrix (a linear algebraic object with 2 indices) as having dimension 2, and a tensor (an object with 3 indices) as having dimension 3. This is, however, confusing and wrong, since we are already using the word dimension to indicate the dimensionality of the space. The dimensionality is simply the number of entries within a vector, or the size of a complete basis in a vector space. The correct word that has to be used for higher-order objects is order or degree, and to indicate a specific index of an object with a high order, we can use the word axis. Thus, a matrix that mapsR3→R2_{is an}

object of order 2, whose first axis has dimension 3 and whose second axis has dimension 2. Moreover, we can use the word shape to mean a list of the dimensions of all the axes of an object. For example, the aforementioned matrix has shape(3, 2). A scalar is simply an object of degree 0 whose shape is an empty list.

3.2.3 Convolution in Auto-Encoders

Imagine one has a set of images (matrices of pixels) in which small features (correlations between pixels) can be at various locations on the image. A conventional auto-encoder would be able to reduce the dimensionality of these images, and the hidden neurons would then be susceptible for features that span the whole image. However, when the same image is shifted by only a small margin, completely different hidden neurons would be activated. A smarter approach is the translational-invariant convolutional auto-encoder (CAE) [23], where hidden neurons are trained

to recognize features anywhere on the image.

A convolutional auto-encoder exploits the spatial relationship between all pixels by dividing an image into patches of nx×nypixels. All of these patches are then encoded using the same weight

set W into a vector of hidden neurons per patch. This way, the weights are trained to recognize the same small features anywhere on the image. Weights in convolution networks are often referred to as filters or kernels. Figure3.5gives a graphical explanation. The left panel shows a 5×5 input

image where each pixel is represented by 1 scalar value. In practice, images generally consist of 3 channels(parameters per pixel) for the three elementary colours. The middle panel denotes a filter with convolution windows w of size 3 along both axes. The output of the convolution operation has the same 2D structure as the input, where every activation corresponds to the sum of all values in the a small patch on the input multiplied by the weight matrix. In order to get the full output, the filter can be though of as ‘shifting’ over the image, trying to find a feature at all possible locations.

(23)

Figure 3.5.: An example of convolution. In the left panel, the green square denotes a patch on an image. The

values of the filter are given in the middle panel. The result of the convolution operation on this patch is displayed on the right-hand panel. Image taken fromhttp://docs.gimp.org/en/ plug-in-convmatrix.html.

In one dimension, the convolution operator (∗) for discrete steps in a simple auto-encoder works on the first axis of a matrix-shaped input xa,i as follows:

ha =σ(W∗x+b) =σ(

w

∑

ξ=1

Wa+ξ·xa+b) (3.21)

where Wjiconsists of w unique weight matrices in the axis of convolution.

More formally, and in arbitrary dimensions, let neuron layer hk_{a

m},ihave d spatially correlated and

ckuncorrelated axes (channels). The indices{am} =a1. . . adlabel all correlated axes, and i labels the

different channels in layer k (thus i runs from 1 to ck). This is equivalent to having a d-dimensional image where each pixel is constructed by a basis of ck colours. The set of weights W_{ak

m},j,ihas, for

every correlated axis am, a corresponding axis of size wm. This size is the convolution window for

that axis. Moreover, the last two axes form a matrix as in formula3.1which maps the i-th channel

neurons in layer k to the j-th channel neurons in layer k+1. The bias bk_j is the same across all correlated axes but is unique per channel. The activation in the subsequent layer can then be found through hk+1_a 0,...,ad,j=σ w0

∑

ξ1=1 · · · wd

∑

ξd=1 W_ak₀_+ξ₀_,...,a d+ξd,j,i·h k a0,...,ad,i+b k j. (3.22)

The exact behaviour at the edges is somewhat ambiguous, therefore we will describe the two most relevant approaches. In so-called valid convolution, the patches are chosen to fully overlap with the image, and do not have any area outside of the image. If the convolution axis has a length u and the kernel has a window w, the output will have a convolutional axis of size u−w+1 [23]. This is

equivalent to having index a of the output run from 1 to u−w+1 in formula3.22. Contrarily, in

so-called full convolution, zeros are added around the image (known as zero padding), such that every possible patch that contains at least one pixel can be used. The output will then have size u+w−1. Figure3.6graphically depicts the two approaches.

(24)

X4 X0 X1 X2 X3 h0 y4 y3 y2 y1 y0 h1 h0 h1 h3 h4 h5 h6 X4 X0 X1 X2 X3 y4 y3 y2 y1 y0 h7 h2 Valid convolution

Full convolution Valid convolution

Full convolution

‘Valid’ network ‘Full’ network

Figure 3.6.: On the left side, the encoder applies valid convolution to the input x, after which the decoder

performs full convolution to form the output y. The opposite operations are used on the right-hand side. We will sometimes refer to the left type of CAE as a ‘valid CAE’, whilst the right-hand type might be called ‘full CAE’. Notice that every black line can actually be a matrix operation, mapping some number of channels in its input to some number of channels as its output.

An auto-encoders attempts to recreate their input. For this purpose, the input size of any CAE must match the output size. This can only be done by mixing the two convolution operations: If the encoding operation uses valid convolution, the decoder must apply full convolution. We will sometimes abuse language and call a CAE that applies valid convolution in the encoder a ‘valid CAE’, and one that encodes using full convolution a ‘full CAE’. These two types are depicted in figure3.6on the left- and right side respectively.

Just like with normal auto-encoders, CAEs can be stacked into an arbitrary number of layers. These networks were found to be particularly useful in image recognition [19,25]. The advantage

of the convolution is that the networks scale well with the sizes of images, since the number of parameters in the weights remains invariant. When a deep CAE is trained on natural images, the hidden neurons in the various layers will be responsive to features that are very reminiscent to neurons in various visual layers in the human brain. For example, the first few, most shallow layers typically encode lines in Gabor-like patches, whilst deeper neurons are found to be responsive to high-level features such as human faces or bodies. Interestingly, a network trained using random frames from Youtube videos was found to have deep neurons particularly responsive to images of cats [18].

pooling Deep convolutional networks often use subsampling methods to enforce translational invariance and to reduce the size of the input, by grouping together the activations of patches of neighbouring neurons [35]. This is very similar to what complex cells in the human visual systems

V1 appear to do. The most common operations are max-pooling and average-pooling:

h0a=max(hf a, hf a+1, . . . , hf a+ f −1) (Max-pooling)

h0_a= 1

f

∑

(hf a, hf a+1, . . . , hf a+ f −1) (Average-Pooling)

where f is some downsampling factor which determines the size every patch along the down-sampling axis. Thus, these pooling operations reduce the size of one axis by a factor f . Pooling can be performed sequentially on any number of axes, such as for a 2D image in figure3.7. The

operations preserve information about the presence of specific features, but lose information about their exact location. Practically all state-of-the-art image classifiers alternate convolution operations with max-pooling operations [19], continually reducing the size of the spatial dimensions of the

(25)

Figure 3.7.: The operations of average pooling and max-pooling explained using a concrete example. Image

taken fromhttp://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622.

3.3 reinforcement learning in neural networks

With the Neural Network’s ability to approximate any function, we can use them to estimate Q-values in a reinforcement learning problem. This is particularly useful when the state-space is continuous, because table-based methods which store a Q-value for every unique state cannot deal with an infinite number of states. Neural networks naturally handle continuous inputs and automatically perform interpolation when dealing with never-observed states. A neural network N should be initialized with an input size nxequal to the dimensionality of the state-space, and

an output size nyequaling the number of actions|A|. Alternatively, if states are discrete, it could

be given one input neuron per state, or any other encoding of the inputs. The network can then be trained to output estimated Q-values yat(st) =qt(st, at)at all of its outputs by minimizing the

error with respect to the target G1

t at every time step t. Because in RL we receive feedback based

on only one action and thus on only one Q-value at a time, the best we can do is performing backpropagation starting from only one output neuron. For example, using the SARSA(λ) learning update, we define the error Etat every time step as

Et= 1 2 yat(st) −G 1 t 2 = 1 2 qt(st, at) −γqt+1(st+1, at+1) −rt+1 2 = 1 2δ 2 t. (3.23)

Interestingly, the parameter δthappens to be both the backpropagation error due to the neurons in

the last layer (section3.1.2), but also the temporal difference error from section2.1.2, making the

following derivation easy to generalize to any other TD algorithm. We can apply gradient descent to update the network parameters θi such that the prediction of qt(st, at)becomes more accurate:

∆θi= −α∂Et ∂θi = −αδt ∂δt ∂θi = −α known at time t+1 z }| { qt(st, at) −γqt+1(st+1, at+1) −rt+1 ∂qt(st, at) ∂θi | {z } known at time t . (3.24)

In the last step, we used that∂δt

∂θi does not depend on any other terms than qt(st, at), which is exactly

the output neuron yat corresponding to the chosen action. This partial derivate can thus easily be

found through backpropagation at time t. The actual update, however, has to wait until the term in the brackets is known, which is at time t+1. This makes the algorithm hard to implement, because backprop requires extensive knowledge about activation values at time t. To avoid having to copy the whole network for every time step, we can efficiently equip all weights with a trace Zθi,twhich

stores exactly one parameter’s gradient with respect to the output at each time step: Zθi,t=

∂qt(st,at)

∂θi .

If we’re keeping track of these traces anyway, we may just as well use them similarly to eligibility traces as defined in section2.1.4. This works as follows [34]:

Z_θ_i,t=γλZ_θ_i,t−1+

∂qt(st, at)

∂θi

(3.25) ∆θi=αδtqt−1(st−1, at−1) −γqt(st, at) −rtZθi,t−1. (3.26)

(26)

Notice how we shifted the time t backwards by one step in formula3.26, such that all information

for this update is available at time t. Moreover, notice how the update for the trace Z_θ_i,tin formula 3.25 must be done after the weight update occurs, in order to be able to use the required t−1

version of Zθi,t−1. In line with other work, we will define β= (1−γλ)to describe the trace updates

similarly to Ref. [31]3: ∆Zθi,t= −βZθi,t−1+ ∂qt(st, at) ∂θi (3.27) Zθi,t=Zθi,t−1+∆Zθi,t. (3.28)

By using the trace updates with β< 1 (or equivalently, λ> 0), the weights do not only receive feedback based on the immediate effect of their action, but will notice an exponentially decaying effect from future rewards and Q-values.

3 _{In Ref. [}31_{], the trace decay parameter is called α, but we choose a different symbol here because α is already used to denote}