Reconstructing the Equations of Motion of Chaotic Time Series with Artificial Neural Networks

(1)

Reconstructing the Equations of

Motion of Chaotic Time Series with

Artificial Neural Networks

Master Project Theoretical Physics

University of Amsterdam

Rico van Midde

August 14, 2019

(2)

Rico van Midde

University of Amsterdam 14 August 2019

Supervisor: Greg Stephens Second reader: Efstratios Gavves

(3)

Abstract

As many real world dynamical systems show chaotic behaviour, well known mathematical chaotic models can help to provide a clean environment to explore machine learning methods to study these sys-tems. In this thesis we use a feedforward neural network, a reservoir computing network and propose a new approach: the decoupled reser-voir network. These networks are used to learn the equations of motion from data of low dimensional chaotic dynamical systems. Specifically, the logistic map and the Lorenz system are studied. A successfully trained network is used to generate new imitative dynamics which are used to make quantitative predictions and to analyze the ergodic properties of the system. The feedforward model successfully imitates the logistic map for a variable parameter r in terms of bifurcation and Lyapunov exponents. For quantitative predictions of the Lorenz system, the feedforward model and the decoupled reservoir network both show a remarkable improvement in comparison with the reservoir computing network. The long term behaviour of the Lorenz system in terms of Lyapunov exponents is accurately reconstructed with a feedforward network.

(4)

(5)

Acknowledgements

Thank you Greg, your guidance, your expertise and your enthusiasm were of great value during my research. Your fascination in biophysics and chaos has sparked my interests and gave me a new hobby! Thank you Stratis, for introducing me to deep learning and the time you have spend on my project. I want to thank the group members for the fruitful discussion sessions, they have been very interesting. Antonio, thanks for the conversations and your help when I were stuck during my project. Laksmi, a special thanks to you, you have been very kind and you’re the one that got me into this.

Athul, bedankt voor de gesprekken, de gedachtes, de inzichten, voor je eindeloze interesse en voor je support, bij jou kon in altijd terecht. Kees, je programmeer skills hebben mij vaak geholpen, en dankjewel omdat je er altijd voor me bent. Lieve papa, mama, Max en oma, dankjewel voor het afgelopen jaar en voor nog veel meer!

(6)

1 Introduction

Chaos theory plays a prominent role in our understanding of nature, from non-living physical dynamical systems as the weather to the evolution of complex biophysical systems as the population growth of species and the be-haviour of organisms. These examples are typically nonlinear in nature as the evolution of these systems emerges from an interplay between different components of the system. A chaotic system is deterministic, but character-ized by aperiodic motion that is sensitive to small variations in the initial conditions. With the latter meaning that nearby starting points diverge, on average, exponentially fast from each other [16]. The average separation rate between nearby points when a system evolves can be formalized with the Lyapunov exponents. An n-dimensional system has n Lyapunov ex-ponents, one for every direction. A positive Lyapunov exponent means a positive separation rate while a negative Lyapunov exponent means a nega-tive separation (or contraction) rate. When the separation in some direction is conserved this corresponds to a zero exponent.

The Lyapunov exponents are an important dynamical quantity because they are invariant to the initial conditions of the system and have proven to be very useful in analyzing complex dynamics[13]. A positive Lyapunov exponent is a common indicator for the presence of chaos in dynamics but also puts an upper limit on the time we can theoretically predict the state of the system in the future, known as the time horizon of the system. Also other useful dynamical invariants are related to the Lyapunov exponents like, among others, the rate of entropy production of the dynamics and the fractal dimension of the fractal like manifold, called a strange attractor, on which chaotic dynamics evolve[13].

Two milestone toy models in chaos theory are the logistic map and the Lorenz system, which both show minimum requirements for the emer-gence of chaos in respectively a discrete and continuous time evolution. Both systems are low dimensional where the logistic map describes 1-dimensional chaos and the Lorenz system is a 3-dimensional system. The study of low dimensional chaotic systems in this thesis is inspired by the goal to learn equations of motion that approximate the evolution of a relatively simple or-ganism called the Caenorhabditis elegans. The C. elegans is a nematode with a length of roughly 1 millimeter, which is extensively studied and therefore known as a model organism in biology. Greg Stephens et al. [15] showed that the dimensionality of the posture of the C. elegans can be reduced to

(9)

a low dimensional representation, called the eigenworm representation. The eigenworm representation provides a minimized test ground to explore the analysis of individual behaviour in organisms.

The aim of this thesis is to use machine learning to predict and analyze chaotic dynamical systems. In particular, artificial neural networks (ANNs) are trained in a supervised manner with observed data to approximate the evolution function that describes the observed dynamics. The equations of motion that arise can than be evolved to make short term, or quantitative predictions about the systems state and to uncover invariants of the long term behaviour of the chaotic dynamics.

A recent study by Pathak et al. [12] shows that a specific type of ANNs, named reservoir computing (RC), can be used to recover two out of three Lyapunov exponents of the Lorenz system and make short term predictions of the systems state. 1 _{RC is an recurrent neural network (RNN) type}

of ANN that is widely used for time series forecasting. An RC consists of an untrained part, called the reservoir, and a trained linear readout. The literature states that a reservoir needs to preserve the echo state property and has a memory-nonlinearity trade-off that needs to be optimized for any given problem[1][5]. Because the reservoir is not trained, this optimization needs to be done by tuning the hyper parameters of the reservoir. Pathak et al. [12] show that the performance of the reservoir on the Lorenz system is sensitive to a small change in these hyper parameters.

A new approach is introduced to optimize this trade-off by combining two parallel operating reservoir computing networks, also an older idea that a decoupling of the hidden state of these networks could improve its perfor-mance [6] is in line with the opposed model. The introduced model shows a significant improved performance on the Lorenz system with respect to a vanilla reservoir. Also the flexibility of the hyper parameters greatly im-proves. However, we also show that a similar improved performance can be achieved with a relatively small feed forward network. This is explained by the fact that the equations that generate chaos in the Lorenz system do

1_{Their study also develops a method to learn the dynamics of a higher dimensional}

chaotic system, called the Kuramoto-Sivashinsky system which exhibits spatiotemporal relations. This method uses a linear approach, in the sense that it cuts the observed dy-namical system in smaller subsystems of lower dimensionality, solves the individual parts and recombines the solutions together again. In this way, a high dimensional spatiotempo-ral dynamical system can be analyzed with similar techniques as a low dimensional system. However for the scope of this thesis we will keep the focus on dimensional systems.

(10)

not necessarily need ’memory’, but only the current state of the system, to compute its next state.

(11)

2 Nonlinear Dynamics and Chaos

”Physicists like to think that all you have to do is say, these are the conditions, now what happens next? ”

-Richard Feynman

This section is meant to provide the reader a historical and theoretical intro-duction to concepts in dynamical system theory, nonlinear physics and chaos theory that are used in the rest of this thesis.

2.1 Historical Remarks

Differential calculus (Newton): The first formulations of dynamics date back to Newton and Leibniz in the 1600s, where they independently intro-duced differential calculus. Newton used the differential equations to describe the motion of celestial bodies. In his work he also introduced the famous n-body problem which is roughly formulated as: ”Given only the present positions and velocities of a group of n celestial bodies and the laws of grav-itation, predict their motions for all future time and deduce them for all past time”[3]. With his differential equations and Kepler’s law Newton was able to provide a quantitative solution for the two body problem, meaning he could give an exact description of any future or past state of the system. Newton and many others have attempted to generalize the two body problem to systems with n bodies, but without success. It became one of the biggest unsolved problem of physics [3]. The general thought was that an n-body problem was solvable in a quantitative manner.

Qualitative methods and a glimpse of chaos (Poincaré): The first to give new insight on this was Poincaré in the late 1800s. Poincaré shifted the way of thinking by asking different questions regarding the problem. Instead of looking for a quantitative solution, he rather used a geometrical approach to answer qualitative questions about the evolution. This meant he used

(12)

the phase space2 _{of the dynamical system to treat questions about the final}

destination of trajectories: is the system bounded or will some trajectories be attracted to infinity?

One famous theorem Poincar´e developed is the Recurrence theorem which states certain types of systems will inevitably return to a state very close to their initial state. An even more important theorem in chaos theory as we know it today is the Poincar´e-Bendixson the-orem, which constrains the evolution for dynamical systems that live in the phase planea. More precisely it states that continuous motion in 2 dimensional systems that has no fixed point (including infinity) must eventually approach a closed orbit.

a_{The phase plane is the 2-dimensional case of the n-dimensional phase space.}

Poincar´e used a geometrical approach to find qualities that are invariant to different sets of initial conditions. While he was using this different ap-proach to tackle the problem he slowly started realizing that a quantitative solution for all initial conditions did not exist. Poincar´e found that most trajectories of the restricted version3 _{of the three-body problem would pass}

regions where the trajectories are unstable, i.e. small changes to these tra-jectories lead to a very different evolution [3]. Poincar´e realized that these regions meant that the three-body problem exhibits sensitive dependence on the initial conditions and that trajectories are aperiodic in nature. With his insight he concluded that long term predictions where impossible, and therefore a closed form (quantitative) solution did not exist for most ini-tial conditions. This idea went against all common sense in his time, but is considered as the foundation for chaos theory as we know it today.

2_{The phase space of a dynamical system describes the set of all positions and momenta}

of the given system. A more general term is the state space, which can also be used in a different context where the degrees of freedom represent other variables than position and momenta. A phase space is therefore also a state space, but a state space is not necessarily a phase space. Poincar´e had his contributions to the invention of the phase space representation, which he used to show the sensitivity to initial conditions for the three body problem. A more in-depth discussion on the phase space can be found in [11]

3_{The restricted thee-body problem is a special case of the three-body problem where}

the position of two bodies are fixed. This was studied as a simpler example instead of the full problem

(13)

Figure 1: Egg example of stability: (left) The state of the egg is fixed, however sensitive to small deviations (middle) the state of the egg is not fixed and unstable (right) the state of the egg is fixed and stable: small deviations will converge back to the same state.

The qualitative approach to study differential equations is still used to understand the long term behaviour of dynamical systems. Differential equa-tions often cannot be solved explicitly, with geometric tools it is possible to solve these equations qualitatively and describe properties that are invariant to sets of initial conditions. Some invariants that are relevant in this thesis are the spectral radius, the Jacobian and the Lyapunov exponents. Most of these invariants are related to expansion and contraction of small regions in space. Because of this, eigenvectors and eigenvalues play an important role in understanding the long term dynamical behaviour[9] as will be discuss in this section.

Chaotic motion on a strange attractor (Lorenz): Since Poincar´e more interesting research is done on complex dynamics in Hamiltonian systems, but the idea of chaos was forgotten. It is the masterpiece ”Deterministic Nonperiodic Flow” of Lorenz in 1963 that made the glimpse of chaos of Poincar´e concrete and described the topology of chaotic dynamics in the form of a strange attractor, which will be discussed later. Lorenz was a meteorologist who was working on weather prediction. When he evolved one of his weather models twice, he found that the evolution would diverge after some time to a point where the system was in a completely different state.

He realized that the model was correct and that small rounding errors in the initial conditions had lead to a fundamental divergence for all future states. In his work he reconstructed the erratic behaviour in a model of

(14)

atmo-Figure 2: Patterns in a computer model for weather prediction diverge when iterated twice with initial condition that differ by a small round-off error (Lorenz’s printouts 1961)

This model is described by the Lorenz equations and treated in section 5.

2.2 Dynamical systems

In many real world problems it can be very useful to have a mathematical model than can describe the dynamics of a system that changes over time. If the state of a system at time t can be described by n variables, we say that the system is n dimensional. We can then write the state as an n-dimensional state vector x(t) =     x_.1 . . . . . xn     (1)

The time evolution of a dynamical system can be continuous or discrete. A discrete time dynamical system can be written in the form of a difference equation

xt+1 = f (xt) (2)

Where f is some function that describes how state xt is mapped to state

xt+1. These systems are also called (iterated) maps or recursion relations.

(15)

A continuous time dynamical system can be written as a differential equation

˙x(t) = f (x(t)) (3)

Where f is some function describing the dynamics or difference for a small timestep τ and the evolution is than

x(t + τ ) = x(t) + ˙x(t) (4)

Both systems described above are autonomous: the state of the system follows only from the previous state. When the dynamics of the system also depends on any other variable than the previous state, for example some input signal, the system is non-autonomous4_.

x(t + τ ) = f (x(t), u(t)) (5)

Where u(t) can be any external variable. We use these terms to tell if the reservoir is driven by an external input or acts on itself.

2.3 Definition of Chaos

As described in section 2.1, breakthroughs in chaos theory are spread out over some centuries. However, it is only since the last few decades that chaos theory took off to what it is today. Besides that the ideas about chaos wander around for a long time, chaos theory is still lacking a formal definition of chaos. In his famous book Nonlinear Dynamics and Chaos [16], Strogatz provides a working definition that nicely covers three important properties of chaos that carries universal consensus:

”Chaos is aperiodic long-term behaviour in a deterministic sys-tem that exhibits sensitive dependence on initial conditions.”

If the evolution of a system does not converge to any fixed point (including infinity) or to some periodic or quasiperiodic orbits, the system is said to have aperiodic long-term behaviour. In other words, this means the trajectory will wander around forever, as it will not go to infinity or any other fixed point,

4_{To prevent any confusion: time could also be an external variable, used to determine}

the evolution of the system. In that case, the system x(t + τ ) = f (x(t)) = f (x, t) would be non-autonomous. However, here time is only used to indicate where the system is in

(16)

and will never reoccur in a state is has seen before. Secondly, determinism is important to be fundamental to the aperiodic behaviour, it would else be too easy to create an orbit with a continuous random variable to never be in the exact same state. Finally, a chaotic system always exhibits sensitive dependence on initial conditions, this says that two nearby points, when they evolve, will diverge exponentially. As we will see in section 2.6 this exponential divergence is characterized by the Lyapunov exponents.

2.4 Nonlinearity and Chaos

“Using a term like nonlinear science is like referring to the bulk of zoology as the study of non-elephant animals.” - Stanislaw Ulam

When a system is linear it can be understood as the sum of its parts, this is also knows as the superposition principle. This principle can drastically simplify complex problems because it makes it possible to solve the parts independently and then add them back together to get the solution of the initial more complex problem. There are many powerful techniques that can be used to tackle linear problems. For instance, the superposition principle also allows to understand a system as a linear combination of its normal nodes or describe a function with a Fourier transform. But also Laplace transforms and superposition arguments are widely used in physics [16]. Also, any variations that are made in the initial conditions of a linear system, will change proportionally when the system is evolved.

A nonlinear dynamical system is a system that does not evolve linearly. This happens, for example, when parts of the system have some nonlinear in-teraction like interference, competition or cooperation. Mathematically this is found in products, powers and functions concerning variables from the systems phase space. Nonlinearity also means that the evolution function does not satisfy the superposition principle. That means that all the sophis-ticated techniques, described for the linear system, generally cannot be used to effectively reduce a nonlinear problem. Therefore, when possible, a linear approximation is used to describe a simplified version of some problem. In this thesis the linear approximation of the reservoir dynamics is used to un-derstand the real nonlinear dynamics better. Strogatz [16] summed up some systems that he classified based on their number of variables and their non-linearity, shown in figure 3. With this figure he illustrates the complexity that emerges with non-linearity. He argues that many fields in the upper

(17)

half are developed because of the linear techniques that are discussed above, these systems can be solved quantitatively. But many real world phenom-ena are inherently nonlinear and cannot be approximated as being linear. These problems are found in the lower half and information is often found in qualitative approaches.

In figure3 one can see that the label chaos is attributed in the nonlinear regime where the amount of variables n ≥ 3. From the Poincar´e-Bendixson theorem5 _{it follows that systems, for which the evolution is continuous, need}

to have at least three dimensions to be chaotic. Because this is the minimum required, it is also the amount of dimensions of the toy-model of continu-ous chaos that is used in this thesis, known as the Lorenz system. The Poincar´e-Bendixson theorem does not describe dynamical systems with a dis-crete evolution. For disdis-crete systems it is possible to have chaos in just a single dimension. A toy model commonly used for these discrete dynamics is the logistic map, which is also used in this thesis.

5 _{The Poincar´}_{e-Bendixson theorem states that systems of less than three dimensions}

(18)

Figure 3: Illustration from [16], comparison of different phenomena in exact sciences based on the amount of dimensions and the presence of nonlinearity.

(19)

2.5 The Geometry of Dynamics

In dynamical systems, the repetition of simple equations can lead to complex behaviour that is not always obvious. However, there is ”hidden” information in the mathematical operations that are repeated. A technique to uncover hidden information involves a decomposition of the mathematical objects. This can make it possible to isolate universal properties, like for instance rotation and scaling.

2.5.1 Eigendecomposition

To uncover the long term behaviour of the dynamics, eigenvectors and eigen-values provide the key. As a reminder, for a given n × n matrix A, an eigenvector v is a nonzero vector that is scaled when it is multiplied by A, while the direction ˆv stays the same.

Av = λv (6)

The proportion of scaling is given by the corresponding eigenvalue λ. There-fore the eigenvectors of a matrix show us the direction of scaling, and the related eigenvalue the proportion of this scaling. Eigenvector and eigenvalue pairs can be found by solving the characteristic equation of A that is given by 6

det(A − λI) = 0 (7)

Where I is the identity matrix. An n × n matrix has therefore exactly n eigenvalues with corresponding eigenvectors, one for every unitary value of the identity matrix. The solutions of these eigenvalues can also be complex. Complex eigenvalues reveal information about real dynamical systems, such as periodic behaviour, vibration or other types of spatial rotations[9].

2.5.2 Linear dynamical systems

To better understand some of the geometry that will arise by a specific dif-ference equation, it is therefore often helpful to decompose the mathematical object in its eigenvalues and eigenvectors.

6_{many programming libraries like NumPy, SciPy, PyTorch and TensorFlow provide}

(20)

For simplicity let us first consider a 2-dimensional linear discrete dynam-ical system which can be represented by a linear difference equation of the form

xt+1= Axt (8)

The geometry of the dynamics of an n-dimensional system can be illustrated by evolving an n-sphere of initial points. This gives insight in how nearby trajectories converge or diverge from each other around some region in space. It can also be thought of as how a region is transformed spatially. Figure 4 illustrates the effect of the transformation from equation 8 on a unitary 2-sphere of initial positions. The directions of contraction and expansion are aligned with the eigenvectors of A.

Figure 4: A unitary 2-sphere is deformed by a multiplication by A. (Left) shows a unitary 2-sphere at state xt and (right) shows how the points have

evolved after multiplication with A at state xt+1. The expand in the direction

of eigenvalue λ1 that is greater than one and contracts in the direction of

eigenvalue λ2 that is smaller than one.

To illustrate the evolution of trajectories over a longer time interval, it is more convenient to use a phase diagram where a few trajectories are plot-ted. Figure 5 shows the phase portrait for different absolute values for the eigenvalues of A. When |λ1,2| < 1, the origin acts as an attractor. When

|λ1,2| > 1, the origin acts as a repeller. When |λ1| < 1 and |λ2| > 1, or the

(21)

(a) The origin as a attractor. (b) The origin as a repeller. (c) The origin as a saddle point.

Figure 5: Phase portraits for dynamical systems described by eq.8 with dif-ferent eigenvalues for A. For (a) and (b), the eigenvectors are aligned with the axis. (Illustration from David Lay (2012) [9])

For a transition matrix A that also contains rotations, the eigenvalues are complex. If both absolute eigenvalues are unitary trajectories will spiral around the origin forever (figure 6). The largest absolute eigenvalue of a matrix is also called the spectral radius. The spectral radius is a measure that is commonly used as an important hyper parameter for initializing a reservoir computer. For an autonomous linear system, a spectral radius smaller than 1, guarantees that trajectories do not fly off to infinity and will slowly decay. In reservoir computing theory the guarantee of the decay of signals is referred to as the echo state property.

(22)

(a) Conserved spiral around the ori-gin, |λ1,2| = 1

(b) Inward spiral to the origin, λ1,2< 1

Figure 6: States spiral around the origin for complex eigenvalues (Illustration from David Lay (2012) [9])

2.5.3 Attractors and Strange Attractors

A straightforward example of an attractor is the origin in figure 5a, this is a stable fixed point, which is a point to which all neighbouring trajectories converge. This means that once you are on to a stable fixed point, the trajec-tories will stay there. A stable limit cycle, where all neighbouring trajectrajec-tories converge to a limit cycle, is also an attractor. If a fixed point or limit cycle is unstable, the evolution will only stay there if it starts exactly on the fixed point or limit cycle and small deviations will lead the diverging trajecto-ries. The fixed point or limit cycle act as a repeller (figure 5b). Just as the definition of chaos, there is still discussion about the exact definition of an attractor [16]. Roughly speaking, an attractor is defined as a closed set of numerical values that satisfies the following properties[16]:

1. The attractor is an invariant set, meaning any trajectory that starts on the attractor will stay on the attractor.

2. There is an open region that contains the attractor, but is larger than the attractor, for which all initial conditions converge to the attractor. Such largest possible region is called the basin of attraction.

3. An attractor is minimal, meaning no subset of the attractor satisfies the other properties.

(23)

Figure 7: Illustration of a stable, unstable and semi-stable limit cycle.

Linear dynamical systems can only have the origin to act as an attractor or repeller. In nonlinear system, the transition from state xt to state xt+1 is

not proportional to the input state and varies with xt. A characteristic of

a chaotic system is that nearby trajectories exponentially diverge from each other: this makes the system sensitive to the initial conditions. But also, exponentially converge from each other: this keeps the trajectories bounded so they do not diverge to infinity.

A way to quantify the convergence and divergence of nearby trajectories is by taking the average divergence and convergence over some time interval. However, for dynamical systems with a nonlinear evolution function, there can be multiple points that attract or repel, or even complex cycles or regions of attraction. When the evolution of a system is chaotic, trajectories are sensitive to the initial conditions and therefore diverge. On the other hand, trajectories are aperiodic and therefore bounded to some region. This leads to trajectories that converge to a complex attractor and generally means that the attractor of a chaotic system is a fractal set. These attractors are called strange attractors. The first known example of a strange attractor is the butterfly attractor of the chaotic Lorenz system, which was discovered by Edward Lorenz. The front page of this thesis shows an illustration of the butterfly attractor, and the Lorenz system will be discussed in detail in section 5.1.

(24)

2.6 Lyapunov exponents

The Lyapunov exponents measures the average divergence in n principal di-rections along a trajectory in Rn_{. When the trajectory evolves on an}

attrac-tor, the average divergence is invariant to the initial conditions, and is thus an universal quantity of the attractor. A positive Lyapunov exponent in direction k means that two nearby trajectories diverge from each other in di-rection k. For a negative Lyapunov exponent this means they converge. And a zero Lyapunov exponent, means mutual distance between points in that direction is conserved. For a continuous evolution there is always a zero exponent along the direction of the flow. Different types of attractors can therefore be characterized by the signs of its Lyapunov exponents:

• Stable fixed point: all exponents are negative

• Stable limit cycle: 1 exponent is zero (along the flow) and the rest is negative

• Continuous chaotic attractor: at least 1 exponent is positive, 1 expo-nent is zero and the rest is negative

• Discrete chaotic attractor: at least 1 exponent is positive and the rest is negative

In this thesis we will use the Lyapunov exponents to verify if the learned dynamics are characterized by these same universal properties as the original dynamical system. We will also use the Lyapunov time, explained in the next subsection, as a measure for the predictability of a dynamical system. But the Lyapunov exponents reveal more interesting universal properties about the attractor: like an approximation of the rate of entropy that the system generates and the fractal dimension of the attractor. These properties have been proven useful to better understand complex dynamical systems and sophisticated methods to find these exponents are therefore very relevant.

2.6.1 Jacobian linearization of a nonlinear system

For nonlinear systems the divergence and curl can vary along a trajectory in phase space, but on a single point this divergence is fixed. Therefore, we can approximate the Lyapunov exponents by considering the Jacobian of infinitely many points on a trajectory. This is commonly known as the

(25)

“Jacobian linearization of a nonlinear system”. As a reminder, the Jacobian is a matrix containing all first-order partial derivatives of a function with respect to the input.

Jij =

∂fi

∂xj(t)

(9)

It defines a linear mapping of Rn _{→ R}m _{and act as a linear approximation}

of the nonlinear function f near the state x. The Jacobian can therefore be used to study how a volume around a state in the linear approximation changes when applied on a difference equation (discrete dynamical systems). For continuous systems, where the evolution is described by a differential equation, the flow map can be used to. This also adds the previous state by using the identity matrix Iij and because the function f describes dx_dt it is

also multiplied with the stepsize dt

Mij = Iij + Jij dt (10)

In the linear case, we had the fixed matrix A that describes the change in volume around a state. In the nonlinear case, the Jacobian J or flow map M can do the same. Just as in the linear case, the absolute eigenvalues |lk∈n| of the flow map tell us the rate of divergence in the principal axes of

the hypersphere, which are in the eigendirections. Only now, the scale and directions of the deformations change along the path, therefore the flow map has to be recalculated for every point on the attractor.

Figure 8: Divergence in the 0th principal axis of a 2-sphere of nearby points along a trajectory (red).

(26)

The flow map is a linear version of a nonlinear evolution function as described in section 2.2. Sequentially applying the recalculated flow map, amounts to recursively applying the nonlinear evolution function.

xn+1= xnMn (11)

2.6.2 Theoretical definition

The Lyapunov exponents measure the divergence along the principal axes of these hyper spheres of nearby points. The kth Lyapunov exponent is characterized by the divergence in the kth principal axis.

δk(t) = δk(t0)eλkt (12)

Where λk is the kth local Lyapunov exponent. With local we mean the

Lya-punov exponent of a point or smaller region. As mentioned before, ideally infinitely many points on the attractor are considered and averaged to esti-mate the Lyapunov exponents of the attractor. Therefore, a formal definition for the Lyapunov exponents of an attractor is

λk= lim t→∞ 1 t ln lk(t) (13) where lk(t) = _δδk(t) k(t0)

2.6.3 Numerical Calculation: Discrete QR Method for Global Ex-ponents

In practice we can, of course, not consider an infinite time interval t. However for a large enough value of t, the average value of the exponents converges and becomes a good approximation of the actual global exponents of the attractor. Another problem that emerges, is that for large values of t, equa-tion 13 can become numerically unstable. This is because the principle axis lk(t) for large t, diverges for a positive exponent and converges to zero for

a negative exponent. Both are numerically unstable for large stretching or contraction. Also, the initial principle axes ~l(t0), naturally tends to align

along the axis with the largest Lyapunov exponent for large t, due to shear-ing effects. Thereby if we would iteratively measure these sheared axes, we would not measure the principle axis lk(t) of the deformed n-sphere anymore,

(27)

A simple trick to overcome these instabilities is to use a QR decom-position. The QR factorization theorem states that a linear independent m × n matrix A can be decomposed into an orthonormal m × n matrix Q and an invertible upper triangular n × n matrix R with positive values on its diagonal. Geometrically this means that

• Q is a matrix that describes an orthonormal basis containing all the orientation properties of A (rotation and mirroring).

• The upper triangular matrix R contains the scaling and shearing prop-erties of A. Where the scaling part is naturally on the diagonal and the shearing part is on the entries above the diagonal.

The trick is to separate the orientation of the deformed n-sphere, from the scaling and shearing part of the deformation. Note that Q is orthonormal and therefore again a unit n-sphere. This provides the possibility to propagate the orientation of the deformations by Mt along the evolution, without the

scaling and shearing effects. This prevents the scaling and shearing effects to accumulate. Because the orientation is propagated along the evolution, the entries on the diagonal of all R matrices are aligned and represent the scaling in the principle axes. This is what we need for the computation of the Lyapunov exponent. Algorithmically this looks as follows [2]:

# set initial orientation to the identity matrix ; Q = I;

# Propagate the orientation through the evolution and save the scaling values;

for every flow map Mn along the trajectory do

# 1. Realign Mn with Q;

M0_n= MnQ ;

# 2. Separate orientation and scaling/shearing properties of M0_n; Q_n, Rn= M0n ;

# 3. Update current orientation; Q = Q_n;

end

Algorithm 1: Computes the aligned Rn matrices from the linearized flow

(28)

The kth value of the diagonal of Rn corresponds to the scaling of the kth

principle axes. The Lyapunov exponents are then described by

λk = 1 N ln N Y n=1 [Rn]kk (14)

Note that the shearing part is left out completely by only considering the diagonal. To prevent the scaling effects to accumulate exponentially, we can rewrite this as λk= 1 N N X n=1 ln [Rn]kk (15)

Or, if the discrete steps are proportional to some time interval dt 6= 1, we should write λk = 1 N dt N X n=1 ln [Rn]kk (16)

Libraries like TensorFlow, NumPy and PyTorch have built-in methods to produce the QR decomposition of some matrix A. Generally, these algorithms decompose A by multiplying A with a sequence of orthogo-nal matrices from the left. In this way A is transformed into an upper triangular matrix in a computationally efficient way.

2.7 Lyapunov Time and the Time Horizon

We have seen that a system with a positive Lyapunov exponent has at least one direction in which flows evolve chaotically. Every error, no matter how small, will increase exponentially over time, and will therefore at some time matter. This means that any system that we do not solve with absolute accuracy becomes ’unpredictable’ at a time we call the time horizon. To define unpredictable, a range of tolerance α is chosen that is considered acceptable for the error. A small initial error ||δ0|| will grow by

||δt|| ∼ ||δ0||eλmaxt (17)

Where the largest expansion of the error is in the same direction as the largest Lyapunov exponent λmax. The error is proportional to the chosen

(29)

tolerance at the time horizon, that we can express by rewriting equation 17 into thorizon ∼ O( 1 λmax ln α ||δ0|| ) (18)

The time horizon thorizon is now written as a function that depends

loga-rithmically on the initial condition. This dependence emphasizes again that errors grow exponentially and the time horizon is therefore limited. The in-verse of the maximum Lyapunov exponent, is also called the Lyapunov time.

tLyapunov =

1 λmax

(19)

Looking at equation 17 we can see the Lyapunov time is the time by which an error grows by a factor of e because the exponents cancel out to one. The Lyapunov time as a measure for prediction makes the timescale invariant to the maximum Lyapunov exponent of the system. Because of this qual-ity it can be useful to give some insight in how accurate we can predict a system when it evolves. It is, however, important to emphasize that the Lya-punov time is only the incremental factor of the error. If we know the initial conditions of a system with higher accuracy, we can generally predict more Lyapunov times. For example, if we evolve a system for which we know the initial condition with single floating point accuracy (∼ 2−23) , this error will grow to unitary size after roughly 16 Lyapunov times, because e16 ≈ 2−23_.

But if we start with double floating point precision (∼ 2−52) we can last up to 36 Lyapunov time before we have reached unitary error. In practice, the evolution of the initial value is generally done by iteration of a computer which will inevitably give us more errors due to the machine epsilon7 _{of the}

variables that are used for the evolution.

(30)

3 Artificial Neural Networks:

Feedforward

Networks and Reservoir Computing.

”Nature forms patterns. Some are orderly in space but disorderly in time, others orderly in time but disorderly in space. Some pat-terns are fractal, exhibiting structures self-similar in scale. Others give rise to steady states or oscillating ones. Pattern formation has become a branch of physics and of materials science, allowing scientists to model the aggregation of particles into clusters, the fractured spread of electrical discharges, and the growth of crys-tals in ice and metal alloys. The dynamics seem so basic—shapes changing in space and time—yet only now are the tools available to understand them.”

-James Gleick

This section is dedicated to provide the reader theoretical and historical insights regarding feedforward networks and Reservoir Computing and a new approach in introduced: the decoupled reservoir network.

3.1 Feedforward neural networks

A feedforward neural network is an artificial neural network in which infor-mation flows in one direction. This means the computational graph does not have any loops , unlike a recurrent neural network (discussed next). The goal of these networks is to approximate a function f that maps the input x to the given output y.

y = f (x, θ) (20)

Where θ are the parameters of the network. A network can contains multiple layers, meaning multiple sets of operations act sequentially on the input. A network of n layers can then be written as n sequential functions, for n = 3 we can write

y = f3(f2(f1(x, θ1), θ2), θ3) (21)

n is also called the depth of the network. The last layer provides the output y and is therefore called the output layer, the other layers provide only an output for the next layer and are therefore called hidden layers. The

(31)

dimensionality of these hidden layers is also called the width of the network. A most basic artificial neural network that is commonly used as a vanilla example, is a fully connected8 feedforward network with a single hidden layer.

a = Wihx + bh (22)

h = g(a) (23)

y = Whhh (24)

(25)

The input x is multiplied with the input-to-hidden weight matrix Wih and a

bias bh is added. The pr´e-activation vector a is acted on by an activation

function g (which can in principle be any function but there are some good candidates out there like sigmoid functions or variants of the rectified linear unit) to get to hidden state h. In the output layer h is multiplied with the hidden-to-output weight matrix Who. The output layer is linear in this case,

but it can also have a nonlinear activation function and/or a bias term. A famous theoretical insight in artificial neural networks is given by the universal approximation theorem and was shown by Hornik et al. in 1989 and Cybenko in 1989. It states that if the output layer is linear and the hidden layer has a squashing activation function, it can in theory approximate any continuous function that maps an input x ∈ R to an output value y ∈ R with arbitrary accuracy 9_{. The proof is later also generalized to a broader}

range of activation function. This theorem only states the possibility of a neural network to approximate a function but provides no guarantees about the convergence of a training algorithm[4].

3.1.1 Computational Graphs

Artificial neural networks come in many different forms but are always a set of operations on some input to produce some output. The total set of operations can become complex. A common way to formalize the structure of these operations is in the language of a computational graph. This can

8_{fully connected refers to a network where to forward propagation is done by a matrix}

multiplication that connects all dimensions of every state to all dimensions of the state that follows

9_{The proof of this result is simplistic, beautiful and worth reading, however it is outside}

of the scope of this thesis. The proof can be found in the original papers but also many clear sources are found online.

(32)

Figure 9: Computational graph of feedforward neural network described by equation 22

Figure 10: Computational graph of an ordinary dynamical system.

be done in many ways and one common way to do that is to use nodes to describe variables and edges to describe operations. The operations generally exist of matrix multiplications, adding a bias term, and give this as input to a nonlinear activation function. The computational graph of the feedforward network described by equation 22 is illustrated in figure 9. Also the compu-tational graph of a discrete dynamical system as described by equation 2 in section 2.2 is illustrated (figure 10).

3.1.2 Back Propagation

In a feedforward network a signal of the input data x is propagated forward through the computational graph to generate an output signal ˆy, which is

(33)

Figure 11: Schematic view of feedforward neural network that is described by equation 22. (The additional bias in the hidden layer is implicit)

called forward propagation. The output signal is used in a loss function that describes what we want to optimize in a scalar J (θ) of the networks parameters. In this thesis we train the network in a supervised manner, this means that a label y of the desired output is available and is compared in the loss function with ˆy. The goal is to minimize the loss function. The derivatives of the loss function with respect to any parameter θi in the

com-putational graph describes how much a variation in this parameter influences the loss function.

dJ (θ) dθi

(26) Back propagation computes the gradients with respect to all parameters that needs to be trained. In this way the parameters of the network can be updated proportionally to their contribution to ˆy. If this is done recursively for multiple train examples, the parameters are tuned to the desired result. In this thesis we will not use back propagation. More detailed information on back propagation can be found in Goodfellow et al. [4].

3.2 Recurrent Neural Networks

(34)

mod-sequential data of variable length τ .

~

x1, ..., ~xτ (27)

What makes the recurrent neural network unique is a concept called param-eter sharing. Recurrent neural networks share the paramparam-eters that operate on all the different values of the sequence. In comparison, a traditional fully connected network could handle sequential input, as this can be just a vector, but then each sequence value gets its own special treatment. If you just con-sider a single fully connected layer as a matrix multiplication, then each value of the input vector is multiplied by a different independent value of the weight matrix. But in recurrent neural networks every sequence value gets treated with the same parameters: the weights are shared. Because of this, it is pos-sible to use, and generalize across, sequential data of variable length. It also reduces the amount of training examples that are needed to train the param-eters, because the total amount of values are now used for training instead of the amount of sequences. The number of training examples per trained pa-rameter therefore increase by a factor of O(average sequence length). The recurrent neural networks is especially interesting for modelling sequences for which the absolute positions of the values are arbitrary (i.e. could have shifted). Convolutional neural networks are also considered to be well fit for sequence modelling tasks, but is less important for the purpose of this thesis. To be more explicit about the working mechanism of recurrent neural net-works, we can start by stating a recurrent neural network is itself a dynamical system that is driven by an external signal. In dynamical systems theory this is knows as signal driven dynamics and is a non-autonomous system, which we can formalize as follows

s(t) = f (s(t−1), x(t); θ) (28) Where the current the state of the system s(t) _{is some function of the previous}

state s(t−1)_{, the external input of the current state x}(t) _{and the parameters of}

the dynamical system θ. The previous state is, in turn, a function of the state previous to the previous state, and so on. We can therefore unfold equation 28 into the form

s(t) = f (s(t−1), x(t); θ)

(35)

Figure 12: Unfolding of the computational graph of an RNN. Note that the current state is here described by h (hidden state).

We can keep on doing this until the first input is reached. The current state of the system is therefore a function of all previous input values and the parameters θ are shared across them. Because of this, the current state can be seen as a lossy representation of all the previous members of the sequence. This representation is then used as input to a readout layer to make predictions about the input sequence.

3.2.1 Back Propagation Through Time

Similarly to the back propagation algorithm used in feedforward networks, back propagation through time is a widely used training method for recur-rent neural networks. Where traditional back propagation assumes that the network has no cycles, back propagation through time unfolds the different time layers of the network (see figure: 12) . The gradients of all considered timesteps are then summed together to update the weights matrices. Funda-mental to the dynamical system structure of the network is that every time step follows from the previous timestep. It is therefore impossible to save computation time with techniques like parallelization and the runtime of a backward pass will necessarily be of order O(n time steps).

(36)

3.2.2 Vanishing Gradients

Consider a static Jacobian for the hidden-to-hidden state, J , which would be the case without non-linearity. If we back-propagate from a point g, then after N time-steps we will have JNg. With the same reasoning as in the previous section, if we start from g with a small perturbation of δ , this would lead after N time-steps to a perturbation of JN_{δ . If the largest absolute}

eigenvalue of J is larger than unity, this leads to exponential divergence, and for an absolute eigenvalue lower than unity to exponential convergence. In the first case, the gradients will explode to infinity, in the second case the gradients will vanish. Both cases are described by δ|λ|N, where λ is the eigenvalue of J in a specific direction and the initial perturbation δ is in the direction of λ. The exploding gradients would do the most damage to the optimization, but can be solved by re-scaling gradients that exceed some chosen threshold. The vanishing gradients are, however, an inherent problem for recurrent neural networks. The longer the input sequence, the more Jacobians are multiplied.

The same counts for a Jacobian of a nonlinear system that changes with the input and therefore changes with time. The same argument counts in this case, as used for the Lyapunov exponents. The average divergence by the largest absolute eigenvalue of the Jacobian determines the direction of the largest separation vector. This can prevent convergence during training and even if the network does converges it often takes many gradients be-cause of the vanishing gradients problem. This method is therefore slow and computationally expensive.

(37)

3.3 Reservoir Computing

Reservoir computing is a network architecture that is similar to the recurrent neural network, but tries to offers a solution to the convergence problems with the back-propagation-through-time algorithm. It was independently proposed by Jaeger and Haas (2004) and Maas et al. (2002) under the name echo state networks and liquid state machine. Where the echo state network was proposed with continuous variables for the hidden neurons of the network and the liquid state machine with binary, or ”spiking”, hid-den neurons. Both networks are now commonly referred to as reservoir computing. The main idea of reservoir computing is to only train a linear readout Who and initialize the rest of the network randomly. A reservoir

computing network can be formally written as

r(t+1)= f (Whhr(t)+ Wihx(t)) (29)

ˆ

y(t+1)= Whor(t+1) (30)

Where r(t) _{is the H-dimensional hidden state vector at time t, W}

hh is the

hidden-to-hidden weight matrix of size H × H, x(t) is the M -dimensional input vector, Wih is the input-to-hidden weight matrix of size H × M , f is

the tanh sigmoid activation function, ˆy(t+1) _{is the networks prediction of the}

label and Who is the hidden-to-output weight matrix, also called the linear

readout, which contains the only parameters that are trained. Figure 13 illustrates a schematic view of a reservoir computing network.

A linear layer can by definition only learn a linear function of its inputs and can therefore not contain any memory or nonlinearity. This is important for many machine learning tasks because the real function that the network tries to approximate is often nonlinear. In our particular case we want a network to learn the evolution function of a finite dimensional chaotic sys-tem. As discussed in section 2, we know theoretically that for these systems a nonlinearity is essential. The memory part can often be very useful for sequential data, and is the reason why we use the recurrent neural network in the first place. Both, memory and nonlinearity, have to be generated by the untrained part, the ”reservoir”. The trick in reservoir computing is to try to optimize the reservoir such that it provides a rich nonlinear representation of the current input and the relevant history of the given data. To do this a set of hyper parameters can be tuned that are discussed in this section.

(38)

Figure 13: Schematic view of a reservoir computing network

3.3.1 Connectivity

The hidden-to-hidden state coupling is represented by Whh. If the hidden

state vector is of size N , this matrix is of size N2_{. In the language of graph}

theory, this is a finite graph of N vertexes, which can have N2 _{edges. The}

weight matrix Whh that connects the hidden states is then called an

adja-cency matrix. In graph theory, the connectivity determines whether vertices are connected or disconnected. The average amount of connections per ver-tex is called the degree of the graph. In reservoir computing the connectivity of Whh is considered to be responsible for the variation of the hidden

repre-sentation [6]. The initial thought was that a sparse adjacency matrix would partly decouple the dynamics in the reservoir, which would lead to a richer representation for the readout layer because dynamics would evolve more independently. However, a sparse matrix with random connections generally still represents small world properties. Meaning that the amount of edges that needs to be crossed to move from one arbitrary vertex to another is small. therefore the dynamics could actually be less decoupled than initially thought. Also according to [10] a sparse Whh leads in practice to only slight

improvements of the performance. However, for a trained network, a sparse adjacency matrix with a small degree is computationally more efficient

(39)

be-cause less computations need to be done. If the degree of the adjacency matrix is fixed, independently of the reservoir size, than increasing the reser-voir size only leads to a linear scaling in computational complexity instead of quadratic. Therefor, and because the sparsity only contributes a small amount to the performance, [10] emphasizes to generally initialize Whh with

a fixed sparse degree around 10.

One way to create a randomly connected sparse graph is as an Erd¨ os-Rényi graph, which is also used throughout this thesis. In an Erdös-Rényi graph every possible edge (or nonzero value in the adjacency matrix) has a uniform probability p that can be computed from the fixed degree d and hidden size h by p = d/h. If p is close to 1 the adjacency matrix is densely connected and if p is close to 0 the adjacency matrix is sparsely connected. In this thesis, the values of the nonzero elements in the adjacency matrix are drawn from a normal distribution. Other distributions are also possible but show similar performance according to [10].

3.3.2 Spectral Radius

The spectral radius of the adjacency matrix ρ(Whh) is an important bias of

the reservoir because it controls the scale of the recurrent components. The spectral radius is defined by the largest absolute eigenvalue of Whh. In section

2.5 we showed how an absolute eigenvalue |λ| > 1 leads information to spiral away from the origin, while |λ| < 1 leads to an inward spiral. And for |λ| = 1 information can spiral, or ”echo”, around the origin forever (figure 6). This applies for a linear autonomous evolution, while the evolution of the reservoir state is a nonlinear function that is driven by an input signal. Therefor, in reservoir computing ρ(Whh) is only a rough estimate to this behaviour,

nevertheless the spectral radius provides a quantity for the stretching or contracting effect of Whh that is invariant to its other algebraic properties

(like the hidden size and sparseness).

3.3.3 Input Scaling

The input-to-hidden weight matrix Wih is sampled from a random

distribu-tion and is generally densely connected. Multiple distribudistribu-tions are possible just as for Whh, like Gaussian or a normal distribution, but in practice this

does not show significant differences on the performance[10]. In this thesis Wihis initially sampled from a normal distribution on the interval [−

√ k,√k]

(40)

where k = _{input dimensions}1 . It is therefore independent of the size of the reser-voir. Wih is then scaled by a constant value α, which is called the input

scale factor. Together with the spectral radius, the input scale factor con-trols the scale of the pr´e-activation vector a(n)_{. Naturally, any squeezing}

activation function like the tanh, drives input values to the origin (white circle in figure 14) where the local gain in maximum and the slope of the tangent is steepest. In this regime, the tanh approximates a linear function. Therefor, small values of α will decrease the nonlinear effect of the activation function. Also, a larger value of α will push the average neural upward or downward to the regime where the local gain decreases (grey circle in figure 14) . If a(t) _{is too large the the sigmoid will be saturated and as a result the}

dynamics of the neurons will behave binary.

Note that the input scale factor is not an invariant quantity for the scale of the input signal of the reservoir. The initial range of the random distribution where Win is sampled from and the scale of the (possibly normalized) input

data also contribute to the absolute scale of the input signal. Also the input data can be normalized in different ways.

Figure 14: Illustration from Verstraeten et al. (2009) [17]

3.3.4 The Echo State Property

The initial paper of Jaeger (2001) [7], emphasizes the importance of the echo state property for the reservoir. The echo state property states that the state of the network should be a function that is uniquely defined by previous values of the input sequence. Therefore any initialization values should be washed out when time passes. Another way Jaeger puts this is that the hidden state r(t) can be described as a set of echo functions E of

(41)

the input values x

r(t) = E(..., x(n − 1), x(n)) (31)

Where E is a vector of echo function. The echo state property is controlled by the algebraic properties of the adjacency matrix and the input signal10_.

A spectral radius ρ(Whh) ≤ 1 ensures the echo state property for most cases,

as the previous state values, including the initial state, vanishes already from the effect of the adjacency matrix. While a spectral radius larger than unity can in theory violate the echo state property because the state vector gets amplified in some directions by Whh. In the general case, the echo state

property also holds for values larger than unity because the hidden state vector is bounded by the ”squeezing” tanh activation function. According to [10] it is possible for the reservoir to host fixed points, periodic modes or chaotic attractor modes if the spectral radius is too large. Furthermore, Yildiz et al. (2012) show that special cases exist where the spectral radius is smaller than unity and bifurcations11 _{can emerge where the reservoirs}

dynamics are forced into non trivial fixed points and/or periodic orbits. The practical guide of Lukosevicius (2012) [10] emphasizes that the spectral radius is task dependent and should be smaller than unity for tasks where memory of the recent input is most important and larger for tasks that require longer input memory.

3.3.5 Linear Regression

To train the reservoir, we want to find the hidden-to-output weight matrix Who that maps the hidden state of the reservoir r(n) to the prediction ˆy(n),

such that it minimizes the squared error with respect to the true value y(n)

for the validation data. To do this, we first evaluate all examples from our train data through the reservoir network to generate the hidden states. In matrix notation this is written as

WhoR = ˆY (32)

Where matrix R ∈ RH×N _{contains all generated hidden states r}(n) _{and H is}

the hidden size of the network and N is the length of the train data. The matrix ˆ_{Y ∈ R}D×N contains all N predictions ˆy(n) of dimensionality D. The

10_{And in a stretched understanding also by the activation function, but we will only}

(42)

euclidean norm between the predictions ˆY and the labels Y can be minimized efficiently using the Moore-Penrose pseudo inverse, or simply pseudo inverse. The pseudo inverse A+ of a matrix A is defined as

A+= lim

α→0(A

T_{A + αI)}−1

AT (33)

It provides an approximation to the inverse of a matrix when the exact inverse (α = 0) is not defined or numerically unstable. Using the pseudo inverse, the weight matrix Who can be computed as

Who = Y R+ (34)

In general, T H, this means solving for Who boils down to solving an

over-determined system of linear equations. This is necessary, because if T ∼ H the network would not have to generalize across examples and would likely be overfitting on the data.

Another way to prevent overfitting would be to increase the value of α. This is known as Tikhonov regularization, ridge regression or weight decay. Typically the letter β is used instead of α and equation 33 becomes

A+ = (ATA + βI)−1AT (35)

Instead of minimizing the mean squared error, solving for Who now minimizes

Who = arg min Who 1 D D X d=1 ( N X n=1 (yd(n) − ˆyd(n))2+ β||Who,d||2) (36)

The β term penalizes large values of Who, making the network experience the

input X to have greater variance [4]. A more in depth discussion on weight decay and other regularization methods can be found in [4] [10]. The exam-ples in this thesis are solved without regularization, however for extensions to a more general case of real world examples, like the eigenworm representation of the C. elegans, this could be necessary.

3.3.6 Memory vs Nonlinearity

We discussed that a linear readout layer can by definition not contain mem-ory and nonlinearity, which are both desired in many machine learning tasks. This means that the memory and nonlinearity must come from the reservoir. In the early studies on ESN’s by Jeager, he showed that for independent and

(43)

identically distributed (i.i.d.) input, the memory capacity of the reservoir is limited by the size of the reservoir and optimal for reservoirs with a linear activation function[8]. However, nonlinearity is important for a network to approximate more complex functions. This is also shown for reservoir com-puters on the linearly inseparable problem by Dambe et al. [1]. In their study they uncover the existence of a trade-off between the systems short-term memory capacity and nonlinearity of the computations in reservoir computing. The examples in this thesis, ask from the network to approx-imate a function that generates chaos, which makes nonlinearity essential, as discussed in section 2. However, also time-delay methods are commonly used to reconstruct attractors of chaotic systems like the Lorenz system [14], which indicates the presence of short term memory in the network could be valuable.

3.4 Problems with Reservoir Computing

The parameters of the reservoir are not trained in reservoir computing. This method provides an extensive computational profit and overcomes the van-ishing / exploding gradients problem that is found in traditional RNNs. But there is also a downside: the tuning of the hyper parameters is now left to the researcher / practitioner which is currently done by manual experimentation[6]. One problem is that the performance of the reservoir is sensitive to changes in the hyper parameters and is task specific. The tuning of the hyper pa-rameters can therefore be time consuming.

3.5 Inverse Mixing of Decoupled Reservoirs

A new approach is introduced, the decoupled reservoir network (DRN), to optimize the memory nonlinearity trade-off by automated routing between a set of models, that cover a spectrum of hyper parameters, to a single out-put using the Moore-Penrose pseudoinverse. The main idea is that different models have a different memory-nonlinearity trade-off and that the routing mechanism approaches the optimal combination that fits the given problem best. Another way to see this is that the final output is a linear combination of the outputs of the individual reservoirs in an ensemble. Our hypothesis is that the specific combination of reservoir models, found by the routing mechanism, approach a continuous region of hyper parameter space. Where

(44)

discrete hyper parameter sets. The idea of combining separate models is also in line with the original argumentation behind a sparse initialization of the reservoir weights, namely that a decoupling of the hidden states into multiple sub systems would increase the richness of the signal that is represented to the readout layer.

For a given set of models with different hyper parameters

|m1 >, |m2 >, ..., |mn> (37)

The mixed model M is a linear combination of the models m

Mi = d X j=0 N X n=1 anijmnj

for N models with d output dimensions. The connection matrix a is responsi-ble for the mixing of the models. Since the models m are already individually trained, finding the connection matrix becomes an inverse problem, which can again be efficiently solved with high accuracy using the pseudo inverse. The pseudo inverse has proven to be very efficient for the examples in this thesis (Section 5.2.2). However with the same argumentation as using the pseudo inverse directly on a regular reservoir, it could in general be necessary to use a regression method as the Tikhonov regularization to prevent overfit-ting. During the research for this thesis two variants of the mixed model have been explored, (1) where the connection matrix a combines the outputs of trained reservoirs, and (2) where the connection matrix a is used directly on the signal of the hidden states of the reservoir. In (2) only a single readout a is trained where in (1) N individual readouts Who,n are trained as if it where

individual reservoirs, whereafter a final readout a is trained that combines the models. Usually the dimensionality of the hidden state H is much larger than the dimensions of the output signal D. In case (1) the combinations of hidden states could interfere with the condition N H (discussed in section 3.3.5), where N is the number of train examples. This increases the possibility of overfitting and the use of a regularization would therefore be encouraged.

(45)

Figure 15: Schematic view of an inverse mixing of decoupled reservoirs. The decoupled models each has their own memory-nonlinearity trade-off to in-crease the richness of the signal to the final state. This model is of type (1): the decoupled models are individually trained on the data and the connection matrix is then again learned for the data.

(46)

4 Case Study 1: The Logistic Map

4.1 Introduction

The logistic map is a discrete 1-dimensional map in the form of a first order quadratic difference equation. It was discovered in 1976 by Robert May when he was modelling the population growth of biological species. The population x at time t can have a value between 0 and 1.

xt+1 = rxt(1 − xt) (38)

The complex evolution of the map is chaotic and has a positive Lyapunov exponent for most values r between 3.569.. and 4. There are, however, some islands of stability in-between these values of r where the Lyapunov exponent is not positive and the systems leaves the chaotic regime.

Figure 16: Lyapunov exponent for the logistic map with different values of r. On in the interval r = [3.569.., 4] the map shows mostly chaotic behaviour.

The logistic map is an archetypal example that shows that a simple non-linear equation can lead to complex behaviour and even chaos. The popula-tions size is fixed when the output of the function is equal to the input, this is called a fixed point and the slope of the function here is naturally zero (xt+1 = xt). As in every deterministic recurrence function, the only variable

input is its own output. The geometry of the interplay, or feedback, between the input and output is better illustrated in a cobweb plot12_{. For r < 3, there}

12_{The vertical lines in a cobweb plot show the evolution of the function x}

t+1 = f (xt),

(47)

is a single point of attraction, when r = 3 this doubles to 2 points of attrac-tion and the populaattrac-tion x will oscillate between these values. At r = 3.449... this doubles to 4 points of attraction and the population approaches a cycle between these four points, this is called periodic-doubling. The time pe-riod between the doubling of points in the oscillation becomes smaller and smaller until it quickly converges to a cycle of infinite points, which makes the evolution aperiodic.

(48)

Evolution Histogram Cobweb Plot r = 2 .9 (a) (b) (c) r = 3 .1 (d) (e) (f) r = 3 .5 (g) (h) (i) r = 4 (j) (k) (l)

Figure 17: Evolution of the logistic map for different values of r. (Left) time evolution of x for 100 timesteps. (Middle) normalized histogram of the distribution of x for 1e5 timesteps. (Right) xt to xt+1 mapping in a Cobweb

plot for 200 timesteps. The oscillation converges to a few points that double in quantity every few timesteps, this is called Periodic-doubling. The period before doubling decreases exponentially in time, leading to an oscillation between infinite points at r = 3.569..: the evolution is now aperiodic.

Reconstructing the Equations of Motion of Chaotic Time Series with Artificial Neural Networks