A partition perspective of dynamical systems: Stochastic methods for the study of coarse grained deterministic systems

(1)

MSc Physics and Astronomy

Theoretical Physics

Master Thesis

A partition perspective of dynamical

systems: Stochastic methods for the study

of coarse grained deterministic systems

by

Mauricio de Jesús Fonseca Fernández

11834218

August 2019

Supervisor:

Prof. Greg Stephens

Examiner:

Prof. Edan Lerner

(2)

Abstract

Sensitivity to initial conditions imply that even if a chaotic dynamical system is deterministic, uncertainty in measurement has severe consequences, as states that are very close together will quickly evolve into seemingly uncorrelated trajectories. We approach this problem by studying a partitioned phase space stochastically, where groups of nearby states are treated as one symbol and we focus on the transfer probabilities between symbols. We describe the ergodic properties present in chaotic attractors that allow us to use this approach, together with methods in information theory and in the theory of stochastic processes to try to characterize the resulting process. We use information theoretic measures to exemplify how apparently random processes can emerge from deterministic ones and how the resolution of our observations affects the processes we observe. For this we use as example models the logistic map, the Lorenz system and a pair of coupled unsynchronized Lorenz systems. Finally, we use Markov chain theory on transfer probability matrices to obtain optimal almost-invariant sets in the Lorenz and coupled Lorenz systems.

(3)

List of Figures

2.1 A trajectory of Lorenz system in the chaotic regime. . . 9 2.2 Projection of a trajectory of the pair of coupled Lorenz systems in the (x1, y1, z1)

subspace (left) and (x2, y2, z2) subspace (right). . . 10

2.3 Projection of a trajectory of the pair of coupled Lorenz systems in the (x⊥, y⊥, z⊥)

subspace (left) and (xk, yk, zk) subspace (right). . . 11

4.1 Entropy as a function of p for a binary source where P (0) = p and P (1) = 1 − p . 16 5.1 Chi-squared test for identifying the order of a Markov process. The data was

gener-ated for a seventh order Markov process, as is correctly identified by the test when the string length L is greater than 7. . . 26 5.2 Chi-squared test for identifying the order of a Markov process when the process is

not Markovian of any order. Three independent Markov processes of orders 3, 4 and 5 were generated for 50000 points each and placed one after the other to produce the data set analyzed. The test does not identify any k as the order of the underlying process in a way that is independent of the string length L. . . 27 6.1 Block entropy growth (a) and entropy gain (b) of the surjective logistic map with

its Markov partition. . . 33 6.2 Invariant density of the surjective logistic map approximated with a homogeneous

partition of 100 regions. The black dashed line shows the true analytic invariant density (6.15). . . 34 6.3 Block entropy growth (a) and entropy gain (b) and predictability gain (c) of the

surjective logistic map with an homogeneous partition of 100 regions. The grey dashed line in (b) corresponds to the KS entropy of the map found with the Markov partition, hKS = 1 . . . 35

6.4 maxi{Pii} as a function of τ for the Lorenz system with the chosen partition. . . . 36

6.5 Visualization of the partitioning in 1485 regions using a k-means clustering algo-rithm. Nearby points of the same color belong to the same region of the partition. 37 6.6 Transfer matrix acting as approximation of Perron-Frobenius operator for the Lorenz

map . . . 37 6.7 maxi{Pii} as a function of τ for the pair of coupled Lorenz system with the chosen

partition. . . 38 6.8 Almost invariant sets of Lorenz attractor for q = 2 . . . 41 6.9 Almost invariant sets of Lorenz attractor for q = 4 . . . 42 6.10 4 Almost invariant sets for the pair of coupled Lorenz Systems shown in the 3D

projections of the original coordinates: x1,y1,z1(left) and x2,y2,z2(right) . . . 43

6.11 4 Almost invariant sets for the pair of coupled Lorenz Systems shown in the 3D projections of the alternative coordinates: xk,yk,zk (left) and x⊥,y⊥,z⊥ (right) . . . 43

(6)

Chapter 1

Introduction

Dynamical systems theory provides us with a robust framework for stating and studying a wide variety of phenomena, from the traditional examples in classical mechanics to biology, chemistry, social sciences and economy. Most of the study of dynamical systems focuses on properties of deterministic phenomena: systems where specification of its configuration together with its intrin-sic evolution law determines uniquely the future of the system. This is opposite to the study of stochastic processes, where instead of deterministic descriptions, we have specification of probabil-ities of events happening.

When we observe the world that surrounds us, we often perceive manifestations of both of these seemingly opposite modes of operation. When we see a pendulum swinging, we remember the laws of classical mechanics, when we wait for a number to come out of a lottery cage, we think of the laws of probability. What comes to our mind when we think about the weather? If we have been exposed to the theory of meteorology, then we might know how sensitive it is to the precise conditions that determine the atmospheric state. We know that the laws that govern weather are deterministic, the atmosphere should be describable by a deterministic system, complicated as it might be. In practice, however, we are constrained by natural limitations that do not allow us to record the state of such a complex system with enough precision to take full advantage of its deterministic nature. We are stuck somewhere in between an underlying dynamical system that is deterministic, and an incomplete description of its state that forces us to speculate. Our best bet in these sort of situations to make use of statistics.

Here, we will study some concepts that allow us to construct a bridge between the strictly deter-ministic and the strictly stochastic. We will see how stochastic processes can be defined out of deterministic ones and how the perception of probabilistic phenomena can be a natural consequence of lack of observational resolution. Chaotic dynamical systems are of particular interest, since one of their defining properties, sensitivity to initial conditions, make even the slightest uncertainty in an observation a problem in terms of prediction.

(7)

Chapter 2

Dynamical Systems

Consider an arbitrary system, which can be of any nature. For example, it can be mechanical, meteorological, ecological, biological or a purely mathematical abstraction. In any case, the system can be described at any given time by stating a number D of independent properties that determine uniquely and unambiguously the state of the system. Considering all possible combinations of these D quantities, we can construct what is known as the phase space of the system: a D-dimensional space for which each point corresponds to a given state of the system in question. For example:

• A simple pendulum on a plane is a mechanical system that idealizes the physical configuration of a mass suspended on a rigid rod that is allowed only to move constrained to a plane. The state of this system can be determined by specifying the position and velocity of the mass at the end of the rod. Although both the position and velocity of a point in a plane are two-dimensional, the rigidness of the rod and the fact that it is fixed in one of its ends impose constraints on these quantities. The mass can only be found on a circle centered on where the rod is fixed with a radius equal to the length of the rod and the velocity has to be tangential to this circle. This eliminates two degrees of freedom for this system and its phase space turns out to be 2-dimensional instead of 4-dimensional. A common and convenient way of parameterizing this phase space is using the angular deviation from the vertical direction and the angular velocity of the mass.

• An ecosystem is an ecological system whose state is determined by the amount of individuals of each of the species found in a given region. If there are n different species on the region where the ecosystem is defined, then the phase space is n-dimensional. Since we can not have a negative or non integer amount of individuals of a species, the phase space is the space of all n-dimensional non-negative vectors of integer entries.

• An animal as a whole can theoretically be thought of as a system whose state could be determined if we had absolute knowledge about each of its components. Of course an animal is a highly complex system and its phase space is inevitably extremely high dimensional and very difficult to describe for various reasons. For example, it is not necessarily clear which are the appropriate quantities to determine and what are the constraints on them. Nevertheless, it is possible to focus on a given aspect of an animal in order to define an incomplete but useful description of its state. For instance, one could think of a phase space describing the possible postures of an animal which includes no description of the states of its nervous system. We can still use such a model to learn about how the animal moves.

2.0.1 Definition of a Dynamical System

Given a system with its appropriate phase space, as described above, one can go a step further by describing how the system evolves in time. This is done by defining an evolution law on the phase space that deterministically specifies what the state of the system is going be in the future given where the system is in the present. More precisely, we define a dynamical system with the triple {T, X, φt_{} where T is a time interval, X is the phase space of the system and φ}t_{: X → X is a}

family of functions on phase space parameterized by t ∈ T . The functions φtdefine what is known as the evolution operator of the dynamical system. Quite naturally, these functions must obey the properties φ0 _{= 1 (1 here being interpreted as the identity operator on X: φ}0_{x = 1x = x}

(8)

for any x ∈ X) and φt+s _{= φ}t_{◦ φ}s _{when the evolution law does not change in time. Systems}

where the evolution law does not change are called autonomous. It is also possible to consider non autonomous dynamical systems where the evolution law is dependent on time and the second property does not apply [1].

There are several ways of defining the evolution operator of a system. To see how it is often done, it is useful to understand the two most important classes of dynamical systems that are usually studied. These are continuous and discrete time dynamics [2].

• Discrete time dynamical systems: There are dynamical systems that are fully specified by determining the evolution for an integer unit of time. This means that we define an evolution operator

φ1= f (2.1)

If one is interested in the evolution of a point after k time steps, then the appropriate operator to consider is obtained by iterating φ1 k times

φk = φk−1◦ φ1= φk−1◦ f = . . . = fk (2.2) For this class of dynamical system, T is a set of integer values. If φ1 _{is such that an}

unam-biguous inverse can be found (i.e. f is invertible), then T can include negative integers. It includes only non negative integers otherwise.

• Continuous time dynamical system: These dynamical systems are defined continuously in time instead of having a fixed unit of time evolution. The most common way of specifying such a system is by defining a system of differential equations on the phase space variables.

˙ ~

x = ~f (~x) (2.3) Taking any point of the phase space together with this system of differential equations one can define an initial value problem. The solution of this problem defines the evolution operator for this point of phase space.

A standard result in the theory of ordinary differential equations states that for a system of ordinary differential equations such as (2.3), if ~f is smooth in some neighborhood of a point x0, then the solution to the equation exists and is unique in this neighborhood [1]. In the

cases where ~f is smooth everywhere, which is the case in most dynamical systems of physical interest, then each initial condition gives rise to a unique solution to the problem.

We define the orbit of a point x ∈ X for either class of dynamical system as

Or(x0) = {x ∈ X | x = φtx0; t ∈ T } (2.4)

The orbit of x0 is also known as its trajectory and it is the collection of points in phase space

through which the evolution takes a system starting at x0. If the system is invertible, one can also

include the points in the past of x0 and so the orbit becomes the points reached by the system

from x0 and the points that eventually lead to x0.

2.0.2 Long term behaviour of a Dynamical System

A typical question one asks about a dynamical system is how it behaves in the long term. For arbitrary initial conditions, one can usually expect to observe transitory behaviour that is not representative of the behaviour of the system if one is to observe it a sufficiently long time. Broadly speaking, there are four kinds of long time limit behaviours [2]:

• Unbounded behaviour: One or many of the variables of the system increase without boundary, so the state of the system "escapes" to infinity in the infinite time limit.

• Fixed points: The system evolves into a state that evolves into itself, so the system reaches this point in phase space and stays there forever.

• Cyclic behaviour: The system converges to a limit cycle, which is a closed trajectory in phase space and remains in this trajectory forever.

(9)

• Chaotic behavior: The system remains in a bounded region of phase space but never reaches a fixed point or a limit cycle. This means that even in the infinite time limit, the system never revisits any state in the bounder region it is contained in. The attracting sets of chaotic behaviour are known as strange attractors and are capable of this particular behaviour because they are fractal sets. Chaotic behaviour is characterized by sensitivity to initial conditions, meaning that two nearby points in phase space evolve into diverging trajectories. This is quantifiable by the concept of Lyapuniov exponents, which measure the expansion rate of characteristic directions in phase space. Chaos is characterized by positive Lyapunov exponents, identifying the existence of a direction of distance expansion under the evolution of the system.

The long time behaviour of a dynamical system can be dependent on the initial conditions, there may be different attracting sets in phase space. The set of initial conditions that converge to a given attracting set is known as the attracting set’s basin of attraction. If there exists only one attracting set and its basin of attraction is the whole phase space, we call this a global attrcator.

2.1 Examples of Dynamical Systems

In this section we will introduce some specific examples of dynamical systems that are going to be used as model systems in this work.

2.1.1 The Logistic Map

The logistic map is a widely studied discrete time dynamical system defined as

xi+1= λxi(1 − xi) (2.5)

Where the variable x is defined in the domain [0, 1] and λ is a parameter in the range [0, 4]. This is used extensively as a prototype to understand different classes of behaviours that can occur in the study of dynamical systems. By increasing the value of the parameter λ from 0 to 4, one encounters a sequence of bifurcations (qualitative changes in the possible long term behaviours of the system). The system goes from having a single, stable, fixed point, to a pair of fixed points, one stable and the other unstable, and then going into what is known as a period doubling cascade, a range of values for λ for which a stable cycle exists and suffers from a sequence of period doublings as the parameter increases. The period doubling eventually reaches a saturation point where the period of the cycle tends to infinity and beyond that parameter value one encounters chaotic behaviour. This property is known as the period doubling route to chaos and can be observed in several other maps [3].

2.1.2 The Lorenz System

The Lorenz system is a three dimensional continuous time dynamical system, often used as a prototype for studying properties of deterministic chaos. It is defined with the following system of differential equations      ˙ x = σ(y − x) ˙ y = x(ρ − z) − y ˙ z = xy − βz (2.6)

where σ, ρ and β are parameters. Figure 2.1 shows a trajectory of the system for parameter values σ = 10, ρ = 28 and β = 8/3, for which the system exhibits chaotic behavior. For the vast majority of initial conditions (i.e. excluding fixed points and their stable manifolds), the system evolves in such a way that it approaches asymptotically a fractal set known as the Lorenz attractor, which is an example of a strange attractor.

(10)

Figure 2.1: A trajectory of Lorenz system in the chaotic regime.

2.1.3 A Pair of Coupled Lorenz Systems

Given we define two dynamical systems, we say we have coupled them if we introduce coupling functions in their evolution laws, meaning that we introduce some way in which the state of one of the systems influences the other. Of course, what we obtain is just another dynamical system with a dimension equal to the sum of the dimensions of the constituent systems. When coupling systems, a phenomenon called synchronization can occur, which is when the component systems behave in a way in which the output of one of them is a given function of the output of the other. In the case where the two component systems are identical, this can manifest itself as both systems, viewed separately, being in the same state, being lagged a fixed amount of time, among other possibilities [4]. It is certainly interesting when a coupled dynamical system does not synchronize, meaning that its behaviour becomes essentially different from the behaviour of two decoupled systems. We are going to use the coupling of two identical Lorenz systems proposed in [5], which is proven to not synchronize provided that the initial conditions in the component systems viewed separately are not the same. This pair of coupled Lorenz systems is defined as

                   ˙ x1 = σ(y1− x1) ˙ y1 = x1(ρ − z1) − y1+ h12(x1, z1, x2, z2, k) ˙ z1 = x1y1− βz1+ g12(x1, y1, x2, y2, k) ˙ x2 = σ(y2− x2) ˙ y2 = x21(ρ − z2) − y2+ h21(x1, z1, x2, z2, k) ˙ z2 = x2y2− βz2+ g21(x1, y1, x2, y2, k) (2.7)

where k is a coupling parameter and hij and gij are the coupling functions

hij= − √ k(xiz1+ xjz2) + xizi (2.8) gij = √ k(xiy1+ xjy2) + xiyi (2.9)

The behaviour of this system under the right parameters is known to be hyperchaotic, which referes to the existance of more than one positive Lyapunov exponent, two in the case of this system [5]. Figure 2.2 shows three-dimensional projections of a trajectory of the coupled Lorenz systems with

(11)

Figure 2.2: Projection of a trajectory of the pair of coupled Lorenz systems in the (x1, y1, z1)

subspace (left) and (x2, y2, z2) subspace (right).

the parameters set to the same values as the single Lorenz system above and k = 1. Consider the following change of coordinates

                   x⊥ = (x1− x2) y⊥ = (y1− y2) z⊥ = (z1− z2) xk = (x1+ x2) y_k = (y1+ y2) zk = (z1+ z2) (2.10)

Substitution of the variables leads to                    ˙ x⊥ = σ(y⊥− x⊥) ˙ y⊥ = x(ρ − z⊥) − y⊥ ˙ z⊥ = x⊥y⊥− βz⊥ ˙ xk = σ(yk− xk) ˙ yk = x(ρ − zk) − yk ˙ zk = xkyk− βzk (2.11)

Notice that this linear change of coordinates leads to a decoupled pair of Lorenz systems in the new variables. Figure 2.3 shows the same trajectory shown in Figure 2.2 in the new coordinates.

(12)

Figure 2.3: Projection of a trajectory of the pair of coupled Lorenz systems in the (x⊥, y⊥, z⊥)

(13)

Chapter 3

Ergodic Theory

When working with dynamical systems, one often is concerned about the long term behaviour of the system under consideration. That is, the behaviour of the system after transients decay. Some systems have certain properties that allows some questions regarding long term behaviour to be formulated in terms of the stochastic characteristics of the evolution of the system on its phase space. Ergodic theory relates to these sort of properties and to the question of what systems are appropriate to study in this manner and to what extent. It serves as a link between the theory of dynamical systems and the measure theoretic approach to probability theory. Appendix A contains an overview of important concepts of measure theory used in this chapter.

3.1 Time and Ensamble Averages

Consider a dynamical system defined by f : X → X. where X is the phase space of the system and f its evolution law. Also consider a function of state φ : X → R, which we can call an observable. For some initial condition x0 we define

φ_n(x0) = 1 n n−1 X i=0 φ(fi(x0)) (3.1)

as the time average of the observable φ over the first n points of the orbit of x0. If the limit

n → ∞ exists, then we can define

φ(x0) = lim

n→∞φn(x0) (3.2)

and call this the time average over the orbit of x0.

Consider now the measure space (X, Σ, µ) consisting of the phase space X along with σ-algebra Σ and some arbitrary probability measure µ. We define the ensamble average of φ with respect to µ as

hφi = Z

X

φdµ (3.3)

which is the average over phase space of φ weighted by the probability measure µ. Define the Dirac-δ measure at point x ∈ X as [6]

δx(A) =

(

1 x ∈ A

0 x /∈ A (3.4)

Consider a dynamical system as defined previously. (X, Σ) is a measurable space for which a measure can be defined. For some initial condition x0define the following sequence of probability

measures µn(x0) = 1 n n−1 X i=0 δfi_(x₀₎ (3.5)

(14)

Notice this definition looks like some sort of time average of Dirac-δ measures over the orbit of x0

under f . We can use any of the measures in this sequence to define measure spaces (X, Σ, µn(x0)).

We are interested in the convergence of the sequence in (3.5). We will be concerned with conver-gence in the so called "weak-star topology" sense, which means that for a sequence of measures µn,R ϕdµn →R ϕdµ for any function of state ϕ means the sequence converges to the measure µ

[6]. Writing out explicitly this integral for some arbitrary ϕ for the measures in (3.5) gives Z ϕdµn(x0) = Z ϕdh1 n n−1 X i=0 δfi_(x 0) i = 1 n n−1 X i=0 Z ϕdδfi_(x 0)= 1 n n−1 X i=0 ϕ fi(x0) = ϕn(x0)

Where in the last step we used the definition in (3.1). We see that the convergence of the sequence of measures µn(x0) is equivalent to the existence of the time average ϕ(x0) of any arbitrary

func-tion of state ϕ.

Notice that the measure to which sequence (3.5) converges, µ(x0), is still dependent on x0. Different

initial conditions can lead to convergence into different measures. For a given measure µ, we call the set of all the initial conditions x0 that lead to convergence to µ its basin of attraction. We

are going to concern ourselves only with measures that have basins of attraction with non zero Lebesgue measure, which are known as physical measures.

3.1.1 Invariant Measures and Ergodicity

For a dynamical system defined by f as before, we call a measure µ on its phase space X an invariant measure with respect to f if the following property holds for every measurable set in the measure space defined by (X, Σ, µ)

µ(A) = µ f−1(A)

(3.6) where f−1(A) = {x | f (x) ∈ A}. Measures with this property, as we will see, have many important properties that can be used to study and characterize the behaviour of f .

An invariant set under f is defined as a set A ⊆ X such that f−1(A) = A, which means that all elements that are sent into A under the evolution are already in A. We say a probability measure µ is ergodic with respect to f if for every invariant set A, either µ(A) = 1 or µ(A) = 0. This means that for an ergodic measure, all invariant sets that don’t have the full measure of the phase space must have measure 0. A consequence of this is that no set with non-zero measure can be invariant unless it captures the whole measure of the phase space. If ergodicity is fulfilled, no non-full-measure set can isolate states completely from the rest of the phase space, there is always going to be a non-zero probability of escaping any subregion of phase space [6].

One of the most important results of ergodic theory is what is known as Birkhoff ’s ergodic theorem. It states the following:

• If an invariant measure under f , µ, exists and ϕ is any function of state, then the limit in the definition (3.2) of ϕ(x0) exists for µ-almost1every initial condition x0.

• If µ is an invariant and ergodic probability measure under f , then for any function of state ϕ,

ϕ = hϕi (3.7)

Notice that the first result also implies that ϕ(x0) = ϕ f (x0), so the time average does not

depend on where in the orbit we start, but it might still depend on the orbit (if y0is not contained

in the orbit of x0, then ϕ(x0) might be different from ϕ(y0)). The second result implies that if

the invariant measure µ is also ergodic, then the time average becomes independent of the initial condition and we simply write ϕ. Notice this second result states the equivalence between time averages and ensamble averages. This is the most important property of ergodicity and the one that is going to be exploited in this thesis.

1_{Saying a property holds “ µ-almost" means that the property holds for all points except maybe in a set of measure}

(15)

3.2 The Perron-Frobenius Operator

The Perron-Frobenius operator P is an infinite dimensional linear operator used to study dynamical systems under a measure-theoretic framework that describes the evolution of probability distribu-tions over the phase space of a dynamical system [7] If φt: X → X is the evolution law for time t of a dynamical system defined on its phase space X and we induce a measure space (X, Σ, µ) over X, then Pt is an operator that acts on density functions on X. The induced measure µ in this

discussion can be arbitrary up non-singularity with respect to φt: µ ◦ φt<< µ (see A.1.1 for the definition of absolutely continuous measures, symbolized by <<), which means that φt does not map sets of non-zero µ-measure to sets of zero µ-measure. We say ρ is a density function on X if it represents a probability distribution of the possible states of the system treated as a random variable. Being a probability distribution, ρ(x) ≥ 0 for all x ∈ X and

Z

X

ρdµ = 1 (3.8)

The Perron-Frobenius operator for φtis defined implicitly as the linear operator that satisfies for all A ∈ Σ and all density functions ρ on X

Z A Ptρdµ = Z φ−t_(A) ρdµ (3.9)

In words, it is the operator that takes ρ to the distribution Ptρ that when integrated over a

measurable set A results in the same value as the integral of ρ over the "past" of A, the set φ−t(A) = {x | φt_{(x) ∈ A}. The action of this operator can be thought in the following way: one}

can imagine having an infinite ensamble of identical systems, each in a different initial state such that the probability distribution of finding the initial state at any given point if we choose one of the systems of the ensamble at random is given by the probability distribution ρ. If all of these systems are evolved for a time t with φt_{, then the probability distribution of states of the ensamble}

is now going to be Pρ.

Notice that Pt is acting as an evolution law for a dynamical system that is now defined over the

space of probability distributions on X. A fixed point in this dynamical system would satisfy

ρ∗= Ptρ∗ (3.10)

for any t and it would mean that ρ∗ is a probability distribution that remains the same under the evolution: we call this an invariant distribution. This is very reminiscent of the concept of an invariant measure, seen in (3.6). Indeed the two concepts are related as we can define an induced measure π as

π(A) = Z

A

ρ∗dµ (3.11)

for every A ∈ Σ. Since this distribution obeys (3.10), then π(A) = Z A ρ∗dµ = Z φ−t_(A) ρ∗dµ = π(φ−t(A)) (3.12) This means that starting from an arbitrary measure µ (up to non-singularity with respect to φt), finding an invariant distribution with respect to µ induces an invariant measure π using (3.11).

(16)

Chapter 4

Information and Entropy

Information theory is a discipline that started out as a means to understand communication. It meant to quantify in some manner the properties of the messages that can be transmitted through some signal. Through its development it has found itself to be a quite general language to approach a diverse set of problems. Here we are going to explore some specifics of information theory that have been found to be useful in the study of dynamical systems.

In the information theoretic approach we will understand a source as anything that provides us with some signal according to a set of rules inherent to the source. We will think of the source as producing a sequence of symbols which can take values out of an alphabet A, which we take to be a countable set. A common example is the binary alphabet where A = {0, 1}, which means that any signal produced from the source is represented as a sequence of 0’s and 1’s.

Information theory is closely tied to statistics, in the sense that it is a means to study and quantify certain properties resulting from the probability of observing a symbol or sequences of symbols in the signal produced by a source. We define a probability distribution for the symbols produced by a source as P : A → [0, 1], assigning to each symbol in the alphabet the probability that it is produced by the source. Of course, we requireP

s∈AP (s) = 1. The information of symbol s is

defined as

I(s) = − log_kP (s) = logk

h 1 P (s)

i

(4.1) The base of the logarithm, k, determines the units in which information is quantified and can be chosen so that it facilitates the interpretation of this quantity given the alphabet in use, but can also be chosen arbitrarily. When working with binary sources, the base of the logarithm is commonly chosen to be 2 and in this case the units are known as bits.

I(s) is a measure of the knowledge we earn if we observe symbol s coming out of the source. Notice that the larger P (s) is, the lower I(s). This is usually interpreted as the fact that obtaining symbol s when P (s) is large implies the result is "less surprising" and therefore provides less knowledge about the process.

We define now the concept of entropy in information theory, also known as Shannon entropy. This is a scalar quantity defined for the probability distribution P of the source as

H(P ) = −X

s∈A

P (s) log_kP (s) (4.2) Which is the average of I(s) under the probability distribution p. Symbols with zero probability pose no problem as we take 0 × log(0) = 0, made plausible by performing the limit limx→0xlogx.

The entropy quantifies expected "surprise" to be experienced when observing the output symbols. It is often described as a quantification of the uncertainty or randomness of the source [8]. To understand this consider the example of a binary source with P (0) = p and P (1) = 1 − p. Figure 4.1 shows a plot of H(P ) as a function of p. Notice that H is maximum with H = 1 bit when the distribution is such that P (0) = P (1) = 1/2, which is the binary distribution for which we

(17)

Figure 4.1: Entropy as a function of p for a binary source where P (0) = p and P (1) = 1 − p have maximum uncertainty regarding the output of the source: a fair coin. On the other extreme, when p = 0 or p = 1 we have a distribution that assures the result deterministically, either always 1 or always 0, respectively. In these cases there is no uncertainty, we gain no information when we observe the output of the source because we already know what the output is going to be. In the intermediate cases the uncertainty is somewhere in between. These cases can be thought of as different degrees of biased coins.

4.1 Block Entropy

We can further expand the usefulness of the concept of entropy by considering the joint distribution of various symbols, which provides us with probability distributions for sequences of symbols. Using the Shannon entropy defined in (4.2) is sufficient when the source’s outputs are independent from each other: the symbol that the source provides at any given time does not depend on the past symbols. When this is not the case, when there are correlations between different symbols, we can gain more insight about the source by considering what is known as blockentropies, the entropies of sequences of L symbols, defined as [9]

HL(PL) = −

X

s1,...,sL∈A

PL(s1, ..., sL) logkPL(s1, ..., sL) (4.3)

where PL(s1, ..., sL) is the probability distribution of sequences of L symbols within the signals

provided by the source. We define H0= 0.

In the context of dynamical systems, as we will see, these quantities are of particular importance because the way HL changes as L grows provides insight on how the dynamical system depends

on its past.

A property of HL as a function of L is that it is non-decreasing: HL(PL) ≤ HL+1(PL+1). This is

because adding a new symbol will never make the uncertainty of a sequence decrease. At most it can have no influence, but in general it tends to increase it since it adds ∼ |A|L_{possible sequences.}

The maximum information that can be added per observation, which is achieved when all resulting sequences are equally likely, is logk|A| and therefore we find that

HL(PL) ≤ L logk|A| (4.4)

4.1.1 Entropy Rate

HL is relevant to us because it allows us to quantify the uncertainty of the source as more and

more symbols are produced. As stated above, this quantities are of interest when the source has some sort of memory: the production of symbols is correlated with the past symbols produced. An important characteristic of a source is what is known as its entropy rate, which quantifies its

(18)

intrinsic randomness, the uncertainty that remains even when an infinite number of past symbols are known. The entropy rate is defined as [9]

h = lim

L→∞

HL(PL)

L (4.5)

and is measured in [information units]/symbol (bits/symbol in the case of k = 2). The existence of this limit, which is guaranteed for sources with stationary distributions [9], means that for large L the growth of HL approaches linear behaviour with slope h. In other words, for large L, HL∼ Lh.

The maximum value of h is logk|A|, achieved when the source symbol production is uniformly

distributed, with no correlations, where expression (4.4) reaches the equality.

The same observation allows us to see that the block entropy can be decomposed as [10] HL= hL + H

(s)

L (4.6)

where H_L(s) captures the part of the block entropy that changes slower than linearly in L (we call this subextensive change). This means that limL→∞H

(s)

L /L = 0 and is responsible for the

possible non-linear behaviour of HL for small L that decays as L becomes large, giving way to the

asymptotic linear growth with slope h. Analyzing this subextensive component provides insight regarding the structural properties of the source.

4.1.2 Block Entropy Derivatives

In order to study the approach of the block entropy to its linear growth with slope h it is useful to consider its discrete derivatives with respect to L. A discrete derivative is simply the rate of change of a function defined in a discrete domain. Say, if G : Z → R, then its discrete derivative is ∆G(n) = G(n) − G(n − 1).

The first discrete derivative, called the entropy gain, is defined as

∆HL= hL= HL− HL−1 (4.7)

and is the amount of entropy gained when going from length L − 1 sequences to length L sequences. The notation hL suggests that this is a measure that approximates an entropy rate for sequences

of length L. It is indeed the case that h = limL→∞hL. Notice that this limit is different from the

definition (4.5), but they do converge to the same value for stationary probability sources. Opera-tionally, it is better to use the limit of hLas it converges faster than (4.5). We define h0= logk|A|

and we interpret it as the gain of entropy from no measurements to the general assumption of a uniformly distributed probability of a source with alphabet A.

The second discrete derivative is known as predictability gain and is the change in entropy gain ∆2HL= ∆hL = hL− hL−1 (4.8)

As hL is a measure of the intrinsic randomness of the source when looking at sequences of L,

∆2_H

L is the change in this intrinsic randomness as L increases. ∆2H0is not defined. Given that

hL converges to the entropy rate h, limL→∞∆2HL = 0. We also have that hL is monotonically

(19)

Chapter 5

Stochastic Processes

The theory of dynamical systems focuses mainly on deterministic processes, phenomena for which we have an unambiguous evolution rule that determines completely what the system’s state is going to be in the future given its state in the present instant. As we have seen, these processes are often described by differential equations for continuous time or discrete maps if time is discretized. We are turning now to the description of processes for which we do not have a deterministic evolution law, but instead we have a stochastic description of the evolution. Our knowledge of the evolution is limited to a description of the probability of evolving into the possible states given what we know about the system in the present.

A discrete time stochastic process is a stochastic process where the evolution is done in discrete time steps. The process can be described with a set of random variables Xi, where i denotes the

discretized time and each variable can take a value from a finite set of states S = {0, ...N − 1}. The set of possible states does not need to be finite, but in what is going to be discussed here it is sufficient to consider a finite state set.

The probability of observing a particular realization of the stochastic process can be written as [11]

P{X0= i0, X1= i1, ..., XT = iT} (5.1)

The complete description of the stochastic process consists of determining the value of these prob-abilities for all T in the range of interest and all sequences (i0, i1, ..., iT) of states. To simplify the

notation we can write the above probability as P{X0= i0, X1= i1, ..., XT = iT} = P(i0, i1, ..., iT).

If the initial probability distribution of states, P{X0= i0} is known, another way of writing down

(5.1) is by using conditional probabilities, as

P(i0, i1, ..., iT) = P(i0)P(i1|i0)P(i2|i0, i1)...P(iT|i0, i1, ..., iT −1) (5.2)

Where P(in|i0, ..., in−1) is the conditional probability of observing Xn = in given that X0 =

i0, ..., Xn−1= in−1. Certain processes have certain properties that are easily made explicit in terms

of these conditional probabilities and simplify expression (5.2) and therefore make the description and study of these processes easier than one for a general stochastic process. The simplest and most studied of these kind of processes are known as Markov chains.

5.1 Markov Chains

A Markov chain is a stochastic process that possesses what is known as the Markov property, which is the condition that the probability distribution of the state of the system at some time t depends only on its state at time t − 1 and not on any other past state. This is also commonly referred as the system being "memoryless". This is written as [12]

P(in|i0, i1, ..., in−1) = P (in|in−1) (5.3)

Thus, for a Markov chain (5.2) can be written as

(20)

Additionally, we are going to work under the assumption that the conditional probabilities do not depend on time. This leads to what are known as time-homogeneous Markov chains.

A useful consequence of this description is that for a Markov chain we can write the evolution probabilities as a matrix known as the transition matrix of the process. This is defined as

Pij = P(j|i) (5.5)

so that the entry i, j of the matrix is the probability of "transferring" from state i to state j. Notice that for all i,

X j∈S Pij= X j∈S P(j|i) = 1 (5.6)

since the system’s state must go somewhere. This means that all rows of P sum to 1, making it what is known as a stochastic matrix: a non-negative matrix (meaning all entries are either positive or zero) with rows that sum to 1.

We can also define a probability vector π(n)

= (P{Xn = 0}, P{Xn = 1}, ..., P{Xn = N − 1}), a

vector whose i-th entry is the probability of observing the state labeled by i − 1 at time n. In general, for a vector to be a probability vector it must be non-negative and its entries must sum to 1. Furthermore, as a matter of convention, we are going to use probability vectors as row vectors. Knowing the initial probability distribution is equivalent to knowing π(0), so (5.4) can be rewirtten as

P(i0, i1, ..., iT) = π(0)Pi0i1Pi1i2...PiT −1iT (5.7)

Notice that since

P(i)P(j|i) = P(i, j) we have

π(n)P =X

i∈S

π(n)iPij = π(n+1) (5.8)

Notice this defines a linear deterministic evolution law for the probability vectors. We can interpret this as a dynamical system on the space of probability vectors with discrete time map (5.8). Consider a fixed point of this system

π∗P = π∗ (5.9)

In general, (5.8) can be understood as the statement that at each step the probability of finding the system at some particular state gets redistributed by the transition matrix. The fixed point equation (5.9) defines a π∗ that does not change under this redistribution. For this reason, π∗is called an invariant distribution of the system. It is natural now to ask under which conditions such a distribution exists and whether or not it is unique for a given transition matrix P . Before addressing this question, it is necessary to define some concepts that help classify Markov chains into categories that share properties that are relevant to the answer.

A Markov chain is reversible if it obeys what is known as detailed balance, which is the condition that for all i, j [13]

πi∗Pij = π∗jPji (5.10)

We can understand the quantity π∗_iPij as the flow from state i to state j. If a Markov chain obeys

detailed balance, it means that the flow from any state i to any state j is the same as the flow in reverse, and hence, the process would looks the same if studied backwards in time, one it has converged to its invariant distribution.

5.1.1 Irreducibility, Transient States and Periodicity

Consider

pm(i, j) =

X

k1,..,km∈S

(21)

which is the probability of starting in state i and ending in state j after m iterations, where we have used the Markov property. Using the transition matrix this becomes

pm(i, j) = X k1,..,km∈S Pik1Pk1k2...Pkmj= P m ij (5.12)

This means that, for example, the transition matrix of the Markov chain given by a double time step is given by P2.

We say that two states i, j of the Markov chain communicate if there are integer numbers m and n such that pm(i, j) > 0 and pn(j, i) > 0. In words, that there are a finite number of steps for which

it is possible to go from i to j and vice versa [14]. If this applies, we write i ↔ j. It is easy to prove that ↔ is an equivalence relation, which means that it is symmetric (i ↔ j =⇒ j ↔ i), transitive (i ↔ k ∧ k ↔ j =⇒ i ↔ j) and reflexive (i ↔ i, trivially, using p0(i, i) = 1). Equivalence relations

always allow the separation of a set into equivalence classes, which are such that all elements belong to one and only one of the classes. In this case we call this classes communication classes [13]. This means that every state of a Markov chain belongs to a communication class which it shares with all states it communicates with.

A communication class T is called a transient class if there exists a state i ∈ T such that for some state j /∈ T , p1(i, j) > 0 but for all integers m, pm(j, i) = 0. This means that a state can transition

from T to another communication class but it can not come back. Essentially, if a system starts in T , it will eventually end up outside of it and never return again. All states that belong to a transient class are called transient states.

A communication class R is called a recurrent class if it is not a transient class. It has the prop-erty that for all states i ∈ R and all possible states j of the Markov chain, if for some integer m, pm(i, j) > 0, then i ↔ j. In other words, if a Markov chain starts in R, it will always remain in R,

or if it starts in some transient class T and transition at some point to R, then it will never leave R. A Markov chain is said to be irreducible if it possesses only one communication class. This means that for every pair of states i, j, we have i ↔ j. Notice this property is very similar to the definition of ergodicity of a dynamical system. From any starting state of the system we can reach any other in a finite number of steps. If we are interested only in the long time behaviour of a Markov chain that has transient states and only one recurrent class, we can construct an irreducible Markov chain by ignoring the transient states, as we know that the system will never be in a transient state given enough time has passed. This can be compared with the idea of letting a dynamical system a long time after its initial state in order to only study the behaviour of the attractors. Furthermore, if a Markov chain has several recurrent communication classes, one can treat each one of them separately as an irreducible Markov chain by ignoring all states not in the recurrence class. This is similar to recognizing the ergodic components in a dynamical system when the system is not ergodic as a whole.

A general transition matrix for a Markov chain with recurrent classes R1, ..., Rrand set of transient

classes denoted by T can be written as [14]

P =        P1 0 . . . 0 0 0 P2 . . . 0 0 .. . ... . .. ... ... 0 0 . . . Pr 0 S Q       

where Piis the irreducible transfer matrix for the states in the recurrent class Ri, Q is the transfer

matrix related to the transient states and S is the transfer matrix relating to the transfer prob-abilities of states in the transient classes to the recurrence classes. The zeros denote null block matrices of the appropriate dimensions. Note that all probabilities of transferring from a recurrent class to another or to the transient classes are zero. Because of lower triangular block structure of this matrix, one can write the m-th iteration matrix as

(22)

Pm=        P1m 0 . . . 0 0 0 Pm 2 . . . 0 0 .. . ... . .. ... ... 0 0 . . . Pm r 0 Sm Qm        for some matrix Sm.

It is useful to be able to characterize the states of a Markov chain in communication classes since it allows us to separate the state space into sets of states that communicate with each other and not with any other state, or in other words, to treat each of this sets independently as an ir-reducible Markov chain. For this reason, we are now going to consider now irir-reducible Markov chains, knowing that for reducible ones the same results apply by treating each communication class independently.

For a given state i, one can define the periodicity of i as the greatest common divisor of the elements of the set [12]

Ai= {m ≥ 0 | pm(i, i) > 0} (5.13)

One can prove that all states in an irreducible Markov chain have the same periodicity, so we can call it the periodicity of the Markov chain, not merely of a state. In the case of a reducible Markov chain, every recurrence class independently determines a periodicity for the states that are in them. If the periodicity of a Markov chain is 1, then we call it aperiodic.

5.1.2 The Perron-Frobinus Theorem

We now have the appropriate concepts to determine under which conditions a Markov chain has an invariant density π∗ defined by (5.9). The Perron-Frobenius theorem for stochastic matrices determines that if a Markov chain is irreducible and aperiodic then π∗ exists and it is unique. More specifically, this theorem asserts the following for P , the transfer matrix of some irreducible and aperiodic Markov chain [15]:

• The eigenvalue of P with greatest absolute value is 1 and it is a simple eigenvalue (non degenerate).

• The right eigenvector of P that corresponds to eigenvalue 1 is the constant vector (all entries the same).

• The left eigenvector of P that corresponds to eigenvalue 1 can be chosen to have all entries positive.

• No other left eigenvector of P can be chosen to have all entries positive.

Notice that (5.9) is the left eigenvector equation for eigenvalue 1. The Perron-Frobenius theorem then tells us that P does indeed have this eigenvalue and that the solution can be chosen to have all entries positive, meaning that it can be made into a probability vector. In a reducible Markov chain, the result applies independently for each recurrence class and the eigenvalue 1 is not simple anymore, but has multiplicity r, where r is the number of recurrence classes. Each of the left eigenvectors with eigenvalue 1 are invariant densities, each with non-zero elements only on the entries corresponding to the states of one of the recurrence classes.

A property of the left eigenvectors associated with the eigenvalues that are not 1 is that the sum of their entries is 0. Suppose that ϕ(j) is one of these eigenvectors with corresponding eigenvalue |λj| < 1 X k (ϕ(j)P )k= X i,k ϕ(j)_i Pik= X i ϕ(j)_i

where in the second equality we used that the rows of P sum to 1. Since ϕ(j)P = λjϕ(j), the

(23)

X i ϕ(j)_i = λj X i ϕ(j)_i Since |λj| < 1, this can only be true ifPiϕ

(j)

i = 0. Notice that both this result and the last point

stated above in the theorem description means that the ϕ(j) cannot be interpreted as probability vectors since they are not non negative and their elements can not be made to sum up to 1. The set of left eigenvecotrs of P forms a basis for the n-dimensional space of vectors. Any proba-bility vector ρ can then be written as

ρ = a1π∗+ n

X

j=2

ajϕ(j) (5.14)

for some coefficients {ai}. Using the fact that the entries of the vectors ϕ(j)sum to zero, it is easy

to see that since a probability vector’s entries sum to 1 and π∗ is a probability vector, we have that a1= 1, so we can write ρ as

ρ = π∗+

n

X

j=2

ajϕ(j) (5.15)

This means that

ρPm= π∗+ n X j=2 ajλmj ϕ (j) _(5.16) and since |λj| < 1 lim m→∞ρP m_{= π}∗ _(5.17)

This shows that viewing P as a linear transformation governing the dynamics of densities on the states of the Markov chain, π∗ is a globally stable fixed point. For any initial density ρ, the evolution will lead to the invariant distribution.

An interesting thing to notice about (5.16) is that the eigenvalues λj tell us something about how

general densities approach the invariant density. Each iteration, the contribution ajϕ(j) to the

density is reduced by a factor λj.

5.2 Higher Order Markov Processes

Using Markov chains is a simple but useful way of describing a stochastic process. Satisfying the Markov condition means that the model is memoryless, the transition probabilities for the next iteration depend only on the current state of the system. It is possible to consider models that do have memory, on which the transition probabilities depend on the past as well as the current state. The simplest way to do this is to consider a k-th order Markov process, one for which the transition probabilities depend on the present state plus at most k − 1 past states. The Markov chains of the past section are, of course, first order Markov processes.

A k-th order Markov process is characterized by

P(in|i0, i1, ..., in−1) = P(in|in−1, in−2, ..., in−k) (5.18)

Often the order of a Markov process that is being modelled depends on the knowledge available. For example, consider a system consisting of some moving unit on a discrete one dimensional lat-tice. Each time step, the unit has to move to one of the nearest neighbours of its current position. Suppose that the probability of moving in the same direction it is already moving is higher than the probability of turning. This means that knowing where the unit is is not sufficient to deter-mine the probabilities of where it is going to be next. One needs to know the previous position in order to determine, using its relation to the current position, in what direction it is moving. This can be described with a second order Markov chain. Consider now that we extend the state

(24)

space of the unit to include the “velocity" of the unit, which we can model as having values +1 if it is moving to the right and -1 if moving to the left, in addition to its position. Now a first order Markov chain is sufficient to determine the probabilities of going from one state to every other. Just as the conditional probabilities of a first order Markov chain can be written in the form of a matrix, the transition probabilities of a k-th order Markov chain can be written in a tensor of rank k + 1. This is done by identifying

P_i(k)

1,i2,...,ik+1= P(ik+1|i1, i2, ..., ik) (5.19)

for every combination of i1, ..., ik+1 ∈ S. The rows of P(1) sum up to 1 because we know that

from any state some other must follow. In the case of P(k) this is still true regarding summing the last index, but only when the sequence (i1, i2, ..., ik) is allowed. There may be sequences not

attainable from the definition of the Markov process. If (i1, i2, ..., ik) is not allowed, by definition

the conditional probabilities P(ik+1|i1, i2, ..., ik) = 0 for all ik+1. This means that for a given

sequence (i1, i2, ..., ik), X ik+1 P_i(k) 1,i2,...,ik,ik+1= ( 1 if (i1, i2, ..., ik) is allowed 0 otherwise (5.20)

Furthermore, for a given order k of a Markov process, we can define all the transition tensors of ranks 2, ..., k + 1. Naturally, only the transition tensor with rank k + 1 defined above will contain the complete transition properties of the process. A transition tensor of rank m < k + 1 describes the probability of observing a transition if we only have knowledge of the m − 1 past states. These partial rank transition tensors can be computed recursively from the full rank transition tensor as follows P_i(k) 1,i2,...,im= X i0 πi0 Mi0,i1,...,im P_i(k) 0,i1,...,im (5.21)

where π is the invariant distribution and Mi0,i1,...,im=

X

i0such that

(i0,i1,...,im) is allowed

πi0 (5.22)

Notationally we are going to treat these tensors in the same footing, noting that number of indices and its relation to k unambiguously determines which tensor is being used. The partial rank ten-sors represent the transition probabilities we can predict from what we know about the process if we do not know enough past states to use the full rank tensor probabilities.

Notice that we can write the probability of observing some N -sequence i1, ..., iN using these tensors

as P(i1, i2, ..., iN) = πi1P (k) i1i2P (k) i1i2i3...P (k) i1i2...ik+1P (k) i2i3...ik+2...P (k) iN −kiN −k+1...iN (5.23)

when N ≥ k. When N < k the product of the tensor elements ends at P_i(k)₁_,...,i_N.

We have used an invariant distribution π in the above discussion. As for the case of the first order Markov process, it is of interest if there is an invariant distribution in the first place. It turns out that a generalized Perron-Frobenius theorem for tensors asserts that an invariant distribution exists for any higher order Markov chain (CITE LEK HENG LIM, CHI KWONG LI). Whether of not this invariant distribution is unique and other characteristics can be derived from properties of the tensors, but that will not be relevant to our discussion.

5.2.1 A Model of Higher Order Markov Chains

In order to study higher order Markov processes, we are going to describe a model to generate data such that it follows a Markov chain of a given specified order. The model follows a similar idea of the example stated in the previous section, regarding the unit moving in a lattice.

(25)

Again, the system of interest is a movable unit in a discrete one dimensional lattice of M sites. We are going to consider the lattice as periodic so the system has a finite number of states. This can be pictured as the unit moving over M points placed on a circle, labeled from 0 to M -1. Again, each time step the unit moves to one of its two neighbouring sites. We want the data to be generated in a way that it results in a k-th order Markov process.

We begin by choosing an arbitrary but allowed sequence of k sites as the initial k states. By default we have chosen the sequence 0, 1, 2, .., k − 1. Next, we randomly choose for each possible (k − 1)-sequence of movements a normalized pair of probabilities, one for moving to the right and one for moving to the left. A sequence of movements can be encoded as a sequence of +1 and -1, meaning movement to the right and movement to the left respectively. The resulting process is k-th order Markovian because the probability distribution for the next state depends on the current state (because it determines which two sites can be accessed, the ones immediately to the right and to the left) and on the way it reached that site k − 1 steps in the past.

5.3 Study of Higher Order Markov Processes

5.3.1 Information Theoretic Aspects of Markov Processes

Let us first compute the entropy of an n-sequence for a (first order) Markov process Hn(1), the

superindex identifying the order of the Markov process. For a 1-sequence, we have

H₁(1)_{(P) = −}X

i∈S

P(i) log P(i) = − X

i∈S

πilog πi (5.24)

Which is the entropy of the invariant distribution For L > 1, using (5.4) H_L(1)_{(P) = −} X i1,...,iL∈S P(i1, i2, ..., iL) log P(i1, i2, ..., iL) = − X i1,...,iL πi1Pi1i2...PiL−1iLlogπi1Pi1i2...PiL−1iL = − X i1,...,iL πiPi1i2...PiL−1iLlog πi− (L − 1) X i1,...,iL πiPi1i2...PiL−1iLlog Pi1i2 = −X i πilog πi− (L − 1) X i,j πiPijlog Pij = H₁(1)_{(P) − (L − 1)}X i,j πiPijlog Pij (5.25)

where we have identified the conditional probabilities with the entries of the transfer operator P and we have used thatP

jPij = 1. In the third line we used that we get L − 1 terms with log Pij

out of decomposing the logarithm. We obtain the result by noting that all these terms are equal, using that for the invariant distributionP

iπiPij = πj.

For later convenience, we are going to define h1= −Piπilog πi and h2= −Pi,jπiPijlog Pij, so

H_L(1)= (

h1 if L = 1

h1+ (L − 1)h2 if L > 1

(5.26) A similar procedure is used to compute the entropy of L-sequences for any order Markov process. To write the result we need to define for m > 1

hm= − X i1,i2,...,im πi1P (k) i1i2...P (k) i1i2...imlog P (k) i1i2...im (5.27)

(26)

Remember these tensors were defined in (5.19) and (5.21).

Following a similar procedure to the one that lead us to (5.26) one finds inductively that for a k-th order Markov process

H_L(k)= (_P_L m=1hm if L ≤ k Pk m=1hm+ (L − k)hk+1 if L > k (5.28) We can use these results to find closed form expressions for the quantities defined in section 4.1. For instance, from (4.5) and (5.28) it is clear that the entropy rate of a k-order Markov process is given by

h(k)= hk+1 (5.29)

Consider the entropy gain h(k)_L = ∆H_L(k)of an order k Markov process. Using (5.28) we see that h(k)_L =

(

hL if L ≤ k

hk+1 if L > k

(5.30) which not only confirms (5.29), but tells us exactly how the entropy gain converges to the entropy rate. If the process is a Markov chain of order k, then the entropy gain converges to the entropy rate exactly when L = k + 1. Furthermore, the predictability gain is easy to find from (5.30) this and it is given by

∆2H_L(k)= (

hL− hL−1 if L ≤ k

0 if L > k (5.31) To exemplify this, consider the model for higher order Markov chains of 5.2.1. We generate a sequence of symbols corresponding to a second order Markov chain process and a fifth order Markov process, both in a lattice of 5 sites with periodic boundary conditions. Figure ?? show the block entropy, entropy gain and predictability gain for both processes, as well as the values of of the hm.

5.4 Statistical Test to Determine the Order of a Markov

Pro-cess

Consider a data set D in the form of a sequence of states from a finite collection S coming from some system of interest. We assume we are confident that the source phenomena from which the data set was recorded can be appropriately thought of as a stochastic process. If this is the case, we might be interested in approximating the behaviour of this system as a Markov process of some order. By doing so we can obtain an approximation of the transition rules that determine the underlying process in a way that also takes in account the memory the system might have of its previous states. The set of states S = {0, 1, ., N − 1} is the alphabet of our representation of the stochastic process, where each element of S is a label to a specific state of the system. We want to determine what order k of a Markov chain better describes the data set D and determine an approximation of the probabilities of transitioning to each state in S coming from each possible sequence of k states. For every string of L labels, we can approximate the probability of finding that string by counting the relative frequency of the string: the amount of times we observe it within the data set divided by the total amount of strings of that length counted. This way, for each string length L we can obtain a probability distribution P(i0, i1, ..., iL−1) of strings of that length. If the data set is large

enough we know that these relative frequencies approximate the probability that the system gen-erates each of these strings. An important point of consideration is that for a given data set, as we increase the string length the approximation of the probability distribution becomes less accurate. The amount of possible strings of length L increases as NL. At some point our available data is going to be insufficient to accurately portray the real probability of finding particular sequences for the underlying stochastic process that is generating the data.

(27)

Figure 5.1: Chi-squared test for identifying the order of a Markov process. The data was generated for a seventh order Markov process, as is correctly identified by the test when the string length L is greater than 7.

On the other hand, if we assume the underlying stochastic process is Markov of order k, we can approximate the probabilities by approximating the transition tensors P(k)of (5.19) by computing the relative frequencies of transitions and using (5.23). We will call this probability distribution P(k). If the underlying process is indeed Markov of order k, then the distributions obtained in both ways described must be the same. We can use a test statistic to determine if the probability distributions coincide, as suggested in [16].

After computing P and P(k) we can compare them using a chi-squared test [16], which consists in computing the following quantity

χ2_(k)=X i P(Xi(L)) − P (k)_(X(L) i ) 2 P(k)(Xi(L)) (5.32) where we have labeled each possible string of length L as X_i(L), where i = 1, 2, ...NL.

In order to determine the value of k of the process we are studying, we have to compute χ2 (k) for

several values of k. We can then plot χ2

(k) against k and determine this by determining when χ 2 (k)

converges to zero. Notice that for any k larger than the order of the process, the value of χ2 (k) is

also zero since trivially, a Markov chain of order k is also of order m for any m > k, so what we are interested in is the lowest k for which χ2

(k) is zero (or close enough to zero).

Notice that for this procedure we also have to choose a value for L, the length of the strings we are considering. Theoretically it does not matter as long as L is larger than the k we are looking for. This means that L must be large enough so that it allows us to search over an appropriate range of values of k but it can not be so large that it compromises the quality of our probability approximations, as mentioned earlier. It is then important to repeat this technique with several values of L in order to confirm the order of the Markov process we are finding is reliable. The procedure can fail if we do not have enough data points, when the underlying k of the process is too large to appropriately approximate probabilities of strings of length larger than the order of the process.

To show how this works, we have generated a data set of 100000 points of the model described in section 5.2.1 with the order of the Markov process set to 7 and the number of states to 6. Figure 5.1 shows the chi-squared test described above for various values of L. We see that for L ≤ 7, the test incorrectly identifies the process as having order L − 1. For L > 7, the test identifies the order

(28)

Figure 5.2: Chi-squared test for identifying the order of a Markov process when the process is not Markovian of any order. Three independent Markov processes of orders 3, 4 and 5 were generated for 50000 points each and placed one after the other to produce the data set analyzed. The test does not identify any k as the order of the underlying process in a way that is independent of the string length L.

of the process to be 7, which is the correct value.

If the memory of a stochastic process is not constant or is not a specific property of the process, then the process can not be described as a Markov chain of any given order. To see how this manifests in the chi-squared test we have generated a data set following the same model as before, again with 6 possible states, but it behaves as a Markov process of order 3 for the first 50000 points, of order 4 for the next 50000 points and of order 5 for the last 50000 points. The underlying transition probabilities are generated independently for each of the three divisions of the data set. Figure 5.2 shows the chi-squared test results for different string lengths. In this case notice how the test identifies L − 1 as the order of the process for every L. This means that the convergence to zero of the chi-squared test for each L is merely an artifact of how the probability distributions are obtained (counting relative frequencies from the data set). The fact that no string length acts as a threshold beyond which the test starts identifying some definite k as the order of the underlying process means that the underlying process is not Markovian of any order.

(29)

Chapter 6

Stochastic Approach to Dynamical

Systems

6.1 Partitioning of the Phase Space

Partitioning the phase space of a dynamical system means defining a coarse grained version of it. To do this one defines a collection of sets {Bi} such thatSiBi = X and Bi∩ Bj = ∅ for i 6= j.

This is, a collection of non overlapping sets that together constitute the whole phase space. Par-titioning changes our ability to resolve the state of a system. Instead of reporting the exact point representing a state, we now can only refer to the region of the partition to which the state belongs to. This is analogous to what happens in real measurements of systems, where the accuracy to determine a state depends on the instrument’s limitations.

After the coarse graining, the description of a trajectory is changed from the specification of a num-ber of points to the specification of a numnum-ber of regions. If we say our coarse grained trajectory is Bi0, Bi1, ..., Bin−1 this means that the true states of the system follow a trajectory x0, x1, ...xn−1

where xj ∈ Bij. Notice that this means that different true trajectories can get to be represented

with the same sequence of region labels: The coarse graining reduces the information we gather from recording a sequence of states of our system.

Partitioning the phase space and dealing with sequences of regions instead of points takes the study of a dynamical system from a deterministic to a probabilistic approach. This can be particularly useful when the deterministic dynamical system behaves chaotically. Deterministic chaos is char-acterized by sensitivity to initial conditions, the existence of positive Lyapunov exponents means that there are directions in the phase space for which small differences in the state of the system diverge under the evolution. If we wish to predict the evolution of the state of a system that behaves chaotically with confidence for a given period of time T , we need to be able to measure its initial state with a precision that becomes exponentially small relative to T . Instead of doing this, we might shift our perspective to the long term behaviour of the system, using the probabilistic approach that emerges from partitioning, which can lead to interesting properties of the global behaviour of the system.

6.1.1 Partitioning Schemes

Any arbitrary partition is technically allowed, but as we will see, the particular partition one uses has influence on the results one obtains. Leaving the discussion of how exactly the processes de-pend on the partitioning for later, we will discuss here ways we can decide to partition a phase space. One obvious way of partitioning is by homogeneously slicing each dimension in a given number of intervals of the same length (which can be different for each dimension). If Mi is the number

of intervals for dimension i, then this leaves us with a partitioning consisting ofQ

iMi regions.

Increasing any of the Mi provides us with a finer partition. In the case this partitioning scheme is

A partition perspective of dynamical systems: Stochastic methods for the study of coarse grained deterministic systems

MSc Physics and Astronomy

Theoretical Physics

Master Thesis

A partition perspective of dynamical

systems: Stochastic methods for the study

of coarse grained deterministic systems

by

Mauricio de Jesús Fonseca Fernández

11834218

August 2019

Supervisor:

Prof. Greg Stephens

Examiner:

Prof. Edan Lerner

Contents

List of Figures

Chapter 1

Introduction

Chapter 2

Dynamical Systems

2.0.1

Definition of a Dynamical System

2.0.2

Long term behaviour of a Dynamical System

2.1

Examples of Dynamical Systems

2.1.1

The Logistic Map

2.1.2

The Lorenz System

2.1.3

A Pair of Coupled Lorenz Systems

Chapter 3

Ergodic Theory

3.1

Time and Ensamble Averages

3.1.1

Invariant Measures and Ergodicity

3.2

The Perron-Frobenius Operator

Chapter 4

Information and Entropy

4.1

Block Entropy

4.1.1

Entropy Rate

4.1.2

Block Entropy Derivatives

Chapter 5

Stochastic Processes

5.1

Markov Chains

5.1.1

Irreducibility, Transient States and Periodicity

5.1.2

The Perron-Frobinus Theorem

5.2

Higher Order Markov Processes

5.2.1

A Model of Higher Order Markov Chains

5.3

Study of Higher Order Markov Processes

5.3.1

Information Theoretic Aspects of Markov Processes

5.4

Statistical Test to Determine the Order of a Markov

Pro-cess

Chapter 6

Stochastic Approach to Dynamical

Systems

6.1

Partitioning of the Phase Space

6.1.1

Partitioning Schemes