Cramér's Large Deviation Theorem

(1)

Cram´

er’s Large Deviation Theorem

Jens Klooster

June 27, 2016

Bachelor thesis Supervisor: dr. Sonja Cox

(2)

Abstract

In this thesis we study Cramér’s Theorem for large deviations. We will introduce the tools needed to understand the theorem, such as the Fenchel-Legendre transform, and give a detailed proof. For the proof of Cramér’s Theorem we will follow [2]. Cramér’s Theorem will also be compared to the Central Limit Theorem and with the use of some simulations we will show how Cramér’s Theorem can be used in practice.

Acknowledgements

I am grateful for the support of my thesis supervisor dr. Sonja Cox, as she gave me valuable insights and comments throughout the process of writing my thesis.

Title: Cram´er’s Large Deviation Theorem

Author: Jens Klooster, jens.klooster@student.uva.nl, 10059229 Supervisor: dr. Sonja Cox

Second grader: dr. Maria Remerova Date: June 27, 2016

Korteweg-de Vries Instituut voor Wiskunde Universiteit van Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl/math

(3)

1 Introduction

1.1 Large Deviation Theory

In 2007 the Indian mathematician S.R. Srinivasa Varadhan won the prestigious Abel prize for his fundamental contributions to probability theory and in particular for cre-ating a unified theory of large deviation. In this thesis we will study one of the main theorems of Large Deviation Theory (LDT), which was introduced by the mathemati-can Harald Cram´er. But before we start with this theorem we will introduce some basic definitions and explain what LDT is.

The theory of large deviations dates back to the early 1930s, when the Scandinavian actuary (or insurance mathematician) F. Esscher was working on the following prob-lem: how do you compute the probability that the total amount of claims that could be made against the insurance company would exceed the reserve fund set aside for these claims? The standard approach was to consider each claim as a random variable with some distribution and then compute the chance of the sum of all these random variables exceding some amount. In this case we are interested in the chance of the occurance of tail probabilities of sums of independent random variables.

A theorem that is often used to solve problems of this kind is the Central Limit Theorem (CLT). However, the following example will show that the CLT will not help us very much. Suppose that X1, X2, . . . is a sequence of i.i.d. real random variables with mean

0 and variance 1 with a common distribution µ. Let Zn = X1+···+X√_n n, then Zn has a

limiting normal distribution according to the central limit theorem. Let k ∈ R, then lim n→∞P[Zn≥ k] = 1 √ 2π Z ∞ k e−x22 dx.

We now know something about the limiting behavior of Zn, but we are interested in the

behavior of Zn for n ∈ N. An often made mistake that if often taught in basic statistic

courses is to assume the following: P[Zn≥ k] ≈ 1 √ 2π Z ∞ k e−x22 dx.

In chapter 3 we will go into the details of why this is not allowed. For now we will need to find a different approach to tackle this problem.

One approach to solve this problem was introduced by the Swedish mathematician Har-ald Cram´er with the introduction of what is now known as Cram´er’s theorem for large

(5)

deviations or Cramér’s theorem. Cramér’s theorem is one of the fundamental theorems of Large Deviation Theory and is therefore the main subject of study in this thesis. As a mathematician Cramér started out working in analytic number theory and later started working as an actuary in the insurance business. In a similar way as Esscher, his work in the insurance business got him into statistics [1].

In the following section we will introduce some basic definitions and notation that is used throughout this text. In chapter 2 we will introduce and proof an important lemma that is needed to prove Cramér’s theorem. We then give a detailed proof of Cramér’s theorem, where we follow [2] for both the proof of the lemma and the proof of the theorem. In chapter 3 we will compare Cramér’s theorem to the Central Limit Theorem. In chapter 4 we will do some simulations to see how well Cramér’s theorem works in practice. In chapter 5 we end with some concluding remarks.

1.2 Notation and definitions

In this thesis we will work in the probability space (Ω, F , P), where P is an arbitrary probability measure.

Notation 1.1. Let X ⊂ R. The interior of X is denoted by Xo_{, the closure is denoted}

by X and the complement is denoted by Xc.

Notation 1.2. B will denote Borel sigma-algebra on R.

Definition 1.3. A rate function I is a lower semicontinuous mapping I : B → [0, ∞] (i.e. for all α ∈ [0, ∞), the level set ΨI(α) := {x : I(x) ≤ α} is a closed subset of B).

A good rate function is a rate function for which all the level sets ΨI(α) are compact

subsets of B. The effective domain of I, denoted DI, is the set of points in B of finite

rate,i.e., DI := {x ∈ R : I(X) < ∞}, When no confusion occurs, we refer to DI as the

domain of I.

Definition 1.4. Let (µn)n∈N be a sequence of measures on R. We say that (µn)n∈N

satisfies the large deviation principle with a rate function I if, for all Γ ∈ B, − inf

x∈ΓoI(x) ≤ lim inf_n→∞

1

nlog µn(Γ) ≤ lim supn→∞

1

nlog µn(Γ) ≤ − infx∈Γ

I(x). (1.1)

This definition states the large deviation principle in a general setting. In this thesis we will be interested in the case where µnis a specific probability measure and I is the

Fenchel-Legendre transform. Both of these functions will be introduced in the beginning of Chapter 2.

(6)

2 Cram´

er’s theorem

Our main goal of this chapter is to introduce Cram´er’s theorem and give a detailed proof. Before we do so we will have to introduce the Fenchel-Legendre transform that is used in the theorem. In the proof of the theorem we will use some properties of the Fenchel-Legendre transform that we will also prove before we start with the proof of Cram´er’s theorem.

Let X1, X2. . . be i.i.d. random variables with X1 distributed according to probability

law µ. The empirical mean of the first n variables, n ∈ N is defined as ˆSn= _n1Pnk=1Xk.

We will use a probability measure µnon B that is defined as follows: µn(B) = P[ ˆSn∈ B]

in which B ∈ B(R) is a Borel set and n ∈ N. The logarithmic moment generating function associated with the probability measure µ is defined as

Λ(λ) := log M (λ) := log E[eλX1_]. _(2.1)

Definition 2.1 (Fenchel-Legendre). The Fenchel-Legendre transform of Λ is given by Λ∗ : R → R ∪ {∞}, Λ∗(x) = sup

λ∈R

{λx − Λ(λ)} (2.2)

for all x ∈ R.

Before we start by stating and proving some properties of the Fenchel-Legendre trans-form we will introduce the main theorem.

Theorem 2.2 (Cram´er’s theorem). Given a sequence of i.i.d. real valued random vari-ables Xi ∈ R, the sequence of measures (µn)n∈N satisfies the LDP with the convex rate

function Λ∗(·) so that:

1. For any closed set F ⊂ R, lim sup n→∞ 1 nlog µn(F ) ≤ − infx∈FΛ ∗ (x). (2.3)

2. For any open set G ⊂ R, lim inf

n→∞

1

nlog µn(G) ≥ − infx∈GΛ

∗_(x). _(2.4)

To be able to prove Cram´er’s theorem we will have to prove a few properties of the Fenchel-Legendre transformation Λ∗. The following lemma gives a summary of the properties of the Fenchel-Legendre transform we need to use.

(7)

2.1 An important lemma

Lemma 2.3. 1. Both Λ and Λ∗ are convex functions and Λ∗ is a rate function. 2. If DΛ := {λ | Λ(λ) < ∞} = {0} then Λ∗ is identically zero. Let ¯x = E[X1]. If

Λ(λ) < ∞ for some λ > 0 then it holds that x < ∞, and for all x ≥ x, Λ∗(x) = sup

λ≥0

[λx − Λ(x)],

and Λ∗ is, for x > x, a nondecreasing function. Similary, if Λ(λ) < ∞ for some λ < 0 then it holds that x > ∞ (possibly x = ∞), and for all x ≤ x,

Λ∗(x) = sup

λ≤0

[λx − Λ(x)],

and Λ∗ is, for x < x, a nonincreasing function. When x is finite, Λ∗(x) = 0. If x is not finite then still

inf

x∈RΛ ∗

(x) = 0. 3. Λ(·) is differentiable in the interior of DΛ with

Λ0(ν) = 1

M (ν)E[X1e

νX1_]

and

Λ0(ν) = y =⇒ Λ∗(y) = νy − Λ(ν). Proof. Proof of statement 1.

To prove the convexity of Λ we will use H¨older’s inequality [3]. Let θ ∈ [0, 1] and λ1, λ2 ∈ R, then

Λ(θλ1+ (1 − θ)λ2) = log(E[(eλ1X1)θ(eλ2X1)1−θ])

≤ log[E[eλ1X1_]θ

E[eλ2X1]1−θ] = θΛ(λ1) + (1 − θ)Λ(λ2).

The convexity of Λ∗ follows directly from its definition. Let θ ∈ [0, 1] and x1, x2 ∈

R, then Λ∗(θx1+ (1 − θ)x2) = sup λ∈R {(θx1+ (1 − θ)x2)λ − Λ(λ)} = sup λ∈R {(θx₁+ (1 − θ)x2)λ − θΛ(λ) − (1 − θ)Λ(λ)} ≤ sup λ∈R {θx₁λ − θΛ(λ)} + sup λ∈R {(1 − θ)x₁λ − (1 − θ)Λ(λ)} = θΛ∗(x1) + (1 − θ)Λ∗(x2).

(8)

To prove that Λ∗ is a rate function we need to prove that Λ∗ is non-negative and lower semi-continuous. Since Λ(0) = 0, we know that

Λ∗(x) = sup

λ∈R

{λx − Λ(λ)} ≥ 0x − Λ(0) = 0.

To prove that Λ∗is lower semi-continuous we need to show that for a fixed sequence xn→ x we have

lim inf

xn→x

Λ∗(xn) ≥ Λ∗(x).

We have, for all λ ∈ R, lim inf xn→x Λ∗(xn) = lim inf xn→x sup λ∈R {λxn− Λ(λ)} ≥ lim inf xn→x (λxn− Λ(λ)) = λx − Λ(λ). We can use the fact that lim infxn→xΛ

∗_(x

n) ≥ λx − Λ(λ) for all λ and conclude

that lim inf xn→x Λ∗(xn) ≥ sup λ∈R [λx − Λ(λ)] = Λ∗(x). Proof of statement 2.

If DΛ = {0}, then Λ∗(x) = supλ∈R{λx − Λ(λ)} = 0x − Λ(0) = 0 for all x ∈ R. If

Λ(λ) < ∞ for some λ > 0 we have that Λ(λ) = log(E[eλX1_{]) < ∞. So that}

λx = λE[X1] = E[λX1] ≤ E[eλX1] =⇒ x ≤

1 λE[e

λX1_{] < ∞.} _(2.5)

Now, by using Jensen’s inequality [4] we find that for all λ ∈ R Λ(λ) = log(E[eλX1_{]) ≥ E[log(e}λX1_{)] = E[λX}

1] = λx. (2.6)

We have just proved λx ≤ Λ(λ) < ∞ twice using different techniques and we will need both results. Next we will go over a few possibilities for x to show that Λ∗(x) = sup_λ≥0[λx − Λ(x)] for all x ≥ x.

If x = −∞, then Λ(λ) = ∞ for λ negative by (2.5) so that λx − Λλ = −∞ for λ ∈ (−∞, 0] and therefore we can restrict

Λ∗(x) = sup λ∈R {λx − Λ(λ)} to Λ∗(x) = sup λ≥0 {λx − Λ(λ)}. .

(9)

a) b) If x is finite, by (2.6) we have Λ∗(x) = sup λ∈R {λx − Λ(λ)} ≤ sup λ∈R {λ(x − x)} and if x ≥ x we find that for λ < 0

sup λ<0 {λ(x − x)} ≤ 0, so that Λ∗(x) = sup λ∈R {λx − Λ(λ)} = sup λ≥0 {λx − Λ(λ)}.

Because we have Λ∗(x) = sup_λ≥0{λx − Λ(λ)} for x ≥ x we only have to check that the right hand side is a non decreasing function on (x, ∞), which is true since for every λ ≥ 0, the function λx − Λ(λ) is nondecreasing as a function of x. To prove that Λ∗(x) = 0, when x < ∞ we use the fact that Λ∗ ≥ 0 and Λ(λ) ≥ λx to conclude that 0 ≤ Λ∗(x) = sup λ∈R {λx − Λ(λ)} ≤ sup λ∈R {λx − λx} = 0.

To prove that we always have that inf_x∈RΛ∗(x) = 0 we will consider a few cases. Firstly, we have already established that for DΛ= {0} we have Λ∗ := 0, and when

x is finite we have Λ∗(x) = 0 so that for these cases it holds.

Next we consider the case when x = −∞ while Λ(λ) < ∞ for some λ > 0. With Chebychev’s inequality [6] we find that

log µ([x, ∞)) ≤ inf λ≥0log Ee λ(X1−x) = − sup λ≥0 − log Eeλ(X1−x) = − sup λ≥0

log eλx_{+ log E[e}−λX1_]

= − sup

λ≥0

{λx − Λ(λ)} = −Λ∗(x).

Using this fact we have 0 ≤ lim x→−∞Λ ∗_{(x) ≤ lim} x→−∞(− log µ([x, ∞))) = 0, so that lim x→−∞Λ ∗ (x) = 0.

Our last case is that of x = ∞, while Λ(λ) < ∞ for some λ < 0. This is proved analogously as the previous case.

(10)

Proof of statement 3. First we notice that

Λ(ν) = log(E[eνX1_{]) =⇒ Λ}0_{(ν) =} 1

E[eνX1_]·

∂E[eνX1_]

∂ν .

Now if we show that ∂ ∂νE[e νX1_{] = E[} ∂ ∂νe νX1_{] = E[X} 1eνX1],

then we are done. Let ν0∈ DΛo. Let > 0 such that ν0− 2 and ν0+ 2 are in DΛ.

Let ν1, ν2, . . . be points in DoΛ such that limn→∞νn= ν0 and 0 < |νn− ν0| ≤ for

all n ∈ N. Now for all n ∈ N we have M (νn) − M (ν0) νn− ν0 = E[e νnX1_{] − E[e}ν0X1_] νn− ν0 := E[Y n], where Yn= eνnX1 _{− e}ν0X1 νn− ν0 . Now as n → ∞ Yn→ d dte ν0X1 _{= X} 1eν0X1 := Y.

According to the dominated convergence theorem, if there exists an integrable random variable D such that |Yn| ≤ D for all n ∈ N, the Yn terms and Y will be

integrable and we will have

lim

n→∞E[Yn] = E[Y ].

This would imply that X1eν0X1 is integrable and M0(ν0) = E[X1eν0X1].

Now we need to find a D such that D is integrable and |Yn| ≤ D for all n. With

use of the mean value theorem we find eνnX1 _{− e}ν0X1 _{= (ν}

n− ν0)X1eν

∗_X 1

for some point ν∗ between ν0 and νn. Hence

|Yn| = eνnX1 _{− e}ν0X1 νn− ν0 = |X1|e ν∗X1 _{≤ |X} 1|(e(ν0−)X1 + e(ν0+)X1). Now |X1| = 1 |X1| ≤ 1 1 + |X1| + 2|X₁|2 2! + . . . = 1 e |X1| ≤ 1 (e −X1 _{+ e}X1_).

(11)

Thus |Y_n| ≤ 1 (e −X1 _{+ e}X1_)(e(ν0−)X1_{+ e}(ν0+)X1₎ ≤ 1 (e (ν0−2)X1 _{+ 2e}ν0X1_{+ e}(ν0+2)X1 _{:= D.} We have E[D] = 1 (M (ν0− 2) + 2M (ν0) + M (ν0+ 2)) < ∞

by the choice of . So D works and we can conclude that Λ(·) is differentiable in Do Λ with Λ0(ν) = 1 M (ν)E[X1e νX1_]. To prove that Λ0(ν) = y =⇒ Λ∗(y) = νy − Λ(ν).

we will first prove that g(ν) = νy − Λ(ν) is a concave function. Let t ∈ [0, 1] then g((1 − t)λ1+ tλ2) = ((1 − t)λ1+ tλ2)y − Λ((1 − t)λ1+ tλ2).

Since Λ(·) is a convex function we have

g((1 − t)λ1+ tλ2) = ((1 − t)λ1+ tλ2)y − Λ((1 − t)λ1+ tλ2)

≥ (1 − t)λ1y − (1 − t)Λ(λ1) + tλ2y − tΛ(λ2)

= (1 − t)g(λ1) + tg(λ1).

so that g(·) is concave. Now if Λ0(ν) = y, then

g0(ν) = y − Λ0(ν) = y − y = 0.

Since g(·) is concave it follows that g(ν) = sup_λ∈Rg(λ) = Λ∗(y) as we wanted.

2.2 Proof of Cram´

er’s theorem

We are now able to start with the proof of Cram´er’s theorem.

Proof. (Cram´er). (1) Let F be a non-empty closed set and let IF := infx∈FΛ∗(x). We

first note that when IF = 0, (2.3) trivially holds because the left hand side of (2.3) is at

most 0. Assume that IF > 0. From part 2 of Lemma 2.3 it follows that x is finite. For

all x and every λ ≥ 0, with Chebycheff’s inequality [6] we have µn([x, ∞)) = E1_{{ ˆ}_S_n_−x≥0} ≤ E enλ( ˆSn−x) = e−nλxE eλ Pn i=1Xi = e−nλxΠn_i=iE(eλXi) = e−n[λx−Λ(λ)].

(12)

By using part 2 of the lemma 2.3 we know that if x < ∞, for every x > x, µn([x, ∞)) ≤ e−nΛ

∗_(x)

. (2.7)

By a similar argument, if x > ∞ and x < x, then µn([−∞, x]) ≤ e−nΛ

∗_(x)

. (2.8)

Next we will consider all possibilities for x. Firstly, assume that x is finite. By Lemma 2.3 we have Λ∗(x) = 0, and because we have assumed that IF > 0, x ∈ Fc. Let (x−, x+)

be the union of all the open intervals (a, b) ∈ Fcthat contain x. Note that x− < x+ and

that either x− or x+ must be finite since F is non-empty. If x− is finite, then x− ∈ F

and therefore Λ∗(x−) ≥ infx∈FΛ∗(x) = IF. By using the same argument we have that

Λ∗(x+) ≥ IF when x+ is finite. Now by applying (2.7) for x = x+ and (2.8) for x = x−

we find that µn(F ) ≤ µn((−∞, x−]) + µn([x+, ∞) ≤ 2e−nIF =⇒ 1 nlog µn(F ) − 1 nlog(2) ≤ −IF =⇒ lim sup n→∞ 1 nlog µn(F ) − 1 nlog(2) ≤ −I_F =⇒ lim sup n→∞ 1 nlog µn(F ) ≤ − infx∈FΛ ∗_(x). as desired.

If x = −∞. Then, since Λ∗ is nondecreasing on x > x it is nondecreasing for all x ∈ R. So that limx→−∞Λ∗(x) = 0, and x+ = inf{x : x ∈ F } is finite because

otherwise IF = 0. Since F is a closed set, x+ ∈ F and Λ∗(x+) ≥ IF as we have shown

before. Moreover, F ⊂ [x+, ∞) and, therefore, the large deviations upper bound follows

by applying (2.7) with x = x+.

The case x = ∞ is handled analogously.

(2) In the following part of the proof we will show that for every δ > 0 and every probability measure µ, lim inf n→∞ 1 nlog µn((−δ, δ)) ≥ infλ∈RΛ(λ) = −Λ ∗_(0). _(2.9)

Proving this is sufficient due to the following transformation: Y = X − x with ΛY(λ) =

Λ(λ) − λx. Now Λ∗_Y(y) = sup λ∈R {λy − ΛY(λ)} = sup λ∈R {λ(y + x) − Λ(λ)} = Λ∗(y + x).

Now it follows from (2.9) that for every x and every δ > 0, lim inf n→∞ 1 nlog µn((x − δ, x + δ)) ≥ −Λ ∗ (x). (2.10)

(13)

For any open set G, any x ∈ G, there exists δ > 0 such that (x − δ, x + δ) ⊂ G. Therefore we have

lim inf

n→∞

1

nlog µn(G) ≥ lim infn→∞

1

nlog µn((x − δ, x + δ)) ≥ −Λ

∗

(x) So for all x ∈ G it follows that

lim inf

n→∞

1

nlog µn(G) ≥ − infx∈GΛ ∗_(x).

Concluding that the large deviations lower bound follows from (2.9).

Now we will start with the proof of (2.9). Firstly, we will consider the case where µ((−∞, 0)) > 0, µ((0, ∞)) > 0 and that µ is supported on a bounded subset of R. Since µ is supported on a bounded subset of R, say F , we have

Λ(λ) = log E[eX1λ_{] = log}

Z R eX1λ_dµ(X 1) = log Z F eX1λ_dµ(X 1) < ∞.

Now let F = A ∪ B, where A = {x ∈ F : x ≤ 0} and B = F \A, then log Z F eX1λ_dµ(X 1) = log Z A eX1λ_dµ(X 1) + Z B eX1λ_dµ(X 1) .

Note that A and B are non-empty due to the assumption that µ((−∞, 0)) > 0, µ((0, ∞)) > 0, if 0 ∈ A, A\{0} is still non-empty due to the assumption that µ((−∞, 0)) > 0. Now if λ → ∞, the integral over A will become finite and the integral over B will become infinite so that lim λ→∞log Z A eX1λ_dµ(X 1) + Z B eX1λ_dµ(X 1) = ∞. With a similar argument we have

lim λ→−∞log Z A eX1λ_dµ(X 1) + Z B eX1λ_dµ(X 1) = ∞.

Concluding that lim|λ|→∞Λ(λ) = ∞ and Λ(·) < ∞. Due to Lemma 2.3 Λ(·) is a

continuous, differentiable function and there exists a finite ν so that Λ(ν) = inf

λ∈RΛ(λ) and Λ

0_{(ν) = 0.}

Define a new probability measure ˜µ in terms of µ via d˜µ

dµ(x) = e

νx−Λ(ν)_,

notice that ˜µ is a probability measure because Z

d˜µ = 1 E[eνx]

Z

(14)

Let ˜µn be the probability measure governing ˜Sn when X1, X2, . . . are i.i.d. random

variables of probability measure µn. Note that for all > 0,

µn((−, )) = Z |Pn i=1Xi|<n µ(dX1) . . . µ(dXn) ≥ e−n|ν| Z |Pn i=1Xi|<n exp(ν n X i=1 Xi)µ(dX1) . . . µ(dXn) = e−n|ν|enΛ(ν)µ˜n((−, )). (2.11)

From Lemma 2.3 and the choice of ν, it follows that Eµ˜[X1] = 1 M (ν) Z R xeνxdµ = Λ0(ν) = 0. By the law of large numbers,

lim

n→∞µ˜n((−, )) = 1. (2.12)

With use of (2.11) it follows that for every 0 < < δ, lim inf

n→∞

1

nlog µn((−δ, δ)) ≥ lim infn→∞

1 nlog µn((−, )) ≥ lim inf n→∞ 1 nlog e−n|ν|enΛ(ν)µ˜n((−, )) = lim inf n→∞ 1 n nΛ(ν) − n|ν| + log ˜µn((−, )) = Λ(ν) − |ν|. By taking the limit of → 0, (2.12) follows.

Next we consider the case where µ is of unbounded support, while both µ((−∞, 0)) > 0 and µ((0, ∞)) > 0. Fix M large enough such that µ([−M, 0)) > 0 and µ((0, M ]) > 0, and define

ΛM(λ) := log

Z M

−M

eλxdµ.

Let τ denote the law of X1 conditioned on {|X1| ≤ M }, and let τn be the law of ˆSn

conditioned on {|Xi| ≤ M, i = 1, . . . , n}. Then for all n and every δ > 0,

µn((−δ, δ)) ≥ τn((−δ, δ))µ([−M, M ])n.

By the preceding proof, (2.10) holds for the measure τn. Therefore, with the logarithmic

moment generating function associated with τ being ΛM(λ) − log µ([−M, M ]) we have

lim inf

n→∞

1

nlog µn((−δ, δ)) ≥ lim infn→∞

1 nlog τn((−δ, δ))µ([−M, M ])n ≥ lim inf n→∞ 1 nlog τn((−δ, δ)) + log µ([−M, M ]) ≥ inf λ∈RΛM(λ)

(15)

Now, with IM = − infλ∈RΛM(λ) and I∗ = lim supM →∞IM, we have lim inf n→∞ 1 nlog µn((−δ, δ)) ≥ −I ∗_. _(2.13)

This is true, because ΛM(·) is nondecreasing in M , and thus so is −IM. Also, −IM ≤

ΛM(0) ≤ Λ(0) = 0, and therefore I∗ ≤ 0. Since −IM is finite for all M large enough,

−I∗ _{> −∞.} _{Therefore, the level sets {λ : Λ}

M(λ) ≤ −I∗} are non-empty,

com-pact sets that are nested with respect to M , and therefore there exists at least one point, say λ0, in their intersection. By Lesbesgue’s monotone convergence theorem,

Λ(λ0) = limM →∞ΛM(λ0) ≤ −I∗, and consequently the bound of (2.13) yields (2.9) for

µ of unbounded support.

The proof of (2.9) for an arbitrary probability law µ is completed by observing that if either µ((−∞, 0)) = 0 or µ((0, ∞)) = 0, then Λ(·) is a monotone function with infλ∈RΛ(λ) = log µ({0}). Hence, in this case, (2.10) follows from

µn((−δ, δ)) ≥ µn({0}) = µ({0})n.

Now that we have finished the proof we will focus on the application of the theorem, which may not be immediately clear. The following example will give some insight in the usefulness of the theorem in the special case where the closed set F and open set G considered in the theorem are of the form F = [x, ∞) and G = (x, ∞).

Example 2.2.1. Let X1, X2, . . . be i.i.d. real random variables and the sequence of

measures (µn)∞n=1, where µn is defined as in the beginning of this chapter, then (µn)∞n=1

satisfies the Large Deviation Principle with the convex rate function Λ∗(·) and set F = [x, ∞) and G = (x, ∞) with x ∈ R. In this case

− inf x∈FΛ ∗_{(x) = − inf} x∈GΛ ∗_(x) so that lim n→∞ 1 nlog µn(F ) = − infx∈FΛ ∗_(x).

By writing out the definition of the limit we find the following ∀ > 0, ∃N ∈ N such that ∀n ≥ N : 1 nlog µn(F ) + infx∈FΛ ∗ (x) < ⇐⇒ 1 nlog µn(F ) ∈ − − infx∈FΛ ∗_{(x), − inf} x∈FΛ ∗_(x) ⇐⇒ µn(F ) ∈ h exp n − − inf x∈FΛ ∗ (x) , exp n − inf x∈FΛ ∗ (x) i

Since µn(F ) = P[Sn ∈ F ] this last expression gives us an interval of the chance that

(16)

Corollary 2.4. Let X1, X2, . . . be a sequence of i.i.d. real random variables with E[X1] <

∞ then for all x > E[X1], x ∈ R, it holds that

µn([x, ∞)) ≤ e−nΛ

∗_(x)

. By a similar argument, if E[X1] > −∞ and x < E[X1], then

µn((−∞, x]) ≤ e−nΛ

∗_(x)

(17)

3 The Central Limit Theorem

In this chapter we will compare the central limit theorem to Cram´er’s theorem. We will start with a brief recap of the central limit theorem.

3.1 Central Limit Theorem

Theorem 3.1 (Central Limit Theorem). Let X1, X2, . . . , be i.i.d. random variables with

finite mean m and finite variance σ2. Set Sn= X1+ · · · + Xn. Then as n → ∞

Sn− nm √ n d − → N (0, σ2),

where N (0, σ2) denotes the normal distibution with mean 0 and variance σ2. Proof. For a proof see [5].

Corollary 3.2. Let X1, X2, . . . , be i.i.d. random variables with finite mean m and finite

variance σ2_{. Set S}

n= X1+ · · · + Xn. Then for each fixed x ∈ R,

lim n→∞P S_n− nm √ nσ2 ≤ x = √1 2π Z x −∞ e−t2/2dt. (3.1)

We are able to rewrite this last statement as follows: lim n→∞P(Sn≤ x √ nσ2_{+ nm) =} _√1 2π Z x −∞ e−t2/2dt.

Meaning that Sn will approximately equal nm with deviations of order

√ n.

As mentioned in the introduction a frequently made mistake is assuming the following. Let n ∈ N, then P hS_n− nm √ nσ2 ≥ x i ≈ √1 2π Z ∞ x e−t22 dt.

Example 3.1.1. Let X1, . . . , Xn be random variables such that Xi∼ Uniform[0, 1], for

i = 1, . . . , n (notice that m = 1₂ and σ2= ₁₂1 ). Then it is obvious that

0 = P[Sn≥ n] = P[ Sn_√− nm nσ2 ≥ n − nm √ nσ2 ] = P[ Sn_√− nm nσ2 ≥ √ n(1 − m) σ ].

However, for all n ∈ N we have 1 √ 2π Z ∞ √ n(1−m) σ e−t22dt > 0.

(18)

In this example we want to make clear that the Central Limit Theorem will only tell us something in the limit and not for any n ∈ N. With Cram´er’s theorem we are able to say something about both. In the special case where we are interested in the probability that the mean lays in an interval of the form [x, ∞) for x ∈ R then for all n ∈ N we can use the upper bound µn([x, ∞)) ≤ e−nΛ

∗_(x)

as introduced in Corollary 2.4. In the following section we will compute this upper bound for the Poisson distribution for different values of n.

(19)

4 Simulations

In this chapter we will do some simulations in Matlab to see Cram´er’s theorem in action. The idea is that we compare theoretical boundaries to empirical ones. We start by considering the Poisson distribution.

4.1 Theoretical part

Let X1, X2, . . . be i.i.d. real random variables such that X1 ∼ Poisson(θ) with θ ∈ (0, ∞).

To use Cram´er’s theorem we will start with the computation of Λ∗(x) := sup

λ∈R{λx − log E[e λX_]}.

We will use the fact that the moment generating function of the Poisson distribution satisfies: M (λ) = eθ(eλ−1). Using this fact we find the following

sup λ∈R {λx − log M (λ)} = sup λ∈R {λx − θ(eλ− 1)} = max λ∈R{λx − θ(e λ_{− 1)}.} _(4.1)

By direct evaluation of λx − θ(eλ− 1) we notice that if x < 0 we can let λ → −∞ so that λx − θ(eλ− 1) → ∞. Thus for negative x we have Λ∗_{(x) = ∞. If x = 0 we can also}

let λ → −∞ so that Λ∗(0) = 1 is a global maximum. If x > 0 we can use a derivative test to find the maximum. To find the maximum we will differentiate with respect to λ and set this derivative equal to 0. We need to solve:

∂ ∂λ λx − θ(eλ− 1)= 0 =⇒ x − θeλ = 0 =⇒ λ = log(x θ) Since ∂2 ∂λ2 λx − θ(eλ− 1)= −θeλ < 0,

when λ = log(x_θ) this is a local maximum. However, for λ < log(x_θ), this function is monotically increasing and for λ > log(x_θ) this function is monotonically decreasing, so λ = log(x_θ) is a global maximum. Substituting this back into (4.1) we find:

Λ∗(x) = θ − x + x log(x θ), for x > 0.

(20)

To use the theorem we will need to choose some arbitrary sets F and G and a parameter θ. We will use F = [5.3, ∞) and G = (5.3, ∞) and set θ = 5 so that X1 ∼ Poisson(5).

In this case − inf x∈FΛ ∗_{(x) = − inf} x∈GΛ ∗_{(x) = − inf} x∈Fθ − x + x log( x θ) .

With use of the lemma we know that for x > x, Λ∗ is a nondecreasing function and therefore attains its minimum in x = 5.3. In other words

− inf x∈Fθ − x + x log( x θ) = − 5 − 5.3 + 5.3 log( 5.3 5 ) ≈ −0.0088. From the proof we know that µn(F ) ≤ e−n infx∈FΛ

∗_(x)

. Considering n = 10, 100, 500 and 1000 we find the following (approximated) upper bounderies respectively:

µ10(F ) ≤ e−10 infx∈FΛ ∗_(x) ≈ 0.915, µ100(F ) ≤ e−100 infx∈FΛ ∗_(x) ≈ 0.415, µ500(F ) ≤ e−500 infx∈FΛ ∗_(x) ≈ 0.012, µ1000(F ) ≤ e−1000 infx∈FΛ ∗_(x) ≈ 0.000145.

In the next section we will check if these upper boundaries hold using simulations.

4.2 Matlab part

In this section we will use matlab to simulate an experiment where we repeatedly pick n i.i.d. random variables with X1∼ Poisson(5), compute the mean and check if this mean

falls into our region [5.3, ∞). If this is the case we will count this occurance and in the end check how many times this happened compared to the total amount of runs. The following code runs the experiment for n = 10.

1 % This script checks 5000 times if the empirical mean of a

2 % random selection of n poisson(5) distibuted random

3 % variables lay in the region [5.3, \infty).

4 number = 0; 5 n = 10; 6 for k = 1:5000; 7 empiricalmean = mean2(poissrnd(5,[n,1])); 8 if empiricalmean >= 5.3 9 number = number + 1; 10 end 11 end 12 number/5000

In column 2 of table 4.1 we present the values of µn(F ) found by running the script

(21)

Table 4.1: 1 observation average of 50 obs. σ50 µ10(F ) 0.3036 0.3050 0.0076 µ100(F ) 0.0822 0.0882 0.0036 µ500(F ) 0.0020 0.0014 0.00052129 µ1000(F ) 0 0.000028 0.0000060609

the third column shows the standard deviation of these 50 observations. We notice that there is not alot of variance within the observations and that they are alot lower than the theoretical upper boundary.

(22)

5 Conclusion

This thesis started with the introduction Fenchel-Legendre transform in Chapter 2. Then some useful properties of this transform were proved, for example that the Fenchel-Legendre transform is a rate function and that the logarithmic generating function used in the Fenchel-Legendre transform is convex. These properties and a proof can be found in Lemma (2.3). Subsequently, Cram´er’s Theorem was introduced and a detailed proof was given. In the proof of both Lemma (2.3) and Cram´er’s Theorem [2] was followed. A Corollary of the proof was the fact that for open and closed sets of the form [x, ∞) and (x, ∞), x ∈ R we can find upper and lower boundaries for the probability that the mean of n i.i.d. random variables lays in this interval.

In Chapter 3 we introduced the Central Limit Theorem and gave an example of why the Central Limit Theorem should only be used in the limit and not for a finite amount of i.i.d. random variables. In this case Cram´er’s Theorem could be used and is therefore a useful extension of the Central Limit Theorem.

In Chapter 4 we computed the theoretical upper bound that the mean of n Poisson(5) distributed lay above the value 5.3. We did this for n = 10, 100, 500 and 1000. We found that this chance decreases with an exponential rate. With the use of Matlab we then simulate a random sample of n Poisson(5) distributed values and compute the mean. We do this 5000 times and check how many times the mean exceeded the value 5.3. We then compared this estimate to the theoretical boundaries and find that indeed, they work.

In the simulations we found that the empirical values are often around 4 times smaller than the theoretical bound. This gives rise to the question if the bound found Cram´er’s theorem can be made tighter. Do there exists other theorems that introduce similar bounds and how does Cram´er’s theorem compare to those bounds? These are all ques-tions that could be addressed in further research.

(23)

6 VWO samenvatting (in dutch)

De theorie van grote afwijkingen is een theorie binnen de kansrekening die zich bezig houdt met het analyseren van grote afwijkingen van bepaalde gebeurtenissen.

Een simpel voorbeeld kan worden gegeven door een aantal keer met een dobbelsteen te gooien. Een dobbelsteen heeft 6 zijden met op iedere zijde een getal, namelijk 1, 2, 3, 4, 5 of 6. Stel we gooien nu ´e´en keer met de dobbelsteen, dan is de kans dat we een van deze getallen opgooien 1 uit 6. We noteren dit als volgt: P[X1 = 2] = 1₆, waarin X1staat

voor het getal dat we in de eerste beurt gaan gooien. Gemiddeld gooien we 3.5 met een dobbelsteen. Dit berekenen we als volgt: E[X] = 1+2+3+4+5+6₆ = 3.5, waarin E[X] staat voor wat we verwachten (E) van de uitkomst van de dobbelsteen (X). Als we één keer met een dobbelsteen gooien is de kans dat we gemiddeld 3.5 gooien 0, maar als we 100 keer een dobbelsteen opgooien dan wordt de kans dat we gemiddeld 3.5 gooien al groter. Een wiskundige stelling, de wet van de grote aantallen, vertelt ons zelfs dat hoe vaker we met de dobbelsteen gooien, hoe dichter we bij dit gemiddelde uitkomen. Na oneindig keer met de dobbelsteen te gooien kom je zelfs precies uit op een gemiddelde van 3.5. Nu weten we dat als we oneindig vaak met een dobbelsteen gooien, we gemiddeld 3.5 gooien. Wat nu als we 100 keer met een dobbelsteen en willen weten wat de kans is dat we gemiddeld boven de 4 hebben gegooid? De wiskundige Harald Cramér heeft hier een slimme stelling voor bedacht die ons hier iets over kan vertellen. Een gevolg van zijn stelling is dat het onder- en bovengrenzen geeft voor de kans dat het gemiddelde boven (of onder) een bepaalde waarde ligt. Het mooie aan deze stelling is dat het niet alleen werkt voor dobbelstenen, maar alle soorten kansverdelingen. Bijvoorbeeld voor discrete verdelingen als de Poisson en binomiale verdeling, maar ook voor continue verdelingen, zoals de normale verdeling.

In deze scriptie leggen wij uit wat deze stelling precies zegt en geven we een bewijs dat het klopt. Dit is natuurlijk al vaak gedaan en het bewijs van deze stelling staat in veel boeken, maar is vaak heel compact opgeschreven. Wiskundigen houden er over het algemeen van om dingen zo kort mogelijk op te schrijven. In deze scriptie hebben wij juist alles netjes uitgewerkt en uitgebreider opgeschreven. Daarnaast wordt deze stelling vergeleken met een andere stelling die ook wel eens wordt gebruikt met hetzelfde doeleinde en maken we een simulatie om te kijken hoe goed de stelling werkt in de praktijk.

(24)

Bibliography

[1] H. Cram´er. Half a century with probability theory: some personal recollections. The annals of probability. 4(4):511-512. 1976.

[2] A. Dembo, O. Zeitouni. Large Deviations Techniques and Applications, Springer, pages 1 - 50.

[3] B.P. Rynne, M.A. Youngson. Linear Functional Analysis, Springer, page 28. [4] R.L. Schilling. Measures, Integrals and Martingales, Cambridge University Press,

page 116.

[5] J.S. Rosenthal. A first look at rigorous probability theory, World Scienctific Publish-ing Co, page 133.

[6] W. Hoeffding. Probability inequalities for sums of bounded random variables. Jour-nal of the American Statistical Association. 58(301):1-13. 1963.

Cramér's Large Deviation Theorem

Cram´

er’s Large Deviation Theorem

Jens Klooster

June 27, 2016

Abstract

Contents

1 Introduction

1.1 Large Deviation Theory

1.2 Notation and definitions

2 Cram´

er’s theorem

2.1 An important lemma

2.2 Proof of Cram´

er’s theorem

3 The Central Limit Theorem

3.1 Central Limit Theorem

4 Simulations

4.1 Theoretical part

4.2 Matlab part

5 Conclusion

6 VWO samenvatting (in dutch)

Bibliography