Asymptotic Convergence - Wald's Theorem and the Kolmogorov-Smirnov Test

(1)

Asymptotic Convergence - Wald’s Theorem and

the Kolmogorov-Smirnov Test

Arja Rydin

July 7, 2017

Bachelor Thesis

Supervisor: Msc Nicos Starreveld, prof. dr. Michel Mandjes

Korteweg-de Vries Instituut voor Wiskunde

(2)

Abstract

In this project we give a generalization of the proof that Wald’s Statistic converges to a chi-squared distribution. In the literature ([4]) it has been proved for a one-dimensional parameter space. Here we give the extended proof for a q-dimensional parameter space. To do so, we also generalized the proof of convergence of the Maximum Likelihood Estimator for q-dimensional parameter spaces.

We also do a numerical study on the Kolmogorov-Smirnov Test. For this, we use the package which is included in MATLAB2015b. The data which we simulate is scaled and centered, so that it is normally distributed in the limit. However, our data is only approximately normally distributed, since we fix our data in the pre-limit (n = 200). We study how well the Kolmogorov-Smirnov Test can distinguish this for a growing sample size. We found that when the data originates from a discrete (Bernoulli) distribution, the Null Hypothesis is quite quickly rejected, at around a sample size of m=1000. When simulating from a continuous (Uniform(0,1)) distribution it should take a far larger sample size to reject the Null Hypothesis. We eventually saw a decrease in the p-value. We expect the Null Hypothesis should be rejected at some point, but we were not able to get to such a large sample size (m 10 000 000).

Title: Asymptotic Convergence - Wald’s Theorem and the Kolmogorov-Smirnov Test Authors: Arja Rydin, arja.rydin@student.uva.nl, 10590854

Supervisor: Msc Nicos Starreveld, prof. dr. Michel Mandjes

Second grader: prof. dr. Sindo Nunez Date: July 7, 2017

Korteweg-de Vries Instituut voor Wiskunde Universiteit van Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl/math

(3)

1. Introduction

When we want to study a phenomenon, we rely on measurements. These measurements, or data, we need to interpret in order to draw sensible conclusions. For example. consider a die and suppose we want to establish whether it is fair or not, i.e. that all possible outcomes are equally likely. We throw the die multiple times and model this process by using a binomial probability distribution. If every side shows up approximately equally often, we will feel reassured in our belief that the die is fair. However, if one of the sides comes up more often than the others, we begin to suspect the die is not fair. Only when we see something very deviant happening, we decide to suspect the die is unfair. The way we make this explicit, is by setting up a Null Hypothesis H0 : ’the die is fair’ and

an Alternative Hypothesis: H1: ’the die is unfair’. We set up these hypotheses before

conducting the research. After doing the experiment and seeing the data, we decide whether to reject the Null Hypothesis and accept the Alternative Hypothesis or to not reject the Null Hypothesis (but not necessarily accepting the Null Hypothesis).

In order to decide what we do with regard to our hypotheses, we need to process the data, so that we conclude whether to reject the Null Hypothesis or not. The processing of the data we do with a so-called statistic. We then test these statistics. This means we take a look at the data, and see how likely it is that we see it, if the Null Hypothesis is true. If this probability is uniformly low, we decide to reject the Null Hypothesis and adopt the Alternative Hypothesis. There exist many different statistics and tests. We will examine a specific statistic and test.

In this project our goal is twofold. At first, we want to prove convergence of Wald’s statistic to a chi-squared distribution, we treat this in Section 3.2. In order to prove this convergence we need the asymptotic behaviour of the Maximum Likelihood Estimator, this is done in Section 3.1. In the literature, see [4], the corresponding results are established for the case of a one dimensional parameter. We extend these results for the case of a q-dimensional parameter and our findings are given in Theorems 8 and 9. The proofs rely on the ideas presented in [4] but the extension to vector parameters is not trivial, we refer the reader to Sections 3.1 and 3.2 for more details.

Secondly we want to explore the sensitivity of the Kolmogorov-Smirnov test when the sample size grows. We know from the Central Limit Theorem that the scaled and centered sample mean has asymptotically a normal distribution. But in the pre-limit it will not be normally distribution hence we would want to have a test statistic which detects this as the sample size grows. We perform some numerical experiments and show our results in Chapter 4.

(5)

2. Prerequisites

In this chapter we present the notation we will use and collect the main definitions and results we will rely on. The notation concerns how we denote matrices and vectors. We will show three different kinds of definitions: statistics, probability theory and matrix theory, in this order. Lastly, in this chapter we state a few important result which will be necessary for the proof of the convergence of Wald’s Statistic.

2.1. Notation

We will assign bold letters to vectors and matrices: xnwill be a column vector of length

n, and Amn will be a matrix of size m × n. In the sequel we will drop the dependence

on n and m whenever it is clear from the context. AT will be the transposed of either a vector or a matrix. With ||A||, A ∈ Mm×n we denote the norm of a matrix. We

consider the following matrix norm:

||A|| := sup

ij

|A_ij|.

We write Hxf for the Hessian matrix and ∇x for the vector of partial derivatives of the

real function f .

2.2. Definitions

We start off with some general statistical definitions like point estimators and the Covari-ance matrix. This will be followed by definitions from probability theory. Among others we will we define the chi-squared distribution and some important types of convergence. To end this section, we give some matrix properties.

Definitions from Statistics

Definition 1 (Statistic). Let X1, . . . , Xn be i.i.d. random variables, distributed

ac-cording to some distribution f (x; θ) with an unknown parameter θ ∈ Θ and denote X = (X1, . . . , Xn). A map T : X → Rq is called a statistic, if it does not depend on the

unknown parameter θ. Any statistic T (X) defines a form of data reduction or data sum-mary. A minimal sufficient statistic is a statistic that has achieved the maximal amount of data reduction possible while still retaining all the information about the unknown parameter [2].

(6)

Definition 2 (Point Estimators). Let X1, . . . , Xn be i.i.d. random variables with some

distribution which depends on an unknown parameter θ ∈ Θ ⊂ Rq, with q ≥ 1. A point estimator ˆθ for θ is a statistic based on the realization of the random variables, which gives a specific value for θ ∈ Θ.

Definition 3 (Hypothesis Testing). In a hypothesis testing problem, after observing the sample, the experimenter must decide either not to reject H0 (but not accepting it as true)

or to reject H0 and decide H1 to be true. A hypothesis testing procedure or hypothesis

test is a rule that specifies:

1. For which sample values the decision is made not to reject H0.

2. For which sample values we may reject H0 and accept H1 as true.

The subset of the sample space for which H0 will be rejected is called the critical region.

The complement of the rejection region is called the acceptance region. Typically we write: consider the test

H0 : θ ∈ Θ0 versus H1: θ ∈ Θ1.

To decide whether or not to reject the Null Hypothesis, we need a test statistic, which is defined as follows. If H0 is true, we want the probability of T falling into the critical

region K to be uniformly small: Pθ0(T (X) ∈ K) < α, where α is often chosen to be

0.05.

Definition 4 (Wald’s Statistic [4]). Let X1, . . . , Xn be an i.i.d. sequence of random

variables with probability density function f (x; θ), θ ∈ Θ ⊂ Rq. Consider the test H0 :

θ = θ0 vs. H1: θ 6= θ0. Let ˆθn be the MLE of θ. Then Wald’s Statistic QW is defined

as follows:

QW = n(ˆθn− θ0)TI(ˆθn)(ˆθn− θ0) (2.1)

Definition 5 (Test Statistic). A test statistic is a statistic T (X) which gives us the probability of our unknown parameter θ to fall in Θ0. If this probability is small enough,

we say T falls in the critical region, and we decide to reject the Null Hypothesis. If we denote the critical region with K, we say that we reject H0 if T (X) ∈ K.

Definition 6 (Likelihood Function). Let X1, . . . , Xn be i.i.d. random variables with

probability density function f (x; θ). We define Ln(θ) as the product of the probability

density functions of X1, . . . , Xn: Ln(θ) := n Y i=1 f(Xi; θ).

(7)

Definition 7 (Maximum Likelihood Estimator (MLE)). The MLE ˆθn is the value for

the likhelihood function achieves its maximum:

ˆ θn= argsup θ∈Θ n Y i=1 f (Xi; θ). (2.2)

Definition 8 (Empirical Cumulative Distribution Function (Empirical cdf)). Let x1, . . . , xn

be the realization of X1, . . . , Xn i.i.d. real random variables, and let F (y) be their

com-mon cumulative distribution function. The empirical cumulative distribution function (empirical cdf ) is given by ˆ Fn(y) = {#xi: x ≤ y} n = 1 n n X i=1 1{xi≤y},

where 1{xi≤y} is the indicator function.

Definition 9 (Fisher Information Matrix). Let X1, . . . , Xn be i.i.d. random variables

with probability density function f (x; θ), x ∈ R, θ ∈ Θ ⊂ Rq. Let _∂θ∂2

i∂θjf (x; θ) be defined

for all i, j = 1, . . . , q. Then the Fisher Information Matrix I(θ) is defined by

I(θ) = E h

[∇θlog f (X1; θ)] [∇θlog f (X1; θ)]T

i .

Definition 10 (Covariance Matrix). Let X = (X1, . . . , Xn) be a random vector. The

Covariance Matrix Σ is the matrix defined as

Σ := [Σij]i,j=1,...,n =Cov(Xi, Xj)

i,j=1,...,n.

Definitions from Probability Theory

Definition 11 (χ2-distribution). We say the random variable Z is χ2-distributed with n degrees of freedom (notation: Z ∼ χ2_n) if Z is equal to the sum of n squared, independent standard normally distributed random variables:

Z =

n

X

i=1

X_i2, where Xi∼ N (0, 1).

Definition 12 (Convergence in Distribution). We say a sequence of random vectors {X_n} converges in distribution to a random vector X if

P(Xn≤ x) → P(X ≤ x).

For this type of convergence we write Xn D

(8)

Definition 13 (Convergence in Probability). We define the distance function d in Rm×n by

d(A, B) = ||A − B|| = sup

i,j

|Aij − Bij|,

where A and B are matrices. A sequence of random matrices {Xn} is said to convergence

in probability to a random matrix X if for all > 0

P(d(Xn, X) > ) → 0.

We denote this kind of convergence with Xn p

− → X.

Definition 14 (characteristic function for multi-dimensional vectors). Let X be a k-dimensional random vector. For t ∈ Rk the characteristic function is defined as φX(t) =

E[eit

T_X

].

Definitions from Matrix Theory

Definition 15. A positive semi-definite matrix matrix is an (n × n)-matrix A for which xTAx ≥ 0 for all vectors x of length n.

Definition 16. An (n × n)-matrix A is said to be idempotent when A2 = A.

Definition 17. A generalized inverse of an (n×m)-matrix A is a matrix B ∈ Rm×nwith the property that ABA = A. This generalized inverse matrix exists for every matrix [3].

Definition 18 (Singularity of a Matrix). A square matrix A is said to be singular if there exists no matrix B such that AB = I.

Definition 19 (Rank of a Matrix). The rank of a matrix A is defined to be the number of lineair independent rows in A [3].

2.3. Preliminary Results

In this section we present some important results we will use later on.

Theorem 1 (Multivariate Central Limit Theorem). Define the random vectors Yi,

i = 1, . . . , n by: Yi= h Y_i(1), . . . , Y_i(q) iT ,

a sequence of independent, identically distributed random vectors. Let the sample mean Yn be:

(9)

1 n n X i=1 Yi= 1 n " _n X i=1 Y_i(1), . . . , n X i=1 Y_i(q) #T =hY(1)_n , . . . , Y(q)_n iT = Yn.

Then we can write the following:

1 √ n n X i=1 [Yi− EY1] = 1 √ n n X i=1 [Yi− EY1] = √ n(Yn− EY1). Then √ n(Y − EY1)−D→ Nq(0, Σ),

where we write Σ for the covariance matrix, as defined in definition 10.

Theorem 2 (Slutsky’s Lemma [4]). Let {Xn} and {Yn} be sequences of random variables

sucht that Xn D

−→ X and Y_n−→ c, where c is a constant. Then, if follows thatp

1. Xn+ Yn D −→ X + c 2. YnXn D −→ cX 3. Xn/Yn D −→ X/c if c 6= 0.

Theorem 3 (L´evy’s Continuity Theorem [4]). Let X1, . . . , Xn be a sequence of random

variables with corresponding sequence of characteristic functions φ1, . . . , φn. If, for n →

∞,

φn(t) → φ(t) ∀t ∈ R,

then Xn converges to X in distribution, i.e. Xn−D→ X.

Theorem 4 (Khintchine Weak Law of Large Numbers). Let {Xn}, for n ≥ 1 be a

sequence of i.i.d. random variables with EX1 = θ, and let

Xn= 1 n n X k=1 Xk, n ≥ 1. Then Xn p − → θ.

(10)

Theorem 5 (Taylor on Rn [4]). Let f (x), x ∈ A ∈ Rp be a continuous function, with continuous partial derivatives up to (k + 1)th order, for some k ≥ 1. Then for every x ∈ A, x0 ∈ A, we have the following (multivariate) Taylor expansion:

f (x) = f (x0) + k X j=1 1 j! p X i1=1 · · · p X ij=1 ∂j ∂xi1. . . ∂xij f (x) j Y l=1 (xil− x0_il) + Rk(x, x0), where Rk(x, x0) is given by 1 (k + 1)! p X i1=1 · · · p X ik+1=1 ∂k+1 ∂xi1. . . ∂xik+1 f [hx0+ (1 − h)x] k+1 Y l=1 (xil− x0_il),

for some 0 < h < 1. For the first order Taylor expansion of f (x), we can actually write

f (x) = f (x0) + (x − x0)T∇xf (x0) + R1(x, x0) = f (x0) + (x − x0)T∇xf (x0) + 1 2(x − x0) T_H xf hx0+ (1 − h)x(x − x0) (2.3)

Theorem 6 (Cochran [4]). Let {Xn} be a sequence of random vectors, such that

√

n(Xn−

θ)−D→ N (0, Σ), rank(Σ) = q ≤ p, and let {A_n} be a sequence of positive semi-definite non-random matrices. Then

Qn= n(Xn− θ)TAn(Xn− θ) D

−→ χ2 p

if and only if An converges to a generalized inverse of Σ.

Theorem 7 (Courant [4]). Let A and B be two positive semi-definite matrices, where B is non-singular. If λi = chi AB−1 is denoted for the i’th eigenvalue (where λ1 ≥

. . . ≥ λq) of AB−1, for i = 1, . . . , q, the we get

λq= inf x xTAx xT_Bx ≤ sup x xTAx xT_Bx = λ1.

(11)

3. Maximum Likelihood

When searching for the true parameter of a distribution, we use observed data to esti-mate the parameter. We of course want to make sensible estimations, maximizing the probability of our estimation to be correct. Maximum likelihood estimation is a widely used method of estimating parameters [2]. Asymptotically, the maximum likelihood estimator has some interesting properties.

3.1. Asymptotic Behaviour of Maximum Likelihood

Estimators

We will show normal convergence of the maximum likelihood estimator in the following theorem. In Large Sample Methods in Statistics, An Introduction with Applications ([4]) the proof was given for Θ ⊂ R. The main contribution of this project is extending this Theorem for the case Θ ⊂ Rq_.

Theorem 8. Let X1, . . . , Xn be i.i.d. random variables wil probability density function

f (x; θ), θ ∈ Θ ⊂ Rq _{which meet the following conditions:}

1. For i, j = 1, . . . , q, _∂θ∂

if (x; θ) and ∂2

∂θi∂θjf (x; θ) are defined almost-everywhere and

are such that: ∂ ∂θi f (x; θ) ≤ Hi(x) and ∂2 ∂θi∂θj f (x; θ) ≤ Gij(x), where Hiand Gij are two integrable functions and

R

RHi(x)dx < ∞ and

R

RGij(x) <

∞.

2. For i, j = 1, . . . , q the partial derivatives _∂θ∂

ilog f (x; θ) and the second partial

derivatives _∂θ∂2

i∂θjlog f (x; θ) exist almost-everywhere and are such that:

a) The Fisher information Matrix,

I(θ) = Eh[∇θlog f (X1; θ)] [∇θlog f (X1; θ)]T

i

is finite and positive definite, b) for δ → 0, Eθ " sup {h:||h||≤δ} H_θlog f(X₁; θ + h) − H_θlog f(X₁; θ) # =: ψδ→ 0.

(12)

Then the Maximum Likelihood Estimator (MLE) ˆθn of θ is such that

√

n( ˆθn− θ) D

−→ N (0, [I(θ)]−1). Proof. First, note that

E [∇θlog f (X1; θ)] = Z R [∇θlog f (x; θ)] f (x; θ)dx = " Z R ∂f ∂θ1 f (x; θ)f (x; θ)dx, . . . , Z R ∂f ∂θq f (x; θ)f(x; θ)dx #T = Z R ∂f ∂θ1 (x; θ)dx, . . . , ∂f ∂θq (x; θ)dx T = ∂ ∂θ1 Z R f (x; θ)dx, . . . , ∂ ∂θq Z R f(x; θ)dx T = ∇θ Z R f (x; θ)dx = 0. (3.1)

We are allowed to interchange the order of differentation and integration because of Condition 1. This condition makes it possible to apply the Dominated Convergence Theorem. Consider the Hessian matrix Hθlog f(X1; θ). Then E [Hθlog f(X1; θ)]_ij can be

written as: E [Hθlog f(X1; θ)]ij = E ∂ 2_{log f} ∂θi∂θj (X1; θ) = Z R h ∂ ∂θi ∂ ∂θjf (x; θ){f (x; θ)} −1_{f (x; θ)}i_dx = h_R R ∂ ∂θi( ∂ ∂θjf (x; θ){f (x; θ)} −1_{)f (x; θ)dx}i =hR R ∂2_f ∂θi∂θj(x; θ) · 1 f (x;θ) − ∂f ∂θj(x; θ) ∂f ∂θi(x; θ) · 1 (f (x;θ))2 f (x; θ)i = h R R ∂2 ∂θi∂θjf (x; θ)dx i −I(θ) ij = ∂ ∂θi Z R ∂ ∂θj f (x; θ)dx | {z } =0 − [I(θ)]_ij = −[I(θ)]ij. (3.2)

What we see, is that the expected value of the Hessian of log f (X1; θ) in one observation

is equal to the negative Fisher Information Matrix. We define the vector of sums of partial derivatives to be

(13)

Un(θ) := ∇θ n X i=1 log f (Xi; θ) = n X i=1 h ∂ ∂θ1 log f (Xi; θ), . . . , ∂ ∂θq log f (Xi; θ) iT

We use equation (3.1) and Condition 2a in combination with the Multivariate Central Limit Theorem (Theorem 1) to show that

1 √

nUn(θ)

D

−→ N_q(0, I(θ)).

In order to see this, define the random vector Yi, i = 1, . . . , n to be

Yi = ∂ log f ∂θ1 (Xi; θ), . . . , ∂ log f ∂θq (Xi; θ) T = (∇θlog f (Xi; θ))

and note that Un(θ) =Pni=1Yi. Earlier in equation (3.1) we already saw that

E[Y1] = E " ∂ log f ∂θ1 (X1; θ), . . . , ∂ log f ∂θq (X1; θ) T# = 0. Consequently we obtain n X i=1 E[Yi] = E " _n X i=1 Yi # = E [Un(θ)] = 0.

For the covariance matrix Σ we have

Σij = Cov Y₁(i), Y₁(j) = E h

(Y₁(i)− E[Y1(i)]) (Y (j) 1 − E[Y (j) 1 ]) i = E ∂ ∂θi log f (X1; θ) − 0 ∂ ∂θj log f (X1; θ) − 0 = E ∂ ∂θi log f (X1; θ) ∂ ∂θj log f (X1; θ) =I(θ)_ij.

This shows us that

Σ = I(θ). (3.3)

Let Y = _n1Pn

i=1Yi be equal to the sample mean. Using (3.3) and Theorem 1 we can

indeed now conclude that

√

n Y−D→ N_q(0, Σ) =⇒ √1

nUn(θ)

D

(14)

Let us define Vn(θ) = Hθ n X i=1 log f(Xi; θ) ! = n X i=1 Hθ(log f(Xi; θ)).

We will show that 1_nVn(θ) converges in probability to:

1 nVn(θ)

p

−

→ −I(θ).

Vn(θ) is an (q × q)-matrix, and for every entry _n1[Vn(θ)]ij of this matrix we know that

the expected value is

E 1 n[Vn(θ)]ij = E " 1 n n X k=1 ∂2 ∂θi∂θj log f (Xk; θ) # = − [I(θ)]ij.

We may therefore use the Law of Large Numbers (Theorem 4) to see that 1 n[Vn(θ)]ij p − → −[I(θ)]_ij, which is equivalent to P 1 n[Vn(θ)]ij+ [I(θ]ij) > n→∞ −−−→ 0 for every > 0.

This result holds for every i, j = 1, . . . , q, so it will also hold for the maximum. This we use to see that, for every > 0

P 1 nVn(θ) + I(θ) > = P max i,j 1 n[Vn(θ)]ij+ [I(θ)]ij > ! → 0.

Our conclusion is therefore that

1 nVn(θ)

p

−

→ −I(θ). (3.5)

Consider u ∈ Rq where ||u|| ≤ K, for some constant 0 < K < ∞. Let us now define λn(u): λn(u) = log Ln θ +√1 nu − log L_n(θ) = n X i=1 log f Xi; θ + 1 √ nu − log f (X_i; θ) , (3.6)

(15)

and let us make a first order Taylor expansion of log f (Xi; θ +√1_nu) around θ (Theorem 5): log f Xi; θ + 1 √ nu = log f (Xi; θ) + θ + √1 nu − θ T ∇_θlog f (Xi; θ) + R1 θ +√1 nu, θ = log f (Xi; θ) + 1 √ nu T_∇ θlog f (Xi; θ) + 1 2nu T_H θlog f θ + (1 − h)√1 nu u,

for some 0 < h < 1. Substituting this into the definition of λn(u) of equation (3.6), we

see: λn(u) = n X i=1 log f (Xi; θ) + 1 √ nu T_∇ θlog f (Xi; θ) + 1 2nu T_H θlog f(Xi; θ + (1 − h) 1 √ nu))u − log f (Xi; θ) = n X i=1 1 √ nu T_∇ θlog f (Xi; θ) + 1 2nu T_H θlog f Xi; θ + (1 − h) 1 √ nu u = √1 nu T n X i=1 ∇_θlog f (Xi; θ) + 1 2nu T_H θ n X i=1 log f Xi; θ + (1 − h) 1 √ nu u. Now, define Zn(θ) to be Zn(u) = 1 n n X i=1 Hθlog f Xi; θ + (1 − h) 1 √ nu − H_θlog f(Xi; θ) . Remember that Un(θ) = ∇θ n X i=1 log f (Xi; θ), and Vn(θ) = Hθ n X i=1 log f(Xi; θ).

(16)

λn(u) = 1 √ nu T_U n(θ) + 1 2nu T_V n(θ)u + 1 2u T_Z n(u)u. (3.7)

Let δ > 0. There exists an n0 = n0(δ) such that n−1/2||u|| ≤ n−1/2K < δ for all

n ≥ n0. If we let n be large enough and using the triangle inequality, we get the

following inequality: ||Z_n(u)|| ≤ 1 n n X i=1 sup {h:||h||≤||u||/√n} Hθlog f(Xi; θ + h) − Hθlog f(Xi; θ) ≤ 1 n n X i=1 sup {h:||h||≤δ} Hθlog f(Xi; θ + h) − Hθlog f(Xi; θ) .

Because of the Law of Large Numbers (Theorem 4) and condition 2b it follows that, for δ → 0, lim n→∞δ→0lim 1 n n X i=1 sup {h:||h||≤δ} Hθlog f(Xi; θ + h) − Hθlog f(Xi; θ) a.s. −−→ ψ_δ→ 0. (3.8)

Therefore, from (3.8) we now see

sup

{u:||u||≤K}

||Zn(u)|| a.s.

−−→ 0. (3.9)

We can again rewrite λn(u) (equation (3.6)):

λn(u) = 1 √ nu T_U n(θ) − 1 2u T_{I(θ)u +} 1 2u T 1 nVn(θ) + I(θ) u +1 2u T_Z n(u)u.

Uniformly in ||u|| ∈ [−K, K], we use (3.9) to see that

lim n→∞u T_Z n(u)u ≤ lim n→∞u T_{u max} ij [Zn(u)]ij ≤ K lim n→∞maxij [Zn(u)]ij = K lim n→∞||Zn(u)|| = 0,

which means that uTZn(u)u = oq(1). For uT 1_nVn(θ) + I(θ) u we can show the same,

(17)

lim n→∞u T 1 nVn(θ) + I(θ) u ≤ lim n→∞u T_{u max} ij 1 nVn(θ) + I(θ) ij ≤ K lim n→∞||Vn(θ) + I(θ)|| = 0.

This means that uT 1

nVn(θ) + I(θ) u = oq(1) too. 1 2u T 1 nVn(θ) + I(θ) u +1 2u T_Z n(u)u = oq(1) + oq(1) = oq(1). (3.10)

Use the fact that ||u|| ∈ [−K, K] and the result from equation (3.10) to conclude that our definition of λn(u) (equation (3.6)) is equal to:

λn(u) = 1 √ nu T_U n(θ) − 1 2u T_{I(θ)u + o} q(1).

Now let us maximize λn(u) in u. For the (q × q) identity matrix, write 1q. We take the

vector of partial derivatives ∇uλn(u) and determine its root:

∇uλn(u) = √1 n1qUn(θ) − 1 2   q X i=1 ui[I(θ)]1i+ q X j=1 uj[I(θ)]j1, . . . , q X i=1 ui[I(θ)]qi+ q X j=1 uj[I(θ)]jq   T = √1 nUn(θ) − " _q X i=1 ui[I(θ)]1i, . . . , q X i=1 ui[I(θ)]qi #T = √1 nUn(θ) − I(θ)u = 0 ⇐⇒ √1 nUn(θ) = I(θ)u.

This maximum is thus attained in ˆu = √1

nUn(θ)[I(θ)]

−1_{. We therefore find that}

maxi-mizing λn(u) in u gives us

ˆ u = √1

nUn(θ)[I(θ)]

−1_{+ o}

q(1).

Now looking at our first definition of λn(u):

λn(u) = log Ln θ +√1 nu − log L_n(θ),

(18)

∇u log Ln(θ + 1 √ nu) − log Ln(θ) =∇u Ln(θ + 1 √ nu = ∇u n X i=1 log f Xi; θ + 1 √ nu =Un θ +√1 nu = 0 ⇐⇒ ˆu = √1 nUn(θ)[I(θ)] −1_{+ o} q(1).

We already found that the maximum of λn(u) is attained in ˆu. We conclude that ˆu

approximately will be equal to the maximum of Ln

θ + √1 nu = n Q i=1 f Xi; θ + √1_nu , which is achieved in the MLE ˆθn of θ. What we thus see is

ˆ θn= θ + 1 √ nu = θ +ˆ 1 √ n 1 √ nUn(θ)[I(θ)] −1 + oq(1) = θ + 1 n Un(θ)[I(θ)] −1_{+ o} q( √ n) .

Rewriting this equality, we see:

ˆ θn− θ = 1 n Un(θ)[I(θ)] −1_{+ o} q( √ n) √ n(ˆθn− θ) = 1 √ n Un(θ)[I(θ)] −1_{+ o} q( √ n) = √1 nUn(θ)[I(θ)] −1_{+ o} q(1).

The Law of Large Numbers (Theorem 4) shows us that For n → ∞, we have oq(1) = 0.

Slutsky’s Theorem (Theorem 2) makes us reach the conclusion that √

n(ˆθn− θ)−D→ N (0, [I(θ)]−1).

3.2. Wald’s Theorem

In this Section we will prove asymptotic convergence of Wald’s Statistic as stated in Definition 4. Wald’s Statistic is defined by

QW = n(ˆθn− θ0)TI(ˆθn)(ˆθn− θ0).

ˆ

θnis defined as the Maximum Likelihood Estimator as defined in Definition 7. We will be

using the result of Theorem 8 to prove convergence of QW to a chi-squared distribution

(19)

The proof of Wald’s Theorem is written down in Large Sample Methods in Statistics, An Introduction with Applications ([4]), for θ a one-dimensional parameter in R. We give the generalization for Θ ⊂ Rq.

Theorem 9 (Wald’s Statistic). Let X1, . . . Xn be a sequence of i.i.d. random variables

with probability density function f (x; θ), θ ∈ Θ ⊂ Rq which satisfies the conditions of Theorem 8. We consider the following hypothesis testing H0: θ = θ0 versus H1 : θ 6= θ0.

Then Wald’s Statistic QW = n(ˆθn−θ0)TI(ˆθn)(ˆθn−θ0) has an asymptotic χ2-distribution

under H0.

Proof. From Theorem 8 we know that, under H0,

√

n(ˆθn−θ) D

−→ N (0, [I(θ)]−1). Together with Cochran’s Theorem (Theorem 6) it now follows that, under H0,

Q∗_W = n(ˆθn− θ0)TI(θ0)(ˆθn− θ0) D

−→ χ2 q.

Writing x =√n(ˆθn− θ0) we get the following equality:

QW Q∗_W = xTI(ˆθn)x xT_I(θ 0)x ,

for which we can use Courant’s Theorem (Theorem 7) to see that

chq I(ˆθn)[I(θ0)]−1 ≤ QW Q∗_W ≤ ch1 I(ˆθn)[I(θ0)]−1 .

Under the assumption that θ = θ0, we get that ˆθn−−−→ θn→∞ 0. Consider [I(ˆθn)]ij. Because

of our regularity assumptions on f we get:

[I(ˆθn)]ij = Z R ∂ ∂θi log f (X1; θ) ∂ ∂θj log f (X1; θ)dx θ=ˆθn n→∞ −−−→ Z R ∂ ∂θi log f (X1; θ) ∂ ∂θj log f (X1; θ)dx θ=θ0 = [I(θ0)]ij.

Using we Weak Law of Large Number (Theorem 4), we see

I( ˆθn) p − → I(θ₀), and thus I( ˆθn)[I(θ0)]−1 p−→ Iq.

Consequently, all eigenvalues of I( ˆθn)[I(θ0)]−1 converge in probability to 1:

ch1{I( ˆθn)[I(θ0)]−1} p

− → 1 and

(20)

chq{I( ˆθn)[I(θ0)]−1} p

− → 1. This makes us see that QW/Q∗_W

p − → 1. Rewriting QW as follows QW = Q∗W · QW Q∗_W, and using Q∗_W −D→ χ2 q and QW/Q∗W p −

→ 1, allows us to apply Slutsky’s Lemma (Theorem 2) to finally conclude that, under H0,

QW D

−→ χ2 q.

(21)

4. Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov Test is used to test samples for certain distributions. It does so by comparing the data of the empirical cumulative distribution function with the cumulative distribution function that we are testing for. One restriction however, is that the Kolmogorov-Smirnov Test is meant for continuous distributions [1].

In this section we describe and explain the algorithm of the Kolmogorov-Smirnov Test. Consider a sequence of independent and identically distributed random variables X1, . . . , Xn, n ≥ 0. Consider the sample mean Sn, then we know that the centred and

scaled sample mean converges to a normal distribution. Consider now the pre-limit, we know that for n large we have approximately a normal distribution. But this is merely an approximation, hence we would like to have a test statistic which would be able to detect this difference from a normal distribution in the pre-limit as the sample size grows large. In the numerical illustrations that follow we consider n to be fixed and study the p values obtained from the Kolmogorov-Smirnov Test for growing sample sizes.

We study the test by simulating data, trying to explain why the test does not work for discrete distributions and fixing that problem. Our data will be constructed in such a way that it is asymptotically normally distributed, by applying the Central Limit Theorem (Theorem 1).

4.1. Algorithm Kolmogorov-Smirnov Test

If we have data points x1, . . . , xm from some distribution, we can draw the empirical

cumulative distribution function (empirical cdf), which is defined by the function

Fm(x) = 1 m m X i=1 1{xi≤x}= |i : xi≤ x| m , x ∈ R.

This function is non-decreasing, right-continuous and bounded by 0 and 1, and therefore a real distribution function, as can be seen in the figure. Replacing the deterministic {x_i} by random {X_i}, F_m(x) becomes random as well.

(22)

-6 -4 -2 0 2 4 6 x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F(x) Empirical CDF

We can transform the data X1, . . . Xmto U1 = F (X1), . . . , Um = F (Xm), to retrieve and

i.i.d. U (0, 1)-distribution. We need to make the assumption that F is continuous and strictly increasing here. We denote the empirical cdf of the {Ui} by Gm(u), the so-called

reduced empirical cdf. x and u are related by x = F−1(u) and u = F (x). Now, look at

Fm(x) − F (x) = 1 m m X i=1 1{Xi≤x}− F (x) = 1 m m X i=1 1{Ui≤u}− G(x) = Gm(x) − G(x).

We define the maximum distance between the empirical and the true cdf to be

Dm = sup x∈R

|Fm(x) − F (x)| = sup 0≤u≤1

|Gm(x) − G(x)|.

This is a statistic, and is called the Kolmogorov Statistic. More commonly, we use the scaled version Km :=

√

mDm as a test statistic. The Null Hypothesis is that the

empirical cdf is equal to some cdf F , and the Alternative Hypothesis is that the the empirical cdf is equal to some other cdf. Now, this Km has a distribution of its own.

We refer to the Handbook of Monte Carlo Methods (referentie) for the proof, and only show the result. This distribution is equal to

P(K ≥ y)def= 1 − lim m→∞P(Km≤ y) = 1 − ∞ X k=−∞ (−1)ke−2(ky)2, y > 0.

(23)

H0 and H1:

H0 : the empirical cdf is equal to F

H1 : the empirical cdf is not equal to F,

for a given distribution F . Then we need to follow the next four steps:

1. Order the data x(1) < · · · < x(m).

2. Calculate dm = max i=1,...,nmax n x(i)− i m , x(i)− i−1 m o . 3. Define km = √ mdm.

4. Calculate the p-value P(Km ≥ km).

If the p-value exceeds some level α ∈ (0, 1), we do not reject H0. If, however, p < α,

our test statistic falls into the critical region and we reject H0.

4.2. Simulation in Matlab

In this Section we investigate how the Kolmogorov-Smirnov Test performs when we have a sample of the centered and scaled sample mean of a sequence of random variables from two distributions, the Bernoulli and the Uniform. We consider n to be fixed and equal to n = 200.

Approximately we will have a normal distribution, but it will not exactly be normal, since it is an approximation. It is important to realize this, because it means that the Kolmogorov-Smirnov Test should eventually reject the Null Hypothesis. We compare the speed at which the Kolmogorov-Smirnov Test rejects the Null hypothesis for growing sample size.

Bernoulli Sample

We simulated a Bernoulli trial in Matlab, by assigning the number +1 with probability p = 0.5, and −1, with probability 1 − p = 0.5, to the random variable Xi:

P(Xi = 1) = p = 1 − P(Xi= 0) = 0.5.

We created a vector X of length n = 200 with 1s and −1s. It is clear that the sample mean µX of this vector should approximately be equal to µX = 0, and the sample

variance S2_X equal to S_X2 = EX2 − (E [X])2= 1. By default, the Kolmogorov-Smirnov Test tests for a standard normal distribution. For this reason, we applied the Central Limit Theorem to the vector X:

(24)

Sn= Pn i=1Xi− nµX q nS2 X =√n · Xn,

creating a number Snwhich is by the Central Limit Theorem (1) approximately standard

normally distributed:

Sn∼ N (0, 1).

Now, we repeated this m times, creating a vector Ym = [Sn1. . . Snm]. Asymptotically, this

vector has a standard normal distribution. We test this vector for normality with the Kolmogorov-Smirnov Test.

The Kolmogorov-Smirnov test then gives a p-value. The Null-Hypothesis is that the sample comes from a standard normal distribution. If p > 0.05, we do not reject the Null-Hypothesis.

We observed that the p-value varies when we vary the length m of our vector Ym. Our

p-value is itself a random variable. We therefore choose to run 1000 p-values for every m, and then calculating the mean p-value for every m. This resulted in the following figure, where see the p-value plotted against m.

0 500 1000 1500 2000 2500 sample size m 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 p-value

p-value of the Kolmogorov-Smirnov Test with Bernoulli sample

We see the p-value varies a lot, even though we simulated one-hundred p-values for every m. Nevertheless, it decreases for growing m. The p-value drops below our significance leven α = 0.05 around m = 1000, and keeps on decreasing after this.

Uniform(0,1) Sample

We saw that the p-value drops below 0.05 around a sample size of m = 900 when our data came from a Bernoulli distribution. We will now see what happens when we take the Uniform(0,1) distribution, also applying the Central Limit Theorem (1).

(25)

Due to calculation time, we set n to be equal to 50, and we have X₁∗, . . . , X_n∗ i.i.d. random variables, where X_i∗ ∼ U nif orm(0, 1), for i ∈ {1, . . . , n}. The sample mean is µX∗ = 1

2 and the sample variance S 2 X∗ = ₁₂1. We define S_n∗: S_n∗ = Pn i=1Xi∗− nµ q nS_X2∗ .

Creating vectors Y_m∗ = [S_n∗1, . . . , S∗m_n ], where we vary m, we get the following graph:

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 sample size m 0 0.05 0.1 0.2 0.3 0.4 0.5 p-value

p-value of the Kolmogorov-Smirnov Test with Uniform(0,1) sample

We see the p-value varies a bit more than in the Bernoulli case, which is caused by two things: S_n∗ has a larger variance, and we simulated less p-values per sample size due to computing time. But in general, there is no directly visible decrease in the p-value for increasing m. The data is only normally distributed for n going to infinity, so at some point we do expect the p-value to drop for large sample sizes. Computing these large sample sizes takes a lot of time, but it would be interesting to simulate these data. We calculated 10 p-values for m=10 000 000, which gave us a mean value of p = 0.3014. For larger sample sizes we would expect further decrease in the p-value.

(26)

5. Conclusion

The goal of this bachelor project was to study statistical tests. Our focus lied on Wald’s statistic and the Kolmogorov-Smirnov Test.

In the first part we focused on theoretical properties of the Maximum Likelihood Estimator. In existing literature it has been proved that the Maximum Likelihood Estimator converges to a normal distribution under certain conditions. We managed to extend this proof for q-dimensional parameter spaces. We also extended the proof of Wald’s Theorem to q-dimensional parameter spaces. This theorem states that Wald’s Statistic converges to a chi-squared distribution under the Null Hypothesis H0 : θ = θ0.

These two proofs were the main theoretical contributions of this project. Our proofs heavily relied on the proofs of given in the literature, but the extensions were far from trivial.

In the second part of this project we studied the performance of the Kolmogorov-Smirnov Test. Our motivation is to see whether the Kolmogorov-Kolmogorov-Smirnov Test can detect the deviation from the normal distribution in the pre-limit of the Central Limit Theorem. The algorithm we tested is the package included in MATLAB R2015b. We think that such a numerical investigation is important and it yields some limitations in the capabilities of such statistical tests. We found that the test rejects the Null Hypothesis for relatively small sample sizes when the data originates from a Bernoulli distribution. This has something to do with the fact that you get double counts when sampling from a discrete distribution. We did not see the same for Uniform(0,1) data, although we saw a decrease of the p-value for m=10 000 000. We suspect the Null Hypothesis should be rejected for even larger sample sizes.

(27)

6. Popular Summary

When doing research, it is important to know how to deal with the data that we retrieve from the experiments. For example, think of conducting a research on whether a new medicine works, or that it does not work better than a placebo. If we give the medicine to a group of people, and a placebo to a different group of people, and we see the first group has a higher rate of getting cured, we will suspect the medicine works. On the other hand, if we do not see a big difference in the fraction of people being cured by the medicine and the placebo, we will begin to suspect the medicine does not work.

Somehow we need to interpret the information, and process it in such a way that we can draw sensible conclusions. When can we say the medicine works better than a placebo? We cannot just start conducting an experiment. Before we start, we will probably have some presumption. In our example it will be that we think the medicine works better than the placebo. We want to know whether this presumption is likely to be true after having seen our data. For this reason, we set up a so-called Null Hypothesis H0 and an Alternative Hypothesis H1. We do this in such a way, that our presumption

lies in our Alternative Hypothesis. So the Null Hypothesis in our example could be ’the medicine is equally effective as a placebo’. The Alternative Hypothesis could then be ’the medicine works better than the placebo’. After conducting the experiment, we can do two things: either we do not see enough evidence to reject H0, or we do see enough evidence,

so we reject H0 and adopt H1. Note that we do not treat H0 and H1 symmetrically.

Not rejecting the Null Hypothesis does not mean we say H0 is true, we merely say we

have not seen enough data to be able to reject it.

Continuing with our example, we decide to treat one group with the medicine, and the other group with the placebo, not telling anyone in which group they are placed. We mark the fraction of people cured by the medicine, and the fraction of people ’cured’ by the placebo. If the former fraction is (much) higher than the former, we want to say it is highly possible the Null Hypothesis is false, and the Alternative Hypothesis is true. If the fractions do not differ enough from one another, we may not conclude the medicine works.

We can never for sure whether the medicine works better than the placebo. After all, our experiment is subject to chance. It is possible we get ’unlucky’ in how our groups are divided. We can make two kinds of mistakes. We might see many people cured by the medicine and few by the placebo. It is possible that we only saw this by chance though, and that in fact the medicine does not truly work better than the placebo. After seeing this data we would decide to reject H0 and conclude the medicine does work, when in

reality it does not work better. Falsely rejecting H0 and adopting H1 is called a Type-I

error. We can also make a different kind of mistake. It is possible that the subjects in our medicine group are not cured by the medicine by coincidence, and many of the subjects

(28)

in our placebo group are cured. We would then say we do not see enough evidence to reject our Null Hypothesis, although in truth we should have been able to do so. This kind of mistake, where we incorrectly do not reject H0, is called a Type-II error.

Falsely rejecting H0 is a grave error which we want to minimize as much as possible.

Imagine falsely concluding a medicine works, and giving this medicine to sick people while the medicine does not work! Of course it is also a shame if we falsely conclude we cannot use a certain medicine, but less harm is done with this mistake than by the first mistake. We therefore want to minimize making a type-I error. By minimizing H0 however, we make it more likely to make a Type-II error. We can never make the

probability zero to make a Type-I error without making the probability of making a Type-II error very big. We have to choose a significance level. This level, often denoted with α, is a number between 0 and 1, and is the upper bound for what we want the probability to be that we make a Type-I error. In other words: we set the probability that we falsely reject H0 and adopt H1 to be at most α. After setting this level, we minimize

the probability of making a Type-II error. Then we can say we have a probability of at most α to falsely reject our Null Hypothesis.

(29)

Bibliography

[1] Dirk P. Kroese, Thomas Taimre, Zdravko I. Botev, Handbook of Monte Carlo Methods, Wiley Interscience.

[2] Fetsje Bijma, Marianne Jonker, Aad van der Vaart, Inleiding in de statistiek, Epsilon Uitgaven, Utrecht, 2013.

[3] Paul Igodt, Wim Veys, Lineaire Algebra, Universitaire Pers Leuven.

[4] Pranab K. Sen, Julio M. Singer, Large Sample Methods in Statistics, An Intro-duction with Applications,Chapman & Hall, New York, 1993.

(30)

A. Matlab Code

Bernoulli Sample

n=200; mu=0; sigma=1; K=100; P = zeros(1,K); M = zeros(1,250); for N = 10:10:2500 Y = zeros(1,N); for k=1:K for j=1:N sum=0; for i=1:n

if rand<0.5 sum = sum+1; else

sum = sum - 1; end end S = (sum - n*mu)/(sqrt(n)*sqrt(sigma)); Y(1,j) = S; end [h,p] = kstest(Y); Nn=(N)/10; M(1,Nn) =p; P(1,k)=p; M(1,Nn)=mean(P); end end x=[10:10:2500]; plot(x,M)

(31)

Uniform Code

n=50; mu=.5; sigma=1/12; K=100; P = zeros(1,K); M = zeros(1,100); for N = 100:100:10000 Y = zeros(1,N); for k=1:K for j=1:N sum=0; for i=1:n

sum = sum + rand; end S = (sum - n*mu)/(sqrt(n)*sqrt(sigma)); Y(1,j) = S; end [h,p] = kstest(Y); Nn=(N)/100; M(1,Nn) =p; P(1,k)=p; M(1,Nn)=mean(P); end end x=[100:100:10000]; plot(x,M)

Asymptotic Convergence - Wald's Theorem and the Kolmogorov-Smirnov Test