Asymptotic Convergence - Wald’s Theorem and
the Kolmogorov-Smirnov Test
Arja Rydin
July 7, 2017
Bachelor Thesis
Supervisor: Msc Nicos Starreveld, prof. dr. Michel Mandjes
Korteweg-de Vries Instituut voor Wiskunde
Abstract
In this project we give a generalization of the proof that Wald’s Statistic converges to a chi-squared distribution. In the literature ([4]) it has been proved for a one-dimensional parameter space. Here we give the extended proof for a q-dimensional parameter space. To do so, we also generalized the proof of convergence of the Maximum Likelihood Estimator for q-dimensional parameter spaces.
We also do a numerical study on the Kolmogorov-Smirnov Test. For this, we use the package which is included in MATLAB2015b. The data which we simulate is scaled and centered, so that it is normally distributed in the limit. However, our data is only approximately normally distributed, since we fix our data in the pre-limit (n = 200). We study how well the Kolmogorov-Smirnov Test can distinguish this for a growing sample size. We found that when the data originates from a discrete (Bernoulli) distribution, the Null Hypothesis is quite quickly rejected, at around a sample size of m=1000. When simulating from a continuous (Uniform(0,1)) distribution it should take a far larger sample size to reject the Null Hypothesis. We eventually saw a decrease in the p-value. We expect the Null Hypothesis should be rejected at some point, but we were not able to get to such a large sample size (m 10 000 000).
Title: Asymptotic Convergence - Wald’s Theorem and the Kolmogorov-Smirnov Test Authors: Arja Rydin, arja.rydin@student.uva.nl, 10590854
Supervisor: Msc Nicos Starreveld, prof. dr. Michel Mandjes
Second grader: prof. dr. Sindo Nunez Date: July 7, 2017
Korteweg-de Vries Instituut voor Wiskunde Universiteit van Amsterdam
Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl/math
Contents
Abstract 2 1. Introduction 4 2. Prerequisites 5 2.1. Notation . . . 5 2.2. Definitions . . . 5 2.3. Preliminary Results . . . 8 3. Maximum Likelihood 113.1. Asymptotic Behaviour of Maximum Likelihood Estimators . . . 11 3.2. Wald’s Theorem . . . 18
4. Kolmogorov-Smirnov Test 21
4.1. Algorithm Kolmogorov-Smirnov Test . . . 21 4.2. Simulation in Matlab . . . 23
5. Conclusion 26
6. Popular Summary 27
Bibliography 29
1. Introduction
When we want to study a phenomenon, we rely on measurements. These measurements, or data, we need to interpret in order to draw sensible conclusions. For example. consider a die and suppose we want to establish whether it is fair or not, i.e. that all possible outcomes are equally likely. We throw the die multiple times and model this process by using a binomial probability distribution. If every side shows up approximately equally often, we will feel reassured in our belief that the die is fair. However, if one of the sides comes up more often than the others, we begin to suspect the die is not fair. Only when we see something very deviant happening, we decide to suspect the die is unfair. The way we make this explicit, is by setting up a Null Hypothesis H0 : ’the die is fair’ and
an Alternative Hypothesis: H1: ’the die is unfair’. We set up these hypotheses before
conducting the research. After doing the experiment and seeing the data, we decide whether to reject the Null Hypothesis and accept the Alternative Hypothesis or to not reject the Null Hypothesis (but not necessarily accepting the Null Hypothesis).
In order to decide what we do with regard to our hypotheses, we need to process the data, so that we conclude whether to reject the Null Hypothesis or not. The processing of the data we do with a so-called statistic. We then test these statistics. This means we take a look at the data, and see how likely it is that we see it, if the Null Hypothesis is true. If this probability is uniformly low, we decide to reject the Null Hypothesis and adopt the Alternative Hypothesis. There exist many different statistics and tests. We will examine a specific statistic and test.
In this project our goal is twofold. At first, we want to prove convergence of Wald’s statistic to a chi-squared distribution, we treat this in Section 3.2. In order to prove this convergence we need the asymptotic behaviour of the Maximum Likelihood Estimator, this is done in Section 3.1. In the literature, see [4], the corresponding results are established for the case of a one dimensional parameter. We extend these results for the case of a q-dimensional parameter and our findings are given in Theorems 8 and 9. The proofs rely on the ideas presented in [4] but the extension to vector parameters is not trivial, we refer the reader to Sections 3.1 and 3.2 for more details.
Secondly we want to explore the sensitivity of the Kolmogorov-Smirnov test when the sample size grows. We know from the Central Limit Theorem that the scaled and centered sample mean has asymptotically a normal distribution. But in the pre-limit it will not be normally distribution hence we would want to have a test statistic which detects this as the sample size grows. We perform some numerical experiments and show our results in Chapter 4.
2. Prerequisites
In this chapter we present the notation we will use and collect the main definitions and results we will rely on. The notation concerns how we denote matrices and vectors. We will show three different kinds of definitions: statistics, probability theory and matrix theory, in this order. Lastly, in this chapter we state a few important result which will be necessary for the proof of the convergence of Wald’s Statistic.
2.1. Notation
We will assign bold letters to vectors and matrices: xnwill be a column vector of length
n, and Amn will be a matrix of size m × n. In the sequel we will drop the dependence
on n and m whenever it is clear from the context. AT will be the transposed of either a vector or a matrix. With ||A||, A ∈ Mm×n we denote the norm of a matrix. We
consider the following matrix norm:
||A|| := sup
ij
|Aij|.
We write Hxf for the Hessian matrix and ∇x for the vector of partial derivatives of the
real function f .
2.2. Definitions
We start off with some general statistical definitions like point estimators and the Covari-ance matrix. This will be followed by definitions from probability theory. Among others we will we define the chi-squared distribution and some important types of convergence. To end this section, we give some matrix properties.
Definitions from Statistics
Definition 1 (Statistic). Let X1, . . . , Xn be i.i.d. random variables, distributed
ac-cording to some distribution f (x; θ) with an unknown parameter θ ∈ Θ and denote X = (X1, . . . , Xn). A map T : X → Rq is called a statistic, if it does not depend on the
unknown parameter θ. Any statistic T (X) defines a form of data reduction or data sum-mary. A minimal sufficient statistic is a statistic that has achieved the maximal amount of data reduction possible while still retaining all the information about the unknown parameter [2].
Definition 2 (Point Estimators). Let X1, . . . , Xn be i.i.d. random variables with some
distribution which depends on an unknown parameter θ ∈ Θ ⊂ Rq, with q ≥ 1. A point estimator ˆθ for θ is a statistic based on the realization of the random variables, which gives a specific value for θ ∈ Θ.
Definition 3 (Hypothesis Testing). In a hypothesis testing problem, after observing the sample, the experimenter must decide either not to reject H0 (but not accepting it as true)
or to reject H0 and decide H1 to be true. A hypothesis testing procedure or hypothesis
test is a rule that specifies:
1. For which sample values the decision is made not to reject H0.
2. For which sample values we may reject H0 and accept H1 as true.
The subset of the sample space for which H0 will be rejected is called the critical region.
The complement of the rejection region is called the acceptance region. Typically we write: consider the test
H0 : θ ∈ Θ0 versus H1: θ ∈ Θ1.
To decide whether or not to reject the Null Hypothesis, we need a test statistic, which is defined as follows. If H0 is true, we want the probability of T falling into the critical
region K to be uniformly small: Pθ0(T (X) ∈ K) < α, where α is often chosen to be
0.05.
Definition 4 (Wald’s Statistic [4]). Let X1, . . . , Xn be an i.i.d. sequence of random
variables with probability density function f (x; θ), θ ∈ Θ ⊂ Rq. Consider the test H0 :
θ = θ0 vs. H1: θ 6= θ0. Let ˆθn be the MLE of θ. Then Wald’s Statistic QW is defined
as follows:
QW = n(ˆθn− θ0)TI(ˆθn)(ˆθn− θ0) (2.1)
Definition 5 (Test Statistic). A test statistic is a statistic T (X) which gives us the probability of our unknown parameter θ to fall in Θ0. If this probability is small enough,
we say T falls in the critical region, and we decide to reject the Null Hypothesis. If we denote the critical region with K, we say that we reject H0 if T (X) ∈ K.
Definition 6 (Likelihood Function). Let X1, . . . , Xn be i.i.d. random variables with
probability density function f (x; θ). We define Ln(θ) as the product of the probability
density functions of X1, . . . , Xn: Ln(θ) := n Y i=1 f(Xi; θ).
Definition 7 (Maximum Likelihood Estimator (MLE)). The MLE ˆθn is the value for
the likhelihood function achieves its maximum:
ˆ θn= argsup θ∈Θ n Y i=1 f (Xi; θ). (2.2)
Definition 8 (Empirical Cumulative Distribution Function (Empirical cdf)). Let x1, . . . , xn
be the realization of X1, . . . , Xn i.i.d. real random variables, and let F (y) be their
com-mon cumulative distribution function. The empirical cumulative distribution function (empirical cdf ) is given by ˆ Fn(y) = {#xi: x ≤ y} n = 1 n n X i=1 1{xi≤y},
where 1{xi≤y} is the indicator function.
Definition 9 (Fisher Information Matrix). Let X1, . . . , Xn be i.i.d. random variables
with probability density function f (x; θ), x ∈ R, θ ∈ Θ ⊂ Rq. Let ∂θ∂2
i∂θjf (x; θ) be defined
for all i, j = 1, . . . , q. Then the Fisher Information Matrix I(θ) is defined by
I(θ) = E h
[∇θlog f (X1; θ)] [∇θlog f (X1; θ)]T
i .
Definition 10 (Covariance Matrix). Let X = (X1, . . . , Xn) be a random vector. The
Covariance Matrix Σ is the matrix defined as
Σ := [Σij]i,j=1,...,n =Cov(Xi, Xj)
i,j=1,...,n.
Definitions from Probability Theory
Definition 11 (χ2-distribution). We say the random variable Z is χ2-distributed with n degrees of freedom (notation: Z ∼ χ2n) if Z is equal to the sum of n squared, independent standard normally distributed random variables:
Z =
n
X
i=1
Xi2, where Xi∼ N (0, 1).
Definition 12 (Convergence in Distribution). We say a sequence of random vectors {Xn} converges in distribution to a random vector X if
P(Xn≤ x) → P(X ≤ x).
For this type of convergence we write Xn D
Definition 13 (Convergence in Probability). We define the distance function d in Rm×n by
d(A, B) = ||A − B|| = sup
i,j
|Aij − Bij|,
where A and B are matrices. A sequence of random matrices {Xn} is said to convergence
in probability to a random matrix X if for all > 0
P(d(Xn, X) > ) → 0.
We denote this kind of convergence with Xn p
− → X.
Definition 14 (characteristic function for multi-dimensional vectors). Let X be a k-dimensional random vector. For t ∈ Rk the characteristic function is defined as φX(t) =
E[eit
TX
].
Definitions from Matrix Theory
Definition 15. A positive semi-definite matrix matrix is an (n × n)-matrix A for which xTAx ≥ 0 for all vectors x of length n.
Definition 16. An (n × n)-matrix A is said to be idempotent when A2 = A.
Definition 17. A generalized inverse of an (n×m)-matrix A is a matrix B ∈ Rm×nwith the property that ABA = A. This generalized inverse matrix exists for every matrix [3].
Definition 18 (Singularity of a Matrix). A square matrix A is said to be singular if there exists no matrix B such that AB = I.
Definition 19 (Rank of a Matrix). The rank of a matrix A is defined to be the number of lineair independent rows in A [3].
2.3. Preliminary Results
In this section we present some important results we will use later on.
Theorem 1 (Multivariate Central Limit Theorem). Define the random vectors Yi,
i = 1, . . . , n by: Yi= h Yi(1), . . . , Yi(q) iT ,
a sequence of independent, identically distributed random vectors. Let the sample mean Yn be:
1 n n X i=1 Yi= 1 n " n X i=1 Yi(1), . . . , n X i=1 Yi(q) #T =hY(1)n , . . . , Y(q)n iT = Yn.
Then we can write the following:
1 √ n n X i=1 [Yi− EY1] = 1 √ n n X i=1 [Yi− EY1] = √ n(Yn− EY1). Then √ n(Y − EY1)−D→ Nq(0, Σ),
where we write Σ for the covariance matrix, as defined in definition 10.
Theorem 2 (Slutsky’s Lemma [4]). Let {Xn} and {Yn} be sequences of random variables
sucht that Xn D
−→ X and Yn−→ c, where c is a constant. Then, if follows thatp
1. Xn+ Yn D −→ X + c 2. YnXn D −→ cX 3. Xn/Yn D −→ X/c if c 6= 0.
Theorem 3 (L´evy’s Continuity Theorem [4]). Let X1, . . . , Xn be a sequence of random
variables with corresponding sequence of characteristic functions φ1, . . . , φn. If, for n →
∞,
φn(t) → φ(t) ∀t ∈ R,
then Xn converges to X in distribution, i.e. Xn−D→ X.
Theorem 4 (Khintchine Weak Law of Large Numbers). Let {Xn}, for n ≥ 1 be a
sequence of i.i.d. random variables with EX1 = θ, and let
Xn= 1 n n X k=1 Xk, n ≥ 1. Then Xn p − → θ.
Theorem 5 (Taylor on Rn [4]). Let f (x), x ∈ A ∈ Rp be a continuous function, with continuous partial derivatives up to (k + 1)th order, for some k ≥ 1. Then for every x ∈ A, x0 ∈ A, we have the following (multivariate) Taylor expansion:
f (x) = f (x0) + k X j=1 1 j! p X i1=1 · · · p X ij=1 ∂j ∂xi1. . . ∂xij f (x) j Y l=1 (xil− x0il) + Rk(x, x0), where Rk(x, x0) is given by 1 (k + 1)! p X i1=1 · · · p X ik+1=1 ∂k+1 ∂xi1. . . ∂xik+1 f [hx0+ (1 − h)x] k+1 Y l=1 (xil− x0il),
for some 0 < h < 1. For the first order Taylor expansion of f (x), we can actually write
f (x) = f (x0) + (x − x0)T∇xf (x0) + R1(x, x0) = f (x0) + (x − x0)T∇xf (x0) + 1 2(x − x0) TH xf hx0+ (1 − h)x(x − x0) (2.3)
Theorem 6 (Cochran [4]). Let {Xn} be a sequence of random vectors, such that
√
n(Xn−
θ)−D→ N (0, Σ), rank(Σ) = q ≤ p, and let {An} be a sequence of positive semi-definite non-random matrices. Then
Qn= n(Xn− θ)TAn(Xn− θ) D
−→ χ2 p
if and only if An converges to a generalized inverse of Σ.
Theorem 7 (Courant [4]). Let A and B be two positive semi-definite matrices, where B is non-singular. If λi = chi AB−1 is denoted for the i’th eigenvalue (where λ1 ≥
. . . ≥ λq) of AB−1, for i = 1, . . . , q, the we get
λq= inf x xTAx xTBx ≤ sup x xTAx xTBx = λ1.
3. Maximum Likelihood
When searching for the true parameter of a distribution, we use observed data to esti-mate the parameter. We of course want to make sensible estimations, maximizing the probability of our estimation to be correct. Maximum likelihood estimation is a widely used method of estimating parameters [2]. Asymptotically, the maximum likelihood estimator has some interesting properties.
3.1. Asymptotic Behaviour of Maximum Likelihood
Estimators
We will show normal convergence of the maximum likelihood estimator in the following theorem. In Large Sample Methods in Statistics, An Introduction with Applications ([4]) the proof was given for Θ ⊂ R. The main contribution of this project is extending this Theorem for the case Θ ⊂ Rq.
Theorem 8. Let X1, . . . , Xn be i.i.d. random variables wil probability density function
f (x; θ), θ ∈ Θ ⊂ Rq which meet the following conditions:
1. For i, j = 1, . . . , q, ∂θ∂
if (x; θ) and ∂2
∂θi∂θjf (x; θ) are defined almost-everywhere and
are such that: ∂ ∂θi f (x; θ) ≤ Hi(x) and ∂2 ∂θi∂θj f (x; θ) ≤ Gij(x), where Hiand Gij are two integrable functions and
R
RHi(x)dx < ∞ and
R
RGij(x) <
∞.
2. For i, j = 1, . . . , q the partial derivatives ∂θ∂
ilog f (x; θ) and the second partial
derivatives ∂θ∂2
i∂θjlog f (x; θ) exist almost-everywhere and are such that:
a) The Fisher information Matrix,
I(θ) = Eh[∇θlog f (X1; θ)] [∇θlog f (X1; θ)]T
i
is finite and positive definite, b) for δ → 0, Eθ " sup {h:||h||≤δ} Hθlog f(X1; θ + h) − Hθlog f(X1; θ) # =: ψδ→ 0.
Then the Maximum Likelihood Estimator (MLE) ˆθn of θ is such that
√
n( ˆθn− θ) D
−→ N (0, [I(θ)]−1). Proof. First, note that
E [∇θlog f (X1; θ)] = Z R [∇θlog f (x; θ)] f (x; θ)dx = " Z R ∂f ∂θ1 f (x; θ)f (x; θ)dx, . . . , Z R ∂f ∂θq f (x; θ)f(x; θ)dx #T = Z R ∂f ∂θ1 (x; θ)dx, . . . , ∂f ∂θq (x; θ)dx T = ∂ ∂θ1 Z R f (x; θ)dx, . . . , ∂ ∂θq Z R f(x; θ)dx T = ∇θ Z R f (x; θ)dx = 0. (3.1)
We are allowed to interchange the order of differentation and integration because of Condition 1. This condition makes it possible to apply the Dominated Convergence Theorem. Consider the Hessian matrix Hθlog f(X1; θ). Then E [Hθlog f(X1; θ)]ij can be
written as: E [Hθlog f(X1; θ)]ij = E ∂ 2log f ∂θi∂θj (X1; θ) = Z R h ∂ ∂θi ∂ ∂θjf (x; θ){f (x; θ)} −1f (x; θ)idx = hR R ∂ ∂θi( ∂ ∂θjf (x; θ){f (x; θ)} −1)f (x; θ)dxi =hR R ∂2f ∂θi∂θj(x; θ) · 1 f (x;θ) − ∂f ∂θj(x; θ) ∂f ∂θi(x; θ) · 1 (f (x;θ))2 f (x; θ)i = h R R ∂2 ∂θi∂θjf (x; θ)dx i −I(θ) ij = ∂ ∂θi Z R ∂ ∂θj f (x; θ)dx | {z } =0 − [I(θ)]ij = −[I(θ)]ij. (3.2)
What we see, is that the expected value of the Hessian of log f (X1; θ) in one observation
is equal to the negative Fisher Information Matrix. We define the vector of sums of partial derivatives to be
Un(θ) := ∇θ n X i=1 log f (Xi; θ) = n X i=1 h ∂ ∂θ1 log f (Xi; θ), . . . , ∂ ∂θq log f (Xi; θ) iT
We use equation (3.1) and Condition 2a in combination with the Multivariate Central Limit Theorem (Theorem 1) to show that
1 √
nUn(θ)
D
−→ Nq(0, I(θ)).
In order to see this, define the random vector Yi, i = 1, . . . , n to be
Yi = ∂ log f ∂θ1 (Xi; θ), . . . , ∂ log f ∂θq (Xi; θ) T = (∇θlog f (Xi; θ))
and note that Un(θ) =Pni=1Yi. Earlier in equation (3.1) we already saw that
E[Y1] = E " ∂ log f ∂θ1 (X1; θ), . . . , ∂ log f ∂θq (X1; θ) T# = 0. Consequently we obtain n X i=1 E[Yi] = E " n X i=1 Yi # = E [Un(θ)] = 0.
For the covariance matrix Σ we have
Σij = Cov Y1(i), Y1(j) = E h
(Y1(i)− E[Y1(i)]) (Y (j) 1 − E[Y (j) 1 ]) i = E ∂ ∂θi log f (X1; θ) − 0 ∂ ∂θj log f (X1; θ) − 0 = E ∂ ∂θi log f (X1; θ) ∂ ∂θj log f (X1; θ) =I(θ)ij.
This shows us that
Σ = I(θ). (3.3)
Let Y = n1Pn
i=1Yi be equal to the sample mean. Using (3.3) and Theorem 1 we can
indeed now conclude that
√
n Y−D→ Nq(0, Σ) =⇒ √1
nUn(θ)
D
Let us define Vn(θ) = Hθ n X i=1 log f(Xi; θ) ! = n X i=1 Hθ(log f(Xi; θ)).
We will show that 1nVn(θ) converges in probability to:
1 nVn(θ)
p
−
→ −I(θ).
Vn(θ) is an (q × q)-matrix, and for every entry n1[Vn(θ)]ij of this matrix we know that
the expected value is
E 1 n[Vn(θ)]ij = E " 1 n n X k=1 ∂2 ∂θi∂θj log f (Xk; θ) # = − [I(θ)]ij.
We may therefore use the Law of Large Numbers (Theorem 4) to see that 1 n[Vn(θ)]ij p − → −[I(θ)]ij, which is equivalent to P 1 n[Vn(θ)]ij+ [I(θ]ij) > n→∞ −−−→ 0 for every > 0.
This result holds for every i, j = 1, . . . , q, so it will also hold for the maximum. This we use to see that, for every > 0
P 1 nVn(θ) + I(θ) > = P max i,j 1 n[Vn(θ)]ij+ [I(θ)]ij > ! → 0.
Our conclusion is therefore that
1 nVn(θ)
p
−
→ −I(θ). (3.5)
Consider u ∈ Rq where ||u|| ≤ K, for some constant 0 < K < ∞. Let us now define λn(u): λn(u) = log Ln θ +√1 nu − log Ln(θ) = n X i=1 log f Xi; θ + 1 √ nu − log f (Xi; θ) , (3.6)
and let us make a first order Taylor expansion of log f (Xi; θ +√1nu) around θ (Theorem 5): log f Xi; θ + 1 √ nu = log f (Xi; θ) + θ + √1 nu − θ T ∇θlog f (Xi; θ) + R1 θ +√1 nu, θ = log f (Xi; θ) + 1 √ nu T∇ θlog f (Xi; θ) + 1 2nu TH θlog f θ + (1 − h)√1 nu u,
for some 0 < h < 1. Substituting this into the definition of λn(u) of equation (3.6), we
see: λn(u) = n X i=1 log f (Xi; θ) + 1 √ nu T∇ θlog f (Xi; θ) + 1 2nu TH θlog f(Xi; θ + (1 − h) 1 √ nu))u − log f (Xi; θ) = n X i=1 1 √ nu T∇ θlog f (Xi; θ) + 1 2nu TH θlog f Xi; θ + (1 − h) 1 √ nu u = √1 nu T n X i=1 ∇θlog f (Xi; θ) + 1 2nu TH θ n X i=1 log f Xi; θ + (1 − h) 1 √ nu u. Now, define Zn(θ) to be Zn(u) = 1 n n X i=1 Hθlog f Xi; θ + (1 − h) 1 √ nu − Hθlog f(Xi; θ) . Remember that Un(θ) = ∇θ n X i=1 log f (Xi; θ), and Vn(θ) = Hθ n X i=1 log f(Xi; θ).
λn(u) = 1 √ nu TU n(θ) + 1 2nu TV n(θ)u + 1 2u TZ n(u)u. (3.7)
Let δ > 0. There exists an n0 = n0(δ) such that n−1/2||u|| ≤ n−1/2K < δ for all
n ≥ n0. If we let n be large enough and using the triangle inequality, we get the
following inequality: ||Zn(u)|| ≤ 1 n n X i=1 sup {h:||h||≤||u||/√n} Hθlog f(Xi; θ + h) − Hθlog f(Xi; θ) ≤ 1 n n X i=1 sup {h:||h||≤δ} Hθlog f(Xi; θ + h) − Hθlog f(Xi; θ) .
Because of the Law of Large Numbers (Theorem 4) and condition 2b it follows that, for δ → 0, lim n→∞δ→0lim 1 n n X i=1 sup {h:||h||≤δ} Hθlog f(Xi; θ + h) − Hθlog f(Xi; θ) a.s. −−→ ψδ→ 0. (3.8)
Therefore, from (3.8) we now see
sup
{u:||u||≤K}
||Zn(u)|| a.s.
−−→ 0. (3.9)
We can again rewrite λn(u) (equation (3.6)):
λn(u) = 1 √ nu TU n(θ) − 1 2u TI(θ)u + 1 2u T 1 nVn(θ) + I(θ) u +1 2u TZ n(u)u.
Uniformly in ||u|| ∈ [−K, K], we use (3.9) to see that
lim n→∞u TZ n(u)u ≤ lim n→∞u Tu max ij [Zn(u)]ij ≤ K lim n→∞maxij [Zn(u)]ij = K lim n→∞||Zn(u)|| = 0,
which means that uTZn(u)u = oq(1). For uT 1nVn(θ) + I(θ) u we can show the same,
lim n→∞u T 1 nVn(θ) + I(θ) u ≤ lim n→∞u Tu max ij 1 nVn(θ) + I(θ) ij ≤ K lim n→∞||Vn(θ) + I(θ)|| = 0.
This means that uT 1
nVn(θ) + I(θ) u = oq(1) too. 1 2u T 1 nVn(θ) + I(θ) u +1 2u TZ n(u)u = oq(1) + oq(1) = oq(1). (3.10)
Use the fact that ||u|| ∈ [−K, K] and the result from equation (3.10) to conclude that our definition of λn(u) (equation (3.6)) is equal to:
λn(u) = 1 √ nu TU n(θ) − 1 2u TI(θ)u + o q(1).
Now let us maximize λn(u) in u. For the (q × q) identity matrix, write 1q. We take the
vector of partial derivatives ∇uλn(u) and determine its root:
∇uλn(u) = √1 n1qUn(θ) − 1 2 q X i=1 ui[I(θ)]1i+ q X j=1 uj[I(θ)]j1, . . . , q X i=1 ui[I(θ)]qi+ q X j=1 uj[I(θ)]jq T = √1 nUn(θ) − " q X i=1 ui[I(θ)]1i, . . . , q X i=1 ui[I(θ)]qi #T = √1 nUn(θ) − I(θ)u = 0 ⇐⇒ √1 nUn(θ) = I(θ)u.
This maximum is thus attained in ˆu = √1
nUn(θ)[I(θ)]
−1. We therefore find that
maxi-mizing λn(u) in u gives us
ˆ u = √1
nUn(θ)[I(θ)]
−1+ o
q(1).
Now looking at our first definition of λn(u):
λn(u) = log Ln θ +√1 nu − log Ln(θ),
∇u log Ln(θ + 1 √ nu) − log Ln(θ) =∇u Ln(θ + 1 √ nu = ∇u n X i=1 log f Xi; θ + 1 √ nu =Un θ +√1 nu = 0 ⇐⇒ ˆu = √1 nUn(θ)[I(θ)] −1+ o q(1).
We already found that the maximum of λn(u) is attained in ˆu. We conclude that ˆu
approximately will be equal to the maximum of Ln
θ + √1 nu = n Q i=1 f Xi; θ + √1nu , which is achieved in the MLE ˆθn of θ. What we thus see is
ˆ θn= θ + 1 √ nu = θ +ˆ 1 √ n 1 √ nUn(θ)[I(θ)] −1 + oq(1) = θ + 1 n Un(θ)[I(θ)] −1+ o q( √ n) .
Rewriting this equality, we see:
ˆ θn− θ = 1 n Un(θ)[I(θ)] −1+ o q( √ n) √ n(ˆθn− θ) = 1 √ n Un(θ)[I(θ)] −1+ o q( √ n) = √1 nUn(θ)[I(θ)] −1+ o q(1).
The Law of Large Numbers (Theorem 4) shows us that For n → ∞, we have oq(1) = 0.
Slutsky’s Theorem (Theorem 2) makes us reach the conclusion that √
n(ˆθn− θ)−D→ N (0, [I(θ)]−1).
3.2. Wald’s Theorem
In this Section we will prove asymptotic convergence of Wald’s Statistic as stated in Definition 4. Wald’s Statistic is defined by
QW = n(ˆθn− θ0)TI(ˆθn)(ˆθn− θ0).
ˆ
θnis defined as the Maximum Likelihood Estimator as defined in Definition 7. We will be
using the result of Theorem 8 to prove convergence of QW to a chi-squared distribution
The proof of Wald’s Theorem is written down in Large Sample Methods in Statistics, An Introduction with Applications ([4]), for θ a one-dimensional parameter in R. We give the generalization for Θ ⊂ Rq.
Theorem 9 (Wald’s Statistic). Let X1, . . . Xn be a sequence of i.i.d. random variables
with probability density function f (x; θ), θ ∈ Θ ⊂ Rq which satisfies the conditions of Theorem 8. We consider the following hypothesis testing H0: θ = θ0 versus H1 : θ 6= θ0.
Then Wald’s Statistic QW = n(ˆθn−θ0)TI(ˆθn)(ˆθn−θ0) has an asymptotic χ2-distribution
under H0.
Proof. From Theorem 8 we know that, under H0,
√
n(ˆθn−θ) D
−→ N (0, [I(θ)]−1). Together with Cochran’s Theorem (Theorem 6) it now follows that, under H0,
Q∗W = n(ˆθn− θ0)TI(θ0)(ˆθn− θ0) D
−→ χ2 q.
Writing x =√n(ˆθn− θ0) we get the following equality:
QW Q∗W = xTI(ˆθn)x xTI(θ 0)x ,
for which we can use Courant’s Theorem (Theorem 7) to see that
chq I(ˆθn)[I(θ0)]−1 ≤ QW Q∗W ≤ ch1 I(ˆθn)[I(θ0)]−1 .
Under the assumption that θ = θ0, we get that ˆθn−−−→ θn→∞ 0. Consider [I(ˆθn)]ij. Because
of our regularity assumptions on f we get:
[I(ˆθn)]ij = Z R ∂ ∂θi log f (X1; θ) ∂ ∂θj log f (X1; θ)dx θ=ˆθn n→∞ −−−→ Z R ∂ ∂θi log f (X1; θ) ∂ ∂θj log f (X1; θ)dx θ=θ0 = [I(θ0)]ij.
Using we Weak Law of Large Number (Theorem 4), we see
I( ˆθn) p − → I(θ0), and thus I( ˆθn)[I(θ0)]−1 p−→ Iq.
Consequently, all eigenvalues of I( ˆθn)[I(θ0)]−1 converge in probability to 1:
ch1{I( ˆθn)[I(θ0)]−1} p
− → 1 and
chq{I( ˆθn)[I(θ0)]−1} p
− → 1. This makes us see that QW/Q∗W
p − → 1. Rewriting QW as follows QW = Q∗W · QW Q∗W, and using Q∗W −D→ χ2 q and QW/Q∗W p −
→ 1, allows us to apply Slutsky’s Lemma (Theorem 2) to finally conclude that, under H0,
QW D
−→ χ2 q.
4. Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov Test is used to test samples for certain distributions. It does so by comparing the data of the empirical cumulative distribution function with the cumulative distribution function that we are testing for. One restriction however, is that the Kolmogorov-Smirnov Test is meant for continuous distributions [1].
In this section we describe and explain the algorithm of the Kolmogorov-Smirnov Test. Consider a sequence of independent and identically distributed random variables X1, . . . , Xn, n ≥ 0. Consider the sample mean Sn, then we know that the centred and
scaled sample mean converges to a normal distribution. Consider now the pre-limit, we know that for n large we have approximately a normal distribution. But this is merely an approximation, hence we would like to have a test statistic which would be able to detect this difference from a normal distribution in the pre-limit as the sample size grows large. In the numerical illustrations that follow we consider n to be fixed and study the p values obtained from the Kolmogorov-Smirnov Test for growing sample sizes.
We study the test by simulating data, trying to explain why the test does not work for discrete distributions and fixing that problem. Our data will be constructed in such a way that it is asymptotically normally distributed, by applying the Central Limit Theorem (Theorem 1).
4.1. Algorithm Kolmogorov-Smirnov Test
If we have data points x1, . . . , xm from some distribution, we can draw the empirical
cumulative distribution function (empirical cdf), which is defined by the function
Fm(x) = 1 m m X i=1 1{xi≤x}= |i : xi≤ x| m , x ∈ R.
This function is non-decreasing, right-continuous and bounded by 0 and 1, and therefore a real distribution function, as can be seen in the figure. Replacing the deterministic {xi} by random {Xi}, Fm(x) becomes random as well.
-6 -4 -2 0 2 4 6 x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F(x) Empirical CDF
We can transform the data X1, . . . Xmto U1 = F (X1), . . . , Um = F (Xm), to retrieve and
i.i.d. U (0, 1)-distribution. We need to make the assumption that F is continuous and strictly increasing here. We denote the empirical cdf of the {Ui} by Gm(u), the so-called
reduced empirical cdf. x and u are related by x = F−1(u) and u = F (x). Now, look at
Fm(x) − F (x) = 1 m m X i=1 1{Xi≤x}− F (x) = 1 m m X i=1 1{Ui≤u}− G(x) = Gm(x) − G(x).
We define the maximum distance between the empirical and the true cdf to be
Dm = sup x∈R
|Fm(x) − F (x)| = sup 0≤u≤1
|Gm(x) − G(x)|.
This is a statistic, and is called the Kolmogorov Statistic. More commonly, we use the scaled version Km :=
√
mDm as a test statistic. The Null Hypothesis is that the
empirical cdf is equal to some cdf F , and the Alternative Hypothesis is that the the empirical cdf is equal to some other cdf. Now, this Km has a distribution of its own.
We refer to the Handbook of Monte Carlo Methods (referentie) for the proof, and only show the result. This distribution is equal to
P(K ≥ y)def= 1 − lim m→∞P(Km≤ y) = 1 − ∞ X k=−∞ (−1)ke−2(ky)2, y > 0.
H0 and H1:
H0 : the empirical cdf is equal to F
H1 : the empirical cdf is not equal to F,
for a given distribution F . Then we need to follow the next four steps:
1. Order the data x(1) < · · · < x(m).
2. Calculate dm = max i=1,...,nmax n x(i)− i m , x(i)− i−1 m o . 3. Define km = √ mdm.
4. Calculate the p-value P(Km ≥ km).
If the p-value exceeds some level α ∈ (0, 1), we do not reject H0. If, however, p < α,
our test statistic falls into the critical region and we reject H0.
4.2. Simulation in Matlab
In this Section we investigate how the Kolmogorov-Smirnov Test performs when we have a sample of the centered and scaled sample mean of a sequence of random variables from two distributions, the Bernoulli and the Uniform. We consider n to be fixed and equal to n = 200.
Approximately we will have a normal distribution, but it will not exactly be normal, since it is an approximation. It is important to realize this, because it means that the Kolmogorov-Smirnov Test should eventually reject the Null Hypothesis. We compare the speed at which the Kolmogorov-Smirnov Test rejects the Null hypothesis for growing sample size.
Bernoulli Sample
We simulated a Bernoulli trial in Matlab, by assigning the number +1 with probability p = 0.5, and −1, with probability 1 − p = 0.5, to the random variable Xi:
P(Xi = 1) = p = 1 − P(Xi= 0) = 0.5.
We created a vector X of length n = 200 with 1s and −1s. It is clear that the sample mean µX of this vector should approximately be equal to µX = 0, and the sample
variance S2X equal to SX2 = EX2 − (E [X])2= 1. By default, the Kolmogorov-Smirnov Test tests for a standard normal distribution. For this reason, we applied the Central Limit Theorem to the vector X:
Sn= Pn i=1Xi− nµX q nS2 X =√n · Xn,
creating a number Snwhich is by the Central Limit Theorem (1) approximately standard
normally distributed:
Sn∼ N (0, 1).
Now, we repeated this m times, creating a vector Ym = [Sn1. . . Snm]. Asymptotically, this
vector has a standard normal distribution. We test this vector for normality with the Kolmogorov-Smirnov Test.
The Kolmogorov-Smirnov test then gives a p-value. The Null-Hypothesis is that the sample comes from a standard normal distribution. If p > 0.05, we do not reject the Null-Hypothesis.
We observed that the p-value varies when we vary the length m of our vector Ym. Our
p-value is itself a random variable. We therefore choose to run 1000 p-values for every m, and then calculating the mean p-value for every m. This resulted in the following figure, where see the p-value plotted against m.
0 500 1000 1500 2000 2500 sample size m 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 p-value
p-value of the Kolmogorov-Smirnov Test with Bernoulli sample
We see the p-value varies a lot, even though we simulated one-hundred p-values for every m. Nevertheless, it decreases for growing m. The p-value drops below our significance leven α = 0.05 around m = 1000, and keeps on decreasing after this.
Uniform(0,1) Sample
We saw that the p-value drops below 0.05 around a sample size of m = 900 when our data came from a Bernoulli distribution. We will now see what happens when we take the Uniform(0,1) distribution, also applying the Central Limit Theorem (1).
Due to calculation time, we set n to be equal to 50, and we have X1∗, . . . , Xn∗ i.i.d. random variables, where Xi∗ ∼ U nif orm(0, 1), for i ∈ {1, . . . , n}. The sample mean is µX∗ = 1
2 and the sample variance S 2 X∗ = 121. We define Sn∗: Sn∗ = Pn i=1Xi∗− nµ q nSX2∗ .
Creating vectors Ym∗ = [Sn∗1, . . . , S∗mn ], where we vary m, we get the following graph:
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 sample size m 0 0.05 0.1 0.2 0.3 0.4 0.5 p-value
p-value of the Kolmogorov-Smirnov Test with Uniform(0,1) sample
We see the p-value varies a bit more than in the Bernoulli case, which is caused by two things: Sn∗ has a larger variance, and we simulated less p-values per sample size due to computing time. But in general, there is no directly visible decrease in the p-value for increasing m. The data is only normally distributed for n going to infinity, so at some point we do expect the p-value to drop for large sample sizes. Computing these large sample sizes takes a lot of time, but it would be interesting to simulate these data. We calculated 10 p-values for m=10 000 000, which gave us a mean value of p = 0.3014. For larger sample sizes we would expect further decrease in the p-value.
5. Conclusion
The goal of this bachelor project was to study statistical tests. Our focus lied on Wald’s statistic and the Kolmogorov-Smirnov Test.
In the first part we focused on theoretical properties of the Maximum Likelihood Estimator. In existing literature it has been proved that the Maximum Likelihood Estimator converges to a normal distribution under certain conditions. We managed to extend this proof for q-dimensional parameter spaces. We also extended the proof of Wald’s Theorem to q-dimensional parameter spaces. This theorem states that Wald’s Statistic converges to a chi-squared distribution under the Null Hypothesis H0 : θ = θ0.
These two proofs were the main theoretical contributions of this project. Our proofs heavily relied on the proofs of given in the literature, but the extensions were far from trivial.
In the second part of this project we studied the performance of the Kolmogorov-Smirnov Test. Our motivation is to see whether the Kolmogorov-Kolmogorov-Smirnov Test can detect the deviation from the normal distribution in the pre-limit of the Central Limit Theorem. The algorithm we tested is the package included in MATLAB R2015b. We think that such a numerical investigation is important and it yields some limitations in the capabilities of such statistical tests. We found that the test rejects the Null Hypothesis for relatively small sample sizes when the data originates from a Bernoulli distribution. This has something to do with the fact that you get double counts when sampling from a discrete distribution. We did not see the same for Uniform(0,1) data, although we saw a decrease of the p-value for m=10 000 000. We suspect the Null Hypothesis should be rejected for even larger sample sizes.
6. Popular Summary
When doing research, it is important to know how to deal with the data that we retrieve from the experiments. For example, think of conducting a research on whether a new medicine works, or that it does not work better than a placebo. If we give the medicine to a group of people, and a placebo to a different group of people, and we see the first group has a higher rate of getting cured, we will suspect the medicine works. On the other hand, if we do not see a big difference in the fraction of people being cured by the medicine and the placebo, we will begin to suspect the medicine does not work.
Somehow we need to interpret the information, and process it in such a way that we can draw sensible conclusions. When can we say the medicine works better than a placebo? We cannot just start conducting an experiment. Before we start, we will probably have some presumption. In our example it will be that we think the medicine works better than the placebo. We want to know whether this presumption is likely to be true after having seen our data. For this reason, we set up a so-called Null Hypothesis H0 and an Alternative Hypothesis H1. We do this in such a way, that our presumption
lies in our Alternative Hypothesis. So the Null Hypothesis in our example could be ’the medicine is equally effective as a placebo’. The Alternative Hypothesis could then be ’the medicine works better than the placebo’. After conducting the experiment, we can do two things: either we do not see enough evidence to reject H0, or we do see enough evidence,
so we reject H0 and adopt H1. Note that we do not treat H0 and H1 symmetrically.
Not rejecting the Null Hypothesis does not mean we say H0 is true, we merely say we
have not seen enough data to be able to reject it.
Continuing with our example, we decide to treat one group with the medicine, and the other group with the placebo, not telling anyone in which group they are placed. We mark the fraction of people cured by the medicine, and the fraction of people ’cured’ by the placebo. If the former fraction is (much) higher than the former, we want to say it is highly possible the Null Hypothesis is false, and the Alternative Hypothesis is true. If the fractions do not differ enough from one another, we may not conclude the medicine works.
We can never for sure whether the medicine works better than the placebo. After all, our experiment is subject to chance. It is possible we get ’unlucky’ in how our groups are divided. We can make two kinds of mistakes. We might see many people cured by the medicine and few by the placebo. It is possible that we only saw this by chance though, and that in fact the medicine does not truly work better than the placebo. After seeing this data we would decide to reject H0 and conclude the medicine does work, when in
reality it does not work better. Falsely rejecting H0 and adopting H1 is called a Type-I
error. We can also make a different kind of mistake. It is possible that the subjects in our medicine group are not cured by the medicine by coincidence, and many of the subjects
in our placebo group are cured. We would then say we do not see enough evidence to reject our Null Hypothesis, although in truth we should have been able to do so. This kind of mistake, where we incorrectly do not reject H0, is called a Type-II error.
Falsely rejecting H0 is a grave error which we want to minimize as much as possible.
Imagine falsely concluding a medicine works, and giving this medicine to sick people while the medicine does not work! Of course it is also a shame if we falsely conclude we cannot use a certain medicine, but less harm is done with this mistake than by the first mistake. We therefore want to minimize making a type-I error. By minimizing H0 however, we make it more likely to make a Type-II error. We can never make the
probability zero to make a Type-I error without making the probability of making a Type-II error very big. We have to choose a significance level. This level, often denoted with α, is a number between 0 and 1, and is the upper bound for what we want the probability to be that we make a Type-I error. In other words: we set the probability that we falsely reject H0 and adopt H1 to be at most α. After setting this level, we minimize
the probability of making a Type-II error. Then we can say we have a probability of at most α to falsely reject our Null Hypothesis.
Bibliography
[1] Dirk P. Kroese, Thomas Taimre, Zdravko I. Botev, Handbook of Monte Carlo Methods, Wiley Interscience.
[2] Fetsje Bijma, Marianne Jonker, Aad van der Vaart, Inleiding in de statistiek, Epsilon Uitgaven, Utrecht, 2013.
[3] Paul Igodt, Wim Veys, Lineaire Algebra, Universitaire Pers Leuven.
[4] Pranab K. Sen, Julio M. Singer, Large Sample Methods in Statistics, An Intro-duction with Applications,Chapman & Hall, New York, 1993.
A. Matlab Code
Bernoulli Sample
n=200; mu=0; sigma=1; K=100; P = zeros(1,K); M = zeros(1,250); for N = 10:10:2500 Y = zeros(1,N); for k=1:K for j=1:N sum=0; for i=1:nif rand<0.5 sum = sum+1; else
sum = sum - 1; end end S = (sum - n*mu)/(sqrt(n)*sqrt(sigma)); Y(1,j) = S; end [h,p] = kstest(Y); Nn=(N)/10; M(1,Nn) =p; P(1,k)=p; M(1,Nn)=mean(P); end end x=[10:10:2500]; plot(x,M)
Uniform Code
n=50; mu=.5; sigma=1/12; K=100; P = zeros(1,K); M = zeros(1,100); for N = 100:100:10000 Y = zeros(1,N); for k=1:K for j=1:N sum=0; for i=1:nsum = sum + rand; end S = (sum - n*mu)/(sqrt(n)*sqrt(sigma)); Y(1,j) = S; end [h,p] = kstest(Y); Nn=(N)/100; M(1,Nn) =p; P(1,k)=p; M(1,Nn)=mean(P); end end x=[100:100:10000]; plot(x,M)