Empirical Bayes

(1)

Empirical Bayes

Hasine Efet¨

urk

July 10, 2017

Bachelor Thesis Mathematics Supervisor: Jarno Hartog MSc

0.0

0.2

0.4

0.6

0.8

1.0 −1.0

−0.5

0.0

0.5

1.0 x

density of the poster

ior

f

0

f

^

Korteweg-de Vries Institute for Mathematics Faculty of science

(2)

Abstract

The empirical Bayes procedure is applicable to statistical inference when the same decision problem occurs repeatedly and independently with a fixed, yet unknown a priori distribution of the parameter. Statistical problems tangled in a hierarchy, could be disentangled using empirical Bayes. An approximation to a fully Bayesian treatment of a hierarchical Bayes model is considered. The empirical Bayes approach hits certain advantages over any approach which neglects the fact that the parameter is itself a random variable, as well as over any approach which assumes a personal or a conventional distribution of the parameter not subject to change with experience. In this thesis, the computations of the empirical Bayes approach will be elaborated in detail and applied to regression analysis.

Title: Empirical Bayes

Authors: Hasine Efet¨urk, hasine.efeturk@student.uva.nl, 10173536

Supervisor: Jarno Hartog MSc

Second graders: prof. dr. J.H. van Zanten and prof. dr. P.J.C. Spreij Date: July 10, 2017

(3)

1. Introduction

Nowadays, human beings experience a lot of uncertainty. Uncertainty can be divided into aleatoric- and epistemic uncertainty. Strictly speaking is this uncertainty about outcomes of repeatable events and events resulting from insufficient knowledge respectively. Events like, the probability for drawing a certain card from a card deck can be predicted, but the probability that a suspect indeed commit the crime where the person is accused for, is harder to predict. The two main statistical schools are frequentist and Bayesian statistics. The frequentist approach considers probabilities as objective probabilities, which is based on data. However, to solve a crime scene for instance, we need the Bayesian approach. Bayesians consider probabilities as measures of belief since the belief in an occurrence needs to be expressed for an event which is unlikely for re-occurrence. Especially named as subjective probability. The combination of the objective probabilities and the subjective probabilities, which requires some experience of the statistician, is enabled by this approach. Further on, some methods of the Bayesian approach to statistical inference will come across.

Statistical analysis could be done for every problem setting. From the obtained informa-tion about the situainforma-tion, a possible research quesinforma-tion will rise up and the required data to give proper answer to this question will be clear. However, before the data are available, the context of the statistical investigation needs to commend some suitable probability models. If the data contains some values of the unknown parameters and if this could be expressed as a probability density function, Bayes’ theorem could be used for inference.

As stated in [1, p. 13], let the data be the observations x = (x1, . . . , xn), where x is the

ob-served value of a random variable X = (X1, . . . , Xn) having a probability distribution which

is specified by a probability density function p(x). Further on, let (Θ, F , P) be a measure

space with P(Θ) = 1. Then (Θ, F, P) is a probability space, with parameter space Θ ⊆ Rd

of possible values of θ, the event space, the set F is obtained by restricting the parameter space to a suitable family and probability measure P. Suppose further that we fulfill the probability axioms. In parametric statistical inference, p(x) is of a known form, but involves

a finite number of real unknown parameters θ = (θ1, . . . , θd). The case we will study, is the

non-parametric model where p(x) does not have a parametric representation.

There are three main types of inference: point estimation, confidence set estimation and

hypothesis testing. Using point estimation, a single value is computed from the data x

and used as an estimate of θ, as described in A.2. In confidence set estimation, a set of values is generated, which has a high probability to enclose the true, yet unknown value of

(5)

A Bayesian test will reject the null hypothesis if p(Θ0|x) is below some threshold c. Those types of inference will not be elaborated further on in this thesis.

The key conceptual point is the way that the prior distribution on the unknown parameter θ is updated, on observing the realized value of the data x, to the posterior distribution, using Bayes’ law. Inference about θ is then extracted from this posterior. Although, it is hard to presume a prior distribution on θ if θ is dependent on other parameters ξ. In chapter 2, we shed a light on the hierarchy within the parameters. The main objective is to develop a method for finding the posterior, section 2.2, which is necessary for inference and applying this on regression problems, which will be addressed in chapter 3). While writing this thesis,

I mostly got inspired through K¨unsch [2] and Rasmussen [3].

Acknowledgements

I would like to thank my supervisor Jarno Hartog MSc for his great supervision and advise. I appreciate the time and effort he has invested in my work.

Hasine Efet¨urk,

(6)

2. Hierarchical Bayes models

2.1. Hierarchical Bayes model

The Bayes model is defined by

p(θ|x) = R∞p(θ)p(x|θ)

0 p(θ)p(x|θ) dθ

,

where θ is the parameter to estimate, given observations x. Usually, the likelihood function, the parametric statistical model of x given θ is denoted by the letter f and the prior distri-bution on the parameters as well as the posterior distridistri-bution of the parameter θ is denoted by π. Throughout this thesis, the notation of p is going to be used for both, the likelihood function as well as the prior and posterior distribution functions. So be aware that for diffe-rent density functions, the notation of p will be restated. The difference between the models will be clear through the usage of different data and parameters. The distribution of θ given x is proportional to a product of the marginal distribution of θ and the conditional distribution of x given θ. Suppose that the prior for θ depends on other parameters ξ. This will lead to the joint density distribution of the triple of random variables (ξ, θ, x) with

p(θ|x) ∝ p(ξ)p(θ|ξ)p(x|θ).

This is a hierarchical Bayes model, where ξ is a hyper parameter. The Bayesian approach is the same in the hierarchical Bayes model. The posterior will be computed, that is the conditional distribution of the variables (ξ, θ) given the observed variables x. Based on this posterior, inference is possible. There are two possible ways to compute this posterior. The first is by computing the marginal prior of θ and then using the Bayes formula,

p(θ|x) ∝ p(θ)p(x|θ) where p(θ) =

Z ∞

0

p(θ|ξ)p(ξ) dξ.

(7)

Example 2.1. Hierarchical Poisson model

For a certain time period, let the number of claims Xj of contract j be modeled as independent

Poisson(θj) where j ∈ {1, . . . , n}. Suppose further that the θj are independent and identically

Γ(γ, λ)-distributed. Moreover, on the highest level, the parameters (γ, λ) rule, (θ1, . . . , θn) are

on the second level and (X1. . . , Xn) are on the lowest level. The joint density will be given

by p(γ, λ)Qn

j=1p(θj|γ, λ)p(xj|θ). The first approach computes the posterior as follows

p(θ|x) ∝ p(x|θ) Z ∞ 0 Z ∞ 0 p(γ, λ) n Y j=1 p(θj|γ, λ) dγ dλ = p(x|θ) Z ∞ 0 Z ∞ 0 p(γ, λ) λ nγ Γ(γ)n   n Y j=1 θ_jγ−1  exp  −λ n X j=1 θj  dγ dλ

This integral is not a standard distribution and the θj are not independent under this prior.

The second approach seems more inviting to solve. The (θj, xj) conditional on (γ, λ) are

independent for different j0s, so

p(θ1, . . . , θn|γ, λ, x1, . . . , xn) = n Y j=1 p(θj|γ, λ, xj) p(x1, . . . , xn|γ, λ) = n Y j=1 p(xj|γ, λ).

The prior p(θ|γ, λ) is Gamma distributed, so the prior density is the conjugate for the Poisson

model p(x|θ). The posterior θj|γ, λ, xj ∼ Γ(γ + xj, λ + 1). The Gamma distribution function

with those parameters is given by

p(θj|γ, λ, xj) =

(λ + 1)γ+xj

Γ(γ + xj)

θ_jγ+xj−1exp(−(λ + 1)θj)

and the fact that R₀∞θα−1exp(−βθ) dθ = Γ(α)_βα is used to solve the next equation:

(8)

Hence, the information which is transferred by this integral is that the insurance company uses information about others to estimate the expected number of claims of a certain person. This can by explained by the fact that every person belongs to a population where the distribution of the expected number of claims in this population can be estimated by the the number of claims of the complement of the population. This integral cannot be computed in closed form, however, approximations are possible.

A more sophisticated approach will be discussed in the next section on empirical Bayes methods.

2.2. Empirical Bayes method

The previous example from the last section has revealed that one could not compute the posterior of θ given observations x in closed form, while the computation of p(θ|x, ξ) and p(x|ξ) were possible by the use of conjugate priors and independence. A possible solution for determining the posterior p(θ|x) is the empirical Bayes method. This method chooses the value with maximal weight in stead of taking the weighted average of the hyperparameters. In other words

p(θ|x) ≈ p(θ|x, ˆξ(x)), ξ(x) = arg maxˆ

ξ p(x|ξ).

Using this method will vanish the complexity of computing the integral and choosing the hy-perprior. The infinite hierarchy that arises by the acknowledgement that even the hyperprior is dependent on another prior will be left aside.

The example in the previous section will be continued. Example 2.2. Hierarchical Poisson model Recall that p(xj|γ, λ) = λγ Γ(γ)xj! Γ(xj+ γ) (λ + 1)(xj+γ)

The aim is to maximize this over the hyperparameters γ and λ.

log p(xj|γ, λ) = γ log λ − log Γ(γ) − log xj! + log Γ(xj+ γ) − (xj+ γ) log(λ + 1).

So

arg max

(9)

− log p(x1, . . . , xn|γ, λ) = −nγ log λ + n log Γ(γ) + n X j=1 log xj! − n X j=1 log Γ(xj+ γ) + n X j=1 (xj+ γ) log(λ + 1) = −nγ log λ + n X j=1 log Γ(γ) Γ(xj+ γ) + n X j=1 log xj! + n X j=1 xjlog(λ + 1) + nγ log(λ + 1) = n X j=1 log xj!Γ(γ) Γ(xj+ γ) + n X j=1 xjlog(λ + 1) + nγ log 1 + 1 λ = X j;xj>0   xj X k=1 log k − xj−1 X k=0 log(γ + k)  + n X j=1 xjlog(λ + 1) + nγ log 1 + 1 λ .

Minimizing this over λ results in n X j=1 xj λ + 1 = nγ 1 +1_λ · 1 λ2 ⇐⇒ 1 n n X j=1 xj = γ · (λ + 1) λ2_{+ λ} = γ λ.

Minimizing this over γ results in

X j;xj>0 xj−1 X k=0 1 γ + k = n log 1 +1 λ .

So ˆλ = ˆγ_¯_x and the expression for ˆγ can not be given explicitly. Thus another approximation

method is necessary to solve the problem.

Simulating this in R studio, suppose that the θj are independent and identically Γ(5,

1)-distributed, and use that the left hand side of the equation is a function of λ, namely g(λ) and the right hand side is f (λ). The plot of f and g is given by

1.0 1.1 1.2 1.3 1.4 1.5 1.6 50 55 60 65 70

Optimizing the log likelihood

λ

fλ

gλ

The found results are ˆλ = 1.26 and

ˆ

γ = ˆλ¯x = 5.68.

Hence, using empirical Bayes, an approximation of the two stage hierarchical model is obtai-ned.

For empirical Bayes computation, the likelihood needs to be maximized over the hyper-parameters. Giving the solution explicitly is not always possible. The material which is

(10)

evaluated till now, is about determining the posterior of the parameter θ which depends on other variables. This can be collated with regression models. Regression models are used to describe how a response variable depends on other explanatory variables. Most of statistical modelling is about the question, ’What happens to y when x varies?’ as stated in [4]. The usage of the empirical Bayes method on regression will be discussed. This is the reason why the next chapter will be about regression models.

(11)

3. Regression

The first known form of regression was the method of least squares, which was published by Legendre in 1805 and Gauss in 1809. Both used this method to the problem of determining, from astronomical observations, the orbits of bodies about the sun (first comets, later on the newly discovered minor planets). Regression is a statistical learning process which plays a key role in many areas of science, finance and industry. An example in finance will be, predict the price of a commodity as a function of interest rates, currency exchange rates, availability and demand. A second example, this time in science is, predict whether a patient, hospitalized due to a heart attack, will have a second heart attack, based on diet and clinical measurements for that patient. All of those random processes are representing some numerical values of some systems randomly changing over time. A formal definition as in [5] will follow.

Definition 3.1 (Stochastic process). Let S be the state space. A random element in S is

called a random variable when S = R, a random vector when S = Rm _{with m ∈ Z, a random}

sequence when S = R∞ and a random or stochastic process when S is a function space.

A stochastic process is a generalization of a probability distribution of functions. The restriction to distributions which are Gaussian are called Gaussian processes (GP). A formal definition of this will be given later on. This chapter will cover Gaussian process methods for regression problems. The space of the regression model where the output is a linear combination of the inputs is called a weight space. At first, we introduce the linear regression model on the weight space view.

3.1. Linear Regression

Given is the data set D = {(xi, yi) : i = 1, . . . , n}. The input vector (explanatory variables)

x is given by x = (x1, . . . , xd)| and the target y (response variable). All n observations are

given by the d × n design matrix X. The linear regression model with Gaussian noise is given by

y = f (x) + where f (x) = x|w

where y is the response variable (the n observations) and this can be approximated by the continuous function f where the difference is equal to the noise . Further contains the vector X the explanatory variables and w is the vector of associated weights of the linear model. Let be i.i.d according to the normal distribution with

j ∼ N (0, σ2) j = 1, . . . , n.

The likelihood is y|X, w ∼ N (X|w, σ2I) and the prior w ∼ N (0, Σ) where Σ is the covariance

(12)

p(w|X, y) = R∞p(w)p(y|X, w) 0 p(w)p(y|X, w) dw , where p(y|X, w) ∝ exp −1 2 (y − X|w)|(y − X|w) σ2_I p(w) ∝ exp −1 2(w |_Σ−1_w)

the product is in proportion equal to the posterior p(w|X, y) ∝ exp −1 2 (y − X|w)|(y − X|w) σ2 exp −1 2(w |_Σ−1_w) = exp −1 2 (y − X|_w)|_{(y − X}|_w) σ2 ) + (w |_Σ−1_w) = exp −1 2 y|_y σ2 − 2y|_X|_w σ2 + w|_XX|_w σ2 + w |_Σ−1 w = exp −1 2 w| XX | σ2 + Σ −1 w − 2y |_X| σ2 w + y|y σ2

Suppose w|X, y ∼ N (µ, Ω−1), the power of the exponent is in proportion equal to (w −

µ)|Ω(w − µ) = w|Ωw − 2µ|Ωw + µ|Ωµ. The distribution of w given X and y is also Gaussian.

w|X, y ∼ N 1 σ2_IA −1_{Xy, A}−1 , where A = XX | σ2_I + Σ −1_.

The aim is to fit a continuous function to an infinite set given some finite set y. The first

target is to make predictions for all possible input values x∗. The collection of all possible

input values is called the test case. The predictive distribution for f∗ = f (x∗) is deduced by

the distribution of w given X and y and is also normally distributed

f∗|x∗, X, y ∼ N 1 σ2x | ∗A−1Xy, x|∗A−1x∗ .

Example 3.2. Suppose that some data points {(xi, yi) : i ∈ {1, 2, . . . , 12}} are given. It

seems like those points lie on the curve f (x) = cos(x). So we try to fit the data points with

(13)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 −0.5 0.0 0.5 1.0 x y y f φx

The Bayesian treatment of the linear model was considered in this section. However, the design matrix contains one-dimensional entries, so approximations of more exotic functions is hard to obtain. The enhancement to this model is obtained by projecting the inputs x into a high-dimensional feature space φ(x) and applying the linear model there. This will be done in the next section.

3.2. Projections of inputs into feature space

This section is about mapping the explanatory variables into the feature space. The linear model applied to the variables does not always fit well, to extend our domain, the input x

needs to be projected into some high-dimensional space. Consider the function φ : Rd7→ Rm

where φ maps a d-dimensional input vector x into an m dimensional feature space. A possible

mapping for φ could be φ(x) = (1, x, x2, x3, . . . , xm−1)|. The choice of φ in order to obtain

an optimal regression model will be discussed later on in this chapter. Let the feature space be a basis of polynomials which are constructed by the Gram-Schmidt orthogonalisation. Let

Φ ∈ Rm×n be the matrix with in every column φ(xi) evaluated, where xi = (x1, . . . , xd) and

i ∈ {1, . . . , n}. So Φ(X) = (φ(x1), . . . , φ(xn)). The adjusted model is

f (x) = φ(x)|w.

The prediction f∗ for an test value x∗ follows the normal distribution

f∗|x∗, Φ, y ∼ N 1 σ2φ(x∗) |_A−1_{Φy, φ(x} ∗)|A−1φ(x∗) where Φ = Φ(X) and A = _σ12ΦΦ|+ Σ

−1_{. Further more, this distribution can be obtained}

by substituting X = Φ in the distribution of f∗ given y as stated in the previous section.

Use φ(x∗) = φ∗ and define K = Φ|ΣΦ. The parameters from the latter distribution can be

written in the form

f∗|x∗, Φ, y ∼ N φ|∗ΣΦ(K + σ2I)−1y, φ|∗Σφ∗− φ|∗ΣΦ(K + σ2I)−1Φ|Σφ∗ .

Showing that the mean of those distributions are equal is a matter of some calculus. A lemma as stated in [3, p. 201] is necessary to show the equality of the variances.

Lemma 3.3 (matrix inversion). Let the matrix Z ∈ Rn×n, W ∈ Rm×m and U, V ∈ Rn×m

and assume that those matrices are invertible, than the following equality holds

(14)

Setting Z−1 = Σ, W−1= σ2I and V = U = Φ leads to the desired equality of the variances.

A generalization could be made in the distribution of f∗, remark that the feature space

has the form of φ|∗ΣΦ, φ|∗Σφ∗, φ|∗ΣΦ, Φ|Σφ∗. Thus the entries of these matrices have the

form φ(x)|Σφ(x0) where x and x0 are in the set of data points, or in the set of predicted

points. Define further the kernel k(x, x0) = φ(x)|Σφ(x0). The kernel contains φ and Σ. The

length of φ(x), the amount of orthogonal polynomials is going to be varied and the choice of the covariance matrix Σ will be further determined. Those elements of the kernel are the important parameters which will be adjusted to obtain the optimized model.

As stated in [3, p. 31], the formal definition of a Gaussian process will follow.

Definition 3.4 (Gaussian process). A Gaussian process is a collection of random variables, any finite number of which have a joint normal distribution.

This process is specified by its mean function and covariance function. Let m(x), k(x, x0) be

the mean function and the covariance function of f (x) respectively. m(x) = E[f (x)]

k(x, x0) = E[(f (x) − m(x))(f (x0) − m(x0))]

A Gaussian process is defined as

f (x) ∼ GP(m(x), k(x, x0)).

The collection of random variables are the values of the different functions f (x) at location x. A Gaussian process is normally defined over time, so the index set of the random variables is time. However, the index set is now the set of all possible inputs of x and is an element of

Rd. If we increase d, the same procedure of finding the distribution of y needs to hold. So

the consistency requirements needs to be fulfilled. Without giving a proof, this implies that examination of a larger set of variables, does not change the distribution of the smaller set.

An example of a Gaussian process is elaborated.

Example 3.5. Consider the following linear regression model with i.i.d. noise: y = f (x) +

where f (x) = φ(x)|w, i ∼ N 0, σ2 with i = 1, . . . , n and where the marginal likelihood is

given by

p(y|Φ) =

Z ∞

0

p(f |Φ)p(y|f ) df.

The terms inside the integral are the prior and the likelihood respectively where f |Φ ∼ N (0, K)

y|f ∼ N Φ|w, σ2I

y|Φ ∼ N 0, K + σ2I

Let m be the dimension of the feature space, σ2 = 0.1 and Σ ∈ Rm×m be the product of a

m × m identity matrix and constant c. For now, let c = 1. Suppose that the given training

(15)

Another choice for φ(x) is the basis of orthogonal polynomials. Those choices are respectively shown in the figures below.

0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.5 2.5 3.5 f f ^ 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.5 2.5 3.5 f f ^

It is obvious that the choice of the first feature space will result in a better approxima-tion. Nevertheless, increasing the dimension of the feature space will give more accurate approximations for the basis with orthogonal polynomials. Shortly, finding an appropriate feature space is important to optimize the posterior. The relevance of the choice of c and the dimension of the feature space will be clear from the next example.

Example 3.6. Consider the same linear regression model as in example 3.4. Let the feature space consist of the basis of orthogonal polynomials and let

f0(x) = sin(2πx) + cos(5x4) + sin(4πx), n = 200

Σ = Ic where I ∈ Rm×mand c ∈ R 0.0 0.2 0.4 0.6 0.8 1.0 −2 0 2 4 6 f0 f ^ Figure 3.1.: m = 10, c = 1 0.0 0.2 0.4 0.6 0.8 1.0 −2 0 2 4 6 f0 f ^ Figure 3.2.: m = 10, c = 0.1 0.0 0.2 0.4 0.6 0.8 1.0 −2 0 2 4 6 f0 f ^ Figure 3.3.: m = 25, c = 1 0.0 0.2 0.4 0.6 0.8 1.0 −2 0 2 4 6 f0 f ^ Figure 3.4.: m = 10, c = 10

(16)

From the plots, one could deduce that for c → 0 the function ˆf is approximately constant.

Another observation is that ˆf is more accurate if m increases. Hence, the plots are significantly

different for different values of c and m.

The need for finding the optimal value for the hyperparameters c and m arises. In the ex-amples, the standard deviation σ is considered as known for the sake of simplicity. Further on, σ is also going to be estimated. The next section will cover the choice of the hyperparameters.

(17)

3.3. Optimizing the hyperparameter

From the examples of the previous section, the need of determining the hyperparameters emerged. At first, the distribution to optimize the value of the design matrix will be derived.

3.3.1. The function of the optimal design matrix

Let y|w ∼ N (φ|w, σ2I) and w|c ∼ N (0, cI). The aim is to maximize p(y|c) over c. The

distribution of y given c is equal to some integral whereby the law of total probability is used, p(y|c) =

Z ∞

0

p(y|w)p(w|c) dw. The conditional probability of the terms in the integral is given by

p(y|w) = p(y, w) p(w) = p(y|w, c) = p(y, w, c) p(w, c) p(w|c) = p(w, c) p(c) p(y|w)p(w|c) = p(y.w.c) p(c) = p(y, w|c)

Thus the distributions are defined by,

p(y|w) = (2πσ2)−n/2exp −1 2 (y − φ|w)|(y − φ|w) σ2 p(w|c) = (2πc)−m/2exp −1 2 w|w c

The integral of the product is

(2πσ2)−n/2(2πc)−m/2 Z ∞ 0 exp −1 2 1 cw |_{w +} 1 σ2y |_{y +} 1 σ2w |_φφ|_{w − 2} 1 σ2y |_φ|_w dw (2πσ2)−n/2(2πc)−m/2exp − 1 2σ2y |_y Z ∞ 0 exp −1 2 w| 1 cI + 1 σ2φφ | w − 2 1 σ2y |_φ|_w dw The term inside the integral, is again a normally distributed with parameters

Σ−1 = 1 cI + 1 σ2φφ | and µ = 1 cI + 1 σ2φφ | −1 φy 1 σ2.

If w is multivariate Gaussian distributed, with parameters µ and Σ, then Z exp −1 2(w − µ) |_Σ−1 (w − µ) dw = (2π)m/2(det Σ)1/2.

So the density p(y|c) is equal to

So the best value for c is given by arg max

c p(y|c) = arg maxc c

−m/2 exp 1 2σ4y |_φ| 1 cI + 1 σ2φφ | −1 φy ! det 1 cI + 1 σ2φφ | −1/2 .

(18)

3.3.2. Estimate the variance of the errors

Before, the variance was considered as known. In most regression problems, the σ2 is

unknown, so the distribution of y depends also on another hyperparameter. Suppose that

σ2 ∼ Γ−1(α, β), with α, β known. Then

p(y|σ2) = Z ∞ 0 Z ∞ 0 p(y|w, σ2)p(w|c)p(σ2) dσ2dw

The product yields

βα Γ(α)(2π) −n/2_(σ2₎−n/2−α−1_(2πc)−m/2_exp −1 2 (y − φ|w)|(y − φ|w) σ2 − β σ2 − 1 2 w|w c

Integrating this with respect to σ2 results in

βα Γ(α)(2π) −n/2_(2πc)−m/2_exp −1 2 w|w c Z ∞ 0 (σ2)−n/2−α−1exp − ₁ 2(y − φ|w)|(y − φ|w) + β σ2 ! dσ2

The term inside the integral is inverse-gamma distributed with parameters α + n/2 and 1

2(y − φ|w)|(y − φ|w) + β. The result of integrating this with respect to w cannot be given

explicitly, so the integration order is going to be changed. After some calculus, the resulting

Hence, a likelihood function is found for the unknown σ2 and c. With this distribution

function, the optimal value for σ2 and c can be obtained straight away.

It seems quite intuitive that adding features to the feature space, will result in a more accurate likelihood. In fact, one has observed that extending the basis of φ, does not always result in a good approximation of the posterior. To find the optimal value of the length m of

φ, the same distribution of y|σ2 _{is considered, but now, this will be optimized with respect}

to m. To support the previous theory, contemplate the next example.

Example 3.7. Consider the linear regression model of the data y, where y = f (x) + with

f (x) = φ(x)|w and assume that f is the function f0(x) = 3 sin(10x) and j ∼ N (0, 1). Let

y = (y1, . . . , y200), φ : R → R10 and w ∼ N (0, Σ) with Σ = I · c, I ∈ R10×10, c = 1. The

com-puted likelihood distribution of y given the hyperparameters is optimized using the maximum likelihood estimation method and the optimized values of the hyperparameters are given below each figure. The range is chosen differently for different hyperparameters. The likelihood

(19)

dis-0 50 100 150 200 250 300 −400 −300 −200 c Figure 3.5.: c = 79.43 0 200 400 600 800 1000 −8000 −4000 0 σ2 Figure 3.6.: σ2 = 3.98 10 20 30 40 50 150 250 350 m Figure 3.7.: m = 7 0.0 0.2 0.4 0.6 0.8 1.0 −4 −2 0 2 4 6 f0 f ^

Figure 3.8.: ˆf before optimization

In the distribution of c, it seems like there is no clear optimum value, but with the

optimi-zation method, the found MLE for c is ˆc = 79.43. Using the hyper parameters found, ˆf seems

to be a better approximation: 0.0 0.2 0.4 0.6 0.8 1.0 −4 −2 0 2 4 6 f0 f ^

Although, this is not good enough. Fixing the other hyper parameters while obtaining the MLE of one, reduces in approximation errors. Fix m = 7, and take a further look at the

distribution of the likelihood with the unknown hyper parameters c and σ2. Computing the

(20)

The likelihood function of σ2 and c −4 −2 0 2 4 6 8 10 −14000 −12000 −10000 −8000 −6000 −4000 −2000 0 −6 −4 −2 0 2 4 6 8 log c lo g σ 2 lo g lik elihood

Optimizing this using the maximum likelihood estimation method results in ˆc = 115.14 and

ˆ

σ2 = 0.87. And hence, this results in a more accurate posterior

0.0 0.2 0.4 0.6 0.8 1.0 −4 −2 0 2 4 6 f0 f ^

Remarkably is that the confidence band is narrower and is more around ˆf concentrated. Using

another optimization method, namely L-BFGS-B yields ˆc = 125.40 and ˆσ2 = 0.98.

−4 −2 0 2 4 6 f0 f ^

(21)

3.4. The L-BFGS-B method

In the previous section another approximation method was used. The BFGS method is the most popular quasi-Newton algorithm which uses limited amount of computer memory. This is named for its discovers Broyden, Fletcher, Goldfarb and Shanno.

Figure 3.9.: From left to right: Broyden, Fletcher, Goldfarb and Shanno

Mathematic students learn the Newton’s iterative method, but they probably have not heard a lot about quasi-Newton methods. Meanwhile, this algorithm is a very popular algoritm for parameter estimation in machine learning. The BFGS method is explained in chapter 6 of [6]. The aim of the algorithm is to minimise f (x) over unconstrained values of the real-vector

x where f ∈ C1 is a scalar function. The L-BFGS-B method is an extension of the BFGS

method to handle simple box constraints on variables. Constraints of the form li ≤ xi ≤ ui

where li, ui are the lower - and upper bound respectively for variable xi. The method works

by identifying fixed and free variables at every step, and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process. More precise,

let f : Rn → R be a nonlinear differentiable function , let g be the gradient of f and Bk is

the positive definite limited memory approximation. As in [7], a quadratic model of f at xk

can be obtained mk(x) = f (xk) + gk|(x − xk) + 1 2(x − xk) |_B k(x − xk). (3.1)

The algorithm approximately minimizes mk(x) subject to the bounds [li, ui] for given xi. The

algorithm starts first with using the gradient projection method to find a set of active bounds,

ensued by a minimization of mk treating those bounds as equality constraints. Consider first

the linear path

(22)

obtained by projecting the direction with the greatest negative slope onto the feasible region, where P (x, l, u)i =      li if xi < li xi if xi ∈ [li, ui] ui if xi > ui (3.2)

Afterwards, compute the generalized Cauchy point xc, which is defined as the first local

minimizer of the univariate, piece-wise quadatic

qk(t) = mk(x(t)).

The variables whose value at xcis at lower or upper bound, comprising the active set A(xc) are

kept fixed. Consider the following quadratic problem over the subspace of the free variables:

min{mk(x) : xi = xci, ∀i ∈ A(xc)} (3.3)

subject to li≤ xi ≤ ui, ∀i /∈ A(xc) (3.4)

The first aim is to solve (3.3), where the bounds of the free variables are ignored, using direct

or iterative methods on the subspace of the free variables. Using an iterative method, xc is

the starting point for this iteration, and truncate the path towards the solution to satisfy

(3.4). Hence we obtain an approximate solution of ¯xk+1. Compute xk+1 by a line search

along dk= ¯xk+1− xk that satisfies the decrease condition

f (xk+1) ≤ f (xk) + αλkg|_kdk (3.5)

that also enforce the curvature condition

|g_k+1| dk| ≤ β|gk|dk|, (3.6)

with λkthe step length and α, β parameters. The line search method, which ensures that the

iterates remain in the feasible region will not be treated here. Evaluate g(xk+1), compute a

new limited memory Hessian approximation Bk+1 and start a new iteration.

(23)

4. Conclusion and discussion

For statistical inference, the first question was if the empirical Bayes method has advantages over other methods if the considered model, is tangled in a hierarchy. From chapter 2, one can deduce that without the Bayesian treatment, the posterior density is hard to solve analytically. Hence, a good reason to start analyzing the empirical Bayes method. The empirical Bayes method found the best approximations of the hyperparameters, which control the uncertainty at the top level of the hierarchy. From applying empirical Bayes on regression, it turns out that the kernel function plays an important role for finding the posterior. The kernel as given in chapter 3 contains the hyperparameters m and c. The conclusion is that the length m of the feature space is less important than the choice of the covariance matrix on the weights, which is the variance of the prior distribution. Greater length of the feature space, results in a more accurate posterior, but adding to many features can also be disadvantageous. The importance of c be declared by the great effect that the choice of c has on the posterior. If a flat prior is chosen, c → 0, all values in the range are equally likely, thus the prior gives information and therefore it may be sensitive to how the model is parameterized. The choice of a prior where c is large, expresses that a large range of values are plausible rather than concentrating the probability mass around a specific range. Hence, finding the optimal value for c has a higher priority over optimizing the other hyperparameters.

Empirical Bayes is a fast and good approximation method for hierarchical Bayesian mo-dels. However, conjugate priors, which were considered, are not always useful. Remember that conjugate priors have the handy feature that, when multiplied by the appropriate likeli-hood model, they produce a closed-form expression for the posterior. Conditional conjugacy is a useful idea because it is preserved when a model is expanded hierarchically, while the usual concept of conjugacy is not. For example, in the basic hierarchical normal model, the normal prior distributions on the single parameter are conditionally conjugate but not con-jugate. The single parameters have normal posterior distributions, conditional on all other parameters in the model, but their marginal posterior distributions are not normal. So for further research on the empirical Bayes method, another choice of the prior could be analyzed. Extending this subject could be done by also considering classification problems. Clas-sification can also be viewed as a function approximation problem, but the aim here is to

assign an input pattern x to one of C classes, C1, . . . , CC as described in [3]. The solution

of classification problems using Gaussian processes is more difficult than for the considered regression problems, because the targets of the classification models are discrete class labels, so a Gaussian likelihood is inappropriate.

(24)

5. Popular summary (in Dutch)

Als er voorspeld moet worden hoeveel eindexamen leerlingen dit jaar gaan slagen, zullen de gegevens van de afgelopen jaren gebruikt worden om dit te kunnen doen. Stel dat je op een middelbare school de niveaus vmbo-t, havo, vwo of gymnasium hebt. Al deze niveaus hebben hun eigen slagingspercentage. Dus zullen al deze onafhankelijk van elkaar, invloed hebben op het slagingspercentage van de middelbare school in het nieuwe jaar. Ook is het belangrijk om mee te nemen hoeveel leerlingen er ieder jaar op de havo bijvoorbeeld zitten. We verwachten dat dit belangrijk zal zijn omdat er een verband wellicht gelegd kan worden tussen het aantal leerlingen per jaar per opleiding en het aantal geslaagden. Drukkere klassen kunnen ervoor zorgen dat de leerlingen zich niet goed kunnen concentreren en kan dit negatief effect hebben

op de slagingspercentage. Om dit allemaal te kunnen onderzoeken, hebben we dus het ´e´en en

ander nodig. Voor nu beperk ik me tot de data over de slagingspercentages van de afgelopen jaren per opleiding en het aantal leerlingen. Laat θ de parameter zijn die de data van het nieuwe jaar zal beschrijven. We willen weten of we een model kunnen vinden voor het

slag-ingspercentage θ van de middelbare school, gegeven de de slagslag-ingspercentages θien laat λi de

variabele zijn die het aantal leerlingen per opleiding met i ∈ {vmbo-t, havo, vwo, gymnasium} beschrijft. Dus dit wordt opgeschreven als

p(θ | θi, λi, i ∈ {vmbo-t, havo, vwo, gymnasium})

Bij wiskunde A is de normale verdeling ge¨ıntroduceerd. Data X = (X1, . . . , Xn) is normaal

verdeeld met parameters µ en σ2_{, waarbij het gemiddelde en variantie respectievelijk worden}

weergegeven. Dit noteren we als

X ∼ N (µ, σ2).

In deze scriptie gaat het over het schatten van deze parameters aan de hand van de gegevens die we hebben. Hiervoor gaan we een methode gebruiken die empirical Bayes heet. Deze methode gebruikt een techniek die bij wiskunde B aan bod is gekomen, namelijk het optima-liseren. De verdeling van de data over het aantal geslaagden gegeven het aantal leerlingen per jaar, willen we gaan optimaliseren. Dus gaan we op zoek naar het optimale aantal leerlingen per jaar per niveau, zodat we het aantal geslaagden kunnen maximaliseren. Met dit gegeven kunnen we een voorspelling doen over het aantal geslaagden.

(25)

Bibliography

[1] G Alastair Young and Richard L Smith. Essentials of statistical inference, volume 16. Cambridge University Press, 2005.

[2] Hans Rudolf Kunsch. Bayesian statistics, 2014.

[3] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.

[4] Anthony Christopher Davison. Statistical models, volume 11. Cambridge University Press, 2003.

[5] Olav Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006.

[6] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[7] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory al-gorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.

(26)

A. Mathematical background

A.1. Basic concepts of Bayesian statistics

This chapter starts with the concept of Bayesian statistics by introducing the basic concepts and uses of Bayes’ theorem and how to construct the prior and posterior densities.

Let n random variables y1, . . . , yn have a joint probability p(y). The partitioning of those

variables in two groups yA and yB, where the union of the two disjoint sets A and B is equal

to {1, . . . , n}, gives that p(y) = p(yA, yB). The conditional probability density p(yA|yB) is

defined as:

p(yA|yB) =

p(yA, yB)

p(yB)

,

where p(yB) > 0. This is also

p(yA, yB) = p(yA|yB)p(yB).

The left hand side is symmetric in yAand yB while the right hand side is not. So we get that

p(yA|yB)p(yB) = p(yB|yA)p(yA) ⇒ p(yA|yB) =

p(yA)p(yB|yA)

p(yB)

,

where this is the known form of Bayes’ theorem formulated in terms of densities. For continuous random variables X and Y , we get

pY |X(y|x) =

p_X|Y(x|y)fY(y)

R p_X|Y(x|y)pY(y) dy

,

provided the marginal density p(x) > 0 [4].

Prior and Posterior

The aim is to use Bayes’ theorem for inference. Suppose that there is a probability model p(x|θ) for some data x. This is a density function of the data x given the value of θ. Let the belief about θ be summarized in a prior density, p(θ), which is constructed before seeing the data x. Thus θ is the outcome of a random variable whose density is p(θ). After the data has been observed, the conditional density, the Bayes formula becomes

(27)

A.2. Point estimation

There are several statistics to estimate θ in a point. Consider the statistics as the mean, median and the mode. The posterior mean can be find by

ˆ θ = E(θ|x) = R∞ −∞θp(θ)p(x|θ) dθ R∞ −∞p(θ)p(x|θ) dθ , where θ ∈ R. The posterior median will be

Z θˆ −∞ p(θ)p(x|θ) dθ = 1 2 Z ∞ −∞ p(θ)p(x|θ) dθ and the posterior mode is given by

ˆ

θ = arg max

θ p(θ|x) = arg maxθ (log p(θ) + log p(x|θ)).

A.3. Conjugate priors

A mathematical convenient choice are conjugate priors. The posterior distribution belongs to the same parametric family as the prior distribution with different parameters. In [8] are some examples given of conjugate priors. But first the formal definition of a conjugate prior.

Definition A.1. A parametric family PF = {pf(θ) : f ∈ F }, F ⊂ Rq is called conjugate for

the model {p(x|θ) : θ ∈ Θ} if for any p ∈ PF and any x, p(θ|x) is again in PF.

A lot of distributions came across our path. Here are the definitions of those distributions given:

• N (µ, σ) where µ ∈ R, σ > 0 with x ∈ R and density function

p(x) = 1 σ√2πexp − 1 2 x − µ σ 2!

• Multivariate Gaussian(µ, Σ) where µ ∈ Rm_{, Σ ∈ R}m×m_{is the positive definite covariance}

matrix with x ∈ µ + span(Σ) ⊂ Rm and density function

p(x) = 1 (2π)n/2_|Σ|1/2exp 1 2(x − µ) |_Σ−1 (x − µ)

• Γ(γ, λ) where γ > 0 and λ > 0 with x ∈ (0, ∞) and density function

p(x) = λ

γ

Γ(γ)x

γ−1_exp(−λx)

• Γ−1_{(γ, λ) where γ > 0 and λ > 0 with x ∈ (0, ∞) and density function}

p(x) = λ

γ

Γ(γ)x

−γ−1

(28)

• Beta(α, β) where α > 0 and β > 0 with x ∈ (0, 1) and density function

p(x) = Γ(α + β)

Γ(α)Γ(β)x

α−1_{(1 − x)}β−1

• Poisson(λ) where λ > 0 with x ∈ N≥0 and probability function

p(x) = exp(−λ)λ

x x!

• Bin(k, x) where n > N≥0, k ∈ {0, . . . , n} with x ∈ [0, 1] and probability function

p(x) =n

k

xk(1 − x)n−k

Some examples of the most used conjugate priors

Likelihood Prior Posterior

p(x|θ) p(θ) p(θ|x) N (θ, σ2₎ _{N (θ, τ}2₎ _Nσ2_µ+τ2_x σ2_+τ2 , σ 2_τ2 σ2_+τ2 Poisson(θ) Γ(α, β) Γ(α + x, β + 1) Γ(γ, θ) Γ(α, β) Γ(α + γ, β + x)

Bin(n, θ) Beta(α, β) Beta(α + x, β + n − x)

(29)

B. R code

# Empirical Bayes

# Hasine Efetrk, 10173536 #

# This code is computing the optimal value # for the hyperparameters for a regression # model. # Necessary packages install.packages("rgl") install.packages("mvtnorm") install.packages("fda") install.packages("polynom") install.packages("scatterplot3d") # Load libraries library("rgl", lib.loc="~/R/win-library/3.1") library("mvtnorm", lib.loc="~/R/win-library/3.1") library("fda", lib.loc="~/R/win-library/3.1") library("polynom", lib.loc="~/R/win-library/3.1") library("scatterplot3d", lib.loc="~/R/win-library/3.1") # Controls to obtain same plots

set.seed(2017) # Fixed variables n <- 50 m <- 5 x <- seq(0, 1, length.out=n) sigma <- sqrt(0.1) e <- rnorm(n, 0, sigma) I <- diag(m)

# Used test functions

# f0 <- sqrt(2*pi*x) + sin(4*pi*x)

# f0 <- sin(2*pi*x) + cos(5*x^4) + sin(4*pi*x) # f0 <- sqrt(x*(1-x))*sin((2.1*pi)/(x+0.05)) f0 <- 3*sin(10*x)

(30)

# Choice of the feature space

# phi_x <- t(cbind(1, sin(2*pi*x), sin(4*pi*x),

# sin(6*pi*x), sin(8*pi*x)))

phi_x <- t(cbind(rep(1/sqrt(n), n), poly(x, degree = m-1))) # phi_x <- t(fourier(x, nbasis = m, period = 2, nderiv = 0)) #

# First, Visualize the optimum of the hyperparameters. #

# Distribution of y given c and sigma^2

# Maximizing this returns the optimal value for # c / sigma^2

p_c_sigma2 <- function(c, sigma2){ m/2 * log(c) n/2 * log(sigma2)

-(1/2) * (t(y) %*% y / sigma2) +

(1/2) * (1/sigma2^2) * t(y) %*% t(phi_x) %*% solve(((1/c) * I + 1/sigma2 * phi_x %*%

t(phi_x)), phi_x %*% y) - (1/2) * log(det((1/c) * I + 1/sigma2 * phi_x %*% t(phi_x)))

}

# Computing the log likelihood of c and sigma^2 range_c <- 10 ^ seq(-1, 4, length.out = 50) range_sigma2 <- 10 ^ seq(-2, 3, length.out = 50)

loglik <- matrix(NA, length(range_c), length(range_sigma2)) for (i in 1:length(range_c)){

for (j in 1:length(range_sigma2)){

loglik[i, j] <- p_c_sigma2(range_c[i], range_sigma2[j]) }

}

# 3D plot of the log likelihood x1 <- log(range_c)

x2 <- log(range_sigma2) dens <- loglik

s3d <- scatterplot3d(x1, x2, seq(min(dens), max(dens), length = length(x1)), type = "n", col.grid =

"lightblue", col.axis = "blue", grid = TRUE,

angle = 140, zlab = substitute(paste(log, " likelihood")), xlab = substitute(paste(log," c")),

(31)

s3d$points3d(rep(x1[i], length(x2)), x2, dens[i,], type = "l", col = "blue")

for(i in length(x2):1)

s3d$points3d(x1, rep(x2[i], length(x1)), dens[,i], type = "l", col = "green")

# Distribution of y given c and m to find the # optimal length for the feature space

p_m <- function(m, c){

phi_x <- t(fourier(x, nbasis = m, period = 2, nderiv = 0)) I <- diag(m)

return(-m/2 * log(c) + (1/2) * (1/sigma^4) * t(y) %*% t(phi_x) %*% solve(((1/c) * I + 1/sigma^2 * phi_x %*% t(phi_x)), phi_x %*% y) - (1/2) *

log(det((1/c) * I + 1/sigma^2 * phi_x %*% t(phi_x)))) }

# Computing the log likelihood of m range_m <- seq(3, 50, 2)

loglik <- rep(NA, length(range_m)) for (i in 1:length(range_m)){

loglik[i] <- p_m(range_m[i], 2) }

plot(range_m, loglik)

#

# Second, compute the optimum value. #

# Easy approach, by determining the MLE

#index <- which(loglik == max(loglik), arr.ind=TRUE) #chat <- c[index[1]]

#sigma <- sqrt(sigma2[index[2]])

# Numerical optimization method opt <- function(x) {

-p_c_sigma2(x[1], x[2]) }

opt_c_sigma2 <- optim(c(1, 1), opt, method = "L-BFGS-B", lower = c(0.000001, 0.0000001))$par

(32)

# Inserting the optimal values in the variables chat <- opt_c_sigma2[1]

sigma <- sqrt(opt_c_sigma2[2]) Sigma <- diag(m) * chat

# The estimates of w and f

what <- solve((phi_x %*% t(phi_x)) / sigma^2 + solve(Sigma), (phi_x %*% y) / sigma^2)

fhat <- t(phi_x) %*% what

# The posterior distributions of w and f posterior_w <- rmvnorm(20, mean = what,

sigma = solve((phi_x %*% t(phi_x)) / sigma^2 + solve(Sigma)), method = "chol")

posterior_f <- t(phi_x) %*% t(posterior_w)

# Determine the quantiles to obtain the confidence # region for each point

outline <- apply(posterior_f, 1, quantile, prob=c(0.025, 0.975))

# Plotting the functions

plot(x, y, xlab = "", ylab = "")

legend("topright",expression(f[0], hat(f)), pch = 15, col = c(3, 2))

polygon(c(x, rev(x)), c(outline[1, ], rev(outline[2, ])), col=rgb(0.8, 0.8, 0.8), border=NA)

points(x,y)

# matlines(x, posterior_f, col="gray") lines(x, f0, col=3, lwd=2)

Empirical Bayes

Empirical Bayes

Hasine Efet¨

urk

July 10, 2017

0.0

0.2

0.4

0.6

0.8

1.0

−1.0

−0.5

0.0

0.5

1.0

x

density of the poster

ior

f

f

^

Abstract

Contents

1. Introduction

Acknowledgements

2. Hierarchical Bayes models

2.1. Hierarchical Bayes model

2.2. Empirical Bayes method

3. Regression

3.1. Linear Regression

3.2. Projections of inputs into feature space

3.3. Optimizing the hyperparameter

3.4. The L-BFGS-B method

4. Conclusion and discussion

5. Popular summary (in Dutch)

Bibliography

A. Mathematical background

A.1. Basic concepts of Bayesian statistics

Prior and Posterior

A.2. Point estimation

A.3. Conjugate priors

B. R code