• No results found

Combining predictions from linear models using the switch distribution

N/A
N/A
Protected

Academic year: 2021

Share "Combining predictions from linear models using the switch distribution"

Copied!
100
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Thijs van Ommen

Combining predictions from linear models using the switch distribution

Master thesis, defended on March 10, 2011 Thesis advisor: Peter D. Gr¨unwald

Specialisation: Applied Mathematics

Mathematisch Instituut, Universiteit Leiden

(2)
(3)

Contents

1 Introduction 1

2 Linear regression 5

2.1 Definitions and notation . . . 5

2.2 Basis functions . . . 7

2.3 Maximum likelihood estimators and Bayesian predictive dis- tributions . . . 11

2.3.1 The Bayesian framework . . . 11

2.3.2 The Bayesian predictive distribution for M(k) . . . 13

2.3.3 The Bayesian predictive distribution for S(k) . . . 15

2.4 Priors . . . 18

2.4.1 Subjective priors: ridge regression and the g-prior . . 18

2.4.2 Objective priors: Jeffreys’ prior . . . 19

2.4.3 Priors for S(k) . . . 20

2.4.4 The effective number of parameters . . . 21

2.5 Behaviour of predictive variance . . . 22

3 Methods for model selection and combination 26 3.1 Model selection and prediction . . . 26

3.2 The AIC/BIC dilemma . . . 27

3.3 Other methods for dealing with multiple models . . . 29

3.4 The switch distribution . . . 30

4 Risk estimates for M(k) 33 4.1 Decomposition of prediction difference . . . 34

4.2 Oracle aware of all u . . . 34

4.3 Oracle aware of past u . . . 37

4.4 Oracle unaware of u . . . 39

4.5 Comparing two experts . . . 43

5 Risk estimate for S(k) 45 5.1 Oracle unaware of u . . . 45

5.2 Comparing two experts . . . 47

5.3 Nonincreasing risk . . . 49

6 Implementation 50 6.1 Computations for regression . . . 50

6.1.1 Incremental computation for predictive distributions . 50 6.1.2 Computational considerations regarding basis functions 52 6.2 Switching algorithm . . . 53

6.2.1 Dealing with missing predictions . . . 56

6.2.2 Modifying the switch distribution to only go up . . . . 58

(4)

7 Discussion of experimental results 60

7.1 AIC and BIC . . . 64

7.1.1 AIC . . . 66

7.1.2 BIC . . . 67

7.2 Bayesian methods . . . 67

7.2.1 Bayesian model averaging . . . 68

7.2.2 The unmodified switch distribution . . . 69

7.2.3 The modified switch distribution . . . 70

7.2.4 Further comparisons . . . 71

7.2.5 Tuning the switch distributions . . . 72

7.3 Conclusion . . . 73

A Experimental results 76 A.1 Results for M(k), g-prior . . . . 76

A.2 Results for M(k), Jeffreys’ prior . . . 82

A.3 Results for S(k), Jeffreys’ prior . . . 88

References 94

(5)

1 Introduction

In this thesis we investigate the switch distribution [9] as a method for model selection and prediction in linear regression. We compare the performance of this new method to that of other methods, both theoretically and exper- imentally.

Many statistical problems involve finding a relation between two vari- ables. Given several pairs of measurements for these two variables, we wish to find a function that estimates the second variable given the first.1 This type of problem is called a regression problem. The function we find could help us gain insight into the underlying relation between the variables (for example, it might tell us something about the laws of nature determining the relation between two physical quantities), or it could be used to predict the value of the second variable for a given value of the first we have not observed. To make this problem well-defined, the function is chosen from some set of functions. Each function corresponds to a hypothesis, and the set of functions corresponds to a statistical model.

As an example, Figure 1(a) shows ten measurements relating a stream’s depth to its rate of flow [22]. (The units of these measurements are not given.) Suppose we want to use these observations to predict the flow rate

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(a) Measured flow rate

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(b) Flow rate fit by a straight line

Figure 1: Predicting flow rate

of a stream with depth 0.6. A simple way to answer this question is by finding a straight line that passes as closely as possible near all observed data points. In this case our model contains all straight lines. The chosen line is shown in Figure 1(b); it suggests that the flow rate for a depth of 0.6 should be about 4.3. We choose which line “passes most closely to the data points” using the standard method least squares: for a given line, we

1Technically, we assume that the second variable contains a random component, and we want to estimate the conditional expectation of the second variable given the first.

(6)

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(a) Flow rate fit by a quadratic polynomial

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(b) Flow rate fit by a high-degree polyno- mial

Figure 2: Predicting flow rate, continued

measure the vertical distance to each data point, square these distances, and select the line from our model for which the sum of these squares is smallest.

However, the choice to use a straight line is rather arbitrary, and we may wonder whether the resulting prediction is as accurate as can be achieved.

The line from Figure 1(b) passes above the data points in the centre but below those near the edges of the graph, suggesting that a curved line will be able to provide a better fit to the data. This is shown in Figure 2(a), using a quadratic polynomial as the curved line. This line does indeed seem to follow the observations better than the straight line. It predicts a flow rate of 3.6 at depth 0.6.

But if a quadratic polynomial provides better predictions than a straight line, maybe a polynomial of yet higher degree will do even better. If we increase the degree of the polynomial further, we will be able to pass by each point more closely. In fact, for any n points in different horizontal positions, there is a unique polynomial of degree n − 1 that passes through all those points exactly. In our example data, there are two points with depth 0.29 but different flow rates, so the best we can do is a polynomial of degree 8 that passes through the middle of these two points, and coincides with each other point. This polynomial is shown in Figure 2(b). Obviously, this polynomial is worthless when it comes to making predictions about different depths. In particular, the predicted flow rate for a depth of 0.6 is

−1228, which suggests a stream running uphill at alarming speed.

In this last example, we selected the best fitting polynomial from among all polynomials with degree up to 8. This means that our statistical model contained a hypothesis for each such polynomial. While it is mathematically easy to select the polynomial that best fits the data even from such a large set, it is very difficult to find a polynomial that will predict the data well given so many choices: a large amount of data would be required to find

(7)

appropriate values for all the polynomial’s coefficients. Large models are therefore called complex.

The importance of choosing a suitable degree of the polynomial to be used for fitting is clear. Choosing low will lead to a line that fails to mimic the true relation between the variables and therefore does not predict as well as a higher degree polynomial could. On the other hand, if we set the degree too high, we will obtain a curve that goes astray in many places where we do not have observations, and thus has no predictive value. The degree we are looking for must be a trade-off between these two concerns.

In the above example, we motivated the choice of a quadratic polyno- mial based on inspection of the data. Such an approach will not always be practical, for example when larger quantities of data are involved. Many methods based on mathematical theory have been developed to make this trade-off in an automatable way, without requiring human judgement calls.

Unfortunately, these methods sometimes disagree.

Two such methods are the well-known AIC (Akaike’s information crite- rion) [2] and BIC (Bayesian information criterion) [24]. Both take the form of simple formulae that assign a score to each model, allowing the best model to be chosen. However, they differ in their notion of “best”. Informally, AIC is quick to realize when a more complex model is necessary, requiring fewer observations for this that other methods. But if the correct model has al- ready been found, AIC will continue to pick more complex models. BIC, on the other hand, will eventually find the correct model and stay with it, but will stay with too simple models longer than AIC does. In many situations, these competing properties make it unclear which method is to be preferred.

This is known as the AIC/BIC dilemma.

To simplify the problem of choosing an appropriate model, we may take a step back and, in place of models, consider experts that look at the data and give us counsel based on it. Each expert uses a model together with a well-defined method to produce a prediction based on the observed data and the expert’s model.

Consider the situation where we are not directly interested in the models the experts use or in the exact mechanism by which the different experts come to their advice, but we wish to choose which expert to trust based on their past performance. For example, to predict the weather, we may consult several experts, using different, possibly very complex, models. Without understanding these models, we may still evaluate their performance by comparing their predictions to the actual weather. This way, we may find that one expert always gives the most reliable predictions. But it is also possible that which expert is best depends on the season or on other factors.

The insight that which expert is best may vary with time can be seen as underlying the switch distribution, a new method for model selection de- scribed in [9] and formally defined in Section 3.4. This method is shown to share the favourable aspects of both AIC and BIC: it can quickly go to

(8)

complex models when needed like AIC, and will eventually find the correct model (if there is one) like BIC. Without going into the mathematical de- tails of any of these methods, the switch distribution’s motivation may be described as follows: the switch distribution is based on the same principles as BIC, but taking into account the possibility that as more data points come in, different experts may become more suitable and we may want to switch between them occasionally.

It should be clear that any prediction about the future involves some degree of uncertainty. We want to allow our experts to express their uncer- tainty, so that the experts’ own assessments of the quality of their predictions can play a role in our choice of which expert is best. This is accomplished by asking the experts for predictions not in the form of a single value, but in the form of a probability density on possible values. When we know the ac- tual outcome, we evaluate the quality of such a prediction with a loss equal to the negative logarithm of the density assigned to the observed outcome.

Thus, if this outcome was thought likely by the expert and given a high density, the loss is small, but if an outcome occurred that the expert did not find likely, the loss is large. If an expert is strongly convinced that the outcome will lie close to some value, a predictive density with a sharp peak around that value is appropriate, as according to the expert’s beliefs, this will lead to a very small loss. On the other hand, if an expert is not very sure what will happen, a flatter distribution is a way for experts to hedge their bets: no matter what will happen, the loss will not be enormous (but it will be larger than that of the expert in the previous scenario).

The most important theoretical result in this thesis is our finding that once an expert using a complex model has started to outperform experts using simpler models, we may expect it to continue outperforming them as more data are observed. To take advantage of this, we define a modified the switch distribution that does not consider switching from a complex to a simple model.

Further, we conduct simulation experiments where we compare several methods for choosing between as well as combining the predictions of several models, including the aforementioned AIC, BIC and switch distribution. In these experiments, we find that the switch distribution performs best in gen- eral. For most functions, AIC will fail to find that function as it continues to overfit, resulting in inferior predictions. Similarly, BIC’s theoretical slowness (in the sense of requiring many observations before selecting a sufficiently complex model) also translates to practice for most functions, again result- ing in predictions that are outperformed by the switch distribution. Finally, the restricted version of the switch distribution that only switches “up” is found to improve the performance of the switch distribution still further.

This thesis is organized as follows: Section 2 covers linear regression models and their frequentist and Bayesian solutions. In Section 3, we look at several methods to solve the problems of model selection and of predic-

(9)

tion using several models, including the switch distribution. In Sections 4 and 5, we combine the theory from the earlier sections to analyse how the quality of the different models’ predictions changes as more data becomes available. Finally, Section 6 covers some issues that occur when implement- ing these models and the methods for combining them on a computer, and Section 7 we evaluate the performance of the methods based on simulation experiments run on such an implementation.

Acknowledgements First of all, I would like to thank Peter Gr¨unwald for his guidance that led me to interesting insights regarding this even more interesting topic. I am grateful to my colleagues at CWI, especially Steven de Rooij, Tim van Erven, and Wouter Koolen, for the inspiring atmosphere they provided, which greatly encouraged me during this work. Finally, I owe thanks to my family and friends for their support and patience while this thesis continued to be “almost done”.

2 Linear regression

2.1 Definitions and notation

Regression problems are concerned with finding a relationship between two variables: the regressor variable un = (u1, . . . , un) ∈ Un and the regression variable yn= (y1, . . . , yn) ∈ Rn. (We largely follow the notation of chapter 12 of [14].) Given values of these variables, we wish to find a function f : U → R, either to gain insight into the true relation underlying those values, or to use this function for the prediction of further yn+1, . . . , yn+m given un+1, . . . , un+m. In practice, un is often (but not necessarily) under an experimenter’s control (so from the experimenter’s perspective, un can be either chosen values or values sampled from a random variable Un), while ynrepresents the measurements resulting from the experiment. These measurements are assumed to be subject to random noise. That is, we assume that

Yi= f(ui) + Zi, (2.1) where f is the true but unknown function we wish to learn, and the Zi are distributed independently of each other and of Un, with E Zi = 0 and Var Zi = σ2 for all i.

In the particular case of linear regression, we choose an order k, fix that many basis functions f1, . . . , fk, and confine our search to functions that are linear combinations of these basis functions. These are then the functions of the form fµ(u) =Pk

i=1µifi(u) for µ ∈ Rk. By letting µ :=

 µ1

... µk

 , xi :=

 f1(ui)

... fk(ui)

 , and Xn:=

 xT1

... xTn

 , (2.2)

(10)

we can write fµ(ui) = xTi µ and (fµ(u1), . . . , fµ(un))T = Xnµ. We call the xi design vectors and Xn the design matrix.

The hypotheses in our models will be densities for yn of the form pµ,σ2(yn| un) =

µ 1

√2πσ2

n exp

µ

Pn

i=1(yi− fµ(ui))2 2

, (2.3)

expressing the further assumption that the noise terms are independent identically distributed Gaussians with mean 0 and variance σ2. We will consider two types of linear model, which differ in their treatment of σ2. If the true variance is known to be σ∗2, the family of models

M(k)= { pµ,σ∗2 | µ ∈ Rk} (2.4) is appropriate; otherwise, we must estimate the variance along with the function fµ and use the family

S(k) = { pµ,σ2 | µ ∈ Rk, σ2 > 0 }. (2.5) A common way to measure how well fµfits the data is by minimizing the Euclidean length of the vector of errors, which is equivalent to minimizing the sum of squared errors

SSE(µ, yn) :=

Xn i=1

(yi−fµ(ui))2 = Xn

i=1

(yi−xTi µ)2 = (yn−Xnµ)T(yn−Xnµ).

This allows us to rewrite (2.3) to pµ,σ2(yn| un) =

µ 1

√2πσ2

n exp

µ

−SSE(µ, yn) 2

. (2.6)

Because this is decreasing as a function of SSE(µ, yn), we see that the max- imum likelihood estimator ˆµn also minimizes SSE(µ, yn). Another useful property is that whenever Xn has full column rank, ˆµn is uniquely defined and easily computable [13, p. 373]:

ˆ

µn= (XTnXn)−1XTnyn. (2.7) It will be useful to define the residual sum of squares as the sum of squared errors obtained by this minimizing ˆµn:

RSS(yn) := SSE(ˆµn, yn) = (yn− Xnµˆn)T(yn− Xnµˆn).

A prediction of yn+1 can also be expressed as a probability distribution on values rather than as a single value. Here and below, with some abuse of terminology we use the term “probability distribution” to refer to both probability density and probability mass functions. We measure the quality

(11)

of such a prediction using logarithmic loss, which is the negative logarithm of the probability density or mass that such a prediction assigned to the actual outcome. We may predict several outcomes at once with a multivariate predictive distribution; if the outcomes are independent under the predictive distribution, then the loss obtained equals the sum of losses that would have been obtained by individually predicting each outcome with its marginal distribution.

If the predictive distribution was Gaussian, then for any fixed variance of that distribution, the logarithmic loss will be a linear function of the squared error, and for a multivariate Gaussian with fixed covariance matrix, a quadratic form of the squared errors. For example, if we predict yn with the density function pµ,σ2, the logarithmic loss becomes

− log pµ,σ2(yn| un) = n

2log 2πσ2+ 1

2SSE(µ, yn). (2.8) We will see that for several combinations of model and prediction strategy, the predictive distribution is indeed Gaussian, so the logarithmic loss for a single outcome is essentially equivalent to the more familiar squared error loss, in the sense that a prediction strategy is better on some data in terms of logarithmic loss if and only if it is better in terms of squared error loss.

The correspondence between logarithmic loss and squared errors as well as the simple expression for the maximum likelihood estimator are math- ematically elegant, and this is one of the reasons for their widespread use [15, 21]. There is however a disadvantage that must be mentioned: this approach is very sensitive to outliers in the data, and therefore care must be taken when applying it in practice [16].

We use “log” to refer to the natural logarithm. The choice of logarithm base only changes logarithmic loss by a multiplicative constant, but else- where it will considerably simplify the algebra. Another popular choice for the logarithmic loss function is base 2, so that all losses are measured in bits [7, 2.1].

2.2 Basis functions

The linear model relies on a choice of basis functions. As is implied by the word “basis”, we want all functions in a certain, preferably fairly general class to have a unique representation as a linear combination of such basis functions. The condition of uniqueness is easy to meet: given a sequence of functions, we can create a new sequence which omits each function that is a linear combination of its predecessors. Before we can also talk about the existence condition in spaces of functions, we need to extend the familiar notion of a basis from linear algebra to deal with infinite numbers of basis elements. To deal with linear combinations of such sets, we first need to define some form of convergence.

(12)

Consider the probability triple (Ω, F, P ) [27]. Define the L2-norm on measurable functions f as kf k2 =¡R

|f |2dP¢1/2

. Let L2(Ω, F, P ) be the space of equivalence classes of F-measurable functions f with kf k2 < ∞, where functions f and g are in the same equivalence class if kf − gk2 = 0.

(If Ω is a compact interval on the real line, the set of continuous functions on Ω is dense in this space [23, Theorem 1.62].) We will continue to speak of the elements of L2(Ω, F, P ) as functions, even though they are really equivalence classes. Equip this space with an inner product defined by (f, g) =R

f g dP , and note that our norm follows from this inner product.

From linear functional analysis we know that for this particular choice of norm and inner product, the set L2(Ω, F, P ) is a Hilbert space [23, Example 3.24]. In such a space, we can speak of convergence of an infinite series, as we require. Given a sequence of measurable functions f1, f2, . . ., we call it orthogonal if (fi, fj) = 0 for all i 6= j; if also kfik2 = 1 for all i, we call it orthonormal. Now for f that can be expressed as a linear combination of an orthonormal sequence f1, f2, . . ., this linear combination is given by

f = X i=1

(f, fi)fi. (2.9)

This series converges for all f ∈ L2(Ω, F, P ) [23, Corollary 3.44]. Note that, by the definition of our norm, there may be a P -null set of points in Ω which do not participate in the convergence.

Calling f1, f2, . . . a basis is justified if (2.9) holds for all f ∈ L2(Ω, F, P ).

One example of an orthonormal basis is the Fourier basis f1 = 1

√2π, f2j = 1

√π sin ju for j ∈ N, f2j+1 = 1

√π cos ju for j ∈ N

on L2([−π, π], B[−π, π], U [−π, π]) (where B(S) denotes the Borel σ-algebra and U (S) the uniform probability measure on S). Another example is the Legendre basis on L2([−1, 1], B[−1, 1], U [−1, 1]), which is found by applying the Gram-Schmidt process [23, Remark 3.21] to the sequence of polynomials (1, u, u2, . . .). (Note that in both cases the functions can be transformed to cover any other bounded interval.)

This theory guarantees us that it is possible for our linear model to ap- proximate any function in L2(Ω, F, P ) arbitrarily closely given enough basis functions. In practice, we will have to limit the number of basis functions that we use to a finite k so that we only span a subspace of L2(Ω, F, P ), which will impact how well certain functions can be approximated. In this

(13)

case, we have that for any function f ∈ L2(Ω, F, P ), there exists a unique function g ∈ sp(f1, . . . , fk) (where sp denotes the linear span) that minimizes kf − gk2, and this same g is also the unique solution to (f − g, fi) = 0 for all i ∈ {1, . . . , k} [23, Theorem 3.34 and its proof]. Geometrically, this means that in terms of squared errors, the best solution g we can create using our basis functions is the orthogonal projection of f onto this subspace.

Once we have an orthonormal basis f1, f2, . . ., we may of course equally well use any other sequence of functions g1, g2, . . . (infinite or finite; not nec- essarily orthonormal or even orthogonal) for which each function gi can be expressed as a linear combination of f1, f2, . . . , fi, and vice versa for each function fi. Because such a transformation is invertible, this new sequence will be a basis of the same space as the original sequence, in the sense that any function in the space has a unique representation as a linear combina- tion of basis functions. We will therefore call a (possibly infinite) sequence g1, g2, . . . linearly independent if it can be transformed as above into some or- thonormal sequence f1, f2, . . .. Note that this is precisely the transformation that is accomplished by the Gram-Schmidt process. A linear dependence re- lation kP

jαjgjk2 = 0 with the αj not all 0 exists between g1, g2, . . . , gi if and only if the Gram-Schmidt process fails on f1, f2, . . . , fi, justifying this terminology.

Because many properties we are interested in are determined only by what functions can be expressed with some basis and not on what coefficients occur in the corresponding linear combinations, we can freely change bases when discussing such properties. One useful example is the basis that is orthonormal with respect to the empirical measure Pn, which is defined as follows: given un, for any measurable A ⊆ Ω,

Pn(A) := 1 n

Xn i=1

IA(ui) (2.10)

(where IA denotes the indicator function); that is, Pn assigns a point mass of n1 to each point in un. This measure leads to a different norm, and thus to a different partition of functions into equivalence classes. In fact, since only the function values on points ui, i ∈ {1, . . . , n} matter, each basis function now corresponds to a point in Rn, and so the space equipped with this norm will be of dimension at most n. If our basis functions are linearly independent in this space (for which a necessary condition is thus that k, the cardinality of the basis, is at most n), then the Gram-Schmidt process can be applied to find a new basis that is orthonormal with respect to the empirical measure Pn, but which spans the same space.

When doing linear regression, we will receive a sequence of n points from U and apply our basis functions to them to get a design matrix Xn as in (2.2). We can compute ˆµnusing (2.7) if and only if Xnis of full column rank [21, p. 531]; in other words, if Pninduces a space where our k basis functions

(14)

are linearly independent. In experiments where we choose un, we can check this ourselves. In the other case, where each ui is randomly drawn from P , we want to know under what conditions this holds. This is answered by the following two closely related propositions.

Proposition 2.1 If our basis functions f1, f2, . . . , fk are linearly indepen- dent, then the rank of Xnconverges almost surely to k as n → ∞. Otherwise, it almost surely does not converge to k.

Proof If the basis functions are linearly dependent, then for any n, the same linear dependence will almost surely exist between the columns of Xn. Now assume the basis functions are linearly independent, and suppose that for n data points, Xn has rank less than k. The next data point will be drawn from P and will add a random row vector xTn+1to the matrix Xn. If P (xn+1 ∈ sp(x1, x2, . . . , xn)) = 1, then the basis functions are linearly dependent: a contradiction. So with positive probability, Xn+1 will be of higher rank than Xn. Because this holds for any Xn that is not of full column rank, any sequence of such matrices eventually attains full column rank with probability 1.

Proposition 2.2 Xn has full column rank with probability 1 if and only if n ≥ k and for all measurable U ⊆ U with P (U ) > 0, our basis functions are still linearly independent when restricted to U .

Proof First, it is clear that Xn’s rank is at most n, so if n < k, then the matrix can not be of full column rank. Assume now that n ≥ k.

⇒: Suppose there exists a U ⊆ U with positive measure such that the restrictions to U of our basis functions are linearly dependent. Then with positive probability, u1, . . . , un ∈ U , and given this, the same linear dependence will almost surely exist between the columns of Xn.

⇐: Suppose that for m data points, Xm has rank less than k. Let U = f1,...,k−1 (sp(x1, x2, . . . , xm)). When restricted to this U , our basis functions are linearly dependent. If P (um+1 ∈ U ) > 0, this gives a contradiction. So with probability 1, Xm+1 will be of higher rank than Xm. Because this holds for X0, X1, . . . , Xk−1, with probability 1, Xk and further will be of full column rank.

In particular, the conditions of Proposition 2.2 requires that our measure contains no point masses, and that there is no U with positive measure where one of our basis functions is constantly 0.

For the Legendre and Fourier bases, Proposition 2.2 applies: the U in the proof of the proposition contains only finitely many points, so P (U ) = 0 and it follows that Xnwill be of full column rank almost surely. An example of an orthonormal basis that does not satisfy the condition is given by (finite subsequences of) the Haar basis [15, p. 150] on an interval equipped with the

(15)

Lebesgue measure. This happens because all basis functions are piecewise constant with finite numbers of discontinuities, so intervals U can be found where all basis functions are constant and thus linearly dependent.

In practice, it will often happen that a sample un gives a matrix Xn which is theoretically of full column rank, but for which XTnXnis numerically hard to invert. We will discuss this issue in Section 6.

2.3 Maximum likelihood estimators and Bayesian predictive distributions

The maximum likelihood estimator ˆµn is unbiased: if the true function underlying the data (as in (2.1)) is f = fµ, we have for fixed Xn with full column rank

Eµµˆn= Eµ(XnTXn)−1XTnyn

= Eµ(XTnXn)−1XTn(Xnµ + Zn)

= µ + (XTnXn)−1XTnEµZn

= µ (2.11)

[21, p. 538]. Assuming the conditions in Proposition 2.2 hold, Xn almost surely has full column rank, thus the above can be extended to random Xn. Also, by the Gauss-Markov theorem, it is known that ˆµnhas minimum variance among all linear unbiased estimators [15, 3.2.2].

Suppose we have observed data (un, yn). For models M(k), the above suggests predicting with pµˆn∗2, as it is in some sense the best hypothesis in our model. However, in terms of logarithmic loss it is not the optimal choice, as we will see below.

For models S(k), we need an estimator for σ2 in addition to ˆµn. The maximum likelihood estimator is RSS(yn)/n [14, (12.33)], but it is biased and underestimates the true variance. An unbiased estimator is ˆσ2n = RSS(yn)/(n − k) [21, p. 540], which we will use here.

2.3.1 The Bayesian framework

So far we have considered predictions based on distributions from our model.

But since we formulate our prediction as a distribution rather than as a finite number of parameters, we may use any probability distribution as our prediction without confining ourselves to those contained in our model. The Bayesian predictive distribution makes use of this flexibility. It arises in the Bayesian framework as follows. Let M = { Pθ | θ ∈ Θ } be a statistical model of hypotheses indexed by parameter θ. (For example, for the linear model with known variance defined in (2.4), we have Θ = Rk.) Given a prior distribution w(θ) on our parameter space Θ, the probability we assign

(16)

to the observation y is

¯

pBayes(y) :=

Z

θ∈Θ

pθ(y)w(θ)dθ. (2.12)

We call this the prior predictive distribution, as it is our prediction about the data before any observations have been made, based on our prior distri- bution.

If we have already observed y and want to update our view of which parameter values are likely, we can transform our prior into a posterior by

w(θ | y) = pθ(y)w(θ)

¯

pBayes(y), (2.13)

which is an application of Bayes’ Theorem [5].

Similar to how we update our degree of belief in the parameter values, we can update our view of the probabilities of future observations y2 given past observations y1:

¯

pBayes(y2 | y1) = p¯Bayes(y1, y2)

¯

pBayes(y1)

= R

θ∈Θpθ(y2)pθ(y1)w(θ)dθ

¯

pBayes(y1)

= R

θ∈Θpθ(y2)w(θ | y1pBayes(y1)dθ

¯

pBayes(y1)

= Z

θ∈Θ

pθ(y2)w(θ | y1)dθ (2.14) [14, (2.36)]. This is called the (posterior) predictive distribution. Note that this formula has the same form as (2.12), but with the prior replaced by the posterior.

Imagine that the true function is drawn at random according to our prior distribution before the experiment starts. Then this predictive distribution accurately expresses how further data will be distributed given our knowl- edge. As a result, the expected amount of additional logarithmic loss we incur by predicting with some distribution q rather than p(y2) := ¯pBayes(y2 | y1) equals

EY ∼p[− log q(Y )] − EY ∼p[− log p(Y )] = D(p k q) ≥ 0, (2.15) where D(p k q) is the Kullback-Leibler (or KL) divergence from p to q, which is nonnegative by the information inequality [7, Theorem 2.6.3]. This shows that no distribution q can expect to do better than the Bayesian predictive distribution.

Of course, the true function is not usually drawn from a distribution, and if it were, we would rarely know that distribution. Then the above

(17)

optimality result does not hold, but if the true function looks sufficiently like it could have been drawn from our chosen prior, then we still expect to do well.

2.3.2 The Bayesian predictive distribution for M(k)

To see what forms these distributions take in the models we are interested in, we first have to decide on a prior on µ and, in the case of S(k), a prior on σ2. In the models of the M(k) family, where the variance σ∗2 is known, it will be especially convenient to use a Gaussian prior with some mean µ0 and positive definite covariance matrix σ∗2Σ0 for the parameter µ. More details about the choice of prior will be discussed in Section 2.4. The requirement of a positive definite covariance matrix implies that the prior must assign positive density to all parameter vectors. This is not a restriction in practice, because if we only want a linear subspace of our parameter space to have positive prior probability, the right way to accomplish this is by a different choice of basis functions.

From now on, we will no longer explicitly mention un in our formulae;

in the context of regression, the relevant values may always be considered known.

Equation (2.12) essentially expresses that to sample a sequence of ob- servations Yn according to ¯pBayes, we may first sample µ from w, then Yn from pµ,σ∗2. Because both distributions are Gaussian, the resulting distri- bution will again be Gaussian. For random variables µ ∼ N (µ0, σ∗2Σ0) and Zn∼ N (0, σ∗2In) and linear transformation Yn= Xnµ + Zn, this gives

Yn∼ N (Xnµ0, σ∗2(In+ XnΣ0XTn)). (2.16) Because Σ0 is positive definite, its inverse exists and is also positive definite. Any positive semidefinite matrix is a Gram matrix, meaning that it can be expressed as ATA for some k × k matrix A [6, Fact 8.9.37]. Hence we can write (σ∗2Σ0)−1= σ∗−2XT0X0for some k×k matrix X0. This inverse of the covariance matrix is called the precision matrix, and will be useful as a parameter of our normal distributions.

Using this X0, we define y0 := X0µ0. We find using (2.13) that the posterior after observing n data points yn is of the form

w(µ | yn) ∝ pµ,σ∗2(yn)w(µ),

(18)

where pµ,σ∗2(yn) is given by (2.6) and w(µ) ∝ exp

µ

1

2(µ − µ0)T∗2Σ0)−1(µ − µ0)

= exp µ

1

∗2(µ − µ0)TXT0X0(µ − µ0)

= exp µ

1

∗2(X0µ − y0)T(X0µ − y0)

= exp µ

1

∗2SSE(µ, y0)

Putting the two back together, we get w(µ | yn) ∝ exp

µ

1

∗2(SSE(µ, yn) + SSE(µ, y0))

, (2.17)

which is again Gaussian. The maximum and thus also the mean of this density are given by

ˆ

µn= (XT0X0+ XTnXn)−1(XT0y0+ XTnyn) (2.18) as in (2.7). (We use the same notation ˆµn as we did for the maximum likelihood estimator. The reason for this will become apparent in Section 2.4, but whether we mean the maximum likelihood estimator or the posterior mode with respect to some prior will always be clear from context.) By first writing

X :=

µ X0 Xn

, y :=

µ y0 yn

, and ˆy = Xˆµn= X(XTX)−1XTy, and then multiplying the posterior by a factor that does not depend on µ, it becomes

w(µ | yn) ∝ exp µ

1

∗2(y − Xµ)T(y − Xµ)

∝ exp µ

1 ∗2

¡(y − Xµ)T(y − Xµ) − (yTy − ˆyTy)ˆ ¢¶

= exp µ

1

∗2y − Xµ)Ty − Xµ)

= exp µ

1

∗2µn− µ)TXTX(ˆµn− µ)

, (2.19)

which shows that the precision matrix is σ∗−2XTX = σ∗−2−10 + XTnXn).

The predictive distribution given n data points is now easily obtained by taking (2.16) and plugging in the posterior in place of the prior as we saw in (2.14), yielding

Yn+1| yn∼ N (xTn+1µˆn, σ∗2(1 + xTn+1(XTX)−1xn+1)) (2.20)

(19)

for the prediction of a single data point n + 1.

Another result of the above derivation is that when observing more data points, we can incorporate them by adding matrices of the form σ∗−2X0TX0 to our precision matrix, and updating ˆµn using (2.18).

2.3.3 The Bayesian predictive distribution for S(k)

For the model family S(k) we will use priors of the form w(µ, σ) = w(µ | σ)w(σ), where w(µ | σ) is Gaussian as in the previous section, and where w(σ) ∝ σ−α−1exp(−β/2σ2). For α > 0, β > 0, we have

Z

0

σ−α−1exp µ

β 2

dσ = 1 2

µβ 2

−α/2 Γ³ α

2

´ ,

so that w(σ) defines a probability distribution, known as the square root inverted gamma distribution (because σ is the inverted square root of the precision σ−2, which follows a gamma distribution). The combined prior w(µ, σ) will be seen to be a convenient choice for similar reasons as our choice of a multivariate normal prior in the previous section [5, Example 5.6].

Given a prior of this form with mean µ0and precision matrix (σ2Σ0)−1= σ−2XT0X0 for w(µ | σ) and α = α0 > 0, β = β0 > 0 for w(σ), the prior predictive distribution is

¯

pBayes(yn) = Z

0

Z

Rk

pµ,σ2(yn)w(µ | σ) dµ w(σ) dσ

= Z

0

¯

pBayes(yn| σ)w(σ) dσ

Z

0

σ−nexp µ

R 2

· σ−α0−1exp µ

−β0 2

= Z

0

σ−n−α0−1exp µ

−R + β0 2

µ1

2(R + β0)

−(n+α0)/2

∝ (R + β0)−(n+α0)/2, (2.21) where R = (yn−Xnµ0)T(In+XnΣ0XTn)−1(yn−Xnµ0), the factor occurring in the density function for (2.16). We see that this distribution is multivari- ate Student with mean Xnµ0, α0 degrees of freedom, and precision matrix 00)(In+ XnΣ0XTn)−1. (This last parameter is often called “precision matrix” even though it is not exactly the inverse of the distribution’s co- variance matrix [5].) Note that Student’s t distribution also appears in the frequentist approach to the problem of estimating the mean of a Gaussian whose variance is unknown [26].

(20)

To interpret R, we define for any data y0 with least squares solution µ0: y :=

µ y0 yn

, y0 :=

µ X0µ0 yn

.

Then by [14, (12.58)], R = RSS(y0). Because y0 and X0µ0 have the same least squares solution µ0, by part 3 of the composition lemma [14, 12.2], y and y0 will have the same least squares solution ˆµn, and by part 4 of that lemma, the amount by which the residual sum of squares increases as a result of adding observations yn is the same in either case. Because RSS(X0µ0) = 0, we see that R equals this amount.

We now set out to compute the posterior after observing data Xn, yn. If Xn is of full column rank, let µ0 = (XTnXn)−1XTnyn, the least squares estimator for the new data only; otherwise, choose any µ0 such that Xnµ0 = yn (this system is underdetermined in this case). In ei- ther case, we have XTnXnµ0 = XTnyn. Furthermore, we will need that for ˆ

µn= (XTX)−1(XTnXnµ0+ XT0X0µ0), (µ − ˆµn)TXTX(µ − ˆµn)

= µT(XTnXn+ XT0X0)µ − 2µT(XTnXnµ0+ XT0X0µ0)

+ (µ0TXTnXn+ µT0XT0X0)(XTX)−1(XTnXnµ0+ XT0X0µ0), and for the last term occurring there,

0TXTnXn+ µT0XT0X0)(XTX)−1(XTnXnµ0+ XT0X0µ0)

0− µ0)TXnTXn+ µ0XT

(XTX)−1£

XT0X00− µ0) + XT0¤

= µT0XT0+ (µ0− µ0)TXTnXnµ0+ µ0XT0X00− µ0)

− (µ0− µ0)TXTnXn(XTX)−1XT0X00− µ0)

= µ0TXTnXnµ0+ µ0XT0X0µ0

− (µ0− µ0)TXTnXn(XTX)−1XT0X00− µ0).

Referenties

GERELATEERDE DOCUMENTEN

We computed two commonly used Bayesian point estimators – posterior mode or modal a-posteriori (MAP) and posterior mean or expected a-posteriori (EAP) – and their confidence

Comparing Gaussian graphical models with the posterior predictive distribution and Bayesian model selection.. Williams, Donald R.; Rast, Philip; Pericchi, Luis R.;

Uit de analyse komen als de drie beste kenmerken om op te sturen naar voren: (1) de verhouding ruweiwit/kVEM in het rantsoen (figuur 2), (2) het stikstof- en fosforgehalte in

In this way, we obtain a set of behavioral equations: (i) For each vertex of the interconnection architecture, we obtain a behavior relating the variables that ‘live’ on the

By using this specific variant of the µ-calculus, we are able to give a translation that is succinct, but that does not introduce performance penalties when checking the formula

The current research shows that perceptual disappearances in Troxler fading can be recognized in EEG activity related to a dynamic background, and that the number of

The fabrication of low-loss Al2O3:Er3+ waveguides and internal optical gain over an 80-nm wavelength range with a peak gain of 2.0 dB/cm enabled the realization of various integrated

The philosophical question for this project is, thus, formulated as follows: ‘What does being-in-the-world mean in a digital age dominated by smartphones?’ To translate this