Combining predictions from linear models using the switch distribution

(1)

Thijs van Ommen

Combining predictions from linear models using the switch distribution

Master thesis, defended on March 10, 2011 Thesis advisor: Peter D. Gr¨unwald

Specialisation: Applied Mathematics

Mathematisch Instituut, Universiteit Leiden

(2)

(3)

1 Introduction

In this thesis we investigate the switch distribution [9] as a method for model selection and prediction in linear regression. We compare the performance of this new method to that of other methods, both theoretically and exper- imentally.

Many statistical problems involve finding a relation between two variables. Given several pairs of measurements for these two variables, we wish to find a function that estimates the second variable given the first.¹ This type of problem is called a regression problem. The function we find could help us gain insight into the underlying relation between the variables (for example, it might tell us something about the laws of nature determining the relation between two physical quantities), or it could be used to predict the value of the second variable for a given value of the first we have not observed. To make this problem well-defined, the function is chosen from some set of functions. Each function corresponds to a hypothesis, and the set of functions corresponds to a statistical model.

As an example, Figure 1(a) shows ten measurements relating a stream’s depth to its rate of flow [22]. (The units of these measurements are not given.) Suppose we want to use these observations to predict the flow rate

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(a) Measured flow rate

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(b) Flow rate fit by a straight line

Figure 1: Predicting flow rate

of a stream with depth 0.6. A simple way to answer this question is by finding a straight line that passes as closely as possible near all observed data points. In this case our model contains all straight lines. The chosen line is shown in Figure 1(b); it suggests that the flow rate for a depth of 0.6 should be about 4.3. We choose which line “passes most closely to the data points” using the standard method least squares: for a given line, we

1Technically, we assume that the second variable contains a random component, and we want to estimate the conditional expectation of the second variable given the first.

(6)

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(a) Flow rate fit by a quadratic polynomial

0.3 0.4 0.5 0.6 0.7 0.8

−2 0 2 4 6 8

depth

flow rate

(b) Flow rate fit by a high-degree polynomial

Figure 2: Predicting flow rate, continued

measure the vertical distance to each data point, square these distances, and select the line from our model for which the sum of these squares is smallest.

However, the choice to use a straight line is rather arbitrary, and we may wonder whether the resulting prediction is as accurate as can be achieved.

The line from Figure 1(b) passes above the data points in the centre but below those near the edges of the graph, suggesting that a curved line will be able to provide a better fit to the data. This is shown in Figure 2(a), using a quadratic polynomial as the curved line. This line does indeed seem to follow the observations better than the straight line. It predicts a flow rate of 3.6 at depth 0.6.

But if a quadratic polynomial provides better predictions than a straight line, maybe a polynomial of yet higher degree will do even better. If we increase the degree of the polynomial further, we will be able to pass by each point more closely. In fact, for any n points in different horizontal positions, there is a unique polynomial of degree n − 1 that passes through all those points exactly. In our example data, there are two points with depth 0.29 but different flow rates, so the best we can do is a polynomial of degree 8 that passes through the middle of these two points, and coincides with each other point. This polynomial is shown in Figure 2(b). Obviously, this polynomial is worthless when it comes to making predictions about different depths. In particular, the predicted flow rate for a depth of 0.6 is

−1228, which suggests a stream running uphill at alarming speed.

In this last example, we selected the best fitting polynomial from among all polynomials with degree up to 8. This means that our statistical model contained a hypothesis for each such polynomial. While it is mathematically easy to select the polynomial that best fits the data even from such a large set, it is very difficult to find a polynomial that will predict the data well given so many choices: a large amount of data would be required to find

(7)

appropriate values for all the polynomial’s coefficients. Large models are therefore called complex.

The importance of choosing a suitable degree of the polynomial to be used for fitting is clear. Choosing low will lead to a line that fails to mimic the true relation between the variables and therefore does not predict as well as a higher degree polynomial could. On the other hand, if we set the degree too high, we will obtain a curve that goes astray in many places where we do not have observations, and thus has no predictive value. The degree we are looking for must be a trade-off between these two concerns.

In the above example, we motivated the choice of a quadratic polynomial based on inspection of the data. Such an approach will not always be practical, for example when larger quantities of data are involved. Many methods based on mathematical theory have been developed to make this trade-off in an automatable way, without requiring human judgement calls.

Unfortunately, these methods sometimes disagree.

Two such methods are the well-known AIC (Akaike’s information crite- rion) [2] and BIC (Bayesian information criterion) [24]. Both take the form of simple formulae that assign a score to each model, allowing the best model to be chosen. However, they differ in their notion of “best”. Informally, AIC is quick to realize when a more complex model is necessary, requiring fewer observations for this that other methods. But if the correct model has already been found, AIC will continue to pick more complex models. BIC, on the other hand, will eventually find the correct model and stay with it, but will stay with too simple models longer than AIC does. In many situations, these competing properties make it unclear which method is to be preferred.

This is known as the AIC/BIC dilemma.

To simplify the problem of choosing an appropriate model, we may take a step back and, in place of models, consider experts that look at the data and give us counsel based on it. Each expert uses a model together with a well-defined method to produce a prediction based on the observed data and the expert’s model.

Consider the situation where we are not directly interested in the models the experts use or in the exact mechanism by which the different experts come to their advice, but we wish to choose which expert to trust based on their past performance. For example, to predict the weather, we may consult several experts, using different, possibly very complex, models. Without understanding these models, we may still evaluate their performance by comparing their predictions to the actual weather. This way, we may find that one expert always gives the most reliable predictions. But it is also possible that which expert is best depends on the season or on other factors.

The insight that which expert is best may vary with time can be seen as underlying the switch distribution, a new method for model selection de- scribed in [9] and formally defined in Section 3.4. This method is shown to share the favourable aspects of both AIC and BIC: it can quickly go to

(8)

complex models when needed like AIC, and will eventually find the correct model (if there is one) like BIC. Without going into the mathematical details of any of these methods, the switch distribution’s motivation may be described as follows: the switch distribution is based on the same principles as BIC, but taking into account the possibility that as more data points come in, different experts may become more suitable and we may want to switch between them occasionally.

It should be clear that any prediction about the future involves some degree of uncertainty. We want to allow our experts to express their uncertainty, so that the experts’ own assessments of the quality of their predictions can play a role in our choice of which expert is best. This is accomplished by asking the experts for predictions not in the form of a single value, but in the form of a probability density on possible values. When we know the ac- tual outcome, we evaluate the quality of such a prediction with a loss equal to the negative logarithm of the density assigned to the observed outcome.

Thus, if this outcome was thought likely by the expert and given a high density, the loss is small, but if an outcome occurred that the expert did not find likely, the loss is large. If an expert is strongly convinced that the outcome will lie close to some value, a predictive density with a sharp peak around that value is appropriate, as according to the expert’s beliefs, this will lead to a very small loss. On the other hand, if an expert is not very sure what will happen, a flatter distribution is a way for experts to hedge their bets: no matter what will happen, the loss will not be enormous (but it will be larger than that of the expert in the previous scenario).

The most important theoretical result in this thesis is our finding that once an expert using a complex model has started to outperform experts using simpler models, we may expect it to continue outperforming them as more data are observed. To take advantage of this, we define a modified the switch distribution that does not consider switching from a complex to a simple model.

Further, we conduct simulation experiments where we compare several methods for choosing between as well as combining the predictions of several models, including the aforementioned AIC, BIC and switch distribution. In these experiments, we find that the switch distribution performs best in general. For most functions, AIC will fail to find that function as it continues to overfit, resulting in inferior predictions. Similarly, BIC’s theoretical slowness (in the sense of requiring many observations before selecting a sufficiently complex model) also translates to practice for most functions, again resulting in predictions that are outperformed by the switch distribution. Finally, the restricted version of the switch distribution that only switches “up” is found to improve the performance of the switch distribution still further.

This thesis is organized as follows: Section 2 covers linear regression models and their frequentist and Bayesian solutions. In Section 3, we look at several methods to solve the problems of model selection and of predic-

(9)

tion using several models, including the switch distribution. In Sections 4 and 5, we combine the theory from the earlier sections to analyse how the quality of the different models’ predictions changes as more data becomes available. Finally, Section 6 covers some issues that occur when implement- ing these models and the methods for combining them on a computer, and Section 7 we evaluate the performance of the methods based on simulation experiments run on such an implementation.

Acknowledgements First of all, I would like to thank Peter Gr¨unwald for his guidance that led me to interesting insights regarding this even more interesting topic. I am grateful to my colleagues at CWI, especially Steven de Rooij, Tim van Erven, and Wouter Koolen, for the inspiring atmosphere they provided, which greatly encouraged me during this work. Finally, I owe thanks to my family and friends for their support and patience while this thesis continued to be “almost done”.

2 Linear regression

2.1 Definitions and notation

Regression problems are concerned with finding a relationship between two variables: the regressor variable uⁿ = (u₁, . . . , u_n) ∈ Uⁿ and the regression variable yⁿ= (y₁, . . . , y_n) ∈ Rⁿ. (We largely follow the notation of chapter 12 of [14].) Given values of these variables, we wish to find a function f : U → R, either to gain insight into the true relation underlying those values, or to use this function for the prediction of further y_n+1, . . . , y_n+m given u_n+1, . . . , u_n+m. In practice, uⁿ is often (but not necessarily) under an experimenter’s control (so from the experimenter’s perspective, uⁿ can be either chosen values or values sampled from a random variable Uⁿ), while yⁿrepresents the measurements resulting from the experiment. These measurements are assumed to be subject to random noise. That is, we assume that

Y_i= f^∗(u_i) + Z_i, (2.1) where f^∗ is the true but unknown function we wish to learn, and the Z_i are distributed independently of each other and of Uⁿ, with E Z_i = 0 and Var Z_i = σ² for all i.

In the particular case of linear regression, we choose an order k, fix that many basis functions f₁, . . . , f_k, and confine our search to functions that are linear combinations of these basis functions. These are then the functions of the form f_µ(u) =P_k

i=1µ_if_i(u) for µ ∈ R^k. By letting µ :=



 µ₁

... µ_k



 , xi :=



 f₁(u_i)

... f_k(u_i)



 , and Xn:=



 x^T₁

... x^T_n



 , (2.2)

(10)

we can write f_µ(u_i) = x^T_i µ and (f_µ(u₁), . . . , f_µ(u_n))^T = X_nµ. We call the x_i design vectors and X_n the design matrix.

The hypotheses in our models will be densities for yⁿ of the form p_µ,σ²(yⁿ| uⁿ) =

µ 1

√2πσ²

¶_n exp

µ

− P_n

i=1(y_i− f_µ(u_i))² 2σ²

¶

, (2.3)

expressing the further assumption that the noise terms are independent identically distributed Gaussians with mean 0 and variance σ². We will consider two types of linear model, which differ in their treatment of σ². If the true variance is known to be σ^∗2, the family of models

M^(k)= { p_µ,σ∗2 | µ ∈ R^k} (2.4) is appropriate; otherwise, we must estimate the variance along with the function f_µ and use the family

S^(k) = { p_µ,σ2 | µ ∈ R^k, σ² > 0 }. (2.5) A common way to measure how well f_µfits the data is by minimizing the Euclidean length of the vector of errors, which is equivalent to minimizing the sum of squared errors

SSE(µ, yⁿ) :=

Xn i=1

(y_i−f_µ(u_i))² = Xn

i=1

(y_i−x^T_i µ)² = (yⁿ−X_nµ)^T(yⁿ−X_nµ).

This allows us to rewrite (2.3) to p_µ,σ2(yⁿ| uⁿ) =

µ 1

√2πσ²

¶_n exp

µ

−SSE(µ, yⁿ) 2σ²

¶

. (2.6)

Because this is decreasing as a function of SSE(µ, yⁿ), we see that the max- imum likelihood estimator ˆµ_n also minimizes SSE(µ, yⁿ). Another useful property is that whenever X_n has full column rank, ˆµ_n is uniquely defined and easily computable [13, p. 373]:

ˆ

µ_n= (X^T_nX_n)⁻¹X^T_nyⁿ. (2.7) It will be useful to define the residual sum of squares as the sum of squared errors obtained by this minimizing ˆµ_n:

RSS(yⁿ) := SSE(ˆµ_n, yⁿ) = (yⁿ− X_nµˆ_n)^T(yⁿ− X_nµˆ_n).

A prediction of y_n+1 can also be expressed as a probability distribution on values rather than as a single value. Here and below, with some abuse of terminology we use the term “probability distribution” to refer to both probability density and probability mass functions. We measure the quality

(11)

of such a prediction using logarithmic loss, which is the negative logarithm of the probability density or mass that such a prediction assigned to the actual outcome. We may predict several outcomes at once with a multivariate predictive distribution; if the outcomes are independent under the predictive distribution, then the loss obtained equals the sum of losses that would have been obtained by individually predicting each outcome with its marginal distribution.

If the predictive distribution was Gaussian, then for any fixed variance of that distribution, the logarithmic loss will be a linear function of the squared error, and for a multivariate Gaussian with fixed covariance matrix, a quadratic form of the squared errors. For example, if we predict yⁿ with the density function p_µ,σ², the logarithmic loss becomes

− log p_µ,σ²(yⁿ| uⁿ) = n

2log 2πσ²+ 1

2σ²SSE(µ, yⁿ). (2.8) We will see that for several combinations of model and prediction strategy, the predictive distribution is indeed Gaussian, so the logarithmic loss for a single outcome is essentially equivalent to the more familiar squared error loss, in the sense that a prediction strategy is better on some data in terms of logarithmic loss if and only if it is better in terms of squared error loss.

The correspondence between logarithmic loss and squared errors as well as the simple expression for the maximum likelihood estimator are mathematically elegant, and this is one of the reasons for their widespread use [15, 21]. There is however a disadvantage that must be mentioned: this approach is very sensitive to outliers in the data, and therefore care must be taken when applying it in practice [16].

We use “log” to refer to the natural logarithm. The choice of logarithm base only changes logarithmic loss by a multiplicative constant, but else- where it will considerably simplify the algebra. Another popular choice for the logarithmic loss function is base 2, so that all losses are measured in bits [7, 2.1].

2.2 Basis functions

The linear model relies on a choice of basis functions. As is implied by the word “basis”, we want all functions in a certain, preferably fairly general class to have a unique representation as a linear combination of such basis functions. The condition of uniqueness is easy to meet: given a sequence of functions, we can create a new sequence which omits each function that is a linear combination of its predecessors. Before we can also talk about the existence condition in spaces of functions, we need to extend the familiar notion of a basis from linear algebra to deal with infinite numbers of basis elements. To deal with linear combinations of such sets, we first need to define some form of convergence.

(12)

Consider the probability triple (Ω, F, P ) [27]. Define the L²-norm on measurable functions f as kf k₂ =¡R

Ω|f |²dP¢_1/2

. Let L²(Ω, F, P ) be the space of equivalence classes of F-measurable functions f with kf k₂ < ∞, where functions f and g are in the same equivalence class if kf − gk₂ = 0.

(If Ω is a compact interval on the real line, the set of continuous functions on Ω is dense in this space [23, Theorem 1.62].) We will continue to speak of the elements of L²(Ω, F, P ) as functions, even though they are really equivalence classes. Equip this space with an inner product defined by (f, g) =R

Ωf g dP , and note that our norm follows from this inner product.

From linear functional analysis we know that for this particular choice of norm and inner product, the set L²(Ω, F, P ) is a Hilbert space [23, Example 3.24]. In such a space, we can speak of convergence of an infinite series, as we require. Given a sequence of measurable functions f₁, f₂, . . ., we call it orthogonal if (f_i, f_j) = 0 for all i 6= j; if also kf_ik₂ = 1 for all i, we call it orthonormal. Now for f that can be expressed as a linear combination of an orthonormal sequence f₁, f₂, . . ., this linear combination is given by

f = X∞ i=1

(f, f_i)f_i. (2.9)

This series converges for all f ∈ L²(Ω, F, P ) [23, Corollary 3.44]. Note that, by the definition of our norm, there may be a P -null set of points in Ω which do not participate in the convergence.

Calling f₁, f₂, . . . a basis is justified if (2.9) holds for all f ∈ L²(Ω, F, P ).

One example of an orthonormal basis is the Fourier basis f₁ = 1

√2π, f_2j = 1

√π sin ju for j ∈ N, f_2j+1 = 1

√π cos ju for j ∈ N

on L²([−π, π], B[−π, π], U [−π, π]) (where B(S) denotes the Borel σ-algebra and U (S) the uniform probability measure on S). Another example is the Legendre basis on L²([−1, 1], B[−1, 1], U [−1, 1]), which is found by applying the Gram-Schmidt process [23, Remark 3.21] to the sequence of polynomials (1, u, u², . . .). (Note that in both cases the functions can be transformed to cover any other bounded interval.)

This theory guarantees us that it is possible for our linear model to ap- proximate any function in L²(Ω, F, P ) arbitrarily closely given enough basis functions. In practice, we will have to limit the number of basis functions that we use to a finite k so that we only span a subspace of L²(Ω, F, P ), which will impact how well certain functions can be approximated. In this

(13)

case, we have that for any function f ∈ L²(Ω, F, P ), there exists a unique function g ∈ sp(f₁, . . . , f_k) (where sp denotes the linear span) that minimizes kf − gk₂, and this same g is also the unique solution to (f − g, f_i) = 0 for all i ∈ {1, . . . , k} [23, Theorem 3.34 and its proof]. Geometrically, this means that in terms of squared errors, the best solution g we can create using our basis functions is the orthogonal projection of f onto this subspace.

Once we have an orthonormal basis f₁, f₂, . . ., we may of course equally well use any other sequence of functions g₁, g₂, . . . (infinite or finite; not nec- essarily orthonormal or even orthogonal) for which each function g_i can be expressed as a linear combination of f₁, f₂, . . . , f_i, and vice versa for each function f_i. Because such a transformation is invertible, this new sequence will be a basis of the same space as the original sequence, in the sense that any function in the space has a unique representation as a linear combination of basis functions. We will therefore call a (possibly infinite) sequence g₁, g₂, . . . linearly independent if it can be transformed as above into some or- thonormal sequence f₁, f₂, . . .. Note that this is precisely the transformation that is accomplished by the Gram-Schmidt process. A linear dependence re- lation kP

jα_jg_jk₂ = 0 with the α_j not all 0 exists between g₁, g₂, . . . , g_i if and only if the Gram-Schmidt process fails on f₁, f₂, . . . , f_i, justifying this terminology.

Because many properties we are interested in are determined only by what functions can be expressed with some basis and not on what coefficients occur in the corresponding linear combinations, we can freely change bases when discussing such properties. One useful example is the basis that is orthonormal with respect to the empirical measure P_n, which is defined as follows: given uⁿ, for any measurable A ⊆ Ω,

P_n(A) := 1 n

Xn i=1

I_A(u_i) (2.10)

(where I_A denotes the indicator function); that is, P_n assigns a point mass of _n¹ to each point in uⁿ. This measure leads to a different norm, and thus to a different partition of functions into equivalence classes. In fact, since only the function values on points u_i, i ∈ {1, . . . , n} matter, each basis function now corresponds to a point in Rⁿ, and so the space equipped with this norm will be of dimension at most n. If our basis functions are linearly independent in this space (for which a necessary condition is thus that k, the cardinality of the basis, is at most n), then the Gram-Schmidt process can be applied to find a new basis that is orthonormal with respect to the empirical measure P_n, but which spans the same space.

When doing linear regression, we will receive a sequence of n points from U and apply our basis functions to them to get a design matrix X_n as in (2.2). We can compute ˆµ_nusing (2.7) if and only if X_nis of full column rank [21, p. 531]; in other words, if P_ninduces a space where our k basis functions

(14)

are linearly independent. In experiments where we choose uⁿ, we can check this ourselves. In the other case, where each u_i is randomly drawn from P , we want to know under what conditions this holds. This is answered by the following two closely related propositions.

Proposition 2.1 If our basis functions f₁, f₂, . . . , f_k are linearly indepen- dent, then the rank of X_nconverges almost surely to k as n → ∞. Otherwise, it almost surely does not converge to k.

Proof If the basis functions are linearly dependent, then for any n, the same linear dependence will almost surely exist between the columns of X_n. Now assume the basis functions are linearly independent, and suppose that for n data points, X_n has rank less than k. The next data point will be drawn from P and will add a random row vector x^T_n+1to the matrix X_n. If P (x_n+1 ∈ sp(x₁, x₂, . . . , x_n)) = 1, then the basis functions are linearly dependent: a contradiction. So with positive probability, X_n+1 will be of higher rank than X_n. Because this holds for any X_n that is not of full column rank, any sequence of such matrices eventually attains full column rank with probability 1.

Proposition 2.2 X_n has full column rank with probability 1 if and only if n ≥ k and for all measurable U ⊆ U with P (U ) > 0, our basis functions are still linearly independent when restricted to U .

Proof First, it is clear that X_n’s rank is at most n, so if n < k, then the matrix can not be of full column rank. Assume now that n ≥ k.

⇒: Suppose there exists a U ⊆ U with positive measure such that the restrictions to U of our basis functions are linearly dependent. Then with positive probability, u₁, . . . , u_n ∈ U , and given this, the same linear dependence will almost surely exist between the columns of X_n.

⇐: Suppose that for m data points, X_m has rank less than k. Let U = f_1,...,k⁻¹ (sp(x₁, x₂, . . . , x_m)). When restricted to this U , our basis functions are linearly dependent. If P (u_m+1 ∈ U ) > 0, this gives a contradiction. So with probability 1, X_m+1 will be of higher rank than X_m. Because this holds for X₀, X₁, . . . , X_k−1, with probability 1, X_k and further will be of full column rank.

In particular, the conditions of Proposition 2.2 requires that our measure contains no point masses, and that there is no U with positive measure where one of our basis functions is constantly 0.

For the Legendre and Fourier bases, Proposition 2.2 applies: the U in the proof of the proposition contains only finitely many points, so P (U ) = 0 and it follows that X_nwill be of full column rank almost surely. An example of an orthonormal basis that does not satisfy the condition is given by (finite subsequences of) the Haar basis [15, p. 150] on an interval equipped with the

(15)

Lebesgue measure. This happens because all basis functions are piecewise constant with finite numbers of discontinuities, so intervals U can be found where all basis functions are constant and thus linearly dependent.

In practice, it will often happen that a sample uⁿ gives a matrix X_n which is theoretically of full column rank, but for which X^T_nX_nis numerically hard to invert. We will discuss this issue in Section 6.

2.3 Maximum likelihood estimators and Bayesian predictive distributions

The maximum likelihood estimator ˆµ_n is unbiased: if the true function underlying the data (as in (2.1)) is f^∗ = f_µ, we have for fixed X_n with full column rank

E_µµˆ_n= E_µ(X_n^TX_n)⁻¹X^T_nyⁿ

= E_µ(X^T_nX_n)⁻¹X^T_n(X_nµ + Zⁿ)

= µ + (X^T_nX_n)⁻¹X^T_nE_µZⁿ

= µ (2.11)

[21, p. 538]. Assuming the conditions in Proposition 2.2 hold, X_n almost surely has full column rank, thus the above can be extended to random X_n. Also, by the Gauss-Markov theorem, it is known that ˆµ_nhas minimum variance among all linear unbiased estimators [15, 3.2.2].

Suppose we have observed data (uⁿ, yⁿ). For models M^(k), the above suggests predicting with p_µ_ˆ_n_,σ^∗2, as it is in some sense the best hypothesis in our model. However, in terms of logarithmic loss it is not the optimal choice, as we will see below.

For models S^(k), we need an estimator for σ² in addition to ˆµ_n. The maximum likelihood estimator is RSS(yⁿ)/n [14, (12.33)], but it is biased and underestimates the true variance. An unbiased estimator is ˆσ²_n = RSS(yⁿ)/(n − k) [21, p. 540], which we will use here.

2.3.1 The Bayesian framework

So far we have considered predictions based on distributions from our model.

But since we formulate our prediction as a distribution rather than as a finite number of parameters, we may use any probability distribution as our prediction without confining ourselves to those contained in our model. The Bayesian predictive distribution makes use of this flexibility. It arises in the Bayesian framework as follows. Let M = { P_θ | θ ∈ Θ } be a statistical model of hypotheses indexed by parameter θ. (For example, for the linear model with known variance defined in (2.4), we have Θ = R^k.) Given a prior distribution w(θ) on our parameter space Θ, the probability we assign

(16)

to the observation y is

¯

p_Bayes(y) :=

Z

θ∈Θ

p_θ(y)w(θ)dθ. (2.12)

We call this the prior predictive distribution, as it is our prediction about the data before any observations have been made, based on our prior distribution.

If we have already observed y and want to update our view of which parameter values are likely, we can transform our prior into a posterior by

w(θ | y) = p_θ(y)w(θ)

¯

p_Bayes(y), (2.13)

which is an application of Bayes’ Theorem [5].

Similar to how we update our degree of belief in the parameter values, we can update our view of the probabilities of future observations y₂ given past observations y₁:

¯

p_Bayes(y₂ | y₁) = p¯_Bayes(y₁, y₂)

¯

p_Bayes(y₁)

= R

θ∈Θp_θ(y₂)p_θ(y₁)w(θ)dθ

¯

p_Bayes(y₁)

= R

θ∈Θp_θ(y₂)w(θ | y₁)¯p_Bayes(y₁)dθ

¯

p_Bayes(y₁)

= Z

θ∈Θ

p_θ(y₂)w(θ | y₁)dθ (2.14) [14, (2.36)]. This is called the (posterior) predictive distribution. Note that this formula has the same form as (2.12), but with the prior replaced by the posterior.

Imagine that the true function is drawn at random according to our prior distribution before the experiment starts. Then this predictive distribution accurately expresses how further data will be distributed given our knowl- edge. As a result, the expected amount of additional logarithmic loss we incur by predicting with some distribution q rather than p(y₂) := ¯p_Bayes(y₂ | y₁) equals

E_{Y ∼p}[− log q(Y )] − E_{Y ∼p}[− log p(Y )] = D(p k q) ≥ 0, (2.15) where D(p k q) is the Kullback-Leibler (or KL) divergence from p to q, which is nonnegative by the information inequality [7, Theorem 2.6.3]. This shows that no distribution q can expect to do better than the Bayesian predictive distribution.

Of course, the true function is not usually drawn from a distribution, and if it were, we would rarely know that distribution. Then the above

(17)

optimality result does not hold, but if the true function looks sufficiently like it could have been drawn from our chosen prior, then we still expect to do well.

2.3.2 The Bayesian predictive distribution for M^(k)

To see what forms these distributions take in the models we are interested in, we first have to decide on a prior on µ and, in the case of S^(k), a prior on σ². In the models of the M^(k) family, where the variance σ^∗2 is known, it will be especially convenient to use a Gaussian prior with some mean µ₀ and positive definite covariance matrix σ^∗2Σ₀ for the parameter µ. More details about the choice of prior will be discussed in Section 2.4. The requirement of a positive definite covariance matrix implies that the prior must assign positive density to all parameter vectors. This is not a restriction in practice, because if we only want a linear subspace of our parameter space to have positive prior probability, the right way to accomplish this is by a different choice of basis functions.

From now on, we will no longer explicitly mention uⁿ in our formulae;

in the context of regression, the relevant values may always be considered known.

Equation (2.12) essentially expresses that to sample a sequence of ob- servations Yⁿ according to ¯p_Bayes, we may first sample µ from w, then Yⁿ from p_µ,σ^∗2. Because both distributions are Gaussian, the resulting distri- bution will again be Gaussian. For random variables µ ∼ N (µ₀, σ^∗2Σ₀) and Zⁿ∼ N (0, σ^∗2I_n) and linear transformation Yⁿ= X_nµ + Zⁿ, this gives

Yⁿ∼ N (X_nµ₀, σ^∗2(I_n+ X_nΣ₀X^T_n)). (2.16) Because Σ₀ is positive definite, its inverse exists and is also positive definite. Any positive semidefinite matrix is a Gram matrix, meaning that it can be expressed as A^TA for some k × k matrix A [6, Fact 8.9.37]. Hence we can write (σ^∗2Σ₀)⁻¹= σ^∗−2X^T₀X₀for some k×k matrix X₀. This inverse of the covariance matrix is called the precision matrix, and will be useful as a parameter of our normal distributions.

Using this X₀, we define y₀ := X₀µ₀. We find using (2.13) that the posterior after observing n data points yⁿ is of the form

w(µ | yⁿ) ∝ p_µ,σ∗2(yⁿ)w(µ),

(18)

where p_µ,σ^∗2(yⁿ) is given by (2.6) and w(µ) ∝ exp

µ

−1

2(µ − µ₀)^T(σ^∗2Σ₀)⁻¹(µ − µ₀)

¶

= exp µ

− 1

2σ^∗2(µ − µ₀)^TX^T₀X₀(µ − µ₀)

¶

= exp µ

− 1

2σ^∗2(X₀µ − y₀)^T(X₀µ − y₀)

¶

= exp µ

− 1

2σ^∗2SSE(µ, y₀)

¶

Putting the two back together, we get w(µ | yⁿ) ∝ exp

µ

− 1

2σ^∗2(SSE(µ, yⁿ) + SSE(µ, y₀))

¶

, (2.17)

which is again Gaussian. The maximum and thus also the mean of this density are given by

ˆ

µ_n= (X^T₀X₀+ X^T_nX_n)⁻¹(X^T₀y₀+ X^T_nyⁿ) (2.18) as in (2.7). (We use the same notation ˆµ_n as we did for the maximum likelihood estimator. The reason for this will become apparent in Section 2.4, but whether we mean the maximum likelihood estimator or the posterior mode with respect to some prior will always be clear from context.) By first writing

X :=

µ X₀ X_n

¶ , y :=

µ y₀ yⁿ

¶

, and ˆy = Xˆµ_n= X(X^TX)⁻¹X^Ty, and then multiplying the posterior by a factor that does not depend on µ, it becomes

w(µ | yⁿ) ∝ exp µ

− 1

2σ^∗2(y − Xµ)^T(y − Xµ)

¶

∝ exp µ

− 1 2σ^∗2

¡(y − Xµ)^T(y − Xµ) − (y^Ty − ˆy^Ty)ˆ ¢¶

= exp µ

− 1

2σ^∗2(ˆy − Xµ)^T(ˆy − Xµ)

¶

= exp µ

− 1

2σ^∗2(ˆµ_n− µ)^TX^TX(ˆµ_n− µ)

¶

, (2.19)

which shows that the precision matrix is σ^∗−2X^TX = σ^∗−2(Σ⁻¹₀ + X^T_nX_n).

The predictive distribution given n data points is now easily obtained by taking (2.16) and plugging in the posterior in place of the prior as we saw in (2.14), yielding

Y_n+1| yⁿ∼ N (x^T_n+1µˆ_n, σ^∗2(1 + x^T_n+1(X^TX)⁻¹x_n+1)) (2.20)

(19)

for the prediction of a single data point n + 1.

Another result of the above derivation is that when observing more data points, we can incorporate them by adding matrices of the form σ^∗−2X^0TX⁰ to our precision matrix, and updating ˆµ_n using (2.18).

2.3.3 The Bayesian predictive distribution for S^(k)

For the model family S^(k) we will use priors of the form w(µ, σ) = w(µ | σ)w(σ), where w(µ | σ) is Gaussian as in the previous section, and where w(σ) ∝ σ^−α−1exp(−β/2σ²). For α > 0, β > 0, we have

Z _∞

0

σ^−α−1exp µ

− β 2σ²

¶

dσ = 1 2

µβ 2

¶_−α/2 Γ³ α

2

´ ,

so that w(σ) defines a probability distribution, known as the square root inverted gamma distribution (because σ is the inverted square root of the precision σ⁻², which follows a gamma distribution). The combined prior w(µ, σ) will be seen to be a convenient choice for similar reasons as our choice of a multivariate normal prior in the previous section [5, Example 5.6].

Given a prior of this form with mean µ₀and precision matrix (σ²Σ₀)⁻¹= σ⁻²X^T₀X₀ for w(µ | σ) and α = α₀ > 0, β = β₀ > 0 for w(σ), the prior predictive distribution is

¯

p_Bayes(yⁿ) = Z _∞

0

Z

R^k

p_µ,σ²(yⁿ)w(µ | σ) dµ w(σ) dσ

= Z _∞

0

¯

p_Bayes(yⁿ| σ)w(σ) dσ

∝ Z _∞

0

σ⁻ⁿexp µ

− R 2σ²

¶

· σ^−α⁰⁻¹exp µ

−β₀ 2σ²

¶ dσ

= Z _∞

0

σ^−n−α⁰⁻¹exp µ

−R + β₀ 2σ²

¶ dσ

∝ µ1

2(R + β₀)

¶_−(n+α₀_)/2

∝ (R + β₀)^−(n+α⁰^)/2, (2.21) where R = (yⁿ−X_nµ₀)^T(I_n+X_nΣ₀X^T_n)⁻¹(yⁿ−X_nµ₀), the factor occurring in the density function for (2.16). We see that this distribution is multivariate Student with mean X_nµ₀, α₀ degrees of freedom, and precision matrix (α₀/β₀)(I_n+ X_nΣ₀X^T_n)⁻¹. (This last parameter is often called “precision matrix” even though it is not exactly the inverse of the distribution’s co- variance matrix [5].) Note that Student’s t distribution also appears in the frequentist approach to the problem of estimating the mean of a Gaussian whose variance is unknown [26].

(20)

To interpret R, we define for any data y₀ with least squares solution µ₀: y :=

µ y₀ yⁿ

¶

, y⁰ :=

µ X₀µ₀ yⁿ

¶ .

Then by [14, (12.58)], R = RSS(y⁰). Because y₀ and X₀µ₀ have the same least squares solution µ₀, by part 3 of the composition lemma [14, 12.2], y and y⁰ will have the same least squares solution ˆµ_n, and by part 4 of that lemma, the amount by which the residual sum of squares increases as a result of adding observations yⁿ is the same in either case. Because RSS(X₀µ₀) = 0, we see that R equals this amount.

We now set out to compute the posterior after observing data X_n, yⁿ. If X_n is of full column rank, let µ⁰ = (X^T_nX_n)⁻¹X^T_nyⁿ, the least squares estimator for the new data only; otherwise, choose any µ⁰ such that X_nµ⁰ = yⁿ (this system is underdetermined in this case). In either case, we have X^T_nX_nµ⁰ = X^T_nyⁿ. Furthermore, we will need that for ˆ

µ_n= (X^TX)⁻¹(X^T_nX_nµ⁰+ X^T₀X₀µ₀), (µ − ˆµ_n)^TX^TX(µ − ˆµ_n)

= µ^T(X^T_nX_n+ X^T₀X₀)µ − 2µ^T(X^T_nX_nµ⁰+ X^T₀X₀µ₀)

+ (µ^0TX^T_nX_n+ µ^T₀X^T₀X₀)(X^TX)⁻¹(X^T_nX_nµ⁰+ X^T₀X₀µ₀), and for the last term occurring there,

(µ^0TX^T_nX_n+ µ^T₀X^T₀X₀)(X^TX)⁻¹(X^T_nX_nµ⁰+ X^T₀X₀µ₀)

=£

(µ⁰− µ₀)^TX_n^TX_n+ µ₀X^TX¤

(X^TX)⁻¹£

X^T₀X₀(µ₀− µ⁰) + X^TXµ⁰¤

= µ^T₀X^TXµ⁰+ (µ⁰− µ₀)^TX^T_nX_nµ⁰+ µ₀X^T₀X₀(µ₀− µ⁰)

− (µ⁰− µ₀)^TX^T_nX_n(X^TX)⁻¹X^T₀X₀(µ⁰− µ₀)

= µ^0TX^T_nX_nµ⁰+ µ₀X^T₀X₀µ₀

− (µ⁰− µ₀)^TX^T_nX_n(X^TX)⁻¹X^T₀X₀(µ⁰− µ₀).

Combining predictions from linear models using the switch distribution