Investigating the effect of restricting residuals in bootstrap based hypothesis testing

(1)

Investigating the effect of restricting

residuals in bootstrap based hypothesis

testing

T Nombebe

orcid.org/

0000-0002-5940-9738

Dissertation submitted in

partial

fulfilment of the requirements

for the

Masters

degree in

Statistics

at the North-West University

Supervisor:

Prof L Santana

Co-supervisor:

Prof JS Allison

Graduation

May 2018

23800364

(2)

hypothesis testing

Thobeka Nombebe

23800364

School for Computer, Statistical and Mathematical Sciences

North-West University, Potchefstroom campus

Keywords: Bootstrap, Inference, Linear Regression, Location-scale model, Residuals-based

bootstrap.

(3)

Acknowledgements

I would like to take this opportunity to acknowledge and thank those who made this work possible: • To Almighty God for giving me the strength.

• My supervisor Professor Leonard Santana, for coming up with such an interesting project, for his guidance, great support, kind advice throughout the research project, for giving me great insights into

bootstrap hypothesis testing, assistance on using LA_{TEX and R code, as well as for pain-staking efforts}

in proof-reading the drafts. It was a great privilege and honour for me to share not only his exceptional knowledge of Statistics, but also his extraordinary human qualities.

• I would also like to thank my co-supervisor, Professor James Allison for his constant support, availabil-ity and constructive suggestions which were determinant for the accomplishment of the work presented in this dissertation.

• I would like to thank my close friend Lwandiso Dyubeni for supporting and keeping me motivated throughout this work, my colleagues for their everlasting encouragement and support and last, but not least, I’d like to express my sincere gratitude to my family for their unconditional support, encourage-ment, love and patience throughout these years.

(4)

Abstract

In this dissertation, model-based bootstrap methods for testing the hypothesis for the population mean, population variance and regression model coefficient parameters are discussed by considering eight different ways to approach these tests. The approaches involve how each of the three stages of the model-based bootstrap hypothesis testing procedure are conducted:

1. In the first stage, residuals are calculated. Here, the residuals can be obtained from either the unre-stricted model (with the null hypothesis not enforced), or the reunre-stricted model (with the null hypothesis enforced).

2. In the second stage, the bootstrap data are generated. These values can be generated from either the unrestricted model or the restricted model.

3. In the final stage, the pivotal test statistic is calculated using either a sample estimate of the parameter of interest or the actual hypothesised value of the parameter.

This results in eight different combinations of approaches. The aims of this dissertation are to determine which approaches are ‘correct’ and which are not, as well as the gains and losses associated with using certain ‘correct’ and ‘incorrect’ approaches. Additionally, it is also of interest to find out which approaches produce the same results. This problem is first tackled by deriving theoretical expressions for the bootstrap test statistics for all eight approaches so that it might be clear how these approaches behave in general. Then a Monte Carlo study is conducted to approximate the size and power of the tests for each of the eight different approaches.

In this study it is found that the only two approaches that work well for all three test scenarios considered are:

• The approach where one resamples from unrestricted residuals, generates bootstrap data from the restricted model, and uses the hypothesised value in the test statistic, and

• the approach where one resamples from unrestricted residuals, generates bootstrap data from the unrestricted model and uses the estimated parameter value in the test statistic.

The bootstrap test statistics for these two approaches are equivalent for testing both the mean and variance in the location-scale model and for testing the coefficient parameter in the simple linear regression. The simulation studies corroborate these results, where it is found that these approaches always produce adequate sizes and powers for all simulation settings considered. For each of the two models a few other approaches were found to perform reasonably well, but those approaches differed for the various models, and did not not produce noticeably higher powers or improvements on the size of the tests. The two tests mentioned are therefore the only two recommended approaches for conducting residual-based bootstrap hypothesis testing.

(5)

List of Figures

5.1 Powers of the test for the mean (Distribution=1). . . 71

5.10 Powers of the test for the variance using proper test (Distribution=1). . . 76

5.19 Powers of the test for the variance using alternative test (Distribution=1). . . 81

5.28 Powers of the test for the simple linear regression using alternative test (Distribution=1). . . 86

(8)

List of Tables

5.1 Error distributions used in the simulation studies. . . 65

5.2 Estimated sizes of the test for the parameter H0 : µ = 0 using the 8 different model based

bootstrap approaches to hypothesis testing (α = 0.05). . . 66

5.3 Estimated sizes of the test for the parameter H0 : σ2 = 1 (version1) using the 8 different

model based bootstrap approaches to hypothesis testing (α = 0.05). . . 67

5.4 Estimated sizes of the test for the parameter H0 : σ2 = 1 (version2) using the 8 different

model based bootstrap approaches to hypothesis testing (α = 0.05). . . 68

5.5 Estimated sizes of the test for the parameter H0: β1= 0 in simple linear regression using the

8 different model based bootstrap approaches to hypothesis testing (α = 0.05). . . 69

(9)

Chapter 1

Introduction

Conducting inference for the parameters in a statistical model by making use of the bootstrap has been studied in a number of publications in the literature, see for example Efron and Tibshirani (1993), Shao and Tu (1995), Davison and Hinkley (1997) and Chernick (2008), among others. The benefits of using the bootstrap over classical theory are that classical statistical inference in, for example, the location-scale and linear regression models, typically involves an assumption on the distribution of the errors in the model. For the linear regression model, the error terms are typically assumed to be normally distributed with mean zero and constant variance. When these assumptions hold the traditional tests for linear regression models are appropriate for the data. The least squares theory of regression and the resulting inference will therefore be reliable (Davison and Hinkley, 1997). However, when these assumptions do not hold, it is still possible to conduct inference using bootstrap techniques. There are several authors who have conducted studies in the application of the bootstrap in linear regression models, such as Freedman (1981), Freedman and Peters (1984), and Wu (1986). In general, the bootstrap can also be employed to conduct inference for the parameters of other models with no (or few) distributional assumptions made on the underlying data, see, for example, Efron and Tibshirani (1993).

In this study, interest lies in testing hypotheses for the population mean and variance parameters of

the location-scale model, i.e., when the data Xi are assumed to be generated independently and identically

from a process with mean µ and variance σ2_{, and also for the parameter coefficients in linear regression}

models. The errors of these models are assumed to be independent, have an expected value of zero, and have constant variance. More specifically, the model based bootstrap, or residual based bootstrap, method of conducting inference in these models will be studied in this dissertation. One drawback of this method is that it is not entirely non-parametric, i.e., the use of this technique requires that the form of the model used to generate the data be known, or at least well estimated by the model proposed. Both of the statistical models that will be investigated (i.e., the location-scale and linear regression models) depend on how well the assumptions of the model are satisfied. For example, in the linear regression model, if the errors are assumed to be independent and identically distributed (i.i.d.), then conducting the bootstrap by sampling the residuals is a valid technique. However, if the errors are heteroskedastic, or display some other deviations from these distributional assumptions, then the residual based bootstrap procedure cannot be used. The residual bootstrap technique involves resampling the sample residuals to ultimately construct the bootstrap version of the response variable, after which it is possible to use the bootstrap to estimate the null distribution of various test statistics. The typical approaches to doing this involves, among other things, the calculations of sample residuals and the generation of bootstrap sample data. However, the uninitiated can occasionally implement these approaches incorrectly, without realising it.

The idea of correctly conducting inference with the bootstrap has been extensively studied and, with

(10)

reference to the literature on bootstrap based hypothesis testing, the following rules suggested for suitable application of the bootstrap hypothesis testing methodology are given by Hall and Wilson (1991):

• The first rule states that, when one wants to estimate the null distribution of a test statistic, resampling must be conducted in a way that ‘imitates’ the null hypothesis. This must happen even if the data were generated from a distribution specified by the alternative. This rule is essential for bootstrap hypothesis testing and has also been emphasised by Beran (1988), Fisher and Hall (1990), Westfall and Young (1993), and Martin (2007).

• In their second rule, Hall and Wilson advocated conducting bootstrap hypothesis tests by basing these tests on test statistics that are (at least) asymptotically pivotal. The benefits of the use of asymptotically pivotal test statistics are discussed in Beran (1988) and Hall (1992) among others. However, even given these rules for appropriate application of the bootstrap hypothesis testing methodology, there appears to be some mild disagreement on how they are to be implemented in the case when one tests for the coefficient parameters in a regression model. Some authors prefer to resample residuals determined from a restricted model, that is, where the null hypothesis is imposed when computing the residuals; see, for example, Nankervis and Savin (1996); Park (2003); Swensen (2003) and Davidson and MacKinnon (2010). Whereas, there are others who recommend the use of the unrestricted residuals, including Li and Maddala (1996), MacKinnon (2006) and Martin (2007). Due to the uncertainty regarding the correct choices in the literature, this thesis attempts to provide conclusive answers to the questions relating to the effects that the types of bootstrap inference methods used in resampling have on the power properties of these simple tests. In particular, the study will provide an indication of which method performs best under different conditions, such as differing sample sizes, differing distribution of errors, whether or not the residuals should be restricted or unrestricted under the null hypothesis when conducting these tests, and the extent of the loss of power when incorrect methods are used.

The rest of this dissertation is organized as follows: Chapter 2 explores the history of linear regression, classical linear regression models together with assumptions on which they are based. The statistical prop-erties of the ordinary least squares (OLS) estimators are also explained. An introduction to the bootstrap, bootstrap regression, bootstrap hypothesis testing in regression models, together with the application of the algorithms and methodology used to calculate the different bootstrap statistics are discussed in Chapter 3. In Chapter 4 the investigation of the different bootstrap approaches where different algorithms on how to gen-erate bootstrap data and how to calculate the power and size for eight different methods are provided. The different bootstrap approaches involve either restricting (imposing the null hypothesis) or not restricting (not imposing the null hypothesis) when generating the bootstrap data followed by either restricting (parameters are estimated and then residuals calculated from the restricted model) or not restricting (residuals calculated from the unrestricted model) the residuals combined with two versions of the bootstrap test statistics. The first version uses the estimated value of the parameter and the second version uses the hypothesized value. These different bootstrap approaches are discussed for testing the hypothesis for the population mean and the population variance using the location-scale model and are also discussed for testing the hypothesis for the slope parameter in a simple linear regression model. This is followed by Chapter 5 where the results obtained through simulations are provided and discussed. Finally, Chapter 6 includes the conclusion, where the detailed findings from the results are discussed along with findings from similar studies done previously. Some recommendations for future research which can be developed depending on the findings of this thesis, are also provided in this chapter.

(11)

Chapter 2

Linear regression models

2.1 Introduction

Regression analysis is an important tool used to explain and estimate the relationship among variables. It can assist in the understanding of how the value of a dependent variable changes when any one of the independent variables is changed, while others are fixed (Weisberg, 2005). In this dissertation bootstrap inference for the parameters of the regression model are discussed, and so an overview of the history of linear regression, classical linear regression and its model assumptions is important. The discussion will start with simple linear regression models and later expand to multiple linear regression models. The statistical properties of the ordinary least squares (OLS) estimators, as well as the related statistical procedures are also provided.

2.1.1 History of regression

According to Pearson (1930), regression problems were first considered in the 18th century by Sir Francis Galton. Galton was a pioneer in the application of statistical methods to measurements in many branches of science and in studying data on relative sizes of parents and their offspring in various species of plants and animals. Sir Galton’s experiment with sweet peas in 1875 led to the development of the initial concept of linear regression. Sweet peas could ‘self-fertilize’, i.e., daughter plants express genetic variations from mother plants without contribution from a second parent. Galton collected data and plotted the masses of daughter seeds against the masses of mother seeds, hand fitted a line to the data and discovered that the slope of the line connecting the means of different columns was equivalent to the slope of a straight line. He noticed that mother plants of a given seed tended to have daughter seeds of roughly similar sizes, and that the offspring did not tend to resemble their parent’s seed size, but were ‘mediocre’ when compared to the parent, and he thought of this as a kind of ‘regression toward mediocrity’. Therefore, he called this phenomenon ‘Regression to the mean’. In humans, Galton found that the height of the children of unusually tall or unusually short parents tends to move toward the average height of the population. Galton’s law of universal regression was confirmed by his friend Karl Pearson, using more than a thousand records of heights of members of family groups. He found that the average height of sons of a group of tall fathers was less than their fathers’ heights and the average height of sons of a group of short fathers was greater than their fathers’ heights, thus regressing tall and short sons alike toward the average height of all men. Again, to Galton, this was ‘regression to mediocrity’.

(12)

2.1.2 Basic applications of regression analysis

Applications of regression analysis exist in many scientific fields. These fields include business, economics, engineering, medicine, biology, agriculture, geology and geography. In business, regression analysis is used mainly for forecasting and optimization of business processes. It is also used in decision making about a number of business issues, to generate insights on consumer behaviour, understanding business and factors influencing profitability. It can also be used to evaluate trends and make estimates or forecasts, e.g., busi-ness managers can use regression analysis to predict future demand for their products with data derived from historical sales data available to them. In marketing, regression analysis is used to predict how the relationship between two variables, such as advertising and sales can develop over time.

Regression analysis can also be used to answer questions in demography1, environmental statistics2,

climatology3_{, the study of plant and animal populations, geo-statistics}4_{, population ecology}5_{, psychometrics}6_,

and quantitative psychology7.

2.2 Regression models

In this section, the simple linear regression is discussed together with the assumptions associated with the method. The method will later be extended to include more independent variables, i.e., the multiple regres-sion model. Simple linear regresregres-sion is the most commonly used technique for determining how one variable, the response variable, usually denoted by Y , is affected by changes in another variable, the explanatory variable, denoted by X. Even though the relationship between the dependent and independent variables cannot be detected exactly with the regression analysis, the relation function, obtained from the regression analysis, often explains the relationship that is close to the real one. The determination of the regression function has two stages. In the first stage, the type of relationship between the variables is determined and the regression functions are determined in the second stage. The relationship between the independent and dependent variables can be linear, curvilinear, polynomial, quadratic or logarithmic. If the relationship between the dependent variable and the independent variable is linear then the process of determining the relation between X and Y which involves fitting a straight line through the data is called a simple linear regression.

2.2.1 The simple linear regression

The simple linear regression model can be expressed by the following equation:

E(Y |X = x) = β0+ β1X. (2.1)

The parameters in the mean function are the intercept β0, which is the value of E(Y |X = x) when X equals

zero. The slope β1, is the rate of change in E(Y |X = x) for a unit change in X. For any given value of X,

the distribution of Y is centered about E(Y |X), meaning that for each given value of X, one has to sample Y repeatedly to obtain a sample distribution of Y values at each of the chosen values of X. The model

1_{Demography is the statistical study of all populations.}

2_{Environmental statistics is the application of statistical methods to environmental science.} 3_{Climatology includes weather, climate, air and water quality analysis.}

4_{Geo-statistics is a branch of geography that deals with the analysis of data from disciplines such as petroleum geology,}

hydrogeology, meteorology, oceanography, and geochemistry.

5_{Population ecology is a sub-field of ecology that deals with the dynamics of species populations and how these populations}

interact with the environment.

6_{Psychometrics is the theory and technique of educational and psychological measurement of knowledge, abilities, attitudes,}

and personality traits.

(13)

given in (2.1) is also known as the population regression function (PRF), and represents the true relationship between the variables. It is thus understood that it generates the actual data and can be referred to as the data generating process (DGP) for the simple linear regression model. The PRF is fixed, but unknown.

The parameters, β0 and β1are unknown and must be estimated using data. The typical experiment for the

simple linear regression is that we observe n pairs of data (X1, Y1), (X2, Y2), · · · , (Xn, Yn) from a scientific

experiment, and the model in terms of the n pairs of the data can be written as:

Yi= β0+ β1Xi+ i, i = 1, 2, · · · , n,

where Yiis the value of the response variable in the ith trial, β0and β1are parameters, Xiare n observations

assumed to be fixed (under the control of a researcher) and i is a random error term with mean, E (i) = 0

and variance, Var(i) = σ2. The deviation of an observation Yi from its population mean is taken into

account by adding a random error i, these errors are unobservable quantities. However, these errors can

be estimated with the use of sample residuals, which are the differences between the actual point, Yi and

its fitted value, bYi. The residuals are thus given by ei = Yi− ( bβ0+ bβ1Xi), i = 1, 2, · · · , n, where bβ0 and

b

β1 are the estimates of the parameters β0 and β1. The theory for simple linear regression is well developed

and there are several textbooks one can refer to such as Greene (2003), Gujarati (2004) and Kutner et al. (2005). These texts outline that simple linear regression models are used for three main purposes:

1. To describe the linear dependence of one variable on another.

2. To predict values of one variable from values of another, for which more data are available.

3. To correct for the linear dependence of one variable on another, in order to clarify other features of its variability.

The classical simple linear regression model is based on the following assumptions:

• The regression model is linear in the parameters, i.e., Yi = β0+ β1Xi+ i.

• The values of the independent variable are assumed to be fixed in repeated sampling. This implies that there is no sampling variation in X as its value is determined outside the model.

• Assume E(i|Xi) = 0, i.e., the average value of does not depend on X.

• The errors are all linearly independent of one another, i.e., Cov(i, j) = 0, i 6= j, meaning that the

value of the error for one case gives no information about the value of the error for another case.

• There is no relationship between the error and the corresponding X variable, Cov(i, Xi).

• The variance of i for all the observations remain the same i.e.

V ar(1) = V ar(2) = · · · = V ar(n) = σ2.

This is known as homoscedasticity.

• The errors are assumed to be normally distributed, i.e., i ∼ N(0, σ2). This assumption is essential

when using classical regression inference techniques to make valid inferences about the population

(14)

2.2.2 Multiple regression

Multiple regression generalises the simple linear regression model by allowing for many terms in a mean function rather than just one intercept and one slope (Weisberg, 2005). Typically, if one adds more factors to the model that are useful for explaining Y , then more variation in Y can be explained. Thus multiple regression analysis can be used to build better models for predicting the dependent variable (Maddala, 1992).

The general multiple linear regression model with response Y and independent variables X1, · · · , Xp−1, will

have the form:

Yi= β0Xi,0+ β1Xi,1+ β2Xi,2+ · · · + βp−1Xi,p−1+ i. (2.2)

Note that β0is the intercept, and βj, j = 1, 2, . . . , p − 1, measures change in Y with respect to Xj, holding

other factors fixed.

Since there are p − 1 independent variables and an intercept, the above equation contains p − 1 + 1 =

p unknown population parameters. The error term captures the effect of all other factors, other than

X1, X2, · · · , Xp−1, that affect Y . In other words, no matter how many explanatory variables are included in

the model, there will always be factors that cannot be included and these are collectively contained in . In matrix notation, the multiple linear regression model can be written as follows:

Y (n×1)=(n×p)X _(p×1)β +(n×1) , where Y =       Y1 Y2 .. . Yn       , X =       1 X11 X12 · · · X1,p−1 1 X21 X22 · · · X2,p−1 .. . ... ... . .. ... 1 Xn1 Xn2 · · · Xn,p−1       , β =       β0 β1 .. . βp−1       , and =       1 2 .. . n       .

Therefore, Y is an n × 1 vector, X is a known n × p matrix including a column of 1’s for the intercept, if the intercept is included in the mean function. β is a p × 1 vector of regression coefficients and is the n × 1 vector of statistical errors. For emphasis, the assumptions of classical multiple linear regression model which are similar to simple linear regression are as follows:

• Linearity: Yi= β0+ β1Xi1+ β2Xi2+ · · · + βp−1Xip−1+ i , in matrix form : Y = Xβ + . The model

(15)

• Full rank: There is no exact linear relationship between any of the independent variables in the model, in matrix form: X is an n × p matrix with rank p. There are at least p observations and X has full column rank.

• Exogeneity of the independent variables: E[i|Xj1, Xj2, · · · , Xjp] = 0. The expected value of the

error at observation i in the sample is not a function of the independent variables observed at any observation.

• Homoskedasticity and non-autocorrelation: Each error, i has the same finite variance, σ2 and is

uncorrelated with every other error, j.

• Normal distribution: The errors are normally distributed.

That is, E() = 0, V ar() = σ2In, one can write : ∼ N(0, σ2In), V ar() represents the variance/covariance

matrix of . In is the n × n identity matrix with ones on the diagonals and zeroes everywhere else, and 0 is

a matrix or vector of zeroes (of appropriate size). In the next section parameter estimation and properties of estimates are discussed together with the results, these two subjects are well known, but are provided for completeness.

2.3 Parameter estimation and properties of estimates

Now that the model is specified and the assumptions of classical linear regression have been stated, the next important step is to fit the model to the data. To do this, estimators of the parameters β have to be determined using the data on Y and X. The most popular method for estimating these parameters is the ordinary least squares method, discussed next.

2.3.1 Ordinary Least Squares Estimation (OLS)

The task is to estimate parameter vector β using the sample data. The idea is to choose the estimates ˆβ

such that the distances of the data points to the fitted line are minimised. The fitted regression line is given by

b

Y = X bβ

where bβ is the estimated vector of parameters. The residuals are given by :

b

e = Y − X bβ.

Note that_be is the vertical distance between the estimated regression line and the data points (Xi, Yi). To

obtain the least squares estimation of β one therefore needs to minimize the residual of sum of squares by solving the following minimisation problem:

b

β = arg min β

(Y − Xβ)0(Y − Xβ).

By taking partial derivative with respect to β the following expression is obtained ∂ ∂β[(Y − Xβ) 0_{(Y − Xβ)] =} ∂ ∂β[Y 0_{Y − 2Y}0_{Xβ + β}0_X0_Xβ] = − 2Y0X + 2X0Xβ.

(16)

The normal equation of the multiple linear regression model can be obtained by setting this derivative equal to zero, i.e., the following expressions are obtained

−2Y0_{X + 2X}0_{Xβ =0}

X0X bβ =X0Y.

And, since X0X is assumed to be non-singular it follows that

b

β = (X0X)−1X0Y.

If the classical assumptions are satisfied, then the ordinary least squares estimator is the Best Linear Unbiased

Estimator (BLUE) among all linear estimators, that is the variance of the OLS estimator bβ is minimum

among the class of linear unbiased estimators, it is a linear combination of Y , and the expected values of the estimated parameters equals the true values describing the relationship between X and Y . However, even if the assumptions do not hold, it can still be shown that the OLS estimators are unbiased.

Unbiasedness of the OLS estimator

To show that the least squares estimates of β are unbiased, first express bβ as a function of the vector using

Y = Xβ + , then: b β = (X0X)−1X0Y = (X0X)−1X0(Xβ + ) = (X0X)−1(X0X)β + (X0X)−1X0 = β + (X0X)−1X0. (2.3)

Next, taking expectation and using the fact that E() = 0, the following is obtained:

E( bβ) = β + (X0X)−1X0E()

= β + (X0X)−1X00

= β.

Therefore, bβ is an unbiased estimator. In addition, the variance of bβ can be expressed as follows:

Variance of the OLS estimators

The variance/covariance matrix of the OLS estimators is given by

Var( bβ) =              

Var( bβ0) Cov( bβ0, bβ1) · · · Cov( bβ0, bβj) · · · Cov( bβ1, bβp−1)

Cov( bβ1, bβ0) Var( bβ1) · · · Cov( bβ1, bβj) · · · Cov( bβ1, bβp−1)

· · · . .. · · · ·

Cov( bβj, bβ0) Cov( bβj, bβ1) · · · Var( bβj) · · · Cov( bβj, bβp−1)

· · · . .. · · ·

Cov( bβp−1, bβ0) Cov( bβp−1, bβ1) · · · Cov( bβp−1, bβj) · · · Var( bβp−1).

              ,

(17)

with Cov( bβi, bβj) = Cov( bβj, bβi). The covariance matrix of bβ can then be determined using the assumption

that Var(i) = σ2 and E(ij) = 0, i 6= j, i.e.,

Var( bβ) = E( bβ − E( bβ))( bβ − E( bβ))0 = E( bβ − β)( bβ − β) = E((X0X)−1X00X(X0X)−1) = (X0X)−1X0E(0)X(X0X)−1 = (X0X)−1X0E(σ2I)X(X0X)−1 = σ2(X0X)−1,

since bβ − β = (X0X)−1X0 and E(X0) = 0. Therefore, Var( bβ) = σ2(X0X)−1 is the covariance matrix of the

vector bβ.

The σ2that appears in this expression is unknown and therefore an estimator for V ar( bβ) depends on the

estimator for σ2_{. Statistical inference on regression coefficients rely on the estimation of the error variance,}

σ2. This quantity is unobservable and has to be estimated using the data at hand.

2.3.2 Estimating σ

2

The estimator for the variance σ2 can be obtained by using the sum of squared residuals. Consider the

residual sum of squares in the multiple regression, e0e, which can be written as a quadratic form of the

response vector Y :

e0e = (Y − X ˆβ)0(Y − X ˆβ)

= Y0[I − X(X0X)−1X]Y.

Note that, X(X0X)−1X0 is idempotent, and that,

tr(X(X0X)−1X0) = tr(X0X(X0X)−1) = tr(Ip) = p.

Now, the expected value of this quantity can be derived as follows8:

E(e0e) = E[Y0(I − X(X0X)−1X0)Y]

= E(Y)0(I − X(X0X)−1X)E(Y ) + σ2tr(I − X(X0X)−1X0)

= (Xβ)0(I − X(X0X)−1X0)Xβ + σ2tr(I) − σ2tr(X(X0X)−1X0)

= (Xβ)0(I − X(X0X)−1X0)(Xβ) + σ2(n − p)

= (Xβ)0(Xβ − X(X0X)−1X0Xβ) + σ2(n − p)

= σ2(n − p).

The parameter σ2can then be written as: σ2= E(e_(n−p)0e) and an unbiased estimator for this variance is therefore

given by: S2= e 0_e n − p= Y0(I − X(X0X)−1X0)Y n − p = 1 n − p n X i=1 (Yi− bYi)2.

Note that if the Yi’s follow a normal distribution, then the residuals will also be normally distributed, and

the mean square of the residuals will have a chi-squared distribution with degrees of freedom, df = n − p, or, in symbols,

(n − p)S

2

σ2 ∼ χ

2_{(n − p).} _(2.4)

Using this estimated variance, an estimator for V ar( bβ) is then simply dV ar( bβ) = (X0X)−1S2_.

8_{Note that if Y is a (n × 1) random vector and A is some (n × 1) constant matrix, then E(Y}0_{AY) = tr(AΣ) + µ}0_{Aµ where}

(18)

2.4 Inferences about regression parameters

Statistical inference is a process of drawing conclusions about the population based on the results observed through random sampling. There are different types of inferences such as point estimation, interval estimation and hypothesis testing. However, only hypothesis testing will be discussed in this section since it is the main focus of this study.

Once the estimates of β0 and β1 have been obtained from the sample (recall the OLS estimates

dis-cussed in Subsection 2.3.1) along with their estimated standard errors (see Subsection 2.3.2), and if the model assumptions for the linear regression model are satisfied, it is possible to conduct inferences for these population parameters.

To do this, one first needs to obtain the sampling distribution of bβj. This sampling distribution can

be derived by noting that if there are two variables X1 ∼ N(0, 1) and X2 ∼ χ2k, and if X1 and X2 are

independent, then X = √X1

X2/k

has a t-distribution with k degrees of freedom. Now, for the linear regression

model note that ∼ N(0, σ2I) and bβ ∼ N(β, σ2(X0X)−1). In addition, since σ2 is unknown, it has to be

estimated by S2 _{and, as mentioned in (2.4), this estimator divided by n − p has a χ}2

n−p distribution. Then

it follows that when one subtracts the expected value and divides by the estimated standard error of bβj,

Tj= b βj− βj b se( bβj) , j = 0, 1

(i.e., the ‘studentised’ values), the result has a t-distribution with n − p degrees of freedom. Note thatse(c_b βj)

is the estimated standard error of bβjand is defined as the square root of the jth diagonal element of dVar( bβ).

This sampling distribution holds in multiple linear regression too, and can be applied to each of the

estimated parameters bβ0, bβ1, ..., bβp−1.

Naturally, once the sampling distribution is known, inference regarding the parameters can easily be conducted. Below are the details of the standard test for the coefficient parameters in a linear regression model.

To test the hypothesis

H0: βj= βj,H0 versus H1: βj6= βj,H0

it is needed to assume that i∼ N(0, σ2). The test statistic is given by :

Tj= b βj− βj,H0 b se( bβj) H0 ∼ tn−p, (2.5)

where the notationH0

∼ is used to indicate the distribution under the null hypothesis.

The null hypothesis is therefore rejected for extreme values of the test statistic. In particular, the following two quantiles of the t distribution with n − p degrees of freedom are used to specify the decision rule at

a given significance level of α: tn−p(1 − α/2) and tn−p(α/2). The decision rule for this hypothesis is then

simply:

Reject H0 if tj> tn−p(1 − α/2) or if tj < tn−p(α/2),

where tj denotes the calculated value of the test statistic (2.5).

It has been suggested that under strong departure of the model assumptions above, a better alternative to test the significance of the regression coefficients is the use of re-sampling techniques such as the bootstrap. Bootstrap techniques for testing the hypothesis mentioned here are the main focus of this dissertation, specifically, the various approaches that can be used to perform this bootstrap inference.

In the next chapter the general theory of bootstrap methods is discussed, along with specific applications of the method in regression models and in hypothesis testing. In addition, detailed explanations of the algorithms that can be used to implement these techniques for various statistics are also provided.

(19)

Chapter 3

The Bootstrap

In this chapter the concept of the bootstrap, which is the main subject of this dissertation, is introduced. The history, methodology and applications to different models such as simple regression and location - scale mod-els are also discussed. In the beginning of the chapter, the notation that will be used throughout the study is introduced. The applications of the bootstrap method based on the plug-in principle are also presented. This is followed by comparisons between the parametric, nonparametric and semi-parametric bootstrap methods which includes algorithms of how to draw bootstrap samples for each of the different methods. The algo-rithms for approximating bootstrap quantities, such as the bootstrap standard deviation, bias, critical values and p values, are also provided. Bootstrap regression models with independent and identically distributed (i.i.d.) error terms including the two bootstrap regression methods, i.e. the cases bootstrap and resampling residuals are also explored. Finally, bootstrap hypothesis testing and some guidelines for using the method are reviewed. Furthermore, the bootstrap hypothesis testing methodology is expanded to the case of simple and multiple linear regression.

3.1 An introduction to the Bootstrap

The bootstrap was developed by Efron in 1979 and has since became a widely used technique that depends a great deal on the availability of high speed computers. It can be used to approximate an estimator for the sampling distribution of the statistic of interest and it is widely used as a means of assessment of the prop-erties of the sampling distribution, such as standard errors, confidence intervals for unknown parameters, and critical values for test statistics under a null hypothesis. Efron and Tibshirani (1993) gave an extended account of the use of the term “bootstrap” in ‘An Introduction to the Bootstrap’ where they explain that the term derives from the phrase “to pull oneself up by one’s own bootstrap”, based on an eighteenth century novel “The Surprising Adventures of Baron Munchausen”, by Rudolph Erich Raspe. In the tale, it is ex-plained how the Baron tried to do the impossible task of getting himself out of the bottom of a deep lake by pulling himself up by his bootlaces. With this analogy in mind, the statistical bootstrap methodology also attempts the “impossible” by obtaining estimators of population quantities by first considering how these quantities behave when calculated in a situation where the sample is treated as a population. Once these results are obtained, they are “pulled up” to the real population setting i.e. all distributional properties observed in the “bootstrap” setting are used as estimators for the real population setting.

The bootstrap is similar to Monte Carlo methods, but instead of fully specifying the data generating process (DGP), the bootstrap uses information from the sample. In Efron and Tibshirani (1993) they discuss the distinction between the “bootstrap world” and the “real world”. In the bootstrap world one treats the sample as a population of values, since the sample data is the only available information about the distribution of

(20)

the underlying random variable, and also treats sample statistics calculated from the sample as population parameters. That is, instead of drawing independently from a specified parametric distribution in order to create a random sample, the traditional bootstrap draws with replacement from a sample of observed values. In this case, the empirical distribution function (EDF) is treated as the true distribution function. Therefore, everything calculated in the bootstrap world is an estimator for the real world and its counterparts.

The bootstrap approach is free from the typical assumptions of parametric models which have assumptions about the form of the distribution. When applied to perform inference, the significance levels, critical values and confidence intervals are not derived from normal theory, but are estimated from the bootstrap estimate of the sampling distribution. It is therefore robust to assumption violations. According to Conover and Iman (1981), the bootstrap keeps the distributional information of the original sample (unlike most of the nonparametric techniques that convert data into ranks) and when conducting inference, will only be less powerful than parametric techniques in those cases where the sample dataset is small. The bootstrap has found support in problems where large-sample theory is either inadequate or unavailable. It answers the questions related to the bias and variability of the estimator through computational rather than analytical power. In the next section some notation is introduced that will be used throughout this dissertation.

3.2 Notation

Let X1, · · · , Xn be n i.i.d. random variables from an unknown probability distribution F . Let Fn be the

empirical distribution function, that is having mass 1/n at each observed Xi. Fn may be represented as :

Fn(x) = 1 n n X i=1 I(Xi≤ x),

where I(A) is the indicator function and is defined as follows:

I(A) = (

1, if A occurs

0, if Ac _occurs.

Let Tn= Tn(X1, X2, · · · , Xn) denote the test statistic calculated from the sample data and tn = Tn(x1, x2,

· · · , xn) denote the test statistic calculated from the observed data. Suppose the parameter of interest is ϑ,

and suppose further that one can write the unknown parameter ϑ as ϑ = t(F ), i.e., some functional of the

true, unknown distribution F . ϑ can be estimated by ˆϑ, a function of the random variables X1, X2, · · · , Xn

and is denoted by ˆϑn = ˆϑ(X1, X2, · · · , Xn). Let P (· ) denote the probability operator when using sample

data and statistics, E(· ) the expectation operator and V ar(· ) the variance operator when using the sample

data and statistics. The sampling distribution of ˆϑ may be difficult to obtain since, firstly, F is unknown

and secondly, ˆϑ may be a complex function of X1, X2, · · · , Xn and so finding its distribution would need

complicated analytical calculations. One of the solutions would be to use the bootstrap approach. The

notation X₁∗, ..., X_n∗ will be used to denote a sample drawn independently from Fn, and is often referred to

as the “bootstrap sample”. Let ˆϑ∗_n = ˆϑ(X₁∗, ..., X_n∗) be the bootstrap replication of ˆϑ, which is obtained by

calculating the sample statistic ˆϑ using the bootstrap sample. Let P∗(· ) denote the bootstrap probability

operator, with P∗(X_i∗ < x) = P (X_i∗ < x|X1, · · · , Xn) where the condition is made on the sample data,

E∗_{(· ) denote the expectation operator when using bootstrap data, where}

E∗(X_i∗) = E(X_i∗|X1, · · · , Xn) = EFn(X

∗

i)

and V ar∗_{(· ) denote the variance operator when using bootstrap data and statistics, where}

V ar∗(Xi∗) = V ar(Xi∗|X1, · · · , Xn) = V arFn(X

∗

(21)

The next section discusses one of the major building blocks of the bootstrap method, known as the plug-in principle. The plug-in principle is an important tool in inferential statistics for both traditional and bootstrap methods. Bootstrap actually employs the plug-in principle by approximating the population distribution with a sample distribution. Examples for illustrating the plug-in principle are also provided.

3.3 The bootstrap and the plug-in principle

The bootstrap is based on the plug-in principle whereby estimators are obtained for population parameters by substituting unknown population elements with direct empirical equivalents. The bootstrap estimates

are obtained by plugging in an estimate of the population distribution function F , with Fn, the empirical

distribution function of the sample X1, · · · , Xn. Suppose the quantity of interest is ϑ = t(F ), i.e., some

functional t of the distribution function F . Applying the plug-in principle one can estimate the parameter

t(F ), by substituting F by Fn, to obtain the statistic ˆϑ = t(Fn). Specific examples are:

• The population mean, ϑ = t(F ) = EF(X) =R xdF (x) and the plug-in estimate is

ˆ ϑ = t(Fn) = Z xdFn(x)dx = EFn(X ∗_{) =} 1 n n X i=1 Xi= ¯X,

• The population variance, t(F ) = V arF(X) =R (x − EF(X))2dF (x) and the plug-in estimate is

ˆ ϑ = t(Fn) = Z (x − EFn(X ∗₎₎2 dFn(x) = 1 n n X i=1 (Xi− ¯X)2,

• The standard error of the sample mean, t(F ) = seF( ¯X) =

p

V arF( ¯X), the bootstrap estimate of

standard error ˆ ϑ = t(Fn) = q V arFn( ¯X ∗_), q V arFn( ¯X ∗_{) =} v u u tV arFn 1 n n X i=1 X∗ i ! = v u u t 1 n2 n X i=1 V arFn(X ∗ i), X ∗ 1, · · · , X ∗ n are i.i.d. = r 1 nV arFn(X ∗ 1) = r 1 nEFn[X ∗ 1− EFn(X ∗ 1)]2 = r 1 nEFn[X ∗ 1− ¯X]2 = v u u t 1 n2 " _n X i=1 (Xi− ¯X)2 # = r 1 nS 2 n, where Sn2 =n1 Pn i=1(Xi− ¯X)2.

(22)

3.3.1 Application of the bootstrap to bias and standard error estimation

The discussion of how the bootstrap can be used in order to estimate the bias and standard error of an estimator will now be presented. One can use the bootstrap to find out how accurate an estimator is. This can be done by using bootstrap methods to estimate bias. The theoretical bias of some estimator ˆ

θ = ˆθ(X1, X2, · · · , Xn) when estimating ϑ is defined as

Bias(ˆθ) = E(ˆθ − ϑ) = E(ˆθ) − ϑ,

where bθ is any estimator based on the sample data, and can possibly be chosen as the plug - in estimator.

In order to obtain the bootstrap estimate of bias, one can plug in the plug-in estimate ˆϑ = t(Fn) in place

of the parameter ϑ and ˆθ∗in place of ˆθ, where ˆθ∗is based on a simple random sample drawn from Fn. The

result is denoted as BiasFn(ˆθ

∗_{) = E}

Fn(ˆθ

∗_{− ˆ}_{ϑ) = E}

Fn(ˆθ

∗_{) − ˆ}_{ϑ. Subtracting the estimated bias from ˆ}_{ϑ result}

in the bias-corrected version of ˆϑ, calculated as

ˆ

θ − BiasFn( ˆϑ

∗_{) = ˆ}_{θ + ˆ}_{ϑ − E}

Fn(ˆθ

∗_).

If ˆθ is the plug-in estimator i.e. ˆθ = ˆϑ then the above expression becomes 2 ˆϑ − EFn(ˆθ

∗_).

Estimating the standard error of ˆϑ can be difficult, since the statistic may be a complex function of

X1, X2, . . . , Xn. The exact theoretical expression might be impossible to calculate, but the bootstrap

es-timate of the standard error of ˆϑ, denoted as seFn( ˆϑ

∗_{) =} q_{V ar}

Fn( ˆϑ∗) is the plug-in estimate of se( ˆϑ) =

q

V ar( ˆϑ). An algorithm to approximate the bootstrap standard error and bias is provided below (see, e.g.,

Efron and Tibshirani, 1993).

Algorithm for approximating the bootstrap standard error and bias

1. Draw a sample of size n with replacement from X1, · · · , Xn and get the resulting bootstrap sample

X1∗, · · · , Xn∗.

2. Compute the statistic ˆϑ∗

n= ˆϑ(X1∗, · · · , Xn∗) for the sample drawn in step 1.

3. Repeat step 1 and 2 B times to obtain the bootstrap replications ˆϑ∗_n,1, ˆϑ∗_n,2, · · · , ˆϑ∗_n,B.

The standard error calculation:

• Compute the approximation of the estimate of standard error as

b seB= v u u t 1 B − 1 B X b=1 ( ˆϑ∗ n,b− ˆϑ∗n(· ))2, where ˆ ϑ∗_n(· ) = 1 B B X b=1 ˆ ϑ∗_n,b.

The limit of seB as B goes to infinity is seFn( ˆϑ

∗_{) which is the ideal bootstrap estimate.}

Another way to estimate the standard error is to use an “ideal” bootstrap standard error

es-timator denoted as se_b∞. Generally, analytical calculation of seb∞ is very difficult to obtain so

the bootstrap standard error estimator,se_bB based on B simulations can be used to estimate the

“ideal” bootstrap standard error. According to the law of large numbers, limB→∞se_bB = se_b∞

almost surely, where

b se∞= q EFn( bϑ(X ∗_{) − E} Fnϑ(Xb ∗₎₎2_.

(23)

• In order to approximate the bootstrap bias, do step 1, 2, 3 in the above algorithm and then the estimated bias is : BiasFn,B= ˆϑ ∗ n(· ) − ˆϑ, where ˆ ϑ∗n(· ) = 1 B B X b=1 ˆ ϑ∗n,b.

In the next section various types of bootstrap resampling methods, how these bootstrap methods differ in generating data, and the assumptions associated with them are discussed.

3.4 Different types of bootstrap resampling methods

There are several types of bootstrap methods, these include the parametric, the nonparametric and semi-parametric methods.

3.4.1 The parametric bootstrap

In the parametric bootstrap it is assumed that F belongs to a parametric family of distributions with unknown parameters, which have to be estimated from the sample. In the parametric bootstrap, a model is assumed (e.g. normal distribution with unknown shape and scale), parameters for that model are estimated, then bootstrap samples are drawn from the model with those estimated parameters. That is, the general functional form of the CDF is known, but not the exact parameters. This estimation method leads to more accurate inference if the distribution family is correctly specified, on the other hand the assumed F may be fairly far from the true F if the family assumption is wrong. Suppose we wish to estimate the distributional

properties of some statistic bθ = bθ(X1, · · · , Xn) using the parametric bootstrap. The simulation approach to

approximating this distribution is given by the following algorithm:

1. Suppose the sample data is X1, · · · , Xn.

2. Assume that the data comes from a known distribution family Fψ where ψ = [ψ1, ψ2, · · · , ψp] is a

vector of p unknown population parameters. Fψ is a known distribution family described by a set of

parameters, for example ψ = [µ, σ2_{] for a normal distribution.}

3. Estimate ψ with, say, maximum likelihood estimation to get ˆψ.

4. Sample independently a new data set of size n from Fψˆ to obtain X

∗

1, · · · , Xn∗.

5. Get the estimate ˆϑ∗= ˆϑ(X₁∗, X₂∗, · · · , X_n∗).

6. Repeat steps 4 and 5 B times to obtain ˆϑ∗₁, ˆϑ∗₂, · · · , ˆϑ∗_B.

7. Consider the empirical distribution of ˆϑ∗₁, · · · , ˆϑ∗_B as an approximation of the true distribution of ˆϑ.

3.4.2 The nonparametric bootstrap

In the nonparametric bootstrap one assumes that F is completely unknown. We then estimate F by Fn, the

EDF. Fn is a discrete probability distribution that gives probability 1/n to each observed value X1, · · · , Xn.

Sampling from Fn is thus the same as sampling with replacement from the sample X1, X2, · · · , Xn. The

(24)

1. Construct an empirical probability distribution, Fn, from the sample by placing probability 1/n on

each data point, X1, X2, · · · , Xn of the sample.

2. From the empirical distribution function, Fn, independently draw a random sample of size n, i.e.

sample with replacement from X1, · · · , Xn. The result, denoted X1∗, · · · , Xn∗, is called the “bootstrap”

sample or “resample”.

3. Calculate the statistic of interest, ˆθn, based on the bootstrap sample to get ˆθn∗.

4. Repeat steps 3 and 4 B times, where B is some large number, to obtain ˆθ∗_n,1, . . . , ˆθ∗_n,B.

5. Consider the empirical distribution of ˆθ_n,1∗ , · · · , ˆθ∗_n,B as an approximation of the true distribution of

ˆ θn.

3.4.3 The semi-parametric bootstrap

Semi-parametric models typically have some known parametric properties combined with other unknown properties. As an example, consider the location scale model where Y = µ + σ with only the assumption that the error term, , is centered at zero and has a scale of 1. Another example is the classical linear regression model,

Y = Xβ + ,

where the coefficients β are unknown and can be estimated by bβ. This model may be parametric when error

terms are assumed to be normally distributed with a mean of zero and constant variance. The bootstrap is conducted in a way that will take these assumptions into account. This often requires changing the original sample data to comply with the assumptions made about the distribution of the data.

The following section explores different types of bootstrap regression models. The bootstrap application in linear regression models is considered. This is one of the important parts of this dissertation as linear regression models are the major subject of the study. The discussion of the extension to multiple regression is provided as well. The algorithms for generating bootstrap data using bootstrap regression techniques are presented.

3.5 Bootstrap regression

The choice of the bootstrap regression model depends on whether the assumptions about the model hold. There are two commonly used bootstrapping regression techniques for linear regression; resampling the residuals (or model based) and case resampling. Both these will be discussed next, although this study is only based on the method of resampling the residuals. However, the case resampling is provided for completeness.

3.5.1 Case resampling

The case resampling approach consists of drawing the observation of the dependent variable (Y ) along with

the independent variable (X) for the same observation from the bivariate distribution function, Fn(X, Y )

(Freedman, 1981). This is implemented by sampling the i.i.d. pairs (X∗1, Y1∗), · · · , (X∗n, Yn∗) with

replace-ment from the cases (or pairs) (X1, Y1), · · · , (Xn, Yn). Sampling with replacement is equivalent to assigning

probability 1/n to each pair (Xj, Yj) j = 1, 2 · · · , n, where Xj is a row vector of all independent variables

for observation j. The approach is not affected by whether or not the linear regression model holds. This can be seen from the fact that, in this approach, the model structure is not used. The resulting resampled

(25)

data has a similar structure to the original data because the Yivalues are bound to Xi. The case resampling method is valid even when the errors show heteroskedasticity of unknown form (Davidson and

MacKin-non, 2010). The bootstrap values ˆβ₀∗ and ˆβ₁∗ of the coefficient estimates are computed from the bootstrap

pairs (X∗₁, Y₁∗), · · · , (X∗_n, Y_n∗).

The simple linear regression that will be considered in this section is given by:

Yi= β0+ β1Xi+ i (3.1)

with E(i) = 0, i’s are i.i.d. errors. If the data are (X1, Y1), (X2, Y2), · · · , (Xn, Yn), the method to produce

bootstrap samples using the case-based resampling and to approximate the standard error of ˆβ0 and ˆβ1

involves:

• Drawing a bootstrap sample of size n, with replacement from these n pairs. The bootstrap data set is of the form (X1∗, Y1∗), (X2∗, Y2∗), · · · , (Xn∗, Yn∗).

• Ordinary least squares is then used to estimate the regression coefficients bβ₀∗ and bβ∗₁ for this

boot-strap sample of paired cases. That is, for each of the B sets of resampled pairs estimate ˆβ∗(b) =

(X∗0(b)X∗(b))−1X∗0(b)Y∗(b) = ( bβ₀∗, bβ₁∗)b, b = 1, 2, ..., B, where X∗(b) is the bth design matrix

X∗(b) =       1 X1∗ 1 X₂∗ .. . ... 1 Xn∗      

and Y∗(b) is the bth set of resampled Y values Y₁∗, Y₂∗, · · · , Y_n∗. Both of these quantities are obtained

from the resampled paired values, (X1∗, Y1∗), (X2∗, Y2∗), · · · , (Xn∗, Yn∗) .

• The sampling distribution of bβ0and bβ1can thus be approximated by this sequence. From the sequence,

we can then get the estimated standard error of these estimated regression coefficients as follows:

sek,B= v u u t 1 B − 1 B X b=1 ( bβ∗_k(b) − bβ_k∗(· ))2_, _{k = 0, 1} where bβ_k∗(· ) = _B1 PB

b=1βb_k∗(b). The limit of sek,B as B goes to infinity, is the ideal bootstrap estimate

of se( bβk).

If the independent variables have to be controlled for some reason, for example by the design of the study, then the methods of resampling residuals can be used.

3.5.2 Residual resampling

In the case of resampling pairs, the assumption is that both the independent and dependent variables are random. However, when making use of resampled residuals, the independent variable is assumed to be fixed.

In residual resampling, the first step is to fit the model, compute the predicted values bYi and obtain the

residuals ei = Yi − bYi, then generate a bootstrap sample by using the original X values. The bootstrap

response values are obtained by adding the predicted values and random residuals, i.e., Y_i∗= bYi+_be∗i,

wheree_b∗_i are sampled randomly with replacement from the original centered residuals. The linear regression

(26)

distribution of the true errors, i can be estimated by resampling the mean corrected residuals. To bootstrap

residuals, the empirical distribution for the centered residuals is used, i.e.,_be∗_i ∼ EDF (e1− ¯e, · · · , en− ¯e) and

a sample of size n is taken independently from this empirical distribution to obtain the bootstrap sample for

the residuals, which are denoted as _be∗₁,_be₂∗, · · · ,_be∗_n. From Y_i∗= bβ0+ bβ1Xi+_be∗i, i = 1, 2, · · · , n, a bootstrap

sample, (X1, Y1∗), · · · , (Xn, Yn∗) can be generated and the bootstrap coefficient estimates bβ0∗ and bβ1∗ can be

calculated from (X1, Y1∗), · · · , (Xn, Yn∗). For model or residual-based resampling, the simple linear regression

model is given by (3.1). The algorithm below provides approximation to the distribution of bβ − β, from

which the bias and variance can be obtained.

• The fitted values bYi and residual ei are first obtained from the observed data i.e. bYi= bβ0+ bβ1Xi and

ei= Yi− bYi= Yi− ( bβ0+ bβ1Xi).

• Sample with replacement from the centered residuals_be1,_be2, · · · ,e_bn to obtain_be∗1,be

∗ 2, · · · ,be

∗

n.

• The bootstrap sample for the regression (Xi, Yi∗) comprises of the X values from the original data

and Y∗ values computed by adding the fitted values and the bootstrap residuals values together i.e.

Y_i∗= bβ0+ bβ1Xi+_be∗i = bYi+_be∗i.

• Regress the bootstrap bY∗ on the fixed X values to get the bootstrap regression coefficients by the

method of least-squares, then bβ∗= (X0X)−1XY∗.

• Repeat the previous 2 steps B times to obtain bβ∗(1), bβ∗(2), . . . , bβ∗(B).

• The bootstrap data set is of the form (X1, Y1∗), (X2, Y2∗), · · · , (Xn, Yn∗).

• The distribution of bβ_k∗− bβk, k = 0, 1, which is the bootstrap estimate of the distribution of bβk− βk,

k = 0, 1, can be constructed and the bootstrap estimates of bias denoted as E∗( bβ_k∗) − bβk, k = 0, 1, and

variance denoted as V ar∗( bβ∗_k), k = 0, 1, can then be obtained.

Next, residual resampling is extended to cover multiple regression (where the independent variables are fixed) and more general regression models.

Consider now the general regression model equation

Zi = g(Xi, β) + i, i = 1, · · · , n,

with Z a vector of response variables, g a known real-valued function of the predictor variables X and the parameters β are obtained from the expression

ˆ β = arg min β n X i=1 (Zi− g(Xi, β))2.

The bootstrap procedure is as follows:

1. Given the data, Z = [Z1, Z2, · · · , Zn] and X.

2. Least-squares estimates for parameters are obtained by using ˆ β = arg min β n X i=1 (Zi− g(Xi, β))2.

3. Compute the centered residuals _bei = Zi− g(Xi, bβ) − 1_nPn_j=1(Zj − g(Xj, bβ)), i = 1, · · · , n to obtain

b

e1,eb2, · · · ,ben.

4. Sample with replacement from_be1,be2, · · · ,ben to get the bootstrap residual samplebe

∗ 1,be ∗ 2, · · · ,be ∗ n.

(27)

5. The bootstrap sample is generated using the relationship Z_i∗ = g(Xi, ˆβ) + e∗i, i = 1, · · · , n to obtain Z₁∗, Z₂∗, · · · , Z_n∗.

6. Using data X = [x1, x2, · · · , xn]0 and the bootstrap sample Z1∗, Z2∗, · · · , Zn∗to determine the bootstrap

estimate

ˆ

β∗= arg min

β

(Z_i∗− g(Xi, β))2.

7. Repeat steps 4 to 6 B times to obtain: ˆβ_n∗(1), ˆβ_n∗(2), · · · , ˆβ_n∗(B).

8. Estimate the standard error

b sen,B = v u u t 1 B − 1 B X b=1 ( bβ∗ n(b) − bβ∗n(·))2, where bβ∗ n(·) = 1 B PB b=1βb_n∗(b).

Comparison between the case resampling and residual resampling The case resampling is more

robust against model mis-specification than the residual resampling. That is, with case resampling no as-sumption about constant variance, or the form of the relationship between X and Y when generating the data, is made. This offers the advantage of robustness to heteroscedasticity and the disadvantage of

ineffi-ciency if the variance is homoscedastic. Case resampling only requires the sample (Xi, Yi)ni=1 to be randomly

drawn from the distribution. The error terms are excluded from the bootstrap DGP and the method does not condition on X, which means the X’s are random. In the context of bootstrap hypothesis testing, which is the subject to be discussed in the next section, the case resampling is inadequate because the bootstrap DGP is not able to impose any restrictions on the parameters of the model. Residual resampling, on the other hand, requires the validity of the linear regression model and that the errors to be i.i.d., but it has the benefit of being able to impose restrictions on the model. If these assumptions hold, then the residual resampling method will be more efficient than the cases resampling method.

In the next section, bootstrap hypothesis testing, a very important subject of this dissertation is dis-cussed. Recall that the study is about investigating and comparing the performance of the bootstrap tests in terms of Type I error and power using different approaches of generating bootstrap data and residuals, i.e. unrestricted/restricted bootstrap model with unrestricted/restricted residuals. To compare and contrast the performance of the bootstrap tests, that is, to identify which bootstrap approach is more powerful, hypothesis testing is conducted.

3.6 Bootstrap hypothesis testing

Hypothesis testing is a vital tool used in scientific research, it allows the researchers to carry out inferences

about population parameters using data from a sample. A hypothesis is a specific statement about a

property of a population of interest. There are two types of hypotheses, a simple hypothesis and a composite hypothesis. The hypothesis in which all parameters of the distribution are specified is a simple hypothesis,

for example, the hypothesis that the data are normally distributed with µ = 0 and σ2 _{= 1. On the other}

hand, if the exact distribution of the population is not known, or not all parameters are specified under the hypothesis, this is a composite hypothesis. For example, the hypothesis that the data are normally distributed with µ = 0 and unspecified variance.

In hypothesis testing, one wants to prove that the alternative hypothesis, denoted by H1 is correct so

(28)

the null hypothesis. For example, let Xn = (X1, X2, · · · , Xn) be a random sample from some unknown distribution F . Suppose the following right-sided alternative hypothesis has to be tested:

H0: ϑ(F ) = ϑ0 vs. H1: ϑ(F ) > ϑ0, (3.2)

where the parameter ϑ(F ) is some functional of F . Suppose also that the test statistic Tn summarizes the

information from the data. The behaviour of Tn = Tn(X1, X2, · · · , Xn) under the null hypothesis is then

studied. If the observed data are x1, · · · , xn, then the observed value of the test statistic can be calculated,

say,

tn= Tn(x1, · · · , xn).

T_n∗= Tn(X1∗, X2∗, · · · , Xn∗) is the bootstrap counterpart of Tn. To implement the decision rule on the basis

of sample data on whether to reject H0, one needs a critical value cn(α). The critical value for a hypothesis

test is a threshold to which the value of the test statistic in a sample is compared to determine whether or not the null hypothesis is rejected.

The decision rule is to reject H0 at level α if

Tn≥ cn(α),

where

PH0(Tn≥ cn(α)) ≈ α.

The distribution F is typically unknown and therefore the critical value, cn(α), is also unknown. The

bootstrap can be used to estimate cn(α) and the bootstrap estimator of cn(α), denotedbcn(α), is the (1−α)th

quantile of the bootstrap null distribution of T_n∗.

In estimating the critical value cn(α) by bcn(α), the first step should be the transformation of the data.

The main aim of transforming the data is to make sure that it imitates the null hypothesis H0in accordance

with MacKinnon’s (2009) suggestion and Hall and Wilson’s (1991) first guideline. According to MacKinnon

(2009), it is important that the distribution of the bootstrap statistic, T_n∗ imitates the null distribution of

Tn, whether or not the null hypothesis is true, otherwise the bootstrap test will have poor power properties.

The same sentiments were echoed by Hall and Wilson (1991) with their guidelines for bootstrap hypothesis testing. The first guideline suggests that when the critical value is estimated, it should be done in a manner that takes the null hypothesis into account, even if the data were generated under the alternative hypothesis. The second guideline states that the bootstrap hypothesis tests should be carried out on test statistics that are pivotal.

These guidelines are now illustrated with a simple example involving the population mean.

Let X1, · · · , Xn be a random sample from a distribution with mean µ. Suppose the hypothesis to be

tested is

H0: µ = µ0versus H1: µ > µ0. (3.3)

We first consider the situation of testing this hypothesis with a test statistic defined as:

T1= ¯X − µ0,

and we compare the value of T1to the distribution of the bootstrap test statistic

T₁∗= ¯X∗− µ0, where ¯X = _n1Pn i=1Xi and ¯X ∗₌ 1 n Pn i=1X ∗

i. The test using T1∗may lack power when H0 is incorrect, since

the distribution of ¯X∗ _{is centered at ¯}_{X and not at µ}

0. Secondly, we can test the hypothesis using, once

again, the statistic

Investigating the effect of restricting residuals in bootstrap based hypothesis testing