Hypothesis Testing - Eindhoven University of Technology BACHELOR Testing for the Period of a Fu

2.1.1 Definitions

In many situations, we must decide if a certain claim about a population is true. A statistician is required to guess the answer based on some data extracted from the population. Data is defined in the following way:

Definition 2.1.1. (Data) Data is a collection of measurements of random variables.

To aid in such a decision, two hypotheses are defined: the null hypothesis, H0, and the alternative, H1. Choosing between these hypotheses based on data is known as hypothesis testing. This is defined formally byAbromovich & Ritov (2013) as follows:

Definition 2.1.2. (Hypothesis Testing) Let Θ denote the parameter set of our underlying model. Consider Θ0, Θ1⊆ Θ ⊆ R such that Θ0∩ Θ1= ∅. Assume the data X = (X1, . . . , Xn) has distribution Fθ, where θ is unknown, θ ∈ Θ0∪ Θ1.

We want to test the null hypothesis

H₀: θ ∈ Θ0, against the alternative hypothesis

H1: θ ∈ Θ1.

Our goal is to construct a test function X 7→ ψ(X) ∈ {0, 1}, where ψ(X) = 1 means we reject the null hypothesis and ψ(X) = 0 means we fail to reject the null hypothesis.

For the null hypothesis, there are also two classifications; simple and composite. A simple hypoth-esis is one that completely specifies the distribution. If this is not the case the null hypothhypoth-esis is composite. In Definition2.1.2the null hypothesis is simple if Θ0 is a singleton i.e. contains only one value.

Our decision to reject the null hypothesis or not is based on data. Therefore, the decision is also random and can be incorrect. In hypothesis testing there are two errors; type I and type II.

Definition 2.1.3. (Type I & type II Errors)

type I error = reject H0when H0 is true type II error = fail to reject H0 when H1 is true A test is constructed in order to minimise the occurrence of these errors.

From Definition2.1.2, we can define the critical region of the test.

CHAPTER 2. BACKGROUND INFORMATION

Definition 2.1.4. (Critical Region) The critical region of a test satisfying Definition 2.1.2 is the set C^∗ such that,

C^∗= {x : ψ(x) = 1}.

In words, it is the set of realisations of X, denoted by x , which results in a rejection of H₀. Therefore, it is the set of realisations which would be unlikely to occur under H₀.

For the test ψ we can define the following probabilities,

αψ(θ) = P^θ(reject H0) = P^θ(ψ(X) = 1) = P^θ(X ∈ C^∗) (2.1) β_ψ(θ) = Pθ(fail to reject H₀) = Pθ(ψ(X) = 0) = Pθ(X /∈ C^∗). (2.2) We write Pθ to emphasise that the data is a random sample from F_θ.

Equations2.1and2.2 define the probability of the type I error and the probability of the type II error, respectively. From these equations we can define the significance level of a test as follows:

Definition 2.1.5. (Significance Level) A test ψ is said to have significance level α ∈ (0, 1) if

sup_θ∈Θ₀α_ψ(θ) ≤ α (2.3)

Thus, the significance level controls how often H0 can be rejected. With significance level α, the test should not incorrectly reject H0 more than α · 100% of the time.

Then to assist in our decision between the hypotheses H0 and H1, we define a test statistic.

Definition 2.1.6. (Test Statistic) A test statistic is a function of the data that is used as a base for our decisions. More specifically, it is used to construct the test function.

With Definition2.1.6we can define the statistical test as follows:

Definition 2.1.7. (Statistical Test with Test Statistic) A statistical test based on test statistic T rejects the null hypothesis if

T ∈ C^∗,

where C^∗ is the critical region of the test as defined in Definition 2.1.4. In this case, our test function is,

ψ(X) = 1{T ∈ C^∗}, where 1 is the indicator function.

A test statistic is chosen to minimise the probability of making errors. The significance level of the test is often chosen during design and is usually set to a value of 0.05. Thus, the probability of a type I error is small. A small type I error is desirable, however, the smaller the type I error, the less likely we are to reject the null hypothesis, even when we should. This leads to a high type II error. Thus, there is a trade-off between the two types of errors; decreasing one of them, increases the other. This can be illustrated with a simple example. If we have a test that always rejects H₀, regardless of the data, then the type II error is zero. However, the type I error is one.

Therefore, we need to find a balance between the two errors.

To aid in this balance, we define the power and power function of a test.

Definition 2.1.8. (Power of a Test) The power of a test ψ(X) is computed for any θ /∈ Θ0 and is given by

1 − βψ(θ).

In other words, the power of the test is the amount of time that the test rejects H₀ when H₁ is true.

CHAPTER 2. BACKGROUND INFORMATION

Definition 2.1.9. (Power Function) The power function of a test ψ(X) is defined as π_ψ(θ) : Θ \ Θ₀→ [0, 1], where

πψ(θ) = P^θ(Reject H0) = P^θ(ψ(X) = 1) = P^θ(X ∈ C^∗), for θ ∈ Θ \ Θ0.

The aim is to find a test of maximum power given the chosen significance level. Such a test is called the uniformly most powerful test defined byAbromovich & Ritov(2013) as follows:

Definition 2.1.10. (Uniformly Most Powerful Test) A test ψ (defined as in Definition2.1.2) of significance level α with power function πψ, is the uniformly most powerful test if for any other test ψ⁰ of significance level no larger than α with power function π_ψ0,

∀θ1∈ Θ1: πψ(θ1) ≥ πψ⁰(θ1).

In summary, we define a test with an appropriate test statistic, significance level α, and critical region C^∗. Then we check whether the observed value t of the test statistic T lies in the critical region. If it does, we reject the null hypothesis at significance level α, otherwise, it cannot be rejected. Note that failing to reject the null hypothesis does not mean that we accept H0 as true.

We just do not have enough evidence to reject it.

However, it can now be argued that one can always get the conclusion wanted by choosing the significance level accordingly. Therefore, it is preferred to report the p-value of a test.

Definition 2.1.11. (p-value) The p-value of a test ψ is

p(X) = inf {α ∈ (0, 1) : ψα(X) = 1}, where ψ_α denotes the test with significance level α.

In words, it is the smallest significance level α that would lead to a rejection of H0. Thus, we reject H₀ at significance level α if the p-value is less than or equal to α, else we fail to reject.

Therefore, a low p-value (usually, less than 0.05) would imply a rejection of H₀.

2.1.2 Example

To relate the concepts defined above, we assume that there is a company that sells 500 millilitres (ml) bottles of water. We are asked to check if the filling machine works correctly. For simplicity, we assume that filling the bottles can be modelled by a normal distribution with unknown mean µ and known standard deviation of 20 ml. We collect ten bottles at random, with the amount of water found in each bottle recorded in Table2.1.

512.7280 518.8793 500.6510 541.8786 535.5288 513.4458 479.1822 500.8753 525.4474 485.1904

Table 2.1: Amount of water (in ml) of the ten bottles collected randomly.

We represent the values in Table2.1by the vector X. Then, we define the competing hypotheses H0: µ = 500 versus H1: µ 6= 500,

where µ ∈ R. In this case we can use the sample mean X =¯ 1

i=1

Xi∼ N (µ,σ² n )

as the test statistic because it is a known approximation of the mean. Then we reject H0 if the observed value ¯X is significantly above or below 500 ml. Thus, we can construct the test function

ψ(X) = 1{X ∈ C_δ^∗}

CHAPTER 2. BACKGROUND INFORMATION

where C_δ^∗= (−∞, 500 − δ] ∪ [500 + δ, ∞) is the critical region of our test for any δ ≥ 0.

Suppose we let δ = 15ml. Then under H0, µ = 500 and so X ∼ N (500,¯ 20²

10).

Thus, the significance level of the test is

α = Pµ=500({ ¯X ≥ 500 + δ} ∪ { ¯X ≤ 500 − δ}).

The significant level is easy to find as we have a simple null hypothesis. Therefore, no supremum needs to be computed (See Definition 2.1.5). Then by rewriting the probability in terms of the standard normal Z, we find that α = 0.0177 when δ = 15ml.

If we let µ ∈ Θ \ Θ0, the power of the test can be calculated. As Θ \ Θ0 = R \ {500}, we choose µ = 508. Then the power of the test can be calculated as follows:

1 − βψ(508) = P^µ=508({ ¯X ≥ 500 + δ} ∪ { ¯X ≤ 500 − δ}) = 0.134.

However, the value of µ is unknown and it is unusual to determine the significance level through the choice of δ. Therefore, we choose the significance level of our test first. Let α = 0.1, then due to symmetry

α/2 = P^µ=500( ¯X ≥ 500 + δ) = P^µ=500(Z ≥ δ ·√ 10 20 ), thus,

δ ·√ 10

20 = zα/2= 1.645,

and then, δ = 10.4. This results in C10.4 = (−∞, 489.6] ∪ [510.4, ∞). For the data that we have collected ¯X = 511.4 which falls inside the rejection region. Therefore, we reject H0 at significance level α = 0.1.

If we chose α = 0.05, the critical region would be (−∞, 487.6] ∪ [512.4, ∞). In this case, there is not enough evidence to reject the null hypothesis at significance level α = 0.05.

This example illustrates the importance of the p-value. Based on what we choose α to be we can get the conclusion we want, and thus we prefer to report the p-value. We know, the p-value is the smallest α value for which we reject H0. Thus, δ = ¯x − 500 = 11.4 and,

p-value = P^µ=500({ ¯X ≥ 511.4} ∪ { ¯X ≤ 488.6}) = 0.072.

Thus, we reject H0 if α > 0.072, and we fail to reject H0 if α < 0.072. Therefore, if α = 0.1 we reject H0and if α = 0.05 we fail to reject H0 as we saw before.

Lastly, this example does not have a uniformly most powerful test due to the two-sided H1(µ can be less than 500 or greater than 500). For more information about this the reader is referred to Abromovich & Ritov(2013).

2.1.3 Receiver Operating Characteristic (ROC) Curves

Receiver Operating Characteristic (ROC) curves are useful tools for assessing the performance of tests. The ROC curve of a test is created by plotting its false positive rate against its true positive rate. To define these rates we first introduce the contingency table.

CHAPTER 2. BACKGROUND INFORMATION

Definition 2.1.12. (Contingency Table)

H0 true H1 true

Fail to Reject H0 True Negative (TN) False Negative (FN) Reject H0 False Positive (FP) True Positive (TP)

Note that the number of false negatives is equivalent to the type II error and the number of false positives is equivalent to the type I error.

Given Definition 2.1.12, we can calculate the false positive and true positive rate of a test as follows,

True positive rate = T P T P + F N False positive rate = F P

F P + T N.

Plotting these values with the false positive rate as the x-axis and true positive rate as the y-axis gives us the ROC curve of a test.

A perfect test represents the case where there are no false negatives and no false positives. Thus, its ROC curve would go through the upper left corner (0,1). A random guess is represented by a diagonal line from the left bottom to the top right corner passing through the point (0.5,0.5).

This line is called the line of no-discrimination. Curves above this diagonal represent good results and curves underneath this diagonal represent bad results. Note that bad tests can simply be inverted.

Plotting ROC curves for different tests allows for a fair comparison. The better test will have a ROC curve closer to the top left corner i.e. closer to the ROC curve of a perfect test. In case the lines overlap each other at multiple points and it is unclear which of the curves is better, the area under the curve is calculated. The curve with the largest area corresponds to the better test.

2.1.4 Permutation Methods

Permutation methods are simple, yet powerful methods based on non-parametric concepts that can be traced back to the work ofFisher(1936) andPitman(1937). In non-parametric statistics, the goal is to drop as many assumptions on a model as possible. This allows for a wider application and a higher level of robustness. Until the late 20^th century permutation methods were rarely discussed in non-parametric literature, but have recently proved to be invaluable (Ernst, 2004).

Permutation methods are defined as follows:

Definition 2.1.13. (Permutation Methods) Given that the observations are exchangeable un-der the null hypothesis of a test, permutation methods un-derive the exact distribution of the test statistic under the null hypothesis by considering every permutation of the observations.

For n observations there are n! permutations, therefore, these methods were not practical until the recent exponential growth in computer processing power. Thus, they were neglected. We are now able to easily find exact inferences using these methods. It is also important to note that per-mutation methods themselves can be applied to parametric or non-parametric models, however, they are based on non-parametric concepts.

In Definition 2.1.13, “exchangeable under the null hypothesis” means that the joint distribu-tion of the sample is invariant under the null hypothesis. If this is the case, then permutadistribu-tion methods can be used. In addition, the value of the test statistic cannot remain constant under permutations. If it is constant, we cannot derive its distribution.

CHAPTER 2. BACKGROUND INFORMATION

Assuming that these constraints are satisfied, we can derive the probability mass function of a test statistic T by computing its value for every permutation. Then we can determine whether the value of T for the observed order of data is extreme. To establish how extreme the value is, we can compute the p-value of the test.

To illustrate this further consider the hypotheses, X = 0 against X > 0. Let T denote the test statistic that we use to choose between the two hypotheses. Then, we reject the null hypoth-esis if T is large. Assuming that all the constraints are satisfied we use permutation methods and can compute the p-value of the test as follows. Let t denote the observed value of T i.e. the value of T for the observed order of the data, and P be the set of all permutations of the observed data of length n. Then we calculate the p-values as follows:

p-value = 1

|P | X

p∈P

1{tp≥ t}, (2.4)

where t_pdenotes the value of T for permutation p ∈ P and 1 is the indicator function.

In hypothesis testing, we want to control the type I error of a test by choosing its significance level. Due to the constraints imposed by permutation methods, we can control the type I error of the test. Thus, we can construct exact tests using permutation methods. We define an exact test as follows:

Definition 2.1.14. (Exact Test) An exact test is a statistical test for which the probability of the type I error is equal to the significance level. Thus,

P(reject H0 when H₀ is true) = α.

In other words, if an exact test has a significance level α = 0.05, repeating the test over samples where the null hypothesis is true will result in 5% of cases rejecting H0. This is a desirable property for a test. However, as the test statistic is a discrete random variable, this property does not hold for all α. More specifically, this only holds if α is a multiple of 1/|P |.

In document Eindhoven University of Technology BACHELOR Testing for the Period of a Function using Permutation Methods Freyer, Caroline (pagina 8-13)