Eindhoven University of Technology BACHELOR Testing for the Period of a Function using Permutation Methods Freyer, Caroline

(1)

Eindhoven University of Technology

BACHELOR

Testing for the Period of a Function using Permutation Methods

Freyer, Caroline

Award date:

2020

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

(2)

Testing for the Period of a Function using

Permutation Methods

Bachelor Final Project

C.W.S. Freyer 1036039

Department of Mathematics and Computer Science

Supervisor:

P.J. De Andrade Serra

Final Version

Eindhoven, July, 2020

(3)

Abstract

Testing for the period of repetitive behaviour in data has a high practical value in fields such as signal processing and business forecasting. In this report, new tests to determine the period of a function are proposed for seemingly the first time using permutation methods. Permutation methods allow tests to be exact, flexible, and simple under minimal constraints. The test statistics used for the new tests were extracted from known approaches and applied in the context of permutation methods or derived newly. The most prominent benefit of using permutation methods is that we do not need to derive the distribution of the test statistics used and we can use any test statistic that does not remain constant under permutations. The tests using Fisher’s G- statistic, Bartlett’s method, and the new test statistic derived from Fisher’s G-statistic have very promising results. For single frequency functions, the tests using Fisher’s G-statistic and the new test statistic perform consistently well. For functions with multiple frequencies, the test using Bartlett’s method performs best. Furthermore, in terms of efficiency, the tests using Fisher’s G- statistic and the new statistic perform exceptionally well with a runtime of O(n log n). Thus, for practical use, these tests are recommended.

(4)

Introduction

Testing for the period of a function is important for many applications. In terms of business forecasting, companies need to predict possible future scenarios. For example, if sales data is found to be periodic, knowing the period of this data allows one to form viable predictions for possible sales patterns in the future. Another example is collision avoidance for transmitting signals. To ensure that signals transferring information do not interfere with each other and corrupt the transferred data, it is important to know the period of these signals.

If a function f has a period τ > 0, then f (t) = f (t + iτ ) for all t in the domain of f and i ∈ Z. Thus, the function repeats itself in a regular pattern at established intervals of length τ . This allows one to determine the value of f (t) at any point, once the values of f are known for a single period. Common examples of such functions are the sine and cosine functions. Both have periods 2π. Thus, these functions repeat themselves on intervals of length 2π.

Since Ronald A. Fisher proposed a test for detecting periodicity in 1929, many mathematicians contributed to the field of detecting periodicity in data. Later, statistical inference methods evolved based on Fisher’s work. One specific research area was determining the period of a function. In this report, we will focus on this area. More specifically, the problem that will be addressed is defined as follows.

Problem: Suppose f is a periodic function with unknown period τ ∈ Q that we observe at every time unit. However, the observations are corrupted by noise. Thus, we are observing the series Xtsuch that

X_t= f_t+ _t,

where ft = f (t) is the function value at time t and t is the random error at time t due to noise. Then on the basis of the observations X1, . . . , Xn, we would like to test whether f (t) has period τ0∈ Q.

To solve this problem we first consider the existing tests used to determine the period of a given function. Many of the existing tests are only approximates and not exact. Therefore, in this report, we propose a new exact statistical test for the given problem. The novelty of this test stems from its procedure, which uses permutation methods.

Using permutation methods would allow us to find the exact distribution of a chosen test statistic under minimal constraints. By computing the value of the test statistic under different permutations we can construct its distribution. Thus, there is no need to analytically solve for the distribution. Due to these benefits, in this report, we attempt to answer the guiding question:

(7)

CHAPTER 1. INTRODUCTION

Guiding Question: Can permutation methods be used to accurately test the period of a function?

Chapter 2 gives an overview of the necessary background information and notations that are important throughout this report. A summary of the notation can also be found in Appendix A. In Chapter 3 the existing tests used to solve the problem described above are detailed and evaluated. The new test using permutation methods is proposed in Chapter 4. In this chapter, the methods used to permute the data are given and the test statistics used are defined. Chapter 5evaluates which test statistic performs best for a given number of examples. In these examples, both functions with single frequencies and multiple frequencies are addressed. Lastly, a final conclusion for the best test statistic is given with recommendations for future research.

(8)

Chapter 2

Background Information

2.1 Hypothesis Testing

2.1.1 Definitions

In many situations, we must decide if a certain claim about a population is true. A statistician is required to guess the answer based on some data extracted from the population. Data is defined in the following way:

Definition 2.1.1. (Data) Data is a collection of measurements of random variables.

To aid in such a decision, two hypotheses are defined: the null hypothesis, H0, and the alternative, H1. Choosing between these hypotheses based on data is known as hypothesis testing. This is defined formally byAbromovich & Ritov (2013) as follows:

Definition 2.1.2. (Hypothesis Testing) Let Θ denote the parameter set of our underlying model. Consider Θ0, Θ1⊆ Θ ⊆ R such that Θ0∩ Θ1= ∅. Assume the data X = (X1, . . . , Xn) has distribution Fθ, where θ is unknown, θ ∈ Θ0∪ Θ1.

We want to test the null hypothesis

H₀: θ ∈ Θ0, against the alternative hypothesis

H1: θ ∈ Θ1.

Our goal is to construct a test function X 7→ ψ(X) ∈ {0, 1}, where ψ(X) = 1 means we reject the null hypothesis and ψ(X) = 0 means we fail to reject the null hypothesis.

For the null hypothesis, there are also two classifications; simple and composite. A simple hypothesis is one that completely specifies the distribution. If this is not the case the null hypothesis is composite. In Definition2.1.2the null hypothesis is simple if Θ0 is a singleton i.e. contains only one value.

Our decision to reject the null hypothesis or not is based on data. Therefore, the decision is also random and can be incorrect. In hypothesis testing there are two errors; type I and type II.

Definition 2.1.3. (Type I & type II Errors)

type I error = reject H0when H0 is true type II error = fail to reject H0 when H1 is true A test is constructed in order to minimise the occurrence of these errors.

From Definition2.1.2, we can define the critical region of the test.

(9)

CHAPTER 2. BACKGROUND INFORMATION

Definition 2.1.4. (Critical Region) The critical region of a test satisfying Definition 2.1.2 is the set C^∗ such that,

C^∗= {x : ψ(x) = 1}.

In words, it is the set of realisations of X, denoted by x , which results in a rejection of H₀. Therefore, it is the set of realisations which would be unlikely to occur under H₀.

For the test ψ we can define the following probabilities,

αψ(θ) = P^θ(reject H0) = P^θ(ψ(X) = 1) = P^θ(X ∈ C^∗) (2.1) β_ψ(θ) = Pθ(fail to reject H₀) = Pθ(ψ(X) = 0) = Pθ(X /∈ C^∗). (2.2) We write Pθ to emphasise that the data is a random sample from F_θ.

Equations2.1and2.2 define the probability of the type I error and the probability of the type II error, respectively. From these equations we can define the significance level of a test as follows:

Definition 2.1.5. (Significance Level) A test ψ is said to have significance level α ∈ (0, 1) if

sup_θ∈Θ₀α_ψ(θ) ≤ α (2.3)

Thus, the significance level controls how often H0 can be rejected. With significance level α, the test should not incorrectly reject H0 more than α · 100% of the time.

Then to assist in our decision between the hypotheses H0 and H1, we define a test statistic.

Definition 2.1.6. (Test Statistic) A test statistic is a function of the data that is used as a base for our decisions. More specifically, it is used to construct the test function.

With Definition2.1.6we can define the statistical test as follows:

Definition 2.1.7. (Statistical Test with Test Statistic) A statistical test based on test statistic T rejects the null hypothesis if

T ∈ C^∗,

where C^∗ is the critical region of the test as defined in Definition 2.1.4. In this case, our test function is,

ψ(X) = 1{T ∈ C^∗}, where 1 is the indicator function.

A test statistic is chosen to minimise the probability of making errors. The significance level of the test is often chosen during design and is usually set to a value of 0.05. Thus, the probability of a type I error is small. A small type I error is desirable, however, the smaller the type I error, the less likely we are to reject the null hypothesis, even when we should. This leads to a high type II error. Thus, there is a trade-off between the two types of errors; decreasing one of them, increases the other. This can be illustrated with a simple example. If we have a test that always rejects H₀, regardless of the data, then the type II error is zero. However, the type I error is one.

Therefore, we need to find a balance between the two errors.

To aid in this balance, we define the power and power function of a test.

Definition 2.1.8. (Power of a Test) The power of a test ψ(X) is computed for any θ /∈ Θ0 and is given by

1 − βψ(θ).

In other words, the power of the test is the amount of time that the test rejects H₀ when H₁ is true.

(10)

Definition 2.1.9. (Power Function) The power function of a test ψ(X) is defined as π_ψ(θ) : Θ \ Θ₀→ [0, 1], where

πψ(θ) = P^θ(Reject H0) = P^θ(ψ(X) = 1) = P^θ(X ∈ C^∗), for θ ∈ Θ \ Θ0.

The aim is to find a test of maximum power given the chosen significance level. Such a test is called the uniformly most powerful test defined byAbromovich & Ritov(2013) as follows:

Definition 2.1.10. (Uniformly Most Powerful Test) A test ψ (defined as in Definition2.1.2) of significance level α with power function πψ, is the uniformly most powerful test if for any other test ψ⁰ of significance level no larger than α with power function π_ψ0,

∀θ1∈ Θ1: πψ(θ1) ≥ πψ⁰(θ1).

In summary, we define a test with an appropriate test statistic, significance level α, and critical region C^∗. Then we check whether the observed value t of the test statistic T lies in the critical region. If it does, we reject the null hypothesis at significance level α, otherwise, it cannot be rejected. Note that failing to reject the null hypothesis does not mean that we accept H0 as true.

We just do not have enough evidence to reject it.

However, it can now be argued that one can always get the conclusion wanted by choosing the significance level accordingly. Therefore, it is preferred to report the p-value of a test.

Definition 2.1.11. (p-value) The p-value of a test ψ is

p(X) = inf {α ∈ (0, 1) : ψα(X) = 1}, where ψ_α denotes the test with significance level α.

In words, it is the smallest significance level α that would lead to a rejection of H0. Thus, we reject H₀ at significance level α if the p-value is less than or equal to α, else we fail to reject.

Therefore, a low p-value (usually, less than 0.05) would imply a rejection of H₀.

2.1.2 Example

To relate the concepts defined above, we assume that there is a company that sells 500 millilitres (ml) bottles of water. We are asked to check if the filling machine works correctly. For simplicity, we assume that filling the bottles can be modelled by a normal distribution with unknown mean µ and known standard deviation of 20 ml. We collect ten bottles at random, with the amount of water found in each bottle recorded in Table2.1.

512.7280 518.8793 500.6510 541.8786 535.5288 513.4458 479.1822 500.8753 525.4474 485.1904

Table 2.1: Amount of water (in ml) of the ten bottles collected randomly.

We represent the values in Table2.1by the vector X. Then, we define the competing hypotheses H0: µ = 500 versus H1: µ 6= 500,

where µ ∈ R. In this case we can use the sample mean X =¯ 1

n

X

i=1

Xi∼ N (µ,σ² n )

as the test statistic because it is a known approximation of the mean. Then we reject H0 if the observed value ¯X is significantly above or below 500 ml. Thus, we can construct the test function

ψ(X) = 1{X ∈ C_δ^∗}

(11)

where C_δ^∗= (−∞, 500 − δ] ∪ [500 + δ, ∞) is the critical region of our test for any δ ≥ 0.

Suppose we let δ = 15ml. Then under H0, µ = 500 and so X ∼ N (500,¯ 20²

10).

Thus, the significance level of the test is

α = Pµ=500({ ¯X ≥ 500 + δ} ∪ { ¯X ≤ 500 − δ}).

The significant level is easy to find as we have a simple null hypothesis. Therefore, no supremum needs to be computed (See Definition 2.1.5). Then by rewriting the probability in terms of the standard normal Z, we find that α = 0.0177 when δ = 15ml.

If we let µ ∈ Θ \ Θ0, the power of the test can be calculated. As Θ \ Θ0 = R \ {500}, we choose µ = 508. Then the power of the test can be calculated as follows:

1 − βψ(508) = P^µ=508({ ¯X ≥ 500 + δ} ∪ { ¯X ≤ 500 − δ}) = 0.134.

However, the value of µ is unknown and it is unusual to determine the significance level through the choice of δ. Therefore, we choose the significance level of our test first. Let α = 0.1, then due to symmetry

α/2 = P^µ=500( ¯X ≥ 500 + δ) = P^µ=500(Z ≥ δ ·√ 10 20 ), thus,

δ ·√ 10

20 = zα/2= 1.645,

and then, δ = 10.4. This results in C10.4 = (−∞, 489.6] ∪ [510.4, ∞). For the data that we have collected ¯X = 511.4 which falls inside the rejection region. Therefore, we reject H0 at significance level α = 0.1.

If we chose α = 0.05, the critical region would be (−∞, 487.6] ∪ [512.4, ∞). In this case, there is not enough evidence to reject the null hypothesis at significance level α = 0.05.

This example illustrates the importance of the p-value. Based on what we choose α to be we can get the conclusion we want, and thus we prefer to report the p-value. We know, the p-value is the smallest α value for which we reject H0. Thus, δ = ¯x − 500 = 11.4 and,

p-value = P^µ=500({ ¯X ≥ 511.4} ∪ { ¯X ≤ 488.6}) = 0.072.

Thus, we reject H0 if α > 0.072, and we fail to reject H0 if α < 0.072. Therefore, if α = 0.1 we reject H0and if α = 0.05 we fail to reject H0 as we saw before.

Lastly, this example does not have a uniformly most powerful test due to the two-sided H1(µ can be less than 500 or greater than 500). For more information about this the reader is referred to Abromovich & Ritov(2013).

2.1.3 Receiver Operating Characteristic (ROC) Curves

Receiver Operating Characteristic (ROC) curves are useful tools for assessing the performance of tests. The ROC curve of a test is created by plotting its false positive rate against its true positive rate. To define these rates we first introduce the contingency table.

(12)

Definition 2.1.12. (Contingency Table)

H0 true H1 true

Fail to Reject H0 True Negative (TN) False Negative (FN) Reject H0 False Positive (FP) True Positive (TP)

Note that the number of false negatives is equivalent to the type II error and the number of false positives is equivalent to the type I error.

Given Definition 2.1.12, we can calculate the false positive and true positive rate of a test as follows,

True positive rate = T P T P + F N False positive rate = F P

F P + T N.

Plotting these values with the false positive rate as the x-axis and true positive rate as the y-axis gives us the ROC curve of a test.

A perfect test represents the case where there are no false negatives and no false positives. Thus, its ROC curve would go through the upper left corner (0,1). A random guess is represented by a diagonal line from the left bottom to the top right corner passing through the point (0.5,0.5).

This line is called the line of no-discrimination. Curves above this diagonal represent good results and curves underneath this diagonal represent bad results. Note that bad tests can simply be inverted.

Plotting ROC curves for different tests allows for a fair comparison. The better test will have a ROC curve closer to the top left corner i.e. closer to the ROC curve of a perfect test. In case the lines overlap each other at multiple points and it is unclear which of the curves is better, the area under the curve is calculated. The curve with the largest area corresponds to the better test.

2.1.4 Permutation Methods

Permutation methods are simple, yet powerful methods based on non-parametric concepts that can be traced back to the work ofFisher(1936) andPitman(1937). In non-parametric statistics, the goal is to drop as many assumptions on a model as possible. This allows for a wider application and a higher level of robustness. Until the late 20^th century permutation methods were rarely discussed in non-parametric literature, but have recently proved to be invaluable (Ernst, 2004).

Permutation methods are defined as follows:

Definition 2.1.13. (Permutation Methods) Given that the observations are exchangeable under the null hypothesis of a test, permutation methods derive the exact distribution of the test statistic under the null hypothesis by considering every permutation of the observations.

For n observations there are n! permutations, therefore, these methods were not practical until the recent exponential growth in computer processing power. Thus, they were neglected. We are now able to easily find exact inferences using these methods. It is also important to note that permutation methods themselves can be applied to parametric or non-parametric models, however, they are based on non-parametric concepts.

In Definition 2.1.13, “exchangeable under the null hypothesis” means that the joint distribution of the sample is invariant under the null hypothesis. If this is the case, then permutation methods can be used. In addition, the value of the test statistic cannot remain constant under permutations. If it is constant, we cannot derive its distribution.

(13)

Assuming that these constraints are satisfied, we can derive the probability mass function of a test statistic T by computing its value for every permutation. Then we can determine whether the value of T for the observed order of data is extreme. To establish how extreme the value is, we can compute the p-value of the test.

To illustrate this further consider the hypotheses, X = 0 against X > 0. Let T denote the test statistic that we use to choose between the two hypotheses. Then, we reject the null hypothesis if T is large. Assuming that all the constraints are satisfied we use permutation methods and can compute the p-value of the test as follows. Let t denote the observed value of T i.e. the value of T for the observed order of the data, and P be the set of all permutations of the observed data of length n. Then we calculate the p-values as follows:

p-value = 1

|P | X

p∈P

1{tp≥ t}, (2.4)

where t_pdenotes the value of T for permutation p ∈ P and 1 is the indicator function.

In hypothesis testing, we want to control the type I error of a test by choosing its significance level. Due to the constraints imposed by permutation methods, we can control the type I error of the test. Thus, we can construct exact tests using permutation methods. We define an exact test as follows:

Definition 2.1.14. (Exact Test) An exact test is a statistical test for which the probability of the type I error is equal to the significance level. Thus,

P(reject H0 when H₀ is true) = α.

In other words, if an exact test has a significance level α = 0.05, repeating the test over samples where the null hypothesis is true will result in 5% of cases rejecting H0. This is a desirable property for a test. However, as the test statistic is a discrete random variable, this property does not hold for all α. More specifically, this only holds if α is a multiple of 1/|P |.

2.2 Chi-Squared Distribution

The chi-squared distribution is characterised by one parameter, the degrees of freedom usually denoted by k. The distribution is related to the standard normal. Given a standard normal random variable Z ∼ N (0, 1),

Z²∼ χ²(1) (2.5)

where the (1) denotes the single degree of freedom. Then it is known that the sum of independent chi-squared variables is chi-squared. Therefore, if we have random variables Z = (Z1, . . . , Zn) with Z_i^i.i.d.∼ N (0, 1) for each i = 1, . . . , n,

kZ k²=

n

X

i=1

Z_i²∼ χ²(n). (2.6)

In addition it is known that,

n

X

i=1

(Z_i− ¯Z)²∼ σ²χ²(n − 1), (2.7)

where ¯Z = n⁻¹Pn

i=1Z_i is the sample mean of Z . The sample mean constrains the equation as after the first n − 1 components are found, the last one can be derived from the sample mean.

Therefore, there is one less degree of freedom.

(14)

2.3 Fourier Analysis

A Fourier series is an expansion of a periodic function using harmonically related sinusoids. We use the following definitions by Brockwell & Davis (2006) to construct the Fourier series for an arbitrary sequence of numbers f₁, . . . , f_n.

Definition 2.3.1. (Harmonics) Suppose f1, . . . , fn is a sequence of numbers where ft= f (t) for some function f with period n. Then the fundamental frequency is 2π/n and the harmonics are

n^−1/2e^itω^j,

for t = 1, . . . , n, where ωj:= ^2πj_n represent the integer multiples of the fundamental frequency 2π/n within the interval (−π, π]. These frequencies, ωj ∈ (−π, π], are called the Fourier frequencies of the series {f1, . . . , fn}.

The vector form of these harmonics is,

ej = n^−1/2(eîω^j, eî2ω^j, . . . , eînω^j)^T, j ∈ Jn, (2.8) where

J_n = {j ∈ Z : −π < ωj≤ π} (2.9)

Note that Jn contains n integers.

With this notation we introduce an important lemma fromBrockwell & Davis(2006).

Lemma 2.3.1. The vectors ej, j ∈ Jn as defined above constitute an orthonormal basis for Cⁿ. Proof.

< ej, ek >= n⁻¹

n

X

r=1

e^ir(ω^j^−ω^k⁾=







1 if j = k,

n⁻¹e^i(ω^j^−ω^k^{) 1−e}^{in(ωj −ωk)}

1−e^{i(ωj −ωk)} = 0 if j 6= k

A consequence of this lemma is the following corollary.

Corollary 2.3.2. For any f = (f1, . . . , fn)^T ∈ Cⁿ, we can express f as a linear combination of the harmonics as follows:

f = X

j∈J_n

f˜jej, (2.10)

where

f˜_j =< f, e_j>= n^−1/2

n

X

t=1

f_te^−itω^j. (2.11)

Definition 2.3.2. (Discrete Fourier Transform (DFT)) The discrete Fourier transform (DFT) of f ∈ Cⁿ is the sequence { ˜f_j, j ∈ J_n} defined by equation 2.11.

The DFT converts a finite sequence of equally spaced data points in a time domain into a sequence of values from a complex frequency function. Thus, the DFT is the frequency domain represen- tation of the original sequence. As a finite amount of data is used, methods using DFT can be implemented efficiently. A well-known algorithm is the fast Fourier transform which computes the DFT of a sequence in O(n log(n)) time.

(15)

2.4 Periodogram

A periodogram is a tool that is used to check the level of dominance for each frequency in the function that generates the sequence of numbers f1, . . . , fn. It is defined byBrockwell & Davis (2006) as follows:

Definition 2.4.1. (Periodogram of f ∈ Cⁿ) Let f = (f1, . . . , fn)^T ∈ Cⁿ and let If(ωj) denote the value of the periodogram of f at frequency ωj= ^2πj_n , j ∈ Jn. Then If (ωj) is defined in terms of the discrete Fourier transform { ˜f_j}j∈Jn of f by,

If(ωj) := | ˜fj|²= n⁻¹

n

X

t=1

fte^−itω^j

2

. (2.12)

From Definition2.4.1it can be seen that

If (ω0) = If (0) = n⁻¹





n

X

t=1

ft





2

. (2.13)

Clearly, this is the mean of the periodogram values. Moreover, it can be seen that the periodogram values decompose k f k². Thus,

k f k²= X

j∈J_n

I_f (ω_j). (2.14)

This implies that if the periodogram value decreases at a frequency, the periodogram value must increase by the same amount at a different frequency (or distributed among several others).

When considering a sequence in the real plane, f ∈ Rⁿ, we can simplify the periodogram ex- pressions. If ω_j and ω_−j are both in (−π, π], from equation 2.11 we see that ˜f_j is equal to the complex conjugate of ˜f_−j, denoted by ˜f_−j^∗ . Therefore, If (ωj) = If (−ωj). Due to this symmetry, we can rewrite equation2.10in the form

f = ˜f0e0+

b(n−1)/2c

X

j=1

( ˜fjej+ ˜f_j^∗e_−j) + ˜f_n/2e_n/2, (2.15)

where ˜fn/2en/2 is defined as zero if n is odd. Then, if we express ˜fj in polar form, ˜fj= aje^iθ^j we can rewrite equation2.15to

f = ˜f₀e₀+

b(n−1)/2c

X

j=1

(√

2a_j(c_jcosθ_j− sjsinθ_j)) + ˜f_n/2e_n/2, (2.16)

where

cj= (2/n)^1/2(cosωj, cos 2ωj, . . . , cos nωj)^T (2.17) and

s_j= (2/n)^1/2(sinω_j, sin 2ω_j, . . . , sin nω_j)^T. (2.18) Note that now {e₀, c₁, s₁, . . . , c_[(n−1)/2], s_[(n−1)/2], e_n/2} is an orthonormal basis for Rⁿ, where e_n/2 is excluded if n is odd. Thus, we can decompose k f k²=Pn

i=1f_i²into components corresponding to the elements in this basis set. For 1 ≤ j ≤ b(n − 1)/2c, the components corresponding to c_jand sj can be combined to form a ωj component, as shown in Table2.2. Then k f k² is decomposed into components associated with the Fourier frequencies ω_j for j = 0, . . . , n/2 where n/2 is only considered if n is even. Note that the ω_j component is the squared length of the projection of f onto the subspace sp{c_j, s_j} ⊂ Rⁿ, defined as the closure of the span of c_j and s_j.

(16)

Frequency Component Degrees of Freedom

ω0(mean) f˜₀²= n⁻¹ Pn t=1ft

2

= I(0) 1

ω1 2a²₁= 2

f˜1

2

= 2I(ω1) 2

... ... ...

ωk 2a²_k = 2

f˜k

2

= 2I(ωk) 2

... ... ...

ω_n/2 (if n even) f˜_n/2² = I(π) 1

Total Pn

t=1f_t²= k f k² n

Table 2.2: Decomposition of k f k²for f ∈ Rⁿ based on equation 2.16(Brockwell & Davis,2006).

Then we can rewrite equation2.14as follows:

k f k²= I_f (0) +

b(n−1)/2c

X

j=1

2 · I_f (ω_j) + I_f (π), (2.19)

where If (π) = If (ωn/2) is defined as zero if n is odd.

2.5 Time Series

Time series occur frequently and are used in many areas such as economic and sale forecasting, inventory studies, and astronomy. They are defined as follows:

Definition 2.5.1. (Time Series) A time series {Xt : t = 1, . . . , n} is an ordered sequence of data points indexed at fixed intervals of time.

One of the most basic time series is discrete white noise. It is defined byCowpertwait & Metcalfe (2009) as follows:

Definition 2.5.2. (Discrete White Noise) A time series {w_t: t = 1, . . . , n} is discrete white noise if all the variables are independent and identically distributed with mean zero and variance σ². This implies that Cov(wi, wj) = 0 for all i 6= j. If, in addition, the variables also follow a normal distribution the series is called Gaussian white noise.

A time series can be considered in two different domains, the time domain and the frequency domain. In the following section, we describe the properties of the time series in these two domains.

2.5.1 Time Domain

Within the time domain, there are three important properties of a time series; the autocovariance and the autocorrelation function, and stationarity.

Definition 2.5.3. (Autocovariance Function) The autocovariance function of the time series {Xt: t = 1, . . . , n} is the measure of linear dependence between the different points in time of a time series. It is defined as

γ(t, s) = Cov(X_t, X_s) = E[(Xt− µ_t)(X_s− µ_s)] (2.20)

(17)

where t, s ∈ {1, . . . , n} are points in time, µt= E(X^t), and µs= E(X^s).

Note that γ(t, t) is the variance of the time series at time t. Moreover, even if γ(t, s) = 0, it is still possible that there is a dependence between the two points in time, however, it is not linear.

The current autocovariance function definition is dependent on the scale of the time series, thus we introduce the autocorrelation function which is a rescaled version of the autocovariance function.

Definition 2.5.4. (Autocorrelation Function) The autocorrelation function of the time series {Xt: t = 1, . . . , n} measures the linear predictability of Xtgiven Xs. Formally,

ρ(t, s) = γ(t, s)

pγ(t, t)γ(s, s), (2.21)

for t, s ∈ {1, . . . , n}.

The autocorrelation function allows for an accurate comparison between different time series as it is rescaled. In addition, we can define a property of the autocovariance function; absolutely summable.

Definition 2.5.5. (Absolutely Summable Autocovariance Function) The autocovariance function γ of a time series {X_t: t = 1, . . . , n} is absolutely summable if

∞

X

j=−∞

γ(j)

< ∞. (2.22)

Another important property of a time series is stationarity. To allow for informative predictions about a time series there has to be a form of regularity. In a time series, the notion of regularity is called stationarity.

Definition 2.5.6. (Stationary Time Series) A time series {Xt : t = 1, . . . , n} is stationary when its mean and variance do not change over time. Thus, µ(t) = E[X^t] and σ²(t) = E[(Xt− µ(t))²] are constant. This means that there is no time time trend and the time series is homoscedastic, respectively. In this case, {Xt} has mean µ and variance σ².

Given a stationary time series we can derive new definitions for the autocovariance and autocorrelation functions as shown in Cowpertwait & Metcalfe(2009). For a stationary time series, the mean function is constant and independent of time, so µ(t) = µ for all t. Secondly, the autocovariance function γ only depends on the difference between the points in time. Therefore, if we have two points in time t and s we define the difference in time as h = |s − t| where s = t + h.

Definition 2.5.7. (Autocovariance Function of a Stationary Time Series) Let {Xt: t = 1, . . . , n} be a stationary time series with mean µ and let h be the difference in time between two points t, s ∈ {1, . . . , n}. Then autocovariance function of {Xt} is

γ(h) = cov(Xt, Xt+h) = E[(X^t− µ)(Xt+h− µ)]. (2.23) Definition 2.5.8. (Autocorrelation Function of a Stationary Time Series) The autocorrelation function of a stationary time series {Xt: t = 1, . . . , n} with mean µ is

ρ(h) = γ(t, t + h)

pγ(t, t)γ(t + h, t + h) =γ(h)

γ(0), (2.24)

where h the difference in time between two points t, s ∈ {1, . . . , n}.

Note that when the above definitions use observed data, we get estimates for the autocovariance and autocorrelation function as shown in the definition below.

(18)

Definition 2.5.9. (Sample Autocovariance and Autocorrelation Function) Given the observed data {Xt: t = 1, . . . , n}, we estimate the autocovariance and autocorrelation functions to get the sample autocovariance and sample autocorrelation functions defined as,

ˆ

γ(h) = 1 n

n−h

X

t=1

(Xt− ¯X)(Xt+h− ¯X) and ρ(h) =ˆ ˆγ(h) ˆ

γ(0), (2.25)

respectively, for h ≥ 0. If h < 0, then ˆγ(h) = ˆγ(−h).

2.6 Frequency Domain

Traditional ways of analysing a time series focus on the time domain, however, considering the frequency domain in time series analysis has shown to be more informative (Brandes et al.,1968).

2.6.1 Periodogram for a Time Series

The periodogram can also be applied to a time series and is a fundamental tool in constructing statistical inferences about its frequency properties (Brockwell & Davis,2006). The expected value of the periodogram is asymptotically the power spectral density (PSD) of the time series (Brandes et al.,1968). Thus, a periodogram can be used to estimate the PSD of a stationary time series.

The PSD of a stationary time series is defined byBrockwell & Davis(2006) as follows:

Definition 2.6.1. (Power Spectral Density (PSD)) Given a stationary time series {X_t : t = 1, . . . , n} with mean µ and absolutely summable autocovariance function γ, the power spectral density (PSD) of {Xt} is

(2π)⁻¹

∞

X

k=−∞

γ(k)e^−ikω, (2.26)

for frequency values ω ∈ [−π, π].

In words, the PSD determines the energy of each frequency component. Thus, the periodogram estimates the energy of these frequency components. The periodogram for the time series {X_t: t = 1, . . . , n} at the Fourier frequencies ωj = ^2πj_n ∈ (−π, π] for j ∈ Jn is defined by Brockwell &

Davis(2006) as follows:

I_(X,n)(ωj) := | ˜Xj|²= n⁻¹

n

X

t=1

Xte^−itω^j

2

. (2.27)

Equation2.27is analogous to the general definition in Definition2.4.1where the sequence { ˜X_j : j ∈ J_n} are the DFT for the time series. The extra subscript n denotes the dependence on a finite number of values.

The periodogram value is also closely related to the estimate of autocovariance function ˆγ as shown byBrockwell & Davis(2006) with the following lemma.

Lemma 2.6.1. (The Periodogram of X in Terms of the Sample Autocovariance Func- tion) If ωj is any non-zero Fourier frequency, then

I_(X,n)(ω_j) = X

|k|<n

ˆ

γ(k)e^−ikω^j. (2.28)

Based on Lemma 2.6.1 we can extend the definition of absolutely summable (Definition 2.5.5).

In section2.4 it was shown that the periodogram values decompose kXk². Therefore, kXk²< ∞ implies that γ is absolutely summable. Thus it is sufficient to check if kXk² is finite.

(19)

Additionally, Lemma2.6.1allows us to rewrite equation2.27as follows:

(I_(X,n)(0) = n ¯X

2, I_(X,n)(ω_j) =P

|k|<nγ(k)eˆ ^−ikω^j if ω_j6= 0. (2.29) Comparing equation 2.26 and 2.29 we see that I_(X,n)(ωj)/(2π) is a natural estimator for the PSD for ωj 6= 0. However, in order to estimate the PSD for any arbitrary non-zero frequency in the interval [−π, π] we extend the periodogram. We extend the periodogram on [−π, π] using a piecewise constant function coinciding with equation2.29at the Fourier frequencies, as shown by Fuller(1976). This gives us the following definition.

Definition 2.6.2. (Extension of the Periodogram) For any ω ∈ [−π, π] the extended periodogram is defined as

I_(X,n)(ω) =

(I(X,n)(ωj) if ωj− π/n < ω ≤ ωj+ π/n and 0 ≤ ω ≤ π

I_(X,n)(−ω) if ω ∈ [−π, 0). (2.30)

With this definition we can estimate the energy of each frequency in [−π, π].

2.6.2 Bartlett’s Method

Bartlett’s method (1950) is a variant of the periodogram which estimates the PSD of a time series with a smaller variance. However, it also reduces the resolution; the ability to make concrete conclusions based on the data.

Definition 2.6.3. (Bartlett’s Method) Given a stationary time series {Xt: t = 1, . . . , n} with kXk² finite,

1. Segment the n numbers into K non-overlapping data segments, each of a fixed length, L.

2. For each segment compute the periodogram values and take their average.

The result is the energy for each frequency bin, which is used to estimate the PSD.

The averaging is key to reducing the variance, however, the resolution is reduced as frequencies are considered in a group rather than separately. For further information seeBartlett(1950).

2.6.3 Welch’s Method

Welch’s method (1967) is an improvement on Bartlett’s Method. It reduces the noise in the estimation of the PSD in exchange for a further reduction in the resolution.

Definition 2.6.4. (Welch’s Method) Given a stationary time series {Xt : t = 1, . . . , n} with kXk² finite,

1. Segment the numbers into K segments of length L, overlapping by D points.

2. Window the segments i.e. outside the chosen segment all the values are set to zero and within the segment the data is tapered such that the data at the centre of the segment have a greater influence.

3. For each windowed segment compute the periodogram values and take their average.

The result is the energy for each frequency bin, which is used to estimate the PSD.

Note that for D = 0, Welch’s method is reduced to Bartlett’s method.

In Welch’s method, the windowing of the segments leads to a loss of information and thus the segments are overlapping to reduce this loss. However, the extent of this reduction depends on the value of D and the loss can never be fully mitigated. Further information can be found inWelch (1967).

(20)

Chapter 3

Literature Overview

3.1 Problem Description

Suppose we observe X_tsuch that

Xt= ft+ t, for t = 1, . . . , n.

Here, f_t= f (t) is the value at time t of a periodic function f with period τ , and _tis a random error due to noise at time t. For simplicity we assume that,

_t^i.i.d.∼ N (0, σ²) where σ is known.

Then on the basis of the observation X1, . . . , Xn, we would like to test whether f (t) has period τ0∈ Q⁺∪ {0}. Therefore, we want to test the null hypothesis,

H0: τ = τ0

against the alternative,

H1: τ 6= τ0.

In this problem {t: t = 1, . . . , n} is Gaussian white noise. Therefore, it is a stationary time series and the terms are not correlated. In contrast, {Xt: t = 1, . . . , n} is not stationary. The mean of {Xt},

µ(t) = E[Xt] = E[ft+ _t] = f_t

is not constant with respect to time. Thus, it has a time trend and cannot be stationary by Definition 2.5.6. Although stationarity is a desirable property, the time trend is needed to test the periodicity of f .

Lastly, it can be seen that

Xt∼ N (ft, σ²). (3.1)

Given that f_t is not constant over time, the X_t terms are not identically distributed. However, they remain independent of each other.

3.2 Distribution of the Periodogram

This section explains the relationship between the periodogram of X and of f and derives the distribution for the periodogram of X. The following proposition summarises our findings.

(21)

CHAPTER 3. LITERATURE OVERVIEW

Proposition 3.2.1. Suppose that Xt= ft+twhere t∼ N (0, σ²). Let the sets { ˜Xj}j∈J_n,{ ˜fj}j∈J_n, and {˜j}j∈J_n denote the DFT for X, f, respectively and let I_(X,n)(ωj), I_(f,n)(ωj), I_(,n)(ωj) denote the periodogram for X, f, for j ∈ Jn as defined in equation2.27.

(i) ˜Xj∼ N ( ˜fj, σ²), ˜Xj independent

(ii) I_X(ω_j) = I_f(ω_j) + I(ω_j) + C_j where C_j ∼ N (0, 4σ²f˜_j²), I(ω_j) ∼ Exp(_σ¹₂), and I_f(ω_j) is deterministic.

(iii) E[I(X,n)(ωj)] = If(ωj) + σ²

(iv) If E[X1⁴] < ∞ and ωj =^2πj_n ∈ [0, π], then

Cov(I_(X,n)(ωj), I_(X,n)(ωk)) = 4σ²d

n² → 0 as n → ∞ where d ∈ R and

V ar(I_(X,n)(ω_j)) = 2σ²

n I_f(ω_j) +2σ² n² d⁰+

(2σ⁴ if ωj = 0 or π σ⁴ else

where d⁰ ∈ R. (See AppendixBfor the exact expression for d and d⁰.) Proof. See AppendixBfor the proof of this Proposition.

To choose between the two hypotheses in the problem statement the PSD of f can be used. By definition, I_f(ω_j) is the PSD of f . In Proposition3.2.1we see that I_X(ω_j) can be used to estimate I_f(ω_j) and therefore can be used as the test statistic. The bias of an estimator is the difference between its expected value and the true value. From (iii) we see that the bias is σ², which is constant. Moreover, we notice that σ² is the PSD of . From (ii),

E[I(,n)(ωj)] = σ². (3.2)

As σ² is a constant, we can conclude that the PSD of is σ². Thus, the periodogram of X is the sum of the PSD of f and the PSD of , with the latter a constant value. Thus, the structure of periodogram of X comes from f .

From this proposition we also see that IX(ωj) is the sum of a exponentially distributed random variable, a normally distributed random variable, and a deterministic value. Its distribution is not explicit, however, given equation3.1we can conclude that

kXk²=

n

X

t=1

X_t²∼ σ²χ²(n, λ), (3.3)

where λ =Pn

i=1f_i² is the non-centrality parameter. We can make it a central chi-squared distribution by subtracting the mean as follows,

n

X

t=1

(Xt− ¯X)²= kXk²− n⁻¹





n

X

t=1

Xt





2

∼ σ²χ²(n − 1), (3.4)

where ¯X is the mean of X and n⁻¹ Pn t=1Xt²

is the mean of the periodogram values of X, I_(X,n)(0). The degree of freedom is now one less due to the constraint the mean induces (See Section2.2).

From the distribution of Cj and the variance given in (iv) we see that the variance of IX(ωj) depends on fj. Therefore, the variance will be higher at the peaks of the periodogram. Further- more,Scargle(1982) shows that the periodogram values are noisy and notes that increasing the sample size does not reduce the noise. In fact, increasing the sample size increases the noise, as with more data the number of available frequencies increases. Thus, methods such as Bartlett’s

(22)

and Welch’s are used to reduce the variance by averaging the periodogram.

The periodogram also experiences leakages. At a given frequency ω, the energy of this frequency is leaked to other frequencies. The leakage to frequencies close to ω is due to the finite data used, and the leakage to frequencies further away is due to the finite interval used to sample the data. Moreover, the periodogram is based on the DFT which assumes that the data has sinusoidal components with only Fourier frequencies. Therefore, in cases where this is not true, multiple peaks will be found in the periodogram around the true value. This also causes the periodogram values at non-Fourier frequencies to be approximated using the extended periodogram as defined in Definition 2.6.2. For these reasons, the periodogram is now usually just a component in more sophisticated methods.

3.3 Known Approaches

This section presents known tests for the problem described in Section 3.1. In the following section, we shall assume that the data is real unless stated otherwise. Thus the Fourier frequencies are ωj for j ∈ J_n⁰ = {0, . . . , n/2}, where n/2 is only considered if n is even.

Before the tests are described it is important to address that τ0 ≥ 2. As we are sampling at every time unit, the Nyquist frequency is π. This means that π is roughly the highest frequency about which there is information (Scargle, 1982). Thus, if τ₀ < 2 the corresponding frequency 2π/τ₀> π. Therefore, only periods greater than or equal to two can be considered for the given problem. Note that it is possible to consider smaller periods if the sampling rate used is more frequent. This is a possible extension for the problem given in this report.

3.3.1 Fisher’s Test

Fisher’s test was designed to test for periodicity in data of unspecified frequency. However, un- derstanding this test forms a fundamental base for other tests.

For the observations X as defined in Section 3.1, the null hypothesis for Fisher test’s states that X is Gaussian white noise i.e.

H₀: f1= . . . = f_n.

The idea is to reject H0if there is a periodogram value substantially larger than the average value.

As in the work ofFisher(1929) we assume that n is odd such that n = 2ν + 1. A simple extension to an even number of observations can be found inFisher(1939).

Then we define

Yj= I(X,n)(ωj) Pν

i=1I_(X,n)(ωi) (3.5)

for j = 1, . . . , ν. In equation3.5 the periodogram value is normalised to reduce the effect of the variance. Then Fisher’s test is based on the statistic

T = ν max

1≤j≤νY_j= max_1≤j≤νI_(X,n)(ω_j) ν⁻¹Pν

i=1I_(X,n)(ωi). (3.6)

Thus if T is large, it is likely that there is a value larger than the average and thus we reject H0. Under H0, Brockwell & Davis(2006) show that

P(T ≥ t) = 1 −

ν

X

j=0

(−1)^jν j

(1 − jt/ν)^ν−1₊ , (3.7)

(23)

where x+= max(x, 0). Then we compute the realised value t of T from the data X1, . . . , Xn and reject H0at significance level α if P(T ≥ t) < α.

It is shown by Anderson (1971) that Fisher’s test is the uniformly most powerful test in detecting periodicity in data containing a single dominant frequency. However, when data contains multiple frequencies, the power of Fisher’s test is greatly reduced as found byShimshoni (1971) andSiegel(1980). Therefore, both suggest considering not only the maximum periodogram value but all large periodogram values.

3.3.2 Testing using an F-statistic

Assume that ω ∈ [−π, π] and we hypothesise that the data has frequency ω. Let k = ωn

2π.

Then we consider two cases; k ∈ J_n⁰ and k /∈ J_n⁰. If k ∈ J_n⁰ then ω is the Fourier frequency ωk. Thus, the hypotheses are,

H0: X is Gaussian white noise, against

H₁: f has frequency ωk.

Then, we reject H₀ if 2I_(X,n)(ω_k) is sufficiently large. Note that H₁ can easily be translated in terms of periods using the formula

τ = 2π

ω , (3.8)

where τ is the period and ω the frequency for f .

In Section2.4 we saw that the ωk component was the length of the projection of the data onto the two dimensional subspace sp{ck, sk} of Rⁿ. Thus,

2I_(X,n)(ω_k) =

P_sp{c_k_,s_k_}X

2

, (3.9)

where P is the projection matrix. By Theorem 2.4.1 ofBrockwell & Davis(2006) we have that Psp{c_k,s_k}X =< X, ck> ck+ < X, sk> sk. (3.10) Then under H0, Xt= µ + tand thus,

2I_(X,n)(ω_k) =

P_sp{c_k_,s_k_}X

2

=

P_sp{c_k_,s_k_}Z

2

= 2I_(,n)(ω_k). (3.11) Then from Proposition3.2.1(ii)

2I_(,n)(ωk) ∼ 2Exp( 1

σ²) = σ²χ²(2). (3.12)

Moreover, sp{ck, sk} has an orthogonal complement as it is a subspace of Rⁿ. The projection on this complement is denoted by I − P_sp{c_k_,s_k_} where I is the identity mapping of Rⁿ. Clearly, (I − P_sp{c_k_,s_k_})X and P_sp{c_k_,s_k_}X are orthogonal. From this it follows that their square lengths are independent of each other.

Then we can rewrite the projection of the complement as follows:

(I − Psp{c_k,s_k})X

2

=

n

X

i=1

X_i²− 2I(X,n)(ωk). (3.13)

(24)

From equations3.3and3.12it follows that this expression has a non-central chi-squared distribution with n − 2 degrees of freedom. We centralise this expression by subtracting the mean of the periodogram, n⁻¹ Pn

t=1Xt²

= I_(X,n)(0) (See equation2.13). Thus,

n

X

i=1

X_i²− 2I_(X,n)(ω_k) − I_(X,n)(0) ∼ σ²χ²(n − 3). (3.14) Then we see that

n

X

i=1

X_i²− 2I_(X,n)(ωk) − I_(X,n)(0) =

X − P_sp{e₀_,c_k_,s_k_}

2

(3.15)

which is still independent of

P_sp{c_k_,s_k_}X

2

as it is a projection to a subspace of the complement of sp{c_k, s_k}.

Due to the independence and the distribution given in equations 3.12 and 3.14 we can define the test statistic

T = (n − 3)I_(X,n)(ωk) Pn

t=1X_t²− I_(X,n)(ω₀) − 2I_(X,n)(ω_k) ∼ F (2, n − 3) (3.16) as done byBrockwell & Davis(2006). Then we reject H0 in favour of H1 at significance level α if T > F1−α(2, n − 3).

If k /∈ J_n⁰ then we have to approximate the value of I_(X,n)(ω) using the extended periodogram.

In this case, T is only approximately distributed by the F-distribution with parameters 2 and n−3.

Furthermore, it was shown by Novick (1994) that this test statistic is inadequate, as there is a high chance that this F -statistic yields an incorrect rejection of H0(See Novick (1994) Section 6.1). Thus, this test has a high type I error. Therefore, Novick (1994) advises the use of the Fisher’s G-statistic instead. Yet, this comes with a cost of higher complexity.

3.3.3 Testing using Fisher’s G-statistic

This test is based on Fisher’s test described in Section 3.3.1. We define the r^th order Fisher G-statistic as defined byGrenander & Rosenblatt(1957)

G_r= I_(r) I(1)+ . . . + I(n)

= I_(r)

||X||² (3.17)

where Ir= I_(X,n)(ωr) and I_(r) denotes the r^thorder statistic. Let k ∈ J_n⁰ denote the index of the highest peak in periodogram plot. Then frequency 2πk/n corresponds to periodogram value I₍₁₎. Then we hypothesise that the data has frequency 2πk/n. In this case the hypotheses are,

H0: X is Gaussian white noise, against

H₁: f has frequency 2πk n .

Thus, G₁can be used as the test statistic to choose between the competing hypotheses.

Under H0, Grenander & Rosenblatt(1957) show that

P(Gr> x) = m!

(r − 1)!

1/x

X

j=r

(−1)^j−r(1 − jx)^m−1

j(j − r)!(m − j)! . (3.18)

Then we can compute the realised value g₁ of G₁ from the data X₁, . . . , X_n and reject H₀ at significance level α if P(G1> g₁) ≤ α.

Eindhoven University of Technology BACHELOR Testing for the Period of a Function using Permutation Methods Freyer, Caroline